Parsing pretty text files

Anonymous · ‎2009-05-04

I am working on a project to parse a text file, the problems I face are the following: 1. The files format and content change slightly at the whim of the person generating the report (reason I am shooting down writing custom code to do the parsing), 2. The TXT file is made to be human readable, in other words pretty; all lined up so the delimiter (spaces) varies depending on the length of the data. 3. The file contains three main parts the first two are in the following format:"Name Data Name Data Name Data" and the third in this format: Name Name Name
Data Data Data
I started looking into this software because unlike myself the main users are not code monkeys so I figured with the graphical interface making a small change to the parsing would be pretty simple. What I am looking for from this post is a direction and maybe some ideas, what components would be the best fit for parsing the two formats of data that would be easily changeable by non-programmers. Usually this would be no problem but since it changes on a whim and non code monkeys have to keep up with the changes this has become a bit more difficult; look forward to hearing some input.
Thank You,
Zachary Long

Anonymous · ‎2009-05-04

Hello friend
Can you show us an example of content of file?
Best regards

shong

Anonymous · ‎2009-05-04

Here is a piece of one of the many files I need to parse this shows the three sections, also there are many of these per file, separated by "END OF REPORT", which I figure will not be to hard to implement to separate reports. Something that I did forget to mention is that there are several reports per input file, ultimate goal will be to combine all data using the data as a delimiter.

Zachary Long

Anonymous · ‎2009-05-06

Hello all, I am guessing by the lack of response that you everyone is just as stumped as I am ?
Thanks
Zachary Long

Anonymous · ‎2009-05-06

Hi Zachary,
Here's my 2 cents worth.
I am assuming you want each report to be a single output record (basically a many input -> one output scenario).
I would define the input at space (' ') delimited and output to a delimited file (';'). You will end up with a file with up to 7 (i think I counted correctly) columns.
Since no two lines are the same, I would then use tJavaRow to build the output record. If you do a search on talend forum you can find examples of this.
You will need check field 1 on some lines ( ie 'for MACHINE) and field 4 on others (i.e TRIM). You may also need to concatenate some fields back together to get output you need.
Since there are several reports in one input file, you will need to generate a simple sequence number for each report, and also each line of each report.
Then you can sort by report/line number (descending) and then use TUniqRow on report number (checking the 'Only once each duplicated key' option under the Advanced Tab).
It's not pretty but neither is the input.
Give it a go. If you have problems maybe you can copy/paste a file sample instead of image. I might have time to see if I can get it to work.
Bye for now,

Anonymous · ‎2009-05-06

regex based Perl parsing would be a great fit for this problem. you can use clever regex's to locate your position in the file, and then parse out the data you need.
If you're stuck with Java check out this package:
http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/package-summary.html
http://java.sun.com/developer/technicalArticles/releases/1.4regex/

Older

Talend Data Integration