
Anonymous
Not applicable
2015-08-02
06:34 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Whats the best way to parse a CSV file with multiple headers?
Hi all,
I have a csv file that contains multiple headers separated by blanks and each section can be dynamic i.e. the number of entries under each header depends on the amount of data recorded. I've been investigating the use of the tFileInputMSDelimited operator but am having trouble getting this to work.
Or should I use a Regex operator to extract each of the sections ( i.e. Summary Stats, Statistics Overall Values, Category Dist)?
Anyway I would be grateful for any suggestions.
An example of the file and its format is pasted below, I'm using ".." to indicate the number of rows could be any number but usually in the hundreds but changes from file to file
Example Report: Test Report (Single) 01-01-70
Doe, John
Summary Stats
,,Deg1 1,Deg 2, Deg 3,Deg 4,Deg 5,
Time,Whole,"75%","11%","9%","5%","1%",
Statistics: Overall Values
dur,date, start, avg, maxpercentage
"1002"," 01/01/1970 19:07:40","1.1","10"
"1010867","01/01/1970 19:20:08","3.7","40%"
"1018866","01/01/1970 19:20:15","4.9","35%"
"1028866","01/01/1970 19:20:25","3.9","41%"
..
..
..
..
"1036616","01/01/1970 19:20:33","5","31%"
CATEGORY: Dist.
Attibute: Example.
att1,att2,
"1","2"
"1","2"
"1","2"
"1","2"
"1","2"
"1","2"
"1","2"
"1","2"
"1","2"
"1","2"
..
..
..
..
"10","15"
I have a csv file that contains multiple headers separated by blanks and each section can be dynamic i.e. the number of entries under each header depends on the amount of data recorded. I've been investigating the use of the tFileInputMSDelimited operator but am having trouble getting this to work.
Or should I use a Regex operator to extract each of the sections ( i.e. Summary Stats, Statistics Overall Values, Category Dist)?
Anyway I would be grateful for any suggestions.
An example of the file and its format is pasted below, I'm using ".." to indicate the number of rows could be any number but usually in the hundreds but changes from file to file
Example Report: Test Report (Single) 01-01-70
Doe, John
Summary Stats
,,Deg1 1,Deg 2, Deg 3,Deg 4,Deg 5,
Time,Whole,"75%","11%","9%","5%","1%",
Statistics: Overall Values
dur,date, start, avg, maxpercentage
"1002"," 01/01/1970 19:07:40","1.1","10"
"1010867","01/01/1970 19:20:08","3.7","40%"
"1018866","01/01/1970 19:20:15","4.9","35%"
"1028866","01/01/1970 19:20:25","3.9","41%"
..
..
..
..
"1036616","01/01/1970 19:20:33","5","31%"
CATEGORY: Dist.
Attibute: Example.
att1,att2,
"1","2"
"1","2"
"1","2"
"1","2"
"1","2"
"1","2"
"1","2"
"1","2"
"1","2"
"1","2"
..
..
..
..
"10","15"
904 Views
5 Replies

Anonymous
Not applicable
2015-08-02
11:22 PM
Author
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
what about the number of columns for each header? Always changing or fixed? For example: is it always 2 columns for
CATEGORY: Dist header?
904 Views

Anonymous
Not applicable
2015-08-03
03:17 PM
Author
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Shong,
No the number of columns stays the same for each header.
There is also extra text added to these files to create sections but they're not really important. The actual columns for each section incase I didn't describe it well enough would be:
,,Deg1 1,Deg 2, Deg 3,Deg 4,Deg 5,
dur,date, start, avg, maxpercentage
att1,att2
No the number of columns stays the same for each header.
There is also extra text added to these files to create sections but they're not really important. The actual columns for each section incase I didn't describe it well enough would be:
,,Deg1 1,Deg 2, Deg 3,Deg 4,Deg 5,
dur,date, start, avg, maxpercentage
att1,att2
904 Views

Anonymous
Not applicable
2015-08-05
04:18 PM
Author
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just to update I decided to go with a python solution instead as I found a similar solution on stack overflow
http://stackoverflow.com/questions/20293327/use-python-to-split-a-csv-file-with-multiple-headers
I need to get something up and running but might revisit using the regex I have with the some of the regex file operators in talend.
http://stackoverflow.com/questions/20293327/use-python-to-split-a-csv-file-with-multiple-headers
I need to get something up and running but might revisit using the regex I have with the some of the regex file operators in talend.
904 Views

Anonymous
Not applicable
2015-08-06
10:20 AM
Author
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Use tFileInputFullRow to read the file by rows, connect it to tJavaRow where you will detect each header and configure context.myOutputFileName which you will use with tFileOutputDelimited. This way you will split each content to specific smaller files which you can then easily process one by one.
Another solution is that you will use tMap with multiple outputs and you will filter output for specific value which might be still context.myOutputFileName, one output will filter:
context.myOutputFileName = CATEGORY: Dist.
another context.myOutputFileName = Statistics: Overall Values
etc.
You still need to deal with delimiting/splitting the values in one row by some tJavaRow as soon as you get single field from tMap and need to split it to expected fields by java, but that is easy as well.
Another solution might be reading full file into memory as big string and process it on string level...
Ladislav
Another solution is that you will use tMap with multiple outputs and you will filter output for specific value which might be still context.myOutputFileName, one output will filter:
context.myOutputFileName = CATEGORY: Dist.
another context.myOutputFileName = Statistics: Overall Values
etc.
You still need to deal with delimiting/splitting the values in one row by some tJavaRow as soon as you get single field from tMap and need to split it to expected fields by java, but that is easy as well.
Another solution might be reading full file into memory as big string and process it on string level...
Ladislav
904 Views

Anonymous
Not applicable
2015-08-07
04:29 PM
Author
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Use tFileInputFullRow to read the file by rows, connect it to tJavaRow where you will detect each header and configure context.myOutputFileName which you will use with tFileOutputDelimited. This way you will split each content to specific smaller files which you can then easily process one by one.
Another solution is that you will use tMap with multiple outputs and you will filter output for specific value which might be still context.myOutputFileName, one output will filter:
context.myOutputFileName = CATEGORY: Dist.
another context.myOutputFileName = Statistics: Overall Values
etc.
You still need to deal with delimiting/splitting the values in one row by some tJavaRow as soon as you get single field from tMap and need to split it to expected fields by java, but that is easy as well.
Another solution might be reading full file into memory as big string and process it on string level...
Ladislav
Thanks for that some great ideas of things to try!
904 Views
