Skip to main content
Announcements
A fresh, new look for the Data Integration & Quality forums and navigation! Read more about what's changed.
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

Whats the best way to parse a CSV file with multiple headers?

Hi all,
I have a csv file that contains multiple headers separated by blanks and each section can be dynamic i.e. the number of entries under each header depends on the amount of data recorded. I've been investigating the use of the tFileInputMSDelimited operator but am having trouble getting this to work.
Or should I use a Regex operator to extract each of the sections ( i.e. Summary Stats, Statistics Overall Values, Category Dist)?
Anyway I would be grateful for any suggestions.

An example of the file and its format is pasted below, I'm using ".." to indicate the number of rows could be any number but usually in the hundreds but changes from file to file

Example Report: Test Report (Single) 01-01-70
Doe, John
Summary Stats
,,Deg1 1,Deg 2, Deg 3,Deg 4,Deg 5,
Time,Whole,"75%","11%","9%","5%","1%",

Statistics: Overall Values
dur,date, start, avg, maxpercentage
"1002"," 01/01/1970 19:07:40","1.1","10"
"1010867","01/01/1970 19:20:08","3.7","40%"
"1018866","01/01/1970 19:20:15","4.9","35%"
"1028866","01/01/1970 19:20:25","3.9","41%"
..
..
..
..
"1036616","01/01/1970 19:20:33","5","31%"
CATEGORY: Dist.
Attibute: Example.
att1,att2,
"1","2"
"1","2"
"1","2"
"1","2"
"1","2"
"1","2"
"1","2"
"1","2"
"1","2"
"1","2"
..
..
..
..
"10","15"
Labels (3)
5 Replies
Anonymous
Not applicable
Author

what about the number of columns for each header? Always changing or fixed? For example: is it always 2 columns for  CATEGORY: Dist header?
Anonymous
Not applicable
Author

Hi Shong,
No the number of columns stays the same for each header.
There is also extra text added to these files to create sections but they're not really important. The actual columns for each section incase I didn't describe it well enough would be:
,,Deg1 1,Deg 2, Deg 3,Deg 4,Deg 5,
dur,date, start, avg, maxpercentage
att1,att2
Anonymous
Not applicable
Author

Just to update I decided to go with a python solution instead as I found a similar solution on stack overflow
http://stackoverflow.com/questions/20293327/use-python-to-split-a-csv-file-with-multiple-headers
I need to get something up and running but might revisit using the regex I have with the some of the regex file operators in talend.
Anonymous
Not applicable
Author

Use tFileInputFullRow to read the file by rows, connect it to tJavaRow where you will detect each header and configure context.myOutputFileName which you will use with tFileOutputDelimited. This way you will split each content to specific smaller files which you can then easily process one by one.
Another solution is that you will use tMap with multiple outputs and you will filter output for specific value which might be still context.myOutputFileName, one output will filter:
context.myOutputFileName = CATEGORY: Dist.
another context.myOutputFileName = Statistics: Overall Values
etc.
You still need to deal with delimiting/splitting the values in one row by some tJavaRow as soon as you get single field from tMap and need to split it to expected fields by java, but that is easy as well.
Another solution might be reading full file into memory as big string and process it on string level...

Ladislav
Anonymous
Not applicable
Author

Use tFileInputFullRow to read the file by rows, connect it to tJavaRow where you will detect each header and configure context.myOutputFileName which you will use with tFileOutputDelimited. This way you will split each content to specific smaller files which you can then easily process one by one.
Another solution is that you will use tMap with multiple outputs and you will filter output for specific value which might be still context.myOutputFileName, one output will filter:
context.myOutputFileName = CATEGORY: Dist.
another context.myOutputFileName = Statistics: Overall Values
etc.
You still need to deal with delimiting/splitting the values in one row by some tJavaRow as soon as you get single field from tMap and need to split it to expected fields by java, but that is easy as well.
Another solution might be reading full file into memory as big string and process it on string level...

Ladislav

Thanks for that some great ideas of things to try!