topic Re: Whats the best way to parse a CSV file with multiple headers? in Talend Studio

Whats the best way to parse a CSV file with multiple headers?

Anonymous — Sat, 16 Nov 2024 11:06:59 GMT

Hi all,
I have a csv file that contains multiple headers separated by blanks and each section can be dynamic i.e. the number of entries under each header depends on the amount of data recorded. I've been investigating the use of the tFileInputMSDelimited operator but am having trouble getting this to work.
Or should I use a Regex operator to extract each of the sections ( i.e. Summary Stats, Statistics Overall Values, Category Dist)?
Anyway I would be grateful for any suggestions.

An example of the file and its format is pasted below, I'm using ".." to indicate the number of rows could be any number but usually in the hundreds but changes from file to file

Example Report: Test Report (Single) 01-01-70
Doe, John
Summary Stats
,,Deg1 1,Deg 2, Deg 3,Deg 4,Deg 5,
Time,Whole,"75%","11%","9%","5%","1%",

Statistics: Overall Values
dur,date, start, avg, maxpercentage
"1002"," 01/01/1970 19:07:40","1.1","10"
"1010867","01/01/1970 19:20:08","3.7","40%"
"1018866","01/01/1970 19:20:15","4.9","35%"
"1028866","01/01/1970 19:20:25","3.9","41%"
..
..
..
..
"1036616","01/01/1970 19:20:33","5","31%"
CATEGORY: Dist.
Attibute: Example.
att1,att2,
"1","2"
"1","2"
"1","2"
"1","2"
"1","2"
"1","2"
"1","2"
"1","2"
"1","2"
"1","2"
..
..
..
..
"10","15"

Re: Whats the best way to parse a CSV file with multiple headers?

Anonymous — Mon, 03 Aug 2015 03:22:15 GMT

what about the number of columns for each header? Always changing or fixed? For example: is it always 2 columns for CATEGORY: Dist header?

Re: Whats the best way to parse a CSV file with multiple headers?

Anonymous — Mon, 03 Aug 2015 19:17:39 GMT

Hi Shong,
No the number of columns stays the same for each header.
There is also extra text added to these files to create sections but they're not really important. The actual columns for each section incase I didn't describe it well enough would be:
,,Deg1 1,Deg 2, Deg 3,Deg 4,Deg 5,
dur,date, start, avg, maxpercentage
att1,att2

Re: Whats the best way to parse a CSV file with multiple headers?

Anonymous — Wed, 05 Aug 2015 20:18:33 GMT

Just to update I decided to go with a python solution instead as I found a similar solution on stack overflow
http://stackoverflow.com/questions/20293327/use-python-to-split-a-csv-file-with-multiple-headers
I need to get something up and running but might revisit using the regex I have with the some of the regex file operators in talend.

Re: Whats the best way to parse a CSV file with multiple headers?

Anonymous — Thu, 06 Aug 2015 14:20:09 GMT

Use tFileInputFullRow to read the file by rows, connect it to tJavaRow where you will detect each header and configure context.myOutputFileName which you will use with tFileOutputDelimited. This way you will split each content to specific smaller files which you can then easily process one by one.
Another solution is that you will use tMap with multiple outputs and you will filter output for specific value which might be still context.myOutputFileName, one output will filter:
context.myOutputFileName = CATEGORY: Dist.
another context.myOutputFileName = Statistics: Overall Values
etc.
You still need to deal with delimiting/splitting the values in one row by some tJavaRow as soon as you get single field from tMap and need to split it to expected fields by java, but that is easy as well.
Another solution might be reading full file into memory as big string and process it on string level...

Ladislav

Re: Whats the best way to parse a CSV file with multiple headers?

Anonymous — Fri, 07 Aug 2015 20:29:06 GMT

Use tFileInputFullRow to read the file by rows, connect it to tJavaRow where you will detect each header and configure context.myOutputFileName which you will use with tFileOutputDelimited. This way you will split each content to specific smaller files which you can then easily process one by one.
Another solution is that you will use tMap with multiple outputs and you will filter output for specific value which might be still context.myOutputFileName, one output will filter:
context.myOutputFileName = CATEGORY: Dist.
another context.myOutputFileName = Statistics: Overall Values
etc.
You still need to deal with delimiting/splitting the values in one row by some tJavaRow as soon as you get single field from tMap and need to split it to expected fields by java, but that is easy as well.
Another solution might be reading full file into memory as big string and process it on string level...

Ladislav

Thanks for that some great ideas of things to try!