<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Whats the best way to parse a CSV file with multiple headers? in Talend Studio</title>
    <link>https://community.qlik.com/t5/Talend-Studio/Whats-the-best-way-to-parse-a-CSV-file-with-multiple-headers/m-p/2354737#M120644</link>
    <description>Just to update I decided to go with a python solution instead as I found a similar solution on stack overflow
&lt;BR /&gt;
&lt;A href="http://stackoverflow.com/questions/20293327/use-python-to-split-a-csv-file-with-multiple-headers" target="_blank" rel="nofollow noopener noreferrer"&gt;http://stackoverflow.com/questions/20293327/use-python-to-split-a-csv-file-with-multiple-headers&lt;/A&gt;
&lt;BR /&gt;I need to get something up and running but might revisit using the regex I have with the some of the regex file operators in talend.</description>
    <pubDate>Wed, 05 Aug 2015 20:18:33 GMT</pubDate>
    <dc:creator>Anonymous</dc:creator>
    <dc:date>2015-08-05T20:18:33Z</dc:date>
    <item>
      <title>Whats the best way to parse a CSV file with multiple headers?</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Whats-the-best-way-to-parse-a-CSV-file-with-multiple-headers/m-p/2354734#M120641</link>
      <description>Hi all, 
&lt;BR /&gt;I have a csv file that contains multiple headers separated by blanks and each section can be dynamic i.e. the number of entries under each header depends on the amount of data recorded. I've been investigating the use of the tFileInputMSDelimited operator but am having trouble getting this to work. 
&lt;BR /&gt;Or should I use a Regex operator to extract each of the sections ( i.e. Summary Stats, Statistics Overall Values, Category Dist)? 
&lt;BR /&gt;Anyway I would be grateful for any suggestions. 
&lt;BR /&gt; 
&lt;BR /&gt;An example of the file and its format is pasted below, I'm using ".." to indicate the number of rows could be any number but usually in the hundreds but changes from file to file 
&lt;BR /&gt; 
&lt;BR /&gt;Example Report: Test Report (Single) 01-01-70 
&lt;BR /&gt;Doe, John 
&lt;BR /&gt;Summary Stats 
&lt;BR /&gt;,,Deg1 1,Deg 2, Deg 3,Deg 4,Deg 5, 
&lt;BR /&gt;Time,Whole,"75%","11%","9%","5%","1%", 
&lt;BR /&gt; 
&lt;BR /&gt;Statistics: Overall Values 
&lt;BR /&gt;dur,date, start, avg, maxpercentage 
&lt;BR /&gt;"1002"," 01/01/1970 19:07:40","1.1","10" 
&lt;BR /&gt;"1010867","01/01/1970 19:20:08","3.7","40%" 
&lt;BR /&gt;"1018866","01/01/1970 19:20:15","4.9","35%" 
&lt;BR /&gt;"1028866","01/01/1970 19:20:25","3.9","41%" 
&lt;BR /&gt;.. 
&lt;BR /&gt;.. 
&lt;BR /&gt;.. 
&lt;BR /&gt;.. 
&lt;BR /&gt;"1036616","01/01/1970 19:20:33","5","31%" 
&lt;BR /&gt;CATEGORY: Dist. 
&lt;BR /&gt;Attibute: Example. 
&lt;BR /&gt;att1,att2, 
&lt;BR /&gt;"1","2" 
&lt;BR /&gt;"1","2" 
&lt;BR /&gt;"1","2" 
&lt;BR /&gt;"1","2" 
&lt;BR /&gt;"1","2" 
&lt;BR /&gt;"1","2" 
&lt;BR /&gt;"1","2" 
&lt;BR /&gt;"1","2" 
&lt;BR /&gt;"1","2" 
&lt;BR /&gt;"1","2" 
&lt;BR /&gt;.. 
&lt;BR /&gt;.. 
&lt;BR /&gt;.. 
&lt;BR /&gt;.. 
&lt;BR /&gt;"10","15"</description>
      <pubDate>Sat, 16 Nov 2024 11:06:59 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Whats-the-best-way-to-parse-a-CSV-file-with-multiple-headers/m-p/2354734#M120641</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2024-11-16T11:06:59Z</dc:date>
    </item>
    <item>
      <title>Re: Whats the best way to parse a CSV file with multiple headers?</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Whats-the-best-way-to-parse-a-CSV-file-with-multiple-headers/m-p/2354735#M120642</link>
      <description>what about the number of columns for each header? Always changing or fixed? For example: is it always 2 columns for&amp;nbsp;
&lt;FONT size="2"&gt;&lt;FONT face="Verdana, Helvetica, Arial, sans-serif"&gt;CATEGORY: Dist header?&lt;/FONT&gt;&lt;/FONT&gt;</description>
      <pubDate>Mon, 03 Aug 2015 03:22:15 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Whats-the-best-way-to-parse-a-CSV-file-with-multiple-headers/m-p/2354735#M120642</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2015-08-03T03:22:15Z</dc:date>
    </item>
    <item>
      <title>Re: Whats the best way to parse a CSV file with multiple headers?</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Whats-the-best-way-to-parse-a-CSV-file-with-multiple-headers/m-p/2354736#M120643</link>
      <description>Hi Shong, 
&lt;BR /&gt;No the number of columns stays the same for each header. 
&lt;BR /&gt;There is also extra text added to these files to create sections but they're not really important. The actual columns for each section incase I didn't describe it well enough would be: 
&lt;BR /&gt;,,Deg1 1,Deg 2, Deg 3,Deg 4,Deg 5, 
&lt;BR /&gt;dur,date, start, avg, maxpercentage 
&lt;BR /&gt;att1,att2</description>
      <pubDate>Mon, 03 Aug 2015 19:17:39 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Whats-the-best-way-to-parse-a-CSV-file-with-multiple-headers/m-p/2354736#M120643</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2015-08-03T19:17:39Z</dc:date>
    </item>
    <item>
      <title>Re: Whats the best way to parse a CSV file with multiple headers?</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Whats-the-best-way-to-parse-a-CSV-file-with-multiple-headers/m-p/2354737#M120644</link>
      <description>Just to update I decided to go with a python solution instead as I found a similar solution on stack overflow
&lt;BR /&gt;
&lt;A href="http://stackoverflow.com/questions/20293327/use-python-to-split-a-csv-file-with-multiple-headers" target="_blank" rel="nofollow noopener noreferrer"&gt;http://stackoverflow.com/questions/20293327/use-python-to-split-a-csv-file-with-multiple-headers&lt;/A&gt;
&lt;BR /&gt;I need to get something up and running but might revisit using the regex I have with the some of the regex file operators in talend.</description>
      <pubDate>Wed, 05 Aug 2015 20:18:33 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Whats-the-best-way-to-parse-a-CSV-file-with-multiple-headers/m-p/2354737#M120644</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2015-08-05T20:18:33Z</dc:date>
    </item>
    <item>
      <title>Re: Whats the best way to parse a CSV file with multiple headers?</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Whats-the-best-way-to-parse-a-CSV-file-with-multiple-headers/m-p/2354738#M120645</link>
      <description>Use tFileInputFullRow to read the file by rows, connect it to tJavaRow where you will detect each header and configure context.myOutputFileName which you will use with tFileOutputDelimited. This way you will split each content to specific smaller files which you can then easily process one by one. 
&lt;BR /&gt;Another solution is that you will use tMap with multiple outputs and you will filter output for specific value which might be still context.myOutputFileName, one output will filter: 
&lt;BR /&gt;context.myOutputFileName = CATEGORY: Dist. 
&lt;BR /&gt;another context.myOutputFileName = Statistics: Overall Values 
&lt;BR /&gt;etc. 
&lt;BR /&gt;You still need to deal with delimiting/splitting the values in one row by some tJavaRow as soon as you get single field from tMap and need to split it to expected fields by java, but that is easy as well. 
&lt;BR /&gt;Another solution might be reading full file into memory as big string and process it on string level... 
&lt;BR /&gt; 
&lt;BR /&gt;Ladislav</description>
      <pubDate>Thu, 06 Aug 2015 14:20:09 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Whats-the-best-way-to-parse-a-CSV-file-with-multiple-headers/m-p/2354738#M120645</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2015-08-06T14:20:09Z</dc:date>
    </item>
    <item>
      <title>Re: Whats the best way to parse a CSV file with multiple headers?</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Whats-the-best-way-to-parse-a-CSV-file-with-multiple-headers/m-p/2354739#M120646</link>
      <description>&lt;BLOCKQUOTE&gt; 
 &lt;TABLE border="1"&gt; 
  &lt;TBODY&gt; 
   &lt;TR&gt; 
    &lt;TD&gt;Use tFileInputFullRow to read the file by rows, connect it to tJavaRow where you will detect each header and configure context.myOutputFileName which you will use with tFileOutputDelimited. This way you will split each content to specific smaller files which you can then easily process one by one.&lt;BR /&gt;Another solution is that you will use tMap with multiple outputs and you will filter output for specific value which might be still context.myOutputFileName, one output will filter:&lt;BR /&gt;context.myOutputFileName = CATEGORY: Dist.&lt;BR /&gt;another context.myOutputFileName = Statistics: Overall Values&lt;BR /&gt;etc.&lt;BR /&gt;You still need to deal with delimiting/splitting the values in one row by some tJavaRow as soon as you get single field from tMap and need to split it to expected fields by java, but that is easy as well.&lt;BR /&gt;Another solution might be reading full file into memory as big string and process it on string level...&lt;BR /&gt;&lt;BR /&gt;Ladislav&lt;/TD&gt; 
   &lt;/TR&gt; 
  &lt;/TBODY&gt; 
 &lt;/TABLE&gt; 
&lt;/BLOCKQUOTE&gt; 
&lt;BR /&gt;Thanks for that some great ideas of things to try!</description>
      <pubDate>Fri, 07 Aug 2015 20:29:06 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Whats-the-best-way-to-parse-a-CSV-file-with-multiple-headers/m-p/2354739#M120646</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2015-08-07T20:29:06Z</dc:date>
    </item>
  </channel>
</rss>

