<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: parsing HTML in Talend Studio</title>
    <link>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246754#M32172</link>
    <description>hi all,&lt;BR /&gt;perhaps you could 'pre-procede' your Html fiel reading it with tFileInputFullRow and check option "skip empty rows".&lt;BR /&gt;hope it helps&lt;BR /&gt;regards&lt;BR /&gt;laurent</description>
    <pubDate>Tue, 01 Apr 2014 09:20:49 GMT</pubDate>
    <dc:creator>Anonymous</dc:creator>
    <dc:date>2014-04-01T09:20:49Z</dc:date>
    <item>
      <title>parsing HTML</title>
      <link>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246749#M32167</link>
      <description>Hi everybody.&lt;BR /&gt;I have to get some information (everything legal) from html pages. There's a way to take just the information I need, deleting html tags??&lt;BR /&gt;Thanks in advance.</description>
      <pubDate>Fri, 28 Mar 2014 09:17:34 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246749#M32167</guid>
      <dc:creator>_AnonymousUser</dc:creator>
      <dc:date>2014-03-28T09:17:34Z</dc:date>
    </item>
    <item>
      <title>Re: parsing HTML</title>
      <link>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246750#M32168</link>
      <description>hi,&lt;BR /&gt;use tFlieFetch to read html file :&lt;BR /&gt;&lt;A href="https://help.talend.com/search/all?query=tFileFetch&amp;amp;content-lang=en" rel="nofollow noopener noreferrer"&gt;https://help.talend.com/search/all?query=tFileFetch&amp;amp;content-lang=en&lt;/A&gt;&lt;BR /&gt;You can use libraby like jSoup to parse html.( write some java code)&lt;BR /&gt;You 've also got exchange component like tHTTPBot or tHTTPTableInput (to read table)&lt;BR /&gt;if well-formed (X)html use xml talend component ofter load html pages.&lt;BR /&gt;hope it helps&lt;BR /&gt;regards&lt;BR /&gt;laurent</description>
      <pubDate>Fri, 28 Mar 2014 10:58:34 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246750#M32168</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2014-03-28T10:58:34Z</dc:date>
    </item>
    <item>
      <title>Re: parsing HTML</title>
      <link>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246751#M32169</link>
      <description>I already tried but with some problems. I started using talend last week, so I'm not very practice. &lt;BR /&gt;Can you please help me better with an example?&lt;BR /&gt;thank you very much.</description>
      <pubDate>Fri, 28 Mar 2014 11:38:31 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246751#M32169</guid>
      <dc:creator>_AnonymousUser</dc:creator>
      <dc:date>2014-03-28T11:38:31Z</dc:date>
    </item>
    <item>
      <title>Re: parsing HTML</title>
      <link>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246752#M32170</link>
      <description>Hi everybody,
&lt;BR /&gt;I tried to use tTikaExtracor and it works, but it does't remove the free space between lines... 
&lt;BR /&gt;Is there a component that write in the output file sequentially?
&lt;BR /&gt;thanks in advance.</description>
      <pubDate>Tue, 01 Apr 2014 08:20:55 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246752#M32170</guid>
      <dc:creator>_AnonymousUser</dc:creator>
      <dc:date>2014-04-01T08:20:55Z</dc:date>
    </item>
    <item>
      <title>Re: parsing HTML</title>
      <link>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246753#M32171</link>
      <description>Hi rob911, &lt;BR /&gt;Is this component working well for your space issue &lt;A href="https://help.talend.com/search/all?query=tReplace&amp;amp;content-lang=en" target="_blank" rel="nofollow noopener noreferrer"&gt;TalendHelpCenter:tReplace&lt;/A&gt;?&lt;BR /&gt;&lt;BLOCKQUOTE&gt;&lt;TABLE border="1"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;Is there a component that write in the output file sequentially?&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;/BLOCKQUOTE&gt;&lt;BR /&gt;Could you set a example for your requirement "write in the output file sequentially"?&lt;BR /&gt;Best regards&lt;BR /&gt;Sabrina</description>
      <pubDate>Tue, 01 Apr 2014 08:47:01 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246753#M32171</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2014-04-01T08:47:01Z</dc:date>
    </item>
    <item>
      <title>Re: parsing HTML</title>
      <link>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246754#M32172</link>
      <description>hi all,&lt;BR /&gt;perhaps you could 'pre-procede' your Html fiel reading it with tFileInputFullRow and check option "skip empty rows".&lt;BR /&gt;hope it helps&lt;BR /&gt;regards&lt;BR /&gt;laurent</description>
      <pubDate>Tue, 01 Apr 2014 09:20:49 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246754#M32172</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2014-04-01T09:20:49Z</dc:date>
    </item>
    <item>
      <title>Re: parsing HTML</title>
      <link>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246755#M32173</link>
      <description>I can't upload the screenshot of my file and talend job. 
&lt;BR /&gt;kzone, where should I put tFileInputFullRow? In my job I have tTikaExtractor -&amp;gt; FixedFlowInput -&amp;gt; tFileOutputDelimited.... 
&lt;BR /&gt;xdshi, with tTikaExtractor I can delete every code line of my html file, but the useful lines remain in the position where they were in the code. 
&lt;BR /&gt;thanks to you two, hoping you can get me to a solution 
&lt;span class="lia-inline-image-display-wrapper" image-alt="0683p000009MACn.png"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/154443iC5B8CACEF3D12C6A/image-size/large?v=v2&amp;amp;px=999" role="button" title="0683p000009MACn.png" alt="0683p000009MACn.png" /&gt;&lt;/span&gt;</description>
      <pubDate>Tue, 01 Apr 2014 10:07:34 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246755#M32173</guid>
      <dc:creator>_AnonymousUser</dc:creator>
      <dc:date>2014-04-01T10:07:34Z</dc:date>
    </item>
    <item>
      <title>Re: parsing HTML</title>
      <link>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246756#M32174</link>
      <description>Hi, 
&lt;BR /&gt;You should register and log in as a Community member first, then you'll get a Image upload box that allows to upload screen captures and images up to 200KB(Limits: 20 images per post, each image must be less then 1024x768 pixels and 200 KB).
&lt;BR /&gt;Best regards
&lt;BR /&gt;Sabrina</description>
      <pubDate>Tue, 01 Apr 2014 10:17:54 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246756#M32174</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2014-04-01T10:17:54Z</dc:date>
    </item>
    <item>
      <title>Re: parsing HTML</title>
      <link>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246757#M32175</link>
      <description>I'm already registered but I can't log in, I don't know why I can't. 
&lt;BR /&gt;Anyway the problem is that the line which I'm interested in are not disposed in the right sequence in the file, I mean that there are too many empy row, in this empy row there was the code. 
&lt;BR /&gt;I put in Tika the url I'm interested in then I get the useful lines in a txt file, but they are in the same position of the html file and I want them in sequential rows. 
&lt;BR /&gt;I used this post 
&lt;A href="https://community.qlik.com/s/feed/0D53p00007vCrGLCA0" rel="nofollow noopener noreferrer"&gt;https://community.talend.com/t5/Design-and-Development/how-do-we-retrieve-data-from-HTML-page/td-p/114529&lt;/A&gt; . 
&lt;BR /&gt;But the output is different and I don't know why! 
&lt;span class="lia-inline-image-display-wrapper" image-alt="0683p000009MPcz.png"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/157233iD1A564EF62DE3BC2/image-size/large?v=v2&amp;amp;px=999" role="button" title="0683p000009MPcz.png" alt="0683p000009MPcz.png" /&gt;&lt;/span&gt;</description>
      <pubDate>Tue, 01 Apr 2014 10:52:29 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246757#M32175</guid>
      <dc:creator>_AnonymousUser</dc:creator>
      <dc:date>2014-04-01T10:52:29Z</dc:date>
    </item>
    <item>
      <title>Re: parsing HTML</title>
      <link>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246758#M32176</link>
      <description>&lt;BLOCKQUOTE&gt; 
 &lt;TABLE border="1"&gt; 
  &lt;TBODY&gt; 
   &lt;TR&gt; 
    &lt;TD&gt;Anyway the problem is that the line which I'm interested in are not disposed in the right sequence in the file, I mean that there are too many empy row, in this empy row there was the code.&lt;/TD&gt; 
   &lt;/TR&gt; 
  &lt;/TBODY&gt; 
 &lt;/TABLE&gt; 
&lt;/BLOCKQUOTE&gt; 
&lt;BR /&gt;as you have : 
&lt;BR /&gt; tTikaExtractor -&amp;gt; FixedFlowInput -&amp;gt; tFileOutputDelimited 
&lt;BR /&gt;next read delimited file with tFileInputFullRow skipping empty rows ... 
&lt;BR /&gt;Not sure it's the more efficient way - I'm sure in fact 
&lt;span class="lia-inline-image-display-wrapper" image-alt="0683p000009MACn.png"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/154443iC5B8CACEF3D12C6A/image-size/large?v=v2&amp;amp;px=999" role="button" title="0683p000009MACn.png" alt="0683p000009MACn.png" /&gt;&lt;/span&gt; - but not sure about what you're expecting . 
&lt;BR /&gt;regards 
&lt;BR /&gt;laurent</description>
      <pubDate>Tue, 01 Apr 2014 15:12:34 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246758#M32176</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2014-04-01T15:12:34Z</dc:date>
    </item>
    <item>
      <title>Re: parsing HTML</title>
      <link>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246759#M32177</link>
      <description>Hi, 
&lt;BR /&gt;I tried tFileInputFullRow -&amp;gt; tFileOutputDelimited skipping empy row, but it doesn't clean empty row... 
&lt;span class="lia-inline-image-display-wrapper" image-alt="0683p000009MPcz.png"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/157233iD1A564EF62DE3BC2/image-size/large?v=v2&amp;amp;px=999" role="button" title="0683p000009MPcz.png" alt="0683p000009MPcz.png" /&gt;&lt;/span&gt; 
&lt;BR /&gt;Regards</description>
      <pubDate>Thu, 03 Apr 2014 08:11:22 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246759#M32177</guid>
      <dc:creator>_AnonymousUser</dc:creator>
      <dc:date>2014-04-03T08:11:22Z</dc:date>
    </item>
    <item>
      <title>Re: parsing HTML</title>
      <link>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246760#M32178</link>
      <description>Hi everybody,
&lt;BR /&gt;fine I don't need to have an orderly file anymore. 
&lt;BR /&gt;I just need to extract some lines... is there a component that help me with that?? I need to specify some start words and some end words.
&lt;BR /&gt;Thanks in advance.</description>
      <pubDate>Fri, 04 Apr 2014 08:17:52 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246760#M32178</guid>
      <dc:creator>_AnonymousUser</dc:creator>
      <dc:date>2014-04-04T08:17:52Z</dc:date>
    </item>
    <item>
      <title>Re: parsing HTML</title>
      <link>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246761#M32179</link>
      <description>I'm using tFileInputRegex and it's matching the lines I need... but how can I write these lines in an output files?&lt;BR /&gt;Using tFileInputRegex -&amp;gt; tFileOutputDelimited doesn't work.&lt;BR /&gt;regards</description>
      <pubDate>Fri, 04 Apr 2014 09:14:11 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246761#M32179</guid>
      <dc:creator>_AnonymousUser</dc:creator>
      <dc:date>2014-04-04T09:14:11Z</dc:date>
    </item>
    <item>
      <title>Re: parsing HTML</title>
      <link>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246762#M32180</link>
      <description>Hi Everybody,&lt;BR /&gt;I am reading a html file using tFileInputFullRow ,but it's not reading the html file from starting. I mean to say it should start reading the file at &amp;lt;html&amp;gt; tag ,but it's starting at somewhere  i am not sure where . Note: i have not checked the random option of the component.</description>
      <pubDate>Mon, 07 Apr 2014 16:07:32 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/parsing-HTML/m-p/2246762#M32180</guid>
      <dc:creator>Ashok_Panda</dc:creator>
      <dc:date>2014-04-07T16:07:32Z</dc:date>
    </item>
  </channel>
</rss>

