<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: PDF and HTML parsers in Talend Studio</title>
    <link>https://community.qlik.com/t5/Talend-Studio/PDF-and-HTML-parsers/m-p/2288516#M61995</link>
    <description>Hi 
&lt;BR /&gt;There is a custom component tPDFToText on Talend exchange 
&lt;BR /&gt; 
&lt;A href="http://www.talendforge.org/exchange/index.php?eid=346&amp;amp;product=tos&amp;amp;action=view&amp;amp;nav=1,1,1" rel="nofollow noopener noreferrer"&gt;http://www.talendforge.org/exchange/index.php?eid=346&amp;amp;product=tos&amp;amp;action=view&amp;amp;nav=1,1,1&lt;/A&gt; 
&lt;BR /&gt;it can be used to convert a PDF file to a text file, and then you can extract a delimited area. 
&lt;BR /&gt;About HTML file, you can test tHTTPTableInput component, 
&lt;BR /&gt; 
&lt;A href="http://www.talendforge.org/exchange/index.php?eid=72&amp;amp;product=tos&amp;amp;action=view&amp;amp;nav=1,1,1" rel="nofollow noopener noreferrer"&gt;http://www.talendforge.org/exchange/index.php?eid=72&amp;amp;product=tos&amp;amp;action=view&amp;amp;nav=1,1,1&lt;/A&gt; 
&lt;BR /&gt;Best regards 
&lt;BR /&gt;Shong</description>
    <pubDate>Sun, 31 Jul 2011 04:15:05 GMT</pubDate>
    <dc:creator>Anonymous</dc:creator>
    <dc:date>2011-07-31T04:15:05Z</dc:date>
    <item>
      <title>PDF and HTML parsers</title>
      <link>https://community.qlik.com/t5/Talend-Studio/PDF-and-HTML-parsers/m-p/2288515#M61994</link>
      <description>Hi,
&lt;BR /&gt;Is talend supports PDF and HTML? If yes can you please let me know how can we do this.
&lt;BR /&gt;Thanks &amp;amp; Regards,
&lt;BR /&gt;Syed</description>
      <pubDate>Sat, 16 Nov 2024 12:46:11 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/PDF-and-HTML-parsers/m-p/2288515#M61994</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2024-11-16T12:46:11Z</dc:date>
    </item>
    <item>
      <title>Re: PDF and HTML parsers</title>
      <link>https://community.qlik.com/t5/Talend-Studio/PDF-and-HTML-parsers/m-p/2288516#M61995</link>
      <description>Hi 
&lt;BR /&gt;There is a custom component tPDFToText on Talend exchange 
&lt;BR /&gt; 
&lt;A href="http://www.talendforge.org/exchange/index.php?eid=346&amp;amp;product=tos&amp;amp;action=view&amp;amp;nav=1,1,1" rel="nofollow noopener noreferrer"&gt;http://www.talendforge.org/exchange/index.php?eid=346&amp;amp;product=tos&amp;amp;action=view&amp;amp;nav=1,1,1&lt;/A&gt; 
&lt;BR /&gt;it can be used to convert a PDF file to a text file, and then you can extract a delimited area. 
&lt;BR /&gt;About HTML file, you can test tHTTPTableInput component, 
&lt;BR /&gt; 
&lt;A href="http://www.talendforge.org/exchange/index.php?eid=72&amp;amp;product=tos&amp;amp;action=view&amp;amp;nav=1,1,1" rel="nofollow noopener noreferrer"&gt;http://www.talendforge.org/exchange/index.php?eid=72&amp;amp;product=tos&amp;amp;action=view&amp;amp;nav=1,1,1&lt;/A&gt; 
&lt;BR /&gt;Best regards 
&lt;BR /&gt;Shong</description>
      <pubDate>Sun, 31 Jul 2011 04:15:05 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/PDF-and-HTML-parsers/m-p/2288516#M61995</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2011-07-31T04:15:05Z</dc:date>
    </item>
    <item>
      <title>Re: PDF and HTML parsers</title>
      <link>https://community.qlik.com/t5/Talend-Studio/PDF-and-HTML-parsers/m-p/2288517#M61996</link>
      <description>Hi Shong, 
&lt;BR /&gt;I have downloaded PDF parser and included in TOS. This component is generating plain text. I am not sure which component 
&lt;BR /&gt;to use to read this text file. Because my PDF file contains the table which is converting as in the text file as follows 
&lt;BR /&gt;---------------------------------------------------------------------------------------- 
&lt;BR /&gt;Instrument Details 
&lt;BR /&gt; 
&lt;BR /&gt;Asset Type:CORPORATE DEBT Provider:BCP Golden Copy 
&lt;BR /&gt; 
&lt;BR /&gt;Identifiers 
&lt;BR /&gt;ISIN XS0283708575 
&lt;BR /&gt;CUSIP EG1215284 
&lt;BR /&gt;SEDOL B1P8V35 
&lt;BR /&gt;CFI Code 
&lt;BR /&gt;Titu Code 65083001 
&lt;BR /&gt;Central Code 
&lt;BR /&gt;SIIB Code 
&lt;BR /&gt;RIC 
&lt;BR /&gt;Code Number NA 
&lt;BR /&gt;Issuer Details 
&lt;BR /&gt;Group Issue NO 
&lt;BR /&gt;-------------------------------------------------------------------------- 
&lt;BR /&gt;From this file I need to preapare the key value pairs and load them in DB. 
&lt;BR /&gt;Example: 
&lt;BR /&gt;ISIN : XS0283708575 
&lt;BR /&gt;CUSIP : EG1215284 
&lt;BR /&gt;SEDOL : B1P8V35 
&lt;BR /&gt;Please Suggest me how can I do this. 
&lt;BR /&gt;Thanks &amp;amp; Regards, 
&lt;BR /&gt;Syed</description>
      <pubDate>Mon, 01 Aug 2011 06:59:21 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/PDF-and-HTML-parsers/m-p/2288517#M61996</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2011-08-01T06:59:21Z</dc:date>
    </item>
    <item>
      <title>Re: PDF and HTML parsers</title>
      <link>https://community.qlik.com/t5/Talend-Studio/PDF-and-HTML-parsers/m-p/2288518#M61997</link>
      <description>Hi 
&lt;BR /&gt;These three records do always start with "ISIN", "CUSIP" and "SEDOL"? If so, use a tFileInputFullRow to read each line one by one, and then filter the rows which start with "ISIN", "CUSIP" and "SEDOL" on tFilterRow, extract each line into multiple fields on tExtractDelimitedFields. for example 
&lt;BR /&gt;tFileInputFullRow--main--&amp;gt;tFilterRow--&amp;gt;tExtractDelimitedFields--&amp;gt;tLogrow 
&lt;BR /&gt;on tFilterRow, use the advanced module and set the filter expression as below: 
&lt;BR /&gt; 
&lt;PRE&gt;input_row.line.startsWith("ISIN")||input_row.line.startsWith("CUSIP")||input_row.line.startsWith("SEDOL")&lt;/PRE&gt; 
&lt;BR /&gt;Best regards 
&lt;BR /&gt;Shong</description>
      <pubDate>Mon, 01 Aug 2011 07:18:42 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/PDF-and-HTML-parsers/m-p/2288518#M61997</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2011-08-01T07:18:42Z</dc:date>
    </item>
    <item>
      <title>Re: PDF and HTML parsers</title>
      <link>https://community.qlik.com/t5/Talend-Studio/PDF-and-HTML-parsers/m-p/2288519#M61998</link>
      <description>Hi Shong,
&lt;BR /&gt;I have given 'Filed Separater' as space in 'tExtractDelimitedFields' component.
&lt;BR /&gt;This works fine for ISIN,CUSIP and SEDOL values but also I have the keys as 'Titu Code' and 'Central Code'.
&lt;BR /&gt;For this it is not working.
&lt;BR /&gt;Can you please suggest how can I do for these.
&lt;BR /&gt;
&lt;BR /&gt;Thanks &amp;amp; Regards,
&lt;BR /&gt;Syed</description>
      <pubDate>Mon, 01 Aug 2011 08:17:52 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/PDF-and-HTML-parsers/m-p/2288519#M61998</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2011-08-01T08:17:52Z</dc:date>
    </item>
    <item>
      <title>Re: PDF and HTML parsers</title>
      <link>https://community.qlik.com/t5/Talend-Studio/PDF-and-HTML-parsers/m-p/2288520#M61999</link>
      <description>Hi Shong,
&lt;BR /&gt;Can you please suggest how to solve this issue?
&lt;BR /&gt;Thanks &amp;amp; Regards,
&lt;BR /&gt;Syed</description>
      <pubDate>Tue, 02 Aug 2011 06:23:15 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/PDF-and-HTML-parsers/m-p/2288520#M61999</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2011-08-02T06:23:15Z</dc:date>
    </item>
  </channel>
</rss>

