<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic PDF data source in Talend in Talend Studio</title>
    <link>https://community.qlik.com/t5/Talend-Studio/PDF-data-source-in-Talend/m-p/2204990#M5638</link>
    <description>Hello,
&lt;BR /&gt;A widely popular format for storing information is pdf. Is there any connector that can be used to read the content of pdf file in Talend?
&lt;BR /&gt;Regards,
&lt;BR /&gt;SAmil</description>
    <pubDate>Sat, 16 Nov 2024 13:19:40 GMT</pubDate>
    <dc:creator>Anonymous</dc:creator>
    <dc:date>2024-11-16T13:19:40Z</dc:date>
    <item>
      <title>PDF data source in Talend</title>
      <link>https://community.qlik.com/t5/Talend-Studio/PDF-data-source-in-Talend/m-p/2204990#M5638</link>
      <description>Hello,
&lt;BR /&gt;A widely popular format for storing information is pdf. Is there any connector that can be used to read the content of pdf file in Talend?
&lt;BR /&gt;Regards,
&lt;BR /&gt;SAmil</description>
      <pubDate>Sat, 16 Nov 2024 13:19:40 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/PDF-data-source-in-Talend/m-p/2204990#M5638</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2024-11-16T13:19:40Z</dc:date>
    </item>
    <item>
      <title>Re: PDF data source in Talend</title>
      <link>https://community.qlik.com/t5/Talend-Studio/PDF-data-source-in-Talend/m-p/2204991#M5639</link>
      <description>pdf's are the nightmare data source for all ETL tools. Unfortunately Talend is not the exception. 
&lt;BR /&gt;Often, a PDF is represented as a single image. This means that to retrieve any information from the "text" of the PDF, you would have to implement OCR routines. This is not a small task and getting all of the data from a PDF correctly is a big risk of this design. 
&lt;BR /&gt;if you have thousands of PDF's that must be entered to the DB it *might* be worth it to implement OCR and integrate this into a Talend job. My advice is to try very hard to get your data in a machine readable format, and understand what you're getting into if you agree to parse PDF files.</description>
      <pubDate>Tue, 03 Aug 2010 00:35:59 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/PDF-data-source-in-Talend/m-p/2204991#M5639</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2010-08-03T00:35:59Z</dc:date>
    </item>
  </channel>
</rss>

