Skip to main content
Announcements
Introducing Qlik Answers: A plug-and-play, Generative AI powered RAG solution. READ ALL ABOUT IT!
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

how to use the pdfs on in the different urls and integrate the parameters in a file

Hi all, 

I saw the other topic posted. Unfortunately the solution does not fit my needs. 

I have different pdfs in the site https://www.cert.ssi.gouv.fr/ , how can  extract info from each pdf to export a database and  integrate  in a table

 

Thank you very much,

Labels (2)
8 Replies
fdenis
Master
Master

pdf are not data files!!!
is there tags inside?
can you get them as excel?
did you have an ocr?
fdenis
Master
Master

you pehaps nend an rpa application.
Anonymous
Not applicable
Author

first of all I thank you for your answer but I didn't understand correctly, I use the pdfs that are in this following site https://www.cert.ssi.gouv.fr/  and extract a data table that exists in each pdf and then I integrate them in a table

regards

fdenis
Master
Master

it's a good advertising for this site but:
PDF files are for printing they contain printable data.
sometime they also contain data into pdf tags useful for indexing.
to convert pdf to text you need to use an ocr.
rca create automatic human process.
Talend is an etl it work with data it does not work with pdf (as I know).
Anonymous
Not applicable
Author


I found that there is a tpdftotext component that has been created by other users on talendexchange but I need to extrat the table that is in each pdf so it doesn't work for me 

Anonymous
Not applicable
Author

Hi,

 

    If you are using a custom component, I would suggest you to contact the author of the component directly. Reading from PDFs is not a good strategy as the data in PDF is meant for easy reading from human perspective. But if you have to read the data lying in PDF, why don't you go to the source system which is providing data to PDF and pick it from there?

 

   That is the ideal way of doing in case of an enterprise environment.

 

Tail Note:- Amazon is creating a new feature called Textract to read PDF but it is currently in Preview mode. Once its ready, you can make API calls from Talend to get result set. There are lot of third party companies go allow API call to fetch the data. You can try that route.

 

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved 🙂

Anonymous
Not applicable
Author

hi @nthampi
first of all I thank you for your time and answer, in fact I'm new in talend if I want to ask my very simple question in an example I would like to know how I can have the data in part: DOCUMENT MANAGEMENT  from site  https://www.cert.ssi.gouv.fr/alerte/CERTFR-2019-ALE-008/
in a table like that:
Reference:                    CERTFR-2019-ALE-008
Title:                             Vulnerability in Microsoft SharePoint Server
Date of first version      29 May 2019
Date of last version       29 May 2019
Source(s)                      Microsoft Security Bulletin CVE-2019-0604 dated February 12, 2019

thanks 

regards

Anonymous
Not applicable
Author

Hi,

 

    The simple answer is there are no standard components from Talend palette for this requirement There might be components created by Talend community members in exchange.talend.com

 

     Other option to do is to write custom java code to read the data using routine options in Talend or call any third party API using REST API calls from Talend.

 

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved 🙂