Skip to main content
Announcements
Accelerate Your Success: Fuel your data and AI journey with the right services, delivered by our experts. Learn More
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

How to read pdf file in talend

Hello,
I need help to read in a variable the content of a pdf file to put it in a text field on a database.
What sort of component I'm suppose to use ?
The process :
- list the files on a folder : ok
- read the file name to find the database row : ok
- read the content of the file to put it on a database ... not ok 0683p000009MPcz.png
Does anyone have a solution ???
Thanks,
David
Labels (2)
21 Replies
Anonymous
Not applicable
Author

Hi,
Thank's for very useful information
i have written a method to read the pdf
Can you please help me how to add the method as a Routines to run the code from the talend tool
when i create a job i am able to view the code but not able to edit it to add my method.
Please give me a suggestion.

Thank's
caba
Anonymous
Not applicable
Author

Check out the documentation https://help.talend.com/search/all?query=Managing+user+routines&content-lang=en
and let us know if you need further assistance.
Anonymous
Not applicable
Author

hello Cabajones
would you be so kind to share your routine?
i am sure it would help other too.
thanks,
Anonymous
Not applicable
Author

Is there any change in the status of this - "no compoent exists to read pdfs"
Given the nature of PDFs, that's what I'd expect, just checking.
Anonymous
Not applicable
Author

Why should a ETL tool read a PDF file?
Anonymous
Not applicable
Author

I agree it doesn't make good sense but my boss told me to ask. Your answer is reassuring 0683p000009MACn.png.
Anonymous
Not applicable
Author

Good question. In the moment you have to use self written code in a tJavaFlex but I do not know how to read a PDF.
I would google for it. Sorry.
Ony problem is: a PDF can be created from images and the structure of the text is oriented for the layout and does not have a fix structure like a HTML table. A solution would be meanly a individual solution for a particular PDF file and every layout changes on the file will have impact to your code.
Anonymous
Not applicable
Author

Is there any change in status of no component exist to read pdf ?
Okay, even if no component exists, is there any way to extract some particular columnar data (although no physical table structure is drawn in pdf, but virtually data is divided into columns) and store it in DB table columns ?
Through java code and itext library in routine, I am able to read pdf file but as mentioned above how to extract columns from pdf ?
Any code or url reference for this will be helpful. 
Anonymous
Not applicable
Author

Google "Java API for reading PDF files".
This is an unusual requirement (for reasons already explained above), but if there is text in the PDF that can be retrieved, the best way is to write a Java routine making use of an existing Java API. One of Talend's massive advantages over other tools is the ease at which you can write your own components or just add code to a tJavaFlex to make use of third party APIs.
Anonymous
Not applicable
Author

Hi talend team,
We have a requirement to read the data from a PDF file/files. wanted to know like do we have any component provided by talend tool through which we can read the content from the pdf files.
I have gone through the different posts on google but maximum I found that it can be done using a piece of java code, but issue is that it is customized for a particular file and not valid unanimously for any kind of PDF file. So request you to share something on this so that I can get clear picture and decide accordingly to go ahead with talend as ETL tool for my assignment. Any sort of help would be appreciable
Thanks