Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Qlik Connect 2026 Agenda Now Available: Explore Sessions
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

PDF data source in Talend

Hello,
A widely popular format for storing information is pdf. Is there any connector that can be used to read the content of pdf file in Talend?
Regards,
SAmil
Labels (2)
1 Reply
Anonymous
Not applicable
Author

pdf's are the nightmare data source for all ETL tools. Unfortunately Talend is not the exception.
Often, a PDF is represented as a single image. This means that to retrieve any information from the "text" of the PDF, you would have to implement OCR routines. This is not a small task and getting all of the data from a PDF correctly is a big risk of this design.
if you have thousands of PDF's that must be entered to the DB it *might* be worth it to implement OCR and integrate this into a Talend job. My advice is to try very hard to get your data in a machine readable format, and understand what you're getting into if you agree to parse PDF files.