Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Qlik Open Lakehouse is Now Generally Available! Discover the key highlights and partner resources here.
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

How to read pdf file in talend

Hello,
I need help to read in a variable the content of a pdf file to put it in a text field on a database.
What sort of component I'm suppose to use ?
The process :
- list the files on a folder : ok
- read the file name to find the database row : ok
- read the content of the file to put it on a database ... not ok 0683p000009MPcz.png
Does anyone have a solution ???
Thanks,
David
Labels (2)
21 Replies
Anonymous
Not applicable
Author

Ciao, thanks for sharing this.

 

But is not clear how i can specify which is the pdf file that must be ridden inside the script.

 

Can you clarify?

 

Thanks

tomwattsusa
Contributor
Contributor

Loaded the Apache pdfbox jar pdfbox-app-2.0.25.jar into a tlibraryLoad component. Then used a tJava component to read a PDF file

 

tJava Code:

/*

File file = new File("/opt/sample.pdf");

PDDocument document = PDDocument.load(file);

PDFTextStripper pdfStripper = new PDFTextStripper();

String text = pdfStripper.getText(document);

System.out.println("Text:" + text);

document.close();

*/

 

 

PDDocument document = PDDocument.load(new File("/opt/pdf.pdf"));

if (!document.isEncrypted()) {

  PDFTextStripper stripper = new PDFTextStripper();

  String text = stripper.getText(document);

  System.out.println("Text:" + text);

}

document.close();

 

tJava Advanced Settings:

 

import java.io.File;

import org.apache.pdfbox.pdmodel.PDDocument; 

import org.apache.pdfbox.text.PDFTextStripper; 

import org.apache.pdfbox.text.PDFTextStripperByArea;

 

0695b00000N1E1qAAF.png0695b00000N1E2FAAV.png0695b00000N1E2PAAV.png0695b00000N1E2tAAF.png0695b00000N1E2tAAF.png