Skip to main content
Announcements
See what Drew Clarke has to say about the Qlik Talend Cloud launch! READ THE BLOG
cancel
Showing results for 
Search instead for 
Did you mean: 
govardhant85
Contributor
Contributor

[resolved] how to read PDF input file from talend

Hi,
Is it possible to read PDF file through talend. We have to read this file and load data into a target table.
Can you please suggest.
Regards
Govardhan Turaka
Labels (2)
1 Solution

Accepted Solutions
Anonymous
Not applicable

Hi Govardhan Turaka,
Please have a look at a related forum : https://community.talend.com/t5/Design-and-Development/How-to-read-pdf-file-in-talend/td-p/99998
Best regards
Sabrina

View solution in original post

4 Replies
Anonymous
Not applicable

Hi Govardhan Turaka,
Please have a look at a related forum : https://community.talend.com/t5/Design-and-Development/How-to-read-pdf-file-in-talend/td-p/99998
Best regards
Sabrina
tomwattsusa
Contributor
Contributor

I was able to read the text of PDFs using the Apache library pdfbox, pdfbox-app-2.0.25.jar

 

I used the tLibraryLoad component to load the jar.

Then used a tJava component to read the file

 

tJava Code:

/*

File file = new File("/opt/sample.pdf");

PDDocument document = PDDocument.load(file);

PDFTextStripper pdfStripper = new PDFTextStripper();

String text = pdfStripper.getText(document);

System.out.println("Text:" + text);

document.close();

*/

 

 

PDDocument document = PDDocument.load(new File("/opt/pdf.pdf"));

if (!document.isEncrypted()) {

  PDFTextStripper stripper = new PDFTextStripper();

  String text = stripper.getText(document);

  System.out.println("Text:" + text);

}

document.close();

 

 

tJava Advanced Settings:

import java.io.File;

import org.apache.pdfbox.pdmodel.PDDocument; 

import org.apache.pdfbox.text.PDFTextStripper; 

import org.apache.pdfbox.text.PDFTextStripperByArea;

 

0695b00000N1DeXAAV.png0695b00000N1Df6AAF.png0695b00000N1DdoAAF.png 

 

Anonymous
Not applicable

Hello,

Thanks for sharing this solution with us on community.

Best regards

Sabrina

SNad1654691194
Contributor
Contributor

I have used exact steps but unable to get it going

 

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory

at org.apache.pdfbox.pdmodel.PDDocument.<clinit>(PDDocument.java:98)

at local_project.test_0_1.test.tJava_1Process(test.java:501)

at local_project.test_0_1.test.tLibraryLoad_1Process(test.java:415)

at local_project.test_0_1.test.runJobInTOS(test.java:804)

at local_project.test_0_1.test.main(test.java:642)

Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory

at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)

at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)

at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)

... 5 more