Skip to main content
Announcements
Introducing Qlik Answers: A plug-and-play, Generative AI powered RAG solution. READ ALL ABOUT IT!
cancel
Showing results for 
Search instead for 
Did you mean: 
govardhant85
Contributor
Contributor

[resolved] how to read PDF input file from talend

Hi,
Is it possible to read PDF file through talend. We have to read this file and load data into a target table.
Can you please suggest.
Regards
Govardhan Turaka
Labels (2)
1 Solution

Accepted Solutions
Anonymous
Not applicable

Hi Govardhan Turaka,
Please have a look at a related forum : https://community.talend.com/t5/Design-and-Development/How-to-read-pdf-file-in-talend/td-p/99998
Best regards
Sabrina

View solution in original post

4 Replies
Anonymous
Not applicable

Hi Govardhan Turaka,
Please have a look at a related forum : https://community.talend.com/t5/Design-and-Development/How-to-read-pdf-file-in-talend/td-p/99998
Best regards
Sabrina
tomwattsusa
Contributor
Contributor

I was able to read the text of PDFs using the Apache library pdfbox, pdfbox-app-2.0.25.jar

 

I used the tLibraryLoad component to load the jar.

Then used a tJava component to read the file

 

tJava Code:

/*

File file = new File("/opt/sample.pdf");

PDDocument document = PDDocument.load(file);

PDFTextStripper pdfStripper = new PDFTextStripper();

String text = pdfStripper.getText(document);

System.out.println("Text:" + text);

document.close();

*/

 

 

PDDocument document = PDDocument.load(new File("/opt/pdf.pdf"));

if (!document.isEncrypted()) {

  PDFTextStripper stripper = new PDFTextStripper();

  String text = stripper.getText(document);

  System.out.println("Text:" + text);

}

document.close();

 

 

tJava Advanced Settings:

import java.io.File;

import org.apache.pdfbox.pdmodel.PDDocument; 

import org.apache.pdfbox.text.PDFTextStripper; 

import org.apache.pdfbox.text.PDFTextStripperByArea;

 

0695b00000N1DeXAAV.png0695b00000N1Df6AAF.png0695b00000N1DdoAAF.png 

 

Anonymous
Not applicable

Hello,

Thanks for sharing this solution with us on community.

Best regards

Sabrina

SNad1654691194
Contributor
Contributor

I have used exact steps but unable to get it going

 

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory

at org.apache.pdfbox.pdmodel.PDDocument.<clinit>(PDDocument.java:98)

at local_project.test_0_1.test.tJava_1Process(test.java:501)

at local_project.test_0_1.test.tLibraryLoad_1Process(test.java:415)

at local_project.test_0_1.test.runJobInTOS(test.java:804)

at local_project.test_0_1.test.main(test.java:642)

Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory

at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)

at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)

at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)

... 5 more