Skip to main content
Announcements
Accelerate Your Success: Fuel your data and AI journey with the right services, delivered by our experts. Learn More
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

How to read pdf file in talend

Hello,
I need help to read in a variable the content of a pdf file to put it in a text field on a database.
What sort of component I'm suppose to use ?
The process :
- list the files on a folder : ok
- read the file name to find the database row : ok
- read the content of the file to put it on a database ... not ok 0683p000009MPcz.png
Does anyone have a solution ???
Thanks,
David
Labels (2)
1 Solution

Accepted Solutions
Anonymous
Not applicable
Author

I can't share the project because it's for my company, sorry for that.
To make this work.
In the talend Repository Menu, create a new Routines :
// template routine Java
package routines;
import java.io.*;
/*
* user specification: the function's comment should contain keys as follows: 1. write about the function's comment.but
* it must be before the "{talendTypes}" key.
*
* 2. {talendTypes} 's value must be talend Type, it is required . its value should be one of: String, char | Character,
* long | Long, int | Integer, boolean | Boolean, byte | Byte, Date, double | Double, float | Float, Object, short |
* Short
*
* 3. {Category} define a category for the Function. it is required. its value is user-defined .
*
* 4. {param} 's format is: {param} <type> <name>
*
* <type> 's value should be one of: string, int, list, double, object, boolean, long, char, date. <name>'s value is the
* Function's parameter name. the {param} is optional. so if you the Function without the parameters. the {param} don't
* added. you can have many parameters for the Function.
*
* 5. {example} gives a example for the Function. it is optional.
*/
public class fichierRef {
/**
* readFile: lit le fichier pdf et renvoi une chaine
*
*
* {talendTypes} String
*
* {Category} User Defined
*
* {param} string() input: le nom du fichier à lire
*
* {example} readFile("/etc/passwd") # hacking en cours ...
*/
public static String readFile(String fichier) {
String chaine = new String() ;
try {
InputStream ips=new FileInputStream(fichier);
InputStreamReader ipsr=new InputStreamReader(ips);
BufferedReader br=new BufferedReader(ipsr);
String ligne;
while ((ligne=br.readLine())!=null){
chaine+=ligne+"\n";
}
br.close();
return chaine ;
}catch(Exception e){
return "";
}

}

On any tMap you need it, put this sort of data :
routines.fichierRef.readFile(row3.filename).getBytes()

View solution in original post

21 Replies
Anonymous
Not applicable
Author

Hello David
Unfortunately, there is no a component can be used to extract data from a PDF file. 0683p000009MPcz.png
Best regards

shong
Anonymous
Not applicable
Author

ok I find a solution : using a TJava after a TFileExist with this code
String chaine = new String() ;
InputStream ips=new FileInputStream(((String)globalMap.get("tFileExist_2_FILENAME")));
InputStreamReader ipsr=new InputStreamReader(ips);
BufferedReader br=new BufferedReader(ipsr);
String ligne;
while ((ligne=br.readLine())!=null){
chaine+=ligne+"\n";
}
br.close();
In the next object, use the chaine variable of the TJava object.
Anonymous
Not applicable
Author

I finally prefere another solution :
create a routines (in java) with a function readFile
in the tmap before data insertion, use routines.classname.functionname(pdffilenametoread)
Anonymous
Not applicable
Author

Hello friend
Can you share your job and routine on forum?
Thanks for your support!
Best regards

shong
Anonymous
Not applicable
Author

I can't share the project because it's for my company, sorry for that.
To make this work.
In the talend Repository Menu, create a new Routines :
// template routine Java
package routines;
import java.io.*;
/*
* user specification: the function's comment should contain keys as follows: 1. write about the function's comment.but
* it must be before the "{talendTypes}" key.
*
* 2. {talendTypes} 's value must be talend Type, it is required . its value should be one of: String, char | Character,
* long | Long, int | Integer, boolean | Boolean, byte | Byte, Date, double | Double, float | Float, Object, short |
* Short
*
* 3. {Category} define a category for the Function. it is required. its value is user-defined .
*
* 4. {param} 's format is: {param} <type> <name>
*
* <type> 's value should be one of: string, int, list, double, object, boolean, long, char, date. <name>'s value is the
* Function's parameter name. the {param} is optional. so if you the Function without the parameters. the {param} don't
* added. you can have many parameters for the Function.
*
* 5. {example} gives a example for the Function. it is optional.
*/
public class fichierRef {
/**
* readFile: lit le fichier pdf et renvoi une chaine
*
*
* {talendTypes} String
*
* {Category} User Defined
*
* {param} string() input: le nom du fichier à lire
*
* {example} readFile("/etc/passwd") # hacking en cours ...
*/
public static String readFile(String fichier) {
String chaine = new String() ;
try {
InputStream ips=new FileInputStream(fichier);
InputStreamReader ipsr=new InputStreamReader(ips);
BufferedReader br=new BufferedReader(ipsr);
String ligne;
while ((ligne=br.readLine())!=null){
chaine+=ligne+"\n";
}
br.close();
return chaine ;
}catch(Exception e){
return "";
}

}

On any tMap you need it, put this sort of data :
routines.fichierRef.readFile(row3.filename).getBytes()
Anonymous
Not applicable
Author

Notice that you could use some PDF library (iText) to extract some metadata.
Anonymous
Not applicable
Author

hi,
Urgent please
i am new to talend
I need help to read a pdf and write the contents to txt file can some one help me to get started.

I also tried adding the tFileOutputPDF after adding this in the talend tool in options window--->preferences--->talend--->components--->user component folder but not able to view in the palette.
Please help me giving some suggestions

Thank's
jones
Anonymous
Not applicable
Author

HI Cabajones
tFileOutputPDF is a component, you can download from talend exchange.(http://www.talendforge.org/exchange/)

thanks
B. Anil Kumar
Anonymous
Not applicable
Author

hi,

I also tried adding the tFileOutputPDF after adding this in the talend tool in options window--->preferences--->talend--->components--->user component folder but not able to view in the palette.
Please help me giving some suggestions

Thank's
jones

Hi Jones
tFileOutputPDF is used to write data to a PDF file, there is no a component can be used to read data from a PDF file, you need to hard code to read it in a routine as arfman did and call it in a job.
Shong