topic Re: PDF and HTML parsers in Talend Studio

PDF and HTML parsers

Anonymous — Sat, 16 Nov 2024 12:46:11 GMT

Hi,
Is talend supports PDF and HTML? If yes can you please let me know how can we do this.
Thanks & Regards,
Syed

Re: PDF and HTML parsers

Anonymous — Sun, 31 Jul 2011 04:15:05 GMT

Hi
There is a custom component tPDFToText on Talend exchange
http://www.talendforge.org/exchange/index.php?eid=346&product=tos&action=view&nav=1,1,1
it can be used to convert a PDF file to a text file, and then you can extract a delimited area.
About HTML file, you can test tHTTPTableInput component,
http://www.talendforge.org/exchange/index.php?eid=72&product=tos&action=view&nav=1,1,1
Best regards
Shong

Re: PDF and HTML parsers

Anonymous — Mon, 01 Aug 2011 06:59:21 GMT

Hi Shong,
I have downloaded PDF parser and included in TOS. This component is generating plain text. I am not sure which component
to use to read this text file. Because my PDF file contains the table which is converting as in the text file as follows
----------------------------------------------------------------------------------------
Instrument Details

Asset Type:CORPORATE DEBT Provider:BCP Golden Copy

Identifiers
ISIN XS0283708575
CUSIP EG1215284
SEDOL B1P8V35
CFI Code
Titu Code 65083001
Central Code
SIIB Code
RIC
Code Number NA
Issuer Details
Group Issue NO
--------------------------------------------------------------------------
From this file I need to preapare the key value pairs and load them in DB.
Example:
ISIN : XS0283708575
CUSIP : EG1215284
SEDOL : B1P8V35
Please Suggest me how can I do this.
Thanks & Regards,
Syed

Re: PDF and HTML parsers

Anonymous — Mon, 01 Aug 2011 07:18:42 GMT

Hi
These three records do always start with "ISIN", "CUSIP" and "SEDOL"? If so, use a tFileInputFullRow to read each line one by one, and then filter the rows which start with "ISIN", "CUSIP" and "SEDOL" on tFilterRow, extract each line into multiple fields on tExtractDelimitedFields. for example
tFileInputFullRow--main-->tFilterRow-->tExtractDelimitedFields-->tLogrow
on tFilterRow, use the advanced module and set the filter expression as below:

input_row.line.startsWith("ISIN")||input_row.line.startsWith("CUSIP")||input_row.line.startsWith("SEDOL")

Best regards
Shong

Re: PDF and HTML parsers

Anonymous — Mon, 01 Aug 2011 08:17:52 GMT

Hi Shong,
I have given 'Filed Separater' as space in 'tExtractDelimitedFields' component.
This works fine for ISIN,CUSIP and SEDOL values but also I have the keys as 'Titu Code' and 'Central Code'.
For this it is not working.
Can you please suggest how can I do for these.

Thanks & Regards,
Syed

Re: PDF and HTML parsers

Anonymous — Tue, 02 Aug 2011 06:23:15 GMT

Hi Shong,
Can you please suggest how to solve this issue?
Thanks & Regards,
Syed