parsing HTML

_AnonymousUser · ‎2014-03-28

Hi everybody.
I have to get some information (everything legal) from html pages. There's a way to take just the information I need, deleting html tags??
Thanks in advance.

Anonymous · ‎2014-03-28

hi,
use tFlieFetch to read html file :
https://help.talend.com/search/all?query=tFileFetch&content-lang=en
You can use libraby like jSoup to parse html.( write some java code)
You 've also got exchange component like tHTTPBot or tHTTPTableInput (to read table)
if well-formed (X)html use xml talend component ofter load html pages.
hope it helps
regards
laurent

_AnonymousUser · ‎2014-03-28

I already tried but with some problems. I started using talend last week, so I'm not very practice.
Can you please help me better with an example?
thank you very much.

_AnonymousUser · ‎2014-04-01

Hi everybody,
I tried to use tTikaExtracor and it works, but it does't remove the free space between lines...
Is there a component that write in the output file sequentially?
thanks in advance.

Anonymous · ‎2014-04-01

Hi rob911,
Is this component working well for your space issue TalendHelpCenter:tReplace?

Is there a component that write in the output file sequentially?

Could you set a example for your requirement "write in the output file sequentially"?
Best regards
Sabrina

Anonymous · ‎2014-04-01

hi all,
perhaps you could 'pre-procede' your Html fiel reading it with tFileInputFullRow and check option "skip empty rows".
hope it helps
regards
laurent

_AnonymousUser · ‎2014-04-01

I can't upload the screenshot of my file and talend job.
kzone, where should I put tFileInputFullRow? In my job I have tTikaExtractor -> FixedFlowInput -> tFileOutputDelimited....
xdshi, with tTikaExtractor I can delete every code line of my html file, but the useful lines remain in the position where they were in the code.
thanks to you two, hoping you can get me to a solution

Anonymous · ‎2014-04-01

Hi,
You should register and log in as a Community member first, then you'll get a Image upload box that allows to upload screen captures and images up to 200KB(Limits: 20 images per post, each image must be less then 1024x768 pixels and 200 KB).
Best regards
Sabrina

_AnonymousUser · ‎2014-04-01

I'm already registered but I can't log in, I don't know why I can't.
Anyway the problem is that the line which I'm interested in are not disposed in the right sequence in the file, I mean that there are too many empy row, in this empy row there was the code.
I put in Tika the url I'm interested in then I get the useful lines in a txt file, but they are in the same position of the html file and I want them in sequential rows.
I used this post https://community.talend.com/t5/Design-and-Development/how-do-we-retrieve-data-from-HTML-page/td-p/1... .
But the output is different and I don't know why!

Anonymous · ‎2014-04-01

Anyway the problem is that the line which I'm interested in are not disposed in the right sequence in the file, I mean that there are too many empy row, in this empy row there was the code.

as you have :
tTikaExtractor -> FixedFlowInput -> tFileOutputDelimited
next read delimited file with tFileInputFullRow skipping empty rows ...
Not sure it's the more efficient way - I'm sure in fact

- but not sure about what you're expecting .
regards
laurent

Talend Data Integration

v5.x

parsing HTML

Talend Data Integration

v5.x

Related Topics