topic Re: parsing HTML in Talend Studio

parsing HTML

_AnonymousUser — Fri, 28 Mar 2014 09:17:34 GMT

Hi everybody.
I have to get some information (everything legal) from html pages. There's a way to take just the information I need, deleting html tags??
Thanks in advance.

Re: parsing HTML

Anonymous — Fri, 28 Mar 2014 10:58:34 GMT

hi,
use tFlieFetch to read html file :
https://help.talend.com/search/all?query=tFileFetch&content-lang=en
You can use libraby like jSoup to parse html.( write some java code)
You 've also got exchange component like tHTTPBot or tHTTPTableInput (to read table)
if well-formed (X)html use xml talend component ofter load html pages.
hope it helps
regards
laurent

Re: parsing HTML

_AnonymousUser — Fri, 28 Mar 2014 11:38:31 GMT

I already tried but with some problems. I started using talend last week, so I'm not very practice.
Can you please help me better with an example?
thank you very much.

Re: parsing HTML

_AnonymousUser — Tue, 01 Apr 2014 08:20:55 GMT

Hi everybody,
I tried to use tTikaExtracor and it works, but it does't remove the free space between lines...
Is there a component that write in the output file sequentially?
thanks in advance.

Re: parsing HTML

Anonymous — Tue, 01 Apr 2014 08:47:01 GMT

Hi rob911,
Is this component working well for your space issue TalendHelpCenter:tReplace?

Is there a component that write in the output file sequentially?

Could you set a example for your requirement "write in the output file sequentially"?
Best regards
Sabrina

Re: parsing HTML

Anonymous — Tue, 01 Apr 2014 09:20:49 GMT

hi all,
perhaps you could 'pre-procede' your Html fiel reading it with tFileInputFullRow and check option "skip empty rows".
hope it helps
regards
laurent

Re: parsing HTML

_AnonymousUser — Tue, 01 Apr 2014 10:07:34 GMT

I can't upload the screenshot of my file and talend job.
kzone, where should I put tFileInputFullRow? In my job I have tTikaExtractor -> FixedFlowInput -> tFileOutputDelimited....
xdshi, with tTikaExtractor I can delete every code line of my html file, but the useful lines remain in the position where they were in the code.
thanks to you two, hoping you can get me to a solution

Re: parsing HTML

Anonymous — Tue, 01 Apr 2014 10:17:54 GMT

Hi,
You should register and log in as a Community member first, then you'll get a Image upload box that allows to upload screen captures and images up to 200KB(Limits: 20 images per post, each image must be less then 1024x768 pixels and 200 KB).
Best regards
Sabrina

Re: parsing HTML

_AnonymousUser — Tue, 01 Apr 2014 10:52:29 GMT

I'm already registered but I can't log in, I don't know why I can't.
Anyway the problem is that the line which I'm interested in are not disposed in the right sequence in the file, I mean that there are too many empy row, in this empy row there was the code.
I put in Tika the url I'm interested in then I get the useful lines in a txt file, but they are in the same position of the html file and I want them in sequential rows.
I used this post https://community.talend.com/t5/Design-and-Development/how-do-we-retrieve-data-from-HTML-page/td-p/114529 .
But the output is different and I don't know why!

Re: parsing HTML

Anonymous — Tue, 01 Apr 2014 15:12:34 GMT

Anyway the problem is that the line which I'm interested in are not disposed in the right sequence in the file, I mean that there are too many empy row, in this empy row there was the code.

as you have :
tTikaExtractor -> FixedFlowInput -> tFileOutputDelimited
next read delimited file with tFileInputFullRow skipping empty rows ...
Not sure it's the more efficient way - I'm sure in fact - but not sure about what you're expecting .
regards
laurent

Re: parsing HTML

_AnonymousUser — Thu, 03 Apr 2014 08:11:22 GMT

Hi,
I tried tFileInputFullRow -> tFileOutputDelimited skipping empy row, but it doesn't clean empty row...
Regards

Re: parsing HTML

_AnonymousUser — Fri, 04 Apr 2014 08:17:52 GMT

Hi everybody,
fine I don't need to have an orderly file anymore.
I just need to extract some lines... is there a component that help me with that?? I need to specify some start words and some end words.
Thanks in advance.

Re: parsing HTML

_AnonymousUser — Fri, 04 Apr 2014 09:14:11 GMT

I'm using tFileInputRegex and it's matching the lines I need... but how can I write these lines in an output files?
Using tFileInputRegex -> tFileOutputDelimited doesn't work.
regards

Re: parsing HTML

Ashok_Panda — Mon, 07 Apr 2014 16:07:32 GMT

Hi Everybody,
I am reading a html file using tFileInputFullRow ,but it's not reading the html file from starting. I mean to say it should start reading the file at <html> tag ,but it's starting at somewhere i am not sure where . Note: i have not checked the random option of the component.