Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Join us in Toronto Sept 9th for Qlik's AI Reality Tour! Register Now
cancel
Showing results for 
Search instead for 
Did you mean: 
_AnonymousUser
Specialist III
Specialist III

parsing HTML

Hi everybody.
I have to get some information (everything legal) from html pages. There's a way to take just the information I need, deleting html tags??
Thanks in advance.
Labels (2)
13 Replies
Anonymous
Not applicable

hi,
use tFlieFetch to read html file :
https://help.talend.com/search/all?query=tFileFetch&content-lang=en
You can use libraby like jSoup to parse html.( write some java code)
You 've also got exchange component like tHTTPBot or tHTTPTableInput (to read table)
if well-formed (X)html use xml talend component ofter load html pages.
hope it helps
regards
laurent
_AnonymousUser
Specialist III
Specialist III
Author

I already tried but with some problems. I started using talend last week, so I'm not very practice.
Can you please help me better with an example?
thank you very much.
_AnonymousUser
Specialist III
Specialist III
Author

Hi everybody,
I tried to use tTikaExtracor and it works, but it does't remove the free space between lines...
Is there a component that write in the output file sequentially?
thanks in advance.
Anonymous
Not applicable

Hi rob911,
Is this component working well for your space issue TalendHelpCenter:tReplace?
Is there a component that write in the output file sequentially?

Could you set a example for your requirement "write in the output file sequentially"?
Best regards
Sabrina
Anonymous
Not applicable

hi all,
perhaps you could 'pre-procede' your Html fiel reading it with tFileInputFullRow and check option "skip empty rows".
hope it helps
regards
laurent
_AnonymousUser
Specialist III
Specialist III
Author

I can't upload the screenshot of my file and talend job.
kzone, where should I put tFileInputFullRow? In my job I have tTikaExtractor -> FixedFlowInput -> tFileOutputDelimited....
xdshi, with tTikaExtractor I can delete every code line of my html file, but the useful lines remain in the position where they were in the code.
thanks to you two, hoping you can get me to a solution 0683p000009MACn.png
Anonymous
Not applicable

Hi,
You should register and log in as a Community member first, then you'll get a Image upload box that allows to upload screen captures and images up to 200KB(Limits: 20 images per post, each image must be less then 1024x768 pixels and 200 KB).
Best regards
Sabrina
_AnonymousUser
Specialist III
Specialist III
Author

I'm already registered but I can't log in, I don't know why I can't.
Anyway the problem is that the line which I'm interested in are not disposed in the right sequence in the file, I mean that there are too many empy row, in this empy row there was the code.
I put in Tika the url I'm interested in then I get the useful lines in a txt file, but they are in the same position of the html file and I want them in sequential rows.
I used this post https://community.talend.com/t5/Design-and-Development/how-do-we-retrieve-data-from-HTML-page/td-p/1... .
But the output is different and I don't know why! 0683p000009MPcz.png
Anonymous
Not applicable

Anyway the problem is that the line which I'm interested in are not disposed in the right sequence in the file, I mean that there are too many empy row, in this empy row there was the code.

as you have :
tTikaExtractor -> FixedFlowInput -> tFileOutputDelimited
next read delimited file with tFileInputFullRow skipping empty rows ...
Not sure it's the more efficient way - I'm sure in fact 0683p000009MACn.png - but not sure about what you're expecting .
regards
laurent