topic Re: Web Scraping in Talend Studio

Web Scraping

Anonymous — Sat, 16 Nov 2024 11:13:59 GMT

Hi everyone,
I have the URL of a web page. In this, there are some links. For each link, I have to scrape all its content.
I want to make it with TOS. It's the first time that I make something like that.
Have I need to use a script, for example in Python, to combine with a talend job? Or can I do everything through specific talend components (so without scripts)? Which components have I to use?
Thanks all

Re: Web Scraping

Anonymous — Fri, 08 May 2015 09:48:07 GMT

Hello
Take a look at tHttpRequest component, this component can be used to send a http request to the serve and get the page content from the URL, and then use regular expression or tExtractXMLFields component to extract all links from the response, finally, iterate link one by one. For example:
tHttpRequest--main--tExtractXMLField-main-tFlowToIterate--iterate--tHttpRequest--main--tLogRow
Best regards
Shong