Using Talend to crawl a website

_AnonymousUser — Wed, 03 Apr 2013 11:27:50 GMT

Hi,
My use case is like this:- I want to crawl a website say talend.com, and extract all the information on the website into Hadoop. After that I want to search for specific strings in the data, and use it populate Hive and create a report.
I want to use Talend to populate data from a website and store it in Hadoop. I watched this video
http://www.talend.com/resources/webinars/watch/215#validatewebinar
Based on this when i use a t_FileFetch or t_HttpRequest and connect to a URI say "http://talend.com" - I only get the first page , which I can save in a file. How can I iterate over the entire contents of a directory- I need to know each distinct URL like talend.com/products etc. How can I iteratively fetch all files under a master URL.

Re: Using Talend to crawl a website

Anonymous — Wed, 03 Apr 2013 21:17:31 GMT

I would use a regulary expression and filter the content of the first page for links. After collection all links you can iterate over them and so on.

topic Re: Using Talend to crawl a website in Talend Studio

Using Talend to crawl a website

Re: Using Talend to crawl a website