Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Join us in Toronto Sept 9th for Qlik's AI Reality Tour! Register Now
cancel
Showing results for 
Search instead for 
Did you mean: 
_AnonymousUser
Specialist III
Specialist III

Using Talend to crawl a website

Hi,
My use case is like this:- I want to crawl a website say talend.com, and extract all the information on the website into Hadoop. After that I want to search for specific strings in the data, and use it populate Hive and create a report.
I want to use Talend to populate data from a website and store it in Hadoop. I watched this video
http://www.talend.com/resources/webinars/watch/215#validatewebinar
Based on this when i use a t_FileFetch or t_HttpRequest and connect to a URI say "http://talend.com" - I only get the first page , which I can save in a file. How can I iterate over the entire contents of a directory- I need to know each distinct URL like talend.com/products etc. How can I iteratively fetch all files under a master URL.
Labels (2)
1 Reply
Anonymous
Not applicable

I would use a regulary expression and filter the content of the first page for links. After collection all links you can iterate over them and so on.