<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Using Talend to crawl a website in Talend Studio</title>
    <link>https://community.qlik.com/t5/Talend-Studio/Using-Talend-to-crawl-a-website/m-p/2354084#M120128</link>
    <description>I would use a regulary expression and filter the content of the first page for links. After collection all links you can iterate over them and so on.</description>
    <pubDate>Wed, 03 Apr 2013 21:17:31 GMT</pubDate>
    <dc:creator>Anonymous</dc:creator>
    <dc:date>2013-04-03T21:17:31Z</dc:date>
    <item>
      <title>Using Talend to crawl a website</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Using-Talend-to-crawl-a-website/m-p/2354083#M120127</link>
      <description>Hi,&lt;BR /&gt;My use case is like this:- I want to crawl a website say talend.com, and extract all the information on the website into Hadoop. After that I want to search for specific strings in the data, and use it populate Hive and create a report.&lt;BR /&gt;I want to use Talend to populate data from a website and store it in Hadoop. I watched this video&lt;BR /&gt;&lt;A href="http://www.talend.com/resources/webinars/watch/215#validatewebinar" target="_blank" rel="nofollow noopener noreferrer"&gt;http://www.talend.com/resources/webinars/watch/215#validatewebinar&lt;/A&gt;&lt;BR /&gt;Based on this when i use a t_FileFetch or t_HttpRequest and connect to a URI say "&lt;A href="http://talend.com" target="_blank" rel="nofollow noopener noreferrer"&gt;http://talend.com&lt;/A&gt;" - I only get the first page , which I can save in a file. How can I iterate over the entire contents of a directory- I need to know each distinct URL like talend.com/products etc. How can I iteratively fetch all files under a master URL.</description>
      <pubDate>Wed, 03 Apr 2013 11:27:50 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Using-Talend-to-crawl-a-website/m-p/2354083#M120127</guid>
      <dc:creator>_AnonymousUser</dc:creator>
      <dc:date>2013-04-03T11:27:50Z</dc:date>
    </item>
    <item>
      <title>Re: Using Talend to crawl a website</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Using-Talend-to-crawl-a-website/m-p/2354084#M120128</link>
      <description>I would use a regulary expression and filter the content of the first page for links. After collection all links you can iterate over them and so on.</description>
      <pubDate>Wed, 03 Apr 2013 21:17:31 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Using-Talend-to-crawl-a-website/m-p/2354084#M120128</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2013-04-03T21:17:31Z</dc:date>
    </item>
  </channel>
</rss>

