<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Extract Data from URL in Talend in Talend Studio</title>
    <link>https://community.qlik.com/t5/Talend-Studio/Extract-Data-from-URL-in-Talend/m-p/2319694#M89912</link>
    <description>I wrote a tutorial on exactly this about 3 years ago. You can find it here: 
&lt;A href="https://www.rilhia.com/tutorials/using-third-party-java-library-scrape-content-table-web-page" target="_blank" rel="nofollow noopener noreferrer"&gt;https://www.rilhia.com/tutorials/using-third-party-java-library-scrape-content-table-web-page&lt;/A&gt;
&lt;BR /&gt;
&lt;BR /&gt;I suspect that better libraries are now available but the thing I learnt doing this is that you need to build your job with the website it is going to scrape in mind. Consider a website like XML. You can't build a single job to handle all types of XML. The same applies to websites.
&lt;BR /&gt;
&lt;BR /&gt;I hope the tutorial gives you a few ideas.</description>
    <pubDate>Fri, 16 Mar 2018 16:20:11 GMT</pubDate>
    <dc:creator>Anonymous</dc:creator>
    <dc:date>2018-03-16T16:20:11Z</dc:date>
    <item>
      <title>Extract Data from URL in Talend</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Extract-Data-from-URL-in-Talend/m-p/2319693#M89911</link>
      <description>&lt;DIV class="trans-verified-button-small"&gt; 
 &lt;SPAN class=""&gt;&lt;SPAN&gt;hello everyone I'm trying to do crawling with talend, and I managed to do it even with the tHttpInputTable component found in the talend exchange, but also with java code by importing the jsoup library into the tJavaFlex component.&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;The result is amazing to be able to do it on all the other sites is what I try to do, but I'm still new in this field, someone can make me a small overview and what I'm missing, for example the simple and static sites such as "&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;&lt;A href="http://www.imdb.com/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&amp;amp;pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&amp;amp;pf_rd_r=18KCBZY4GGNJFX5MGV7C&amp;amp;pf_rd_s=center-1&amp;amp;pf_rd_t=15506&amp;amp;pf_rd_i=top&amp;amp;ref_=chttp_tt_2" target="_blank" rel="nofollow noopener noreferrer"&gt;http://www.imdb.com/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&amp;amp;pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&amp;amp;pf_rd_r=18KCBZY4GGNJFX5MGV7C&amp;amp;pf_rd_s=center-1&amp;amp;pf_rd_t=15506&amp;amp;pf_rd_i=top&amp;amp;ref_=chttp_tt_2&lt;/A&gt; "which is a rating site for movies&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;, I can do it with a few lines of java, but for the most complex and certainly not static sites such as "&lt;A href="https://www.risultati.it" target="_blank" rel="nofollow noopener noreferrer"&gt;https://www.risultati.it&lt;/A&gt;" which is a live soccer results site I can not, what I'm missing&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;?&lt;/SPAN&gt;&lt;/SPAN&gt; 
&lt;/DIV&gt; 
&lt;DIV class="trans-verified-button-small"&gt; 
 &lt;SPAN class=""&gt;Is JSOUP not powerful enough to crawl all kinds of sites? thanks in advance for those who have to devote some time and open a world in this field to a new one.&lt;/SPAN&gt; 
&lt;/DIV&gt;</description>
      <pubDate>Sat, 16 Nov 2024 08:32:31 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Extract-Data-from-URL-in-Talend/m-p/2319693#M89911</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2024-11-16T08:32:31Z</dc:date>
    </item>
    <item>
      <title>Re: Extract Data from URL in Talend</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Extract-Data-from-URL-in-Talend/m-p/2319694#M89912</link>
      <description>I wrote a tutorial on exactly this about 3 years ago. You can find it here: 
&lt;A href="https://www.rilhia.com/tutorials/using-third-party-java-library-scrape-content-table-web-page" target="_blank" rel="nofollow noopener noreferrer"&gt;https://www.rilhia.com/tutorials/using-third-party-java-library-scrape-content-table-web-page&lt;/A&gt;
&lt;BR /&gt;
&lt;BR /&gt;I suspect that better libraries are now available but the thing I learnt doing this is that you need to build your job with the website it is going to scrape in mind. Consider a website like XML. You can't build a single job to handle all types of XML. The same applies to websites.
&lt;BR /&gt;
&lt;BR /&gt;I hope the tutorial gives you a few ideas.</description>
      <pubDate>Fri, 16 Mar 2018 16:20:11 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Extract-Data-from-URL-in-Talend/m-p/2319694#M89912</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2018-03-16T16:20:11Z</dc:date>
    </item>
    <item>
      <title>Re: Extract Data from URL in Talend</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Extract-Data-from-URL-in-Talend/m-p/2319695#M89913</link>
      <description>&lt;P&gt;Hi rhall_2_0, thank you so much for your solution, a very nice and very well structured tutorial.&lt;/P&gt; 
&lt;P&gt;but unfortunately this does not solve my problem, i tried your solution and I can not get out data of these dynamic sites, like this one "&lt;A href="https://www.risultati.it/partita/MPX1oKd9/#informazioni-partita" target="_blank" rel="nofollow noopener noreferrer"&gt;https://www.risultati.it/partita/MPX1oKd9/#informazioni-partita&lt;/A&gt;" ,but are many other of this type, I'm interested a certain football match. and I can not do this because I'm not able to understand how data are exposed on the site in some way recall database I think. the html that you get with thttprequest in talend does not contains all of this data that i need... any other idea ?? thank you again in advanced&lt;/P&gt;</description>
      <pubDate>Mon, 19 Mar 2018 10:48:39 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Extract-Data-from-URL-in-Talend/m-p/2319695#M89913</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2018-03-19T10:48:39Z</dc:date>
    </item>
    <item>
      <title>Re: Extract Data from URL in Talend</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Extract-Data-from-URL-in-Talend/m-p/2319696#M89914</link>
      <description>&lt;P&gt;I've looked at the page source of the page you posted. It looks like you might struggle with this. It looks like this page has been written to obfuscate the data to prevent page scraping. There is a lot of Javascript used. I am not sure you will be able to do this. You *could* try saving the pages locally as HTML and then processing them. That *might* make it slightly easier.&lt;/P&gt;</description>
      <pubDate>Mon, 19 Mar 2018 22:02:36 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Extract-Data-from-URL-in-Talend/m-p/2319696#M89914</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2018-03-19T22:02:36Z</dc:date>
    </item>
    <item>
      <title>Re: Extract Data from URL in Talend</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Extract-Data-from-URL-in-Talend/m-p/2319697#M89915</link>
      <description>&lt;P&gt;yes, I noticed this, in fact I'm fighting with this thing for a while, it's not a very urgent thing, because I do not have to do this at work, but it's a very interesting thing I'd like to do. However, I would like to thank you once again for your time! and I hope to hear from you again if anyone can find a way to do it.&lt;/P&gt;</description>
      <pubDate>Tue, 20 Mar 2018 08:48:31 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Extract-Data-from-URL-in-Talend/m-p/2319697#M89915</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2018-03-20T08:48:31Z</dc:date>
    </item>
    <item>
      <title>Re: Extract Data from URL in Talend</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Extract-Data-from-URL-in-Talend/m-p/2319698#M89916</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;BR /&gt;first of all, thank you for your time to help me&lt;BR /&gt;in fact, I want to parse XML / HTML from site&amp;nbsp;&lt;A href="https://www.cert.ssi.gouv.fr/" target="_blank" rel="nofollow noopener noreferrer noopener noreferrer noopener noreferrer"&gt;https://www.cert.ssi.gouv.fr/&lt;/A&gt;&lt;/P&gt; 
&lt;P&gt;&lt;SPAN&gt;when i try to catch data from the HTML page that cames with the component everything works fine, but this page is very simple does not have any divs, or blockquotes, is structured only using tables, when i try to use a page that uses more html tags, like blockquotes, is like tHTTPTableInput does not recognize the Tables, so it launch a&amp;nbsp;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;"Exception in component tHTTPTableInput_1 java.lang.ArrayIndexOutOfBoundsException:"&lt;/SPAN&gt;&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;&lt;BR /&gt;I'm expecting to have a table like that&lt;/P&gt; 
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline"&gt;&lt;SPAN class="lia-message-image-wrapper lia-message-image-actions-narrow lia-message-image-actions-below"&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="0683p000009M791.png"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/142489i4E579AA69B83FC53/image-size/large?v=v2&amp;amp;px=999" role="button" title="0683p000009M791.png" alt="0683p000009M791.png" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt; 
&lt;P&gt;ie I want to parse HTML and extract all the CERTEFs with a title and a publication date and all the VECs that it exists in each CERTEF&lt;BR /&gt;I do not know which component I can use and with which configuration that extracts exactly the same table&lt;/P&gt; 
&lt;P&gt;thank you for helping me&lt;/P&gt;</description>
      <pubDate>Fri, 23 Aug 2019 15:12:06 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Extract-Data-from-URL-in-Talend/m-p/2319698#M89916</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2019-08-23T15:12:06Z</dc:date>
    </item>
    <item>
      <title>Re: Extract Data from URL in Talend</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Extract-Data-from-URL-in-Talend/m-p/2319699#M89917</link>
      <description>&lt;P&gt;This is not going to be easy and there is no component I know of which will just do it for you. I think you will need to use a bit a code. I have written post which describes how I achieved something very similar. It is very complicated, but there will not be an "easy" way of achieving this I am afraid.&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;&lt;A href="https://community.qlik.com/s/feed/0D73p000004k5l5CAA#M95040" target="_blank"&gt;https://community.talend.com/t5/Design-and-Development/Extract-Multiple-table-using-tHTTPTableInput-component/m-p/155415/highlight/true#M95040&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 23 Aug 2019 17:33:51 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Extract-Data-from-URL-in-Talend/m-p/2319699#M89917</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2019-08-23T17:33:51Z</dc:date>
    </item>
    <item>
      <title>Re: Extract Data from URL in Talend</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Extract-Data-from-URL-in-Talend/m-p/2319700#M89918</link>
      <description>&lt;P&gt;first I thank you for your answer,&amp;nbsp;&lt;BR /&gt;any way, I created a job as following but I have a problem in writing the codes&lt;/P&gt; 
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="parsehttp.PNG" style="width: 579px;"&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="0683p000009M7D4.png"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/153911iE35ED88AEE50D72B/image-size/large?v=v2&amp;amp;px=999" role="button" title="0683p000009M7D4.png" alt="0683p000009M7D4.png" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt; 
&lt;P&gt;I searched between the questions in community and I find it but&amp;nbsp;it doesn't work&amp;nbsp;&lt;A href="https://www.rilhia.com/tutorials/using-third-party-java-library-scrape-content-table-web-page" target="_blank" rel="nofollow noopener noreferrer noopener noreferrer"&gt;https://www.rilhia.com/tutorials/using-third-party-java-library-scrape-content-table-web-page&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/A&gt;&lt;/P&gt; 
&lt;P&gt;but I don't know how I can use this way for my project because the site has several div and pdf and link and the data is not exactly in the specific tables&lt;/P&gt; 
&lt;P&gt;thanks&lt;/P&gt; 
&lt;P&gt;regards&lt;/P&gt;</description>
      <pubDate>Tue, 27 Aug 2019 11:21:28 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Extract-Data-from-URL-in-Talend/m-p/2319700#M89918</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2019-08-27T11:21:28Z</dc:date>
    </item>
    <item>
      <title>Re: Extract Data from URL in Talend</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Extract-Data-from-URL-in-Talend/m-p/2319701#M89919</link>
      <description>&lt;P&gt;&lt;A href="https://community.qlik.com/s/profile/0053p000007LPa0AAG"&gt;@mitra1367&lt;/A&gt;&amp;nbsp;I recreated the work that was hosted at the link that no longer works here:&amp;nbsp;&lt;A href="https://community.qlik.com/s/feed/0D73p000004k5l5CAA#M95040" target="_blank" rel="noopener"&gt;https://community.talend.com/t5/Design-and-Development/Extract-Multiple-table-using-tHTTPTableInput-component/m-p/155415/highlight/true#M95040&lt;/A&gt;&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;I think I may have said that it is tricky and you need to understand the third party Java API that I mention. Take a look at the documentation for that.&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;Unfortunately scraping websites is notoriously hard because there is no standard way of displaying data. So your solution will usually be entirely bespoke to your problem&lt;/P&gt;</description>
      <pubDate>Tue, 27 Aug 2019 11:43:29 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Extract-Data-from-URL-in-Talend/m-p/2319701#M89919</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2019-08-27T11:43:29Z</dc:date>
    </item>
  </channel>
</rss>

