<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Component to extract hyperlinks from a web page (HTML, PHP or ASPX) in Talend Studio</title>
    <link>https://community.qlik.com/t5/Talend-Studio/Component-to-extract-hyperlinks-from-a-web-page-HTML-PHP-or-ASPX/m-p/2248089#M33052</link>
    <description>Another approach: 
&lt;BR /&gt;Java - extract an HTML tag from a String using Pattern and Matcher 
&lt;BR /&gt; 
&lt;A href="http://devdaily.com/blog/post/java/how-extract-html-tag-string-regex-pattern-matcher-group" rel="nofollow noopener noreferrer"&gt;http://devdaily.com/blog/post/java/how-extract-html-tag-string-regex-pattern-matcher-group&lt;/A&gt; 
&lt;BR /&gt;"Use the Java Pattern and Matcher classes, and supply a regular expression (regex) to the Pattern class that defines the tag you want to extract. Then use the find method of the Matcher class to see if there is a match, and if so, use the group method to extract the actual group of characters from the String that matches your regular expression." 
&lt;BR /&gt;"In the following source code I demonstrate how to extract the contents from a code tag from a longer HTML string:" 
&lt;BR /&gt;* * * 
&lt;BR /&gt;"It's important to note that this example is hard-coded to look for only one occurrence of this group. In a more robust example, where you want to find and extract the contents of every code tag, your code would look more like this, using a while loop with the find method:" 
&lt;BR /&gt; 
&lt;A href="http://devdaily.com/blog/post/java/how-extract-html-tag-string-regex-pattern-matcher-group" rel="nofollow noopener noreferrer"&gt;http://devdaily.com/blog/post/java/how-extract-html-tag-string-regex-pattern-matcher-group&lt;/A&gt; 
&lt;BR /&gt;This approach seems simpler than a full blown SAX or DOM parser. 
&lt;BR /&gt;Jim</description>
    <pubDate>Thu, 16 Sep 2010 19:08:23 GMT</pubDate>
    <dc:creator>Anonymous</dc:creator>
    <dc:date>2010-09-16T19:08:23Z</dc:date>
    <item>
      <title>Component to extract hyperlinks from a web page (HTML, PHP or ASPX)</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Component-to-extract-hyperlinks-from-a-web-page-HTML-PHP-or-ASPX/m-p/2248088#M33051</link>
      <description>I am a Talend Open Source newbie (1 week) and I need a component to extract a list of hyperlinks from an html page I download with tFileFetch. 
&lt;BR /&gt;The specific hyperlinks I need to extract download data files. If I get a complete list of hyperlinks (one per row in a file) 
&lt;BR /&gt;in a second step I can filter the list for the one's I am interested in and then in a third step I can iterate over the list and use string functions (from Talend Code\Routines) to build the URLs I want to pass to another tFileFetch to download the 50+ data files on a daily basis. 
&lt;BR /&gt;I have successfully downloaded the HTML page by feeding the original HTML link to tFileFetch. 
&lt;BR /&gt;By HTML hyperlinks I mean everything between "&amp;lt;A" and "&amp;lt;/A&amp;gt;". 
&lt;BR /&gt;In general, extracting hyperlinks can be done with Regular Expressions or an XML/XQUERY, but Talend's components 
&lt;BR /&gt;assume something close to a regular row and column structure (a schema) and blow up with malformed or loosely structured HTML. 
&lt;BR /&gt;Slightly off topic -- one exception (for my application) might be Exchange component tHTTPTableInput (how to install in TOS?). 
&lt;BR /&gt;I researched the topic and found convoluted Regular Expressions (RegEx): 
&lt;BR /&gt;&amp;lt;a.*href=('|")?(http\://.*?(?=\1)).*&amp;gt;\s*(+|.*?)?\s*&amp;lt;/a&amp;gt; 
&lt;BR /&gt; 
&lt;A href="http://vidmar.net/weblog/archive/2009/09/10/matching-links-with-regular-expression-in-html.aspx" target="_blank" rel="nofollow noopener noreferrer"&gt;http://vidmar.net/weblog/archive/2009/09/10/matching-links-with-regular-expression-in-html.aspx&lt;/A&gt; 
&lt;BR /&gt;and this interesting February 2008 blog post "Showdown ? Java HTML Parsing Comparison" 
&lt;BR /&gt;on extracting hyperlinks using an XML/XQUERY from Java. 
&lt;BR /&gt; 
&lt;A href="http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/" target="_blank" rel="nofollow noopener noreferrer"&gt;http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/&lt;/A&gt; 
&lt;BR /&gt;"So, to test the parsing libraries, I decided to do exactly that and see if I could parse the HTML well enough to extract links from it using an XQuery. The contenders were NekoHTML, HtmlCleaner, TagSoup, and jTidy. " 
&lt;BR /&gt;* * * 
&lt;BR /&gt;"I gave each library an InputStream created from a URL (referred to as urlIS in the code samples below) and expected an org.w3c.dom.Node in return once the parse operation was completed. " 
&lt;BR /&gt;* * * 
&lt;BR /&gt;"Finally, to judge the ability to parse the HTML, I ran the XQuery ?//a? to grab all the &amp;lt;a&amp;gt; tags from the document ." 
&lt;BR /&gt;NOTE: Compare the XML/XQUERY ""//a" to the Regular Expression "&amp;lt;a.*href=('|")?(http\://.*?(?=\1)).*&amp;gt;\s*(+|.*?)?\s*&amp;lt;/a&amp;gt;". 
&lt;BR /&gt;"The only one of these parsing libraries I had used before was jTidy. It was able to extract the links from 5 of the 10 documents. However, the clear winner was HtmlCleaner. It was the only library to successfully clean 10/10 documents. " 
&lt;BR /&gt;* * * 
&lt;BR /&gt;"One drawback to HtmlCleaner is that it?s not available in a Maven repository. Sometimes NekoHTML may be easier to use for this reason." 
&lt;BR /&gt; 
&lt;A href="http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/" target="_blank" rel="nofollow noopener noreferrer"&gt;http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/&lt;/A&gt; 
&lt;BR /&gt;The blog post does not give the complete Java code: 
&lt;BR /&gt;"I implemented each library in its own class extending from an AbstractScraper implementing a Scraper interface I created. " 
&lt;BR /&gt;* * * 
&lt;BR /&gt;"The implementation specific code for each library is below" 
&lt;BR /&gt;So, if we can get the complete Java code from the blog post author can this be implemented in a custom code tJava component? 
&lt;BR /&gt;As I mentioned at the beginning, I have downloaded a page using tFileFetch 
&lt;BR /&gt;and if I can get a complete list of hyperlinks (one per row in a file) 
&lt;BR /&gt;in a second step I can filter the list (using ? Talend component) for the URL's I am interested in 
&lt;BR /&gt;and then in a third step I can iterate over the list and use string functions (from Talend Code\Routines) 
&lt;BR /&gt;to build the URLs I want to pass to another tFileFetch to download the 50+ data files on a daily basis. 
&lt;BR /&gt;But first, I have to get over this hump (extracting the links) -- can you help? 
&lt;BR /&gt;Thanks 
&lt;BR /&gt;Jim</description>
      <pubDate>Sat, 16 Nov 2024 13:17:18 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Component-to-extract-hyperlinks-from-a-web-page-HTML-PHP-or-ASPX/m-p/2248088#M33051</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2024-11-16T13:17:18Z</dc:date>
    </item>
    <item>
      <title>Re: Component to extract hyperlinks from a web page (HTML, PHP or ASPX)</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Component-to-extract-hyperlinks-from-a-web-page-HTML-PHP-or-ASPX/m-p/2248089#M33052</link>
      <description>Another approach: 
&lt;BR /&gt;Java - extract an HTML tag from a String using Pattern and Matcher 
&lt;BR /&gt; 
&lt;A href="http://devdaily.com/blog/post/java/how-extract-html-tag-string-regex-pattern-matcher-group" rel="nofollow noopener noreferrer"&gt;http://devdaily.com/blog/post/java/how-extract-html-tag-string-regex-pattern-matcher-group&lt;/A&gt; 
&lt;BR /&gt;"Use the Java Pattern and Matcher classes, and supply a regular expression (regex) to the Pattern class that defines the tag you want to extract. Then use the find method of the Matcher class to see if there is a match, and if so, use the group method to extract the actual group of characters from the String that matches your regular expression." 
&lt;BR /&gt;"In the following source code I demonstrate how to extract the contents from a code tag from a longer HTML string:" 
&lt;BR /&gt;* * * 
&lt;BR /&gt;"It's important to note that this example is hard-coded to look for only one occurrence of this group. In a more robust example, where you want to find and extract the contents of every code tag, your code would look more like this, using a while loop with the find method:" 
&lt;BR /&gt; 
&lt;A href="http://devdaily.com/blog/post/java/how-extract-html-tag-string-regex-pattern-matcher-group" rel="nofollow noopener noreferrer"&gt;http://devdaily.com/blog/post/java/how-extract-html-tag-string-regex-pattern-matcher-group&lt;/A&gt; 
&lt;BR /&gt;This approach seems simpler than a full blown SAX or DOM parser. 
&lt;BR /&gt;Jim</description>
      <pubDate>Thu, 16 Sep 2010 19:08:23 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Component-to-extract-hyperlinks-from-a-web-page-HTML-PHP-or-ASPX/m-p/2248089#M33052</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2010-09-16T19:08:23Z</dc:date>
    </item>
    <item>
      <title>Re: Component to extract hyperlinks from a web page (HTML, PHP or ASPX)</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Component-to-extract-hyperlinks-from-a-web-page-HTML-PHP-or-ASPX/m-p/2248090#M33053</link>
      <description>I have a proof of concept program working, but it requires pre-processing of the HTML file. 
&lt;BR /&gt;The pre-processing of the HTML file consists of changing all &amp;lt;/A&amp;gt; strings to be followed by a 
&lt;BR /&gt;blank space and an end of line string. 
&lt;BR /&gt;For proof of concept I did the pre-processing in MS Word. 
&lt;BR /&gt;I hope to be able to do the pre-processing using GNU SED (stream editor). 
&lt;BR /&gt;While researching SED, I ran across this thread that was relevant to the original topic. 
&lt;BR /&gt;New To Java - java 'sed' like functionality? 
&lt;BR /&gt; 
&lt;A href="http://forums.sun.com/thread.jspa?threadID=743023" rel="nofollow noopener noreferrer"&gt;http://forums.sun.com/thread.jspa?threadID=743023&lt;/A&gt; 
&lt;BR /&gt;Code examples include reading the file name from the command line and 
&lt;BR /&gt;reading the entire file into a string (warning: have to control regex so it 
&lt;BR /&gt;doesn't match multiple end tags from later tag pairs -- that's why I do line 
&lt;BR /&gt;at a time input and pre-process to make sure each tag pair is on a separate line). 
&lt;BR /&gt;If Java uses zero based arrays, why is the matched string found at element one? 
&lt;BR /&gt;And do the single letter variables mean they are using Generics? 
&lt;BR /&gt;Jim</description>
      <pubDate>Fri, 17 Sep 2010 17:51:28 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Component-to-extract-hyperlinks-from-a-web-page-HTML-PHP-or-ASPX/m-p/2248090#M33053</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2010-09-17T17:51:28Z</dc:date>
    </item>
  </channel>
</rss>

