<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Retrieve Data from HTML document and insert it into database in Talend Studio</title>
    <link>https://community.qlik.com/t5/Talend-Studio/Retrieve-Data-from-HTML-document-and-insert-it-into-database/m-p/2304041#M75855</link>
    <description>&lt;P&gt;hi @Shicong Hong​&amp;nbsp;, thanks for replying, the problem is that the html document is stored locally on my laptop, that's why i had to use the regular expressions to remove html tags.&lt;/P&gt;&lt;P&gt;i don't know if thtmlinput is suitable in this case !!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;PS: i also tried using the component TtikaExtractor but it didn't work as well &lt;/P&gt;&lt;P&gt;Regards &lt;/P&gt;</description>
    <pubDate>Mon, 30 May 2022 10:48:02 GMT</pubDate>
    <dc:creator>behi5542</dc:creator>
    <dc:date>2022-05-30T10:48:02Z</dc:date>
    <item>
      <title>Retrieve Data from HTML document and insert it into database</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Retrieve-Data-from-HTML-document-and-insert-it-into-database/m-p/2304039#M75853</link>
      <description>&lt;P&gt;hi everyone,&lt;/P&gt;&lt;P&gt;i want to extract data from html document and insert it into a table, so i've made this job, in the tjavaflex i am using this code to remove html tags:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;row15.content = row14.content.replaceAll("\\&amp;lt;.*?\\&amp;gt;", "").replaceAll("\\&amp;lt;\\?html(.+?)\\?\\&amp;gt;", "").replaceAll("&amp;lt;style([\\s\\S]+?)&amp;lt;/style&amp;gt;", "").replaceAll("&amp;amp;nbsp;"," ").trim();&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="0695b00000RiXYMAA3.png"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/153446iB42A14B35C2133BC/image-size/large?v=v2&amp;amp;px=999" role="button" title="0695b00000RiXYMAA3.png" alt="0695b00000RiXYMAA3.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;but after executing the job it always shows this error  "Error on line 1 of document&amp;nbsp;: Content is not allowed in prolog"&lt;/P&gt;&lt;P&gt;can anyone please help me to find a solution so i can extract the data from the html document. thanks in advance &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 15 Nov 2024 22:53:12 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Retrieve-Data-from-HTML-document-and-insert-it-into-database/m-p/2304039#M75853</guid>
      <dc:creator>behi5542</dc:creator>
      <dc:date>2024-11-15T22:53:12Z</dc:date>
    </item>
    <item>
      <title>Re: Retrieve Data from HTML document and insert it into database</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Retrieve-Data-from-HTML-document-and-insert-it-into-database/m-p/2304040#M75854</link>
      <description>&lt;P&gt;@behi behi​&amp;nbsp;the error' Conent is not allowed in prolog' is usually because there is something wrong in the file, you can read more discussions about this error on this &lt;A href="https://stackoverflow.com/questions/3030903/content-is-not-allowed-in-prolog-when-parsing-perfectly-valid-xml-on-gae" alt="https://stackoverflow.com/questions/3030903/content-is-not-allowed-in-prolog-when-parsing-perfectly-valid-xml-on-gae" target="_blank"&gt;page&lt;/A&gt;. &lt;/P&gt;&lt;P&gt;BTW, no a official component used to extract data from html document, here is a custom &lt;A href="https://exchange.talend.com/#marketplaceproductoverview:marketplace=marketplace%252F1&amp;amp;p=marketplace%252F1%252Fproducts%252F1295&amp;amp;pi=marketplace%252F1%252Fproducts%252F1295%252Fitems%252F1787" alt="https://exchange.talend.com/#marketplaceproductoverview:marketplace=marketplace%252F1&amp;amp;p=marketplace%252F1%252Fproducts%252F1295&amp;amp;pi=marketplace%252F1%252Fproducts%252F1295%252Fitems%252F1787" target="_blank"&gt;component &lt;/A&gt;shared by  community user on Talend Exchange, you can download it and test if it fixes your need. &lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Regards&lt;/P&gt;&lt;P&gt;Shong&lt;/P&gt;</description>
      <pubDate>Mon, 30 May 2022 05:28:58 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Retrieve-Data-from-HTML-document-and-insert-it-into-database/m-p/2304040#M75854</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-05-30T05:28:58Z</dc:date>
    </item>
    <item>
      <title>Re: Retrieve Data from HTML document and insert it into database</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Retrieve-Data-from-HTML-document-and-insert-it-into-database/m-p/2304041#M75855</link>
      <description>&lt;P&gt;hi @Shicong Hong​&amp;nbsp;, thanks for replying, the problem is that the html document is stored locally on my laptop, that's why i had to use the regular expressions to remove html tags.&lt;/P&gt;&lt;P&gt;i don't know if thtmlinput is suitable in this case !!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;PS: i also tried using the component TtikaExtractor but it didn't work as well &lt;/P&gt;&lt;P&gt;Regards &lt;/P&gt;</description>
      <pubDate>Mon, 30 May 2022 10:48:02 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Retrieve-Data-from-HTML-document-and-insert-it-into-database/m-p/2304041#M75855</guid>
      <dc:creator>behi5542</dc:creator>
      <dc:date>2022-05-30T10:48:02Z</dc:date>
    </item>
  </channel>
</rss>

