<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Detect and Reject Non UTF-8 files in Talend Studio</title>
    <link>https://community.qlik.com/t5/Talend-Studio/Detect-and-Reject-Non-UTF-8-files/m-p/2356162#M121752</link>
    <description>I have a task of detecting and rejecting all incoming xml files of Non UTF-8 format.
&lt;BR /&gt;If my XML input file is of the form:
&lt;BR /&gt;&amp;lt;?xml version="1.0" encoding="EBCDIC"?&amp;gt;
&lt;BR /&gt;&amp;lt;book&amp;gt;
&lt;BR /&gt;&amp;lt;price&amp;gt;50£&amp;lt;/price&amp;gt;
&lt;BR /&gt;&amp;lt;/book&amp;gt;
&lt;BR /&gt;and the advanced settings within tFileInputXML and tFileOutputXML has UTF-8 selected, the job runs successfully whereas I want to the file to be rejected. 
&lt;BR /&gt;Output file:
&lt;BR /&gt;&amp;lt;?xml version="1.0" encoding="UTF-8"?&amp;gt;
&lt;BR /&gt;&amp;lt;root&amp;gt;
&lt;BR /&gt;&amp;lt;row&amp;gt;
&lt;BR /&gt;&amp;lt;price&amp;gt;50&amp;lt;/price&amp;gt;
&lt;BR /&gt;&amp;lt;/row&amp;gt;
&lt;BR /&gt;&amp;lt;/root&amp;gt;
&lt;BR /&gt;The file needs to be rejected even in below scenario wherein the xml version encoding is defined as UTF-8 but the data contains non UTF-8 characters(the pound symbol in the below example)
&lt;BR /&gt;Input file:
&lt;BR /&gt;&amp;lt;?xml version="1.0" encoding="UTF-8"?&amp;gt;
&lt;BR /&gt;&amp;lt;book&amp;gt;
&lt;BR /&gt;&amp;lt;price&amp;gt;50£&amp;lt;/price&amp;gt;
&lt;BR /&gt;&amp;lt;/book&amp;gt;</description>
    <pubDate>Tue, 04 Jun 2013 11:02:36 GMT</pubDate>
    <dc:creator>Anonymous</dc:creator>
    <dc:date>2013-06-04T11:02:36Z</dc:date>
    <item>
      <title>Detect and Reject Non UTF-8 files</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Detect-and-Reject-Non-UTF-8-files/m-p/2356162#M121752</link>
      <description>I have a task of detecting and rejecting all incoming xml files of Non UTF-8 format.
&lt;BR /&gt;If my XML input file is of the form:
&lt;BR /&gt;&amp;lt;?xml version="1.0" encoding="EBCDIC"?&amp;gt;
&lt;BR /&gt;&amp;lt;book&amp;gt;
&lt;BR /&gt;&amp;lt;price&amp;gt;50£&amp;lt;/price&amp;gt;
&lt;BR /&gt;&amp;lt;/book&amp;gt;
&lt;BR /&gt;and the advanced settings within tFileInputXML and tFileOutputXML has UTF-8 selected, the job runs successfully whereas I want to the file to be rejected. 
&lt;BR /&gt;Output file:
&lt;BR /&gt;&amp;lt;?xml version="1.0" encoding="UTF-8"?&amp;gt;
&lt;BR /&gt;&amp;lt;root&amp;gt;
&lt;BR /&gt;&amp;lt;row&amp;gt;
&lt;BR /&gt;&amp;lt;price&amp;gt;50&amp;lt;/price&amp;gt;
&lt;BR /&gt;&amp;lt;/row&amp;gt;
&lt;BR /&gt;&amp;lt;/root&amp;gt;
&lt;BR /&gt;The file needs to be rejected even in below scenario wherein the xml version encoding is defined as UTF-8 but the data contains non UTF-8 characters(the pound symbol in the below example)
&lt;BR /&gt;Input file:
&lt;BR /&gt;&amp;lt;?xml version="1.0" encoding="UTF-8"?&amp;gt;
&lt;BR /&gt;&amp;lt;book&amp;gt;
&lt;BR /&gt;&amp;lt;price&amp;gt;50£&amp;lt;/price&amp;gt;
&lt;BR /&gt;&amp;lt;/book&amp;gt;</description>
      <pubDate>Tue, 04 Jun 2013 11:02:36 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Detect-and-Reject-Non-UTF-8-files/m-p/2356162#M121752</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2013-06-04T11:02:36Z</dc:date>
    </item>
    <item>
      <title>Re: Detect and Reject Non UTF-8 files</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Detect-and-Reject-Non-UTF-8-files/m-p/2356163#M121753</link>
      <description>Hi
&lt;BR /&gt;There is no a component or a built-in function can be used to detect the file encoding, you can refer to these discussions in this 
&lt;A href="http://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream" target="_blank" rel="nofollow noopener noreferrer"&gt;page&lt;/A&gt; and write a routine in Talend to parse the file encoding.
&lt;BR /&gt;Shong</description>
      <pubDate>Wed, 05 Jun 2013 04:22:47 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Detect-and-Reject-Non-UTF-8-files/m-p/2356163#M121753</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2013-06-05T04:22:47Z</dc:date>
    </item>
  </channel>
</rss>

