<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: how to clean a file from wrong encoded characters ? in Talend Studio</title>
    <link>https://community.qlik.com/t5/Talend-Studio/how-to-clean-a-file-from-wrong-encoded-characters/m-p/2363481#M127356</link>
    <description>You're looking at the character representation... you need to look at its byte representation.&lt;BR /&gt;example (made it up) \u0001232 = A using unicode / utf-8 ... but a different encoding will mayv result in a char looks like ╗ or... when there is no character involved it could be &amp;lt;?&amp;gt;&lt;BR /&gt;&lt;BR /&gt;I would still use/search for its original encoding, which is capable of showing all the needed diacrites in your (french) language. Because of conversion problems the bytes mapping was wrong and shows a false char... with the original encoding / collation you are probably work with correct bytes ranges&lt;BR /&gt;example Ḃ, ḃ, Ċ, ċ, Ḋ, ḋ, Ḟ, ḟ, Ġ, ġ, Ṁ, ṁ, Ṡ, ṡ, Ṫ, ṫ is probably a fixed byte range in your encoding.&lt;BR /&gt;&lt;BR /&gt;You need to map it back... and need to find out which range of bytes is malformed , you can do regex with byte ranges.&lt;BR /&gt;</description>
    <pubDate>Fri, 20 Jul 2018 12:22:27 GMT</pubDate>
    <dc:creator>Jesperrekuh</dc:creator>
    <dc:date>2018-07-20T12:22:27Z</dc:date>
    <item>
      <title>how to clean a file from wrong encoded characters ?</title>
      <link>https://community.qlik.com/t5/Talend-Studio/how-to-clean-a-file-from-wrong-encoded-characters/m-p/2363478#M127353</link>
      <description>&lt;P&gt;hello,&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;I have been given a .txt file which contains lines looking like that :&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="testDjoWrongEncodedCharacters.JPG" style="width: 468px;"&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="0683p000009LzKM.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/154465i4F7FF204392D0CDD/image-size/large?v=v2&amp;amp;px=999" role="button" title="0683p000009LzKM.jpg" alt="0683p000009LzKM.jpg" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;Here for example, we have 4 lines with 5 columns each. And in my 4th columns, some characters are not recognize as UTF-8 characters.&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;What I would like to do is either 1/erase those wrong characters, 2/ replace them by a space or 3/ recover them in order to read them correctly.&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;I tried to use a regex in a tMap component in order to erase or replace the wrong characters.&lt;/P&gt; 
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="tMapWrongEncodedCharacters.JPG" style="width: 999px;"&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="0683p000009LzKR.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/140763i6CD6E2EE5644FCFC/image-size/large?v=v2&amp;amp;px=999" role="button" title="0683p000009LzKR.jpg" alt="0683p000009LzKR.jpg" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;But it didn't work out ! My wrong characters still stay the same...&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="jobTestWrongEncodedCharacters.JPG" style="width: 999px;"&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="0683p000009LzKW.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/157823i5568CAA3D7FA1AAD/image-size/large?v=v2&amp;amp;px=999" role="button" title="0683p000009LzKW.jpg" alt="0683p000009LzKW.jpg" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;I also tried using NotePad++ to convert my file from UTF-8 back to ANSII but it is not possible. The characters don't revert back to how they should. So using a routine to change the encoding of my file is not really an option too.&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;I am starting to run out of ideas and options. Anyone has a good idea to share ?&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;ps : i join my test file if anyone want to run some tests&lt;/P&gt;</description>
      <pubDate>Sat, 16 Nov 2024 07:55:34 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/how-to-clean-a-file-from-wrong-encoded-characters/m-p/2363478#M127353</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2024-11-16T07:55:34Z</dc:date>
    </item>
    <item>
      <title>Re: how to clean a file from wrong encoded characters ?</title>
      <link>https://community.qlik.com/t5/Talend-Studio/how-to-clean-a-file-from-wrong-encoded-characters/m-p/2363479#M127354</link>
      <description>&lt;P&gt;Try WINDOWS-1252 / CP-1252&lt;BR /&gt;Is it data directly from a database, ask its owner/sender which collation is used for the table settings.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 20 Jul 2018 10:47:31 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/how-to-clean-a-file-from-wrong-encoded-characters/m-p/2363479#M127354</guid>
      <dc:creator>Jesperrekuh</dc:creator>
      <dc:date>2018-07-20T10:47:31Z</dc:date>
    </item>
    <item>
      <title>Re: how to clean a file from wrong encoded characters ?</title>
      <link>https://community.qlik.com/t5/Talend-Studio/how-to-clean-a-file-from-wrong-encoded-characters/m-p/2363480#M127355</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;A href="https://community.qlik.com/s/profile/0053p000007LMrOAAW"&gt;@Dijke&lt;/A&gt;,&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;thank you for your quick anwser, but I tried that and it didn't work out.&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;See :&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="jobTestWrongEncodedCharacters2.JPG" style="width: 999px;"&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="0683p000009LzKl.jpg"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/144062i80609539E58648D4/image-size/large?v=v2&amp;amp;px=999" role="button" title="0683p000009LzKl.jpg" alt="0683p000009LzKl.jpg" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;Even if I knew which was the native encoding, I think reverting back the file to that encoding would still be impossible.&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;Is there any other way to capture those characters to erase them ? I think it might be simplier. I tried a regex with alpha-numeric characters allowed only ([^a-zA-Z0-9]) but&amp;nbsp;I couldn't capture/change/erase the wrong characters. Did I missed something here ?&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 20 Jul 2018 11:13:24 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/how-to-clean-a-file-from-wrong-encoded-characters/m-p/2363480#M127355</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2018-07-20T11:13:24Z</dc:date>
    </item>
    <item>
      <title>Re: how to clean a file from wrong encoded characters ?</title>
      <link>https://community.qlik.com/t5/Talend-Studio/how-to-clean-a-file-from-wrong-encoded-characters/m-p/2363481#M127356</link>
      <description>You're looking at the character representation... you need to look at its byte representation.&lt;BR /&gt;example (made it up) \u0001232 = A using unicode / utf-8 ... but a different encoding will mayv result in a char looks like ╗ or... when there is no character involved it could be &amp;lt;?&amp;gt;&lt;BR /&gt;&lt;BR /&gt;I would still use/search for its original encoding, which is capable of showing all the needed diacrites in your (french) language. Because of conversion problems the bytes mapping was wrong and shows a false char... with the original encoding / collation you are probably work with correct bytes ranges&lt;BR /&gt;example Ḃ, ḃ, Ċ, ċ, Ḋ, ḋ, Ḟ, ḟ, Ġ, ġ, Ṁ, ṁ, Ṡ, ṡ, Ṫ, ṫ is probably a fixed byte range in your encoding.&lt;BR /&gt;&lt;BR /&gt;You need to map it back... and need to find out which range of bytes is malformed , you can do regex with byte ranges.&lt;BR /&gt;</description>
      <pubDate>Fri, 20 Jul 2018 12:22:27 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/how-to-clean-a-file-from-wrong-encoded-characters/m-p/2363481#M127356</guid>
      <dc:creator>Jesperrekuh</dc:creator>
      <dc:date>2018-07-20T12:22:27Z</dc:date>
    </item>
    <item>
      <title>Re: how to clean a file from wrong encoded characters ?</title>
      <link>https://community.qlik.com/t5/Talend-Studio/how-to-clean-a-file-from-wrong-encoded-characters/m-p/2363482#M127357</link>
      <description>&lt;P&gt;Ok, now&amp;nbsp;I get it. Even though I would have prefer a quicker solution, I will try it that way to reach a durable solution.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thank you for your input and all the explanations,&amp;nbsp;&lt;A href="https://community.qlik.com/s/profile/0053p000007LMrOAAW"&gt;@Dijke&lt;/A&gt;&amp;nbsp;!&lt;/P&gt;</description>
      <pubDate>Fri, 20 Jul 2018 12:49:01 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/how-to-clean-a-file-from-wrong-encoded-characters/m-p/2363482#M127357</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2018-07-20T12:49:01Z</dc:date>
    </item>
  </channel>
</rss>

