Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Join us in Toronto Sept 9th for Qlik's AI Reality Tour! Register Now
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

how to clean a file from wrong encoded characters ?

hello, 

 

I have been given a .txt file which contains lines looking like that : 

 0683p000009LzKM.jpg 

Here for example, we have 4 lines with 5 columns each. And in my 4th columns, some characters are not recognize as UTF-8 characters. 

 

What I would like to do is either 1/erase those wrong characters, 2/ replace them by a space or 3/ recover them in order to read them correctly. 

 

I tried to use a regex in a tMap component in order to erase or replace the wrong characters.

0683p000009LzKR.jpg

 

But it didn't work out ! My wrong characters still stay the same... 

0683p000009LzKW.jpg 

 

I also tried using NotePad++ to convert my file from UTF-8 back to ANSII but it is not possible. The characters don't revert back to how they should. So using a routine to change the encoding of my file is not really an option too. 

 

I am starting to run out of ideas and options. Anyone has a good idea to share ? 

 

ps : i join my test file if anyone want to run some tests

Labels (3)
1 Solution

Accepted Solutions
Jesperrekuh
Specialist
Specialist

You're looking at the character representation... you need to look at its byte representation.
example (made it up) \u0001232 = A using unicode / utf-8 ... but a different encoding will mayv result in a char looks like ╗ or... when there is no character involved it could be <?>

I would still use/search for its original encoding, which is capable of showing all the needed diacrites in your (french) language. Because of conversion problems the bytes mapping was wrong and shows a false char... with the original encoding / collation you are probably work with correct bytes ranges
example Ḃ, ḃ, Ċ, ċ, Ḋ, ḋ, Ḟ, ḟ, Ġ, ġ, Ṁ, ṁ, Ṡ, ṡ, Ṫ, ṫ is probably a fixed byte range in your encoding.

You need to map it back... and need to find out which range of bytes is malformed , you can do regex with byte ranges.

View solution in original post

4 Replies
Jesperrekuh
Specialist
Specialist

Try WINDOWS-1252 / CP-1252
Is it data directly from a database, ask its owner/sender which collation is used for the table settings.

 

Anonymous
Not applicable
Author

Hello @Dijke

 

thank you for your quick anwser, but I tried that and it didn't work out. 

See : 

0683p000009LzKl.jpg

 

Even if I knew which was the native encoding, I think reverting back the file to that encoding would still be impossible. 

 

Is there any other way to capture those characters to erase them ? I think it might be simplier. I tried a regex with alpha-numeric characters allowed only ([^a-zA-Z0-9]) but I couldn't capture/change/erase the wrong characters. Did I missed something here ?

 

 

Jesperrekuh
Specialist
Specialist

You're looking at the character representation... you need to look at its byte representation.
example (made it up) \u0001232 = A using unicode / utf-8 ... but a different encoding will mayv result in a char looks like ╗ or... when there is no character involved it could be <?>

I would still use/search for its original encoding, which is capable of showing all the needed diacrites in your (french) language. Because of conversion problems the bytes mapping was wrong and shows a false char... with the original encoding / collation you are probably work with correct bytes ranges
example Ḃ, ḃ, Ċ, ċ, Ḋ, ḋ, Ḟ, ḟ, Ġ, ġ, Ṁ, ṁ, Ṡ, ṡ, Ṫ, ṫ is probably a fixed byte range in your encoding.

You need to map it back... and need to find out which range of bytes is malformed , you can do regex with byte ranges.
Anonymous
Not applicable
Author

Ok, now I get it. Even though I would have prefer a quicker solution, I will try it that way to reach a durable solution. 

Thank you for your input and all the explanations, @Dijke !