Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Qlik Open Lakehouse is Now Generally Available! Discover the key highlights and partner resources here.
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

how to clean a file from wrong encoded characters ?

hello, 

 

I have been given a .txt file which contains lines looking like that : 

 0683p000009LzKM.jpg 

Here for example, we have 4 lines with 5 columns each. And in my 4th columns, some characters are not recognize as UTF-8 characters. 

 

What I would like to do is either 1/erase those wrong characters, 2/ replace them by a space or 3/ recover them in order to read them correctly. 

 

I tried to use a regex in a tMap component in order to erase or replace the wrong characters.

0683p000009LzKR.jpg

 

But it didn't work out ! My wrong characters still stay the same... 

0683p000009LzKW.jpg 

 

I also tried using NotePad++ to convert my file from UTF-8 back to ANSII but it is not possible. The characters don't revert back to how they should. So using a routine to change the encoding of my file is not really an option too. 

 

I am starting to run out of ideas and options. Anyone has a good idea to share ? 

 

ps : i join my test file if anyone want to run some tests

Labels (3)
1 Solution

Accepted Solutions
Jesperrekuh
Specialist
Specialist

You're looking at the character representation... you need to look at its byte representation.
example (made it up) \u0001232 = A using unicode / utf-8 ... but a different encoding will mayv result in a char looks like ╗ or... when there is no character involved it could be <?>

I would still use/search for its original encoding, which is capable of showing all the needed diacrites in your (french) language. Because of conversion problems the bytes mapping was wrong and shows a false char... with the original encoding / collation you are probably work with correct bytes ranges
example Ḃ, ḃ, Ċ, ċ, Ḋ, ḋ, Ḟ, ḟ, Ġ, ġ, Ṁ, ṁ, Ṡ, ṡ, Ṫ, ṫ is probably a fixed byte range in your encoding.

You need to map it back... and need to find out which range of bytes is malformed , you can do regex with byte ranges.

View solution in original post

4 Replies
Jesperrekuh
Specialist
Specialist

Try WINDOWS-1252 / CP-1252
Is it data directly from a database, ask its owner/sender which collation is used for the table settings.

 

Anonymous
Not applicable
Author

Hello @Dijke

 

thank you for your quick anwser, but I tried that and it didn't work out. 

See : 

0683p000009LzKl.jpg

 

Even if I knew which was the native encoding, I think reverting back the file to that encoding would still be impossible. 

 

Is there any other way to capture those characters to erase them ? I think it might be simplier. I tried a regex with alpha-numeric characters allowed only ([^a-zA-Z0-9]) but I couldn't capture/change/erase the wrong characters. Did I missed something here ?

 

 

Jesperrekuh
Specialist
Specialist

You're looking at the character representation... you need to look at its byte representation.
example (made it up) \u0001232 = A using unicode / utf-8 ... but a different encoding will mayv result in a char looks like ╗ or... when there is no character involved it could be <?>

I would still use/search for its original encoding, which is capable of showing all the needed diacrites in your (french) language. Because of conversion problems the bytes mapping was wrong and shows a false char... with the original encoding / collation you are probably work with correct bytes ranges
example Ḃ, ḃ, Ċ, ċ, Ḋ, ḋ, Ḟ, ḟ, Ġ, ġ, Ṁ, ṁ, Ṡ, ṡ, Ṫ, ṫ is probably a fixed byte range in your encoding.

You need to map it back... and need to find out which range of bytes is malformed , you can do regex with byte ranges.
Anonymous
Not applicable
Author

Ok, now I get it. Even though I would have prefer a quicker solution, I will try it that way to reach a durable solution. 

Thank you for your input and all the explanations, @Dijke !