topic Re: how to clean a file from wrong encoded characters ? in Talend Studio

how to clean a file from wrong encoded characters ?

Anonymous — Sat, 16 Nov 2024 07:55:34 GMT

hello,

I have been given a .txt file which contains lines looking like that :

Here for example, we have 4 lines with 5 columns each. And in my 4th columns, some characters are not recognize as UTF-8 characters.

What I would like to do is either 1/erase those wrong characters, 2/ replace them by a space or 3/ recover them in order to read them correctly.

I tried to use a regex in a tMap component in order to erase or replace the wrong characters.

But it didn't work out ! My wrong characters still stay the same...

I also tried using NotePad++ to convert my file from UTF-8 back to ANSII but it is not possible. The characters don't revert back to how they should. So using a routine to change the encoding of my file is not really an option too.

I am starting to run out of ideas and options. Anyone has a good idea to share ?

ps : i join my test file if anyone want to run some tests

Re: how to clean a file from wrong encoded characters ?

Jesperrekuh — Fri, 20 Jul 2018 10:47:31 GMT

Try WINDOWS-1252 / CP-1252
Is it data directly from a database, ask its owner/sender which collation is used for the table settings.

Re: how to clean a file from wrong encoded characters ?

Anonymous — Fri, 20 Jul 2018 11:13:24 GMT

Hello @Dijke,

thank you for your quick anwser, but I tried that and it didn't work out.

See :

Even if I knew which was the native encoding, I think reverting back the file to that encoding would still be impossible.

Is there any other way to capture those characters to erase them ? I think it might be simplier. I tried a regex with alpha-numeric characters allowed only ([^a-zA-Z0-9]) but I couldn't capture/change/erase the wrong characters. Did I missed something here ?

Re: how to clean a file from wrong encoded characters ?

Jesperrekuh — Fri, 20 Jul 2018 12:22:27 GMT

You're looking at the character representation... you need to look at its byte representation.
example (made it up) \u0001232 = A using unicode / utf-8 ... but a different encoding will mayv result in a char looks like ╗ or... when there is no character involved it could be <?>

I would still use/search for its original encoding, which is capable of showing all the needed diacrites in your (french) language. Because of conversion problems the bytes mapping was wrong and shows a false char... with the original encoding / collation you are probably work with correct bytes ranges
example Ḃ, ḃ, Ċ, ċ, Ḋ, ḋ, Ḟ, ḟ, Ġ, ġ, Ṁ, ṁ, Ṡ, ṡ, Ṫ, ṫ is probably a fixed byte range in your encoding.

You need to map it back... and need to find out which range of bytes is malformed , you can do regex with byte ranges.

Re: how to clean a file from wrong encoded characters ?

Anonymous — Fri, 20 Jul 2018 12:49:01 GMT

Ok, now I get it. Even though I would have prefer a quicker solution, I will try it that way to reach a durable solution.

Thank you for your input and all the explanations, @Dijke !