Documents for QlikView related information.
anyone who is dealing with Unicode data should consider U+3000 (Ideographic Space) and perhaps all other spaces: Wikipedia
If you are parsing texts to build word lists or so by using space as delimiter the best thing would be to swap U+3000 with regular space chr(32) at first:
replace( text, chr( num#( '3000', '(hex)') ), chr(32) )
There are also some more space characters in Unicode you could stumble on:
Just as a side information: I came into this by processing Japanese data.
This was interesting. I did not realize that there were so many different whitespace characters...
The solution using the replace() function works fine, but it can only handle conversion of one single whitepace type. I would instead use a mapping table and the function MapSubString for the conversion, e.g.:
Blanks:Mapping Load chr(Num#(Ord,Notation)) as ChangeFrom, chr(32) as ChangeTo inline
[Notation, Ord(DEC), 160(HEX), 2000(HEX), 3000];
followed by a
Mapsubstring('Blanks', Text) as CorrectedText
in the Load statement. This way you can convert many characters with one function call.
This is a good solution. It could be easily extended for other characters to reduce noise.
Purgechar() is also a good option to remove those.