Qlik Community

QlikView Documents

Documents for QlikView related information.

Announcements

Breathe easy -- you now have more time to plan your next steps with Qlik!
QlikView 11.2 Extended Support is now valid through December 31, 2020. Click here for more information.

Unicode text parsing and U+3000 (Ideographic Space)

rbecher
Not applicable

Unicode text parsing and U+3000 (Ideographic Space)

Hi,

anyone who is dealing with Unicode data should consider U+3000 (Ideographic Space) and perhaps all other spaces: Wikipedia

If you are parsing texts to build word lists or so by using space as delimiter the best thing would be to swap U+3000 with regular space chr(32) at first:

replace( text, chr( num#( '3000', '(hex)') ), chr(32) )

Good luck!

Ralf

Update:

There are also some more space characters in Unicode you could stumble on:

http://www.fileformat.info/info/unicode/category/Zs/list.htm


Labels (1)
Comments
rbecher
Not applicable

Just as a side information: I came into this by processing Japanese data.

Henric_Cronström
Not applicable

Ralf

This was interesting. I did not realize that there were so many different whitespace characters...

The solution using the replace() function works fine, but it can only handle conversion of one single whitepace type. I would instead use a mapping table and the function MapSubString for the conversion, e.g.:

Blanks:
Mapping Load
          chr(Num#(Ord,Notation)) as ChangeFrom,
          chr(32) as ChangeTo
          inline

[Notation, Ord
(DEC), 160
(HEX), 2000
(HEX), 3000];

followed by a

     Mapsubstring('Blanks', Text) as CorrectedText

in the Load statement. This way you can convert many characters with one function call.

HIC

rbecher
Not applicable

Thanks Henric!

This is a good solution. It could be easily extended for other characters to reduce noise.

Purgechar() is also a good option to remove those.

- Ralf

Version history
Revision #:
1 of 1
Last update:
‎08-29-2011 04:53 PM
Updated by: