Qlik Community

QlikView Documents

Documents for QlikView related information.

Announcements
BARC’s The BI Survey 19 makes it official. BI users love Qlik. GET REPORT

Unicode text parsing and U+3000 (Ideographic Space)

MVP & Luminary
MVP & Luminary

Unicode text parsing and U+3000 (Ideographic Space)

Hi,

anyone who is dealing with Unicode data should consider U+3000 (Ideographic Space) and perhaps all other spaces: Wikipedia

If you are parsing texts to build word lists or so by using space as delimiter the best thing would be to swap U+3000 with regular space chr(32) at first:

replace( text, chr( num#( '3000', '(hex)') ), chr(32) )

Good luck!

Ralf

Update:

There are also some more space characters in Unicode you could stumble on:

http://www.fileformat.info/info/unicode/category/Zs/list.htm


Labels (1)
Comments
MVP & Luminary
MVP & Luminary

Just as a side information: I came into this by processing Japanese data.

0 Likes

Ralf

This was interesting. I did not realize that there were so many different whitespace characters...

The solution using the replace() function works fine, but it can only handle conversion of one single whitepace type. I would instead use a mapping table and the function MapSubString for the conversion, e.g.:

Blanks:
Mapping Load
          chr(Num#(Ord,Notation)) as ChangeFrom,
          chr(32) as ChangeTo
          inline

[Notation, Ord
(DEC), 160
(HEX), 2000
(HEX), 3000];

followed by a

     Mapsubstring('Blanks', Text) as CorrectedText

in the Load statement. This way you can convert many characters with one function call.

HIC

0 Likes
MVP & Luminary
MVP & Luminary

Thanks Henric!

This is a good solution. It could be easily extended for other characters to reduce noise.

Purgechar() is also a good option to remove those.

- Ralf

0 Likes
Version history
Revision #:
1 of 1
Last update:
‎2011-08-29 04:53 PM
Updated by: