Unicode text parsing and U+3000 (Ideographic Space)

    Hi,

     

    anyone who is dealing with Unicode data should consider U+3000 (Ideographic Space) and perhaps all other spaces: Wikipedia

     

    If you are parsing texts to build word lists or so by using space as delimiter the best thing would be to swap U+3000 with regular space chr(32) at first:

     

    replace( text, chr( num#( '3000', '(hex)') ), chr(32) )

     

    Good luck!

     

    Ralf

     

    Update:

    There are also some more space characters in Unicode you could stumble on:

    http://www.fileformat.info/info/unicode/category/Zs/list.htm