Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Qlik Open Lakehouse is Now Generally Available! Discover the key highlights and partner resources here.
cancel
Showing results for 
Search instead for 
Did you mean: 
soowork
Contributor III
Contributor III

issue with unicode character in JSON with tFileInputJSON_1

I am using Talend BigData 5.4.1 (5.4.1.r111943).  Similar to https://community.talend.com/t5/Design-and-Development/Invalid-XML-character-in-json-file/m-p/66674, I am encountering an error when trying to consume a JSON file that has a unicode control character in it.  In my case it failed with "An invalid XML character (Unicode: 0x1b) was found in the element content of the document." in tFileInputJSON.

Is there yet any fix or work around for this issue?

Like GuruGulabKhatri, I also tried to strip the unicode character out in a tMap and had no luck (e.g. row1.line.replaceAll("\\u001b", "")).

If there is no fix or work around, it is known exactly which Unicode characters will cause tFileInputJSON to fail?

Thanks in advance.

Labels (2)
2 Replies
Anonymous
Not applicable

Hi soowork

>I also tried to strip the unicode character out in a tMap and had no luck (e.g. row1.line.replaceAll("\\u001b", "")).
0x1b is not necessarly  \u001b (for example is \u00b7 = 0xc2b7), you Need to find an Translation table like here: https://en.wikipedia.org/wiki/List_of_Unicode_characters

>If there is no fix or work around, it is known exactly which Unicode characters will cause tFileInputJSON to fail?
I think this might be the case when there is no equivalent of the Unicode character in your targetcodepage ( which i would expect to be ISO-8859-15) , but this is only a guess as i dont have talend bigdata .

in TIS 5.4 my json component hat in the "Advance Settings" section the posibility to Switch the Encoding, have you tried that ?

cheers
dj
soowork
Contributor III
Contributor III
Author

Thanks dj.
In my case I checked the incoming file and see it written as "\u001BSam", which I interpreted as [ESC]Sam
That was why I tried to replace "\u001b".
But basically, even if I got the replace to work, that would only help if I did that for all possible breaking characters.  Do you know if it is objecting to any unicode character? or just the fact that it is a control character?
I haven't tried changing the encoding - I will explore that.  Though ideally i would like to strip or ignore such control characters, as opposed to allow them through...