Solved: Re: Clean accented character and white space in co... - Page 2 - Qlik Community

Anonymous · ‎2017-05-02

I have a workflow as follows. In the column 'summary', i want to remove

1. question mark(?)
2. white space from the text
3. replace accented alphabets with the english equivalent. For example é into e.

Input

?? at Shenzhen Xingjiexun Electronics Co.Ltd
Designer at FabUnion | ????????
Jinanhaolu Ñ manager

Output

at Shenzhen Xingjiexun Electronics Co.Ltd
Designer at FabUnion |
Jinanhaolu N manager

For the accented alphabet, above is just a sample as it can be anything and i do not have a finite list to produce for an example.

Thanks in advance!!

TRF · ‎2017-05-02

What's the encoding of the tFileInputDelimited?

Anonymous · ‎2017-05-02

UTF-8

@TRF wrote:
What's the encoding of the tFileInputDelimited?

TRF · ‎2017-05-02

But is your file encoded as utf8?
I just tested on my side and it works fine.

vboppudi · ‎2017-05-02

Hi,

The following steps might helps you.

Step1: Change file read encoding

Step2: Create new routines stripAccents with below script.

package routines;
import java.text.Normalizer;
public class stripAccents {

public static String stripAccents(String s)
{
s = Normalizer.normalize(s, Normalizer.Form.NFD);
s = s.replaceAll("[\\p{InCombiningDiacriticalMarks}]", "");
return s;
}
}

create job src--> tMap--> tLogRow

COL as input in Source and row1.COL as in put in tMap. COL as output in tMap.

output COL --> stripAccents.stripAccents(row1.COL).replaceAll("[?]", "").replaceAll("^ ", "")

Input Data:

?? at Shenzhen Xingjiexun Electronics Co.Ltd
Designer at FabUnion | ????????
Jinanhaolu Ñ manager
aaaéééàààçççbbbb
Shenzhen WenTong electronic co.Ltd Ñ power adapter

Output Data:

Hope this helps!

Regards,

Anonymous · ‎2017-05-02

@TRF can you post screenshot? @vboppudi file is in UTF-8 format and if i change the format in input, file is not read properly, I faced this issue and it took me a week to understand the reason and after i switched to UTF-8, data was read properly.

vboppudi · ‎2017-05-02

TRF · ‎2017-05-02

Here is the job with the tFileInputDelimited:

The Advanced settings tab of the tFileInputDelimited:

The input file with the Encoding menu (from Notepad++):

Finally, the result:

@Enthusiast, let us know the encoding system for your file.

Regards,

Anonymous · ‎2017-05-02

Its appearing as ANSI when i open it in Notepad++

TRF · ‎2017-05-02

So just select ISO-8859-15 as the encoding system in the Advanced settings tab.

It works (I've tried).

Clean accented character and white space in column

Talend Data Integration

v6.x