Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
I have a workflow as follows. In the column 'summary', i want to remove
1. question mark(?)
2. white space from the text
3. replace accented alphabets with the english equivalent. For example é into e.
Input
?? at Shenzhen Xingjiexun Electronics Co.Ltd Designer at FabUnion | ???????? Jinanhaolu Ñ manager
Output
at Shenzhen Xingjiexun Electronics Co.Ltd Designer at FabUnion | Jinanhaolu N manager
For the accented alphabet, above is just a sample as it can be anything and i do not have a finite list to produce for an example.
Thanks in advance!!
Hi,
The following steps might helps you.
Step1: Change file read encoding
Step2: Create new routines stripAccents with below script.
package routines;
import java.text.Normalizer;
public class stripAccents {
public static String stripAccents(String s)
{
s = Normalizer.normalize(s, Normalizer.Form.NFD);
s = s.replaceAll("[\\p{InCombiningDiacriticalMarks}]", "");
return s;
}
}
create job src--> tMap--> tLogRow
COL as input in Source and row1.COL as in put in tMap. COL as output in tMap.
output COL --> stripAccents.stripAccents(row1.COL).replaceAll("[?]", "").replaceAll("^ ", "")
Input Data:
?? at Shenzhen Xingjiexun Electronics Co.Ltd
Designer at FabUnion | ????????
Jinanhaolu Ñ manager
aaaéééàààçççbbbb
Shenzhen WenTong electronic co.Ltd Ñ power adapter
Output Data:
Hope this helps!
Regards,
Here is the job with the tFileInputDelimited:
The Advanced settings tab of the tFileInputDelimited:
The input file with the Encoding menu (from Notepad++):
Finally, the result:
@Enthusiast, let us know the encoding system for your file.
Regards,
Its appearing as ANSI when i open it in Notepad++
So just select ISO-8859-15 as the encoding system in the Advanced settings tab.
It works (I've tried).