Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
I have a workflow as follows. In the column 'summary', i want to remove
1. question mark(?)
2. white space from the text
3. replace accented alphabets with the english equivalent. For example é into e.
Input
?? at Shenzhen Xingjiexun Electronics Co.Ltd Designer at FabUnion | ???????? Jinanhaolu Ñ manager
Output
at Shenzhen Xingjiexun Electronics Co.Ltd Designer at FabUnion | Jinanhaolu N manager
For the accented alphabet, above is just a sample as it can be anything and i do not have a finite list to produce for an example.
Thanks in advance!!
Hi,
The following steps might helps you.
Step1: Change file read encoding
Step2: Create new routines stripAccents with below script.
package routines;
import java.text.Normalizer;
public class stripAccents {
public static String stripAccents(String s)
{
s = Normalizer.normalize(s, Normalizer.Form.NFD);
s = s.replaceAll("[\\p{InCombiningDiacriticalMarks}]", "");
return s;
}
}
create job src--> tMap--> tLogRow
COL as input in Source and row1.COL as in put in tMap. COL as output in tMap.
output COL --> stripAccents.stripAccents(row1.COL).replaceAll("[?]", "").replaceAll("^ ", "")
Input Data:
?? at Shenzhen Xingjiexun Electronics Co.Ltd
Designer at FabUnion | ????????
Jinanhaolu Ñ manager
aaaéééàààçççbbbb
Shenzhen WenTong electronic co.Ltd Ñ power adapter
Output Data:
Hope this helps!
Regards,
Hi,
Please provide some sample data and expected output.
Regards,
Hi,
Here is an example of howto:
1st, load the commons-lang3-3.4.jar file and import org.apache.commons.lang3.StringUtils.
For that, in tLibraryLoad Basic settings select "commons-lang3-3.4.jar", then in Advanced setting enter import "org.apache.commons.lang3.StringUtils;" in the import field.
In tJavaRow, enter the following (maybe something similar in tMap depending on your use case):
output_row.line = StringUtils.stripAccents(input_row.line);
tFixedFlowInput is here to generate data for the flow ("aaaéééàààçççbbbb" for my example), and the result is:
aaaeeeaaacccbbbb
Hope this helps,
Sorry, I forgot "?" and space.
Just replace:
output_row.line = StringUtils.stripAccents(input_row.line);
with:
output_row.line = StringUtils.stripAccents(input_row.line).replaceAll("[? ]", "");
That's all.
How should i connect tLibraryLoad and tJavaRow in my workflow?
should it be as follows? Please suggest if i should arrange this palettes in different way.
tMap -> tLibraryLoad -> tJavaRow -> tFileOutputDelimited
well, if you just want to remove starting white spaces (not all) just use:
output_row.line = StringUtils.stripAccents(input_row.line).replaceAll("[?]", "").replaceAll("^ ", "");
maybe exists a shorter form, but it works:
Regards,
I downloaded the jar file from http://book2s.com/java/jar/c/commons-lang3/download-commons-lang3-3.4.jar.html and tried working with the suggested solution and made tLibrary as first component. Below is how tLibraryLoad is configured
Basic Settings
Advanced settings
And this is how tJavaRow is configured. I added the column name 'summary' after output_row and input_row in the code as follows
However, i am getting error
Execution failed : Job compile errors At least job "Test2_Copy" has a compile errors, please fix and export again. Error Line: 49 Detail Message: Syntax error on token ""org.apache.commons.lang3.StringUtils;"", delete this token There may be some other errors caused by JVM compatibility. Make sure your JVM setup is similar to the studio.
you must load the library first: tLibraryLoad - onSubjob OK -> tFileList
also verifiy Advanced setting of tLibraryLoad. It must contain import org.apache.commons.lang3.StringUtils; in the Import field.
Edit: OK, forget, just remove both " in the Import field (that's Java code, not just a string)
I inserted import org.apache.commons.lang3.StringUtils; in the advanced settings field and it ran without any error, however the output is not what i need. It simply replace accented Ñ with a question mark ?
Shenzhen WenTong electronic co.Ltd Ñ power adapter
is converted into
Shenzhen WenTong electronic co.Ltd ? power adapter