Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
I am looking to split a column which looks like this
1 can cream of mushroom soup
1 carton sour cream
2 (16 oz.) pkg. frozen corn
1 (8 oz.) pkg. cream cheese
cubed
1/3 c. butter
cubed
I would like to split it into:
Quantity, Measurement, Item
For ex: 1 | can | Cream of Mushroom Soup
Since all the rows in this particular column have a different format, I am finding it difficult to use any specific delimiter to split it. And also I am not familiar with regex. I am new to Talend. Any help/ pointers is appreciated. Thank You!
Do you have any other information to help with this? For example, a list of all of the products/items that you will be looking for? Is this list produced by a human (ie is it typed) or by another system? If it is typed by a human, you will have trouble with spelling mistakes and differing measurements (kg, KG, kilo, kilogram, kilograms, etc).
If you know of all measurements and/or all items, then it will be a lot easier to do this.
This is extracted from recipe websites. There is no lookup data to support this. So basically right now , I have to clean this column and create lookups from this data to create a normalized database.
OK, I just knocked this up quickly. It won't be perfect, will need some fine tuning and probably isn't the most efficient way of doing this......but it works for your examples 🙂
This all takes place in a tJavaFlex Main code section.....
// Split the sentence by spaces
String[] words = row2.sentence.split(" ");
boolean quantityFound = false;
boolean measurementFound = false;
String quantity = "";
String measurement = "";
String item = "";
//Loop over each word
for(int i = 0; i<words.length; i++){
//Check for words that match quantity values
if(words[i].matches("\\d*|(\\d*\\/\\d*)")&&!quantityFound){
quantity = quantity+words[i];
quantityFound = true;
i++;
}
//Check for measurement words
if(quantityFound&&!measurementFound){
if(words[i].matches("\\(\\d*|(.*?)\\)|\\d*\\/\\d*|(pkg.)|(can)|(carton)|(c.)")){
measurement = measurement+" " +words[i];
}else{
measurementFound = true;
}
}
//The rest is the item
if(quantityFound&&measurementFound){
item = item + " " +words[i];
}
row3.quantity = quantity;
row3.measurement = measurement.trim();
row3.item = item.trim();
}
When I run this against the examples you gave, it returns this.....
1|can|cream of mushroom soup
1|carton|sour cream
2|(16 oz.) pkg.|frozen corn
1|(8 oz.) pkg.|cream cheese cubed
1/3|c.|butter cubed
Wow thank you so much, with the sample data it worked like magic. But with the dataset I am using it gives me below exception.
Exception in component tJavaFlex_1 (CodeCleaning)
java.lang.ArrayIndexOutOfBoundsException: Index 5 out of bounds for length 5
at local_project.codecleaning_0_1.CodeCleaning.tFileInputDelimited_1Process(CodeCleaning.java:1211)
at local_project.codecleaning_0_1.CodeCleaning.runJobInTOS(CodeCleaning.java:1696)
at local_project.codecleaning_0_1.CodeCleaning.main(CodeCleaning.java:1534)
[statistics] disconnected
Add a System.out.println() to the code to reveal the data that caused it to fail. As I said, it was just a quick piece of code, so I wasn't expecting it to be perfect. Also, I suspect there are other examples that don't meet the rules.
This error will be caused by something like this.....
One kg baked beans Heinz 57
By the way, I noticed a slight error in the code I gave you. The output fields should be outside of the for loop. Like this....
// Split the sentence by spaces
String[] words = row3.sentence.split(" ");
boolean quantityFound = false;
boolean measurementFound = false;
String quantity = "";
String measurement = "";
String item = "";
//Loop over each word
for(int i = 0; i<words.length; i++){
//Check for words that match quantity values
if(words[i].matches("\\d*|(\\d*\\/\\d*)")&&!quantityFound){
quantity = quantity+words[i];
quantityFound = true;
i++;
}
//Check for measurement words
if(quantityFound&&!measurementFound){
if(words[i].matches("\\(\\d*|(.*?)\\)|\\d*\\/\\d*|(pkg.)|(can)|(carton)|(c.)")){
measurement = measurement+" " +words[i];
}else{
measurementFound = true;
}
}
//The rest is the item
if(quantityFound&&measurementFound){
item = item + " " +words[i];
}
}
row4.quantity = quantity;
row4.measurement = measurement.trim();
row4.item = item.trim();
This code won't fix the problem you have here. To get a better idea of how to fix it, we'll nee more variations of possibilities.