Skip to main content
Announcements
Introducing Qlik Answers: A plug-and-play, Generative AI powered RAG solution. READ ALL ABOUT IT!
cancel
Showing results for 
Search instead for 
Did you mean: 
flor1
Contributor
Contributor

Splitting a Column

I am looking to split a column which looks like this

   1 can cream of mushroom soup

   1 carton sour cream

   2 (16 oz.) pkg. frozen corn

   1 (8 oz.) pkg. cream cheese

   cubed

   1/3 c. butter

   cubed

I would like to split it into:

   Quantity, Measurement, Item 

   For ex: 1 | can | Cream of Mushroom Soup

Since all the rows in this particular column have a different format, I am finding it difficult to use any specific delimiter to split it. And also I am not familiar with regex. I am new to Talend. Any help/ pointers is appreciated. Thank You!

Labels (3)
5 Replies
Anonymous
Not applicable

Do you have any other information to help with this? For example, a list of all of the products/items that you will be looking for? Is this list produced by a human (ie is it typed) or by another system? If it is typed by a human, you will have trouble with spelling mistakes and differing measurements (kg, KG, kilo, kilogram, kilograms, etc).

 

If you know of all measurements and/or all items, then it will be a lot easier to do this.

flor1
Contributor
Contributor
Author

This is extracted from recipe websites. There is no lookup data to support this. So basically right now , I have to clean this column and create lookups from this data to create a normalized database.

Anonymous
Not applicable

OK, I just knocked this up quickly. It won't be perfect, will need some fine tuning and probably isn't the most efficient way of doing this......but it works for your examples 🙂

 

This all takes place in a tJavaFlex Main code section.....

 

// Split the sentence by spaces

String[] words = row2.sentence.split(" ");

 

boolean quantityFound = false;

boolean measurementFound = false;

 

String quantity = "";

String measurement = "";

String item = "";

 

 

//Loop over each word

for(int i = 0; i<words.length; i++){

 

//Check for words that match quantity values

if(words[i].matches("\\d*|(\\d*\\/\\d*)")&&!quantityFound){

quantity = quantity+words[i];

quantityFound = true;

i++;

}

//Check for measurement words

if(quantityFound&&!measurementFound){

if(words[i].matches("\\(\\d*|(.*?)\\)|\\d*\\/\\d*|(pkg.)|(can)|(carton)|(c.)")){

measurement = measurement+" " +words[i];

}else{

measurementFound = true;

}

}

//The rest is the item 

if(quantityFound&&measurementFound){

item = item + " " +words[i];

}

row3.quantity = quantity;

row3.measurement = measurement.trim();

row3.item = item.trim();

}

 

 

When I run this against the examples you gave, it returns this.....

 

1|can|cream of mushroom soup

1|carton|sour cream

2|(16 oz.) pkg.|frozen corn

1|(8 oz.) pkg.|cream cheese cubed

1/3|c.|butter cubed

flor1
Contributor
Contributor
Author

Wow thank you so much, with the sample data it worked like magic. But with the dataset I am using it gives me below exception.

 

Exception in component tJavaFlex_1 (CodeCleaning)

java.lang.ArrayIndexOutOfBoundsException: Index 5 out of bounds for length 5

at local_project.codecleaning_0_1.CodeCleaning.tFileInputDelimited_1Process(CodeCleaning.java:1211)

at local_project.codecleaning_0_1.CodeCleaning.runJobInTOS(CodeCleaning.java:1696)

at local_project.codecleaning_0_1.CodeCleaning.main(CodeCleaning.java:1534)

[statistics] disconnected

Anonymous
Not applicable

Add a System.out.println() to the code to reveal the data that caused it to fail. As I said, it was just a quick piece of code, so I wasn't expecting it to be perfect. Also, I suspect there are other examples that don't meet the rules.

 

This error will be caused by something like this.....

 

One kg baked beans Heinz 57

 

By the way, I noticed a slight error in the code I gave you. The output fields should be outside of the for loop. Like this....

 

// Split the sentence by spaces

 

String[] words = row3.sentence.split(" ");

 

 

 

boolean quantityFound = false;

 

boolean measurementFound = false;

 

 

 

String quantity = "";

 

String measurement = "";

 

String item = "";

 

 

 

 

 

//Loop over each word

 

for(int i = 0; i<words.length; i++){

 

 

 

//Check for words that match quantity values

 

if(words[i].matches("\\d*|(\\d*\\/\\d*)")&&!quantityFound){

 

quantity = quantity+words[i];

 

quantityFound = true;

 

i++;

 

}

 

 

//Check for measurement words

 

if(quantityFound&&!measurementFound){

 

if(words[i].matches("\\(\\d*|(.*?)\\)|\\d*\\/\\d*|(pkg.)|(can)|(carton)|(c.)")){

 

measurement = measurement+" " +words[i];

 

}else{

 

measurementFound = true;

 

}

 

 

}

 

 

//The rest is the item 

 

if(quantityFound&&measurementFound){

 

item = item + " " +words[i];

 

}

 

}

 

row4.quantity = quantity;

 

row4.measurement = measurement.trim();

 

row4.item = item.trim();

 

 

This code won't fix the problem you have here. To get a better idea of how to fix it, we'll nee more variations of possibilities.