Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Qlik Open Lakehouse is Now Generally Available! Discover the key highlights and partner resources here.
cancel
Showing results for 
Search instead for 
Did you mean: 
flor1
Contributor
Contributor

Splitting a Column

I am looking to split a column which looks like this

   1 can cream of mushroom soup

   1 carton sour cream

   2 (16 oz.) pkg. frozen corn

   1 (8 oz.) pkg. cream cheese

   cubed

   1/3 c. butter

   cubed

I would like to split it into:

   Quantity, Measurement, Item 

   For ex: 1 | can | Cream of Mushroom Soup

Since all the rows in this particular column have a different format, I am finding it difficult to use any specific delimiter to split it. And also I am not familiar with regex. I am new to Talend. Any help/ pointers is appreciated. Thank You!

Labels (3)
5 Replies
Anonymous
Not applicable

Do you have any other information to help with this? For example, a list of all of the products/items that you will be looking for? Is this list produced by a human (ie is it typed) or by another system? If it is typed by a human, you will have trouble with spelling mistakes and differing measurements (kg, KG, kilo, kilogram, kilograms, etc).

 

If you know of all measurements and/or all items, then it will be a lot easier to do this.

flor1
Contributor
Contributor
Author

This is extracted from recipe websites. There is no lookup data to support this. So basically right now , I have to clean this column and create lookups from this data to create a normalized database.

Anonymous
Not applicable

OK, I just knocked this up quickly. It won't be perfect, will need some fine tuning and probably isn't the most efficient way of doing this......but it works for your examples 🙂

 

This all takes place in a tJavaFlex Main code section.....

 

// Split the sentence by spaces

String[] words = row2.sentence.split(" ");

 

boolean quantityFound = false;

boolean measurementFound = false;

 

String quantity = "";

String measurement = "";

String item = "";

 

 

//Loop over each word

for(int i = 0; i<words.length; i++){

 

//Check for words that match quantity values

if(words[i].matches("\\d*|(\\d*\\/\\d*)")&&!quantityFound){

quantity = quantity+words[i];

quantityFound = true;

i++;

}

//Check for measurement words

if(quantityFound&&!measurementFound){

if(words[i].matches("\\(\\d*|(.*?)\\)|\\d*\\/\\d*|(pkg.)|(can)|(carton)|(c.)")){

measurement = measurement+" " +words[i];

}else{

measurementFound = true;

}

}

//The rest is the item 

if(quantityFound&&measurementFound){

item = item + " " +words[i];

}

row3.quantity = quantity;

row3.measurement = measurement.trim();

row3.item = item.trim();

}

 

 

When I run this against the examples you gave, it returns this.....

 

1|can|cream of mushroom soup

1|carton|sour cream

2|(16 oz.) pkg.|frozen corn

1|(8 oz.) pkg.|cream cheese cubed

1/3|c.|butter cubed

flor1
Contributor
Contributor
Author

Wow thank you so much, with the sample data it worked like magic. But with the dataset I am using it gives me below exception.

 

Exception in component tJavaFlex_1 (CodeCleaning)

java.lang.ArrayIndexOutOfBoundsException: Index 5 out of bounds for length 5

at local_project.codecleaning_0_1.CodeCleaning.tFileInputDelimited_1Process(CodeCleaning.java:1211)

at local_project.codecleaning_0_1.CodeCleaning.runJobInTOS(CodeCleaning.java:1696)

at local_project.codecleaning_0_1.CodeCleaning.main(CodeCleaning.java:1534)

[statistics] disconnected

Anonymous
Not applicable

Add a System.out.println() to the code to reveal the data that caused it to fail. As I said, it was just a quick piece of code, so I wasn't expecting it to be perfect. Also, I suspect there are other examples that don't meet the rules.

 

This error will be caused by something like this.....

 

One kg baked beans Heinz 57

 

By the way, I noticed a slight error in the code I gave you. The output fields should be outside of the for loop. Like this....

 

// Split the sentence by spaces

 

String[] words = row3.sentence.split(" ");

 

 

 

boolean quantityFound = false;

 

boolean measurementFound = false;

 

 

 

String quantity = "";

 

String measurement = "";

 

String item = "";

 

 

 

 

 

//Loop over each word

 

for(int i = 0; i<words.length; i++){

 

 

 

//Check for words that match quantity values

 

if(words[i].matches("\\d*|(\\d*\\/\\d*)")&&!quantityFound){

 

quantity = quantity+words[i];

 

quantityFound = true;

 

i++;

 

}

 

 

//Check for measurement words

 

if(quantityFound&&!measurementFound){

 

if(words[i].matches("\\(\\d*|(.*?)\\)|\\d*\\/\\d*|(pkg.)|(can)|(carton)|(c.)")){

 

measurement = measurement+" " +words[i];

 

}else{

 

measurementFound = true;

 

}

 

 

}

 

 

//The rest is the item 

 

if(quantityFound&&measurementFound){

 

item = item + " " +words[i];

 

}

 

}

 

row4.quantity = quantity;

 

row4.measurement = measurement.trim();

 

row4.item = item.trim();

 

 

This code won't fix the problem you have here. To get a better idea of how to fix it, we'll nee more variations of possibilities.