Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hey everyone,
I'm attempting to combine various field values into one value on a particular field. For example, if we had a field named fruit and it contained values such as Banana, Strawberry, banana, straw berry, strawberries, blueberries, and Pineapple.
The easiest approach is obviously to go into the data source and quickly modify these values, however that is not currently possible. My pie chart is currently giving me a different slice for each of these field values, however we obviously would like to have all of the strawberries combined into one slice.
I've had a degree of success using set analysis, however a colleague recently suggested using an IF statement in my script. Any help would be greatly appreciated!
or joining the group name using a static levenshtein distance limit in the script:
tabFruits:
LOAD *,
Ceil(Rand()*100) as SomeFruitValue
INLINE [
fruit
Banana
Strawberry
banana
straw berry
strawberries
blueberries
Pineapple
];
tabFruitGroups:
LOAD Distinct fruit Resident tabFruits;
Join
LOAD * Inline [
fruit2, fruitGroup
Banana, Yellow Fruit
Pineapple, Yellow Fruit
Strawberry, Berry
Blueberry, Berry
];
Left Join (tabFruits)
LOAD fruit,
fruitGroup
Resident tabFruitGroups
Where levenshtein(Upper(fruit),Upper(fruit2))<=3;
DROP Table tabFruitGroups;
hope this helps
regards
Marco
Hi Chris,
Maybe a calculated dimension like:
if(Match(fruit,'Banana', 'Strawberry', 'banana', 'straw berry', 'strawberries', 'blueberries', 'Pineapple'),'Fruit','Not Fruit')
cheers
Andrew
Could you explain the function of Match? Initially, I was attempting to group together my budgets based on a description about them, and the data source had them named with different conventions (e.g. and, And, &).
However, I am also looking for a way to combine values regardless of how similar their names are to generate charts that management would like to see. In the example I wrote, it would be along the lines of "We want to group together all the blueberries with strawberries, but leave the rest as they are"
Are both of these possible via the expression you posted? Thank you for your help!
Hi Chris,
The best explanation of the Match function is to be found in QlikView help, but briefly it returns a number if the value being checked matched one of the values in the list that follows. If it matches the first in the list we get 1, second 2 etc. If no match then 0. It can be used as an if condition, 0 will cause condition to be not met, >0 and the condition is satisfied.
Look into the WildMatch function in the help. This allows to search for similar strings using wildcards.
If fruit = 'strawberry'
match(fruit, 'berry') = 0
wildmatch(fruit, '*berry') = 1
Your best bet might be to build up a table mapping possible field values to the standardised values you want to use. You can do this in an external file like an Excel file. Look up the function ApplyMap. You can use that to map to correct values in the script.
Good luck
Andrew
Hi,
maybe one solution could be to group the fruit names by similarity (e.g. using the Levenshtein distance) without hard coding specific fruit names at all:
tabFruits:
LOAD *,
Ceil(Rand()*100) as SomeFruitValue
INLINE [
fruit
Banana
Strawberry
banana
straw berry
strawberries
blueberries
Pineapple
];
tabFruitGroups:
LOAD Distinct fruit Resident tabFruits;
Join
LOAD fruit as fruit2
Resident tabFruitGroups;
Join
LOAD fruit,
fruit2,
levenshtein(Upper(fruit),Upper(fruit2)) as LevenshtDistU
Resident tabFruitGroups;
refer to:
string matching with fuzzy, trigram (n-gram), levenshtein, etc.
for the implementation of the required levenshtein function.
I guess in a productive application the grouping of the fruit names should be done in the script to really get all relevant names into one group and probably name the group by its most frequent member.
The front end aggregation in this example does not deliver useful results for all selected levenshtein distances.
hope this helps
regards
Marco
Thank you!
This will definitely help in terms of cleaning data. I don't suppose there is any way to use the Levenshtein distance to group based on different criteria is there? For example, say you wanted to group Pineapple and Banana (yellow fruits) despite their vastly different names.
Is this still possible using the methodology you have outlined? It sounds like it would be more applicable to do this through the script editor as you mentioned towards the end of your post, but as a new user I'm not sure the exact syntax for grouping field values.
Hi,
maybe one solution might be to create a table that defines the different groups and check for similarities using the same algorithm:
tabFruits:
LOAD *,
Ceil(Rand()*100) as SomeFruitValue
INLINE [
fruit
Banana
Strawberry
banana
straw berry
strawberries
blueberries
Pineapple
];
tabFruitGroups:
LOAD Distinct fruit Resident tabFruits;
Join
LOAD * Inline [
fruit2, fruitGroup
Banana, Yellow Fruit
Pineapple, Yellow Fruit
Strawberry, Berry
Blueberry, Berry
];
Join
LOAD fruit,
fruit2,
levenshtein(Upper(fruit),Upper(fruit2)) as LevenshtDistU
Resident tabFruitGroups;
hope this helps
regards
Marco
or joining the group name using a static levenshtein distance limit in the script:
tabFruits:
LOAD *,
Ceil(Rand()*100) as SomeFruitValue
INLINE [
fruit
Banana
Strawberry
banana
straw berry
strawberries
blueberries
Pineapple
];
tabFruitGroups:
LOAD Distinct fruit Resident tabFruits;
Join
LOAD * Inline [
fruit2, fruitGroup
Banana, Yellow Fruit
Pineapple, Yellow Fruit
Strawberry, Berry
Blueberry, Berry
];
Left Join (tabFruits)
LOAD fruit,
fruitGroup
Resident tabFruitGroups
Where levenshtein(Upper(fruit),Upper(fruit2))<=3;
DROP Table tabFruitGroups;
hope this helps
regards
Marco
Andrew,
I appreciate the help. I looked through the documentation for the match() function, and I managed to combine the fields in question. However, I was unable to define more than 2 fields for the grouping. In the end, I ended up with 2 groups:
1. Group containing all desired field values that should be grouped
2. All other fields
I tried to find a way to combine the IF/MATCH/WILDMATCH functions such that I could define multiple fields, however I kept receiving an error in my expression. Do you have any suggestions on how I may modify the function in a manner such as:
if(Match(fruit,'Banana', 'Pineapple'), 'Yellow Fruit', Match(fruit, 'Strawberry', 'strawberries'), 'Red Fruit', Matc(fruit, 'blueberries'),'Blue Fruit','Not Fruit')
Anything with functionality like this would be a huge help, thank you.
THANK YOU!!! This got me to where I was wanting to go. I'm going to have to look farther into the script to figure out how the function works. I had attempted originally to use just the inline table and a left join, however it wouldn't work until I created the Levenshtein function. Thank you again!!