Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi Everyone,
I have a field composed of text, and I have kind of duplicates in it that I would like to eliminate. I say "kind of" because actually there is the proper text I would like to keep and abbreviations of that text I'm trying to get rid of. here an example:
MYFIELD |
EXPECTED RESULT |
TOM |
TOMATE |
TOMA |
|
TOMATE |
|
CAROT |
CAROTTE |
CAROTTE |
|
CARTE |
|
CHOUX |
CHOUX ROUGE |
CHOUX ROUGE |
as you can see, I keep each time the longest description, and sometimes the abbreviation is done by taking off some letters inside the proper name, not only at the end of it.
I was thinking of something like "if row1 included into row2 then row2" but I didn't succeed to find the right formula for it, and on the other hand it won't work for "CARTE" becoming "CAROTTE" in the above example.
Do you have any idea of how I could achieve this?
Thanks a lot for your help !
Hi, Try below.
MapProperText:
Mapping LOAD * Inline [
ChangeFrom,ChangeTo
TOM,TOMATE
TOMATE,TOMATE
TOMA,TOMATE
TOMATE,TOMATE
CAROT,CAROTTE
CAROTTE,CAROTTE
CAROTTE,CAROTTE
CARTE,CAROTTE
CHOUX,CHOUX ROUGE
CHOUX ROUGE,CHOUX ROUGE
];
Data:
LOAD * ,
ApplyMap('MapProperText',MYFIELD) as "Proper Name"
Inline [
MYFIELD
TOM
TOMATE
TOMA
TOMATE
CAROT
CAROTTE
CAROTTE
CARTE
CHOUX
CHOUX ROUGE
];
Do the values have codes or IDs to distinguish between variants?
Hi Pierre,
no they don't. MYFIELD is the result of combined cleaning actions on 3 other fields. So the table actually look like this:
Field1 | Field2 | Field3 | MyField |
I'm trying to clean further MYFIELD to achieve the result in the above example.
if the current result in MyField is "TOM" for instance instead of "TOMATE" it means that "TOMATE" is never shown in the 3 other fields, but "TOM", while I, being a human, know that "TOM" stands for "TOMATE". Note that the "TOM" can be in the middle of any of the 3 fields.
Thanks !
So, your field named "Myfield" has values TOM, TOMA, TOMATE. From here you need to find the maximum length string Is my understanding correct?
Or your Field1 contains "TOM", Field2 contains "TOMA" and Field3 contains "TOMATE" and you do a concat and have all three together in "Myfield" column?
I think you could use a mapping approach but you will need some degree of manual work because you need to define the final mapping-results. This means for each possible search-value must exists a valid return-value.
It's not mandatory necessary to create such mapping manually else the search-values might be derived from the final values, for example:
temp:
load *, mid(F, 1, L - iterno() + 1) as R while iterno() <= L;
load *, len(F) as L;
load 'TOMATO' as F, autogenerate 1;
which applies per internal while-loop multiple value-parts in extra records. Such logic might be applied several times to run it backwards and forwards and/or to pick values in between and might be also combined with additionally conditions in regard to the min/max lengths (returning just a single char or twice won't be sensible).
With a similar logic you may also consider any typos by reversing chars against each other and/or replacing them with similar chars. Of course this won't be trivial but you may start it with just a few iterations.
Very important to create such listing is to add recno(), rowno() and iterno() as additionally fields because they are needed to sort the mapping-table. A mapping returned always the first matching and therefore you will look at first for: TOMATO before you look for TOM. Further for the above you couldn't use applymap() else mapsubstring() which won't directly return you final value else it will place the final value within calling-string and needs to be afterwards extracted, for example with textbetween() on additionally wrapped chars around your final values.
Mapsubstring() has further the benefit that it could replace multiple values within a string which might be also helpful to rank and sort the matches.
Beside this it might be useful not to apply everything within a single mapping else you may use multiple ones - again to increase the quality of the matching and the ranking. Another point which may minimize the needed efforts might be not to do this against a concatenated string of 3 fields else against the single fields.
If it looked like a quite heavy work - it is. But I think there aren't many alternatives and if they won't be really easier at least not to derive a valid final value. Nevertheless you may look on features like:
LevenshteinDist - script and chart function | Qlik Cloud Help
hi Iswarya,
MyField contains TOM, TOMA, TOMATE, ... and yes I'm looking for the longest word, but between TOM, TOMA and TOMATE. I'm also looking fo rthe longest word between CARO, CAROT, CAROTTE who are in the same field.
Field1, Field2 and Field3 contain a lot more text each, I do clean the 3 of them to get the TOM/TOMA/TOMATE.
hi Marcus,
thanks, sounds interesting, I'll dig further in that direction. I know it's a heavy work 😉