Skip to main content
Announcements
See what Drew Clarke has to say about the Qlik Talend Cloud launch! READ THE BLOG
cancel
Showing results for 
Search instead for 
Did you mean: 
Delphines
Contributor III
Contributor III

Clean a field with itself

Hi Everyone, 

I have a field composed of text, and I have kind of duplicates in it that I would like to eliminate. I say "kind of" because actually there is the proper text I would like to keep and abbreviations of that text I'm trying to get rid of. here an example: 

MYFIELD

EXPECTED RESULT

TOM

TOMATE

TOMA

TOMATE

CAROT

CAROTTE

CAROTTE

CARTE

CHOUX

CHOUX ROUGE

CHOUX ROUGE

as you can see, I keep each time the longest description, and sometimes the abbreviation is done by taking off some letters inside the proper name, not only at the end of it. 

I was thinking of something like "if row1 included into row2 then row2" but I didn't succeed to find the right formula for it, and on the other hand it won't work for "CARTE" becoming "CAROTTE" in the above example.

Do you have any idea of how I could achieve this? 

Thanks a lot for your help !

Labels (2)
8 Replies
BrunPierre
Partner - Master
Partner - Master

Hi, Try below.

MapProperText:
Mapping LOAD * Inline [
ChangeFrom,ChangeTo
TOM,TOMATE
TOMATE,TOMATE
TOMA,TOMATE
TOMATE,TOMATE
CAROT,CAROTTE
CAROTTE,CAROTTE
CAROTTE,CAROTTE
CARTE,CAROTTE
CHOUX,CHOUX ROUGE
CHOUX ROUGE,CHOUX ROUGE
];

Data:
LOAD * ,
ApplyMap('MapProperText',MYFIELD) as "Proper Name"
Inline [
MYFIELD
TOM
TOMATE
TOMA
TOMATE
CAROT
CAROTTE
CAROTTE
CARTE
CHOUX
CHOUX ROUGE
];

Delphines
Contributor III
Contributor III
Author

Hi Pierre,

Thanks for your suggestion but I don't think it'll work as it seems rather manual i.e. I'm giving the answer with the mapping table while I need something looking at the field and picking the right term each time. My left column is already the result of some cleaning actions, and actually contains about 1M rows, and the source of it is dynamic, so I can't definitely have a "manual" mapping table.

Kind regards,

Delphine
BrunPierre
Partner - Master
Partner - Master

Do the values have codes or IDs to distinguish between variants?

Delphines
Contributor III
Contributor III
Author

Hi Pierre, 

no they don't. MYFIELD is the result of combined cleaning actions on 3 other fields. So the table actually look like this: 

Field1 Field2 Field3 MyField

I'm trying to clean further MYFIELD to achieve the result in the above example.

if the current result in MyField is "TOM" for instance instead of "TOMATE" it means that "TOMATE" is never shown in the 3 other fields, but "TOM", while I, being a human, know that "TOM" stands for "TOMATE". Note that the "TOM" can be in the middle of any of the 3 fields. 

Thanks !

_Iswarya_
Contributor III
Contributor III

@Delphines 

So, your field named "Myfield" has values TOM, TOMA, TOMATE. From here you need to find the maximum length string Is my understanding correct?

Or your Field1 contains "TOM", Field2 contains "TOMA" and Field3 contains "TOMATE" and you do a concat and have all three together in "Myfield" column?

 

marcus_sommer

I think you could use a mapping approach but you will need some degree of manual work because you need to define the final mapping-results. This means for each possible search-value must exists a valid return-value.

It's not mandatory necessary to create such mapping manually else the search-values might be derived from the final values, for example:

temp:
load *, mid(F, 1, L - iterno() + 1) as R while iterno() <= L;
load *, len(F) as L;
load 'TOMATO' as F, autogenerate 1;

which applies per internal while-loop multiple value-parts in extra records. Such logic might be applied several times to run it backwards and forwards and/or to pick values in between and might be also combined with additionally conditions in regard to the min/max lengths (returning just a single char or twice won't be sensible).

With a similar logic you may also consider any typos by reversing chars against each other and/or replacing them with similar chars. Of course this won't be trivial but you may start it with just a few iterations.

Very important to create such listing is to add recno(), rowno() and iterno() as additionally fields because they are needed to sort the mapping-table. A mapping returned always the first matching and therefore you will look at first for: TOMATO before you look for TOM. Further for the above you couldn't use applymap() else mapsubstring() which won't directly return you final value else it will place the final value within calling-string and needs to be afterwards extracted, for example with textbetween() on additionally wrapped chars around your final values.

Mapsubstring() has further the benefit that it could replace multiple values within a string which might be also helpful to rank and sort the matches.

Beside this it might be useful not to apply everything within a single mapping else you may use multiple ones - again to increase the quality of the matching and the ranking. Another point which may minimize the needed efforts might be not to do this against a concatenated string of 3 fields else against the single fields.

If it looked like a quite heavy work - it is. But I think there aren't many alternatives and if they won't be really easier at least not to derive a valid final value. Nevertheless you may look on features like:

LevenshteinDist - script and chart function | Qlik Cloud Help

Delphines
Contributor III
Contributor III
Author

hi Iswarya, 

MyField contains TOM, TOMA, TOMATE, ... and yes I'm looking for the longest word, but between TOM, TOMA and TOMATE. I'm also looking fo rthe longest word between CARO, CAROT, CAROTTE who are in the same field. 

Field1, Field2 and Field3 contain a lot more text each, I do clean the 3 of them to get the TOM/TOMA/TOMATE.

Delphines
Contributor III
Contributor III
Author

hi Marcus, 

thanks, sounds interesting, I'll dig further in that direction. I know it's a heavy work 😉