Skip to main content
Announcements
Join us at Qlik Connect for 3 magical days of learning, networking,and inspiration! REGISTER TODAY and save!
cancel
Showing results for 
Search instead for 
Did you mean: 
SilviyaK
Contributor III
Contributor III

Fuzzy matching/likeliness between two values in a field

Hi all, 

I am working with sales data which has company name details but it is not unified. For example, I can have many lines with the Coca Cola company, but the company name can vary in the 'Company Names' field, for example:

1. coca cola
2. coca-cola
3. cocca-colla
4. cocacola 556

etc...

Imagine there can be all kinds of typos, mistakes or random text as the company name field is free text.

What I did is that I extracted the company names from the email addresses of the users in the system, so I basically created a more unified list of all the companies and I named that field 'Company Domains'.

The issue is that not all lines in my data have email address so for these lines the 'Company Domains' field is going to remain null while the 'Company Names' field will have some random company name there (most probably with typos). 

Is there any way to tell Qlik to check if the line has value under 'Company Domain' and in case it doesn't then to check the 'Company Name' and match that value with the closest possible from 'Company Domain'? I believe it needs to be something like fuzzy matching or likeliness between field contents.

Thanks in advance!

Labels (1)
3 Replies
MarcoWedel

maybe you can calculate the minimal Levenshtein distance between names and domains?:

https://help.qlik.com/en-US/sense/August2023/Subsystems/Hub/Content/Sense_Hub/Scripting/StringFuncti...

 

SilviyaK
Contributor III
Contributor III
Author

Thanks! It is an interesting idea, but I am still not sure how I can use it as for some lines the function gives me a very low number for distance, but in others it can go as high as 60, so not really sure how to use it in a condition. Any further ideas are welcome! 

marcus_sommer

You may be able to improve the results if you don't only apply the distance on the origin field else also on a cleaned version and/or splitting the values by a delimiter. Cleaning could mean to upper/lower case everything, removing all spaces and other usual separating chars like '-' and/or removing all kinds of special chars like .;-#:

Further you may try to include any context-checks to the records by looking for zip-codes, countries, time-zones or similar information and/or previous/following records or similar ones. Everything would be rather ugly and only improving it a little bit ... so you should balance the efforts and telling them: "... shit in = shit out ..."