Solved: String Percentage Comparison? - Qlik Community

cbaqir · ‎2017-06-13

Is there a way to get QV to compare how similar two values are?

In other words, match on 75% of the string to return possible duplicates?

John Doe - ABCD

John Doe - EFGH

MarcoWedel · ‎2017-06-14

Hi,

maybe one solution might be:

tabPlanTemp:
LOAD Distinct
    PLAN_DESC
FROM [https://community.qlik.com/servlet/JiveServlet/download/1287827-283097/POWER%20PLAN%20COUNTS.xlsx] (ooxml, embedded labels, table is [PLAN COUNT EXAMPLE])
Where Len(Trim(PLAN_DESC));

Join
LOAD PLAN_DESC as PLAN_DESC2
Resident tabPlanTemp;

tabPlan:
LOAD PLAN_DESC,
    PLAN_DESC2,
    Num(1-Levenshtein(PLAN_DESC,PLAN_DESC2)/RangeMax(Len(PLAN_DESC),Len(PLAN_DESC2)),'0.0%') as Similarity
Resident tabPlanTemp
Where not PLAN_DESC follows PLAN_DESC2;

DROP Table tabPlanTemp;

Another example:

Fuzzy Matching/Joining

hope this helps

regards

Marco

View solution in original post

Peter_Cammaert · ‎2017-06-13

As a starter to detect string similarity, see the implementation of the Levenshtein Distance algorithm here: string matching with fuzzy, trigram (n-gram), levenshtein, etc.

Background info can be found here: Levenshtein distance - Wikipedia

MarcoWedel · ‎2017-06-13

can you post some more examples?

Is it always the first part of the string you're interested in?

regards

Marco

cbaqir · ‎2017-06-14

It's not always the first part of the string. Here's an example where there is an extra space in the middle:

MBS - Modified Barium Swallow Study

Here's another example not at the end:

Transphenoidal (Pituitary) Surgery Postoperative

Transsphenoidal (Pituitary) Surgery Postoperative

Example < vs less than:

POC Risk for Nutritional Imbalance: < Requirements (Newborn)

POC Risk for Nutritional Imbalance: < Requirements (NICU)

POC Risk for Nutritional Imbalance: less than Requirements (Newborn)

POC Risk for Nutritional Imbalance: less than Requirements (NICU)

cbaqir · ‎2017-06-14

I've attached sample data. I ultimately want to show a list that shows matches between PLAN_DESC at like a 75% match.

MarcoWedel · ‎2017-06-14

Hi,

maybe one solution might be:

tabPlanTemp:
LOAD Distinct
    PLAN_DESC
FROM [https://community.qlik.com/servlet/JiveServlet/download/1287827-283097/POWER%20PLAN%20COUNTS.xlsx] (ooxml, embedded labels, table is [PLAN COUNT EXAMPLE])
Where Len(Trim(PLAN_DESC));

Join
LOAD PLAN_DESC as PLAN_DESC2
Resident tabPlanTemp;

tabPlan:
LOAD PLAN_DESC,
    PLAN_DESC2,
    Num(1-Levenshtein(PLAN_DESC,PLAN_DESC2)/RangeMax(Len(PLAN_DESC),Len(PLAN_DESC2)),'0.0%') as Similarity
Resident tabPlanTemp
Where not PLAN_DESC follows PLAN_DESC2;

DROP Table tabPlanTemp;

Another example:

Fuzzy Matching/Joining

hope this helps

regards

Marco

cbaqir · ‎2017-06-15

Super cool. Thanks!

scottford‌ clondono‌

MarcoWedel · ‎2017-06-15

you're welcome

regards

Marco

cbaqir · ‎2017-06-21

The load and drop of the temp table takes a very long time. Is there a way I can use a QVD to speed up the process? I've never tried a QVD on top of a temp table.

cbaqir · ‎2017-06-29

The load and drop of the temp table takes a very long time. Is there a way I can use a QVD to speed up the process? I've never tried a QVD on top of a temp table.