Skip to main content
Woohoo! Qlik Community has won “Best in Class Community” in the 2024 Khoros Kudos awards!
Announcements
Nov. 20th, Qlik Insider - Lakehouses: Driving the Future of Data & AI - PICK A SESSION
cancel
Showing results for 
Search instead for 
Did you mean: 
cbaqir
Specialist II
Specialist II

String Percentage Comparison?

Is there a way to get QV to compare how similar two values are?

In other words, match on 75% of the string to return possible duplicates?

John Doe - ABCD

John Doe - EFGH

1 Solution

Accepted Solutions
MarcoWedel

Hi,

maybe one solution might be:

QlikCommunity_Thread_263975_Pic1.JPG

QlikCommunity_Thread_263975_Pic2.JPG

QlikCommunity_Thread_263975_Pic3.JPG

tabPlanTemp:

LOAD Distinct

    PLAN_DESC

FROM [https://community.qlik.com/servlet/JiveServlet/download/1287827-283097/POWER%20PLAN%20COUNTS.xlsx] (ooxml, embedded labels, table is [PLAN COUNT EXAMPLE])

Where Len(Trim(PLAN_DESC));

Join

LOAD PLAN_DESC as PLAN_DESC2

Resident tabPlanTemp;

tabPlan:

LOAD PLAN_DESC,

    PLAN_DESC2,

    Num(1-Levenshtein(PLAN_DESC,PLAN_DESC2)/RangeMax(Len(PLAN_DESC),Len(PLAN_DESC2)),'0.0%') as Similarity

Resident tabPlanTemp

Where not PLAN_DESC follows PLAN_DESC2;

DROP Table tabPlanTemp;

Another example:

Fuzzy Matching/Joining

hope this helps

regards

Marco

View solution in original post

10 Replies
Peter_Cammaert
Partner - Champion III
Partner - Champion III

As a starter to detect string similarity, see the implementation of the Levenshtein Distance algorithm here: string matching with fuzzy, trigram (n-gram), levenshtein, etc.

Background info can be found here: Levenshtein distance - Wikipedia

MarcoWedel

can you post some more examples?

Is it always the first part of the string you're interested in?

regards

Marco

cbaqir
Specialist II
Specialist II
Author

It's not always the first part of the string. Here's an example where there is an extra space in the middle:

 

MBS -  Modified Barium Swallow Study
MBS - Modified Barium Swallow Study

Here's another example not at the end:

   

Transphenoidal (Pituitary) Surgery Postoperative
Transsphenoidal (Pituitary) Surgery Postoperative

Example < vs less than:

 

POC Risk for Nutritional Imbalance:  < Requirements (Newborn)
POC Risk for Nutritional Imbalance:  < Requirements (NICU)
POC Risk for Nutritional Imbalance:  less than Requirements (Newborn)
POC Risk for Nutritional Imbalance:  less than Requirements (NICU)
cbaqir
Specialist II
Specialist II
Author

I've attached sample data. I ultimately want to show a list that shows matches between PLAN_DESC at like a 75% match.

MarcoWedel

Hi,

maybe one solution might be:

QlikCommunity_Thread_263975_Pic1.JPG

QlikCommunity_Thread_263975_Pic2.JPG

QlikCommunity_Thread_263975_Pic3.JPG

tabPlanTemp:

LOAD Distinct

    PLAN_DESC

FROM [https://community.qlik.com/servlet/JiveServlet/download/1287827-283097/POWER%20PLAN%20COUNTS.xlsx] (ooxml, embedded labels, table is [PLAN COUNT EXAMPLE])

Where Len(Trim(PLAN_DESC));

Join

LOAD PLAN_DESC as PLAN_DESC2

Resident tabPlanTemp;

tabPlan:

LOAD PLAN_DESC,

    PLAN_DESC2,

    Num(1-Levenshtein(PLAN_DESC,PLAN_DESC2)/RangeMax(Len(PLAN_DESC),Len(PLAN_DESC2)),'0.0%') as Similarity

Resident tabPlanTemp

Where not PLAN_DESC follows PLAN_DESC2;

DROP Table tabPlanTemp;

Another example:

Fuzzy Matching/Joining

hope this helps

regards

Marco

cbaqir
Specialist II
Specialist II
Author

Super cool. Thanks!

scottfordclondono

MarcoWedel

you're welcome

regards

Marco

cbaqir
Specialist II
Specialist II
Author

The load and drop of the temp table takes a very long time. Is there a way I can use a QVD to speed up the process? I've never tried a QVD on top of a temp table.

cbaqir
Specialist II
Specialist II
Author

The load and drop of the temp table takes a very long time. Is there a way I can use a QVD to speed up the process? I've never tried a QVD on top of a temp table.