Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Is there a way to get QV to compare how similar two values are?
In other words, match on 75% of the string to return possible duplicates?
John Doe - ABCD
John Doe - EFGH
Hi,
maybe one solution might be:
tabPlanTemp:
LOAD Distinct
PLAN_DESC
FROM [https://community.qlik.com/servlet/JiveServlet/download/1287827-283097/POWER%20PLAN%20COUNTS.xlsx] (ooxml, embedded labels, table is [PLAN COUNT EXAMPLE])
Where Len(Trim(PLAN_DESC));
Join
LOAD PLAN_DESC as PLAN_DESC2
Resident tabPlanTemp;
tabPlan:
LOAD PLAN_DESC,
PLAN_DESC2,
Num(1-Levenshtein(PLAN_DESC,PLAN_DESC2)/RangeMax(Len(PLAN_DESC),Len(PLAN_DESC2)),'0.0%') as Similarity
Resident tabPlanTemp
Where not PLAN_DESC follows PLAN_DESC2;
DROP Table tabPlanTemp;
Another example:
hope this helps
regards
Marco
As a starter to detect string similarity, see the implementation of the Levenshtein Distance algorithm here: string matching with fuzzy, trigram (n-gram), levenshtein, etc.
Background info can be found here: Levenshtein distance - Wikipedia
can you post some more examples?
Is it always the first part of the string you're interested in?
regards
Marco
It's not always the first part of the string. Here's an example where there is an extra space in the middle:
MBS - Modified Barium Swallow Study |
MBS - Modified Barium Swallow Study |
Here's another example not at the end:
Transphenoidal (Pituitary) Surgery Postoperative |
Transsphenoidal (Pituitary) Surgery Postoperative |
Example < vs less than:
POC Risk for Nutritional Imbalance: < Requirements (Newborn) |
POC Risk for Nutritional Imbalance: < Requirements (NICU) |
POC Risk for Nutritional Imbalance: less than Requirements (Newborn) |
POC Risk for Nutritional Imbalance: less than Requirements (NICU) |
I've attached sample data. I ultimately want to show a list that shows matches between PLAN_DESC at like a 75% match.
Hi,
maybe one solution might be:
tabPlanTemp:
LOAD Distinct
PLAN_DESC
FROM [https://community.qlik.com/servlet/JiveServlet/download/1287827-283097/POWER%20PLAN%20COUNTS.xlsx] (ooxml, embedded labels, table is [PLAN COUNT EXAMPLE])
Where Len(Trim(PLAN_DESC));
Join
LOAD PLAN_DESC as PLAN_DESC2
Resident tabPlanTemp;
tabPlan:
LOAD PLAN_DESC,
PLAN_DESC2,
Num(1-Levenshtein(PLAN_DESC,PLAN_DESC2)/RangeMax(Len(PLAN_DESC),Len(PLAN_DESC2)),'0.0%') as Similarity
Resident tabPlanTemp
Where not PLAN_DESC follows PLAN_DESC2;
DROP Table tabPlanTemp;
Another example:
hope this helps
regards
Marco
you're welcome
regards
Marco
The load and drop of the temp table takes a very long time. Is there a way I can use a QVD to speed up the process? I've never tried a QVD on top of a temp table.
The load and drop of the temp table takes a very long time. Is there a way I can use a QVD to speed up the process? I've never tried a QVD on top of a temp table.