Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
 
					
				
		
 cbaqir
		
			cbaqir
		
		
		
		
		
		
		
		
	
			
		
		
			
					
		Is there a way to get QV to compare how similar two values are?
In other words, match on 75% of the string to return possible duplicates?
John Doe - ABCD
John Doe - EFGH
 
					
				
		
 MarcoWedel
		
			MarcoWedel
		
		
		
		
		
		
		
		
	
			
		
		
			
					
		Hi,
maybe one solution might be:
tabPlanTemp:
LOAD Distinct
PLAN_DESC
FROM [https://community.qlik.com/servlet/JiveServlet/download/1287827-283097/POWER%20PLAN%20COUNTS.xlsx] (ooxml, embedded labels, table is [PLAN COUNT EXAMPLE])
Where Len(Trim(PLAN_DESC));
Join
LOAD PLAN_DESC as PLAN_DESC2
Resident tabPlanTemp;
tabPlan:
LOAD PLAN_DESC,
PLAN_DESC2,
Num(1-Levenshtein(PLAN_DESC,PLAN_DESC2)/RangeMax(Len(PLAN_DESC),Len(PLAN_DESC2)),'0.0%') as Similarity
Resident tabPlanTemp
Where not PLAN_DESC follows PLAN_DESC2;
DROP Table tabPlanTemp;
Another example:
hope this helps
regards
Marco
 Peter_Cammaert
		
			Peter_Cammaert
		
		
		
		
		
		
		
		
	
			
		
		
			
					
		As a starter to detect string similarity, see the implementation of the Levenshtein Distance algorithm here: string matching with fuzzy, trigram (n-gram), levenshtein, etc.
Background info can be found here: Levenshtein distance - Wikipedia
 
					
				
		
 MarcoWedel
		
			MarcoWedel
		
		
		
		
		
		
		
		
	
			
		
		
			
					
		can you post some more examples?
Is it always the first part of the string you're interested in?
regards
Marco
 
					
				
		
 cbaqir
		
			cbaqir
		
		
		
		
		
		
		
		
	
			
		
		
			
					
		It's not always the first part of the string. Here's an example where there is an extra space in the middle:
| MBS - Modified Barium Swallow Study | 
| MBS - Modified Barium Swallow Study | 
Here's another example not at the end:
| Transphenoidal (Pituitary) Surgery Postoperative | 
| Transsphenoidal (Pituitary) Surgery Postoperative | 
Example < vs less than:
| POC Risk for Nutritional Imbalance: < Requirements (Newborn) | 
| POC Risk for Nutritional Imbalance: < Requirements (NICU) | 
| POC Risk for Nutritional Imbalance: less than Requirements (Newborn) | 
| POC Risk for Nutritional Imbalance: less than Requirements (NICU) | 
 
					
				
		
 cbaqir
		
			cbaqir
		
		
		
		
		
		
		
		
	
			
		
		
			
					
		I've attached sample data. I ultimately want to show a list that shows matches between PLAN_DESC at like a 75% match.
 
					
				
		
 MarcoWedel
		
			MarcoWedel
		
		
		
		
		
		
		
		
	
			
		
		
			
					
		Hi,
maybe one solution might be:
tabPlanTemp:
LOAD Distinct
PLAN_DESC
FROM [https://community.qlik.com/servlet/JiveServlet/download/1287827-283097/POWER%20PLAN%20COUNTS.xlsx] (ooxml, embedded labels, table is [PLAN COUNT EXAMPLE])
Where Len(Trim(PLAN_DESC));
Join
LOAD PLAN_DESC as PLAN_DESC2
Resident tabPlanTemp;
tabPlan:
LOAD PLAN_DESC,
PLAN_DESC2,
Num(1-Levenshtein(PLAN_DESC,PLAN_DESC2)/RangeMax(Len(PLAN_DESC),Len(PLAN_DESC2)),'0.0%') as Similarity
Resident tabPlanTemp
Where not PLAN_DESC follows PLAN_DESC2;
DROP Table tabPlanTemp;
Another example:
hope this helps
regards
Marco
 
					
				
		
 cbaqir
		
			cbaqir
		
		
		
		
		
		
		
		
	
			
		
		
			
					
		 
					
				
		
 MarcoWedel
		
			MarcoWedel
		
		
		
		
		
		
		
		
	
			
		
		
			
					
		you're welcome
regards
Marco
 
					
				
		
 cbaqir
		
			cbaqir
		
		
		
		
		
		
		
		
	
			
		
		
			
					
		The load and drop of the temp table takes a very long time. Is there a way I can use a QVD to speed up the process? I've never tried a QVD on top of a temp table.
 
					
				
		
 cbaqir
		
			cbaqir
		
		
		
		
		
		
		
		
	
			
		
		
			
					
		The load and drop of the temp table takes a very long time. Is there a way I can use a QVD to speed up the process? I've never tried a QVD on top of a temp table.
