Skip to main content
Announcements
Have questions about Qlik Connect? Join us live on April 10th, at 11 AM ET: SIGN UP NOW
cancel
Showing results for 
Search instead for 
Did you mean: 
Not applicable

Approximate String Matching in QlikView

Given a dataset that includes address data (e.g. address lines 1, 2, 3, city, state, zip), I'm trying to find a way in QlikView to group identical or *very similar* addresses. One requirement might be to show a list of customers that seem to be sharing the same address.

It's easy enough in QlikView to isolate exact duplicates, but I need something a bit more fuzzy. One address might have "Street" spelled out, another just "St."; one address might have the ZIP code plus four, another just the five digit ZIP. You get the idea. I don't see an out-of-the-box function in QlikView that will accomplish this. (soundex() is good for single names but not this.)

For starters, I think the entire address will have to be concatenated into a single string -- and maybe all punctuation stripped. Already, we've lost something if, say, address lines 1 and 2 are reversible. I realize no solution will be perfect, but I want something as robust as possible.

I found this intriguing -- it's a clever algorithm for ranking approximate string matches:

http://www.catalysoft.com/articles/StrikeAMatch.html. This is the kind of functinality I need. I know I can add VBScript functions into a module in QlikView, but this function operates on two strings, and I don't see a way to work it in.

I'm looking for ideas. I'm using QlikView 11, and I'm relatively new to the tool. Thanks.

4 Replies
swuehl
MVP
MVP

You can create a script that mimics the proposed algorithm without using VBScript, just QV functions, maybe something like

INPUT:

LOAD RecNo() as ID, iterno() as PAIRID, INPUT, UPPER(mid(INPUT,iterno(),2)) as PAIR INLINE [

INPUT

France

French

Frater

] while iterno() <= len(INPUT)-1;

JOIN LOAD ID as ID2, PAIRID as PAIRID2, PAIR as PAIR2,INPUT as INPUT2  Resident INPUT;

SIMILARITY:

LOAD ID, ID2, only(INPUT) as INPUT, only(INPUT2) as INPUT2 ,

2*(count(if(PAIR = PAIR2, PAIR))) / (count(DISTINCT PAIRID)+count(DISTINCT PAIRID2)) as Similarity

Resident INPUT where ID<ID2 group by ID,ID2;

drop table INPUT;

But this will create quite a huge temporary table when applied to several, probably large INPUT values.

You might want to look into other tools that are specialized on data cleansing / quality, QV is just not build for that.

Another idea might be to push all your addresses to a service that generates geo coordinates out of the input, I vaguely remember some threads that discussed that, then use a similar method as above to calculate distances from pairs of coordinates:

How do I use latitude and longitude values to find near by records

Step by step of how to Google Geocode my addresses?

Gysbert_Wassenaar

This might be of interest to you: QlikView Application: TIQViewQVD data profiling

And maybe this app that includes a vbscript metaphone implementation.


talk is cheap, supply exceeds demand
Not applicable
Author

Interesting responses. swuehl, thanks for taking the time. I'm looking more closely at what you put together.

GW, I'm interested in your link, but don't seem to have access/security to view it. Can you make it available?

Gysbert_Wassenaar

Ah, yes. The TIQViewQVD app is in the Data Quality group. You can request to join this group. Once your request has been granted by the group owner you will be able to access the document.


talk is cheap, supply exceeds demand