
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Approximate String Matching in QlikView
Given a dataset that includes address data (e.g. address lines 1, 2, 3, city, state, zip), I'm trying to find a way in QlikView to group identical or *very similar* addresses. One requirement might be to show a list of customers that seem to be sharing the same address.
It's easy enough in QlikView to isolate exact duplicates, but I need something a bit more fuzzy. One address might have "Street" spelled out, another just "St."; one address might have the ZIP code plus four, another just the five digit ZIP. You get the idea. I don't see an out-of-the-box function in QlikView that will accomplish this. (soundex() is good for single names but not this.)
For starters, I think the entire address will have to be concatenated into a single string -- and maybe all punctuation stripped. Already, we've lost something if, say, address lines 1 and 2 are reversible. I realize no solution will be perfect, but I want something as robust as possible.
I found this intriguing -- it's a clever algorithm for ranking approximate string matches:
http://www.catalysoft.com/articles/StrikeAMatch.html. This is the kind of functinality I need. I know I can add VBScript functions into a module in QlikView, but this function operates on two strings, and I don't see a way to work it in.
I'm looking for ideas. I'm using QlikView 11, and I'm relatively new to the tool. Thanks.


- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can create a script that mimics the proposed algorithm without using VBScript, just QV functions, maybe something like
INPUT:
LOAD RecNo() as ID, iterno() as PAIRID, INPUT, UPPER(mid(INPUT,iterno(),2)) as PAIR INLINE [
INPUT
France
French
Frater
] while iterno() <= len(INPUT)-1;
JOIN LOAD ID as ID2, PAIRID as PAIRID2, PAIR as PAIR2,INPUT as INPUT2 Resident INPUT;
SIMILARITY:
LOAD ID, ID2, only(INPUT) as INPUT, only(INPUT2) as INPUT2 ,
2*(count(if(PAIR = PAIR2, PAIR))) / (count(DISTINCT PAIRID)+count(DISTINCT PAIRID2)) as Similarity
Resident INPUT where ID<ID2 group by ID,ID2;
drop table INPUT;
But this will create quite a huge temporary table when applied to several, probably large INPUT values.
You might want to look into other tools that are specialized on data cleansing / quality, QV is just not build for that.
Another idea might be to push all your addresses to a service that generates geo coordinates out of the input, I vaguely remember some threads that discussed that, then use a similar method as above to calculate distances from pairs of coordinates:
How do I use latitude and longitude values to find near by records

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This might be of interest to you: QlikView Application: TIQViewQVD data profiling
And maybe this app that includes a vbscript metaphone implementation.
talk is cheap, supply exceeds demand

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Interesting responses. swuehl, thanks for taking the time. I'm looking more closely at what you put together.
GW, I'm interested in your link, but don't seem to have access/security to view it. Can you make it available?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ah, yes. The TIQViewQVD app is in the Data Quality group. You can request to join this group. Once your request has been granted by the group owner you will be able to access the document.
talk is cheap, supply exceeds demand
