Skip to main content
Announcements
See what Drew Clarke has to say about the Qlik Talend Cloud launch! READ THE BLOG
cancel
Showing results for 
Search instead for 
Did you mean: 
therealdees
Creator III
Creator III

Better approach using LevenshteinDist?

Hi!

I'm building something to validate address input strings by comparing the input value to the correct value. I'm using LevenshteinDist and everything works as expected, except I'm trying to think something more efficient that won't use many nested Ifs to validate.

Basically, I have different type of errors/flags that will be added according to the difference between the two strings: ERROR, FORMATTING, DIVERGENT, UNKNOWN.

For now, I'm using LevenshteinDist(), WildMatch(), Len(), Right() and Left() to partially compare the strings. If LevenDist = 0 then the string is perfectly written, else, I'm comparing the strings differently, according to the LevenDist value, for example:

 

If LevenDist = 1

and Len(string1) = Len(string2)

and WildMatch(string1, Left(string2, Len(string2) - 2)'*') > 0

or LevenDist = 1

and Len(string1) = Len(string2)

and WildMatch(string1, Right(string2, '*'Len(string2) - 2)) > 0,

'FORMATTING'

 

If LevenDist >= 1

and Len(string1) - Len(string2) >= -2
and Len(string1) - Len(string2) <= 2

and WildMatch(string1, Left(string2, Len(string2) - 2)'*')

or LevenDist >= 1

and Len(string1) - Len(string2) >= -2
and Len(string1) - Len(string2) <= 2

and WildMatch(string1, '*'Right(string2, Len(string2) - 2)),

'ERROR'

 

The other flags are more specific and therefore easier to identify. These are the ones that will give me the expected output in most cases, except for some cases depending on what letter index is different. What I do is compare, for example, 8 chars of a 10 char string, either from left or right to see if it wildmatches. If so, then the string is very similar and therefore it's a error or a 'formatting' error. The problem is the wrong letter could be anywhere and using Left and Right might not work as intended. I know I could make many condition and use Mid() as well, but there must be something simpler to do this. Please enlighten me!

I'm not attached at all to the approach. This is what I could think of and it works ok, but I'm looking for suggestions to either improve this logic or do something much better dynamically.

 

In other words, I need to compare 2 strings and flag it as 'ERROR' or 'FORMATTING ERROR' only if the strings are indeed similar:

string1 = SENADORPOMPEU

string 2 = SERNADORPOMPEU

Result = ERROR, because the len doesn't match but most of the strings do

 

string1 = SENADORPOMPEU

string2 = ELIASYAZBEK

Result = Unknown, because LevenDist is high and the strings have nothing do to between them

 

string1 = ELIASYAZBEK

string2 = ELIASIAZBEK

Result = 'FORMATTING', because the strings have the same Len() but there's a single different letter

 

Labels (2)
2 Solutions

Accepted Solutions
rwunderlich
Partner Ambassador/MVP
Partner Ambassador/MVP

 I wonder if you could use the LevenshteinDist value more granularly. For example consider a value of 2 or 3 to be a match if the lengths are the same. And a value of 2 or less to be a match when the lengths are different.

You could adjust your thresholds based  on the string length. For example, if the length is more than 10, 2 would be a match otherwise 1 would be a match.

Using the dist value would solve your problem of detecting unlike characters in the middle of the string as well as the ends.

-Rob

View solution in original post

marcus_sommer

I think you could simplify your approach by applying the essential checks in beforehand and creating appropriate flags, like:

Len(string1) - Len(string2) as LenDiff,

sign(Len(string1) - Len(string2)) as LenFlag

and similar with the wildmatch() checks.

This alone won't remove the need of n if-loops but in a next step you may combine these flags with the levenshstein-result maybe in an approach like:

pick(match(floor(levenshstein, 0.1) * 1000 + LenFlag, 95, 96, ..., 103, 104, ...), 'x', 'x', 'x', 'y', 'y', ....)

which is targeted to create exact values. If the number of possible matching- and return-values grows a bit higher these data might be loaded externally from an Excel and per concat() assigned to variables or maybe just used as mapping-table. 

View solution in original post

9 Replies
Peter_Adams
Employee
Employee


@therealdees wrote:

Hi!

I'm building something to validate address input strings by comparing the input value to the correct value. I'm using LevenshteinDist and everything works as expected, except I'm trying to think something more efficient that won't use many nested Ifs to validate.

Basically, I have different type of errors/flags that will be added according to the difference between the two strings: ERROR, FORMATTING, DIVERGENT, UNKNOWN.

For now, I'm using LevenshteinDist(), WildMatch(), Len(), Right() and Left() to partially compare the strings. If LevenDist = 0 then the string is perfectly written, else, I'm comparing the strings differently, according to the LevenDist value, for example:

 

If LevenDist = 1

and Len(string1) = Len(string2)

and WildMatch(string1, Left(string2, Len(string2) - 2)'*') > 0

or LevenDist = 1

and Len(string1) = Len(string2)

and WildMatch(string1, Right(string2, '*'Len(string2) - 2)) > 0,

'FORMATTING'

 

If LevenDist >= 1

and Len(string1) - Len(string2) >= -2
and Len(string1) - Len(string2) <= 2

and WildMatch(string1, Left(string2, Len(string2) - 2)'*')

or LevenDist >= 1

and Len(string1) - Len(string2) >= -2
and Len(string1) - Len(string2) <= 2

and WildMatch(string1, '*'Right(string2, Len(string2) - 2)),

'ERROR'

 

The other flags are more specific and therefore easier to identify. These are the ones that will give me the expected output in most cases, except for some cases depending on what letter index is different. What I do is compare, for example, 8 chars of a 10 char string, either from left or right to see if it wildmatches. If so, then the string is very similar and therefore it's a error or a 'formatting' error. The problem is the wrong letter could be anywhere and using Left and Right might not work as intended. I know I could make many condition and use Mid() as well, but there must be something simpler to do this. Please enlighten me!

I'm not attached at all to the approach. This is what I could think of and it works ok, but I'm looking for suggestions to either improve this logic or do something much better dynamically.

 

In other words, I need to compare 2 strings and flag it as 'ERROR' or 'FORMATTING ERROR' only if the strings are indeed similar:

string1 = SENADORPOMPEU

string 2 = SERNADORPOMPEU

Result = ERROR, because the len doesn't match but most of the strings do

 

string1 = SENADORPOMPEU

string2 = ELIASYAZBEK

Result = Unknown, because LevenDist is high and the strings have nothing do to between them

 

string1 = ELIASYAZBEK

string2 = ELIASIAZBEK

Result = 'FORMATTING', because the strings have the same Len() but there's a single different letter

 


If this is client-managed Qlik, you could take a look at using the Reg-Ex web connector.  I have not tried this but I know RegEx has powerful pattern matching capabilities.  It would require the QWC installation.    https://help.qlik.com/en-US/connectors/Subsystems/Web_Connectors_help/Content/Connectors_QWC/Data-So...

therealdees
Creator III
Creator III
Author

Hi Peter, thanks for the reply!!

 

Unfortunately I'm using Qlik SaaS. Do you know if there's any similar solution for SaaS?

Peter_Adams
Employee
Employee

Sorry, I don't see a similar connector type in Qlik Sense SAAS.

rwunderlich
Partner Ambassador/MVP
Partner Ambassador/MVP

 I wonder if you could use the LevenshteinDist value more granularly. For example consider a value of 2 or 3 to be a match if the lengths are the same. And a value of 2 or less to be a match when the lengths are different.

You could adjust your thresholds based  on the string length. For example, if the length is more than 10, 2 would be a match otherwise 1 would be a match.

Using the dist value would solve your problem of detecting unlike characters in the middle of the string as well as the ends.

-Rob

therealdees
Creator III
Creator III
Author

Hi, Rob, thanks for the reply

It's kinda obvious now when hearing it from outside, hehe. I'll try it as soon as I'm back.

 

Thanks!!!

marcus_sommer

I think you could simplify your approach by applying the essential checks in beforehand and creating appropriate flags, like:

Len(string1) - Len(string2) as LenDiff,

sign(Len(string1) - Len(string2)) as LenFlag

and similar with the wildmatch() checks.

This alone won't remove the need of n if-loops but in a next step you may combine these flags with the levenshstein-result maybe in an approach like:

pick(match(floor(levenshstein, 0.1) * 1000 + LenFlag, 95, 96, ..., 103, 104, ...), 'x', 'x', 'x', 'y', 'y', ....)

which is targeted to create exact values. If the number of possible matching- and return-values grows a bit higher these data might be loaded externally from an Excel and per concat() assigned to variables or maybe just used as mapping-table. 

therealdees
Creator III
Creator III
Author

Hi, Marcus

Thanks for the reply!

I kinda understood the logic behind your suggestion, but I'm a bit confused about the expression. Why set the floor step to 0.1 and why multiple it by 1000? And why sum it to the LenFlag after? Could you provide me more details?

 

PS: I just got back from 'vacation', sorry for the delay

marcus_sommer

Sorry for the confusion with floor(). I thought that the levenshtein-function would return a float-result and the rounding was aimed to restrict the results to a defined range of values. I just took a look in the help and the function results in integer - so you could just ignore this part.

The (n) multiplication and (n) additions have the purpose to avoid an overlapping between the various parts - to ensure that each combination of the flags return an unique value. Quite often is this kind of measurement done per string-concatenation of the parameter (field/variable-values or expression-results) like:

F1 & '|' & F2 & '|' & F3 as Key_F123

But if these values are numbers or could be replaced by numbers and don't have too many digits it would be possible to create unique (key) values with mathematical operations. Direct benefits would be a lesser RAM consumption and faster processing as by using strings whereby it would depend on the size and kind of the data-set if it would be significantly or rather negligible. Beside this it could be easier with numeric values to create the targeted mapping-table, for example by applying any loops or cutting any upper/lower outliers with range-functions.

To make it short - the aim is to create a mapping-table with n dedicated values which is then queried per applymap() which avoids the nested if-stuff.

therealdees
Creator III
Creator III
Author

This is really interesting , specially about using numbers instead and saving processing. I'm already having trouble performing plenty other similar tasks. Actually the whole script is about flagging different input fields. I'll definitely look into it and try to implement it as a map.

In the meanwhile I followed @rwunderlich pointing and he's right about me complicating things, I don't need left and right, just Len and LevenDist. I fixed the strings I had isolated for testing purposes and I'm loading a big sample now, but I'm pretty sure there will be special cases that I'll have to deal with, specially short strings, like 3 to 5 length. For those I guess mapping the flag using the levendist + len would do the job.

Thank you both.