Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
Not applicable

Levenshtein Distance During Load

I have a question regarding data manipulations and calculations during the load script.  I read some other discussions on the forum that talk about Levenshtein distance, which I'd like to use during my load.  Here is what I'm trying to do - You may want to just read this description and refrain from looking at my script as I'm fairly certain it's far off what I actually want to do.

I have a list of invoices: (It will be much longer than this, I'm just starting simple)

LOAD * INLINE [

  Invoice

  1228357-0

  1228382-0

  1228356-0

  1228355-0

];

I want to get the average levenshtein distance of each invoice number from all the other invoice numbers.  The entries with the lowest average numbers will be the most similar to each other.  The problem I'm having is that the levenshtein function that I'm using takes two arguments and I can't figure out how to write my script to do what I want:

' Source:

' http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#VBScript

Function levenshtein( a, b )

  Dim i,j,cost,d,min1,min2,min3

' Avoid calculations where there there are empty words

  If Len( a ) = 0 Then levenshtein = Len( b 😞 Exit Function

  If Len( b ) = 0 Then levenshtein = Len( a 😞 Exit Function

' Array initialization

  ReDim d( Len( a ), Len( b ) )

  For i = 0 To Len( a 😞 d( i, 0 ) = i: Next

  For j = 0 To Len( b 😞 d( 0, j ) = j: Next

' Actual calculation

  For i = 1 To Len( a )

  For j = 1 To Len( b )

                        If Mid(a, i, 1) = Mid(b, j, 1) Then cost = 0 Else cost = 1 End If

  ' Since min() function is not a part of VBScript, we'll "emulate" it below

  min1 = ( d( i - 1, j ) + 1 )

  min2 = ( d( i, j - 1 ) + 1 )

  min3 = ( d( i - 1, j - 1 ) + cost )

  If min1 <= min2 And min1 <= min3 Then

  d( i, j ) = min1

  ElseIf min2 <= min1 And min2 <= min3 Then

  d( i, j ) = min2

  Else

  d( i, j ) = min3

  End If

  Next

  Next

  levenshtein = d( Len( a ), Len( b ) )

End Function

This is my most recent attempt which gets me the distance of each of the invoices from the first invoice. It's a long way off from what I'm trying to accomplish:

LOAD

  RowNo() as Number,

  Invoice,

  levenshtein(Invoice, Peek(Invoice, 0)) as Distance,

  Peek(Invoice, 0) as Invoice2;

LOAD * INLINE [

  Invoice

  1228357-0

  1228382-0

  1228356-0

  1228355-0

];

Any help to point me in the right direction would be appreciated.

Thanks in advance,

Steve

1 Solution

Accepted Solutions
swuehl
MVP
MVP

You want to compare each invoice to each other. You can do this using

INVOICES:

LOAD * INLINE [ 

Invoice 

1228357-0  

  1228382-0 

  1228356-0 

  1228355-0 

]; 


JOIN (INVOICES)

LOAD Invoice AS InvoiceToCompare

RESIDENT INVOICES;


LEVENTSHTEIN:

LOAD

     Invoice,

     InvoiceToCompare,

     levenshtein(Invoice, InvoiceToCompare) as Distance

RESIDENT INVOICES

Where Invoice <> InvoiceToCompare;


DROP TABLE INVOICES;


edit:

The JOIN might create a huge table (remember, it's the combination of all invoices with all invoices).

I also filtered the identical invoices that the JOIN produced.

View solution in original post

1 Reply
swuehl
MVP
MVP

You want to compare each invoice to each other. You can do this using

INVOICES:

LOAD * INLINE [ 

Invoice 

1228357-0  

  1228382-0 

  1228356-0 

  1228355-0 

]; 


JOIN (INVOICES)

LOAD Invoice AS InvoiceToCompare

RESIDENT INVOICES;


LEVENTSHTEIN:

LOAD

     Invoice,

     InvoiceToCompare,

     levenshtein(Invoice, InvoiceToCompare) as Distance

RESIDENT INVOICES

Where Invoice <> InvoiceToCompare;


DROP TABLE INVOICES;


edit:

The JOIN might create a huge table (remember, it's the combination of all invoices with all invoices).

I also filtered the identical invoices that the JOIN produced.