19 Replies Latest reply: May 9, 2014 6:39 AM by Dariusz Mielczarek RSS

    Levenshtein Algorithm

      Dear All,

       

      Does anyone try to apply Levenshtein algorithm in QlikView?

       

      Your inputs are highly appreciated!!

       

      Regards,

       

      Omer

        • Re: Levenshtein Algorithm
          Anand Chouhan

          I am not hear about that can you explain more about the Levenshtein algorithm.

          • Re: Levenshtein Algorithm
            Advait Thakur

            Hi Omer,

            This is indeed but just to let you know Levenshtein is not implemented yet in QlikView. Few months before I was wondering for a matrix which could be utilize to measure the period or distance between two things (Months, Days, Years or anything). I believe the most common way to calculate this is by dynamic approach, may be VB script, C# or Java program but this is not the feature of QlikView. I tried and played with variables but did not succeeded.

             

            As of now there is no functions which fulfill levenshtein algorithm in QlikView till version SR6.

            I marked this as a priority, as the need for such analysis is present in market.

             

            I hope this is informative and hence request to mark this so other community users can get instant results of the post/questions.

             

            Regards

            Advait

              • Re: Re: Levenshtein Algorithm
                Dariusz Mielczarek

                Hello,

                Here is result of my attempts to implement Levenshtein Algorithm

                I want to think about how to change the formulas to improve presentation:

                - by transfer it to pivot ..
                - by using Aggr ()


                I think that there should be also possible to create some extension for it...



                Regards
                Darek

                  • Re: Re: Re: Levenshtein Algorithm
                    Dariusz Mielczarek

                    Here is version 3.

                    User may at the application level input string to search.

                     

                    regards

                    Darek

                        • Re: Levenshtein Algorithm
                          Dariusz Mielczarek

                           

                          29-04-2014 17:28 użytkownik "Marco Wedel" <qcwebmaster@qlik.com>

                          napisał:

                           

                                Qlik Community <http://community.qlik.com/>   Levenshtein Algorithm

                           

                          reply from Marco Wedel<http://community.qlik.com/people/MarcoWedel?et=watches.email.thread>in

                          Scripting - View the full discussion<http://community.qlik.com/message/518811?et=watches.email.thread#518811>

                           

                            • Re: Levenshtein Algorithm

                              Now I am testing it with Marco's approach with 140M rows.

                               

                              Then i will try Dariusz's approach as well.

                               

                              I will update you about the results guys.

                               

                              Thanks,

                               

                              Omer

                                • Re: Levenshtein Algorithm
                                  Dariusz Mielczarek

                                  Omer,

                                   

                                  Please, let me know what are your user expectations? Do they need to input searched words on the fly, or maybe you know everything at the script level?

                                   

                                  I can imagine, that your goal may be "only" to analyse some text data and counting similiar words from this text. If we have this case, you should make calculations during reload or even outside qlikview... Application should be fast even with 140 mln rows.

                                   

                                  But if user wants to input some string and then search it in loaded data, then rather some on the fly calculations are needed. I think, that my approach will be to hevy for 140 mln words. I assumed, that i have row for each letter to have faster response. But with this data amount it will be rather difficult to load it this way and i think, that there will be necessary to load words and divide it to letters on the fly.

                                   

                                  Anyway, i think, that there are possible some improvements. For example we can try to find some candidates first (using simplier comaprissions with set analysis) and then calculate Levenshtein distance only for them.

                                   

                                  Just for fun i am still trying to "translate" Levenshtein distance calculation to multidimensional algorithm.

                                   

                                  Please, let me know what are your user expectations? Do they need to input searched words on the fly, or maybe you know everything at the script level?

                                   

                                  Please have a look also on possible QlikView with "R" integration. Maybe it is some direction. Let me know...

                                   

                                  regards

                                   

                                  Darek

                                    • Re: Levenshtein Algorithm

                                      Darek,

                                       

                                      You are right, i just want to analyse the some text data and counting similar words. Try to find out miss-typos some more analysis.

                                       

                                      I test this case at SAS with a specific function (complev() in SAS) an it takes really long time and it asks for huge disk space as well. So, instead to calculating it row by row, i transform the data to pivot table and then use array like loop to find out. It takes only a couple minutes in SAS with this approach.

                                       

                                      Then, i started think about how can I apply this in QlikView to use the power of in-memory. By using macro, it takes app. 20 hours to reload. Hence, it is not very feasible.

                                       

                                      So, right now i try to find out how i can integrate SAS with QlikView. If you have any better ideas, i really appreciate hear it.

                                       

                                      Thanks,

                                       

                                      Omer

                                        • Re: Levenshtein Algorithm
                                          Dariusz Mielczarek

                                          Omer,

                                           

                                          i'am still working how to calculate distance on the fly. I'am close to solution which, i hope, should be effective.

                                           

                                          I also tested approach like in ver3, i attached before, but with additional macro, which, using dynamic update or inputfields applies only final results (values from last row of two words combination) calculated in one straightable into memory and then results may be presented by another table (without rows we dont want to see).

                                          In my case, where user could input search string on the fly and i loaded about 1000 words, response time was not bad.

                                           

                                          I also found interesting transformation of this algorithm, which gives me great hope to have fast solution inside qlikview even without macro (i dont like macro's) and with minimal usage of aggr() (aggr likes memory and may be slow) . But i am still working on it.

                                           

                                          You told before about 140M rows.

                                          So, how many Levenshtein distances you would like to check? 140Mx140M?

                                           

                                           

                                          About your SAS and fast calculation in pivot - i think it is similiar like my v3 in QV....

                                          Have you checked how much time my approach needs with your data volumes?

                                           

                                           

                                           

                                          regards

                                          Darek

                                            • Re: Levenshtein Algorithm

                                              Darek,

                                               

                                              Thanks a lot I could not check the v3 sample. I try to check that one asap.

                                               

                                              We are on the same page as well. I do not want to use macro's and try to minimal the usage of aggr().

                                               

                                              In Levenshtein distance, my cut off point will 3 or 4. After 4, it becomes meaningless. After the Cartesian product it becomes 140 M . Initially, it is 12K records.


                                              I will update you about the outcomes of further tests.


                                              Regards,


                                              Omer

                                                • Re: Re: Levenshtein Algorithm
                                                  Dariusz Mielczarek

                                                  Omer,

                                                   

                                                  i still dont know what users will do with those data, but

                                                  if you dont need application which will calculate distance for words inputed by user vie UI, it will be better to calculate it during reload or outside.

                                                  Maybe try to use some fast condition to check if two words are candidates to meet cutpoint and then calculate distance only for those candidates, like this (using function proposed by Marco):

                                                   

                                                  wset1:

                                                  LOAD * INLINE [

                                                      w1

                                                      aba

                                                      Qlik

                                                      Qlik

                                                      QlikView

                                                      darek mielczarek

                                                      Business Discovery

                                                      levenshtein

                                                          ola

                                                      milan

                                                      poland

                                                      polska

                                                      abc

                                                     

                                                  ];

                                                   

                                                  join

                                                  load w1 as w2 Resident wset1;

                                                   

                                                  wset:

                                                  load rowno() as id,

                                                  w1 as word1, w2 as word2,

                                                  if(w1=w2,'the same', if(KeepChar(w1,w2)>len(w1)-4 and KeepChar(w2,w1)>len(w2)-4 and fabs(len(w1)-len(w2))<=4,'candidate','non candidate')) as class

                                                  Resident wset1;

                                                   

                                                  DROP Table wset1;

                                                   

                                                   

                                                  load id,

                                                  //here: distance expression (word1,word2)

                                                  levenshtein(word1,word2) as LevDist

                                                  Resident wset

                                                  where

                                                  class = 'candidate';

                                                   

                                                   

                                                  regards

                                                  Darek

                                                  • Re: Re: Levenshtein Algorithm
                                                    Dariusz Mielczarek

                                                    In my oppinion disance calculation on the fly has sense only if it is search for string user inputed in UI.

                                                    In this case you can:

                                                    1.

                                                    use hidden straight table, which calculates distance with above() and above( total), with expression "expr" like this:

                                                    if(nbr=0,nbr1,if(nbr1=0,nbr,rangemin(Above(expr,1)+1,Above(total expr,word_len1+1)+1 ,Above(total expr,word_len1+2)+if(letter=letter1,0,1))))

                                                     

                                                    and macro which will read results from this table (last row for each word) and input it into another table (via dynamic update or inputsum), and those results will be presented.

                                                     

                                                    2.

                                                    use modified "parallel" algorithm:

                                                    there are the same letters ("0") and cost between two of thems it is value like:

                                                    rangemin(x2-x1,y2-y1)+fabs(x2-x1-(y2-y1))-1,

                                                     

                                                    where x2>x1 and y2>y1,

                                                    x1=0...len(word1)

                                                    y1=0...len(word2)

                                                    x2=1 ... len(word1)+1

                                                    y2=1 ... len(word2)+1

                                                     

                                                    So, if we have:

                                                    word1=comp

                                                    word2=ccwp

                                                     

                                                    we will have matrix:

                                                    1,1;1;2;4,4

                                                     

                                                    we can have 3*2=6 pathes in this matrix (we have 2 elements on 1'st column and 1 element on 2'nd column, so number of pathes is (2+1)*(1+1)=6)

                                                     

                                                    we will duplicate it and add 0,0 as start piont to first instance and 5,5 as end point to 2'nd instance.

                                                     

                                                    so, wi will have 2 matrixes:

                                                    1)

                                                    0,0;1,1;1,2;4,4

                                                    2)

                                                    1,1;1,2;4,4;5,5

                                                     

                                                    combintions (possible pathes), where x means that we skip point from this column:

                                                    path1:

                                                    0,0;x;x;5,5

                                                    path2:

                                                    0,0;1,1;x;5,5

                                                    path3:

                                                    0,0;1,1;4,4;5,5

                                                    path4:

                                                    0,0;1,2;x;5,5

                                                    path5:

                                                    0,0;1,2;4,4;5,5

                                                    path6:

                                                    0,0;x;4,4;5,5

                                                     

                                                    for path1 we have 1 pair:

                                                    0,0 and 5,5, so result is rangemin(5-0,5-0)+fabs(5-0-(5-0))-1=4 (this is distance when we have no mathing letter)

                                                     

                                                     

                                                    for path2 we have 2 pairs to calculate:

                                                    0,0 -> 1,1

                                                    1,1 ->5,5

                                                     

                                                    so, our result is:

                                                    rangemin(1-0,1-0)+fabs(1-0-(1-0))-1 +

                                                    rangemin(5-1,5-1)+fabs(5-1-(5-1))-1 = 0+3=4

                                                     

                                                    for path3 we have 3 pairs:

                                                     

                                                    0,0->1,1

                                                    1,1->4,4

                                                    4,4->5,5, so:

                                                    rangemin(1-0,1-0)+fabs(1-0-(1-0))-1 +

                                                    rangemin(4-1,4-1)+fabs(4-1-(4-1))-1 +

                                                    rangemin(5-4,5-4)+fabs(5-4-(5-4))-1 =0+2+0=2

                                                     

                                                     

                                                    for path4 we have 2 pairs:

                                                     

                                                    0,0->1,2

                                                    1,2->5,5

                                                    rangemin(1-0,2-0)+fabs(1-0-(2-0))-1 +

                                                    rangemin(5-1,5-2)+fabs(5-1-(5-2))-1 = 1+1-1  +  3+1-1=4

                                                     

                                                    for path5 we have 3 pairs:

                                                    0,0->1,2

                                                    1,2->4,4

                                                    4,4->5,5, so

                                                    rangemin(1-0,2-0)+fabs(1-0-(2-0))-1 +

                                                    rangemin(4-1,4-2)+fabs(4-1-(4-2))-1 +

                                                    rangemin(5-4,5-4)+fabs(5-4-(5-4))-1 =1+   2+1-1   +0=3

                                                     

                                                    for path6 we have 2 pairs:

                                                    0,0->4,4

                                                    4,4->5,5

                                                    rangemin(4-0,4-0)+fabs(4-0-(4-0))-1

                                                    rangemin(5-4,5-4)+fabs(5-4-(5-4))-1=3+0=3

                                                     

                                                    so, our winner path is path3 and result is 2

                                                     

                                                     

                                                    It is possible having pairs and knowing number of combinations to know on which path this pais should be calculated.

                                                     

                                                    This parallel version of algorihm i am trying to implement. But it is not easy...

                                                     

                                                     

                                                    regards

                                                    Darek


                                        • Re: Levenshtein Algorithm
                                          Dariusz Mielczarek

                                          Marco,

                                          I have one more idea and hope, that i find tonight even better way to

                                          implement it ...

                                          29-04-2014 17:28 użytkownik "Marco Wedel" <qcwebmaster@qlik.com>

                                          napisał:

                                           

                                                Qlik Community <http://community.qlik.com/>   Levenshtein Algorithm

                                           

                                          reply from Marco Wedel<http://community.qlik.com/people/MarcoWedel?et=watches.email.thread>in

                                          Scripting - View the full discussion<http://community.qlik.com/message/518811?et=watches.email.thread#518811>

                                           

                                  • Re: Levenshtein Algorithm
                                    Stefan Wühl

                                    Have a look at the second page of the second linked thread in my answer here:

                                    String Comparison

                                    • Re: Levenshtein Algorithm
                                      Marco Wedel

                                      Hi Omer,

                                       

                                      although it might not be the fastest solution (in terms of execution time), this application might help:

                                       

                                      QlikCommunity_Thread_115800_Pic1.JPG.jpg

                                       

                                      QlikCommunity_Thread_115800_Pic2.JPG.jpg

                                       

                                      tabLevenshteinDistances:
                                      LOAD
                                        *,
                                        levenshtein(A,B) as LevDistAB
                                      Inline [
                                      A,B
                                      kitten,sitting
                                      saturday,sunday
                                      book,back
                                      table,able
                                      iPod,iPad
                                      cellar,door
                                      qlik,click
                                      ];
                                      
                                      
                                      
                                      
                                      
                                      
                                      

                                       

                                      It's just a copy/paste-solution using a VBScript function found at wikibooks:

                                      http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#VBScript

                                       

                                      The reload execution time for a 1.000.000 rows xlsx table raised from 45sec to 2min 58sec when adding the levenshtein distance calculation.

                                       

                                      hope it might be of any use though

                                       

                                      regards

                                       

                                      Marco

                                      • Re: Levenshtein Algorithm

                                        Hi guys,

                                         

                                        Thanks a lot for the help.

                                         

                                        It really help me.

                                         

                                        Regards,

                                         

                                        Omer