
Re: Levenshtein Algorithm
Anand Chouhan Apr 28, 2014 9:42 AM (in response to Omer Citak )I am not hear about that can you explain more about the Levenshtein algorithm.

Re: Levenshtein Algorithm
Advait Thakur Apr 28, 2014 3:41 PM (in response to Omer Citak )Hi Omer,
This is indeed but just to let you know Levenshtein is not implemented yet in QlikView. Few months before I was wondering for a matrix which could be utilize to measure the period or distance between two things (Months, Days, Years or anything). I believe the most common way to calculate this is by dynamic approach, may be VB script, C# or Java program but this is not the feature of QlikView. I tried and played with variables but did not succeeded.
As of now there is no functions which fulfill levenshtein algorithm in QlikView till version SR6.
I marked this as a priority, as the need for such analysis is present in market.
I hope this is informative and hence request to mark this so other community users can get instant results of the post/questions.
Regards
Advait

Re: Re: Levenshtein Algorithm
Dariusz Mielczarek Apr 29, 2014 5:52 AM (in response to Advait Thakur)Hello,
Here is result of my attempts to implement Levenshtein Algorithm
I want to think about how to change the formulas to improve presentation:
 by transfer it to pivot ..
 by using Aggr ()I think that there should be also possible to create some extension for it...
Regards
Darek
Re: Re: Re: Levenshtein Algorithm
Dariusz Mielczarek Apr 29, 2014 8:17 AM (in response to Dariusz Mielczarek)Here is version 3.
User may at the application level input string to search.
regards
Darek


Re: Levenshtein Algorithm
Dariusz Mielczarek Apr 29, 2014 12:12 PM (in response to Marco Wedel )29042014 17:28 użytkownik "Marco Wedel" <qcwebmaster@qlik.com>
napisał:
Qlik Community <http://community.qlik.com/> Levenshtein Algorithm
reply from Marco Wedel<http://community.qlik.com/people/MarcoWedel?et=watches.email.thread>in
Scripting  View the full discussion<http://community.qlik.com/message/518811?et=watches.email.thread#518811>

Re: Levenshtein Algorithm
Omer Citak Apr 30, 2014 4:15 AM (in response to Dariusz Mielczarek)Now I am testing it with Marco's approach with 140M rows.
Then i will try Dariusz's approach as well.
I will update you about the results guys.
Thanks,
Omer

Re: Levenshtein Algorithm
Dariusz Mielczarek Apr 30, 2014 4:59 AM (in response to Omer Citak )Omer,
Please, let me know what are your user expectations? Do they need to input searched words on the fly, or maybe you know everything at the script level?
I can imagine, that your goal may be "only" to analyse some text data and counting similiar words from this text. If we have this case, you should make calculations during reload or even outside qlikview... Application should be fast even with 140 mln rows.
But if user wants to input some string and then search it in loaded data, then rather some on the fly calculations are needed. I think, that my approach will be to hevy for 140 mln words. I assumed, that i have row for each letter to have faster response. But with this data amount it will be rather difficult to load it this way and i think, that there will be necessary to load words and divide it to letters on the fly.
Anyway, i think, that there are possible some improvements. For example we can try to find some candidates first (using simplier comaprissions with set analysis) and then calculate Levenshtein distance only for them.
Just for fun i am still trying to "translate" Levenshtein distance calculation to multidimensional algorithm.
Please, let me know what are your user expectations? Do they need to input searched words on the fly, or maybe you know everything at the script level?
Please have a look also on possible QlikView with "R" integration. Maybe it is some direction. Let me know...
regards
Darek

Re: Levenshtein Algorithm
Omer Citak May 8, 2014 5:12 AM (in response to Dariusz Mielczarek)Darek,
You are right, i just want to analyse the some text data and counting similar words. Try to find out misstypos some more analysis.
I test this case at SAS with a specific function (complev() in SAS) an it takes really long time and it asks for huge disk space as well. So, instead to calculating it row by row, i transform the data to pivot table and then use array like loop to find out. It takes only a couple minutes in SAS with this approach.
Then, i started think about how can I apply this in QlikView to use the power of inmemory. By using macro, it takes app. 20 hours to reload. Hence, it is not very feasible.
So, right now i try to find out how i can integrate SAS with QlikView. If you have any better ideas, i really appreciate hear it.
Thanks,
Omer

Re: Levenshtein Algorithm
Dariusz Mielczarek May 8, 2014 6:11 AM (in response to Omer Citak )Omer,
i'am still working how to calculate distance on the fly. I'am close to solution which, i hope, should be effective.
I also tested approach like in ver3, i attached before, but with additional macro, which, using dynamic update or inputfields applies only final results (values from last row of two words combination) calculated in one straightable into memory and then results may be presented by another table (without rows we dont want to see).
In my case, where user could input search string on the fly and i loaded about 1000 words, response time was not bad.
I also found interesting transformation of this algorithm, which gives me great hope to have fast solution inside qlikview even without macro (i dont like macro's) and with minimal usage of aggr() (aggr likes memory and may be slow) . But i am still working on it.
You told before about 140M rows.
So, how many Levenshtein distances you would like to check? 140Mx140M?
About your SAS and fast calculation in pivot  i think it is similiar like my v3 in QV....
Have you checked how much time my approach needs with your data volumes?
regards
Darek

Re: Levenshtein Algorithm
Omer Citak May 9, 2014 4:01 AM (in response to Dariusz Mielczarek)Darek,
Thanks a lot I could not check the v3 sample. I try to check that one asap.
We are on the same page as well. I do not want to use macro's and try to minimal the usage of aggr().
In Levenshtein distance, my cut off point will 3 or 4. After 4, it becomes meaningless. After the Cartesian product it becomes 140 M . Initially, it is 12K records.
I will update you about the outcomes of further tests.
Regards,
Omer

Re: Re: Levenshtein Algorithm
Dariusz Mielczarek May 9, 2014 5:42 AM (in response to Omer Citak )Omer,
i still dont know what users will do with those data, but
if you dont need application which will calculate distance for words inputed by user vie UI, it will be better to calculate it during reload or outside.
Maybe try to use some fast condition to check if two words are candidates to meet cutpoint and then calculate distance only for those candidates, like this (using function proposed by Marco):
wset1:
LOAD * INLINE [
w1
aba
Qlik
Qlik
QlikView
darek mielczarek
Business Discovery
levenshtein
ola
milan
poland
polska
abc
];
join
load w1 as w2 Resident wset1;
wset:
load rowno() as id,
w1 as word1, w2 as word2,
if(w1=w2,'the same', if(KeepChar(w1,w2)>len(w1)4 and KeepChar(w2,w1)>len(w2)4 and fabs(len(w1)len(w2))<=4,'candidate','non candidate')) as class
Resident wset1;
DROP Table wset1;
load id,
//here: distance expression (word1,word2)
levenshtein(word1,word2) as LevDist
Resident wset
where
class = 'candidate';
regards
Darek

Re: Re: Levenshtein Algorithm
Dariusz Mielczarek May 9, 2014 6:39 AM (in response to Omer Citak )In my oppinion disance calculation on the fly has sense only if it is search for string user inputed in UI.
In this case you can:
1.
use hidden straight table, which calculates distance with above() and above( total), with expression "expr" like this:
if(nbr=0,nbr1,if(nbr1=0,nbr,rangemin(Above(expr,1)+1,Above(total expr,word_len1+1)+1 ,Above(total expr,word_len1+2)+if(letter=letter1,0,1))))
and macro which will read results from this table (last row for each word) and input it into another table (via dynamic update or inputsum), and those results will be presented.
2.
use modified "parallel" algorithm:
there are the same letters ("0") and cost between two of thems it is value like:
rangemin(x2x1,y2y1)+fabs(x2x1(y2y1))1,
where x2>x1 and y2>y1,
x1=0...len(word1)
y1=0...len(word2)
x2=1 ... len(word1)+1
y2=1 ... len(word2)+1
So, if we have:
word1=comp
word2=ccwp
we will have matrix:
1,1;1;2;4,4
we can have 3*2=6 pathes in this matrix (we have 2 elements on 1'st column and 1 element on 2'nd column, so number of pathes is (2+1)*(1+1)=6)
we will duplicate it and add 0,0 as start piont to first instance and 5,5 as end point to 2'nd instance.
so, wi will have 2 matrixes:
1)
0,0;1,1;1,2;4,4
2)
1,1;1,2;4,4;5,5
combintions (possible pathes), where x means that we skip point from this column:
path1:
0,0;x;x;5,5
path2:
0,0;1,1;x;5,5
path3:
0,0;1,1;4,4;5,5
path4:
0,0;1,2;x;5,5
path5:
0,0;1,2;4,4;5,5
path6:
0,0;x;4,4;5,5
for path1 we have 1 pair:
0,0 and 5,5, so result is rangemin(50,50)+fabs(50(50))1=4 (this is distance when we have no mathing letter)
for path2 we have 2 pairs to calculate:
0,0 > 1,1
1,1 >5,5
so, our result is:
rangemin(10,10)+fabs(10(10))1 +
rangemin(51,51)+fabs(51(51))1 = 0+3=4
for path3 we have 3 pairs:
0,0>1,1
1,1>4,4
4,4>5,5, so:
rangemin(10,10)+fabs(10(10))1 +
rangemin(41,41)+fabs(41(41))1 +
rangemin(54,54)+fabs(54(54))1 =0+2+0=2
for path4 we have 2 pairs:
0,0>1,2
1,2>5,5
rangemin(10,20)+fabs(10(20))1 +
rangemin(51,52)+fabs(51(52))1 = 1+11 + 3+11=4
for path5 we have 3 pairs:
0,0>1,2
1,2>4,4
4,4>5,5, so
rangemin(10,20)+fabs(10(20))1 +
rangemin(41,42)+fabs(41(42))1 +
rangemin(54,54)+fabs(54(54))1 =1+ 2+11 +0=3
for path6 we have 2 pairs:
0,0>4,4
4,4>5,5
rangemin(40,40)+fabs(40(40))1
rangemin(54,54)+fabs(54(54))1=3+0=3
so, our winner path is path3 and result is 2
It is possible having pairs and knowing number of combinations to know on which path this pais should be calculated.
This parallel version of algorihm i am trying to implement. But it is not easy...
regards
Darek







Re: Levenshtein Algorithm
Dariusz Mielczarek Apr 29, 2014 12:24 PM (in response to Marco Wedel )Marco,
I have one more idea and hope, that i find tonight even better way to
implement it ...
29042014 17:28 użytkownik "Marco Wedel" <qcwebmaster@qlik.com>
napisał:
Qlik Community <http://community.qlik.com/> Levenshtein Algorithm
reply from Marco Wedel<http://community.qlik.com/people/MarcoWedel?et=watches.email.thread>in
Scripting  View the full discussion<http://community.qlik.com/message/518811?et=watches.email.thread#518811>





Re: Levenshtein Algorithm
Stefan Wühl Apr 28, 2014 4:23 PM (in response to Omer Citak )Have a look at the second page of the second linked thread in my answer here:

Re: Levenshtein Algorithm
Marco Wedel Apr 28, 2014 4:55 PM (in response to Omer Citak )Hi Omer,
although it might not be the fastest solution (in terms of execution time), this application might help:
tabLevenshteinDistances: LOAD *, levenshtein(A,B) as LevDistAB Inline [ A,B kitten,sitting saturday,sunday book,back table,able iPod,iPad cellar,door qlik,click ];
It's just a copy/pastesolution using a VBScript function found at wikibooks:
http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#VBScript
The reload execution time for a 1.000.000 rows xlsx table raised from 45sec to 2min 58sec when adding the levenshtein distance calculation.
hope it might be of any use though
regards
Marco

QlikCommunity_Thread_115800.qvw 153.8 K

Re: Levenshtein Algorithm
Dariusz Mielczarek Apr 30, 2014 4:35 AM (in response to Marco Wedel )Marco,
is it possible to fire macro on the fly, from expression level?
regards
Darek

Re: Levenshtein Algorithm
Marco Wedel Apr 30, 2014 6:56 AM (in response to Dariusz Mielczarek)Hi,
I don't think we can use functions defined in a VB module in expressions in the front end, no.
see also:
Can a function defined in Module be called as an expression in a table chart?
regards
Marco



Re: Levenshtein Algorithm
Omer Citak Apr 29, 2014 10:07 AM (in response to Omer Citak )Hi guys,
Thanks a lot for the help.
It really help me.
Regards,
Omer