Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Dear All,
Does anyone try to apply Levenshtein algorithm in QlikView?
Your inputs are highly appreciated!!
Regards,
Omer
Marco,
I have one more idea and hope, that i find tonight even better way to
implement it ...
29-04-2014 17:28 użytkownik "Marco Wedel" <qcwebmaster@qlik.com>
napisał:
Qlik Community <http://community.qlik.com/> Levenshtein Algorithm
reply from Marco Wedel<http://community.qlik.com/people/MarcoWedel?et=watches.email.thread>in
Scripting - View the full discussion<http://community.qlik.com/message/518811?et=watches.email.thread#518811>
Now I am testing it with Marco's approach with 140M rows.
Then i will try Dariusz's approach as well.
I will update you about the results guys.
Thanks,
Omer
Marco,
is it possible to fire macro on the fly, from expression level?
regards
Darek
Omer,
Please, let me know what are your user expectations? Do they need to input searched words on the fly, or maybe you know everything at the script level?
I can imagine, that your goal may be "only" to analyse some text data and counting similiar words from this text. If we have this case, you should make calculations during reload or even outside qlikview... Application should be fast even with 140 mln rows.
But if user wants to input some string and then search it in loaded data, then rather some on the fly calculations are needed. I think, that my approach will be to hevy for 140 mln words. I assumed, that i have row for each letter to have faster response. But with this data amount it will be rather difficult to load it this way and i think, that there will be necessary to load words and divide it to letters on the fly.
Anyway, i think, that there are possible some improvements. For example we can try to find some candidates first (using simplier comaprissions with set analysis) and then calculate Levenshtein distance only for them.
Just for fun i am still trying to "translate" Levenshtein distance calculation to multidimensional algorithm.
Please, let me know what are your user expectations? Do they need to input searched words on the fly, or maybe you know everything at the script level?
Please have a look also on possible QlikView with "R" integration. Maybe it is some direction. Let me know...
regards
Darek
Hi,
I don't think we can use functions defined in a VB module in expressions in the front end, no.
see also:
Can a function defined in Module be called as an expression in a table chart?
regards
Marco
Darek,
You are right, i just want to analyse the some text data and counting similar words. Try to find out miss-typos some more analysis.
I test this case at SAS with a specific function (complev() in SAS) an it takes really long time and it asks for huge disk space as well. So, instead to calculating it row by row, i transform the data to pivot table and then use array like loop to find out. It takes only a couple minutes in SAS with this approach.
Then, i started think about how can I apply this in QlikView to use the power of in-memory. By using macro, it takes app. 20 hours to reload. Hence, it is not very feasible.
So, right now i try to find out how i can integrate SAS with QlikView. If you have any better ideas, i really appreciate hear it.
Thanks,
Omer
Omer,
i'am still working how to calculate distance on the fly. I'am close to solution which, i hope, should be effective.
I also tested approach like in ver3, i attached before, but with additional macro, which, using dynamic update or inputfields applies only final results (values from last row of two words combination) calculated in one straightable into memory and then results may be presented by another table (without rows we dont want to see).
In my case, where user could input search string on the fly and i loaded about 1000 words, response time was not bad.
I also found interesting transformation of this algorithm, which gives me great hope to have fast solution inside qlikview even without macro (i dont like macro's) and with minimal usage of aggr() (aggr likes memory and may be slow) . But i am still working on it.
You told before about 140M rows.
So, how many Levenshtein distances you would like to check? 140Mx140M?
About your SAS and fast calculation in pivot - i think it is similiar like my v3 in QV....
Have you checked how much time my approach needs with your data volumes?
regards
Darek
Darek,
Thanks a lot I could not check the v3 sample. I try to check that one asap.
We are on the same page as well. I do not want to use macro's and try to minimal the usage of aggr().
In Levenshtein distance, my cut off point will 3 or 4. After 4, it becomes meaningless. After the Cartesian product it becomes 140 M . Initially, it is 12K records.
I will update you about the outcomes of further tests.
Regards,
Omer
Omer,
i still dont know what users will do with those data, but
if you dont need application which will calculate distance for words inputed by user vie UI, it will be better to calculate it during reload or outside.
Maybe try to use some fast condition to check if two words are candidates to meet cutpoint and then calculate distance only for those candidates, like this (using function proposed by Marco):
wset1:
LOAD * INLINE [
w1
aba
Qlik
Qlik
QlikView
darek mielczarek
Business Discovery
levenshtein
ola
milan
poland
polska
abc
];
join
load w1 as w2 Resident wset1;
wset:
load rowno() as id,
w1 as word1, w2 as word2,
if(w1=w2,'the same', if(KeepChar(w1,w2)>len(w1)-4 and KeepChar(w2,w1)>len(w2)-4 and fabs(len(w1)-len(w2))<=4,'candidate','non candidate')) as class
Resident wset1;
DROP Table wset1;
load id,
//here: distance expression (word1,word2)
levenshtein(word1,word2) as LevDist
Resident wset
where
class = 'candidate';
regards
Darek
In my oppinion disance calculation on the fly has sense only if it is search for string user inputed in UI.
In this case you can:
1.
use hidden straight table, which calculates distance with above() and above( total), with expression "expr" like this:
if(nbr=0,nbr1,if(nbr1=0,nbr,rangemin(Above(expr,1)+1,Above(total expr,word_len1+1)+1 ,Above(total expr,word_len1+2)+if(letter=letter1,0,1))))
and macro which will read results from this table (last row for each word) and input it into another table (via dynamic update or inputsum), and those results will be presented.
2.
use modified "parallel" algorithm:
there are the same letters ("0") and cost between two of thems it is value like:
rangemin(x2-x1,y2-y1)+fabs(x2-x1-(y2-y1))-1,
where x2>x1 and y2>y1,
x1=0...len(word1)
y1=0...len(word2)
x2=1 ... len(word1)+1
y2=1 ... len(word2)+1
So, if we have:
word1=comp
word2=ccwp
we will have matrix:
1,1;1;2;4,4
we can have 3*2=6 pathes in this matrix (we have 2 elements on 1'st column and 1 element on 2'nd column, so number of pathes is (2+1)*(1+1)=6)
we will duplicate it and add 0,0 as start piont to first instance and 5,5 as end point to 2'nd instance.
so, wi will have 2 matrixes:
1)
0,0;1,1;1,2;4,4
2)
1,1;1,2;4,4;5,5
combintions (possible pathes), where x means that we skip point from this column:
path1:
0,0;x;x;5,5
path2:
0,0;1,1;x;5,5
path3:
0,0;1,1;4,4;5,5
path4:
0,0;1,2;x;5,5
path5:
0,0;1,2;4,4;5,5
path6:
0,0;x;4,4;5,5
for path1 we have 1 pair:
0,0 and 5,5, so result is rangemin(5-0,5-0)+fabs(5-0-(5-0))-1=4 (this is distance when we have no mathing letter)
for path2 we have 2 pairs to calculate:
0,0 -> 1,1
1,1 ->5,5
so, our result is:
rangemin(1-0,1-0)+fabs(1-0-(1-0))-1 +
rangemin(5-1,5-1)+fabs(5-1-(5-1))-1 = 0+3=4
for path3 we have 3 pairs:
0,0->1,1
1,1->4,4
4,4->5,5, so:
rangemin(1-0,1-0)+fabs(1-0-(1-0))-1 +
rangemin(4-1,4-1)+fabs(4-1-(4-1))-1 +
rangemin(5-4,5-4)+fabs(5-4-(5-4))-1 =0+2+0=2
for path4 we have 2 pairs:
0,0->1,2
1,2->5,5
rangemin(1-0,2-0)+fabs(1-0-(2-0))-1 +
rangemin(5-1,5-2)+fabs(5-1-(5-2))-1 = 1+1-1 + 3+1-1=4
for path5 we have 3 pairs:
0,0->1,2
1,2->4,4
4,4->5,5, so
rangemin(1-0,2-0)+fabs(1-0-(2-0))-1 +
rangemin(4-1,4-2)+fabs(4-1-(4-2))-1 +
rangemin(5-4,5-4)+fabs(5-4-(5-4))-1 =1+ 2+1-1 +0=3
for path6 we have 2 pairs:
0,0->4,4
4,4->5,5
rangemin(4-0,4-0)+fabs(4-0-(4-0))-1
rangemin(5-4,5-4)+fabs(5-4-(5-4))-1=3+0=3
so, our winner path is path3 and result is 2
It is possible having pairs and knowing number of combinations to know on which path this pais should be calculated.
This parallel version of algorihm i am trying to implement. But it is not easy...
regards
Darek