Number of difference between two string

mall1m · ‎2020-11-10

Dear all

I'm doing a comparison between lines over 60 fields with 60 peeks (60 is an average but number of fields can vary from 20 to 300 so i'm building the peek with a "for" loop)

These 60 peek take a very long time as they are calculated over 4 million lines (around 30 min)

I'm trying to find another way to proceed to speed up the calculation

For each field, I just have 3 possible value (0,1 or 2)n, so for example if I have 4 fields A,B,C,D I would to compare with only one peek the following values

example:

A1|B0|C1|D0|E1|F1

A1|B0|C0|D0|E1|F2

==> here, I have 2 difference: C1/C0 and F1/F2

Do you have any idea on how to get the number of differences without doing

"if(peek(A)=A,0,1)+if(peek(B)=B,0,1) + if(... " and so on

Extra question: I also need to count the number of major differences: if A change from 0 to 1 or from 1 to 2 these are minor but a change from 0 to 2 is considered as major major

example:
A1|B0|C1|D0|E1|F1

A0|B0|C2|D2|E1|F1

==> here, I have 1 difference: D0/D2

I'm trying to adapt these code to my needs, without success for now
https://community.qlik.com/t5/New-to-Qlik-Sense/Compare-2-strings/td-p/1517313

rwunderlich · ‎2020-11-10

Out of the 4 million, approximately what percentage of lines typically have differences?

-Rob

mall1m · ‎2020-11-10

thanks for your interest Rob 🙂

more than 90% have differences

But what I didn't explain: I'm not always comparing the same line together

I'm using ODAG to insert selection from my user

if I have 5 lines (1,2,3,4,5) => my user may filters on a date or a category, then I wil have to compare only line 1,3,5

(Right now I'm testing an optimization which consists to compare only required fields and not the 60 fields, in most of the case i have less than 30 fields to compare so I made a table with combination of all fields - it raise to around 400 combinations - then with a for loop, I do the peek only on the required field ... which I hope will speed up the process a bit )

I hope to find a way to do less peek using PurgeChar and subfield on a concatenated string (but I'm not sure it will be faster...)

marcus_sommer · ‎2020-11-10

It's an interesting case and I think it will be quite hard to improve here the performance because it is a comparing of each field and for each record against the previous one. Especially if more as 90% are different to each other it doesn't look very promising to develop a logic to exclude the equal records from the check.

Because of the fact that there are so many checks against the previous value within these loads it might be worth to try previous() instead of peek(). AFAIK both are very similar in the performance but I never measured them. A further measure which may run a bit faster might be to change your check to something like this:

rangesum(-(peek(A)=A), -(peek(B)=B), ...)

A quite different method to your approach could be to transform the data with a crosstable-statement, ordering the result properly in a following load and applying the check only to one field. Of course it would run against much more records and therefore I'm not sure that there would be any improvements - but maybe you apply here some incremental logics and/or pre-calculate some results so that your final ODAG don't need to do the check-work else just to pick the appropriate results.

- Marcus

deduplication

Peek

Script

string comparison