Slow performance with for loops

L_VN · ‎2023-11-08

I have a dataset with a facts table that contains this basketball data:

Facts:

X, // x coordinat, [0, 50]

Y, // y coordinate, [0, ~100]

ShotType, // shot catagories

ShotValue, // 2 or 3

ShotMade // 1 or 0

I want to create an expected points model by simply calculating the number of shots shot from within roughly the same range of the same shot type and value.

XP = \frac{N_{made shots} * Value}{N_{taken shots}}.

Right now I am doing this with two for loops:

LET vMaxD = 1; // shots within this distance of one another are viewed as similar for the purpose of XP

tmp:

LOAD * Inline [

ShotId, TimesMade, TimesShot, XP, ChanceToMake

];

FOR i = 0 to NoOfRows('FACTS') - 1

LET vMade = 0;

LET vTot = 0;

LET vShotId = Peek('ShotId', i, 'FACTS');

LET X1 = Peek('X', i, 'FACTS');

LET Y1 = Peek('Y', i, 'FACTS');

LET ShotVal1 = Peek('ShotValue', i, 'FACTS');

LET ShotType1 = Peek('ShotType', i, 'FACTS');

FOR j = 0 to NoOfRows('FACTS') - 1

LET X2 = Peek('X', j, 'FACTS');

LET Y2 = Peek('Y', j, 'FACTS');

LET ShotVal2 = Peek('ShotValue', j, 'FACTS');

LET ShotType2 = Peek('ShotType', j, 'FACTS');

LET ShotMade = Peek('ShotMade', j, 'FACTS');

let d = sqrt(Pow($(X1)- $(X2), 2) + Pow($(Y1) - $(Y2), 2));

vMade = if(d < vMaxD AND ShotVal1 = ShotVal2 AND ShotType1 = ShotType1, vMade + ShotMade, vMade)

vTot = if(d < vMaxD AND ShotVal1 = ShotVal2 AND ShotType1 = ShotType1, vTot + 1, vTot)

NEXT j

LET XP = if(vTot > 0, vMade / vTot * ShotVal1, 0);

LET ChanceToMake = if(vTot > 0, vMade / vTot, 0);

Concatenate(tmp)

LOAD * Inline [

ShotId, TimesMade, TimesShot, XP, ChanceToMake

$(vShotId), $(vMade), $(vTot), $(XP), $(ChanceToMake)

];

NEXT i

I am aware that this will be a costly O(N^2) calculation, but even in the case of i in [0, 999], j in [0, 1] it takes a significant amount of time. In the special case of d = 1 we can omit the sqrt() function from the distance calculation, and thereby save some computational power, but this is just a small fix and doesn't really effect the overall performance too much.

I chose not to use a cartesian product of the facts table as it contains roughly a million rows and will become too big to handle.

Is there some way to speed up the calculations?

marcus_sommer · ‎2023-11-09

I'm not sure that I really understand what you want to do. But it must be slow - at least with a data-set of around 1 M of records.

This will mainly caused from the outside-looping against the table. For each variable-assignment Qlik jumps from the loop to the table to fetch the indexed value which could be regarded as a I/O which needs to be handled between Qlik and the OS and will have some delaying between them.

Further you creates at least 1 millions loads which needs to be initialized by checking the exists of the field-names and the field-values. Whereby it a more as a million because the nested loop also creates a cartesian product.

Some years ago we had a bit similar task by creating kml-files with various local/regional areas just from the lat/long information from our pos. For this we needed a logic to find the most outer pos and as a possible solution we identified the Graham scan - Wikipedia. I didn't tried to implement it as program-logic in QlikView else gave it a colleague which developed various solutions in power shell, java and python. They were all working but not faster as creating appropriate cartesian tables in QlikView, performing there all calculations and then ordering/filtering/aggregating the final results.

Therefore the use of the external programming hadn't had benefits for us and it also showed that's possible to transfer such logic into native load-statements. Qlik is really optimized for it but it's not a programming tool like java or python.

In regard to your task I could imagine two ways of improving the performance. One would be a bit similar to your current approach but not looping through the data-table else through the system-tables. Each field is a system-table with only the distinct field-values and a bit-stuffed pointer to the data-table. If there is significantly degree of redundancy in the data the needed number of iterations would be appropriate reduced. Further I think the calculation of d might be separated from the various if-loops - means creating it separate and mapping it back and doing then all the other stuff.

Another way would be to create the mentioned cartesian product with an inside-loop within the load, here an example of the general method:

load *, recno() as RecNo, rowno() as RowNo, iterno() as Iterno,
StartValue + iterno() - 1 as ContinuousValue
from Source while StartValue + iterno() - 1 <= EndValue;

In the end you may not mandatory need the various record-id's but they are quite useful to validate the data and/or order/filter/match them in any way. Also with this kind of logic you might be able to skip duplicates by loading distinct or from the field-values or implementing where-clauses. Also here it might be possible to separate the calculations and matching the results afterwards.

Nevertheless it will remain a quite heavy task. At last I suggest to consider to implement an incremental logic and storing historical calculations in sliced (maybe YYYYMM or similar categories) qvd's.

L_VN · ‎2023-11-14

Yes, when doing it over the entire dataset it' a really tall task (quadratic time complexity over a n > 10^6 dataset), but even when I downsized the problem to i = 1000, j = 3 it took a really long time.

But if I understand what you're saying then it's the looping over tables logic that takes such a long time, so creating a cartesian table and doing the calculations on it will be much faster?

marcus_sommer · ‎2023-11-14

Yes, in my experience is creating cartesian tables with several millions records quite fast especially if the tables are rather small with just a few fields. This is in general valid for join- and for loop-approaches, means for example:

t: load Product from Products; join(t) load Date, 0 as Values from Calendar;

which might be then concatenated to a fact-table to create a fact-record for each Product & Date or to create a basis-table for any mapping targets as well as:

load *, rowno() as RowNo;
load *, ContinuousValue1 * ContinuousValue2 as X;
load *, StartValue2 + iterno() - 1 as ContinuousValue2, iterno() as Iterno2
while StartValue2 + iterno() - 1 <= EndValue2;
load *, recno() as RecNo, iterno() as Iterno1,
StartValue1 + iterno() - 1 as ContinuousValue1
from Source while StartValue1 + iterno() - 1 <= EndValue1;

in which multiple inside-loops are used within a preceding load.

Both methods are just a show-case to demonstrate possibilities and there are quite probably ways to reduce the efforts. Important here is to look if really everything needs to be created at first and then filtered or if the creation might already include some conditions. For example I would rather not apply such join like above else mapping the created and release date from the products and then applying a while-loop like from the second show-case which could reduce the number of records significantly as well as improving the data-quality by not providing old dates for new products.

For your case I could imagine similar possibilities, for example by creating the cartesian product and the on-top calculations only ones and not for each record and the results are then mapped to the records.

vincent_ardiet_ · ‎2023-11-14

I've gone deep in your code, but could you do your cartesian join without the loop and then put the computation in a load?

Facts2:

Load ShotId, X as X1, Y as Y1, ShotType as ShotType1, ShotValue as ShotValue1, ShotMade as ShotMade1 From Facts;

Inner Join (Facts2) Load

Load X as X2, Y as Y2, ShotType as ShotType2, ShotValue as ShotValue2, ShotMade as ShotMade2 From Facts;

Facts3:

Load
ShotId,

sqrt(Pow(X1- X2, 2) + Pow(Y1 - Y2, 2)) as d

...

General Question