Logical Inference and Aggregations - Qlik Community

hic · ‎2013-07-15

Every time you click, the Qlik engine recalculates everything.

Everything.

A new selection implies a new situation: Other field values than before are possible; other summations need to be made; the charts and the KPIs get other values than before. The state vectors and the objects are invalidated. Everything needs to be recalculated since this is what the user demands.

Well, there is of course a cache also – so that the Qlik engine doesn’t have to recalculate something which has been calculated before. So it isn’t quite true that everything is recalculated: If a calculation has been made before, the result is simply fetched from the cache. But it is true that nothing is pre-calculated. There is no need for that. Everything can be done in real-time.

The Qlik engine is an on-demand calculation engine.

From a principal point, there are two steps in the recalculation of data: The logical inference in the data model, and the calculations of all objects, including sheet labels and alerts.

The logical inference is done first. The goal is to figure out which field values in the symbol tables are possible and which records in the data tables are possible, given the new selection. There is no number crunching involved - it is a purely logical process. The result is stored in the state vectors.

Think of it as if the selection propagates from one table in the data model to all other tables. Table after table is evaluated and the Qlik engine figures out which values and records are possible, and which are excluded.

When the logical inference is done, the Qlik engine starts to evaluate all exposed objects. List boxes and dimensions in charts must be populated and sorted. All expressions – in charts, in text boxes, in labels, in alerts – must be calculated. Objects that are on other sheets, minimized or hidden, are however not calculated.

The calculations are always aggregations based on the data records that have been marked as possible by the logical inference engine. I.e., the objects do not persist any data on their own.

The calculation phase is usually the phase that takes time – often over 90% of the response time is due to calculations. The calculations are asynchronous and multi-threaded on several levels: First of all, every object is calculated in its own thread. Secondly, in the 64-bit version, many aggregations e.g. Sum() are calculated using several threads, so that a sum in one single object can be calculated quickly using several CPUs.

Finally, when an object has been calculated, it is rendered. Since the calculation is asynchronous and multi-threaded, some objects are rendered long before other objects are ready.

And when an object has been rendered, you can click again. And everything is repeated.

HIC

PS. All of the above is of course true for both QlikView and Qlik Sense. Both use the same engine.

If you want to read more about the Qlik engine internals, see

Symbol Tables and Bit-Stuffed Pointers

Colors, States and State vectors

The Calculation Engine

mr_barriesmith · ‎2013-07-15

Very clever people making all this happen.

Are variables evaluated before visible sheet objects - as one group?

Is QlikView smart enough to only evaluate variables involved in the visible sheet and object set?

The engine that hands out the multi-threading: is it smart enough to know that SUM(Amount) and sum(Amount) are the same?

All the best, Adam

hic · ‎2013-07-15

Q1: Variables are considered part of the data, so, yes, they are evaluated before the sheet objects.

Q2: Variable dependencies is an interesting topic... Basically, since you can use $-expansions that have different values depending on selection - and nested such - to fetch a variable, it is virtually impossible to figure out where a specific variable is used. So, I believe the answer to Q2 is no. But I need to check with some of our developers to be completely sure.

Q3: I believe it is. I know it is smart enough to recognize both that it is the same aggregation and that the scope of the aggregation is the same, when it looks in the cache. So left to check, is only that no other threads have have started to calculate the same aggregation, but not yet entered the result in the cache.

HIC

skokenes · ‎2013-07-15

Hi Henric and Adam,

Mike Steedle actually found that equivalent expressions written slightly differently will not be recognized by QV as being the same.

Near the bottom of this post about performance with column references, Mike mentions:

Note the other possible danger of using explicit expression formulas in the final column: writing an explicit formula slightly differently, like changing the case of a function or inserting spaces before or after parentheses, makes performance even worse (about 3x worse, in this example). QlikView doesn’t detect that sum(Sales) and Sum(Sales) mean the same thing within a chart, even though they have the same result.

hic · ‎2013-07-15

I would be very surprised if what you claim is correct.

First of all, the CalcTimes as reported in the document properties, sheet properties and memory statistics are not reliable. The code was written before the latest incarnation of the caching was implemented, and also before much of the multi-tasking was implemented, so I wouldn't trust them to be accurate. To make a proper test, you would have to measure the calculation time some other way.

With the current implementation, the response times should not differ between full expressions, label references and column references. But, there may of course be bugs... Soif you have an example where you think that the response times differ, then please send it to me.

HIC

skokenes · ‎2013-07-15

Thanks for the information Henric. I will leave commenting on the nature of the tests to Mike.

Can you define current implementation? Do you mean QlikView 11, or a specific version of QlikView 11, like QV11.2SR2?

One thing I have noticed with column references in QV11SR2 and below is that when I use a function like rand() in an expression, then the result appears to cache after a recalculation. So if I create an expression with rand() and then reference that column multiple times using label or column references, the referenced columns will be equal but will not equal the first column:

https://dl.dropboxusercontent.com/u/99177292/Cache%20test.qvw

Also, how would you define unreliable for the CalcTimes? Do you mean just in terms of absolute value? From experience, it would appear at the very least that you can use them to examine relative changes in performance for a single object across different configurations. In this scenario, the fact that it doesn't take into account the multi-tasking capabilities shouldn't matter. If I understand the example we are discussing now, the multi-threaded calculations shouldn't even come into play because you have stated the software should check whether equivalent expressions are being calculated, thus preventing the same expression from being calculated multiple times.

If this CalcTime cannot be relied on, what would you suggest as an alternative?

Thanks for the feedback

Speros

EDIT: typos

brindlogcool · ‎2013-07-15

CalcTimes as reported in the document properties, sheet properties and memory statistics are not reliable.

How the Data Island will work? Data island is the single threaded process or Multi threaded process?

It would be very helpful for the performance optimization if we are able to track the expression exuection is single threaded or multithreaded

msteedle · ‎2013-07-15

Henric,

I linked a simple example that verifies what Speros mentioned in QlikView Desktop. It does not cache the results when you change the case of the letters in SUM or when you convert the straight table to a graphical chart (combo chart).

http://blog.axc.net/wp-content/uploads/2013/07/expression-caching-test.qvw

Even if the calc times in the Sheet Properties are not accurate, you can tell the difference between the chart calculating from scratch and re-using existing results.

I haven't tested on QlikView Server but would be curious to know if the results are the same.

Mike

msteedle · ‎2013-07-15

I believe your #2 is a fallacy I have seen in a number of performance-related best practices documents ("delete unused variables"). I have also read that the same is true only of variables with definitions that start with an equals sign.

0. To test, create a very slow-calcing expression and store it in a variable. I might recommend =Sum(Aggr(1, Dim1 & Dim2 & Dim3...)). It doesn't have to make sense--just keep adding dimensions until it's good and slow. On an existing tab, create a text object that displays that variable result. Save and close QlikView.

1. Open the application, make selections on that tab, and observe calc times. The variable will be weighing on the calculations. Delete the text object, save, and close.

2. Open the application, make selections on that tab, and observe calc times. The variable will have no discernible effect: calc times are much faster, even though the variable still exists, after deleting the text object that referenced it. Now delete the variable from the Variable Overview, save, and close.

3. Open the application, make selections on that tab, and observe calc times. The performance is the same as the last test case, before the variable was deleted.

From these tests you can surmise that a variable actively being referenced is the only way it can materially impact an application. A weird thing is that, even though the Variable Overview doesn't display the result of the variables containing expressions, the presence of a slow-calcing variable does seem to make the Variable Overview take longer to open.

hic · ‎2013-07-16

Many different questions, but I’ll try to answer them all

On variable evaluation:

Variables are evaluated in two steps. If a variable is assigned a value through an action or a macro, this is done before the sheet objects. However, if a variable has a value which is a string starting with an equal sign, there is a second evaluation step: when the variable is used. In other words, when a sheet object that uses the variable is evaluated, the expression value is used instead of the real variable value. So, Michael, you’re absolutely right in your statement that a slow-calculating variable affects performance the way you describe.

On Expressions, label references and column references:

First of all, the case used in the function name does matter. I checked this with our developers, and the cached result will not be found if the cases differ. I stand corrected.

Further, when an expression is calculated, the aggregations that constitute sub-expressions are re-used. So, if the expression is Sum(A) / Count(B), then the Sum(A) and Count(B) will be cached and re-used. If the same expression is written again, the two aggregations will be re-used, but the division needs to be done again. As opposed to when a label reference or column reference is re-used: Then the final number is fetched directly from the previous expression. For an Aggr() function, this can mean a fairly large difference. But for a single Sum(Amount) there should be no difference at all. So my previous statement was not entirely correct.

On logical islands:

For a 64-bit QlikView, both the logical inference and the aggregations are multi-threaded, also for logical islands. But there is a catch: The first step in a chart evaluation is to generate all combinations between the fields in the expression, and this is a single threaded process. Normally, this is not a problem since it takes very little time, but if you in your chart have one field from the logical island and another from the main data model, this step may take some time.

HIC

sparur · ‎2013-07-19

Hello Henric.

Thanks a lot for the very useful and interesting post.

But I have to find out for myself one thing...

You wrote that
"Objects that are on other sheets, minimized or hidden, are however not calculated"

Is it really correct? I ask, because I definitely remember, that I read some best-practise and optimization documentations, where was written that hidden object always calculates and for optimizing you have to minimized before hidding or use Calculation condition.

Thanks,

Anatoly.