Announcements
Former Employee

## Performance of Conditional Aggregations

In my previous post I compared different methods for making Conditional Aggregations. In it, I claimed that Set Analysis often is faster than other methods, like an If()-function or a multiplication with a flag.

Also, in a recent discussion on the forum performance benefits of the different methods are discussed, and the multiplication method is favored by some.

So I decided to make my own test.

The main conclusion from my test is that Set Analysis is by far the most performant way to make a conditional aggregation – in the cases when you need performance. The chart below shows the average chart calculation time for the five different ways to make a conditional aggregation in a situation where you have a large amount of data.

From this chart, you can draw several conclusions:

• Set Analysis is the fastest alternative for large data sets.
• Set Analysis is relatively better if the selection ratio is small (the sub-set of data that the condition picks out), since the following aggregation runs over a much smaller number of rows. This is in sharp contrast to the other methods where the selection ratio hardly affects the result.
• The three methods in the middle (numeric comparison as condition, Boolean flag as condition and multiplication) are roughly the same from a performance perspective.
• An If()-function with a string comparison is by far the worst choice.

But it is not a clear-cut case: If you instead make the same measurements with a smaller data set, Set Analysis is not the most efficient method. The chart below shows the result for a smaller data amount. Note that even though the data amount still is considerable (1M records), it is small enough for all response times to be under a second, whereas they in most cases are an order of magnitude larger in the above graph.

The reason is that there is an overhead in Set Analysis, that has to be performed independently of whether the data amount is large or not. So for small data amounts, the performance gain in the aggregation is not large enough to cover the overhead.

The bottom line is that Set Analysis is the method you should use for large data amounts. For smaller data amounts, it doesn’t really matter which method you choose: They are all fast enough.

The test was made on my dual-core laptop with 16GB of memory. The data model consisted of three tables; one fact table and two dimension tables. The fact table contained 100 million records.

The calculation time of a pivot table with the field Dim as dimension and the sum of Amount as expression was measured, using the different ways to code the condition. The field Condition was used as flag in the condition.

The measurement was repeated for different user selections in Dim (99M records, 10M records and 1M records), for different selection ratios in the condition (0.5%, 5% and 50%), and for different cardinality in the Condition Dimension (1000 records, 1M records).

The measurements were made starting with a cleared cache, then making a series of different selections in the field Dim of which the last three were recorded. This way the cache was populated with basic calculations and indexes, but not with the specific chart calculation.

HIC

Further reading related to this topic:

Conditional Aggregations

Former Employee

I have not compared the different set analysis methods, but as I wrote in a previous comment, I wouldn't expect any large differences between string, number, Boolean and expression in the Set Analysis expression. These are all compared with the content of the symbol table, which normally is a lot smaller than the data table. Hence, faster than comparing in the data table.

The P() and E() will cost a little more though, since they are indirect selections. You could say that a normal Set Analysis is equivalent to making an extra selection, and then aggregating over a smaller data set. The indirect Set Analysis would then be equivalent to making two extra selections, and then aggregating over a smaller data set.

HIC

1,885 Views
MVP

Somehow I knew it and was stuck to it since I've learned Set Analysis at Qonnections 2009.

1,885 Views
Luminary Alumni

Henric Cronström Thanks for the followup. In my experience, larger data tends to lead to more denormalized data models, specifically for the purpose of minimizing calculation times, which is why I was interested if there was any change in the results with large data when moving the Condition Dimension fields directly to the fact table. (I think that approach is in accordance with QlikView data model best practice documentation, Qonnections presentations, etc.)

1,938 Views
Luminary Alumni

Another great post. Many thanks HIC.

One quick question:

 The reason is that there is an overhead in Set Analysis, that has to be performed independently of whether the data amount is large or not. So for small data amounts, the performance gain in the aggregation is not large enough to cover the overhead.

Referring to the above statements - are you referring to symbol tables? I.e. Set Analysis performs calculations based on the symbol tables irrespective of weather they are small or large tables. If no, please can you explain what is this overhead?

Cheers,

DV

1,938 Views
Former Employee

A Set Analysis expression is in principle the same as a selection. Hence, the overhead is to make an additional selection and create the corresponding state space vectors, so that the aggregation can be made over a record set different from that of the real selection.

HIC

1,938 Views
Not applicable

Thanks a lot for sharing your results.

Don't forget that small datasets can very easily become large data sets, so I think the overhead of set analysis is worth taking in order to future proof your application.

1,938 Views
Not applicable

Hi Henric, Hanks for your great post.

How do you calculate the times of expressions ?

Thank you.

1,938 Views
Former Employee

You can see this in Document Properties > Sheet.

HIC

1,938 Views
Not applicable

hic

Thanks for sharing excellent stats on performance. Can you please suggest me best method to achieve below requirement for large data set (10M Records).

we have two buttons "Next Quarter" and "Next Four Quarters".

All the chart expressions (around 30 charts and 75+ expressions) has to work on both the scenarios. Currently we are using IF condition like below which is making application very slow.

=if(\$(vNextQuarterButton)=1,

Sum({<QuarterYear={\$(vNextQuarter)}>} Spend)

,

Sum({<QuarterYear={\$(vNextQuarter), \$(vNextQuarter1), \$(vNextQuarter2), \$(vNextQuarter3)}>} Spend)

)

Regards,

Sai.

1,850 Views
MVP

Hi Sai,

you should avoid if() function because both output expressions (true/false) are calculated always before the condition is tested.

- Ralf

1,850 Views
Subscribe by Topic