Skip to main content
Announcements
Have questions about Qlik Connect? Join us live on April 10th, at 11 AM ET: SIGN UP NOW
hic
Former Employee
Former Employee

In my previous post I compared different methods for making Conditional Aggregations. In it, I claimed that Set Analysis often is faster than other methods, like an If()-function or a multiplication with a flag.

Also, in a recent discussion on the forum performance benefits of the different methods are discussed, and the multiplication method is favored by some.

So I decided to make my own test.

The main conclusion from my test is that Set Analysis is by far the most performant way to make a conditional aggregation – in the cases when you need performance. The chart below shows the average chart calculation time for the five different ways to make a conditional aggregation in a situation where you have a large amount of data.

Image1 Main conclusion SL_C_LM.png

From this chart, you can draw several conclusions:

  • Set Analysis is the fastest alternative for large data sets.
  • Set Analysis is relatively better if the selection ratio is small (the sub-set of data that the condition picks out), since the following aggregation runs over a much smaller number of rows. This is in sharp contrast to the other methods where the selection ratio hardly affects the result.
  • The three methods in the middle (numeric comparison as condition, Boolean flag as condition and multiplication) are roughly the same from a performance perspective.
  • An If()-function with a string comparison is by far the worst choice.

But it is not a clear-cut case: If you instead make the same measurements with a smaller data set, Set Analysis is not the most efficient method. The chart below shows the result for a smaller data amount. Note that even though the data amount still is considerable (1M records), it is small enough for all response times to be under a second, whereas they in most cases are an order of magnitude larger in the above graph.

Image2 SA poor SL_C_S.png

The reason is that there is an overhead in Set Analysis, that has to be performed independently of whether the data amount is large or not. So for small data amounts, the performance gain in the aggregation is not large enough to cover the overhead.

The bottom line is that Set Analysis is the method you should use for large data amounts. For smaller data amounts, it doesn’t really matter which method you choose: They are all fast enough.

About the test:

The test was made on my dual-core laptop with 16GB of memory. The data model consisted of three tables; one fact table and two dimension tables. The fact table contained 100 million records.

Image5 Data Model.png

The calculation time of a pivot table with the field Dim as dimension and the sum of Amount as expression was measured, using the different ways to code the condition. The field Condition was used as flag in the condition.

The measurement was repeated for different user selections in Dim (99M records, 10M records and 1M records), for different selection ratios in the condition (0.5%, 5% and 50%), and for different cardinality in the Condition Dimension (1000 records, 1M records).

The measurements were made starting with a cleared cache, then making a series of different selections in the field Dim of which the last three were recorded. This way the cache was populated with basic calculations and indexes, but not with the specific chart calculation.

HIC

Further reading related to this topic:

Conditional Aggregations

20 Comments
Jason_Michaelides
Luminary Alumni
Luminary Alumni

Yet another fantastically useful post HIC - thank you. I have been looking at this myself lately and came to the same conclusion.

0 Likes
15,894 Views
Not applicable

Hey Henrich, thanks for the overview!

I suppose that since you didn't make the distinction for a set analysis between a string, boolean or number that this is all the same for the set analysis?

0 Likes
15,894 Views
mikkeltaylor
Creator II
Creator II

very useful post,but unfortunately 2 months to late for us

We've just launched a large app and used the multiplication route i.e. x*y*z, but will definitely convert this to set analysis in the future when we next do large scale development on the app or run into performance issues.

Thanks.

0 Likes
15,894 Views
hic
Former Employee
Former Employee

Mathias Vanden Auweele

No I didn't make the distinction between string, Boolean and number in Set Analysis. But I don't think that this has any larger impact: A string comparison in an If()-function will be performed once per record in the data table, whereas a string comparison in a set expression will be matched against the content of the symbol table, which is a much smaller data set.

HIC

15,894 Views
ThornOfCrowns
Specialist II
Specialist II

Nice work as always. Interesting to see the difference is so big over the large data set.

0 Likes
15,894 Views
Luis_Cortizo
Former Employee
Former Employee

Thank you Henric for your post.

So in the small data set scenario set analysis takes less than twice the time of the best technical solution possible (and that's the worst set analysis scenario), but as soon as the app starts growing we should expect a difference in performance as muchas as being more than 5 times faster. As the times being measured on the small data set scenario are under one second, I'd always recommend to use Set Analysis in any QlikView application

In the second scenario, as you said, it doesn't make a big impact on performance but as soon as you start feeding data in the application, if you haven't used Set Analysis in your expressions prepare for investing a decent amount of time and effort in changing every If() expression that you have developed. And is fairly common that Qlik apps grow beyond the initial expectation (because we all want to analyze more data)

For me the final conclussion is that in most cases, using If() statements will mean to work at least twice, and nobody likes to work twice

0 Likes
15,894 Views
msteedle
Luminary Alumni
Luminary Alumni

Henric Cronström Would it be possible for you to retest with the Condition field moved directly to the fact table? I think this is more often the scenario, especially with large data sets. Thanks.

0 Likes
10,704 Views
hic
Former Employee
Former Employee

I totally agree with your conclusions. If you want to prepare your application for large data amounts, you should avoid If() conditions inside aggregations, and instead use Set Analysis as much as possible.

HIC

10,704 Views
Not applicable

It would be interesting to also see how Set Analysis performs amongst its various options i.e.

- Set Analysis with P()

- Set Analysis with E()

- Set Analysis with string as filter

- Set Analysis with number as filter

- Set Analysis with boolean as filter

- Set Analysis with expression as filter

These seem to be cases we most often run into. My initial take is that the boolean filter performs best whereas the E() the slowest over a given data set.

0 Likes
10,704 Views
hic
Former Employee
Former Employee

@ Michael Steedle

I am not sure I agree that it is more common with the condition field in the fact table. Quite the contrary, I would say. Anyhow, I did the test, and the result is that there is no big difference between having the flag in the fact table or having it in a small (~1000 records) dimensional table. What you should avoid, is a large dimensional table.

HIC

Image3 Flag table comparison.png

10,704 Views