Skip to main content
Announcements
Have questions about Qlik Connect? Join us live on April 10th, at 11 AM ET: SIGN UP NOW
hic
Former Employee
Former Employee

Do you belong to the group of people who think that Count(distinct…) in a chart is a slow, single-threaded operation that should be avoided?

 

If so, I can tell you that you are wrong.

 

Well - it used to be single-threaded and slow, but that was long ago. It was fixed already for – I think – version 9, but the rumor of its slowness lives on like an urban myth that refuses to die. Today the calculation is multi-threaded and optimized.

 

To prove that Count(distinct…) is faster than what many people think, I constructed a test which categorically shows that it is not slower – it is in fact a lot faster than the alternative solutions.

 

I created a data model with a very large fact table: 1M, 3M, 10M, 30M and 100M records. In it, I created a secondary key, with a large number of distinct values: 1%, 0.1% and 0.01% of the number of records in the fact table.

 

The goal was to count the number of distinct values of the secondary key when making a selection. There are several ways that this can be done:

  • Use count distinct in the fact table: Count(distinct [Secondary ID])
  • Use count on a second table that just contains the unique IDs: Count([Secondary ID Copy])
  • Use sum on a field that just contains ‘1’ in the second table: Sum([Secondary ID Count])

 

Data model.png

 

I also created a dimension ("Dim" in the “Dim Table”) with 26 values, also randomly assigned to the data in the fact table. Then I recorded the response times for three charts, each using “Dim” as dimension and one of the three expressions above. I made this for four different selections.

 

Progress bars.png

 

Then I remade all measurements using “Dim ID” as dimension, i.e. I moved also the dimension to the fact table. Finally, I loaded all the recorded data into QlikView and analyzed it.

 

The first obvious result is that the response time increases with the number of records in the fact table. This is hardly surprising…

 

Avg response time.png

 

…so I need to compensate for this: I divide the response times with the number of fact table records and get a normalized response time in picoseconds:

 

Avg response time normalized.png

 

This graph is extremely interesting. It clearly shows that if I use a Count(distinct…) on the fact table, I have a response time that is considerably smaller than if I make a count or a sum in a dimensional table. The table below shows the numbers.

 

Comparison table.png

 

Finally, I calculated the ratios between the response times for having the dimension in the fact table vs. in a dimensional table, and the same ratio for making the aggregation in the fact table vs. in a dimensional table.

 

Comparison graph.png

 

This graph shows the relative response time I get by moving the dimension or the aggregation into the fact table. For instance, at 100M records, the response time from a fact table aggregation (i.e. a Count(distinct…)) is only 20% of an aggregation that is made in a dimensional table.

 

This is the behavior on my mock-up data on my four-core laptop with 16GB. If you make a similar test, you may get a slightly different result since the calculations depend very much on both hardware and the data model. But I still think it is safe to say that you should not spend time avoiding the use of Count(distinct…) on a field in the fact table.

 

In fact, you should consider moving your ID to the fact table if you need to improve the performance. Especially if you have a large fact table.

 

HIC

76 Comments
msteedle
Luminary Alumni
Luminary Alumni

What is the effect on performance if the key field you are counting is in more than two tables?

What is the effect on the actual results the calculation returns if the exact same domain of values is not present in every table, ex. missing reference data, extraneous reference data, multiple transaction tables with the same key?

0 Likes
4,758 Views
kalyandg
Partner - Creator III
Partner - Creator III

hi HIC,

Bravo! really super post

Best Regards,

Kalyan

0 Likes
4,758 Views
hic
Former Employee
Former Employee

Jay touches a very important issue: The trade-off between speed and clarity.

You should not always go for the fastest solutions. You should only go for speed if you already have a performance problem - otherwise the clarity is more important. So I would make the same choice as Jay: For small apps you should prioritize clarity.

"Premature optimization is the root of all evil."

HIC

4,758 Views
rwunderlich
Partner Ambassador/MVP
Partner Ambassador/MVP

Fascinating post Henric. We are going to have to revise the Developer I curriculum! It all makes great sense (counting the state array) and the principles have parallels in script processing where scalar operations against fieldvalues() or a QVD frequently outperform Resident load.

Can you post your test kit so others can reproduce?

How confident can I be that the calc time numbers are an accurate metric of "best performance" when doing testing? Is there a "sub expression" effect across objects? For example, given two objects with different expressions:

=sum(total sales)

=sum(sales) / sum(total sales)

Will total sales be calculated once and the cached result reused in the "other object"? Any way to determine which object will calculate it?

Forgive me if I've wandered off the post topic. Thanks again for your willingness to explore and explain questions such as this post.

Rob

4,758 Views
hic
Former Employee
Former Employee

My test kit is mock-up data. To verify (or refute) my assertion, it would make a lot more sense that people make similar tests on other data than mine.

The calc numbers are by no means accurate. They can however be used as fair estimates. There is unfortunately no way to find the times for sub-expressions.

HIC

0 Likes
4,693 Views
vgutkovsky
Master II
Master II

Henric, I echo what everyone else is saying--this is definitely going to change how Developer I is taught 🙂 Thanks for the great post!

Vlad

0 Likes
4,693 Views
christian77
Partner - Specialist
Partner - Specialist

Thanks Henric.

Another military imposition has been dethroned.


0 Likes
4,693 Views
Jason_Michaelides
Luminary Alumni
Luminary Alumni

Wow.

Crikey.

Oops.

(Thank you.)

0 Likes
4,693 Views
Stefan_Walther
Employee
Employee

Unfortunately the charts in the comments are not visible anymore ... Regards Stefan

0 Likes
4,693 Views
Not applicable

Hi HIC

Very interesting post. i like to comment on link table on your post.

After playing for QV for quite some time , i must said that the most difficult to master in learning is how to create link table. you once said , it is best to get user to date model they want. and friendly speaking not many are able to give the best link as it is depend on what result they need on their analysis.

Paul

0 Likes
4,693 Views