Skip to main content
Announcements
Qlik Connect 2024! Seize endless possibilities! LEARN MORE
hic
Former Employee
Former Employee

Do you belong to the group of people who think that Count(distinct…) in a chart is a slow, single-threaded operation that should be avoided?

 

If so, I can tell you that you are wrong.

 

Well - it used to be single-threaded and slow, but that was long ago. It was fixed already for – I think – version 9, but the rumor of its slowness lives on like an urban myth that refuses to die. Today the calculation is multi-threaded and optimized.

 

To prove that Count(distinct…) is faster than what many people think, I constructed a test which categorically shows that it is not slower – it is in fact a lot faster than the alternative solutions.

 

I created a data model with a very large fact table: 1M, 3M, 10M, 30M and 100M records. In it, I created a secondary key, with a large number of distinct values: 1%, 0.1% and 0.01% of the number of records in the fact table.

 

The goal was to count the number of distinct values of the secondary key when making a selection. There are several ways that this can be done:

  • Use count distinct in the fact table: Count(distinct [Secondary ID])
  • Use count on a second table that just contains the unique IDs: Count([Secondary ID Copy])
  • Use sum on a field that just contains ‘1’ in the second table: Sum([Secondary ID Count])

 

Data model.png

 

I also created a dimension ("Dim" in the “Dim Table”) with 26 values, also randomly assigned to the data in the fact table. Then I recorded the response times for three charts, each using “Dim” as dimension and one of the three expressions above. I made this for four different selections.

 

Progress bars.png

 

Then I remade all measurements using “Dim ID” as dimension, i.e. I moved also the dimension to the fact table. Finally, I loaded all the recorded data into QlikView and analyzed it.

 

The first obvious result is that the response time increases with the number of records in the fact table. This is hardly surprising…

 

Avg response time.png

 

…so I need to compensate for this: I divide the response times with the number of fact table records and get a normalized response time in picoseconds:

 

Avg response time normalized.png

 

This graph is extremely interesting. It clearly shows that if I use a Count(distinct…) on the fact table, I have a response time that is considerably smaller than if I make a count or a sum in a dimensional table. The table below shows the numbers.

 

Comparison table.png

 

Finally, I calculated the ratios between the response times for having the dimension in the fact table vs. in a dimensional table, and the same ratio for making the aggregation in the fact table vs. in a dimensional table.

 

Comparison graph.png

 

This graph shows the relative response time I get by moving the dimension or the aggregation into the fact table. For instance, at 100M records, the response time from a fact table aggregation (i.e. a Count(distinct…)) is only 20% of an aggregation that is made in a dimensional table.

 

This is the behavior on my mock-up data on my four-core laptop with 16GB. If you make a similar test, you may get a slightly different result since the calculations depend very much on both hardware and the data model. But I still think it is safe to say that you should not spend time avoiding the use of Count(distinct…) on a field in the fact table.

 

In fact, you should consider moving your ID to the fact table if you need to improve the performance. Especially if you have a large fact table.

 

HIC

76 Comments
Anonymous
Not applicable

See the .qvw without data here: http://content.heldendaten.eu/testdistinct.zip

Reload it once, to generate the data.

0 Likes
892 Views
rwunderlich
Partner Ambassador/MVP
Partner Ambassador/MVP

Roland,

I tested the QVW you posted and saw only multi-threaded activity. Here's my video:

https://www.youtube.com/watch?v=XoUKju5lBRk

QV12 SR4. I also tried QV11 SR13 and got almost exactly the same execution profile.

I repeated the test with Sense Desktop 3.0

Qlik Sense Count(DISTINCT) Threading test - YouTube

Same results. Hightly multi-threaded.

-Rob

http://masterssummit.com

http://qlikviewcookbook.com

0 Likes
892 Views
Anonymous
Not applicable

HI Rob!

Thx for your testing;

Not quite sure, but I think your "SingleThreaded" phase is pretty short or is covered by some other process.

Please set the vMonths variable to 64. This should cause a longer "Single Threaded" phase.

I did a video on a stronger machine:

- Intel E5  2667 V4, hyperthreading off, NUMA off

- High Performance Powerplan

- QlikView 12 SR4

- No other users on the server

One can clearly see the single threaded phase from second 25-45 in the video.

64month singlethreaded - YouTube

Maybe we can ask someone from Scalabilty Center to validate our testing..

Thx,

Roland

0 Likes
925 Views
Shivam_Sahu
Contributor III
Contributor III

Is it as quick as SUM function ?

0 Likes
901 Views
Oleg_Troyansky
Partner Ambassador/MVP
Partner Ambassador/MVP

Based on a recent test, collectively performed by the team at the Masters Summit for Qlik, we came to the following conclusions:

- When used "globally", with no dimensions, in a text object, Count(Distinct) produces the same results at the same calc time as the SUM(Counter), very fast. We tend to believe that this this case, the counting is done on the Symbols table, as opposed to the Data table, and therefore it's a lot faster.

- However, when used in a Chart with a Dimension, Count(Distinct) was approximately 10 times slower than Sum(Counter). We can't say for sure that the operation was single-threaded, but it was certainly slower than the Sum.

Make your conclusions accordingly. I'd love to hear about your benchmarking tests of the same.

Cheers,

Oleg Troyansky 

0 Likes
853 Views
armandfrigo
Partner Ambassador
Partner Ambassador

Hi All,

Could it be that QlikView count(distinct) operations were multi-threaded and Qlik Sense is single-threaded on the same operations?

This is what is conveyed by Qlik Support in that video: https://youtu.be/Vwmx0nBNk0Q?t=1140

Cheers,

Armand

0 Likes
561 Views