Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Qlik and ServiceNow Partner to Bring Trusted Enterprise Context into AI-Powered Workflows. Learn More!
hic
Former Employee
Former Employee

Do you belong to the group of people who think that Count(distinct…) in a chart is a slow, single-threaded operation that should be avoided?

 

If so, I can tell you that you are wrong.

 

Well - it used to be single-threaded and slow, but that was long ago. It was fixed already for – I think – version 9, but the rumor of its slowness lives on like an urban myth that refuses to die. Today the calculation is multi-threaded and optimized.

 

To prove that Count(distinct…) is faster than what many people think, I constructed a test which categorically shows that it is not slower – it is in fact a lot faster than the alternative solutions.

 

I created a data model with a very large fact table: 1M, 3M, 10M, 30M and 100M records. In it, I created a secondary key, with a large number of distinct values: 1%, 0.1% and 0.01% of the number of records in the fact table.

 

The goal was to count the number of distinct values of the secondary key when making a selection. There are several ways that this can be done:

  • Use count distinct in the fact table: Count(distinct [Secondary ID])
  • Use count on a second table that just contains the unique IDs: Count([Secondary ID Copy])
  • Use sum on a field that just contains ‘1’ in the second table: Sum([Secondary ID Count])

 

Data model.png

 

I also created a dimension ("Dim" in the “Dim Table”) with 26 values, also randomly assigned to the data in the fact table. Then I recorded the response times for three charts, each using “Dim” as dimension and one of the three expressions above. I made this for four different selections.

 

Progress bars.png

 

Then I remade all measurements using “Dim ID” as dimension, i.e. I moved also the dimension to the fact table. Finally, I loaded all the recorded data into QlikView and analyzed it.

 

The first obvious result is that the response time increases with the number of records in the fact table. This is hardly surprising…

 

Avg response time.png

 

…so I need to compensate for this: I divide the response times with the number of fact table records and get a normalized response time in picoseconds:

 

Avg response time normalized.png

 

This graph is extremely interesting. It clearly shows that if I use a Count(distinct…) on the fact table, I have a response time that is considerably smaller than if I make a count or a sum in a dimensional table. The table below shows the numbers.

 

Comparison table.png

 

Finally, I calculated the ratios between the response times for having the dimension in the fact table vs. in a dimensional table, and the same ratio for making the aggregation in the fact table vs. in a dimensional table.

 

Comparison graph.png

 

This graph shows the relative response time I get by moving the dimension or the aggregation into the fact table. For instance, at 100M records, the response time from a fact table aggregation (i.e. a Count(distinct…)) is only 20% of an aggregation that is made in a dimensional table.

 

This is the behavior on my mock-up data on my four-core laptop with 16GB. If you make a similar test, you may get a slightly different result since the calculations depend very much on both hardware and the data model. But I still think it is safe to say that you should not spend time avoiding the use of Count(distinct…) on a field in the fact table.

 

In fact, you should consider moving your ID to the fact table if you need to improve the performance. Especially if you have a large fact table.

 

HIC

76 Comments
Jason_Michaelides
Partner - Master II
Partner - Master II

Absolutely. I was just trying to refocus the discussion and it looked to me like it was straying into a general COUNT(DISTINCT) is better than SUM(Counter) thread.  Your point is "when plotting against a dimension". When counting Fact records and anything else that isn't repeating values (i.e. fully distinct) then Sum(Counter) is faster (in my tests anyway).

I'm probably reading something that isn't there

0 Likes
872 Views
Not applicable

THANK YOU for posting this!  I have been struggling with creating a Flag counter field for a project I am working on to avoid using Distinct Counts.  You have simplified my life! I am sticking with my distinct count so I can mark this metric off as complete

0 Likes
872 Views
Alejandro_Hernández
Former Employee
Former Employee

Jason Michaelides wrote:

When counting Fact records and anything else that isn't repeating values (i.e. fully distinct) then Sum(Counter) is faster (in my tests anyway).

Could you post your results?

Thanks

0 Likes
892 Views
Anonymous
Not applicable

So, Count(distinct) ...

Yes or Not ?

CB.

0 Likes
892 Views
hic
Former Employee
Former Employee

To me, the answer is clear:

If you need to count the number of distinct values of something, you should not be afraid to use Count(distinct...). Further, you should definetely not aim for a kludgy solution by creating an additional table that really isn't necessary.

But which solution to choose depends on which dimension you have in the chart.

HIC

892 Views
Not applicable

in the case of executing count on the fact table, why do response times get faster with more rows?

0 Likes
892 Views
thanstad
Creator
Creator

This is probably not a new way to present distinct values by (in this case by customer)

CustomerStart_DateEnd_Date
101.11.199501.06.1996
105.08.200010.10.2001
111.01.2010

When presenting a Straight or Pivot table with distinct customer I often use to move the dates from Dimension to expression and in this case make a MAX(Start_Date) because I am interesting to show the latest contract.

/Tormod Hanstad

0 Likes
892 Views
sujeetsingh
Master III
Master III

Really a valuable thread .Thank you

0 Likes
892 Views
jeremy_fourman
Luminary
Luminary

Stumbled across this post, excellent write up and explanation, thank you!

0 Likes
928 Views
Jason_Michaelides
Partner - Master II
Partner - Master II

Again, Abhi - please do not post unrelated questions on existing threads. Start a question of your own and someone will help if they can.

Jason

0 Likes
928 Views