Qlik Community

Qlik Design Blog

All about product and Qlik solutions: scripting, data modeling, visual design, extensions, best practices, etc.

A Myth About Count(distinct …)

Do you belong to the group of people who think that Count(distinct…) in a chart is a slow, single-threaded operation that should be avoided?

 

If so, I can tell you that you are wrong.

 

Well - it used to be single-threaded and slow, but that was long ago. It was fixed already for – I think – version 9, but the rumor of its slowness lives on like an urban myth that refuses to die. Today the calculation is multi-threaded and optimized.

 

To prove that Count(distinct…) is faster than what many people think, I constructed a test which categorically shows that it is not slower – it is in fact a lot faster than the alternative solutions.

 

I created a data model with a very large fact table: 1M, 3M, 10M, 30M and 100M records. In it, I created a secondary key, with a large number of distinct values: 1%, 0.1% and 0.01% of the number of records in the fact table.

 

The goal was to count the number of distinct values of the secondary key when making a selection. There are several ways that this can be done:

  • Use count distinct in the fact table: Count(distinct [Secondary ID])
  • Use count on a second table that just contains the unique IDs: Count([Secondary ID Copy])
  • Use sum on a field that just contains ‘1’ in the second table: Sum([Secondary ID Count])

 

Data model.png

 

I also created a dimension ("Dim" in the “Dim Table”) with 26 values, also randomly assigned to the data in the fact table. Then I recorded the response times for three charts, each using “Dim” as dimension and one of the three expressions above. I made this for four different selections.

 

Progress bars.png

 

Then I remade all measurements using “Dim ID” as dimension, i.e. I moved also the dimension to the fact table. Finally, I loaded all the recorded data into QlikView and analyzed it.

 

The first obvious result is that the response time increases with the number of records in the fact table. This is hardly surprising…

 

Avg response time.png

 

…so I need to compensate for this: I divide the response times with the number of fact table records and get a normalized response time in picoseconds:

 

Avg response time normalized.png

 

This graph is extremely interesting. It clearly shows that if I use a Count(distinct…) on the fact table, I have a response time that is considerably smaller than if I make a count or a sum in a dimensional table. The table below shows the numbers.

 

Comparison table.png

 

Finally, I calculated the ratios between the response times for having the dimension in the fact table vs. in a dimensional table, and the same ratio for making the aggregation in the fact table vs. in a dimensional table.

 

Comparison graph.png

 

This graph shows the relative response time I get by moving the dimension or the aggregation into the fact table. For instance, at 100M records, the response time from a fact table aggregation (i.e. a Count(distinct…)) is only 20% of an aggregation that is made in a dimensional table.

 

This is the behavior on my mock-up data on my four-core laptop with 16GB. If you make a similar test, you may get a slightly different result since the calculations depend very much on both hardware and the data model. But I still think it is safe to say that you should not spend time avoiding the use of Count(distinct…) on a field in the fact table.

 

In fact, you should consider moving your ID to the fact table if you need to improve the performance. Especially if you have a large fact table.

 

HIC

75 Comments
Not applicable

True to your word (master summit). Thank you very much for solving this interesting puzzle! This lets me focus on what's really important.

Mathias

0 Likes
2,371 Views

I just had to investigate this. The assumption that Count(distinct...) always should be avoided didn't fit with my experience. It was true in earlier versions and, in all fairness, the QlikTech communication has been to avoid it. Time to change that!

HIC

2,371 Views
MVP & Luminary
MVP & Luminary

Thanks Henric!

I'm embarrassed for not testing it myself before. We've been teaching it for years and never had an idea to test the premise.

Now we need to reverse our message - "if you were previously taught that count(distinct ) is causing problems, scratch that and get rid of all the counter fields that you added".

cheers,

Oleg Troyansky

2,371 Views
srikanthsri1
Valued Contributor

Hi Henric,

as always another informative post .......

i have a question for you it maynot fit in this context but i'm interested to know to you were able to calculate Response time....

are you looking at sheet properties --object -- calculation time after each selection

or just using a stop clock

i need to do some kind of testing for one of my application to know these details like

reponse time

RAM used

CPU consumed

Any suggestion /tips will help me alot

Thanks

Sri1

0 Likes
2,371 Views

I looked at the object CalcTime in Document properties (sheet tab), i.e. the same number you find under Sheet properties.

For RAM usage, you can create a memory statistics app: Recipe for a Memory Statistics analysis

HIC

0 Likes
2,371 Views
Partner
Partner

Hi Henric!

Thx for this great post!

Which slow single-threaded operations are left in QV11?

- Groupby in the script seems to only utilize one core

- SyntheticKey-Calculation at the end of the script (should be avoided anyway)

- Anything on the frontend?

Thx,

Roland

2,371 Views
srikanthsri1
Valued Contributor

Thanks for the response

one more question..... when it say's calculation time is 296  (is it in seconds/ microseconds/picosecond ?)

how should we read that number

Thanks Again

Sri

0 Likes
2,371 Views
Partner
Partner

Great... It halved the calculation time of some formulas with aggr and ifs.

2,371 Views
jdvermeire
Contributor

Henric,

I believe that there may be a flaw in your analysis.  Your assertion is that Count(Distinct ...) is actually faster than Count() and Sum(), but what I believe that you are actually showing is that calculations are dependent on the number of table associations that are needed to jump across.  The chart below illustrates my point:Comparison+table.png

If you look at how much the response time improves just for Count(Distinct ...) between having the dimension in the dim table vs. the fact table, it's safe to assume that the jump from the fact table to the ID table has a similar effect on your Count(...) and Sum(...) response times.  What this proves is that inter-table aggregations require more resources than intra-table aggregations.  In order to give an apples-to-apples comparison, "Secondary ID Copy" and "Secondary ID Count" would need to exist in the fact table on the first instance of each unique "Secondary ID".  I would be curious to see the results in that case.

2,371 Views
Not applicable

The test isn't fair. The count(distinct) never has to make the traversal to [ID Table], but the count and sum both do.

I suspect that because count(distinct) can aggregate directly on the bit-mapped index, it gets far better cache treatment than the others, which amplifies its advantage.

Still, your point that count(distinct) is not single-threaded is clear.

2,371 Views

@ Roland Vecera

Many things in the script are single threaded, probably also the creation of synthetic keys. In the front end, most things are multi threaded, but not all.

For example, when a chart is calculated there is first one step where it creates a link table if the dimension and the fields for the aggregation reside in different tables. This is single threaded. Further, if the aggregation contains fields from different data tables, it creates a lookup table to create all combinations between the fields before it can aggregate. This is also single threaded. Normally, these two steps are done quickly, so you don't notice that it is single threaded. But if the second largest table also is a large table, these steps will take some time.

See picture below from my tests. Note the CPU-usage: Single threading!

single thread.png

HIC

2,371 Views

@ srikanthsri1

The charts contain the units - either milliseconds or picoseconds.

HIC

0 Likes
2,371 Views

@ Jeffrey Vermeire ; @ Jay Jakosky

You are both correct that what takes time is the hop to the next table.

However, I still claim that it is a fair test: In real life, the developer has the choice of using Count(distinct ...) in the fact table, or a Count(ID) or Sum(Count) in a side table. Other options do not exist. It is not possible to put [Secondary ID Count] in the fact table and expect QlikView to make a correct calculation. Hence, a relevant comparison.

HIC

2,371 Views
Not applicable

Other needs exist. You need a real-world example. Here's one:

[Dim Table] contains customers, with a Customer ID and Name. [Fact Table] is traditional sales. [ID Table] is products, containing Product ID, Product Number, Product Version, and Description.

But for this company the Product ID is a compound key of the highly unique product number and a low-cardinality Version. Most products have only one version. And a hypothetical QV chart does not care about the Version, just the Product Number.

You want the count of unique Product Numbers by customer. So now you have to do a count(distinct) that crosses into the third table.

0 Likes
2,371 Views
Partner
Partner

A brave man - challenging HIC!       Great to see the discussion though ...   Philip

2,371 Views
Not applicable

HIC, I understand your statement that there are no other options. The three expressions you gave are isomorphisms. The isomorphisms for my scenario would require a 4th table in the data model.

My problem is that your analysis charts are so dramatically in favor of count(distinct) that most readers would conclude that they should use it. But what I see is a cautionary tale about traversing table links.

0 Likes
2,371 Views

@ Jay Jakosky

I totally agree that there are cases where you for different reasons don't want to or maybe even cannot move the ID into the fact table. And that's fine. Then you need to make an aggregation on the side table.

But I still say that you should consider the possibility of using Count(distinct...) on the fact table, if you have a large fact table and performance issues. Often the penalty of a larger fact table (from adding a column there) is smaller than the gain of a faster aggregation.

My main goal when writing this post was to kill the myth that you should avoid Count(distinct...) because of performance problems. Because that is a misconception.

HIC

PS You write "But what I see is a cautionary tale about traversing table links."  I can only say: Then you read it correctly!

2,371 Views
Not applicable

Thanks Henric!

0 Likes
2,371 Views
Partner
Partner

Which would be more efficient?

1) Stringing Together your unique Product Number (excl Version - you must be able to derive this from the full Product ID) & Customer ID as a field in fact table created during script load and Count(distinct) on that or

2) Doing a count distinct on Product No in product table

3) Creating a 4th table with just product Number and count or Sum on that (without distinct)

My guess would be (1) - less hops and slight overhead of extra field in Fact table.

Philip

0 Likes
2,371 Views
Not applicable

More time efficient? Number 1. But with a slight loss in clarity. "Why is this duplicated?"

For small apps, I'd choose #2. Clarity over speed. Network latency and JavaScript performance eclipse QV engine calc times until the app reaches millions of rows. Then there's a middle ground where #1 is more efficient and it is noticeable in user response times. But then, as you get into hundreds of millions of rows, #1 becomes space inefficient, so you should move back to #2.

Thanks to Henric, we can say #3 is a poor choice in every respect: speed, space, clarity and maintainability.

J

2,371 Views
Partner
Partner

What is the effect on performance if the key field you are counting is in more than two tables?

What is the effect on the actual results the calculation returns if the exact same domain of values is not present in every table, ex. missing reference data, extraneous reference data, multiple transaction tables with the same key?

0 Likes
2,371 Views
Partner
Partner

hi HIC,

Bravo! really super post

Best Regards,

Kalyan

0 Likes
2,371 Views

Jay touches a very important issue: The trade-off between speed and clarity.

You should not always go for the fastest solutions. You should only go for speed if you already have a performance problem - otherwise the clarity is more important. So I would make the same choice as Jay: For small apps you should prioritize clarity.

"Premature optimization is the root of all evil."

HIC

2,371 Views
MVP & Luminary
MVP & Luminary

Fascinating post Henric. We are going to have to revise the Developer I curriculum! It all makes great sense (counting the state array) and the principles have parallels in script processing where scalar operations against fieldvalues() or a QVD frequently outperform Resident load.

Can you post your test kit so others can reproduce?

How confident can I be that the calc time numbers are an accurate metric of "best performance" when doing testing? Is there a "sub expression" effect across objects? For example, given two objects with different expressions:

=sum(total sales)

=sum(sales) / sum(total sales)

Will total sales be calculated once and the cached result reused in the "other object"? Any way to determine which object will calculate it?

Forgive me if I've wandered off the post topic. Thanks again for your willingness to explore and explain questions such as this post.

Rob

2,371 Views

My test kit is mock-up data. To verify (or refute) my assertion, it would make a lot more sense that people make similar tests on other data than mine.

The calc numbers are by no means accurate. They can however be used as fair estimates. There is unfortunately no way to find the times for sub-expressions.

HIC

0 Likes
2,371 Views
vgutkovsky
Honored Contributor II

Henric, I echo what everyone else is saying--this is definitely going to change how Developer I is taught 🙂 Thanks for the great post!

Vlad

0 Likes
2,371 Views
Partner
Partner

Thanks Henric.

Another military imposition has been dethroned.


0 Likes
2,371 Views
Partner
Partner

Wow.

Crikey.

Oops.

(Thank you.)

0 Likes
2,371 Views
Employee
Employee

Unfortunately the charts in the comments are not visible anymore ... Regards Stefan

0 Likes
2,371 Views
Not applicable

Hi HIC

Very interesting post. i like to comment on link table on your post.

After playing for QV for quite some time , i must said that the most difficult to master in learning is how to create link table. you once said , it is best to get user to date model they want. and friendly speaking not many are able to give the best link as it is depend on what result they need on their analysis.

Paul

0 Likes
2,371 Views