A Myth About Count(distinct …) - Qlik Community

hic · ‎2013-10-22

Do you belong to the group of people who think that Count(distinct…) in a chart is a slow, single-threaded operation that should be avoided?

If so, I can tell you that you are wrong.

Well - it used to be single-threaded and slow, but that was long ago. It was fixed already for – I think – version 9, but the rumor of its slowness lives on like an urban myth that refuses to die. Today the calculation is multi-threaded and optimized.

To prove that Count(distinct…) is faster than what many people think, I constructed a test which categorically shows that it is not slower – it is in fact a lot faster than the alternative solutions.

I created a data model with a very large fact table: 1M, 3M, 10M, 30M and 100M records. In it, I created a secondary key, with a large number of distinct values: 1%, 0.1% and 0.01% of the number of records in the fact table.

The goal was to count the number of distinct values of the secondary key when making a selection. There are several ways that this can be done:

Use count distinct in the fact table: Count(distinct [Secondary ID])
Use count on a second table that just contains the unique IDs: Count([Secondary ID Copy])
Use sum on a field that just contains ‘1’ in the second table: Sum([Secondary ID Count])

I also created a dimension ("Dim" in the “Dim Table”) with 26 values, also randomly assigned to the data in the fact table. Then I recorded the response times for three charts, each using “Dim” as dimension and one of the three expressions above. I made this for four different selections.

Then I remade all measurements using “Dim ID” as dimension, i.e. I moved also the dimension to the fact table. Finally, I loaded all the recorded data into QlikView and analyzed it.

The first obvious result is that the response time increases with the number of records in the fact table. This is hardly surprising…

…so I need to compensate for this: I divide the response times with the number of fact table records and get a normalized response time in picoseconds:

This graph is extremely interesting. It clearly shows that if I use a Count(distinct…) on the fact table, I have a response time that is considerably smaller than if I make a count or a sum in a dimensional table. The table below shows the numbers.

Finally, I calculated the ratios between the response times for having the dimension in the fact table vs. in a dimensional table, and the same ratio for making the aggregation in the fact table vs. in a dimensional table.

This graph shows the relative response time I get by moving the dimension or the aggregation into the fact table. For instance, at 100M records, the response time from a fact table aggregation (i.e. a Count(distinct…)) is only 20% of an aggregation that is made in a dimensional table.

This is the behavior on my mock-up data on my four-core laptop with 16GB. If you make a similar test, you may get a slightly different result since the calculations depend very much on both hardware and the data model. But I still think it is safe to say that you should not spend time avoiding the use of Count(distinct…) on a field in the fact table.

In fact, you should consider moving your ID to the fact table if you need to improve the performance. Especially if you have a large fact table.

HIC

Report Inappropriate Content · ‎2013-10-22

True to your word (master summit). Thank you very much for solving this interesting puzzle! This lets me focus on what's really important.

Mathias

hic · ‎2013-10-22

I just had to investigate this. The assumption that Count(distinct...) always should be avoided didn't fit with my experience. It was true in earlier versions and, in all fairness, the QlikTech communication has been to avoid it. Time to change that!

HIC

Oleg_Troyansky · ‎2013-10-22

Thanks Henric!

I'm embarrassed for not testing it myself before. We've been teaching it for years and never had an idea to test the premise.

Now we need to reverse our message - "if you were previously taught that count(distinct ) is causing problems, scratch that and get rid of all the counter fields that you added".

cheers,

Oleg Troyansky

Anonymous · ‎2013-10-22

Hi Henric,

as always another informative post .......

i have a question for you it maynot fit in this context but i'm interested to know to you were able to calculate Response time....

are you looking at sheet properties --object -- calculation time after each selection

or just using a stop clock

i need to do some kind of testing for one of my application to know these details like

reponse time

RAM used

CPU consumed

Any suggestion /tips will help me alot

Thanks

Sri1

hic · ‎2013-10-22

I looked at the object CalcTime in Document properties (sheet tab), i.e. the same number you find under Sheet properties.

For RAM usage, you can create a memory statistics app: Recipe for a Memory Statistics analysis

HIC

Anonymous · ‎2013-10-22

Hi Henric!

Thx for this great post!

Which slow single-threaded operations are left in QV11?

- Groupby in the script seems to only utilize one core

- SyntheticKey-Calculation at the end of the script (should be avoided anyway)

- Anything on the frontend?

Thx,

Roland

Anonymous · ‎2013-10-22

Thanks for the response

one more question..... when it say's calculation time is 296 (is it in seconds/ microseconds/picosecond ?)

how should we read that number

Thanks Again

Sri

Carlos_Reyes · ‎2013-10-22

Great... It halved the calculation time of some formulas with aggr and ifs.

Anonymous · ‎2013-10-22

Henric,

I believe that there may be a flaw in your analysis. Your assertion is that Count(Distinct ...) is actually faster than Count() and Sum(), but what I believe that you are actually showing is that calculations are dependent on the number of table associations that are needed to jump across. The chart below illustrates my point:

If you look at how much the response time improves just for Count(Distinct ...) between having the dimension in the dim table vs. the fact table, it's safe to assume that the jump from the fact table to the ID table has a similar effect on your Count(...) and Sum(...) response times. What this proves is that inter-table aggregations require more resources than intra-table aggregations. In order to give an apples-to-apples comparison, "Secondary ID Copy" and "Secondary ID Count" would need to exist in the fact table on the first instance of each unique "Secondary ID". I would be curious to see the results in that case.

Report Inappropriate Content · ‎2013-10-22

The test isn't fair. The count(distinct) never has to make the traversal to [ID Table], but the count and sum both do.

I suspect that because count(distinct) can aggregate directly on the bit-mapped index, it gets far better cache treatment than the others, which amplifies its advantage.

Still, your point that count(distinct) is not single-threaded is clear.