Qlik Community

Qlik Design Blog

All about product and Qlik solutions: scripting, data modeling, visual design, extensions, best practices, etc.

Employee
Employee

Count or Count distinct?

When counting something in QlikView, you should use the Count() aggregation function. But – should you use Count or Count distinct?

Although very similar, the two constructions return completely different results. And often the wrong one is used…

So, let me start by clarifying what the two constructions really count:

  • Count( <expression> ) counts the number of records in the data.
  • Count( distinct <expression> ) counts the number of distinct values of the expression.

Only instances where the expression isn't NULL are included. Further, if the expression contains references to more than one field, an n-tuple (a temporary table) is created from the cross product of the constituent fields, and the count is performed in the n-tuple instead of in the data table.

This means that in the common case – when the <expression> is a simple field reference – the Count(…) is equivalent to the number of rows in the data table, whereas the Count(distinct …) is equivalent to the number of rows in the symbol table. Read more about data tables and symbol tables here.

An example: You have a fact table and you want to count the number of orders. Then Count(OrderID) will return the number of records in the fact table, which often is a number larger than the number of orders, since one order usually can have several order lines. In this case you probably want to count the number of unique Order IDs, and should most likely choose Count(distinct OrderID) instead.

So, Count(OrderID) does not count the orders!

Image2.png

Another trap I have seen people fall into, is when a key field is used as parameter in the Count() function. If the distinct clause is used, there is no problem: The number of distinct values is well defined also for key fields. But if you omit the distinct clause, the number is not well defined: In which data table should QlikView count the number of records? There are several possible source tables, since the field is a key.

The Count(Key) will however return a number, but this number is not always the number you want. Sometimes it will be the count from the "wrong" table. So I would recommend that you never use a Count() function on a key field, unless you use the distinct clause.

Otherwise, the work-around for using the Count() function (without the distinct clause) on a key field, is the create a copy of this field in the table where you want to count the records, and use this field inside the Count() instead of the key field.

Bottom line: For key fields, either use the distinct clause, or the above work-around.

HIC

See also A myth about Count distinct

5 Comments
Not applicable

Otherwise, the work-around for using the Count() function (without the distinct clause) on a key field, is the create a copy of this field in the table where you want to count the records, and use this field inside the Count() instead of the key field.

If you do this in a large data model be very careful, as QV will not be able to compress the data (all values will be unique by virtue of the fact the field is a key), so this is likely to impact your RAM and storage requirements dramatically.

78 Views
martinpedersen
Contributor

Nice article.

My experience is that you will get better performace, if you create a count-field direct in the load script. For instance if you have a distinct OrderTable add the  "1 as OrderIdCount" and then you can use the expression "Sum(OrderIdCount)" in your objects.

0 Likes
78 Views
Employee
Employee

@ Graeme Smith If the key values are unique, then you don't need the work-around. Just use Count(distinct Key). The work-around is only necessary if the key values are not unique.

@ Martin Pedersen My experience is the opposite. I tested exactly this for performance. Read more about my investigations on A Myth About Count(distinct …).

HIC

78 Views
martinpedersen
Contributor

Hi Henric,

I think it depends on the complexity on your orderID-keys. I got a case where the "OrderIds" are very long text keys and my performance seems to improve a lot when i use a sum(Count-Field) instead - but perhaps i could use a autonumber of the OrderId instead and get even better performace...

0 Likes
78 Views
Not applicable

Henric Cronström - point taken.  Still, in the context of the OrderID example you gave, there could still potentially be a large number of unique OrderID's in the orders table (ultimately depends on the nature of the orders and average number of order line items).  Duplicating this field could still have significant impact on RAM and storage requirements in larger data models, and I still caution people to think carefully about duplicating fields with large numbers of discrete values (irrespective of whether they are unique or not!).

0 Likes
78 Views