Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
Not applicable

aggregate count distinct - what is the optimal architecture?

I want to calculate average number of counterparties a consumer had in a given quarter. Data rows have a transaction_id, cust_id, customer_counterparty_id, quarter_id. (actually transactions have a date and then I use a calendar table with date->quarter)

In SQL that would be avg ( count(distinct customer_counterparty) over( partition by cust_id,quarter_id ) ), or something.

This is what I have in Qlikview and it is slow - a minute or longer:

avg ( aggr( DISTINCT count(DISTINCT customer_counterparty),  cust_id,quarter_id ))

Is there a tip on how can I speed up this calculation?

I have already tried preparing a help table with DISTINCT (customer, customer_counterparty) during the load and using sum instead of a "count distinct" but this did not bring much. It seems the aggr part is a slow one and I notice that QV calculations are running at overall 10% CPU (and 12 cores out of 30)

The point of this chart is to be able to use the usual QV functionality and filter counterparties on different criteria. So I can not just pre-calculate the count(DISTINCT customer_counterparty),  cust_id,quarter_id ) in the script.

BTW if I get it right, first DISTINCT is just for clarity (it is distinct by default), second DISTINCT is because I want to count distinct counterparties.

5 Replies
vgutkovsky
Master II
Master II

You're correct that the first DISTINCT doesn't do anything, as aggr() is distinct by default. Please post a screenshot of your table viewer, along with the approximate number of rows in each table.

Regards,

Vlad

Not applicable
Author

To give you a clear isolated picture I would need to create a new app with only two tables (now there are tens of them).

But there are actually two tables involved in the calculation so hopefully this would do:

Transaction_table: (transaction_id, cust_id, customer_counterparty_id, transaction_date)  - 60mln rows

and Calendar_table: (transaction_date, quarter_id) a about 2 thousands rows.

vgutkovsky
Master II
Master II

So are you counting counterparty or counterparty_id? For a transaction table this large, you do incur a cost if you need to go across a table link to another table. Try to make sure that all fields that you will use in your aggr reside in the same table. Also, what kind of hardware are you running? You would need a pretty powerful machine to handle a data set like this so if it's not, then I'm not surprised it's performing slowly.

Regards,

Vlad

Not applicable
Author

Vlad Gutkovsky wrote:

So are you counting counterparty or counterparty_id?

Sorry it was a typo, the field is called customer_counterparty and this what I meant with counterparty and counterparty_id.

For a transaction table this large, you do incur a cost if you need to go across a table link to another table. Try to make sure that all fields that you will use in your aggr reside in the same table.

Hm, this may be the case but I have many other calculations where I use the linked tables with up to link legs. I have a powerfull hardware and this is the only calculation that takes more than a few seconds.

Do you suggest adding the quarter_id to the Transactions table on the loading stage (in the script?)

I am trying to avoid the calculated columns to that big talbe as it slows down the loads significantly too. 

vgutkovsky
Master II
Master II

The quarter thing is worth a shot, you may see some performance improvements. Other expressions aren't as heavy as an aggr, so that could be why you're seeing worse performance on this one. One more question: where are you using this expression, in a chart of a text object? If in a chart, what are the chart dimensions and what type of chart is it? FYI the worst chart in terms of performance is a pivot table...

Vlad