Skip to main content
Announcements
Qlik Connect 2025: 3 days of full immersion in data, analytics, and AI. May 13-15 | Orlando, FL: Learn More
cancel
Showing results for 
Search instead for 
Did you mean: 
Not applicable

Data model/distinct count challenge

Hello,

my requirement is to be able to compute a distinct count of Account Numbers based on any selections the user makes.

Doing a distinct count worked OK with a small data set. The challenge comes when we increased the data volume (now being over 400 million records) which then causes performance issues.

We tried aggregating the data on the database side by computing the distinct count in the database table at the various combinations of all the dimensions required ( ex: at division,network, time of day etc.) , but the problem with that approach is the fact that you can’t sum up a distinct count in the dashboard if the users wanted to see a distinct count at a higher level of granularity because that would result in double counting (or more) the Account Numbers i.e one Account Number can be in multiple divisions, networks, time of day etc.

Has anyone has any experience with similar requirements? Any thoughts on how we can accomplish this whether it involves a different data model or some type of data manipulation we can do in the script or the dashboard?

Thanks, Maria

5 Replies
Not applicable
Author

Hi Maria,

you can try making an order by of your table and adding a flag that says if the previous row has the same value, then in your expression you just have to make a sum of your flag; the code should be something like this:

load

     fieldA,

     fieldB,

     fieldToCount,

     if(peek('fieldToCount') = fieldToCount,0,1) as Flag

from...

order by fieldToCount Asc;

after this, the expression should be: sum(Flag)

Hope this helps

Regards!

rwunderlich
Partner Ambassador/MVP
Partner Ambassador/MVP

Create an AccountNumberCount Table with two fields:

AccountNumber -- all the DISTINCT AccountNumbers in your data.

AccountNumberCounter = 1

AccountNumber, whether in a dimension or a fact, will point to AccountNumber in the Table. Then sum(AccountNumberCounter) will give you the correct count.

Depending on your model, this may create a loop. You can fix that by using a different link field for each table to the AccountNumberCountTable. Use autonumber() for the links and they won't take up much space.

-Rob

http://robwunderlich.com

Not applicable
Author

Rob,

this is a good idea but you are assuming that we are keeping the Account Number in the main table.

We want to aggregate the data because the volume is getting too big. If we aggregate the data, we will loose the Account Number, but still want to be able to compute the distinct count of Account Number at the various levels.

Any ideas?

Maria

johnw
Champion III
Champion III

I would bet that the simplest and smallest data model required to support your requirement is one in which you haven't aggregated your data by dropping the account numbers.  In other words, I would bet that what you want to do is impossible.  I can't think how to arrange a proof, though, so I could be wrong.

Gabriela, if I understood your solution, I see a problem with it.  Let's take a simple example:

AccountType, AccountNumber
A, 1
A, 2
A, 3
B, 2
B, 3
B, 4
C, 3
C, 4
C, 5

Your solution would set the flag like this:

AccountType, AccountNumber, Flag
A, 1, 1
A, 2, 1
B, 2, 0
A, 3, 1
B, 3, 0
C, 3, 0
B, 4, 1
C, 4, 0
C, 5, 1

And then aggregation with sum() to get rid of our account numbers would give us this:

AccountType, SumFlag
A, 3
B, 1
C, 1

Select B and C.  How many distinct account numbers?  The right answer is 4, but this data model tells us 2.

rwunderlich
Partner Ambassador/MVP
Partner Ambassador/MVP

Maria,

I don't think my solution requires keeping transaction level tables in the model. What I'm suggesting is building up the AccountNumberCount table as you do the aggregation  Assume a number of transactions, each with an AccountNumber, Region, Product  and Amount. Rollup will be sum(Amount) by Region. Build a link table that links each Region to every AccountNumber in the AccountNumberCount table that appears in that Region. Links would also be built for other rollups such as Product.

So selecting a Region will associate a certain number of rows in the AccountNumberCount table. Selecting products sold in that Region would further narrow the rows in AccountNumberCount. At any given point of selections, =sum(AccountNumberCounter) should give the correct count of AccountNumbers.

-Rob