Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
AlexPolorotov
Partner Ambassador
Partner Ambassador

Count distinct clients in dynamically changing groups (combination of values from 7 dimensions)

Table "client contracts". 
 
Total number of rows: 3 billion
Limited RAM: 256GB

Field nameField typeUnique Values
clientIdkey10 mln
bankIdkey100
datedimension 
flag_1dimension2
flag_2dimension3
flag_3dimension3
currencydimension3
productdimension8
regiondimension20
totalamountmeasure  
outstanding amountmeasure 

1 row is a version of the client contract, the client can have contracts with different banks, different currency, product at the same time

Users can select any value in dimension and see the trend of a unique number of clients (Count distinct clientId)
Scenarios: 
1. Users want to see a unique number of clients. User selected period: 1Q2019 - 3Q2019,  product: Mortgage and  bankId's: 1, 2,3. 
2. Users want to see trend of a unique number of clients. User selected product: Mortgage Student,  flag_1: yes

Example: 
1. User selected Bank X - 50 unique clients. 
2. User selected Bank Y - 50 unique clients. 
3. User selected Bank X, Y - only 30 unique clients. And there is only one group by (bank). 

As you can see there are enormous combinations of user selections. So, the calculation time of the COUNT DISTINCT operation is too long for each selection state. 
That's why we have an idea to aggregate data, save it in QVD, and connect it to the fact table and get a precalculated number of the unique client for each possible state. 

But what will be the best data model in our case?
What the best way to connect the fact table and many aggregated data with mixed granularity? (generic keys e t.c.)
How to quickly calculate (Count distinct) for each selection state in the load script?  (by hand only?)

2 Replies
Vegar
MVP
MVP

My big concern with you suggested solution is that the enormous amount of possible selection combinations that you need to create. As an alternative solution, have you considered reducing the granularity in the whole application? Aggregating the whole application on the field names you are listing in your table? If there is a further need of examining details then let the user generate it by offering OnDemand-apps to the end users.

Another consideration (it might not help you all the way due to the large amount of transactions), is not to use count distinct in favour of sum(). Add an client dimension to your data model (one record per clientId) and add a field [No of Clients] with the value of one for all dimension rows. Then in the expression use SUM([No of Clients]).

Vegar_0-1605438805926.png

 

marcus_sommer

I doubt that you could connect the fact-table with multiple aggregated tables - at least not with improving the performance (I assume it would be rather the opposite). Whereby I never tried something like this. There is a good blog-posting from HIC within the community about using mixed granularities of data and this relates to a single fact-table. This means the fact-table contains detailed data as well as aggregated data and aimed mainly to aggregate measures and not to count a key and also to reduce the amount of records by just keeping the detailed data for the last n periods and the previous ones are aggregated on a monthly and/or yearly level or something similar.

Maybe you could apply it in some way - otherwise the above mentioned approach would add more records whereby it might not take longer to calculate with them because through an appropriate flag-field you could refer directly to the aggregated parts.

To get back to your approach you may try it as an second and independent datamodel whereby IMO here would be two specialized applications more suitable.

Another way which might improve the performance is to merge all related fields - in regard to your expressions and the dimensions within your objects - to the fact-table. Usually is the star-scheme the best compromise by a datamodel in regard to the efforts, handling and performance but by larger datasets should a big fact-table perform better within the UI. The essential point here is the fact that creating the virtual table (returns the dimensional context on which the expressions are performed) which is underneath each object is the most expensive calculation within the UI because this step is single-threaded.

You mentioned only the requirement of selecting from various fields - if this is really the essential point and not that also many / all fields are used as dimensions within the objects - you may to consider to merge some of the dimensions. Unless date there are 6 dimensions which have around 10000 combinations. You could use autonumber() to create an appropriate unique value within the fact-table and also within an extra island-table. The user-selections (on this dimensional subset) goes now over the island-table and you could refer on them within your expressions, maybe with something like:

count({< MergedFlag = p(MergedFlagFromIsland)>} distinct clientId)

I use such island-tables quite often but not so large ones and not to improve any performance challenges else to simplify the usability but I could imagine that if really the selections are important it may improve the performance.

Further you may to consider to set any calculation restrictions - maybe with a on/off button - to avoid the calculations of various of the intermediated selection-steps.

Beside all this how long take the calculations and what are your expectations?

- Marcus