Skip to main content
Announcements
Accelerate Your Success: Fuel your data and AI journey with the right services, delivered by our experts. Learn More
cancel
Showing results for 
Search instead for 
Did you mean: 
CavellMok
Contributor II
Contributor II

Aggr() Optimization on Large Dataset with User Input

Hi everyone,

I'm frustrated in this case because after 1 week hard work of building the dashboard, turns out the performance is unacceptable. I'm sure there are a smarter way to do it and I can't figure it out alone. Open to any suggestions as long as it works with huge dataset.

Background:

Main purpose of the dashboard is to allow user to quickly identify bulk shipment (received or shipped) to designated countries/for designated type. User can input quantity that they want to check(highlighted in yellow in above picture) for specific country/type.

2 datasets - Customer, Transaction

Sample Data.PNG

 

If user input 200 in USA quantity, it will only show the customer who received >= 200 on that day(group by date,cust_id). In this case, customer 1001 will be filter out because received from USA = 0 which is less than 200. If user input -10 in USA quantity, 1001 will be shown because total shipped is 149 which pass 10.

I already developed a solution but it doesn't work with large volume data.

My dataset volume: Customer-3M rows, Transaction-3M rows

My Solution(doesn't work with huge dataset)

vCountryUSA : USA quantity input by user (default is 0)

vCountryUSAFlag

aggr(nodistinct {1}
 if( $(vCountryUSA) > 0,//User input value greater than 0
     if(ttl_received_qty= 0 and ttl_shipped_qty <> 0,1,//If no inflow txn, not pass (set = 1) 
    if(in_out = 'Received' and country = 'USA',
                  if(sum(Fund) >= $(vCountryUSA), 0,1)//pass threshold = 0
                  )
   )   
 ,
 if( $(vCountryUSA) < 0,//else if
     if((ttl_received_qty <> 0) and ttl_shipped_qty = 0,1,//If no outflow txn, not pass (set = 1)    
    if(in_out = 'Shipped' and country = 'USA',
                  if(sum(Fund) <= $(vCountryUSA), 0,1)//pass threshold = 0
                  )                
   )
      ,0 )//if user input = 0
 )
,date, cust_id, in_out, country)

 

Create the same Flag for Vietnam(vCountryVietnamFlag )

vCountryFinalFlag

aggr(nodistinct {1}
if(
If(NumericCount($(vCountryVietnamFlag)), Sum($(vCountryVietnamFlag)), NULL()) = 0 
and If(NumericCount($(vCountryUSAFlag)), Sum($(vCountryUSAFlag)), NULL()) = 0 
                   ,0,1)
,date, cust_id)

Repeat the same for Type

vFinalFlag

Aggr(
 if(Sum($(vCountryFinalFlag)) > 0
  or Sum($(vTypeFinalFlag))> 0
    , 1,0)
,date, cust_id)

In the sheet, I put this expression, so it only shows the customer who meet the requirement

Sum({< cust_id = {"=sum($(vFinalFlag)) = 0"}>}Fund)

Thanks!

 

 

21 Replies
CavellMok
Contributor II
Contributor II
Author

As the aggr doesn't work in large data, I also tried to use set analysis. However, I encounter below problem.

a.png

 

Assume user input 10 in Vietnam. Expected to show 1001, 1002 and 1004 because 170,80,50 > 10.

Now using the set analysis, no customer shows. Take customer 1001 as example, let's look at the first part in the set analysis highlighted in blue below. There is no shipped record to China for 1001, so Sum({<country = {'China'},in_out = {'Shipped'}>}qty is null. When compare to vCountryChina(0), it is false. I'm stuck here. (I tried to replace null to zero but not success)

Sum({<[transaction.cust_id] = {"=Sum({<country = {'China'},in_out = {'Shipped'}>}qty) <= $(vCountryChina)"}>
* <[transaction.cust_id] = {"=Sum({<country = {'USA'},in_out = {'Shipped'}>}qty) <= $(vCountryUSA) "} >
* <[transaction.cust_id] = {"=Sum({<country = {'Vietnam'},in_out = {'Shipped'}>}qty) <= $(vCountryVietnam) "} >
} qty)

Also, attached qvf with set analysis for your understanding. Thanks a lot!

sunny_talwar

So, since you entered a value in Vietnam, do you only want to see rows with Vietnam? or do you need to show other countries as well? Also, you mentioned absolute value, but your expressions are not using fabs() function to change the Sum() from negative or positive to positive.

Also, you want to check the sum at date and customer_id level?

Also, 1001 have two shipped for USA which sum to -149... USA is set to 0... why would you want to see 1001 customer?

I feel like I am missing something... can you share the aggr() expression which works slow, but gives the right output.

CavellMok
Contributor II
Contributor II
Author

The qvf I shared is showing the aggr which provide the correct but doesn’t work for huge data set solution. When compare to user input value, the concept is compare the number only regardless of the + or - . For example, if user input 10, total received qty of that person has to >= 10. if user input -10, total shipped qty of that person has to <= -10. I didn’t use abs to do it. I use IF() to check, i.e. if $(vCountryVietnam) > 0 then total received qty has to be >= 0;if $(vCountryVietnam) < 0 then total shipped qty has to be <= 0 .

when User input 10 in Vietnam. It means he wants to only see the customer who has total received qty from Vietnam >= 10 (group by date, customer ). For the same example I used in the previous post, you can see aggr shows 3 customers but set analysis show nothing.

1001 USA is -149 and user input is 0. -149 is <=0. That’s why this customer should shows

 

In the mean time, I've rewrite the Aggr. This is a bit faster, can work with 1.3M but still not 3M data

vCountryUSAFlag

Aggr(Nodistinct {1}
     If( $(vCountryUSA) = 0,0,
      If( country = 'USA',
   If( $(vCountryUSA) > 0,
      If(in_out = 'Received' ,
                    If(Sum(qty) >= $(vCountryUSA), 0,1)
                    )      
            ,
 
    if(in_out = 'Shipped' ,
                  if(Sum(qty) <= $(vCountryUSA), 0,1)
    )               

   )
  )
 )
  
,date, cust_id, in_out, country)

marcus_sommer

I have the impression that you ran in the wrong direction and that you should rethink your datamodel because I doubt that's really suitable for your wanted measures. 3 M of records mean that's a rather small application and if it's (nearly) not capable to run that the whole approach is inappropriate.

The best compromise by a datamodel is usually a star-scheme but if the UI is too slow you should develop it in the direction of a single table (at least all fields which are included in such nested if-loops / aggr constructs should come from one table). Further every logic which could be (pre-)calculated within the script should be done there and may replace one or the other of the if-loops and/or simplify them.

- Marcus

CavellMok
Contributor II
Contributor II
Author

Hi Marcus,

Thanks for your suggestion! I've never thought about data model. My Real transaction dataset I think is already in start schema. Date,Cust_id,In_out,Country,Type,Transaction_category,Transaction_group, Transaction_Type,Quantity. And except quantity, all others fields are keys. What's your opinion on this?

 

marcus_sommer

Only with these information it's difficult to estimate if your datamodel is suitable or not. I believe that nearly all fields are KEY's is rather adverse in regard of performance. I think a screenshot of your datamodel would helpful.

- Marcus

sunny_talwar

I am seeing no Aggr() expression in the attached qvf file... 

CavellMok
Contributor II
Contributor II
Author

Aggr() are in the variables

sunny_talwar

But where on the sheet are you using those variables?

CavellMok
Contributor II
Contributor II
Author

data model.png