3 Replies Latest reply: Apr 25, 2014 8:59 AM by Mike Swinn RSS

    Median of Aggregate to Consider Nulls/Missing Data

      I am looking to produce a chart showing median sales by customer for each month since their signup. My data is at transaction level and I have coded a field for the number of months within the load script.




      Customer          months_since_signup          sale_value

      A                         1                                             5.50

      A                         1                                             4.30

      A                         1                                             6.50

      B                         1                                             3.60

      A                         2                                             5.40

      A                         4                                             6.50

      B                         5                                             7.00


      I started by setting months_since_sale as a dimension and an expression similar to the following:


           median(aggr(sum(sale_value, customer_ID)))


      This does give median sales but only across customers who have a sale in a given month. It does not consider customers who have no sales since there are no data points in the first place. The result is therefore meaningless since customers who drop off over time are ignored.


      So for example if I have 5 customers, with sales in their second month of:


           5, 4, 3, (nothing), (nothing)


      Then the expression would produce a 'median' of 4 since it only considered the first three values. I want it to take the median of


           5, 4, 3, 0, 0

      ...and produce 3 as the result.



      What I'd like to do is somehow use an intermediate table which contains a record for every customer in every month, ideally within my expression rather than data load. I'm not sure this is possible with the aggr() function.


      The mean is simple by working it differently as:


           sum(sale_value) / count(distinct total customer_ID)


      But I can't use this method for median.


      Any ideas? I'm thinking a detatched dimension may be required but I want to keep it as simple as possible.