Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi,
I need to aggregate my data to decrease the size of it, but I also need to calculate the average of these aggregated values afterwards. If you aggregate the data first, then calculate the average afterwards (aggregating an aggregation of a value) then the average will be different (sometimes very different) compared to averaging the unaggregated data.
I was wondering if anyone has any suggestions or links to some resources that discuss techniques that can be used to decrease the variance between the two averages when you have to aggregate data? My maths and statistics knowledge isn't great. I assume some kind of weighted averaging might help this, or maybe there are more advanced techniques. Thanks.
Say, you want to aggregate 'Amount' monthly instead of detailed date wise data, and you need the average also. then what i would do is - simply aggregate the data monthly with two operations SUM and AVG at the time of aggregation simulteneously during load. Like:
LOAD Month, Sum(Amount) as SumofAmount, Avg(Amount) as AmountAvegare
FROM.....Group By Month;
Regards, tresesco
An AVG aggregation does get you closer than SUM but it is still pretty far out. You can see from this very basic example attached the differences when just using simple AVG or SUM aggregation then averaging the result.
This is obvious. Sum and Average can't be carried together with aggregation again and again. Because the aggregation entity-count varries in different level. I did not understand your requirement properly. In that case (as you want), i believe you have to mantain a count(rather DISTINCT COUNT) for dimension, or you can do that at the front end. The Total amount would not varry for any case of aggregation. So you try with customed formula (for average) like : SUM(Amount)/Count(Distinct YourDimension).
See, if it could be an option.
Regards, tresesco