Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
I have a fact table that has data for hundreds of customers and tracks a good 45 attributes for each customer. My end-user wants to do some statistical analysis on the customers, but in a convoluted way. For each of the attributes, on each of the days, there are amounts of customers who express that attribute with a value. Here is a small example:
cal_date----attribute------------value-#day_flag--fb_early_bound--ob_early_bound--c_count--avg?
3/29/2010--birth_decade-----197---#-1----------11/23/2009---------------------------------274------???
3/29/2010--birth_decade-----198---#-1----------11/23/2009---------------------------------610------???
3/29/2010--birth_decade-----199---#-1----------11/23/2009---------------------------------147------???
3/29/2010--homeareacode--441---#-1----------11/23/2009---------------------------------885------???
3/29/2010--homeareacode--442---#-1----------11/23/2009---------------------------------171------???
3/30/2010--birth_decade-----197---##1---------------------------------02/04/2010---------383------???
3/30/2010--birth_decade-----198---##1---------------------------------02/04/2010---------808------???
3/30/2010--birth_decade-----199---##1---------------------------------02/04/2010---------231------???
3/30/2010--homeareacode--441---##1---------------------------------02/04/2010--------1177-----???
3/30/2010--homeareacode--442---##1---------------------------------02/04/2010---------256------???
In the example I have only 2 of the many attributes for only a few random values and only a couple days. You can tell our demographic skews towards those born in the 80s. 🙂 Now, what of these early_bound dates? I will get to them shortly...
What the end-user wants is to find the average of each attribute's value over the past 18 or 30 days. And to make it more complicated, he attaches qualities to the days. He keeps track of 1st-business-days-of-the-week (day_flag of -1), other-business-days-of-the-week (day_flag of 1), Saturdays (day_flag of 2), and all others (Sundays/holidays with day_flag of 0). So when he calcs this average, the days must correspond to the same flavor of day_flag.
So, for 3/29 (a Monday and a 1st-business-day-of-the-week), for birth_decade 198, he will take the previous 18 days customer-count data that also has a flavor of "-1", sum them up then divides by 18. For convenience, I have calculated the lower-bound date for every single day, or more to the point, I believe John Witherspoon did it for me awhile ago 🙂 Note that if it was an other-business-day-of-the-week, like 3/30, we would substitute the constant of 30 for the constant of 18 and only look for days with day_flag of "1".
Then on top of this, he wants to calculate the std_deviation along with the average, and then use the average and std_dev to make yet another calculation:
For 1st-business-days-of-the-week: risk-limit = average-of-past-18-days + 1.645 * standard deviation-of-past-18-days / SQRT(18).
For other-business-days-of-the-week: risk-limit = average-of-past-30-days + 1.645 * standard deviation-of-past-30-days / SQRT(30).
Then he wants a flag column of if the customer_count for that attribute value on that day exceeds this risk-limit.
So I've tried to make sense of this symbol-soup of Set Analysis and failed miserably. I'm not even sure it can be done. Someone please prove me wrong 🙂
Looking for help the next day...
Please, can someone take a shot at this? I need help.
OK, so for each cal_date, you have a cal_lower_bound_date_18 and cal_lower_bound_date_30, or something like that? So that if you select cal_date = 3/29/2010, the cal_lower_bound_date_18 associated with it will be 17 weeks before that, and the cal_lower_bound_date_30 associated with it will be 29 weeks back?
And the user is selecting a specific day? So they already narrowed the working set down to the flavor of that day? if they select 3/29/20101, the flavor of -1 will be in white, and all other flavors will be gray?
For your 18 day expression, for instance, it sounds like your set analysis would then include:
{<cal_date={">=($=cal_lower_bound_date_18) <=($=cal_date)"}>}
You don't need to mention flavor, since the right flavor is the only one possible, and the lower bound you've previously calculated handles getting the right number of days. Then I gather they're also selecting attribute=birth_decade and value=198. Then you want an average of the c_count over the last 18 qualifying days? I think this:
avg({<cal_date={">=($=cal_lower_bound_date_18) <=($=cal_date)"}>} c_count)
And the standard deviation should be:
stdev{<cal_date={">=($=cal_lower_bound_date_18) <=($=cal_date)"}>} c_count)
And the risk limit would then be:
if(flavor=-1,avg({<cal_date={">=($=cal_lower_bound_date_18) <=($=cal_date)"}>} c_count)
+ 1.645 * stdev({<cal_date={">=($=cal_lower_bound_date_18) <=($=cal_date)"}>} c_count) / sqrt(18)
,avg({<cal_date={">=($=cal_lower_bound_date_30) <=($=cal_date)"}>} c_count)
+ 1.645 * stdev({<cal_date={">=($=cal_lower_bound_date_30) <=($=cal_date)"}>} c_count) / sqrt(30))
And your flag would then be:
c_count >
if(flavor=-1,avg({<cal_date={">=($=cal_lower_bound_date_18) <=($=cal_date)"}>} c_count)
+ 1.645 * stdev({<cal_date={">=($=cal_lower_bound_date_18) <=($=cal_date)"}>} c_count) / sqrt(18)
,avg({<cal_date={">=($=cal_lower_bound_date_30) <=($=cal_date)"}>} c_count)
+ 1.645 * stdev({<cal_date={">=($=cal_lower_bound_date_30) <=($=cal_date)"}>} c_count) / sqrt(30))
But I have the feeling that I've drastically misunderstood quite a few things along the way. Like which date you're using in your chart as a dimension, since you have three of them in your table, and that could screw everything up. So chances are that that's not even close to what you want.
John,
Thanks for taking the time to investigate. Hate to say it, but your expression gave no results, which is something that I'm used to when I make my own expressions ![]()
avg({<cal_date={">=($=cal_lower_bound_date_18) <=($=cal_date)"}>} c_count)
Switching around the punctuation and killing =-signs at least yielded the c_count as results when I entered this:
avg({<cal_date={">=$(fb_early_bound)<$(cal_date)"}>} c_count)
I've tried other attempts, and at least this following one gives me some type of average, altho' not for the date ranges I'm looking for (it works over all the dates, not any subset)
avg(total <first_business, test1_attribute, test1_value> c_count)
Just can't figure out a way to meld this last expression with one using sets that works ONLY for the date range of similar business-quality-days...
Again, I've probably drastically misunderstood what you're after, but if you just want to combine those two expressions:
avg({<cal_date={">=$(fb_early_bound)<$(cal_date)"}>} total <first_business, test1_attribute, test1_value> c_count)
As far as procedure goes, I almost never write giant expressions like I posted all in one go. I build them up bit by bit, checking at each stage that I'm getting what I expected, and fixing problems as they pop up. If I go too far, I start tearing it apart, bit by bit, until I find what's working and what isn't.