Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
I am working with a data set that occasionally contains human data entry errors. An example would be entering 10,000 instead of 1,000. I would like to locate and eliminate these outliers from my data. I have created a sample data set containing two "errors".
1. Customer 51091 typically buys between $7,500 and $8,200. However, on 1/27/17 they purchased $80,390. This would be an outlier and I would want to exclude from the data set.
2. Customer 51095 typically buys between $55,000 and $65,000. However, on 3/1/17 they purchased $603,210. This would be an outlier and I would want to exclude from the data set.
May be this
ExTableMapping:
Mapping
LOAD Cust,
Avg(Amt) * 1.5 as [Upper Limit]
FROM
SampleData01.xlsx
(ooxml, embedded labels, table is Sheet1)
Group By Cust;
Fact:
LOAD Date,
Cust,
Amt,
UpperLimit
Where Exclude = 1;
LOAD Date,
Cust,
Amt,
ApplyMap('ExTableMapping', Cust, 0) as UpperLimit,
If(Amt <= ApplyMap('ExTableMapping', Cust, 0), 1, 0) as Exclude
FROM
SampleData01.xlsx
(ooxml, embedded labels, table is Sheet1);
Exclude this in the script or the front end? Do you have a rule file which specify what is the range for each customer?
I will need to do this in the script. The range for each customer will have to be calculated based on their order history. Customer 51091 has an average of ~$7,900 (without the outlier) so an order of $80,390 is more than 10x their average. Likewise, their Max order was $8,121. Possibly exclude anything that is 150% their previous max order. I'm open to ideas.
May be this
ExTableMapping:
Mapping
LOAD Cust,
Avg(Amt) * 1.5 as [Upper Limit]
FROM
SampleData01.xlsx
(ooxml, embedded labels, table is Sheet1)
Group By Cust;
Fact:
LOAD Date,
Cust,
Amt,
UpperLimit
Where Exclude = 1;
LOAD Date,
Cust,
Amt,
ApplyMap('ExTableMapping', Cust, 0) as UpperLimit,
If(Amt <= ApplyMap('ExTableMapping', Cust, 0), 1, 0) as Exclude
FROM
SampleData01.xlsx
(ooxml, embedded labels, table is Sheet1);