Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
hobanwashburne
Creator
Creator

Smoothing outliers

I am working with a data set that occasionally contains human data entry errors. An example would be entering 10,000 instead of 1,000. I would like to locate and eliminate these outliers from my data. I have created a sample data set containing two "errors".

1. Customer 51091 typically buys between $7,500 and $8,200. However, on 1/27/17 they purchased $80,390. This would be an outlier and I would want to exclude from the data set.

2. Customer 51095 typically buys between $55,000 and $65,000. However, on 3/1/17 they purchased $603,210. This would be an outlier and I would want to exclude from the data set.

1 Solution

Accepted Solutions
sunny_talwar

May be this

ExTableMapping:

Mapping

LOAD Cust,

    Avg(Amt) * 1.5 as [Upper Limit]

FROM

SampleData01.xlsx

(ooxml, embedded labels, table is Sheet1)

Group By Cust;

Fact:

LOAD Date,

    Cust,

    Amt,

    UpperLimit

Where Exclude = 1;

LOAD Date,

    Cust,

    Amt,

    ApplyMap('ExTableMapping', Cust, 0) as UpperLimit,

    If(Amt <= ApplyMap('ExTableMapping', Cust, 0), 1, 0) as Exclude

FROM

SampleData01.xlsx

(ooxml, embedded labels, table is Sheet1);

View solution in original post

3 Replies
sunny_talwar

Exclude this in the script or the front end? Do you have a rule file which specify what is the range for each customer?

hobanwashburne
Creator
Creator
Author

I will need to do this in the script. The range for each customer will have to be calculated based on their order history. Customer 51091 has an average of ~$7,900 (without the outlier) so an order of $80,390 is more than 10x their average. Likewise, their Max order was $8,121. Possibly exclude anything that is 150% their previous max order. I'm open to ideas.

sunny_talwar

May be this

ExTableMapping:

Mapping

LOAD Cust,

    Avg(Amt) * 1.5 as [Upper Limit]

FROM

SampleData01.xlsx

(ooxml, embedded labels, table is Sheet1)

Group By Cust;

Fact:

LOAD Date,

    Cust,

    Amt,

    UpperLimit

Where Exclude = 1;

LOAD Date,

    Cust,

    Amt,

    ApplyMap('ExTableMapping', Cust, 0) as UpperLimit,

    If(Amt <= ApplyMap('ExTableMapping', Cust, 0), 1, 0) as Exclude

FROM

SampleData01.xlsx

(ooxml, embedded labels, table is Sheet1);