What are outliers, extremes and how to identify them in your data? What is an outlier?In statistics and data science an outlier is an observation...
What are outliers, extremes and how to identify them in your data?
What is an outlier?
In statistics and data science an outlier is an observation that lies in abnormal distance from other observations in dataset. Simply said, it could be too high or too low value.
The effects of outliers
It is not possible to generally say whether it is better to have dataset with or without outliers for analysing purposes. Still, however, it is certain that we should know about them and make our own decision if to drop or not to drop them - based on specific tasks and objectives.
Since an outlier is the value which distance from the rest of dataset is abnormal then each measure based on distances between observations in the dataset could be significantly biased thanks to its presence 😊. If the measure could be influenced by outliers, we call it a non-resistant measure. On the other hand, if presence of outlier does not have any impact on result then the measure is called resistant.
If we use non-resistant measure, we should identify outliers and decide if we want to exclude them or not. In some cases, we need to know the real value of the measure calculated above all the data including outliers. Some examples of non-resistant measures:
- Standard deviation
- Pearson’s correlation coefficient
The result of resistant measures are not affected by outliers:
- Spearman’s correlation coefficient (the distance in Pearson’s correlation coefficient is replaced by ranking that ignore absolute distances - it leads to change from non-resistant to the resistant measure)
How to identify outliers? Mathematical methods
The most common method used for outliers identification is BoxPlot – the border between normal and abnormal observations is derived by using interquartile range which is visualized by the graph called BoxPlot. Another added value of this method is that it classifies abnormal values into two levels – outliers and extremes. If we want to explain corresponding formulas correctly we must define quantiles, quartiles and interquartile range and then the final formula – but do not be sad or frightened, it is not as hard as it seems to be 😊.
- x% quantile splits the whole dataset into 2 subsets in the way that x% values from the dataset are less than the x% quantile value (if we have dataset where 15% quantile = 123, it means that 15% of all observations are less than 123)
- quartiles are nothing more and nothing less than quantiles splitting the dataset into 4 subsets:
- 25% quantile = 1st quartile (Q1) – one quarter of all observations are less than the value
- 50% quantile = 2nd quartile (Q2) – one half of all observations are less than the value = median
- 75% quantile = 3rd quartile (Q3) – three quarters of all observations are less than the value = one quarter of all observations are higher the value
- Interquartile range (know as IQR) is the difference between Q3 and Q1
- Specific observation X is classified as outlier if the following is true:
- X > Q3 + 1.5*IQR alebo X < Q1 - 1.5*IQR => X > 2.5*Q3 - 1.5*Q1 alebo X < 2.5*Q1 - 1.5*Q3
- Specific observation X is classified as extreme if the following is true:
- X > Q3 + 3*IQR alebo X < Q1 - 3*IQR => X > 4*Q3 - 3*Q1 alebo X < 4*Q1 - 3*Q3
Naturally BoxPlot (IQR) is not the only method for outliers detection. Instead of values based on IQR you can use ± 2*standard deviation from average as borders. Other methodologies are completely different – they use density or classification instead of distances.
How to identify outliers in Qlik Sense? IQR methodology.
In the first place we generate our own testing dataset:
rowno() as X,
exp(num#(rowno())) as Y
We can easily visualize this data thanks to combo chart with the dimensions X and measure Only(Y):
If we want to have outliers identification in the table with dimension X as a new column with binary values @IsOutlier, @IsExtreme, we create new dimensions with corresponding labels and these expressions based on formulas (mentioned above).
(Y > 2.5*fractile(total Y,0.75)-1.5*fractile(total Y,0.25))
(Y < 2.5*fractile(total Y,0.25)-1.5*fractile(total Y,0.75))
(Y > 4*fractile(total Y,0.75)-3*fractile(total Y,0.25))
(Y < 4*fractile(total Y,0.25)-3*fractile(total Y,0.75))
We can add also the background color for more intuitive identification of abnormal observations or use the colorize in our scatter plot:
We can also pre-calculate IQR already in the script what would lead to more simple expression. However, it causes static identification of outliers – algorithms will not be corresponding to actual filters. In some cases this could be exactly what is needed so it is good to know about the possibility.
Visual representation of this algorithm is integrated in Qlik Sense since June 2017.
As it is in Qlik Sense you have a lot of different possibilities so it is up to you which one is the best fit for your use case.
... it makes sense, Sense makes it 😉