(Sample) Size Matters!
By Ioannis Giakoumakis
Principal Business Intelligence Consultant
In our daily work as hands-on business intelligence consultants we routinely perform ETL operations in order to extract, transform, load and visualize data. A big part of the job is to check data quality, cleanse it, link it, enhance it and try to make sense out of it. When it comes to data quality, we focus on how “good” the data is and do not pay that much attention on how “much” the data per dimension is.
Consider the fact that data volume might simply be too low and not diverse enough to make sense and base our insights and decisions on it.
Simply said: Is data volume high and diverse enough to trust it?
The law of small numbers
In his excellent book ”Thinking fast and slow” (1), a must read for every business intelligence professional (and not only), Daniel Kahneman refers to the “law of small numbers” which in summary reads as follows (in his own words – p118):
“The exaggerated faith in small samples is only one example of a more general illusion – we pay more attention to the content of messages than to information about their reliability, and as a result end up with a view of the world around us that is simpler and more coherent than the data justify.
Statistics produce many observations that appear to beg for casual explanations but do not lend themselves to such explanations. Many facts of the world are due to chance, including accidents of sampling. Casual explanations of chance events are inevitably wrong.”
What Daniel Kahneman is actually saying is, among other, that low sample size can produce statistics that are just the outcome of pure luck and inevitably wrong. Because of the low sample size, we may see patterns in data that are simple coincidence and these patterns will disappear as soon as the sample size increases. It is like creating a basket analysis for a super market chain based on 1000 transactions sourced from one shop when there are millions from multiple shops every month. Whatever comes out of that basket analysis, is wrong.
Daniel Kahneman is doing an excellent job explaining the law but let me be bold enough to give a simple example and also make my own analogy and adjust it to what we do & love most: dashboards.
The simple example (2)
Please take a moment to answer the following quiz:
Which of these events is the most likely to happen when flipping a coin?
- Flip 2 or more heads when flipping 3 coins
- Flip 20 or more heads when flipping 30 coins
- Flip 200 or more heads when flipping 300 coins
- They are all equally likely
What say you? (3)
Events 1 to 3, they all indicate a 2 out of 3 chance for heads (approx. 66%). So the first thought that might have crossed our minds is that all could be equally likely. At first glance, answer 4 seems correct.
But when flipping a number of coins, we expect about half (50%) of the outcomes to be heads. Each flipping of the coin is an individual event, which is not affected by the previous outcomes. So a 50% heads and 50% tails rule holds.
As we flip more and more coins, we should expect the observed proportion of heads to be closer and closer to 50%. Flipping only 3 coins can result in any of the following (8 results – 2^3 – H: heads – T: tails):
Number of heads
Percentage of heads in 3 flips
HTT, THT, TTH
HHT, HTH, THH
So chances are that flipping 3 coins can likely lead to 2 or more heads half of the times thus giving a result of 66% or even 100% heads, way off the expected 50%. But as the number of coins increases, the heads outcome will eventually be closer and closer to 50%, as a binomial distribution (4), our logic and real world example indicates. So the more coins we flip, the less chances there are that heads (or tails…) will prevail.
For our information, here are the actual probabilities for the quiz above:
- P(2 or more heads out of 3) = 0.5
- P(20 or more heads out of 30) = 0.05
- P(200 or more heads out of 300) = 0.000000004
So if you answered event 1 as most likely to happen, then yes that is the correct answer!
(I didn’t …).
Bottom line: Yes, weird or rare results can happen a few times, but not many times!
The dashboard analogy
Let’s now assume that we have a dashboard containing sport’s equipment sales, regarding a business physically based in Greece with numerous own owned shops across the country. The dashboard also includes master customer data, the customer data being self-reported. So each customer has been asked to fill out a form, either on spot at the shops or online, stating facts such as sex, income, family size & type and residence post code among other.
The average percentage of customers buying ski equipment across the country and throughout the year is around 3.6%.
Drilling down into the data reveals a shocking fact: There is a sea side area in the south of the country, on the island of Crete (5), where 25.3% of customers have bought ski equipment on average. Imagine that… People in Crete, where the average yearly temperature is over 20 degrees Celsius and where the sun is shining 300 days a year (8 hours a day with sunshine on average) (6), buy more ski equipment than people in the rest of the country…
Marketing people in the company are thrilled with the information. What an excellent opportunity to make ski equipment sales skyrocket. What is it that they do in the nearby Cretan shop so efficiently, that the rest of the shops do not?
But is the data correct? The first thought that comes into their minds is to have their IT department verify that the data is correct. The IT department replies saying that it is the business responsibility to verify the integrity of the data, the business replies that they have no time to clean up mess created by IT and after a few polite email exchanges somebody takes responsibility, checks the data and concludes that it is absolutely correct.
So it is the truth then: People in an area in Crete buy a lot more ski equipment on average than the rest of the country. I wonder how they even try it on without melting down… So where is the catch?
The catch is on sample size; this is a small area in Crete, where everybody knows everybody, so shop employees do not even bother to ask customers to register their personal data. Only a small number of customers have self-registered online and it so happened that the ones that bought ski equipment where part of the registered ones.
The world is back to normal again.
A solution (7)
How does the law affect our work as business intelligence consultants? How should we make sure that the law does not mislead our users?
There are several ways to deal with it:
- Include data quality measures in the dashboard to make users more aware.
- Set up rules that will exclude data from the dashboard in the first place (during load), if the volume is not adequate.
- Apply statistical corrections and weight the data.
- Add last updated information per data source. Sample size might be small because data is missing altogether.
Personally I prefer the first approach, to include some quality measures as it is difficult to agree on and set up rules that will exclude data that can be misleading. We need to pay extreme attention to the following fact: this is not erroneous data. This is just misleading data because of low sample size, and weird results are produced.
Applying statistical corrections can be really tricky and controversial. As an example, suppose that a biased sample of 100 patients included 20 men and 80 women. A researcher could correct for this imbalance by attaching a weight of 2.5 for each male and 0.625 for each female. This would adjust any estimates to achieve the same expected value as a sample that included exactly 50 men and 50 women, unless men and women differed in their likelihood of taking part in the survey (8). But in business intelligence, this is a bit far-fetched. We should simply include all the needed data.
Usually I add a footnote in every chart needed, identifying the sample size as well as the percentage of null or unknown values. I also make the null or unknown or missing values part of the chart as well (apologies for the pie chart…):
The null or unknown values could be omitted from the chart and only mentioned in the footnote:
Notice how drastically the chart changes. Even with the footnote present, it can be misleading. Personally I prefer to include them, so that it is easy for the user to identify that there are null or unknown values, thus raising concern about data quality and awareness about potential misleading conclusions.
It is also good practise to add an extra sheet in our dashboards, with information regarding sources, number of records, past and coming updates etc. A sample table can be seen in the image below:
This is actually a typical case of a sampling bias, whereas a sample (data) is collected in such a way that some members of the intended population are less likely to be included than others. The data collected does not represent the true distribution or is simply too low in volume (8).
Low and not diverse sample size can lead to weird results and misleading conclusions. We and our users need to be aware of that and make it very obvious in our dashboards, by including data quality measures, either by text or visually.
I hope you enjoyed this reading!
- (1) Thinking fast and slow, David Kahneman, Farar Straus and Giroux, New York, 2011, ISBN 978-0-374-53355-7
- (2) Coin flipping example (https://brilliant.org/ )
- (3) Aragorn parries the blow with the Sword of Elendil, much to the ghost's surprise. The King of the Dead: "That bind was broken!" Aragorn: "Fight for us, and regain your honour. What say you?
- (4) https://en.wikipedia.org/wiki/Binomial_distribution
- (5) Travel to Crete www.ferries.gr
- (6) Weather in Crete https://www.tripadvisor.com/Travel-g189413-s208/Crete:Greece:Weather.And.When.To.Go.html
- (7) Charts created using QlikSense (www.qlik.com)
- (8) https://en.wikipedia.org/wiki/Sampling_bias