The Problem of Small Sample Size in Data Analysis

By: Eric Parker

Pro Headshot.jpg

Eric Parker lives in Seattle and has been teaching Tableau and Alteryx since 2014. He's helped thousands of students solve their most pressing problems. If you have a question, feel free to reach out to him directly via email.

Small sample sizes can make fools of us all. You see them hinder analysis all over the place.

For example…

A small sample size of votes leads to prematurely declaring the winner of a political race.

A small sample size of plays can lead you to incorrectly make assumptions about outcome probabilities in sports.

A small sample size of wins at a casino can lead you to wrongly believe you should keep gambling.

 
154-1.png
 

I’m a big sports fan. Growing up and living in the Seattle area I am a fan of the Seahawks. I watch a good amount of their games and consume commentary about their performance.

Here are a few funny “small sample size” anecdotes I’ve heard in the last year.

●        The Seahawks have never lost a game when an opposing quarterback throws for 400 yards or more. (Prior to the 2020 season this had only happened 7 times).

●        Last year (2019), the Seahawks didn’t allow a touchdown on any drive where they had a sack.

●        Russell Wilson has won his last 10 games in a row that started at 10 am Pacific Time (1 pm Eastern games).

 
154-2.png
 

At the time these statements were shared, I believe they were all true. The problem with small sample sizes isn’t that they aren’t true, it’s that they aren’t all that meaningful. They become a problem when talking heads interpret them as predictive.

●        “Allow 400 passing yards and you won’t lose!”

●        “Get a sack and you won’t give up a touchdown!”

●        “Russell Wilson can’t be beat early in the mornings.”

This is an easy and somewhat unfair critique from me. I’m asking ex-athletes who likely haven’t studied informatics or statistics to do a deep dive on data.

I think (if possible) the responsible thing to do would be to share a larger sample size after sharing that smaller sample size. For example (italicized portions are made up):

●        “The Seahawks have not lost a game in the last decade when the opposing quarterback threw for 400 yards or more. That has happened 7 times. That has happened 250 total times in the NFL in the last 10 years and the team whose quarterback throws for 400+ yards wins the game 47% of the time.

●        “Last year the Seahawks didn’t allow a touchdown on any drive where they had a sack. They had 26 sacks last year. In the entire NFL last year, there were 1,052 sacks. Those 1,052 sacks happened on 980 total drives. Of those 980 drives, only 58 resulted in a touchdown. That’s only a 5.9% touchdown rate.

●        “Russell Wilson has won his last 10 East Coast 1 pm starts in a row. He is 12-6 in those games all time with a winning % of 67%. During the time Russell has been in the league there have been 180 instances where a West Coast team played on the East Coast with a 1 pm start. Those teams only win 42% of the time.

 
154-3.png
 

The challenge in getting something like this to happen is;

A)     That doesn’t make for great radio or TV. Who wants 7 different numbers recited to them when they can have one fun, digestible fact instead?

B)      It requires some data preparation and analysis skills which requires technical expertise and time.

What’s the moral of the story?

Small sample sizes can offer fun stories, but take them with a grain of salt. As the old investing adage goes, “Past results do not assure future outcomes”, especially when it is a limited number of results being analyzed. Whenever possible, compare analyses of small sample sizes to larger sample sizes to see if your assumptions based on fewer data values extrapolate well to a larger sample size.

A Big Thanks to You!

Adding Arrows and Colors to Tableau Summary Tiles

0