What we can learn about data collection from the Houston Astros

By: Eric Parker

Pro Headshot.jpg

Eric lives in Seattle and has been teaching Tableau and Alteryx for 5 years. He's helped thousands of students solve their most pressing problems. If you have a question, feel free to reach out to him directly via email.

I recently finished a book called, Astroball written by Ben Reiter. It tells the unlikely story of how the Astros went from the worst team in baseball to winning a World Series a few seasons later. As much as I wish I could say I didn’t like the book (I’m a Mariners fan after all) it was a compelling read and gave me great insight into their organization and methodology.

For much of their early years, the Astros were on a bumpy ride. In 2011, they were bought by a new owner. That new owner wanted the team to win a World Series but also realized it would take time to get there. He hired a few analytical folks from the St. Louis Cardinals and told them to do what they saw fit.

Much like the book and movie Moneyball, they exploited inefficiencies in the market of baseball players. They started with a small analytical team of only a few people. They knew there was a lot of information they didn’t have and needed to start collecting. The realization that you don’t have the information you need yet, and that it could take months, maybe even years to collect it all can be daunting.

However, they knew there was only one way forward; begin collecting the data. A large part of what they did was to weight players’ past performances by where they played previously. For instance, if someone played at a large college against great competition, their performance would be given more weight than a small college against inferior competition.


Seems like an obvious concept, but making it accurate is difficult. How much better is playing a UCLA than playing at UC Irvine? 2 times? 1.5 times? 10 times? Creating the model took a lot of back-testing and plenty of continual refinement as well.

Another development they looked to exploit was to figure out which current major league players could improve their performance through better utilization. For example, if a pitcher threw a curveball, that is rarely hit by opposing batters, only 15% of the time, how would their performance improve if that was increased to 30% of the time?

This helped them lock in on undervalued players like Charlie Morton (to see his stats and improvement upon joining the Astros, follow the link).


We know how the story ends. The Astros go from the laughing stock of baseball in the early 2010s to winning the World Series in 2017.

Nowadays, the Astros analytical team has at least tripled in size since they were first hired by new management in 2011. They have a lot more resources and firepower to throw into collecting data, creating sophisticated models, and continually tweaking their models with updated information.

Why should you care? Most organizations have information that, if it was captured, would help them immensely. For example, if you’re in marketing, wouldn’t it be helpful to know which channels were most effective for capturing new customers? If you’re in sales, wouldn’t you like to know what indicators are most predictive of an impending sale?

Often, those values aren’t captured because the undertaking seems too daunting. Here’s my recommendation, get started anyways! If that additional information is going to be valuable for the company, it’s worth it. It might take you years to capture all the information, but like the Astros have shown, even before all the models are “complete” they can provide extraordinary value.