The book (and later a movie) Moneyball by Michael Lewis tells the story of how the USA baseball team Oakland Athletics in 2002 leveraged the power of data instead of relying on experts. Better data and better analysis of the data lead to find and use market inefficiencies.
The team was one of the poorest in a period when only rich teams could afford the all-star players (the imbalance in total salaries being something like 4 to 1). A new ownership in 1995 was improving the team’s wins but in 2001 the loss of 3 key players and budget cuts were bringing a new idea: take a quantitative approach and find undervalued players.
The traditional way to select players was through scouting but Oakland and his general manager Billy Bean (Brad Pitt in the movie…) selected the players based on their statistics without any prejudice. Specifically, his assistant – the Harvard graduate Paul DePodesta looked at the data to find which ones were the undervalued skills.
A huge repository for the USA baseball statistics (called Sabermetrics) is the Lahman’s Baseball Database. This database contains complete batting and pitching statistics from 1871 to present plus fielding statistics, standings, team stats, managerial records, post-season data and more.
In [1]:
import pandas as pd
In [5]:
baseball = pd.read_csv("../datasets/baseball.csv")
In [6]:
baseball.head()
Out[6]:
In [7]:
baseball.columns
Out[7]:
The feature named ‘W’ is the wins in the season ‘Year’ and the feature ‘Playoffs’ is a boolean value (0 = didn’t make to the playoffs; 1 = did make).
In [8]:
moneyball = baseball[baseball.Year < 2002]
moneyball.head()
Out[8]:
In [9]:
moneyball[['Team','Year','W','Playoffs']].head()
Out[9]:
The first step was to understand what was needed for the team to enter the playoffs and he judged from the past seasons that it takes 95 wins to be reasonably sure to make to the playoffs.
In [10]:
import matplotlib.pyplot as plt
%matplotlib inline
In [11]:
my_yticks=['NO', 'YES']
plt.yticks(moneyball.Playoffs, my_yticks)
plt.scatter(moneyball.W, moneyball.Playoffs)
plt.axvline(x=95, color='r', linestyle='dashed')
Out[11]:
Note: you might ask why the goal is to make to the playoff and not directly to win the world series.
The reason is because the A's managers see their job as making sure the team makes it to the playoffs – after that all bets are off.
“Over a long season the luck evens out, and the skill shines through. But in a series of three out of five, or even four out of seven, anything can happen.”
95 Wins are necessary to be relatively sure to go to the playoffs.
But how many Runs are then needed?
To win games a team needs to score more “runs” than their opponent but how many? DePodesta used linear regression to find out. We can see how.
The feature ‘RS’ is the number of runs scored and ‘RA’ is the number of runs allowed. We can add an additional feature which summarise both of them by calculating the difference:
In [12]:
moneyball = moneyball.assign(RD = moneyball.RS - moneyball.RA)
moneyball.head()
Out[12]:
In [13]:
plt.scatter(moneyball.RD, moneyball.W)
Out[13]:
The next step is to formalise the relation between Wins and Runs.
Our output would be the number of wins W and we want to see how it depends on the runs difference RD:
W = beta0 + beta1 * RD
This is a linear regression and the beta parameters can be estimated
we use the sklearn library and its linear_model module
In [14]:
from sklearn import linear_model
In [15]:
WinsModel = linear_model.LinearRegression()
The input variable is the Runs Difference, and the output is the Wins
In [16]:
features = moneyball[['RD']] # input
features.insert(0, 'intercept', 1)
In [17]:
WinsModel.fit(features, moneyball.W)
Out[17]:
To see the fitted beta parameters we can examine the following values:
In [18]:
WinsModel.intercept_
Out[18]:
In [19]:
WinsModel.coef_
Out[19]:
Therefore the prediction formula for the number of wins is:
Wins = 80.9042 + 0.1045 * RD
To get the number of Runs necessary we need to solve this simple equation:
95 = 80.9042 + 0.1045 * RD
RD = (95-80.9042)/0.1045 = 134.89
135 Runs are necessary for 95 wins.
We can verify it by predicting the wins for 135 Runs:
In [20]:
WinsModel.predict([[1, 135]])
Out[20]:
How does a team increase the runs difference (RD)?
There are two ways: either scoring more runs (RS) or allowing less runs (RA).
The A’s started using a different method to select players, based on their statistics, not on their looks.
Most teams focused on Batting Average (BA): getting on base by hitting the ball.
The A’s discovered that BA was overvalued and two baseball statistics were significantly more important than anything else:
We can use linear regression to verify which baseball player features are more important tot predict runs.
In [21]:
RSmodel = linear_model.LinearRegression()
features = moneyball[['OBP', 'SLG', 'BA']] # input
features.insert(0, 'intercept', 1)
RSmodel.fit(features, moneyball.RS)
Out[21]:
In [22]:
RSmodel.intercept_
Out[22]:
In [23]:
RSmodel.coef_
Out[23]:
Note that now the beta parameters are 3, being 3 the input features.
The prediction formula for Scored Runs would be:
RS = -788 + 2917 * OBP + 1638 * SLG -369 * BA
The score() function returns the R2 metric for the model:
In [24]:
RSmodel.score(features, moneyball.RS)
Out[24]:
Which is very high, almost the maximum.
And if we remove the BA feature?
In [25]:
RSmodel2 = linear_model.LinearRegression()
features = moneyball[['OBP', 'SLG']] # input
features.insert(0, 'intercept', 1)
RSmodel2.fit(features, moneyball.RS)
Out[25]:
In [26]:
RSmodel2.score(features, moneyball.RS)
Out[26]:
There is almost no difference in the R2 when removing the BA feature.
A sign that BA is not significant.
Batting Average (BA) is overvalued
On Base Percentage (OBP) and Slugging Percentage (SLG) are enough.
In [27]:
RSmodel2.intercept_
Out[27]:
In [28]:
RSmodel2.coef_
Out[28]:
Predicting the Runs Allowed is more difficult because there are missing values in the data:
In [29]:
moneyball.isnull().sum()
Out[29]:
In [30]:
moneyball.shape
Out[30]:
Out of 902 rows, 812 misses the OOBP and OSLG features that would be critical for RA prediction.
Just for exercise, we can go on but results will not be significant.
There are too many missing values to use the average values instead, better to ignore these rows altogether.
In [31]:
import numpy as np
In [32]:
moneyball = moneyball[np.isfinite(moneyball['OOBP'])]
In [33]:
moneyball = moneyball[np.isfinite(moneyball['OSLG'])]
In [34]:
moneyball.shape
Out[34]:
Now the data set is reduced to 90 rows ...
In [35]:
RAmodel = linear_model.LinearRegression()
features = moneyball[['OOBP', 'OSLG']]
features.insert(0, 'intercept', 1)
RAmodel.fit(features, moneyball.RA)
Out[35]:
In [36]:
RAmodel.intercept_
Out[36]:
In [37]:
RAmodel.coef_
Out[37]:
In [38]:
RAmodel.score(features, moneyball.RA)
Out[38]:
We know that the A's Team OBP in 2002 is 0.339 and the Team SLG is 0.430
How many Runs and Wins can we expect?
We could just put these values in the formula above:
In [39]:
RS = -804.63 + 2737.77*0.339 + 1584.91*0.43
RS
Out[39]:
Or we can use the handy function predict():
In [40]:
RSmodel2.predict([[1, 0.339, 0.43]])
Out[40]:
We predict 805 Runs scored in 2002
Let's predict also RA even if the model is very weak, just as example.
We know that Team OOBP in 2002 is 0.307 and Team OSLG is 0.373
In [41]:
RAmodel.predict([[1, 0.307, 0.373]])
Out[41]:
Our 2002 prediction is 622 Runs Allowed.
In reality, Paul Podesta at the A's team used a similar but slightly different approach and predicted 650-670 Runs Allowed.
And finally the number of wins:
In [42]:
WinsModel.predict([[1, (805-622)]])
Out[42]:
So our prediction for A's team in 2002 is 100 wins in total, that would be probably enough to access the playoff.
These are 2002 final results:
They made it to the playoffs
Models (even relatively simple models) allow managers to more accurately value players and minimise risk.
Every major league baseball team now has a statistics group.
Analytics are used in other sports, too.