Moneyball: a linear regression example

a linear regression example

The book (and later a movie) Moneyball by Michael Lewis tells the story of how the USA baseball team Oakland Athletics in 2002 leveraged the power of data instead of relying on experts. Better data and better analysis of the data lead to find and use market inefficiencies.

The team was one of the poorest in a period when only rich teams could afford the all-star players (the imbalance in total salaries being something like 4 to 1). A new ownership in 1995 was improving the team’s wins but in 2001 the loss of 3 key players and budget cuts were bringing a new idea: take a quantitative approach and find undervalued players.

The traditional way to select players was through scouting but Oakland and his general manager Billy Bean (Brad Pitt in the movie…) selected the players based on their statistics without any prejudice. Specifically, his assistant – the Harvard graduate Paul DePodesta looked at the data to find which ones were the undervalued skills.

A huge repository for the USA baseball statistics (called Sabermetrics) is the Lahman’s Baseball Database. This database contains complete batting and pitching statistics from 1871 to present plus fielding statistics, standings, team stats, managerial records, post-season data and more.

Read baseball data



In [1]:

    
import pandas as pd



In [5]:

    
baseball = pd.read_csv("../datasets/baseball.csv")



In [6]:

    
baseball.head()









    Out[6]:






  
    
      
      Team
      League
      Year
      RS
      RA
      W
      OBP
      SLG
      BA
      Playoffs
      RankSeason
      RankPlayoffs
      G
      OOBP
      OSLG
    
  
  
    
      0
      ARI
      NL
      2012
      734
      688
      81
      0.328
      0.418
      0.259
      0
      NaN
      NaN
      162
      0.317
      0.415
    
    
      1
      ATL
      NL
      2012
      700
      600
      94
      0.320
      0.389
      0.247
      1
      4.0
      5.0
      162
      0.306
      0.378
    
    
      2
      BAL
      AL
      2012
      712
      705
      93
      0.311
      0.417
      0.247
      1
      5.0
      4.0
      162
      0.315
      0.403
    
    
      3
      BOS
      AL
      2012
      734
      806
      69
      0.315
      0.415
      0.260
      0
      NaN
      NaN
      162
      0.331
      0.428
    
    
      4
      CHC
      NL
      2012
      613
      759
      61
      0.302
      0.378
      0.240
      0
      NaN
      NaN
      162
      0.335
      0.424



In [7]:

    
baseball.columns









    Out[7]:





Index(['Team', 'League', 'Year', 'RS', 'RA', 'W', 'OBP', 'SLG', 'BA',
       'Playoffs', 'RankSeason', 'RankPlayoffs', 'G', 'OOBP', 'OSLG'],
      dtype='object')

The feature named ‘W’ is the wins in the season ‘Year’ and the feature ‘Playoffs’ is a boolean value (0 = didn’t make to the playoffs; 1 = did make).

Subset to only include moneyball years (until 2001)



In [8]:

    
moneyball = baseball[baseball.Year < 2002]
moneyball.head()









    Out[8]:






  
    
      
      Team
      League
      Year
      RS
      RA
      W
      OBP
      SLG
      BA
      Playoffs
      RankSeason
      RankPlayoffs
      G
      OOBP
      OSLG
    
  
  
    
      330
      ANA
      AL
      2001
      691
      730
      75
      0.327
      0.405
      0.261
      0
      NaN
      NaN
      162
      0.331
      0.412
    
    
      331
      ARI
      NL
      2001
      818
      677
      92
      0.341
      0.442
      0.267
      1
      5.0
      1.0
      162
      0.311
      0.404
    
    
      332
      ATL
      NL
      2001
      729
      643
      88
      0.324
      0.412
      0.260
      1
      7.0
      3.0
      162
      0.314
      0.384
    
    
      333
      BAL
      AL
      2001
      687
      829
      63
      0.319
      0.380
      0.248
      0
      NaN
      NaN
      162
      0.337
      0.439
    
    
      334
      BOS
      AL
      2001
      772
      745
      82
      0.334
      0.439
      0.266
      0
      NaN
      NaN
      161
      0.329
      0.393



In [9]:

    
moneyball[['Team','Year','W','Playoffs']].head()

The goal: make to the playoffs

The first step was to understand what was needed for the team to enter the playoffs and he judged from the past seasons that it takes 95 wins to be reasonably sure to make to the playoffs.



In [10]:

    
import matplotlib.pyplot as plt
%matplotlib inline



In [11]:

    
my_yticks=['NO', 'YES']
plt.yticks(moneyball.Playoffs, my_yticks)
plt.scatter(moneyball.W, moneyball.Playoffs)
plt.axvline(x=95, color='r', linestyle='dashed')









    Out[11]:





<matplotlib.lines.Line2D at 0x10d178400>

Note: you might ask why the goal is to make to the playoff and not directly to win the world series.
The reason is because the A's managers see their job as making sure the team makes it to the playoffs – after that all bets are off.

“Over a long season the luck evens out, and the skill shines through. But in a series of three out of five, or even four out of seven, anything can happen.”

The goal: make 95 wins

95 Wins are necessary to be relatively sure to go to the playoffs.
But how many Runs are then needed?

Compute Run Difference

To win games a team needs to score more “runs” than their opponent but how many? DePodesta used linear regression to find out. We can see how.

The feature ‘RS’ is the number of runs scored and ‘RA’ is the number of runs allowed. We can add an additional feature which summarise both of them by calculating the difference:



In [12]:

    
moneyball = moneyball.assign(RD = moneyball.RS - moneyball.RA)
moneyball.head()









    Out[12]:






  
    
      
      Team
      League
      Year
      RS
      RA
      W
      OBP
      SLG
      BA
      Playoffs
      RankSeason
      RankPlayoffs
      G
      OOBP
      OSLG
      RD
    
  
  
    
      330
      ANA
      AL
      2001
      691
      730
      75
      0.327
      0.405
      0.261
      0
      NaN
      NaN
      162
      0.331
      0.412
      -39
    
    
      331
      ARI
      NL
      2001
      818
      677
      92
      0.341
      0.442
      0.267
      1
      5.0
      1.0
      162
      0.311
      0.404
      141
    
    
      332
      ATL
      NL
      2001
      729
      643
      88
      0.324
      0.412
      0.260
      1
      7.0
      3.0
      162
      0.314
      0.384
      86
    
    
      333
      BAL
      AL
      2001
      687
      829
      63
      0.319
      0.380
      0.248
      0
      NaN
      NaN
      162
      0.337
      0.439
      -142
    
    
      334
      BOS
      AL
      2001
      772
      745
      82
      0.334
      0.439
      0.266
      0
      NaN
      NaN
      161
      0.329
      0.393
      27

Wins are directly correlated with Runs scored



In [13]:

    
plt.scatter(moneyball.RD, moneyball.W)









    Out[13]:





<matplotlib.collections.PathCollection at 0x110aa91d0>

The next step is to formalise the relation between Wins and Runs.

Our output would be the number of wins W and we want to see how it depends on the runs difference RD:

W = beta0 + beta1 * RD

This is a linear regression and the beta parameters can be estimated

Regression model to predict wins

we use the sklearn library and its linear_model module



In [14]:

    
from sklearn import linear_model



In [15]:

    
WinsModel = linear_model.LinearRegression()

The input variable is the Runs Difference, and the output is the Wins



In [16]:

    
features = moneyball[['RD']]  # input
features.insert(0, 'intercept', 1)



In [17]:

    
WinsModel.fit(features, moneyball.W)









    Out[17]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

To see the fitted beta parameters we can examine the following values:



In [18]:

    
WinsModel.intercept_









    Out[18]:





80.881374722838132



In [19]:

    
WinsModel.coef_









    Out[19]:





array([ 0.        ,  0.10576562])

Therefore the prediction formula for the number of wins is:
Wins = 80.9042 + 0.1045 * RD

To get the number of Runs necessary we need to solve this simple equation:
95 = 80.9042 + 0.1045 * RD
RD = (95-80.9042)/0.1045 = 134.89

135 Runs are necessary for 95 wins.
We can verify it by predicting the wins for 135 Runs:



In [20]:

    
WinsModel.predict([[1, 135]])









    Out[20]:





array([ 95.15973375])

The goal: score 135+ runs more than opponent

How does a team increase the runs difference (RD)?
There are two ways: either scoring more runs (RS) or allowing less runs (RA).

The A’s started using a different method to select players, based on their statistics, not on their looks.

Most teams focused on Batting Average (BA): getting on base by hitting the ball.

The A’s discovered that BA was overvalued and two baseball statistics were significantly more important than anything else:

On-Base Percentage (OBP)
Percentage of time a player gets on base (including walks)
Slugging Percentage (SLG)
How far a player gets around the bases on his turn (measures power)

We can use linear regression to verify which baseball player features are more important tot predict runs.

Regression model to predict runs scored



In [21]:

    
RSmodel = linear_model.LinearRegression()
features = moneyball[['OBP', 'SLG', 'BA']]  # input
features.insert(0, 'intercept', 1)
RSmodel.fit(features, moneyball.RS)









    Out[21]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [22]:

    
RSmodel.intercept_









    Out[22]:





-788.45704708079347



In [23]:

    
RSmodel.coef_









    Out[23]:





array([    0.        ,  2917.42140821,  1637.92766577,  -368.96606009])

Note that now the beta parameters are 3, being 3 the input features.

The prediction formula for Scored Runs would be:
RS = -788 + 2917 * OBP + 1638 * SLG -369 * BA

The score() function returns the R2 metric for the model:



In [24]:

    
RSmodel.score(features, moneyball.RS)









    Out[24]:





0.93020162587862809

Which is very high, almost the maximum.

And if we remove the BA feature?



In [25]:

    
RSmodel2 = linear_model.LinearRegression()
features = moneyball[['OBP', 'SLG']]  # input
features.insert(0, 'intercept', 1)
RSmodel2.fit(features, moneyball.RS)









    Out[25]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [26]:

    
RSmodel2.score(features, moneyball.RS)









    Out[26]:





0.92958106080965974

There is almost no difference in the R2 when removing the BA feature.
A sign that BA is not significant.

Batting Average (BA) is overvalued
On Base Percentage (OBP) and Slugging Percentage (SLG) are enough.



In [27]:

    
RSmodel2.intercept_









    Out[27]:





-804.62706106223959



In [28]:

    
RSmodel2.coef_









    Out[28]:





array([    0.        ,  2737.76802227,  1584.90860546])

Regression model to predict runs allowed

Predicting the Runs Allowed is more difficult because there are missing values in the data:



In [29]:

    
moneyball.isnull().sum()









    Out[29]:





Team              0
League            0
Year              0
RS                0
RA                0
W                 0
OBP               0
SLG               0
BA                0
Playoffs          0
RankSeason      748
RankPlayoffs    748
G                 0
OOBP            812
OSLG            812
RD                0
dtype: int64



In [30]:

    
moneyball.shape









    Out[30]:





(902, 16)

Out of 902 rows, 812 misses the OOBP and OSLG features that would be critical for RA prediction.

Just for exercise, we can go on but results will not be significant.

There are too many missing values to use the average values instead, better to ignore these rows altogether.



In [31]:

    
import numpy as np



In [32]:

    
moneyball = moneyball[np.isfinite(moneyball['OOBP'])]



In [33]:

    
moneyball = moneyball[np.isfinite(moneyball['OSLG'])]



In [34]:

    
moneyball.shape









    Out[34]:





(90, 16)

Now the data set is reduced to 90 rows ...



In [35]:

    
RAmodel = linear_model.LinearRegression()
features = moneyball[['OOBP', 'OSLG']]
features.insert(0, 'intercept', 1)
RAmodel.fit(features, moneyball.RA)









    Out[35]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [36]:

    
RAmodel.intercept_









    Out[36]:





-837.37788861333513



In [37]:

    
RAmodel.coef_









    Out[37]:





array([    0.        ,  2913.59948582,  1514.28595842])



In [38]:

    
RAmodel.score(features, moneyball.RA)









    Out[38]:





0.9073132027228662

2002 Prediction

We know that the A's Team OBP in 2002 is 0.339 and the Team SLG is 0.430
How many Runs and Wins can we expect?

We could just put these values in the formula above:



In [39]:

    
RS = -804.63 + 2737.77*0.339 + 1584.91*0.43
RS









    Out[39]:





804.9853300000001

Or we can use the handy function predict():



In [40]:

    
RSmodel2.predict([[1, 0.339, 0.43]])









    Out[40]:





array([ 804.98699884])

We predict 805 Runs scored in 2002

Let's predict also RA even if the model is very weak, just as example.
We know that Team OOBP in 2002 is 0.307 and Team OSLG is 0.373



In [41]:

    
RAmodel.predict([[1, 0.307, 0.373]])









    Out[41]:





array([ 621.92581602])

Our 2002 prediction is 622 Runs Allowed.

In reality, Paul Podesta at the A's team used a similar but slightly different approach and predicted 650-670 Runs Allowed.

And finally the number of wins:



In [42]:

    
WinsModel.predict([[1, (805-622)]])









    Out[42]:





array([ 100.23648363])

So our prediction for A's team in 2002 is 100 wins in total, that would be probably enough to access the playoff.

Final results

These are 2002 final results:

Actual runs score were 800 instead of 805
Runs allowed were 653 instead of 622 (note the bigger difference)
And wins were 103 instead of 100

They made it to the playoffs

Models (even relatively simple models) allow managers to more accurately value players and minimise risk.

Every major league baseball team now has a statistics group.
Analytics are used in other sports, too.

	Team	League	Year	RS	RA	W	OBP	SLG	BA	Playoffs	RankSeason	RankPlayoffs	G	OOBP	OSLG
0	ARI	NL	2012	734	688	81	0.328	0.418	0.259	0	NaN	NaN	162	0.317	0.415
1	ATL	NL	2012	700	600	94	0.320	0.389	0.247	1	4.0	5.0	162	0.306	0.378
2	BAL	AL	2012	712	705	93	0.311	0.417	0.247	1	5.0	4.0	162	0.315	0.403
3	BOS	AL	2012	734	806	69	0.315	0.415	0.260	0	NaN	NaN	162	0.331	0.428
4	CHC	NL	2012	613	759	61	0.302	0.378	0.240	0	NaN	NaN	162	0.335	0.424

	Team	League	Year	RS	RA	W	OBP	SLG	BA	Playoffs	RankSeason	RankPlayoffs	G	OOBP	OSLG
330	ANA	AL	2001	691	730	75	0.327	0.405	0.261	0	NaN	NaN	162	0.331	0.412
331	ARI	NL	2001	818	677	92	0.341	0.442	0.267	1	5.0	1.0	162	0.311	0.404
332	ATL	NL	2001	729	643	88	0.324	0.412	0.260	1	7.0	3.0	162	0.314	0.384
333	BAL	AL	2001	687	829	63	0.319	0.380	0.248	0	NaN	NaN	162	0.337	0.439
334	BOS	AL	2001	772	745	82	0.334	0.439	0.266	0	NaN	NaN	161	0.329	0.393