How have MLB Hall of Famer's Salaries changed over time? I will use Sean Lahman's Baseball Database to explore the following questions:
The Lahman database only has salary dating back to the 70s but baseball-reference.com has much more. I obtained this data for a whole bunch of players. Here's what it looks like:
In [1]:
import pandas as pd
import numpy as np
In [2]:
salaries = pd.read_csv('./../data/Salaries/salary.csv')
In [3]:
salaries.head()
Out[3]:
In [4]:
salaries.shape
Out[4]:
How many unique players?
In [5]:
len(salaries.bbrefID.unique())
Out[5]:
Now let's establish a connection to the Lahman Database. I have this database loaded into a psql database on an AWS instance. I'll connect to it here.
In [6]:
from sqlalchemy import create_engine
import getpass
In [7]:
passw = getpass.getpass("Password Please: ")
In [8]:
cnx = create_engine('postgresql://adam:%s@52.23.226.111:5432/baseball'%passw)
Here are the tables in the database:
In [9]:
print ', '.join(pd.read_sql_query("select table_name from information_schema.tables where table_schema = 'public';",cnx).table_name.tolist())
Let's take a look at the hall of fame table.
In [10]:
hall_of_fame = pd.read_sql_query('select * from hall_of_fame;',cnx)
In [11]:
hall_of_fame.head()
Out[11]:
In [12]:
hall_of_fame.votedby.value_counts()
Out[12]:
In [13]:
hall_of_fame.category.value_counts()
Out[13]:
I'll only consider Players. Also, I'll exclude players from the Negro League since I do not have salary data on them.
I'll make a python set of all the player ids of these hall of famers.
In [14]:
hall = set(hall_of_fame[(hall_of_fame.inducted=='Y') &
(hall_of_fame.category=='Player') &
(hall_of_fame.votedby!='Negro League')].player_id)
hall.discard(u'griffcl01') ## he was not inducted as a player: http://www.baseball-reference.com/players/g/griffcl01.shtml
In [15]:
len(hall)
Out[15]:
Now let's filter the salary table to just Hall of Famers. We need to first match the bbref IDs to the player_id that the Lahman database uses.
In [16]:
player = pd.read_sql_query('select * from player;',cnx)
In [17]:
bbid_to_pid = {b:p for b,p in zip(player.bbref_id,player.player_id)}
pid_to_name = {p:(fn,ln) for p,fn,ln in zip(player.player_id,player.name_first,player.name_last)}
In [18]:
salaries.insert(0,'player_id',[bbid_to_pid[bbid] for bbid in salaries.bbrefID])
In [19]:
salaries = salaries[salaries.player_id.isin(hall)].reset_index(drop=True)
In [20]:
salaries.head(3)
Out[20]:
In [21]:
salaries.shape
Out[21]:
Let's see if we have data on all 225 hall of famers..
In [22]:
len(salaries.player_id.unique())
Out[22]:
Ok, that's not bad. Let's see how many null values there are for salary.
In [23]:
sum(salaries.Salary.isnull())
Out[23]:
Yikes, that's a lot. We'll have to figure out a smart way to deal with that.
Let's see some of the oldest data.
In [24]:
salaries.sort_values('Year').head(7)
Out[24]:
Some of the null value are a result of the fact that the years in which a player played on multiple teams have null values for one of the entries. After converting Salary to a number, I'll group by player_id and Year and see how many truely missing Salary entires we have.
In [25]:
salaries.Salary = pd.to_numeric(salaries.Salary.str.replace('$','').str.replace(',',''))
salaries.Year = salaries.Year.astype(int)
In [26]:
salaries.head(3)
Out[26]:
In [27]:
unique_player_years = salaries.groupby(['player_id','Year']).sum().shape[0]
null_player_years = sum(salaries.groupby(['player_id','Year']).sum().Salary.isnull())
In [28]:
print unique_player_years, null_player_years, float(null_player_years)/unique_player_years
Still 39% of the data is missing. Eventually I will impute this data and try to do it in a way that makes sense. First, let's start visualizing the data a little bit. Let's aggregate the mean salary by year.
In [29]:
counts = salaries.dropna().groupby('Year',as_index=False).count()[['Year','Salary']]
In [30]:
counts.head(5)
Out[30]:
In [31]:
counts.tail(3)
Out[31]:
Too avoid getting too noisy of a picture, I'm going to restrict the mean salaries to only years when we have at least 4 players' salaries.
In [32]:
mean_salaries = salaries.dropna().groupby('Year',as_index=False).mean()[counts.Salary>3]
In [33]:
mean_salaries.head(3)
Out[33]:
I'll plot the average HOF salary across time. Actually, I'll plot the log of salary since that will make it easier to visualize.
In [34]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [35]:
plt.plot(mean_salaries.Year,np.log10(mean_salaries.Salary));
plt.xlabel('Year');
plt.ylabel('Log of Average Salary');
plt.title('Log of Average Salary of Hall of Famers');
Ok, this is to be expected, the average salary has been increasing through time. Cool to see that the average salary did not rise from about 1926 to about 1940 and then dropped around World War 2 as this corresponds to expectations.
Let's plot this next to the Consumer Price Index to see it in some context. Since the CPI did not start officially being tracked until 1913, I obtained some CPI data from here. This is data used by Robert Sciller in Irrational Exuberance. It happens to start exactly in 1871 which is the first year of MLB. I'll read this in and then do the necessary steps to get it plotted next to the Average Salary.
In [36]:
cpi = pd.read_csv('./../data/Salaries/ie_data.csv')[['Date','CPI']]
In [37]:
cpi.head(3)
Out[37]:
I'll just use April of every year as I don't need month by month data.
In [38]:
cpi = cpi[cpi.Date.astype(str).str.endswith('03')].reset_index(drop=True)
cpi.Date = cpi.Date.astype(int)
cpi.columns = ['Year','CPI']
In [39]:
cpi.head(3)
Out[39]:
Now I want to see how much the salary of a base year compares to the rest of the years if we adjust for inflation. I'll use 1881 as a base year for now since it is the first year we have non-null data for. I'll calculate how much 1372 dollars in 1873 corresponds to in all the rest of the years. Then I'll plot the result.
In [40]:
cpi.insert(len(cpi.columns),'Base-1881',cpi.CPI.values/cpi[cpi.Year==1881].CPI.values)
In [41]:
cpi.head(4)
Out[41]:
In [42]:
adjusted_base_salary = pd.merge(cpi,mean_salaries,on='Year')
adjusted_base_salary['1881_Salary_Adjusted'] = adjusted_base_salary.iloc[0].Salary*adjusted_base_salary['Base-1881']
In [43]:
adjusted_base_salary.head(4)
Out[43]:
In [44]:
plt.plot(adjusted_base_salary.Year,np.log10(adjusted_base_salary.Salary),label='Average Salary');
plt.plot(adjusted_base_salary.Year,np.log10(adjusted_base_salary['1881_Salary_Adjusted']),label='CPI');
plt.plot()
plt.xlabel('Year');
plt.ylabel('Log Dollars');
plt.title('Log of Average Salary Versus Log CPI');
plt.legend();
Ok, so we can see quite clearly that the average salary of hall of famers has outpaced the rate of inflation. Let's see this one other way. Let's put all the average salaries in 2009 dollars (the last year in this data) and then plot the average through time.
In [45]:
adjusted_base_salary.insert(3,'Base-2009',adjusted_base_salary.iloc[-1].CPI/adjusted_base_salary.CPI)
In [46]:
adjusted_base_salary.head(2)
Out[46]:
In [47]:
adjusted_base_salary['Salary_in_2009_dollars'] = adjusted_base_salary.Salary * adjusted_base_salary['Base-2009']
In [48]:
adjusted_base_salary.head(2)
Out[48]:
In [49]:
plt.plot(adjusted_base_salary.Year,np.log10(adjusted_base_salary.Salary_in_2009_dollars),label='Average Salary');
plt.plot()
plt.xlabel('Year');
plt.ylabel('Log Dollars');
plt.title('Log of Average Salary in 2009 Dollars');
plt.legend();
Since log10 of 10 million is 7, this coresponds to hall-of-famers making around 10 million on average throughout their careers in the mid 2000s. Back before the turn of the century, hall-of-fame players were only making bettween 30 and 100 thousand dollars in 2009 dollars.
Salaries have increased tremndously over the past 40 years. Hall of fame caliber players are now averaging 10 times more per year over the course of their careers than Hank Aaron made at the peak of his earning power.
I feel as though I have satisfactorily answered the first of my four driving questions. Now on to the rest.
We would like to impute data that is missing. Here's my plan for doing so. We use the ratios of all players' earnings in the time series that is their careers and then use this average career earnings trajectory to imput missing data.
Because the common career trajectory might be changing through time, I'll bin the data into 6 bins like this:
First, we will have to drop the players who have all missing values for their salaries since we have no info on them.
In [50]:
players_to_drop = salaries.groupby('player_id',as_index=False).count()['player_id'][(salaries.groupby('player_id').count().Salary==0).values]
In [51]:
players_to_drop
Out[51]:
In [52]:
salaries = salaries[~salaries.player_id.isin(players_to_drop)].reset_index(drop=True)
First let me insert a column for year of the career and adjust all the salaries for inflation.
In [53]:
salaries.insert(3,'Year_of_career',np.zeros(len(salaries)))
for bbref in pd.unique(salaries.bbrefID):
salaries.ix[salaries.bbrefID==bbref,'Year_of_career'] = range(1,sum(salaries.bbrefID==bbref)+1)
In [54]:
cpi.insert(len(cpi.columns),'Base-2010',cpi[cpi.Year==2010].CPI.values/cpi.CPI.values)
In [55]:
year_to_base_2010 = {y:b for y,b in zip(cpi.Year,cpi['Base-2010'])}
In [56]:
salaries.insert(len(salaries.columns),'Salary-2010',[year_to_base_2010[y]*s for y,s in zip(salaries.Year,salaries.Salary)])
In [57]:
salaries.head(3)
Out[57]:
Now I'm going to drop the duplicates of player-Year combinations. I need to make sure to drop only the null row and not the non null row so first I'll sort by player-Year-Salary and then I'll drop the duplicates.
In [58]:
salaries = salaries.sort_values(['player_id','Year','Salary'])
salaries = salaries.drop_duplicates(subset=['player_id','Year'],keep='first')
Now I'm going to calculate the ratio of first year salary to the rest of the year's salaries across all players for which the first year's salary is available.
In [59]:
max_seasons = salaries.Year_of_career.max().astype(int)
A = pd.DataFrame({'%d' % i : [[] for _ in range(max_seasons)] for i in range(1,max_seasons+1)})[['%d' % i for i in range(1,max_seasons+1)]]
for player_id,df in salaries.groupby('player_id'):
for year1 in df.Year_of_career:
for year2 in df.Year_of_career:
ratio = df[df.Year_of_career==year1]['Salary-2010'].values/df[df.Year_of_career==year2]['Salary-2010'].values
if np.isnan(ratio):
continue
A.iloc[int(year1)-1,int(year2)-1].append(ratio[0])
Time to plot the data.
In [60]:
x,y,w = [],[],[]
for u,arr in enumerate(A['1']):
if len(arr)<=3:
continue
else:
x.append(u+1)
y.append(np.mean(arr))
w.append(1/np.std(arr) if np.std(arr)!=0 else 1)
In [61]:
from scipy.interpolate import InterpolatedUnivariateSpline
In [62]:
s = InterpolatedUnivariateSpline(x, y, w, k=1)
In [63]:
plt.scatter(x,y);
plt.plot(x,s(x));
plt.title('Average Hall of Famer Earning Trajectory')
plt.xlabel('Year of Career')
plt.ylabel('Ratio to First Year Salary');
Now I'll use this average trajectory to impute all the missing salary data. I'll use all the available year's data on a players salary to do this by imputing what the salary would be with each point and then taking the average across all points.
In [64]:
for player_id,df in salaries.groupby('player_id'):
for year1 in df.Year_of_career:
if np.isnan(df[df.Year_of_career==year1]['Salary-2010'].values[0]):
impute = []
for year2 in df.Year_of_career:
if np.isnan(df[df.Year_of_career==year2]['Salary-2010'].values[0]):
continue
else:
impute.append(s(year1)/s(year2) * df[df.Year_of_career==year2]['Salary-2010'].values[0])
salaries.loc[(salaries.player_id==player_id) & (salaries.Year_of_career==year1),'Salary-2010'] = np.mean(impute)
In [65]:
sum(salaries['Salary-2010'].isnull())
Out[65]:
Yay! No more nulls. Now let's bin the data into our 6 bins then visualize the career earning trajectories for each of the bins.
In [66]:
salaries.insert(len(salaries.columns),'Bin_1',salaries.Year<1900)
salaries.insert(len(salaries.columns),'Bin_2',np.logical_and(salaries.Year>=1900,salaries.Year<1920))
salaries.insert(len(salaries.columns),'Bin_3',np.logical_and(salaries.Year>=1920,salaries.Year<1940))
salaries.insert(len(salaries.columns),'Bin_4',np.logical_and(salaries.Year>=1940,salaries.Year<1960))
salaries.insert(len(salaries.columns),'Bin_5',np.logical_and(salaries.Year>=1960,salaries.Year<1980))
salaries.insert(len(salaries.columns),'Bin_6',salaries.Year>1980)
In [67]:
for b in range(1,7):
base_salary = salaries[salaries['Bin_%d' % b]].groupby('Year_of_career',as_index=False).mean().iloc[0]['Salary-2010']
x = salaries[salaries['Bin_%d' % b]].groupby('Year_of_career',as_index=False).mean().Year_of_career
y = salaries[salaries['Bin_%d' % b]].groupby('Year_of_career',as_index=False).mean()['Salary-2010']/base_salary
plt.plot(x,y,label='Bin %d' % b)
plt.legend();
plt.xlabel('Year of Career')
plt.ylabel("Ratio to First Year's Salary")
plt.title('Career Earnings Trajectory Across Six Time Periods');
So not only are Hall of Famers making more money than ever, the ratio of their salary during the peak of their careers to their rookie salary is higher as well.
Who is the highest paid hall of famer of all-time. Well if what I've seen so far has taught me anything, it's that the pay of hall of famers has changed substantially throughout the history of baseball. How to answer this question can definetly by debated but I think that players should be compared to their peers to control for aspects of the game changing through time.
To answer this question I will look at two metrics:
I will take a nearest neighbor approach to this question. My X will have two features - year started career and year ended career. For each player I'll find his k nearest neighbors and compare their average metrics to the player's highest metric. The players with the highest differences I'll determine to be the highest paid players.
First I'll make a nice dataframe with player_id, first season of career, last season of career, highest single season pay and average pay.
In [68]:
first_season = {pl:yr for pl,yr in zip(salaries.groupby('player_id').min()['Year'].index,salaries.groupby('player_id').min()['Year'])}
last_season = {pl:yr for pl,yr in zip(salaries.groupby('player_id').max()['Year'].index,salaries.groupby('player_id').max()['Year'])}
highest_season_pay = {pl:pa for pl,pa in zip(salaries.groupby('player_id').max()['Salary-2010'].index,salaries.groupby('player_id').max()['Salary-2010'])}
ave_pay = {pl:pa for pl,pa in zip(salaries.groupby('player_id').mean()['Salary-2010'].index,salaries.groupby('player_id').mean()['Salary-2010'])}
In [69]:
salaries_new = pd.DataFrame({'player_id':pd.unique(salaries.player_id),'first_season':[first_season[p] for p in pd.unique(salaries.player_id)],'last_season':[last_season[p] for p in pd.unique(salaries.player_id)],'highest_season_pay':[highest_season_pay[p] for p in pd.unique(salaries.player_id)],'ave_pay':[ave_pay[p] for p in pd.unique(salaries.player_id)]})
salaries_new = salaries_new[['player_id','first_season','last_season','highest_season_pay','ave_pay']]
In [70]:
salaries_new.head(3)
Out[70]:
Let's try k equals 8 nearest neighbors first.
In [71]:
from sklearn.neighbors import KNeighborsRegressor
In [72]:
knn = KNeighborsRegressor(n_neighbors=8, weights='uniform')
d_larg = {}
for player_id in pd.unique(salaries_new.player_id):
X = salaries_new[salaries_new.player_id!=player_id].iloc[:,1:3].values
y = salaries_new[salaries_new.player_id!=player_id].iloc[:,-2]
knn.fit(X,y)
d_larg[player_id] = (salaries_new[salaries_new.player_id==player_id].iloc[:,-2].values - knn.predict(salaries_new[salaries_new.player_id==player_id].iloc[:,1:3].values))[0]
Top 5 Players:
In [73]:
for key in sorted(d_larg,key=d_larg.get,reverse=True)[:5]:
print key,' '.join(pid_to_name[key]),d_larg[key]
Let's try k equals 12 also.
In [74]:
knn = KNeighborsRegressor(n_neighbors=12, weights='uniform')
d_larg = {}
for player_id in pd.unique(salaries_new.player_id):
X = salaries_new[salaries_new.player_id!=player_id].iloc[:,1:3].values
y = salaries_new[salaries_new.player_id!=player_id].iloc[:,-2]
knn.fit(X,y)
d_larg[player_id] = (salaries_new[salaries_new.player_id==player_id].iloc[:,-2].values - knn.predict(salaries_new[salaries_new.player_id==player_id].iloc[:,1:3].values))[0]
In [75]:
for key in sorted(d_larg,key=d_larg.get,reverse=True)[:5]:
print key,' '.join(pid_to_name[key]),d_larg[key]
Seems pretty robust to the choice of K. This metric really favors pitchers. Let's see the average pay metric. Let's just try k = 12 this time.
In [76]:
knn = KNeighborsRegressor(n_neighbors=10, weights='uniform')
d_ave = {}
for player_id in pd.unique(salaries_new.player_id):
X = salaries_new[salaries_new.player_id!=player_id].iloc[:,1:3].values
y = salaries_new[salaries_new.player_id!=player_id].iloc[:,-1]
knn.fit(X,y)
d_ave[player_id] = (salaries_new[salaries_new.player_id==player_id].iloc[:,-1].values - knn.predict(salaries_new[salaries_new.player_id==player_id].iloc[:,1:3].values))[0]
In [77]:
for key in sorted(d_ave,key=d_ave.get,reverse=True)[:5]:
print key,' '.join(pid_to_name[key]),d_ave[key]
According to this analysis, Pedro Martinez is the highest paid Hall of Famer of all time.
I'll conclude by showing the lowest paid Hall of Famer of all-time by both metrics. I'll keep K fixed at 10 for brevity's sake.
In [78]:
for key in sorted(d_larg,key=d_larg.get)[:5]:
print key,' '.join(pid_to_name[key]),d_larg[key]
In [79]:
for key in sorted(d_ave,key=d_ave.get)[:5]:
print key,' '.join(pid_to_name[key]),d_ave[key]
According to this analysis, either Craig Biggio or Roberto Alomar is the lowest paid Hall of Famer of all time, depending on your definition.