Lahman’s Baseball Database contains complete batting and pitching statistics from 1871 to 2013, plus fielding statistics, standings, team stats, managerial records, post-season data, and more.
One of the topics covered in Lahman's Baseball Database is MLB annual salaries. This notebook provides a sample analysis that explores trends in player salaries. Several questions are addressed:
Bootstrap notebook with necessary notebook and library dependencies.
This notebook requires the installation of the following software dependencies:
!pip install statsmodels
In [22]:
# Provide the inline code necessary for loading any required libraries
In [23]:
import matplotlib.pyplot as plt
%matplotlib inline
In [24]:
import pandas as pd
df_mlb_Salaries = pd.read_csv('/resources/mlb_Salaries_2011.csv').dropna()
df_mlb_Salaries.describe()
Out[24]:
In [25]:
df_mlb_Salaries.plot()
Out[25]:
In [26]:
df_mlb_Salaries.tail()
Out[26]:
In [27]:
american_league = df_mlb_Salaries.query('lgID == "AL"')
national_league = df_mlb_Salaries.query('lgID == "NL"')
In [28]:
american_league_avg_annual_salary = american_league.groupby(['yearID']).mean()[['salary']]
national_league_avg_annual_salary = national_league.groupby(['yearID']).mean()[['salary']]
In [29]:
al = pd.Series(american_league_avg_annual_salary.salary, name="American League")
nl = pd.Series(national_league_avg_annual_salary.salary, name="National League")
In [30]:
df_league_salary_history = pd.concat([al, nl], axis=1)
df_league_salary_history.head()
Out[30]:
In [31]:
df_league_salary_history.describe()
Out[31]:
In [32]:
df_league_salary_history.plot()
plt.rcParams['xtick.major.pad']='10'
plt.title('MLB Annual Salary Averages')
plt.xlabel('Year')
plt.ylabel('Average Salary')
yvalues = american_league_avg_annual_salary.index
plt.xticks(yvalues, rotation='vertical')
plt.show()
In [33]:
def annual_salary_delta(previous_year_salary, current_year_salary):
'''Compute annual salary delta.'''
return current_year_salary - previous_year_salary
def annual_league_salary_increases(salaries):
'''Return list of annual salary deltas for a series of league salaries.'''
salincperyr = []
for idx, val in enumerate(salaries):
if idx == 0:
salincperyr.append(0)
elif idx == len(salaries)-1:
salincperyr.append(0)
else:
salincperyr.append(annual_salary_delta(salaries.iloc[idx-1],
salaries.iloc[idx]))
return salincperyr
In [34]:
tablecols = []
tablecols.append(yvalues)
tablecols.append(df_league_salary_history["American League"].tolist())
tablecols.append(df_league_salary_history["National League"].tolist())
tablecols.append(annual_league_salary_increases(df_league_salary_history["American League"]))
tablecols.append(annual_league_salary_increases(df_league_salary_history["National League"]))
df_league_salary_detailed_history = pd.DataFrame(tablecols).T
df_league_salary_detailed_history.columns = ['Year','alAvgSalary', 'nlAvgSalary','alAnnualSalaryIncrease', 'nlAnnualSalaryIncrease']
df_league_salary_detailed_history.set_index(df_league_salary_detailed_history['Year'])
df_league_salary_detailed_history
Out[34]:
In [35]:
x = df_league_salary_detailed_history['Year']
plt.plot(x,df_league_salary_detailed_history["alAnnualSalaryIncrease"], label="American League")
plt.plot(x,df_league_salary_detailed_history["nlAnnualSalaryIncrease"], label="National League")
plt.title('MLB Year-to-Year Salary Deltas')
plt.xlabel('Year')
plt.ylabel('Average Salary')
xvalues = american_league_avg_annual_salary.index
plt.xticks(xvalues, rotation='vertical')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
In [36]:
def create_season_opener_datetime_index(date_series):
'''Return a series of datetime index objects for a series of date strings'''
import datetime
import pandas as pd
dates = []
for i in date_series:
d = datetime.date(i,4,1)
dates.append(d)
return pd.DatetimeIndex(dates)
We have several data structures we need to transform in order to perform predictive analytics. Most notably is the need to have our data structures indexed using a datetime object. Our original data contains only the year for a season but our algorithms require an actual date so we will assume April 1st for each year as the season opener.
In [37]:
season_opener_dates = create_season_opener_datetime_index(df_league_salary_detailed_history['Year'])
df_al_annual_salary_inc = pd.DataFrame(
df_league_salary_detailed_history["alAnnualSalaryIncrease"].tolist(),
index=season_opener_dates,
columns=["alAnnualSalaryIncrease"]
)
df_nl_annual_salary_inc = pd.DataFrame(
df_league_salary_detailed_history["nlAnnualSalaryIncrease"].tolist(),
index=season_opener_dates,
columns=["nlAnnualSalaryIncrease"]
)
print df_al_annual_salary_inc.head()
print df_nl_annual_salary_inc.head()
In [38]:
import statsmodels.api as sm
al_ar_model = sm.tsa.AR(df_al_annual_salary_inc)
al_ar_model_res = al_ar_model.fit()
print al_ar_model_res.params
nl_ar_model = sm.tsa.AR(df_nl_annual_salary_inc)
nl_ar_model_res = nl_ar_model.fit()
print nl_ar_model_res.params
In [39]:
predict_al_increases = al_ar_model_res.predict(start='1995-04-01', end='2025-01-01', dynamic=True)
predict_nl_increases = nl_ar_model_res.predict(start='1995-04-01', end='2025-01-01', dynamic=True)
In [40]:
plt.plot(predict_al_increases.index,predict_al_increases, label="American League")
plt.plot(predict_nl_increases.index,predict_nl_increases, label="National League")
plt.rcParams['xtick.major.pad']='15'
plt.title('MLB Predictive Annual Averages Salary Increase')
plt.xlabel('Year')
plt.ylabel('Average Salary Increases')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
In [41]:
data_to_compare = {'AL Prediction': predict_al_increases,
'AL Actual': df_al_annual_salary_inc.alAnnualSalaryIncrease,
'NL Prediction': predict_nl_increases,
'NL Actual': df_nl_annual_salary_inc.nlAnnualSalaryIncrease
}
df_annual_growth_predictive_analysis = pd.DataFrame(data_to_compare)
ax = df_annual_growth_predictive_analysis.plot()
ax.set_ylabel('Average Salary Increases')
plt.rcParams['xtick.major.pad']='15'
plt.title('MLB Annual Salary Growth Predictive Analysis')
plt.xlabel('Year')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()