Baseball started in the United States in the 1700s and 1800s. In the United States, it is called the national pastime, because so many people in the United States used to spend a lot of time playing or watching baseball games. Today, though, most Americans follow football more than baseball, especially when it is time for the Super Bowl. Baseball is a game of statistics. Since the flow of a baseball game has natural breaks to it, and normally players act individually rather than performing in clusters, the sport lends itself to easy record-keeping and statistics. Statistics have been kept for professional baseball since the creation of the National League and American League, now part of Major League Baseball. The importance of statistics in baseball increased with the advent of free agency in 1974. Statistics became vital information in determining individual salary, with the free agent market clearly setting what each player should be paid for his given statistics.
Baseball statistics are used in evaluating a player's and/or a team's performance. The following analysis is based on baseball data made available by Sean Lahman, an award-winning database journalist, and author. The data can be downloaded from his website. The version of the data used in the following analysis is 2015-comma-delimited version. Let us start by exploring the player's salaries.
In [1]:
# Import dependencies
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
import matplotlib as mpl
import seaborn as sns
# Set display properties
%pylab inline
sns.set_style('white')
pd.options.display.max_rows = 8
pd.options.display.max_columns = 15
In [2]:
# Read data from csv file into pandas dataframe
def read_csv(filename):
# The argument to this function is the filename of csv file
path = './baseballdatabank-master/core/'
data = pd.read_csv(path + filename, sep=',', keep_default_na=False, na_values=[''])
print filename, 'has', data.shape[0], 'rows and', data.shape[1], 'columns'
return data
In [3]:
# Format summary statistics of salary
def format_salary(x):
return r'$%1.2fM' % (x*1e-6)
In [4]:
# Read in the salary details
salaries = read_csv('Salaries.csv')
In [5]:
salaries.info()
In [6]:
print salaries.head(), '\n'
print 'Data from', salaries.yearID.min(), 'to', salaries.yearID.max()
In [7]:
# Salary Stats
def yearly_stats(groupby):
# The argument to this function is the column name by which data needs to be grouped: ['playerID', 'teamID']
cols = ['yearID']
cols.append(groupby)
s = salaries.groupby(cols, as_index=False).sum()
s = s.groupby('yearID').agg({"salary": {"max_salary" : np.max,
"avg_salary" : np.mean,
"median_salary" : np.median,
"min_salary" : np.min}})
s.columns = s.columns.get_level_values(1)
return s
In [8]:
yearly_salary_stats = yearly_stats('playerID')
yearly_payroll_stats = yearly_stats('teamID')
In [9]:
# Format salary labels to make them concise in plots
def millions(x, pos):
# The two args are the value and tick position
return format_salary(x)
formatter = FuncFormatter(millions)
In [10]:
# Plot average annual salary vs year
f, ax = plt.subplots()
f.set_size_inches((8,5))
plt.plot(yearly_salary_stats.index, yearly_salary_stats.avg_salary, label='Avg. Player Salary')
ax.yaxis.set_major_formatter(formatter)
plt.title("Distribution of Players' Salary over Time", fontsize=15, weight='bold')
plt.xlabel("Year", weight='bold')
plt.ylabel("Average Salary (in millions)", weight='bold')
plt.grid()
Average yearly salary of a player has an increasing trend with time. There are some up's and down's which coincide with some of the major events in baseball history. For example, the abrupt increase from 1990 to 1991 coincides with 1990 Major League Baseball lockout which was resolved by raising minimum major league salary from \$68,000 to \$100,000. The period 1993-95 has a non-increasing trend. During this period, a dispute played out between the players and the ownership regarding the proposal of salary-cap which led to the 1994–95 Major League Baseball strike. As a result of this strike, the 1994 World series was canceled. Post-strike years has an ever increasing trend. There is a small dip in the year 2004 which coincides with the announcement of a new drug policy in Jan 2004 which led to the suspension of many MLB players for the use of performance-enhancing drugs.
In [11]:
yearly_salary_stats.avg_salary.describe().iloc[1:].apply(format_salary)
Out[11]:
The average salary of a player has increased by about 800% from \$0.5M in 1985 to \$4.3M in 2015. In our effort to explore the MLB salaries, let us group the player salaries into teams and take a look at the distribution of the amount spent by various teams on their payrolls.
In [12]:
# Plot average annual payroll vs year
f, ax = plt.subplots()
f.set_size_inches((8,5))
plt.plot(yearly_payroll_stats.index, yearly_payroll_stats.min_salary, label='Min. Payroll')
plt.plot(yearly_payroll_stats.index, yearly_payroll_stats.avg_salary, label='Avg. Payroll')
plt.plot(yearly_payroll_stats.index, yearly_payroll_stats.max_salary, label='Max. Payroll')
ax.fill_between(yearly_payroll_stats.index, yearly_payroll_stats.min_salary,
yearly_payroll_stats.max_salary, alpha=0.2, color='0.85')
ax.yaxis.set_major_formatter(formatter)
plt.title("Distribution of Team Payrolls over Time", fontsize=15, weight='bold')
plt.xlabel("Year", weight='bold')
plt.ylabel("Payroll (in millions)", weight='bold')
plt.legend(loc='best')
plt.grid()
Team payrolls have an interesting trend over time. The maximum amount spent on payrolls increases more steeply than the minimum amount. During 1985-90, the increase is very low, and the range of salaries is almost constant. From 1991 to 2005 the maximum amount spent on payroll increased steeply while the increase in the minimum amount spent is very low. Moreover, this increasing trend in maximum payroll is absent after 2005. Let us compare individual teams on their payroll amounts.
In [13]:
# Yearly payroll of teams
yearly_payroll = salaries.groupby(['teamID', 'yearID'], as_index=False).sum()
In [14]:
# Spread of amount spent by individual teams on payroll
plt.figure(figsize=(8,10))
order = yearly_payroll.groupby('teamID').median().sort_values(by='salary', ascending=False).index
g = sns.boxplot(data=yearly_payroll, x='salary', y='teamID', order=order, palette='viridis')
plt.setp(g.artists, alpha=0.8)
g.xaxis.set_major_formatter(formatter)
g.set_title("Distribution of Payroll across Teams (1985-2015)", fontsize=15, weight='bold')
g.set_xlabel("Team Payroll (in millions)", weight='bold')
g.set_ylabel("TeamID", weight='bold')
g.grid()
The boxplot clearly shows that not all teams spend equally on their payrolls. For example, the minimum amount spent by Los Angeles Angels of Anaheim (LAA) is higher than the median amount of all individual teams. A possible explanation for the observed trend can be that some teams can spend huge amounts on the payroll to hire the best talent while others cannot support the best players financially. How can we test this hypothesis? If our hypothesis is true then the team's payroll should translate to teams performance in regular season and post-season. We test our hypothesis by performing regression analysis relating regular season and post-season success to team payroll.
In [15]:
# Read in team details
teams = read_csv('Teams.csv') #Regular season details
teams_post = read_csv('SeriesPost.csv') #Post-Season details
In [16]:
teams.info()
In [17]:
teams.head()
Out[17]:
In [18]:
# Add salary data to team details
teams = teams.merge(yearly_payroll, on=['teamID', 'yearID'], how='left')
In [19]:
# Calculate Regular season winning percentage
teams['win_percent'] = (teams.W / teams.G) * 100
In [20]:
teams.head()
Out[20]:
In [21]:
teams_post.info()
In [22]:
teams_post.head()
Out[22]:
The post-season details are not in a friendly format for our analysis. The following code cell reshapes the data and calculates post-season wins and winning percentages for individual teams in each year.
In [23]:
# Reshape post-season details to calculate winning percentage
games_won = teams_post.groupby(['teamIDwinner','yearID']).sum()
games_lost = teams_post.groupby(['teamIDloser','yearID']).sum()
playoffs = pd.concat([games_won, games_lost], axis=0).reset_index()
playoffs.columns = ['teamID', 'yearID', 'post_W', 'post_L', 'post_T']
playoffs['post_G'] = playoffs.post_W + playoffs.post_L + playoffs.post_T
playoffs = playoffs.groupby(['teamID', 'yearID']).sum().reset_index()
In [24]:
# Add salary data into the post-season details
playoffs = playoffs.merge(yearly_payroll, on=['teamID', 'yearID'], how='left', copy=False)
In [25]:
# Calculate post-season winning percentage
playoffs['post_win_percent'] = (playoffs.post_W / playoffs.post_G) *100
playoffs.head()
Out[25]:
In [26]:
# Linear regression of winning percentage and amount spent on payroll
def payroll_reg(df, s):
# The two arguments to this function dataframe and season type {'rs'for Regular Season and 'ps'for Post-Season}
# Dicts for case specific values
# column containing win percentage
y = {'rs': 'win_percent', 'ps': 'post_win_percent'}
# title of the plot
t = {'rs': 'Regular Season %Wins Vs Payroll (1985-2015)', 'ps': 'Post-Season %Wins Vs Payroll (1985-2015)'}
# color for the plot
c = {'rs': sns.color_palette()[5], 'ps': sns.color_palette()[3]}
g = sns.lmplot(data=df, x='salary', y=y[s], size=5, aspect=1.5,
scatter_kws={'color': c[s]}, line_kws={'color': c[s]})
g.fig.suptitle(t[s], fontsize=15, weight='bold')
g.set_xlabels("Payroll (in millions)", weight='bold')
g.set_ylabels("%Wins", weight='bold')
sns.despine(right=False, top=False)
# Linear regression using scipy.stats
mask = ~pd.isnull(df['salary']) & ~pd.isnull(df[y[s]])
slope, intercept, r_value, p_value, std_err = stats.linregress(x=df['salary'][mask]*1e-6,
y=df[y[s]][mask])
g.ax.xaxis.set_major_formatter(formatter)
#Add regression summary to plots
g.ax.text(0.01, 0.935,
r"$R^2={r}$".format(r='%.2g'%r_value**2) + "\n"
r"$r={r}, \thinspace p={p}$".format(r='%.2g'%r_value, p='%.2g'%p_value) + "\n" +
r"$y={m}x+{c}$".format(m=round(slope,2), c=round(intercept,2)),
ha='left', va='top',
transform = g.ax.transAxes, fontsize=12)
g.ax.grid()
g.fig.tight_layout(rect=[0, 0, 1, 0.97])
In [27]:
payroll_reg(teams, 'rs')
Based on the data at hand, a regression analysis relating percentage wins to team payroll shows a statistically significant positive relationship. The coefficient for payroll suggests that every \$30000 spent on payroll increases regular season wins by 1%. The result of the regression supports our hypothesis that the free agency market is partly responsible for the competitive imbalance in the regular season.
In [28]:
payroll_reg(playoffs, 'ps')
In our analysis of post-season wins, the result of the regression analysis is not as expected. The results suggest statistically insignificant relationship between percentage wins and payroll. It suggests that a short post-season is a random event in which team payroll has a negligible effect on wins.
The results of the regression analysis suggest that team payroll has a significant impact on team's success in the regular season due to free agency market. The ability to purchase talent in the open market and a 162 game season over six months diminishes the random elements leading to expected effects of team payroll on regular season winning percentage. While conventional wisdom leads us to think that teams have the ability to "buy" championships in baseball because there is no salary-cap, our analysis does not support this claim. We did not find a statistically significant relationship between post-season wins and team payroll.
We saw in the previous section that amount spent on payroll has a correlation of 0.21 with regular season wins which is statistically significant. But payroll is not a strong predictor of regular season wins. Only 4% of the total variance in team wins is explained by payroll ($R^2=0.044$). Let us explore further to find better predictors of regular season wins.
The Teams.csv file has a number of team metrics along with team wins. Let us first explore the correlation between various metrics with the help of a clustermap.
In [29]:
# Create a correlation clustermap
def draw_clustermap(df, fmt=None):
# Remove ordinal columns
non_ordinal_cols = [col for col in df.columns if col not in ['yearID', 'Rank']]
# Compute the correlation matrix
corr = df[non_ordinal_cols].corr()
# Plot a clustermap using correlation as a distance metric
g = sns.clustermap(corr, cmap='Spectral', linewidths=.5, metric='correlation', annot=(fmt!=None), fmt=fmt)
plt.setp(g.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
plt.show()
In [30]:
# Clustermap of team metrics
draw_clustermap(teams)
As we have a large number of metrics in our dataset, let us split them into categories for ease of analysis. The categories into which we divide the metrics are the following:
In [31]:
Offense = ['R', 'AB', 'H', '2B', '3B', 'HR', 'BB', 'SO', 'SB', 'HBP', 'SF', 'CS']
Defense = ['RA', 'ER', 'ERA', 'IPouts', 'HA', 'HRA', 'BBA', 'SOA', 'E', 'SV', 'SHO', 'CG', 'FP', 'DP']
Misc = ['G', 'Ghome', 'salary', 'win_percent']
In [32]:
# Clustermap of Offense metrics
draw_clustermap(teams[Offense + Misc], '.2f')
The clustermap of offense metrics suggests that team wins (win_percent) have the highest positive correlation with Runs scored (R). And, Runs scored are highly correlated with Hits (H), At-Bats(AB), Doubles(2B), Base on Balls (BB), Home Runs (HR) and Sacrifice Flies (SF). Hits and At-Bats create opportunities to score runs. Doubles, Base on Balls, Home Runs and Sacrifice Flies are positive outcomes of the opportunities, that aid scoring a run.
In [33]:
# Clustermap of Defense metrics
draw_clustermap(teams[Defense + Misc], '.2f')
The clustermap for defense metrics suggests that Earned Run Average (ERA) has the highest negative correlation with team wins. ERA represents the number of earned runs a pitcher allows per nine innings. Following ERA, Runs Allowed (RA) has the highest negative correlation with wins. Also, there is a positive relationship between wins and Shutouts (SHO).
Based on our observations from the correlation clustermaps for offense and defense metrics, we find that teams wins are positively related to Runs Scored and negatively related to Runs Allowed. This is as expected because to win, a team has to score more runs while playing offense than runs allowed while playing defense. Let us create a single metric out of runs scored and runs allowed to compare with team wins.
In [34]:
# Difference in Runs scored and Runs Allowed
teams['R-RA'] = teams.R - teams.RA
In [35]:
# Plot team winning percent in regular season vs Runs scored, Runs allowed and their difference
g = sns.pairplot(data=teams, x_vars=['R','RA','R-RA'], y_vars=['win_percent'])
g.fig.set_size_inches((10,4))
sns.despine(top=False, right=False)
g.axes[0,0].set_ylabel("%Wins", weight='bold', fontsize=10)
title = {0: "%Wins Vs Runs Scored", 1: "%Wins Vs Runs Allowed", 2: "%Wins Vs (R-RA)"}
x_label = {0: "Runs Scored", 1: "Runs Allowed", 2: "Runs Scored - Runs Allowed"}
linregress = {0: stats.linregress(x=teams['R'], y=teams['win_percent']),
1: stats.linregress(x=teams['RA'], y=teams['win_percent']),
2: stats.linregress(x=teams['R-RA'], y=teams['win_percent'])}
for i in range(3):
g.axes[0,i].grid()
g.axes[0,i].set_title(title[i], weight='bold', fontsize=12)
g.axes[0,i].set_xlabel(x_label[i], weight='bold', fontsize=10)
slope, intercept, r_value, p_value, std_err = linregress[i]
g.axes[0,i].plot()
g.axes[0,i].text(0.01, 0.97,
r"$r={r}, \thinspace p={p}$".format(r='%.2g'%r_value, p='%.2g'%p_value),
ha='left', va='top',
transform = g.axes[0,i].transAxes, fontsize=11)
The above plots help in visualizing our findings from clustermaps. Team wins in the regular season are highly correlated $(r=0.92)$ with the difference between Runs scored and Runs allowed.
The aim in the regular season is to make playoffs. Based on our observation on the correlation between wins and $(R-RA)$, let us find the number of wins needed to make playoffs and also the difference in runs scored and runs allowed $(R-RA)$ to win the required number of games. The format of MLB post-season has undergone multiple changes over time in order to reduce the possibility of best teams in a league to be left out of the postseason. To account for these changes, let us split the data into periods based on post-season format and analyze if there is an significant difference across periods in team wins needed to make playoffs.
In [36]:
# Merge playoff data with team details
teams = teams.merge(playoffs, on=['teamID','yearID'], how='left')
In [37]:
# Create a column with 1 for making playoff and 0 otherwise
teams['playoffs'] = teams.post_G.map(lambda x: 0 if pd.isnull(x) else 1)
teams.head()
Out[37]:
In [38]:
# Add column to filter data based on post-season format
def mlb_postseason_format(x):
# One Round - World Series (WS)
if x < 1969:
return 'Before 1969'
# Two Round - WS + League Championship Series (LCS)
elif x < 1994:
return '1969-1993'
# Three Rounds - WS + LCS + Division Series (DS) with one wildcard team
elif x < 2012:
return '1994-2011'
# Expanded Wildcard - WS + LCS + DS with two wildcard teams
else:
return '2012-Present'
In [39]:
# Add column to filter on periods
teams['ps_format_period'] = teams.yearID.apply(mlb_postseason_format)
In [40]:
# One-way ANOVA for testing statistically significant differences in team wins to make playoffs in each period
periods = teams.ps_format_period.unique()
d = teams[teams.playoffs == 1]
# Seperate out data for each period
W_groups = {k: d[d.ps_format_period == k]['W'] for k in periods} # Team wins
R_groups = {k: d[d.ps_format_period == k]['R-RA'] for k in periods} # R-RA
print "Results of One way Anova for Wins"
print stats.f_oneway(W_groups[periods[0]], W_groups[periods[1]], W_groups[periods[2]], W_groups[periods[3]])
print "\nResults of One way Anova for R-RA"
print stats.f_oneway(R_groups[periods[0]], R_groups[periods[1]], R_groups[periods[2]], R_groups[periods[3]])
The results suggests that there is atleast one pair of groups which are statistically different. Let us perform individual t-tests to find such pairs.
In [41]:
# Two sample t-test for finding siginifant differences
print 'Pair-wise t-test results'
for i in range(3):
for j in range(i+1,4):
print periods[i],'-',periods[j],'\n', stats.ttest_ind(W_groups[periods[i]], W_groups[periods[j]]),'\n'
The results suggest that the period before 1969 is significantly different in team wins required to make playoffs from all its successive periods. Also, there is no siginificant difference among post 1969 post-season formats. We use this result and reduce the number of periods from four to two.
In [42]:
# Add column to filter data based on post-season format
def mlb_postseason_format(x):
if x < 1969:
return 'Before 1969'
else:
return '1969-Present'
In [43]:
# Update column to reflect new groups
teams['ps_format_period'] = teams.yearID.apply(mlb_postseason_format)
In [44]:
# Visualizing our results
plt.figure(figsize=(9,4.5))
periods = teams.ps_format_period.unique()
plt.suptitle("Distribution of metrics for teams which made playoffs", weight='bold', fontsize=12)
ax1 = plt.subplot(121)
for p in periods:
mask = (teams.ps_format_period == p) & (teams.playoffs == 1)
d = teams[mask]
g = sns.kdeplot(d['W'], label=p, ax=ax1, shade=True)
g.axes.grid(True)
d = teams[teams.playoffs == 1]
result = stats.ttest_ind(d[d.ps_format_period == periods[0]]['W'],
d[d.ps_format_period == periods[1]]['W'])
g.axes.text(0.01, 0.8,
r"$t_{{stat}}={t}, \thinspace p={p}$".format(t='%.2g'%result[0], p='%.2g'%result[1]),
ha='left', va='top',
transform = g.axes.transAxes, fontsize=11)
g.axes.set_title("Team Wins", weight='bold', fontsize=11)
ax2 = plt.subplot(122)
for p in periods:
mask = (teams.ps_format_period == p) & (teams.playoffs == 1)
d = teams[mask]
g = sns.kdeplot(d['R-RA'], label=p, ax=ax2, shade=True)
g.axes.grid(True)
d = teams[teams.playoffs == 1]
result = stats.ttest_ind(d[d.ps_format_period == periods[0]]['R-RA'],
d[d.ps_format_period == periods[1]]['R-RA'])
g.axes.text(0.01, 0.95,
r"$t_{{stat}}={t}, \thinspace p={p}$".format(t='%.2g'%result[0], p='%.2g'%result[1]),
ha='left', va='top',
transform = g.axes.transAxes, fontsize=11)
g.axes.set_title("Runs Scored - Runs Allowed", weight='bold', fontsize=11)
plt.tight_layout(rect=[0, 0, 1, 0.97])
In [45]:
# Linear Regression to find difference in runs needed to win the required numer of games to make playoffs
g = sns.FacetGrid(data=teams, col='ps_format_period', col_wrap=2, hue='playoffs')
g.map(plt.plot, 'R-RA', 'W', marker='o', linestyle='None', mec='w', mew=0.6)
g.fig.set_size_inches((9,5))
g.fig.suptitle("Wins Vs R-RA", weight='bold', fontsize=12)
g.set_ylabels('Wins', weight='bold')
g.set_xlabels('Runs Scored - Runs Allowed', weight='bold')
sns.despine(top=False, right=False)
periods = teams.ps_format_period.unique()
for i, ax in enumerate(g.axes.flatten()):
mask = (teams.ps_format_period == periods[i]) & (teams.playoffs == 1)
d = teams[mask]
ax.set_title(periods[i], weight='bold', fontsize=11)
ax.grid()
# Linear regression using scipy.stats
mask = ~pd.isnull(d['W']) & ~pd.isnull(d['R-RA'])
slope, intercept, r_value, p_value, std_err = stats.linregress(x=d['R-RA'][mask],
y=d['W'][mask])
#Add regression summary to plots
ax.text(0.01, 0.97,
r"$R^2={r}$".format(r='%.2g'%r_value**2) + "\n"
r"$r={r}, \thinspace p={p}$".format(r='%.2g'%r_value, p='%.2g'%p_value) + "\n" +
r"$y={m}x+{c}$".format(m=round(slope,2), c=round(intercept,2)),
ha='left', va='top',
transform = ax.transAxes, fontsize=11)
y = round(d['W'].mean(),2)
x = round((y-intercept)*(1/slope),2)
xlim = ax.get_xlim()
ylim = ax.get_ylim()
xmax = ((x-xlim[0])/(xlim[1]-xlim[0]))
ymax = ((y-ylim[0])/(ylim[1]-ylim[0]))
ax.axhline(y=y, linestyle='--', xmax=xmax, c='0.75')
ax.axvline(x=x, linestyle='--', ymax=ymax, c='0.75')
ax.annotate(s = "$Wins={y}$".format(y=y),
xy=(x, y), xytext=(-100, -10), textcoords='offset points')
ax.annotate(s = "$R-RA={x}$".format(x=x),
xy=(x, y), xytext=(3, -120), textcoords='offset points', rotation=-90)
g.fig.tight_layout(rect=[0, 0, 1, 0.97])
For the post-season format prior to 1969 average wins of the teams which made playoffs are 97 and the corresponding difference in runs scored and runs allowed is 85 (using regression equation). Post-1969 values for the same are 94 wins and 115 runs. Thus, in order to make playoffs, a team has to score 115 more runs than they allowed during the regular season to expect to win 94 games and get into post-season.
We started our analysis by looking at player salaries and tried to correlate the observed up and downs with major events in baseball history. Then we moved to the analysis of team payrolls. We made an observation that there is a difference in spending on payroll across teams. We tried to reason out by hypothesizing that best talent goes to the teams which are capable of spending large amounts on payrolls, which leads to competitive imbalance. We tested are hypothesis by correlating team payrolls with performance in regular season and post-season. We found a statistically significant result in the case of regular season wins, but not in the case of post-season wins. As a result, we failed to reject our hypothesis in the case of the regular season and attributed the insignificant result in the case of post-season performance to its format which makes it a random event where payroll has no effect on wins.
Following our significant result in the case of regular season wins, we started exploring the team metrics to find the influential ones which aid team wins. We observed that team wins have the highest correlation with runs scored while playing offense and runs allowed while playing defense. Based on this observation, we calculated the correlation between team wins and the difference in runs scored and runs allowed. We found a statistically significant correlation coefficient. Using this we tried to build a model to predict the number of runs a team has to score over and above the runs that they allowed in order to win required number of games to make playoffs.
The analysis can be further extended to calculate statistics which correlate well with team runs and thus act as better predictors of teams and/or players performance which in turn help in effective allocation and utilization of team payrolls.