Baseball Statistics

Baseball started in the United States in the 1700s and 1800s. In the United States, it is called the national pastime, because so many people in the United States used to spend a lot of time playing or watching baseball games. Today, though, most Americans follow football more than baseball, especially when it is time for the Super Bowl. Baseball is a game of statistics. Since the flow of a baseball game has natural breaks to it, and normally players act individually rather than performing in clusters, the sport lends itself to easy record-keeping and statistics. Statistics have been kept for professional baseball since the creation of the National League and American League, now part of Major League Baseball. The importance of statistics in baseball increased with the advent of free agency in 1974. Statistics became vital information in determining individual salary, with the free agent market clearly setting what each player should be paid for his given statistics.

Baseball statistics are used in evaluating a player's and/or a team's performance. The following analysis is based on baseball data made available by Sean Lahman, an award-winning database journalist, and author. The data can be downloaded from his website. The version of the data used in the following analysis is 2015-comma-delimited version. Let us start by exploring the player's salaries.

Exploring MLB Player's Salaries



In [1]:

    
# Import dependencies
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
import matplotlib as mpl
import seaborn as sns

# Set display properties
%pylab inline
sns.set_style('white')
pd.options.display.max_rows = 8
pd.options.display.max_columns = 15









    



Populating the interactive namespace from numpy and matplotlib



In [2]:

    
# Read data from csv file into pandas dataframe
def read_csv(filename):
    # The argument to this function is the filename of csv file
    path = './baseballdatabank-master/core/'
    data = pd.read_csv(path + filename, sep=',', keep_default_na=False, na_values=[''])
    print filename, 'has', data.shape[0], 'rows and', data.shape[1], 'columns'
    return data



In [3]:

    
# Format summary statistics of salary
def format_salary(x):
    return r'$%1.2fM' % (x*1e-6)



In [4]:

    
# Read in the salary details
salaries = read_csv('Salaries.csv')









    



Salaries.csv has 25575 rows and 5 columns



In [5]:

    
salaries.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25575 entries, 0 to 25574
Data columns (total 5 columns):
yearID      25575 non-null int64
teamID      25575 non-null object
lgID        25575 non-null object
playerID    25575 non-null object
salary      25575 non-null int64
dtypes: int64(2), object(3)
memory usage: 999.1+ KB



In [6]:

    
print salaries.head(), '\n'
print 'Data from', salaries.yearID.min(), 'to', salaries.yearID.max()









    



   yearID teamID lgID   playerID  salary
0    1985    ATL   NL  barkele01  870000
1    1985    ATL   NL  bedrost01  550000
2    1985    ATL   NL  benedbr01  545000
3    1985    ATL   NL   campri01  633333
4    1985    ATL   NL  ceronri01  625000 

Data from 1985 to 2015

Distribution of player's salaries over time



In [7]:

    
# Salary Stats
def yearly_stats(groupby):
    # The argument to this function is the column name by which data needs to be grouped: ['playerID', 'teamID']
    cols = ['yearID']
    cols.append(groupby)
    s = salaries.groupby(cols, as_index=False).sum()
    s = s.groupby('yearID').agg({"salary": {"max_salary" : np.max,
                                            "avg_salary" : np.mean,
                                            "median_salary" : np.median,
                                            "min_salary" : np.min}})
    s.columns = s.columns.get_level_values(1)
    return s



In [8]:

    
yearly_salary_stats = yearly_stats('playerID')
yearly_payroll_stats = yearly_stats('teamID')



In [9]:

    
# Format salary labels to make them concise in plots
def millions(x, pos):
    # The two args are the value and tick position
    return format_salary(x)

formatter = FuncFormatter(millions)



In [10]:

    
# Plot average annual salary vs year
f, ax = plt.subplots()
f.set_size_inches((8,5))

plt.plot(yearly_salary_stats.index, yearly_salary_stats.avg_salary, label='Avg. Player Salary')
ax.yaxis.set_major_formatter(formatter)

plt.title("Distribution of Players' Salary over Time", fontsize=15, weight='bold')
plt.xlabel("Year", weight='bold')
plt.ylabel("Average Salary (in millions)", weight='bold')

plt.grid()

Average yearly salary of a player has an increasing trend with time. There are some up's and down's which coincide with some of the major events in baseball history. For example, the abrupt increase from 1990 to 1991 coincides with 1990 Major League Baseball lockout which was resolved by raising minimum major league salary from \$68,000 to \$100,000. The period 1993-95 has a non-increasing trend. During this period, a dispute played out between the players and the ownership regarding the proposal of salary-cap which led to the 1994–95 Major League Baseball strike. As a result of this strike, the 1994 World series was canceled. Post-strike years has an ever increasing trend. There is a small dip in the year 2004 which coincides with the announcement of a new drug policy in Jan 2004 which led to the suspension of many MLB players for the use of performance-enhancing drugs.



In [11]:

    
yearly_salary_stats.avg_salary.describe().iloc[1:].apply(format_salary)









    Out[11]:





mean    $1.98M
std     $1.22M
min     $0.42M
25%     $0.98M
50%     $2.00M
75%     $3.04M
max     $4.30M
Name: avg_salary, dtype: object

The average salary of a player has increased by about 800% from \$0.5M in 1985 to \$4.3M in 2015. In our effort to explore the MLB salaries, let us group the player salaries into teams and take a look at the distribution of the amount spent by various teams on their payrolls.

Distribution of Team payrolls over time



In [12]:

    
# Plot average annual payroll vs year
f, ax = plt.subplots()
f.set_size_inches((8,5))

plt.plot(yearly_payroll_stats.index, yearly_payroll_stats.min_salary, label='Min. Payroll')
plt.plot(yearly_payroll_stats.index, yearly_payroll_stats.avg_salary, label='Avg. Payroll')
plt.plot(yearly_payroll_stats.index, yearly_payroll_stats.max_salary, label='Max. Payroll')

ax.fill_between(yearly_payroll_stats.index, yearly_payroll_stats.min_salary,
                yearly_payroll_stats.max_salary, alpha=0.2, color='0.85')
ax.yaxis.set_major_formatter(formatter)

plt.title("Distribution of Team Payrolls over Time", fontsize=15, weight='bold')
plt.xlabel("Year", weight='bold')
plt.ylabel("Payroll (in millions)", weight='bold')
plt.legend(loc='best')

plt.grid()

Team payrolls have an interesting trend over time. The maximum amount spent on payrolls increases more steeply than the minimum amount. During 1985-90, the increase is very low, and the range of salaries is almost constant. From 1991 to 2005 the maximum amount spent on payroll increased steeply while the increase in the minimum amount spent is very low. Moreover, this increasing trend in maximum payroll is absent after 2005. Let us compare individual teams on their payroll amounts.



In [13]:

    
# Yearly payroll of teams
yearly_payroll = salaries.groupby(['teamID', 'yearID'], as_index=False).sum()



In [14]:

    
# Spread of amount spent by individual teams on payroll
plt.figure(figsize=(8,10))

order = yearly_payroll.groupby('teamID').median().sort_values(by='salary', ascending=False).index
g = sns.boxplot(data=yearly_payroll, x='salary', y='teamID', order=order, palette='viridis')

plt.setp(g.artists, alpha=0.8)
g.xaxis.set_major_formatter(formatter)
g.set_title("Distribution of Payroll across Teams (1985-2015)", fontsize=15, weight='bold')
g.set_xlabel("Team Payroll (in millions)", weight='bold')
g.set_ylabel("TeamID", weight='bold')

g.grid()

The boxplot clearly shows that not all teams spend equally on their payrolls. For example, the minimum amount spent by Los Angeles Angels of Anaheim (LAA) is higher than the median amount of all individual teams. A possible explanation for the observed trend can be that some teams can spend huge amounts on the payroll to hire the best talent while others cannot support the best players financially. How can we test this hypothesis? If our hypothesis is true then the team's payroll should translate to teams performance in regular season and post-season. We test our hypothesis by performing regression analysis relating regular season and post-season success to team payroll.

Data Wrangling

In order to perform our analysis, we need to restructure our data to fit our requirements. To begin with, read in the team details from Teams.csv and SeriesPost.csv. The first file contains the details of regular season wins and the second file contains the post-season wins.



In [15]:

    
# Read in team details
teams = read_csv('Teams.csv') #Regular season details
teams_post = read_csv('SeriesPost.csv') #Post-Season details









    



Teams.csv has 2805 rows and 48 columns
SeriesPost.csv has 307 rows and 9 columns



In [16]:

    
teams.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2805 entries, 0 to 2804
Data columns (total 48 columns):
yearID            2805 non-null int64
lgID              2805 non-null object
teamID            2805 non-null object
franchID          2805 non-null object
divID             1288 non-null object
Rank              2805 non-null int64
G                 2805 non-null int64
Ghome             2406 non-null float64
W                 2805 non-null int64
L                 2805 non-null int64
DivWin            1260 non-null object
WCWin             624 non-null object
LgWin             2777 non-null object
WSWin             2448 non-null object
R                 2805 non-null int64
AB                2805 non-null int64
H                 2805 non-null int64
2B                2805 non-null int64
3B                2805 non-null int64
HR                2805 non-null int64
BB                2805 non-null int64
SO                2685 non-null float64
SB                2661 non-null float64
CS                1946 non-null float64
HBP               480 non-null float64
SF                480 non-null float64
RA                2805 non-null int64
ER                2805 non-null int64
ERA               2805 non-null float64
CG                2805 non-null int64
SHO               2805 non-null int64
SV                2805 non-null int64
IPouts            2805 non-null int64
HA                2805 non-null int64
HRA               2805 non-null int64
BBA               2805 non-null int64
SOA               2805 non-null int64
E                 2805 non-null int64
DP                2488 non-null float64
FP                2805 non-null float64
name              2805 non-null object
park              2771 non-null object
attendance        2526 non-null float64
BPF               2805 non-null int64
PPF               2805 non-null int64
teamIDBR          2805 non-null object
teamIDlahman45    2805 non-null object
teamIDretro       2805 non-null object
dtypes: float64(10), int64(25), object(13)
memory usage: 1.0+ MB



In [17]:

    
teams.head()









    Out[17]:






  
    
      
      yearID
      lgID
      teamID
      franchID
      divID
      Rank
      G
      ...
      park
      attendance
      BPF
      PPF
      teamIDBR
      teamIDlahman45
      teamIDretro
    
  
  
    
      0
      1871
      NA
      BS1
      BNA
      NaN
      3
      31
      ...
      South End Grounds I
      NaN
      103
      98
      BOS
      BS1
      BS1
    
    
      1
      1871
      NA
      CH1
      CNA
      NaN
      2
      28
      ...
      Union Base-Ball Grounds
      NaN
      104
      102
      CHI
      CH1
      CH1
    
    
      2
      1871
      NA
      CL1
      CFC
      NaN
      8
      29
      ...
      National Association Grounds
      NaN
      96
      100
      CLE
      CL1
      CL1
    
    
      3
      1871
      NA
      FW1
      KEK
      NaN
      7
      19
      ...
      Hamilton Field
      NaN
      101
      107
      KEK
      FW1
      FW1
    
    
      4
      1871
      NA
      NY2
      NNA
      NaN
      5
      33
      ...
      Union Grounds (Brooklyn)
      NaN
      90
      88
      NYU
      NY2
      NY2
    
  

5 rows × 48 columns



In [18]:

    
# Add salary data to team details
teams = teams.merge(yearly_payroll, on=['teamID', 'yearID'], how='left')



In [19]:

    
# Calculate Regular season winning percentage
teams['win_percent'] = (teams.W / teams.G) * 100



In [20]:

    
teams.head()









    Out[20]:






  
    
      
      yearID
      lgID
      teamID
      franchID
      divID
      Rank
      G
      ...
      BPF
      PPF
      teamIDBR
      teamIDlahman45
      teamIDretro
      salary
      win_percent
    
  
  
    
      0
      1871
      NA
      BS1
      BNA
      NaN
      3
      31
      ...
      103
      98
      BOS
      BS1
      BS1
      NaN
      64.516129
    
    
      1
      1871
      NA
      CH1
      CNA
      NaN
      2
      28
      ...
      104
      102
      CHI
      CH1
      CH1
      NaN
      67.857143
    
    
      2
      1871
      NA
      CL1
      CFC
      NaN
      8
      29
      ...
      96
      100
      CLE
      CL1
      CL1
      NaN
      34.482759
    
    
      3
      1871
      NA
      FW1
      KEK
      NaN
      7
      19
      ...
      101
      107
      KEK
      FW1
      FW1
      NaN
      36.842105
    
    
      4
      1871
      NA
      NY2
      NNA
      NaN
      5
      33
      ...
      90
      88
      NYU
      NY2
      NY2
      NaN
      48.484848
    
  

5 rows × 50 columns



In [21]:

    
teams_post.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307 entries, 0 to 306
Data columns (total 9 columns):
yearID          307 non-null int64
round           307 non-null object
teamIDwinner    307 non-null object
lgIDwinner      307 non-null object
teamIDloser     307 non-null object
lgIDloser       307 non-null object
wins            307 non-null int64
losses          307 non-null int64
ties            307 non-null int64
dtypes: int64(4), object(5)
memory usage: 21.7+ KB



In [22]:

    
teams_post.head()









    Out[22]:






  
    
      
      yearID
      round
      teamIDwinner
      lgIDwinner
      teamIDloser
      lgIDloser
      wins
      losses
      ties
    
  
  
    
      0
      1884
      WS
      PRO
      NL
      NYP
      AA
      3
      0
      0
    
    
      1
      1885
      WS
      CHC
      NL
      STL
      AA
      3
      3
      1
    
    
      2
      1886
      WS
      STL
      AA
      CHC
      NL
      4
      2
      0
    
    
      3
      1887
      WS
      DTN
      NL
      STL
      AA
      10
      5
      0
    
    
      4
      1888
      WS
      NYG
      NL
      STL
      AA
      6
      4
      0

The post-season details are not in a friendly format for our analysis. The following code cell reshapes the data and calculates post-season wins and winning percentages for individual teams in each year.



In [23]:

    
# Reshape post-season details to calculate winning percentage
games_won = teams_post.groupby(['teamIDwinner','yearID']).sum()
games_lost = teams_post.groupby(['teamIDloser','yearID']).sum()

playoffs = pd.concat([games_won, games_lost], axis=0).reset_index()
playoffs.columns = ['teamID', 'yearID', 'post_W', 'post_L', 'post_T']

playoffs['post_G'] = playoffs.post_W + playoffs.post_L + playoffs.post_T

playoffs = playoffs.groupby(['teamID', 'yearID']).sum().reset_index()



In [24]:

    
# Add salary data into the post-season details
playoffs = playoffs.merge(yearly_payroll, on=['teamID', 'yearID'], how='left', copy=False)



In [25]:

    
# Calculate post-season winning percentage
playoffs['post_win_percent'] = (playoffs.post_W / playoffs.post_G) *100
playoffs.head()









    Out[25]:






  
    
      
      teamID
      yearID
      post_W
      post_L
      post_T
      post_G
      salary
      post_win_percent
    
  
  
    
      0
      ANA
      2002
      11
      5
      0
      16
      61721667.0
      68.750000
    
    
      1
      ANA
      2004
      3
      0
      0
      3
      100534667.0
      100.000000
    
    
      2
      ARI
      1999
      3
      1
      0
      4
      68703999.0
      75.000000
    
    
      3
      ARI
      2001
      11
      6
      0
      17
      85082999.0
      64.705882
    
    
      4
      ARI
      2002
      3
      0
      0
      3
      102819999.0
      100.000000

Analysis and Conclusions



In [26]:

    
# Linear regression of winning percentage and amount spent on payroll
def payroll_reg(df, s):
    # The two arguments to this function dataframe and season type {'rs'for Regular Season and 'ps'for Post-Season}
    
    # Dicts for case specific values
    # column containing win percentage
    y = {'rs': 'win_percent', 'ps': 'post_win_percent'}
    # title of the plot
    t = {'rs': 'Regular Season %Wins Vs Payroll (1985-2015)', 'ps': 'Post-Season %Wins Vs Payroll (1985-2015)'}
    # color for the plot
    c = {'rs': sns.color_palette()[5], 'ps': sns.color_palette()[3]}
    
    g = sns.lmplot(data=df, x='salary', y=y[s], size=5, aspect=1.5,
                   scatter_kws={'color': c[s]}, line_kws={'color': c[s]})

    g.fig.suptitle(t[s], fontsize=15, weight='bold')
    g.set_xlabels("Payroll (in millions)", weight='bold')
    g.set_ylabels("%Wins", weight='bold')
    sns.despine(right=False, top=False)
    
    # Linear regression using scipy.stats
    mask = ~pd.isnull(df['salary']) & ~pd.isnull(df[y[s]])
    slope, intercept, r_value, p_value, std_err = stats.linregress(x=df['salary'][mask]*1e-6,
                                                                   y=df[y[s]][mask])
    g.ax.xaxis.set_major_formatter(formatter)
    
    #Add regression summary to plots
    g.ax.text(0.01, 0.935,
              r"$R^2={r}$".format(r='%.2g'%r_value**2) + "\n"
              r"$r={r}, \thinspace p={p}$".format(r='%.2g'%r_value, p='%.2g'%p_value) + "\n" + 
              r"$y={m}x+{c}$".format(m=round(slope,2), c=round(intercept,2)),
              ha='left', va='top',
              transform = g.ax.transAxes, fontsize=12)

    g.ax.grid()

    g.fig.tight_layout(rect=[0, 0, 1, 0.97])



In [27]:

    
payroll_reg(teams, 'rs')

Based on the data at hand, a regression analysis relating percentage wins to team payroll shows a statistically significant positive relationship. The coefficient for payroll suggests that every \$30000 spent on payroll increases regular season wins by 1%. The result of the regression supports our hypothesis that the free agency market is partly responsible for the competitive imbalance in the regular season.



In [28]:

    
payroll_reg(playoffs, 'ps')

In our analysis of post-season wins, the result of the regression analysis is not as expected. The results suggest statistically insignificant relationship between percentage wins and payroll. It suggests that a short post-season is a random event in which team payroll has a negligible effect on wins.

The results of the regression analysis suggest that team payroll has a significant impact on team's success in the regular season due to free agency market. The ability to purchase talent in the open market and a 162 game season over six months diminishes the random elements leading to expected effects of team payroll on regular season winning percentage. While conventional wisdom leads us to think that teams have the ability to "buy" championships in baseball because there is no salary-cap, our analysis does not support this claim. We did not find a statistically significant relationship between post-season wins and team payroll.

Exploring regular season Wins

We saw in the previous section that amount spent on payroll has a correlation of 0.21 with regular season wins which is statistically significant. But payroll is not a strong predictor of regular season wins. Only 4% of the total variance in team wins is explained by payroll ($R^2=0.044$). Let us explore further to find better predictors of regular season wins.

The Teams.csv file has a number of team metrics along with team wins. Let us first explore the correlation between various metrics with the help of a clustermap.



In [29]:

    
# Create a correlation clustermap
def draw_clustermap(df, fmt=None):
    # Remove ordinal columns
    non_ordinal_cols = [col for col in df.columns if col not in ['yearID', 'Rank']]

    # Compute the correlation matrix
    corr = df[non_ordinal_cols].corr()
    
    # Plot a clustermap using correlation as a distance metric
    g = sns.clustermap(corr, cmap='Spectral', linewidths=.5, metric='correlation', annot=(fmt!=None), fmt=fmt)
    plt.setp(g.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
    plt.show()



In [30]:

    
# Clustermap of team metrics
draw_clustermap(teams)

As we have a large number of metrics in our dataset, let us split them into categories for ease of analysis. The categories into which we divide the metrics are the following:

Offense
Defence
Miscellaneous



In [31]:

    
Offense = ['R', 'AB', 'H', '2B', '3B', 'HR', 'BB', 'SO', 'SB', 'HBP', 'SF', 'CS']
Defense = ['RA', 'ER', 'ERA', 'IPouts', 'HA', 'HRA', 'BBA', 'SOA', 'E', 'SV', 'SHO', 'CG', 'FP', 'DP']
Misc = ['G', 'Ghome', 'salary', 'win_percent']



In [32]:

    
# Clustermap of Offense metrics
draw_clustermap(teams[Offense + Misc], '.2f')

The clustermap of offense metrics suggests that team wins (win_percent) have the highest positive correlation with Runs scored (R). And, Runs scored are highly correlated with Hits (H), At-Bats(AB), Doubles(2B), Base on Balls (BB), Home Runs (HR) and Sacrifice Flies (SF). Hits and At-Bats create opportunities to score runs. Doubles, Base on Balls, Home Runs and Sacrifice Flies are positive outcomes of the opportunities, that aid scoring a run.



In [33]:

    
# Clustermap of Defense metrics
draw_clustermap(teams[Defense + Misc], '.2f')

The clustermap for defense metrics suggests that Earned Run Average (ERA) has the highest negative correlation with team wins. ERA represents the number of earned runs a pitcher allows per nine innings. Following ERA, Runs Allowed (RA) has the highest negative correlation with wins. Also, there is a positive relationship between wins and Shutouts (SHO).

Based on our observations from the correlation clustermaps for offense and defense metrics, we find that teams wins are positively related to Runs Scored and negatively related to Runs Allowed. This is as expected because to win, a team has to score more runs while playing offense than runs allowed while playing defense. Let us create a single metric out of runs scored and runs allowed to compare with team wins.



In [34]:

    
# Difference in Runs scored and Runs Allowed
teams['R-RA'] = teams.R - teams.RA

Analysis and Conclusions

Let us test the statistical significance of correlation coefficents for %Wins and Runs Scored(R), Runs Allowed(RA) and our new metric R-RA.



In [35]:

    
# Plot team winning percent in regular season vs Runs scored, Runs allowed and their difference
g = sns.pairplot(data=teams, x_vars=['R','RA','R-RA'], y_vars=['win_percent'])

g.fig.set_size_inches((10,4))
sns.despine(top=False, right=False)
g.axes[0,0].set_ylabel("%Wins", weight='bold', fontsize=10)

title = {0: "%Wins Vs Runs Scored", 1: "%Wins Vs Runs Allowed", 2: "%Wins Vs (R-RA)"}
x_label = {0: "Runs Scored", 1: "Runs Allowed", 2: "Runs Scored - Runs Allowed"}
linregress = {0: stats.linregress(x=teams['R'], y=teams['win_percent']),
              1: stats.linregress(x=teams['RA'], y=teams['win_percent']),
              2: stats.linregress(x=teams['R-RA'], y=teams['win_percent'])}

for i in range(3):
    g.axes[0,i].grid()
    g.axes[0,i].set_title(title[i], weight='bold', fontsize=12)
    g.axes[0,i].set_xlabel(x_label[i], weight='bold', fontsize=10)
    
    slope, intercept, r_value, p_value, std_err = linregress[i]
    
    g.axes[0,i].plot()
    g.axes[0,i].text(0.01, 0.97,
                  r"$r={r}, \thinspace p={p}$".format(r='%.2g'%r_value, p='%.2g'%p_value),
                  ha='left', va='top',
                  transform = g.axes[0,i].transAxes, fontsize=11)

The above plots help in visualizing our findings from clustermaps. Team wins in the regular season are highly correlated $(r=0.92)$ with the difference between Runs scored and Runs allowed.

The aim in the regular season is to make playoffs. Based on our observation on the correlation between wins and $(R-RA)$, let us find the number of wins needed to make playoffs and also the difference in runs scored and runs allowed $(R-RA)$ to win the required number of games. The format of MLB post-season has undergone multiple changes over time in order to reduce the possibility of best teams in a league to be left out of the postseason. To account for these changes, let us split the data into periods based on post-season format and analyze if there is an significant difference across periods in team wins needed to make playoffs.



In [36]:

    
# Merge playoff data with team details
teams = teams.merge(playoffs, on=['teamID','yearID'], how='left')



In [37]:

    
# Create a column with 1 for making playoff and 0 otherwise
teams['playoffs'] = teams.post_G.map(lambda x: 0 if pd.isnull(x) else 1)
teams.head()









    Out[37]:






  
    
      
      yearID
      lgID
      teamID
      franchID
      divID
      Rank
      G
      ...
      post_W
      post_L
      post_T
      post_G
      salary_y
      post_win_percent
      playoffs
    
  
  
    
      0
      1871
      NA
      BS1
      BNA
      NaN
      3
      31
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      0
    
    
      1
      1871
      NA
      CH1
      CNA
      NaN
      2
      28
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      0
    
    
      2
      1871
      NA
      CL1
      CFC
      NaN
      8
      29
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      0
    
    
      3
      1871
      NA
      FW1
      KEK
      NaN
      7
      19
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      0
    
    
      4
      1871
      NA
      NY2
      NNA
      NaN
      5
      33
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      0
    
  

5 rows × 58 columns



In [38]:

    
# Add column to filter data based on post-season format
def mlb_postseason_format(x):
    # One Round - World Series (WS)
    if x < 1969:
        return 'Before 1969'
    # Two Round - WS + League Championship Series (LCS)
    elif x < 1994:
        return '1969-1993'
    # Three Rounds - WS + LCS + Division Series (DS) with one wildcard team
    elif x < 2012:
        return '1994-2011'
    # Expanded Wildcard - WS + LCS + DS with two wildcard teams
    else:
        return '2012-Present'



In [39]:

    
# Add column to filter on periods
teams['ps_format_period'] = teams.yearID.apply(mlb_postseason_format)



In [40]:

    
# One-way ANOVA for testing statistically significant differences in team wins to make playoffs in each period
periods = teams.ps_format_period.unique()

d = teams[teams.playoffs == 1]

# Seperate out data for each period
W_groups = {k: d[d.ps_format_period == k]['W'] for k in periods}     # Team wins
R_groups = {k: d[d.ps_format_period == k]['R-RA'] for k in periods}  # R-RA

print "Results of One way Anova for Wins"
print stats.f_oneway(W_groups[periods[0]], W_groups[periods[1]], W_groups[periods[2]], W_groups[periods[3]])

print "\nResults of One way Anova for R-RA"
print stats.f_oneway(R_groups[periods[0]], R_groups[periods[1]], R_groups[periods[2]], R_groups[periods[3]])









    



Results of One way Anova for Wins
F_onewayResult(statistic=7.5817082933683171, pvalue=6.031647492167738e-05)

Results of One way Anova for R-RA
F_onewayResult(statistic=41.228117538079914, pvalue=2.6452565592601677e-23)

The results suggests that there is atleast one pair of groups which are statistically different. Let us perform individual t-tests to find such pairs.



In [41]:

    
# Two sample t-test for finding siginifant differences
print 'Pair-wise t-test results'
for i in range(3):
    for j in range(i+1,4):
        print periods[i],'-',periods[j],'\n', stats.ttest_ind(W_groups[periods[i]], W_groups[periods[j]]),'\n'









    



Pair-wise t-test results
Before 1969 - 1969-1993 
Ttest_indResult(statistic=3.6659416880625022, pvalue=0.0003044397845877576) 

Before 1969 - 1994-2011 
Ttest_indResult(statistic=4.2531589543168797, pvalue=2.9148332550004145e-05) 

Before 1969 - 2012-Present 
Ttest_indResult(statistic=3.976856778967035, pvalue=0.00010270108283426609) 

1969-1993 - 1994-2011 
Ttest_indResult(statistic=-0.86652406351529654, pvalue=0.38707556193643644) 

1969-1993 - 2012-Present 
Ttest_indResult(statistic=-0.019443678234783865, pvalue=0.98451446073571836) 

1994-2011 - 2012-Present 
Ttest_indResult(statistic=0.9233141850613481, pvalue=0.35712208541879475)

The results suggest that the period before 1969 is significantly different in team wins required to make playoffs from all its successive periods. Also, there is no siginificant difference among post 1969 post-season formats. We use this result and reduce the number of periods from four to two.

Before 1969
1969-Present



In [42]:

    
# Add column to filter data based on post-season format
def mlb_postseason_format(x):
    if x < 1969:
        return 'Before 1969'
    else:
        return '1969-Present'



In [43]:

    
# Update column to reflect new groups
teams['ps_format_period'] = teams.yearID.apply(mlb_postseason_format)



In [44]:

    
# Visualizing our results
plt.figure(figsize=(9,4.5))

periods = teams.ps_format_period.unique()

plt.suptitle("Distribution of metrics for teams which made playoffs", weight='bold', fontsize=12)

ax1 = plt.subplot(121)
for p in periods:
    mask = (teams.ps_format_period == p) & (teams.playoffs == 1)
    d = teams[mask]
    g = sns.kdeplot(d['W'], label=p, ax=ax1, shade=True)
    g.axes.grid(True)

d = teams[teams.playoffs == 1]
result = stats.ttest_ind(d[d.ps_format_period == periods[0]]['W'],
                         d[d.ps_format_period == periods[1]]['W'])    
g.axes.text(0.01, 0.8,
            r"$t_{{stat}}={t}, \thinspace p={p}$".format(t='%.2g'%result[0], p='%.2g'%result[1]),
            ha='left', va='top',
            transform = g.axes.transAxes, fontsize=11)
g.axes.set_title("Team Wins", weight='bold', fontsize=11)

ax2 = plt.subplot(122)
for p in periods:
    mask = (teams.ps_format_period == p) & (teams.playoffs == 1)
    d = teams[mask]
    g = sns.kdeplot(d['R-RA'], label=p, ax=ax2, shade=True)
    g.axes.grid(True)

d = teams[teams.playoffs == 1]
result = stats.ttest_ind(d[d.ps_format_period == periods[0]]['R-RA'],
                         d[d.ps_format_period == periods[1]]['R-RA'])    
g.axes.text(0.01, 0.95,
            r"$t_{{stat}}={t}, \thinspace p={p}$".format(t='%.2g'%result[0], p='%.2g'%result[1]),
            ha='left', va='top',
            transform = g.axes.transAxes, fontsize=11)
g.axes.set_title("Runs Scored - Runs Allowed", weight='bold', fontsize=11)

plt.tight_layout(rect=[0, 0, 1, 0.97])



In [45]:

    
# Linear Regression to find difference in runs needed to win the required numer of games to make playoffs
g = sns.FacetGrid(data=teams, col='ps_format_period', col_wrap=2, hue='playoffs')
g.map(plt.plot, 'R-RA', 'W', marker='o', linestyle='None', mec='w', mew=0.6)

g.fig.set_size_inches((9,5))
g.fig.suptitle("Wins Vs R-RA", weight='bold', fontsize=12)
g.set_ylabels('Wins', weight='bold')
g.set_xlabels('Runs Scored - Runs Allowed', weight='bold')
sns.despine(top=False, right=False)

periods = teams.ps_format_period.unique()

for i, ax in enumerate(g.axes.flatten()):
    mask = (teams.ps_format_period == periods[i]) & (teams.playoffs == 1)
    d = teams[mask]
    ax.set_title(periods[i], weight='bold', fontsize=11)
    ax.grid()
    
    # Linear regression using scipy.stats
    mask = ~pd.isnull(d['W']) & ~pd.isnull(d['R-RA'])
    slope, intercept, r_value, p_value, std_err = stats.linregress(x=d['R-RA'][mask],
                                                                   y=d['W'][mask])
    
    #Add regression summary to plots
    ax.text(0.01, 0.97,
            r"$R^2={r}$".format(r='%.2g'%r_value**2) + "\n"
            r"$r={r}, \thinspace p={p}$".format(r='%.2g'%r_value, p='%.2g'%p_value) + "\n" + 
            r"$y={m}x+{c}$".format(m=round(slope,2), c=round(intercept,2)),
            ha='left', va='top',
            transform = ax.transAxes, fontsize=11)
    
    y = round(d['W'].mean(),2)
    x = round((y-intercept)*(1/slope),2)
    
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()

    xmax = ((x-xlim[0])/(xlim[1]-xlim[0]))
    ymax = ((y-ylim[0])/(ylim[1]-ylim[0]))
    
    ax.axhline(y=y, linestyle='--', xmax=xmax, c='0.75')
    ax.axvline(x=x, linestyle='--', ymax=ymax, c='0.75')
    ax.annotate(s = "$Wins={y}$".format(y=y),
                xy=(x, y), xytext=(-100, -10), textcoords='offset points')
    ax.annotate(s = "$R-RA={x}$".format(x=x),
                xy=(x, y), xytext=(3, -120), textcoords='offset points', rotation=-90)

g.fig.tight_layout(rect=[0, 0, 1, 0.97])

For the post-season format prior to 1969 average wins of the teams which made playoffs are 97 and the corresponding difference in runs scored and runs allowed is 85 (using regression equation). Post-1969 values for the same are 94 wins and 115 runs. Thus, in order to make playoffs, a team has to score 115 more runs than they allowed during the regular season to expect to win 94 games and get into post-season.

Summary

We started our analysis by looking at player salaries and tried to correlate the observed up and downs with major events in baseball history. Then we moved to the analysis of team payrolls. We made an observation that there is a difference in spending on payroll across teams. We tried to reason out by hypothesizing that best talent goes to the teams which are capable of spending large amounts on payrolls, which leads to competitive imbalance. We tested are hypothesis by correlating team payrolls with performance in regular season and post-season. We found a statistically significant result in the case of regular season wins, but not in the case of post-season wins. As a result, we failed to reject our hypothesis in the case of the regular season and attributed the insignificant result in the case of post-season performance to its format which makes it a random event where payroll has no effect on wins.

Following our significant result in the case of regular season wins, we started exploring the team metrics to find the influential ones which aid team wins. We observed that team wins have the highest correlation with runs scored while playing offense and runs allowed while playing defense. Based on this observation, we calculated the correlation between team wins and the difference in runs scored and runs allowed. We found a statistically significant correlation coefficient. Using this we tried to build a model to predict the number of runs a team has to score over and above the runs that they allowed in order to win required number of games to make playoffs.

The analysis can be further extended to calculate statistics which correlate well with team runs and thus act as better predictors of teams and/or players performance which in turn help in effective allocation and utilization of team payrolls.

	yearID	lgID	teamID	franchID	divID	Rank	G	...	park	attendance	BPF	PPF	teamIDBR	teamIDlahman45	teamIDretro
0	1871	NA	BS1	BNA	NaN	3	31	...	South End Grounds I	NaN	103	98	BOS	BS1	BS1
1	1871	NA	CH1	CNA	NaN	2	28	...	Union Base-Ball Grounds	NaN	104	102	CHI	CH1	CH1
2	1871	NA	CL1	CFC	NaN	8	29	...	National Association Grounds	NaN	96	100	CLE	CL1	CL1
3	1871	NA	FW1	KEK	NaN	7	19	...	Hamilton Field	NaN	101	107	KEK	FW1	FW1
4	1871	NA	NY2	NNA	NaN	5	33	...	Union Grounds (Brooklyn)	NaN	90	88	NYU	NY2	NY2

	yearID	lgID	teamID	franchID	divID	Rank	G	...	BPF	PPF	teamIDBR	teamIDlahman45	teamIDretro	salary	win_percent
0	1871	NA	BS1	BNA	NaN	3	31	...	103	98	BOS	BS1	BS1	NaN	64.516129
1	1871	NA	CH1	CNA	NaN	2	28	...	104	102	CHI	CH1	CH1	NaN	67.857143
2	1871	NA	CL1	CFC	NaN	8	29	...	96	100	CLE	CL1	CL1	NaN	34.482759
3	1871	NA	FW1	KEK	NaN	7	19	...	101	107	KEK	FW1	FW1	NaN	36.842105
4	1871	NA	NY2	NNA	NaN	5	33	...	90	88	NYU	NY2	NY2	NaN	48.484848

	yearID	round	teamIDwinner	lgIDwinner	teamIDloser	lgIDloser	wins	losses	ties
0	1884	WS	PRO	NL	NYP	AA	3	0	0
1	1885	WS	CHC	NL	STL	AA	3	3	1
2	1886	WS	STL	AA	CHC	NL	4	2	0
3	1887	WS	DTN	NL	STL	AA	10	5	0
4	1888	WS	NYG	NL	STL	AA	6	4	0

	teamID	yearID	post_W	post_L	post_G	salary	post_win_percent
0	ANA	2002	11	5	16	61721667.0	68.750000
1	ANA	2004	3	0	3	100534667.0	100.000000
2	ARI	1999	3	1	4	68703999.0	75.000000
3	ARI	2001	11	6	17	85082999.0	64.705882
4	ARI	2002	3	0	3	102819999.0	100.000000

Baseball Statistics

Exploring MLB Player's Salaries

Distribution of player's salaries over time

Distribution of Team payrolls over time

Data Wrangling

Analysis and Conclusions

Exploring regular season Wins

Analysis and Conclusions

Summary

Reference