May 2016
Written by Anthony Riccio at NYU Stern
Contact: ajr587@stern.nyu.edu
Mark McGwire & Sammy Sosa, the two names that come to mind I think about the steroid era in Major League Baseball (MLB). The steroid era refers to a period of time in the MLB where a number of players were believed to have used performance enhancing drugs (PEDs) to improve offensive performance. David Wells stated that “25 to 40 percent of all Major Leaguers used PEDs.” Jose Canseco stated in his tell-all book ‘Juiced’ that as many as 80 percent of Major Leaguers used PEDs.
Although steroids where banned in the MLB in 1991, the league did not implement league wide testing until 2003. The steroid era is generally considered to have run from the late ‘80s through the late 2000s. The one thing that baseball has a lot of is data, so I wanted to take a deeper drive into to offensive statistics around baseball in what was considered the steroid era and see if I can draw any conclusions based on the data.
In [29]:
import sys # system module
import pandas as pd # data package
import matplotlib.pyplot as plt # graphics module
import datetime as dt # date and time module
import numpy as np # foundation for Pandas
%matplotlib inline
# check versions
print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('Today: ', dt.date.today())
Packages Imported
I imported pandas, a Python package that allows for data manipulation and analysis; Matplotlib.pyplot for graphing; Numpy for scientific computing
I wanted to get a data set with an extensive amount of baseball statistics on the most common offensive statistics (i.e. Home Runs, Hits) but also in a format the I can break down into easily managable pieces.
After some research, I stumbled upon Sean Lahman's archive of baseball statistics (http://www.seanlahman.com/baseball-archive/statistics/)
This is the data set I will be using for my project.
In [2]:
url = 'https://raw.githubusercontent.com/maxtoki/baseball_R/master/data/Batting.csv' # importing our data
bs = pd.read_csv(url)
bs.head() # taking a look at the data
Out[2]:
In [3]:
print('Dimensions: ', bs.shape) # looking at the categories I can work with
print('Column labels: ', bs.columns)
print('Row labels: ', bs.index)
In [4]:
names = list(bs) # Changing the year label
bs = bs.rename(columns={names[1]: 'Year'})
bs.head(2)
Out[4]:
In [5]:
bsmall = bs.head() #inspecting the data
bsmall
Out[5]:
In [6]:
bsmall.shape
Out[6]:
In [7]:
bsmall.describe # inspecting the data
Out[7]:
Number of Home Runs by Year
Taking a look at home runs over a 25 year period shows us something interesting - the number of home runs per season really begin to spike over the 94'-00' season. This period lines up with what has been determined to be the steroid era. This is a substantial piece of evidence to help us draw our conclusions. The number of home runs spike up during the proclaimed home run period, but then begin to dip down as the league became more stringent around testing.
In [8]:
hrs_by_year = bs.groupby(['Year'])['HR'].sum() # grouping data by year and hrs
In [9]:
hrs_by_year[2004] # check a data point
Out[9]:
In [10]:
hrs_by_year.tail(25)
Out[10]:
In [11]:
plt.figure(figsize = (16,7))
plt.plot(hrs_by_year.tail(25), color = 'red', marker= '*')
plt.suptitle('Total Home Runs by Year', fontsize=18)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Home Runs', fontsize=12)
Out[11]:
Number of Stolen Bases by Year
Next, I wanted to look at the number of stolen bases by season to see how that compared to home runs and doubles. I would think that because the players were bulking up and getting bigger that in essence they wouldn't be as fast and ultimately wouldn't steal as many bases. The data supports this conclusion or that the increase in home runs and doubles decreased the need for players to steal bases.
In [12]:
sb_by_year = bs.groupby(['Year'])['SB'].sum() # grouping data by year and stolen bases
In [13]:
sb_by_year.tail(25)
Out[13]:
In [14]:
plt.figure(figsize = (16,7)) # plotting the data
plt.plot(sb_by_year.tail(25), color = 'blue', marker= '*')
plt.suptitle('Stolen Bases by Year', fontsize=18)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Stolen Bases', fontsize=12)
Out[14]:
Number of Doubles by Year
I also wanted to look the number of doubles by season over the same time period to see if it was somehow correlated to home runs. I would would expect to see a positive correlation between the two because a high percentage of doubles are all most home runs that just didn't have the height or distance to make it out of the park. By looking at the data, we do see a spike in doubles around the same time as home runs.
In [15]:
scndbase_by_year = bs.groupby(['Year'])['2B'].sum() # grouping data by year and doubles
scndbase_by_year.tail(25)
Out[15]:
In [16]:
plt.figure(figsize = (16,7)) # plotting our data
plt.plot(scndbase_by_year.tail(25), color = 'green', marker= '*')
plt.suptitle('Doubles by Year', fontsize=18)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Doubles', fontsize=12)
Out[16]:
Number of Hits by Year
I wanted to also plot the number of hits because the number of hits could have increased at the same time as home runs. Meaning that the increase hits could have drove up the number of home runs in proportion, not supporting our assumption. The data shows a postive correlation between hits with home runs, and hits and doubles.
In [17]:
hits_by_year = bs.groupby(['Year'])['H'].sum() # grouping data by year and hits
hits_by_year.tail(25)
Out[17]:
In [18]:
plt.figure(figsize = (16,7)) # plotting our data
plt.plot(hits_by_year.tail(25), color = 'blue', marker= '*',)
plt.suptitle('Hits by Year', fontsize=18)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Hits', fontsize=12)
Out[18]:
Stacked Bar Graph of Hits
So let's dive a little deeper and break down the number of home runs and doubles as a percentage of hits. Looking a this stack bar graph helps show the percentage increase in doubles and home runs, but it does not clearly displaying as accurate of a picture as I would like.
In [19]:
df = bs.groupby(['Year'])['2B','HR','H'].sum() #looking at the homeruns and doubles as a % of total hits
df = df.tail(25)
df['Other'] = df['H'] - df['HR'] - df['2B']
df['% HR'] = df['HR'] / df['H']
df['% 2B'] = df['2B'] / df['H']
df['% Other'] = df['Other'] / df['H']
df = df.drop ('2B', 1)
df = df.drop ('HR', 1)
df = df.drop ('H', 1)
df = df.drop ('Other', 1)
df
Out[19]:
In [20]:
my_plot = df.plot(kind='bar',stacked=True,figsize=(16,7), fontsize = (14)) # creating a stacked bar graph
my_plot.set_title("Hits by Category", fontsize = (16))
my_plot.set_xlabel("Year", fontsize = (14))
my_plot.set_ylabel("Hits", fontsize = (14))
my_plot.legend(["2B","HR","Other"], loc=9,ncol=3)
Out[20]:
% of Home Runs
So I decided to plot just the percentage of hits that were home runs to see if this can help us draw conclusions about the steroid era. Looking a the graph we can indeed see that even though hits increased, the percentage of home runs increased over the steroid period in baseball.
In [21]:
df2 = df.plot(kind='barh', figsize=(16,7), fontsize = (14), y = '% HR', title = '% of Hits HRs') # ploting hrs as a % of total hits
In conclusion, there was a clear power surge in baseball offensive from the 1993 - 2004 period that increased that level of home runs hit in total and also as a percentage of hits. This power surge begins to decrease around the time when league increased their focus on testing for performance enhancing drugs. So all of the offensive records broken over this time period shouldn't be without and asterisk, because it seems like steroids were a major factor.