A project by Nathan Ding (njd304@stern.nyu.edu) on the effects of temperature on major league batters
Spring 2016 Semester
The Natural Gas Law (PV = nRT) tells us that as the temperature of a gas rises in a rigid container, the pressure of the said gas will steadily increase as well due to a rise in the average speed of gas molecules. In essence, the amount of energy contained within the system rises as heat is nothing more than thermal (kenetic) energy. While the Natural Gas Law holds for gasses, a similar increase in molecular vibrations - and therefore energy - is seen in solid objects as well. When the temperature rises, the amount of energy contained within a solid increases. The purpose of this project is to examine the effects of temperatures on the game of baseball, with specific regard to the hitting aspect of the game.
The art of hitting a MLB fastball combines an incredible amount of luck, lightning-fast reflexes, and skill. Hitters often have less than half a second to determine whether to swing at a ball or not. However, when sharp contact is made with a fastball screaming towads the plate at over 90 miles/hour, the sheer velocity and energy the ball carries with it helps it fly off of the bat at an even faster speed. The higher the pitch velocity, the more energy a ball contains, and the faster its "exit velocity" (the speed of the ball when it is hit). This project looks to examine whether or no the extra energy provided by the ball's temperature plays a significant factor in MLB hitters' abilities to hit the ball harder. By analyzing the rates of extra base hits (doubles, triples, and home runs which generally require a ball to be hit much harder and further than a single) at different temperature ranges, I hope to discover a significant correlation between temperature and hitting rates.
In [15]:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
%matplotlib inline
Pandas: I imported pandas for use in reading my many .csv files and because the pandas module contains dataframes, which are much easier to use for data analysis than lists or dictionaries.
matplotlib.pyplot: matplotlib.pyplot was used to create graphs and scatterplots of the data, and because the creation of figure and axis objects with matplotlib allows for easier manipulation of the physical aspects of a plot.
statsmodels.formula.api was imported for the linear regression models at the end of this project.
The data for this project was collected from baseball-reference.com's Play Index, which allows users to sort and search for baseball games based on a multitude of criteria including team, player, weather conditions (temperature, wind speed/direction, and precipitation). Unfortunately, the play index only allows registered users to access and export a list of 300 games at a time. As a result, I had to download 33 seperate CSV files from the website to gather all 9-inning MLB games from the 2013 - 2015 seasons. The total number of games used in this data set was 8805. Because the filenames were all 'C:/Users/Nathan/Desktop/BaseBall/data/play-index_game_finder.cgi_ajax_result_table.csv' followed by a number in parenthesis, I was able to use a for loop to combine all the data into one large dataframe.
An online version of these files are avaliable at this link
In [16]:
#import data from CSVs
total = pd.read_csv('C:/Users/Nathan/Desktop/BaseBall/data/play-index_game_finder.cgi_ajax_result_table.csv')
for i in range(1,32):
file = 'C:/Users/Nathan/Desktop/Baseball/data/play-index_game_finder.cgi_ajax_result_table (' + str(i) +').csv'
data = pd.read_csv(file)
total = total.append(data)
total.head(30)
Out[16]:
Because of the nature of the baseball-reference.com Play Index, there were some repeated games in the CSV files, and after every 25 games the headers would reappear. In order to clean the data, I removed each row of data where the 'Temp' value was 'Temp', because those rows were the header rows. I removed the unnecessary columns by iterating through the column values with a for loop and removing the ones that were "unimportant" (as opposed to the "important" ones in the important list) Finally, I removed the duplicate entries from the datafram using the df.drop_duplicates() method.
In [17]:
#Clean data to remove duplicates, unwanted stats
important = ['Date','H', '2B', '3B', 'HR', 'Temp']
for i in total:
if i in important:
continue
del total[i]
#remove headers
total = total[total.Temp != 'Temp']
#remove duplicates
total = total.drop_duplicates()
#remove date -> cannot remove before because there are items that are identical except for date
del total['Date']
# remove date from important list
important.remove('Date')
total.head(5)
Out[17]:
In [18]:
#change dtypes to int
total[['Temp', 'HR', '2B', '3B', 'H']] = total[['Temp', 'HR', '2B', '3B', 'H']].astype(int)
total.dtypes
Out[18]:
In [19]:
#calculte extra-base-hits (XBH) (doubles, triples, home runs) for each game
#by creating a new column in the dataframe
total['XBH'] = total['2B'] + total['3B'] + total['HR']
#append XBH to important list
important.append('XBH')
In [20]:
#seperate data into new dataframes based on temperature ranges
#below 50
minus50 = total[total.Temp <= 50]
#50-60
t50 = total[total.Temp <= 60]
t50 = t50[t50.Temp > 50]
#60-70
t60 = total[total.Temp <= 70]
t60 = t60[t60.Temp > 60]
#70-80
t70 = total[total.Temp <= 80]
t70 = t70[t70.Temp > 70]
#80-90
t80 = total[total.Temp <= 90]
t80 = t80[t80.Temp > 80]
#90-100
t90 = total[total.Temp <= 100]
t90 = t90[t90.Temp > 90]
#over 100
over100= total[total.Temp > 100]
minus50.head(5)
Out[20]:
In [21]:
#New dataframe organized by temperature
rangelist = [minus50, t60, t70, t80, t90, over100]
data_by_temp = pd.DataFrame()
data_by_temp['ranges']=['<50', "60's","70's","80's","90's",">100"]
#calculate per-game averages by temperature range
for i in important:
data_by_temp[i+'/Game'] = [sum(x[i])/len(x) for x in rangelist]
#set index to temperature ranges
data_by_temp = data_by_temp.set_index('ranges')
data_by_temp.head(10)
Out[21]:
I made a couple bar graphs to compare average extra base hits per game by temperature range and to compare home runs per game as well because home runs are the furthest hit balls, and in theory should see the largest temperature impact if there is in fact a measureable impact on the baseballs. I then made a couple of scatterplots to compare the complete data results, and look for some sort of trendline. Unfortunately, because of the limited amount of possible results, the scatterplots did not come out as I had hoped.
In [22]:
#plots
fig, ax=plt.subplots()
data_by_temp['XBH/Game'].plot(ax=ax,kind='bar',color='blue', figsize=(10,6))
ax.set_title("Extra Base Hits Per Game by Temp Range", fontsize=18)
ax.set_ylim(2,3.6)
ax.set_ylabel("XBH/Game")
ax.set_xlabel("Temperature")
plt.xticks(rotation='horizontal')
Out[22]:
In [23]:
#plots
fig, ax=plt.subplots()
data_by_temp['HR/Game'].plot(ax=ax,kind='bar',color='red', figsize=(10,6))
ax.set_title("Home Runs Per Game by Temp Range", fontsize=18)
ax.set_ylim(0,1.2)
ax.set_ylabel("HR/Game")
ax.set_xlabel("Temperature")
plt.xticks(rotation='horizontal')
Out[23]:
In [24]:
#scatterplot
x = data_by_temp.index
fig, ax = plt.subplots()
ax.scatter(total['Temp'],total['XBH'])
ax.set_title("Temp vs Total Extra Base Hits", fontsize = 18)
ax.set_ylabel("XBH/Game")
ax.set_xlabel("Temperature")
plt.xticks(rotation='horizontal')
ax.set_ylim(-1,14)
Out[24]:
In [25]:
#scatterplot
x = data_by_temp.index
fig, ax = plt.subplots()
ax.scatter(total['Temp'],total['HR'])
ax.set_title("Temp vs Total Home Runs", fontsize = 18)
ax.set_ylabel("HR/Game")
ax.set_xlabel("Temperature")
plt.xticks(rotation='horizontal')
ax.set_ylim(-1,10)
Out[25]:
I ran a linear regression of the total extra base hits and teperatures for the master data set to see if there was a correletion. Although the r-squared value is so small, due to the fact that there are a limited amount of possible home runs per game (realistically) and the sample size is so large (see the scatterplots above), the regressions for extra base hits and temperature, as well as home runs and temperature, both show a miniscule correlation between temperature and hits. Because the slope values are so small (a 100 degree increase in temperature correleates to a 1 extra base hit increase and a .7 home run increase), there is basically no correlation. After all, a 100 degree increase is basically the entire range of this project.
In [26]:
regression= smf.ols(formula="total['XBH'] ~ total['Temp']", data = total).fit()
regression.params
regression.summary()
Out[26]:
In [27]:
regression2 = smf.ols(formula="total['HR'] ~ total['Temp']", data = total).fit()
regression2.params
regression2.summary()
Out[27]:
Ultimately, the results of this project were mixed and more negative than positive. Though the bar graph on average extra base hits per game showed a steady increase as temperature increased, the same was not true for average home runs per game. Furthermore, the regression analysis showed a tiny relationship between the variables. Although the results were statistically significant, this was due more to the huge sample size than the existience of a correlation. Ultimately, the data collected failed to really suggest that temperature has a huge impact on the ability of MLB hitters to hit a baseball with power. I had been hoping to discover a more impactful effect of temmperature on the ability of hitters to hit the ball far. A possible expansion upon this experiment to go more in-depth perhaps could have been a team-by-team (or stadium by stadium) breakdown of how each team/stadium performed under diffrerent temperature conditons. For example, it is likely that the Tampa Bay Rays or the Los Angeles Angels, from the southern part of the US and unaccustomed to playing in colder temperatures, may have been more affected than a team like the Boston Red Sox who regularly play in colder games, especially in the spring and fall months.
Data from this project was collected from the baseball-reference.com Play Index, which can be found at http://www.baseball-reference.com/play-index/. In order to unlock the full potential of the Play Index, a paid membership is required.
In [ ]: