Since the first Tommy John surgery was performed in 1974, shoulder and elbow injuries have become priority issues for players, coaches and general managers. Recovery from shoulder and elbow soft tissue injury, particularly ulnar collateral ligament (UCL) tears and glenoid labrum tears, is often slow and greuling due to the drastic nature of surgical reconstruction and intense rehabilitation required. With 112 UCL injuries requiring reconstructive surgery in the 2015 season alone, the competitive costs, and substantial economic costs, continue to rise, prompting many to investigate risk factors associate with upper extremity injuries. Some have posited an association with rising fastball velocity, pitch counts, and pitch variability, however, due to the small sample sizes, few have found statistically significant relationships. Nevertheless, many professional and amateur organizations are taking conservative approaches to developing young pitchers, encouraging them to limit pitches counts, extend rest between starts, and delay the use of off-speed pitches.
In the first part of this paper, I used MLB disabled list culled by Fangraphs writer Jeff Zimmerman and salary information provided by Spotrac to compute the average length of playing time lost due to injury, and economic costs from lost salary over the past five seasons. In the second part, I used pitchf/x data, a pitch tracking system created by Sportvision installed in every MLB stadium, to look at pitching characteristics leading up to an injury.
In [37]:
'''Data were imported from referenced sources and stored locally'''
import sys # system module
import pandas as pd # data package
import matplotlib.pyplot as plt # graphics module
import datetime as dt # date and time module
import numpy as np # foundation for Pandas
import statsmodels.formula.api as smf
%matplotlib inline
file1 = '/Users/isaacgammal/Desktop/Sports data/pitchers.xlsx'
df1 = pd.read_excel(file1, usecols=[0,1,2,3,4,5,6,7,8,9,10,11,12,13]) #injured pitchers and salaries
file2 = '/Users/isaacgammal/Downloads/fx.xlsx' #pitchf/x data for injured and healthy pitchers
df2 = fx = pd.read_excel(file2)
In [38]:
df1.head()
Out[38]:
In [39]:
#compute and plot mean length of time on disabled list by season
x = df1['Days on DL'].groupby(df1['Location']).mean()
fig, ax = plt.subplots()
plt.style.use('fivethirtyeight')
x.plot(kind='barh', ax=ax, legend=False)
ax.set_title('Time Spent on Disabled List by Injury Location', fontsize=16)
ax.set_xlabel('Average Days on DL')
ax.set_ylabel('Injury Location')
ax.get_children()[4].set_color('r')
ax.get_children()[17].set_color('r')
The graph above plots average length of disability due to injury, broken down by injury location. Shoulder and elbow injuries are far and away the most devastating and common pitching injuries. The aggregate number of days spent on the DL due to UCL injury is 10,414, representing a staggering 31% of the total number of days for all elbow injuries, and 12% for all pitching injuries.
In [40]:
y = df1['Salary'].groupby(df1['Location']).mean()
fig, ax = plt.subplots(figsize=(10,6))
plt.style.use('fivethirtyeight')
y.plot(kind='barh', ax=ax, legend=False)
ax.set_title('Average Sunk Salary by Injury Location', fontsize=16)
ax.set_xlabel('Average Salary')
ax.set_ylabel('Injury Location')
ax.get_children()[4].set_color('r')
ax.get_children()[17].set_color('r')
Interestingly, despite being the most severe injuries, shoulder and elbow injuries are middle-of-the-pack in terms of lost salary (~$4,000,000). There may be several possible explanations for this finding. Perhaps pitchers with a history of these injuries are labeled as such, and then offered lower salaries in contract negotiations.
Next I loaded the pitchf/x database and merged with the DL database. Because the databases were divided into injured and healthy pitchers, I first separated the two and then concatenated both to get a database of all pitchers. The variables in the pitchf/x database included maximum velocity (vFA), difference between maximum and minimum velocity pitch (delta), and number of unique pitches thrown (# pitches). These variables were used as predictors and regressed against innings pitched, a contnuous variable used as a proxy for injury, and scaled up for relievers vs starters.
In [56]:
fx_injured = pd.merge(df1,df2,how='left',on=['Name','Season'])
fx_healthy = pd.read_csv('/Users/isaacgammal/Downloads/healthy.csv')
fx = pd.concat([fx_injured,fx_healthy],axis=0)
In [57]:
fx.head()
Out[57]:
In [58]:
lm = smf.ols(formula='vFA ~ IP', data=fx).fit()
lm.params
lm.summary()
Out[58]:
In [59]:
lm2 = smf.ols(formula='Delta ~ IP', data=fx).fit()
lm2.params
lm2.summary()
Out[59]:
Regressing max velocity and delta against innings pitches did not yield accurate results. The R^2 is low in both models, however, delta against innings pitches revealed a positive signifcant relationship, indicating that pitchers who vary speeds are more likely to avoid injury.
Pitching injuries are widespread and contributed to nearly $1.4 billion dollars in lost salary over the past 5 seasons. Despite the ubiquity of these injuries, we are still limited in our knowledge of associated risk factors. Further work should focus on developing models to predict pitching injuries, before we can implement effective preventative measures.
In [ ]: