To begin, we load the data into a Pandas data frame from a csv file.
In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('data/data.csv') # read in the csv file
Let's take a cursory glance at the data to see what we're working with.
In [2]:
df.head()
Out[2]:
There's a lot of data that we don't care about. For example, 'PassAttempt' is a binary attribute, but there's also an attribute called 'PlayType' which is set to 'Pass' for a passing play.
We define a list of the columns which we're not interested in, and then we delete them
In [3]:
columns_to_delete = ['Unnamed: 0', 'Date', 'time', 'TimeUnder',
'PosTeamScore', 'PassAttempt', 'RushAttempt',
'DefTeamScore', 'Season', 'PlayAttempted']
#Iterate through and delete the columns we don't want
for col in columns_to_delete:
if col in df:
del df[col]
We can then grab a list of the remaining column names
In [4]:
df.columns
Out[4]:
Temporary simple data replacement so that we can cast to integers (instead of objects)
In [5]:
df = df.replace(to_replace=np.nan,value=-1)
At this point, lots of things are encoded as objects, or with excesively large data types
In [6]:
df.info()
We define four lists based on the types of features we're using. Binary features are separated from the other categorical features so that they can be stored in less space
In [7]:
continuous_features = ['TimeSecs', 'PlayTimeDiff', 'yrdln', 'yrdline100',
'ydstogo', 'ydsnet', 'Yards.Gained', 'Penalty.Yards',
'ScoreDiff', 'AbsScoreDiff']
ordinal_features = ['Drive', 'qtr', 'down']
binary_features = ['GoalToGo', 'FirstDown','sp', 'Touchdown', 'Safety', 'Fumble']
categorical_features = df.columns.difference(continuous_features).difference(ordinal_features)
We then cast all of the columns to the appropriate underlying data types
In [8]:
df[continuous_features] = df[continuous_features].astype(np.float64)
df[ordinal_features] = df[ordinal_features].astype(np.int64)
df[binary_features] = df[binary_features].astype(np.int8)
THIS IS SOME MORE REFORMATTING SHIT I'M DOING FOR NOW. PROLLY GONNA KEEP IT
In [9]:
df['PassOutcome'].replace(['Complete', 'Incomplete Pass'], [1, 0], inplace=True)
Now all of the objects are encoded the way we'd like them to be
In [10]:
df.info()
Now we can start to take a look at what's in each of our columns
In [11]:
df.describe()
Out[11]:
In [12]:
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
#Embed figures in the Jupyter Notebook
%matplotlib inline
#Use GGPlot style for matplotlib
plt.style.use('ggplot')
In [14]:
pass_plays = df[df['PlayType'] == "Pass"]
pass_plays_grouped = pass_plays.groupby(by=['Passer'])
We can take a random sample of passers and show their completion rate:
In [15]:
completion_rate = pass_plays_grouped.PassOutcome.sum() / pass_plays_grouped.PassOutcome.count()
completion_rate_sampled = completion_rate.sample(10)
completion_rate_sampled.sort_values(inplace=True)
completion_rate_sampled.plot(kind='barh')
Out[15]:
We can also group by both passer and receiver, to check for highly effective QB-Receiver combos.
In [16]:
pass_plays_grouped = pass_plays.groupby(by=['Passer', 'Receiver'])
completion_rate = pass_plays_grouped.PassOutcome.sum() / pass_plays_grouped.PassOutcome.count()
completion_rate_sampled = completion_rate.sample(10)
completion_rate_sampled.sort_values(inplace=True)
completion_rate_sampled.plot(kind='barh')
Out[16]:
We can eliminate combos who didn't have at least 10 receptions together, and then re-sample the data. This will remove noise from QB-receiver combos who have very high or low completion rates because they've played very little together.
In [17]:
pass_plays_grouped_filtered = pass_plays_grouped.filter(lambda g: len(g)>10).groupby(by=['Passer', 'Receiver'])
completion_rate = pass_plays_grouped_filtered.PassOutcome.sum() / pass_plays_grouped_filtered.PassOutcome.count()
completion_rate_sampled = completion_rate.sample(10)
completion_rate_sampled.sort_values(inplace=True)
completion_rate_sampled.plot(kind='barh')
Out[17]:
We can also extract the highest-completion percentage combos. Here we take the top-10 most reliable QB-receiver pairs.
In [18]:
completion_rate.sort_values(inplace=True, ascending = False)
completion_rate = pd.Series(completion_rate)
completion_rate[0:10].plot(kind='barh')
Out[18]:
Next, let's find the top ten rushers based on yards-per-carry (only for rushers who have more than 10 carries)
In [19]:
rush_plays = df[(df.Rusher != -1)]
rush_plays_grouped = rush_plays.groupby(by=['Rusher']).filter(lambda g: len(g) > 10).groupby(by=["Rusher"])
In [20]:
yards_per_carry = rush_plays_grouped["Yards.Gained"].sum() / rush_plays_grouped["Yards.Gained"].count()
yards_per_carry.sort_values(inplace=True, ascending=False)
yards_per_carry[0:10].plot(kind='barh')
Out[20]:
Now let's take a look at defenses. Which defenses allowed the fewest overall yards?
In [21]:
def_play_groups = df.groupby("DefensiveTeam")
In [22]:
def_yards_allowed = def_play_groups["Yards.Gained"].sum()
def_yards_allowed.sort_values(inplace=True)
def_yards_allowed.plot(kind='barh')
Out[22]:
It looks like the Denver Broncos allowed the fewest yards overall. Go Broncos! Let's see if there were any specific weaknesses in the Broncos' defense.
In [23]:
broncos_def_plays = df[df.DefensiveTeam == "DEN"]
broncos_def_pass_plays = broncos_def_plays[broncos_def_plays.PlayType == "Pass"]
broncos_def_rush_plays = broncos_def_plays[broncos_def_plays.PlayType == "Run"]
print("Passing yards: " + str(broncos_def_pass_plays["Yards.Gained"].sum()))
print("Rushing yards: " + str(broncos_def_rush_plays["Yards.Gained"].sum()))
It looks like they gave up a lot more passing yards than rushing yards. Let's see what QBs caused the Denver defense the most trouble.
In [24]:
pass_plays = df[df.PlayType == "Pass"]
pass_plays_against_den = pass_plays[pass_plays.DefensiveTeam == "DEN"]
pass_plays_against_den_grouped = pass_plays_against_den.groupby("Passer")
qbs_yards_against_den = pass_plays_against_den_grouped["Yards.Gained"].sum()
qbs_yards_against_den.sort_values(inplace=True, ascending=False)
qbs_yards_against_den.plot(kind='barh')
Out[24]:
It looks like the Raiders' Derek Carr caused a lot of trouble. From this data we can see that when the Broncos play the Raiders, they need to focus more on pass defense.
To figure out which Oakland WR Denver should put their best cornerback on, let's see which Oakland receiver was the highest-performing against Denver.
In [25]:
oak_pass_plays_against_den = pass_plays_against_den[pass_plays_against_den.posteam == "OAK"]
oak_pass_plays_against_den_grouped = oak_pass_plays_against_den.groupby("Receiver")
oak_receivers_yards_against_den = oak_pass_plays_against_den_grouped["Yards.Gained"].sum()
oak_receivers_yards_against_den.sort_values(inplace=True, ascending=False)
oak_receivers_yards_against_den.plot(kind='barh')
Out[25]:
M. Rivera and M. Crabtree were the dominant receivers. Going in to the game against Oakland this year, we can expext Denver to put their best cornerbacks on them.
But who are their best cornerbacks? We can best judge that by ranking Denver cornerbacks by interception counts.
In [26]:
denver_interceptions = pass_plays_against_den[pass_plays_against_den.InterceptionThrown == 1]
denver_interceptions_grouped = denver_interceptions.groupby("Interceptor")
denver_cornerback_interceptions = denver_interceptions_grouped.Interceptor.count()
denver_cornerback_interceptions_sorted = denver_cornerback_interceptions.sort_values(ascending=False)
denver_cornerback_interceptions_sorted.plot(kind='barh')
Out[26]:
The Broncos' top choices to cover M. Rivera and M. Crabtree, therefore, are A. Talib and D. Trevathan
In [27]:
for x in df.columns:
print("* **" + str(x) + "**: Description")
In [28]:
df.SideofField.value_counts()
Out[28]:
In [29]:
df
Out[29]:
In [13]:
import seaborn as sns
cmap = sns.diverging_palette(220, 10, as_cmap=True) # one of the many color mappings
In [16]:
df.columns
Out[16]:
In [24]:
df_plot = df[["Yards.Gained", "yrdline100", "TimeSecs", "ydsnet", "ScoreDiff", "posteam"]]
df_plot = df_plot[df_plot.posteam == "DAL"]
#df_imputed_jitter[['Parch','SibSp','Pclass']] = df_imputed_jitter[['Parch','SibSp','Pclass']].values + np.random.rand(len(df_imputed_jitter),3)/2
#sns.pairplot(df_plot, hue="posteam", size=2)
sns.pairplot(df_plot, size=2)
Out[24]:
In [37]:
#sns.violinplot(x="posteam", y="Yards.Gained", hue="PlayType", data=df, inner="quart")
den_plays = df[df.posteam == "DEN"]
den_passes = den_plays[den_plays.PlayType == "Pass"]
den_runs = den_plays[den_plays.PlayType == "Run"]
den_plays = pd.concat([den_passes, den_runs])
sns.violinplot(x="PlayType", y="Yards.Gained", data=den_plays, inner="quart")
Out[37]:
In [38]:
import plotly
print (plotly.__version__) # version 1.9.x required
plotly.offline.init_notebook_mode() # run at the start of every notebook
In [50]:
from plotly.graph_objs import Scatter, Marker, Layout, XAxis, YAxis
plotly.offline.iplot({
'data':[
Scatter(x=den_plays.PlayType,
y=den_plays["Yards.Gained"],
text=den_plays.PlayType.values.astype(str),
marker=Marker(size=den_plays.ydstogo, sizemode='area', sizeref=1,),
mode='markers')
],
'layout': Layout(xaxis=XAxis(title='Play Type'), yaxis=YAxis(title='Yards Gained'))
}, show_link=False)
In [ ]: