To begin, we load the data into a Pandas data frame from a csv file.
In [45]:
import pandas as pd
import numpy as np
df = pd.read_csv('data/data.csv') # read in the csv file
Let's take a cursory glance at the data to see what we're working with.
In [46]:
df.head()
Out[46]:
There's a lot of data that we don't care about. For example, 'PassAttempt' is a binary attribute, but there's also an attribute called 'PlayType' which is set to 'Pass' for a passing play.
We define a list of the columns which we're not interested in, and then we delete them
In [47]:
columns_to_delete = ['Unnamed: 0', 'Date', 'time', 'TimeUnder',
'PosTeamScore', 'PassAttempt', 'RushAttempt',
'DefTeamScore', 'Season', 'PlayAttempted']
#Iterate through and delete the columns we don't want
for col in columns_to_delete:
if col in df:
del df[col]
We can then grab a list of the remaining column names
In [48]:
df.columns
Out[48]:
Temporary simple data replacement so that we can cast to integers (instead of objects)
In [49]:
df.info()
In [50]:
df = df.replace(to_replace=np.nan,value=-1)
At this point, lots of things are encoded as objects, or with excesively large data types
In [51]:
df.info()
We define four lists based on the types of features we're using. Binary features are separated from the other categorical features so that they can be stored in less space
In [52]:
continuous_features = ['TimeSecs', 'PlayTimeDiff', 'yrdln', 'yrdline100',
'ydstogo', 'ydsnet', 'Yards.Gained', 'Penalty.Yards',
'ScoreDiff', 'AbsScoreDiff']
ordinal_features = ['Drive', 'qtr', 'down']
binary_features = ['GoalToGo', 'FirstDown','sp', 'Touchdown', 'Safety', 'Fumble']
categorical_features = df.columns.difference(continuous_features).difference(ordinal_features)
We then cast all of the columns to the appropriate underlying data types
In [53]:
df[continuous_features] = df[continuous_features].astype(np.float64)
df[ordinal_features] = df[ordinal_features].astype(np.int64)
df[binary_features] = df[binary_features].astype(np.int8)
THIS IS SOME MORE REFORMATTING SHIT I'M DOING FOR NOW. PROLLY GONNA KEEP IT
In [54]:
df['PassOutcome'].replace(['Complete', 'Incomplete Pass'], [1, 0], inplace=True)
In [55]:
df = df[df["PlayType"] != 'Quarter End']
df = df[df["PlayType"] != 'Two Minute Warning']
df = df[df["PlayType"] != 'End of Game']
Now all of the objects are encoded the way we'd like them to be
In [56]:
df.info()
Now we can start to take a look at what's in each of our columns
In [57]:
df.describe()
Out[57]:
In [58]:
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
#Embed figures in the Jupyter Notebook
%matplotlib inline
#Use GGPlot style for matplotlib
plt.style.use('ggplot')
In [59]:
pass_plays = df[df['PlayType'] == "Pass"]
pass_plays_grouped = pass_plays.groupby(by=['Passer'])
Look at the number of yards gained by a FirstDown
In [60]:
first_downs_grouped = df.groupby(by=['FirstDown'])
print(first_downs_grouped['Yards.Gained'].count())
print("-----------------------------")
print(first_downs_grouped['Yards.Gained'].sum())
print("-----------------------------")
print(first_downs_grouped['Yards.Gained'].sum()/first_downs_grouped['Yards.Gained'].count())
Group by play type
In [61]:
plays_grouped = df.groupby(by=['PlayType'])
print(plays_grouped['Yards.Gained'].count())
print("-----------------------------")
print(plays_grouped['Yards.Gained'].sum())
print("-----------------------------")
print(plays_grouped['Yards.Gained'].sum()/plays_grouped['Yards.Gained'].count())
We can eliminate combos who didn't have at least 10 receptions together, and then re-sample the data. This will remove noise from QB-receiver combos who have very high or low completion rates because they've played very little together.
In [62]:
size = 10
corr = df.corr()
fig, ax = plt.subplots(figsize=(size, size))
ax.matshow(corr)
plt.xticks(range(len(corr.columns)), corr.columns);
for tick in ax.get_xticklabels():
tick.set_rotation(90)
plt.yticks(range(len(corr.columns)), corr.columns);
We can also extract the highest-completion percentage combos. Here we take the top-10 most reliable QB-receiver pairs.
In [63]:
import seaborn as sns
In [64]:
# df_dropped = df.dropna()
# df_dropped.info()
selected_types = df.select_dtypes(exclude=["object"])
useful_attributes = df[['FieldGoalDistance','ydstogo']]
print(useful_attributes)
In [65]:
sns.heatmap(corr)
Out[65]:
In [66]:
cluster_corr = sns.clustermap(corr)
plt.setp(cluster_corr.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
# plt.xticks(rotation=90)
Out[66]:
In [ ]: