To begin, we load the data into a Pandas data frame from a csv file.
In [3]:
import pandas as pd
import numpy as np
df = pd.read_csv('data/data.csv') # read in the csv file
Let's take a cursory glance at the data to see what we're working with.
In [4]:
df.head()
Out[4]:
There's a lot of data that we don't care about. For example, 'PassAttempt' is a binary attribute, but there's also an attribute called 'PlayType' which is set to 'Pass' for a passing play.
We define a list of the columns which we're not interested in, and then we delete them
In [5]:
columns_to_delete = ['Unnamed: 0', 'Date', 'time', 'TimeUnder',
'PosTeamScore', 'PassAttempt', 'RushAttempt',
'DefTeamScore', 'Season', 'PlayAttempted']
#Iterate through and delete the columns we don't want
for col in columns_to_delete:
if col in df:
del df[col]
We can then grab a list of the remaining column names
In [6]:
df.columns
Out[6]:
Temporary simple data replacement so that we can cast to integers (instead of objects)
In [7]:
df.info()
In [8]:
df = df.replace(to_replace=np.nan,value=-1)
At this point, lots of things are encoded as objects, or with excesively large data types
In [9]:
df.info()
We define four lists based on the types of features we're using. Binary features are separated from the other categorical features so that they can be stored in less space
In [10]:
continuous_features = ['TimeSecs', 'PlayTimeDiff', 'yrdln', 'yrdline100',
'ydstogo', 'ydsnet', 'Yards.Gained', 'Penalty.Yards',
'ScoreDiff', 'AbsScoreDiff']
ordinal_features = ['Drive', 'qtr', 'down']
binary_features = ['GoalToGo', 'FirstDown','sp', 'Touchdown', 'Safety', 'Fumble']
categorical_features = df.columns.difference(continuous_features).difference(ordinal_features)
We then cast all of the columns to the appropriate underlying data types
In [11]:
df[continuous_features] = df[continuous_features].astype(np.float64)
df[ordinal_features] = df[ordinal_features].astype(np.int64)
df[binary_features] = df[binary_features].astype(np.int8)
THIS IS SOME MORE REFORMATTING SHIT I'M DOING FOR NOW. PROLLY GONNA KEEP IT
In [12]:
df['PassOutcome'].replace(['Complete', 'Incomplete Pass'], [1, 0], inplace=True)
In [13]:
df = df[df["PlayType"] != 'Quarter End']
df = df[df["PlayType"] != 'Two Minute Warning']
df = df[df["PlayType"] != 'End of Game']
Now all of the objects are encoded the way we'd like them to be
In [14]:
df.info()
Now we can start to take a look at what's in each of our columns
In [15]:
df.describe()
Out[15]:
In [16]:
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
#Embed figures in the Jupyter Notebook
%matplotlib inline
#Use GGPlot style for matplotlib
plt.style.use('ggplot')
In [17]:
pass_plays = df[df['PlayType'] == "Pass"]
pass_plays_grouped = pass_plays.groupby(by=['Passer'])
Look at the number of yards gained by a FirstDown
In [18]:
first_downs_grouped = df.groupby(by=['FirstDown'])
print(first_downs_grouped['Yards.Gained'].count())
print("-----------------------------")
print(first_downs_grouped['Yards.Gained'].sum())
print("-----------------------------")
print(first_downs_grouped['Yards.Gained'].sum()/first_downs_grouped['Yards.Gained'].count())
Group by play type
In [19]:
plays_grouped = df.groupby(by=['PlayType'])
print(plays_grouped['Yards.Gained'].count())
print("-----------------------------")
print(plays_grouped['Yards.Gained'].sum())
print("-----------------------------")
print(plays_grouped['Yards.Gained'].sum()/plays_grouped['Yards.Gained'].count())
We can eliminate combos who didn't have at least 10 receptions together, and then re-sample the data. This will remove noise from QB-receiver combos who have very high or low completion rates because they've played very little together.
In [20]:
size = 10
corr = df.corr()
fig, ax = plt.subplots(figsize=(size, size))
ax.matshow(corr)
plt.xticks(range(len(corr.columns)), corr.columns)
for tick in ax.get_xticklabels():
tick.set_rotation(90)
plt.yticks(range(len(corr.columns)), corr.columns)
Out[20]:
We can also extract the highest-completion percentage combos. Here we take the top-10 most reliable QB-receiver pairs.
In [21]:
import seaborn as sns
%matplotlib inline
In [22]:
# df_dropped = df.dropna()
# df_dropped.info()
selected_types = df.select_dtypes(exclude=["object"])
useful_attributes = df[['FieldGoalDistance','ydstogo']]
print(useful_attributes)
In [ ]:
In [23]:
sns.heatmap(corr)
Out[23]:
In [24]:
cluster_corr = sns.clustermap(corr)
plt.setp(cluster_corr.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
# plt.xticks(rotation=90)
Out[24]:
In [25]:
fg_analysis = df[['FieldGoalDistance','FieldGoalResult', 'PlayType']]
fg_analysis = fg_analysis[fg_analysis['FieldGoalResult'] != -1]
fg_grouped = fg_analysis.groupby(by=["FieldGoalResult"])
print(fg_grouped.sum()/fg_grouped.count())
sns.violinplot(x="FieldGoalResult", y="FieldGoalDistance", data=fg_analysis, inner="quart")
Out[25]:
In [26]:
fg_analysis = fg_analysis[fg_analysis['FieldGoalResult'] != "Blocked"]
fg_analysis = fg_analysis[fg_analysis['PlayType'] == "Field Goal"]
sns.violinplot(x = "PlayType", y="FieldGoalDistance", hue="FieldGoalResult", data=fg_analysis, inner="quart", split = True)
Out[26]:
In [27]:
pass_analysis = df[df.PlayType == 'Pass']
pass_analysis = pass_analysis[['PassOutcome','PassLength','PassLocation']]
# print(pass_analysis)
pass_analysis = pass_analysis[pass_analysis.PassLength != -1]
pa_grouped = pass_analysis.groupby(by=['PassLength'])
print(pa_grouped.count())
# pass_analysis['SuccessfulPass'] = pd.cut(df.PassOutcome,[0,1,2],2,labels=['Complete','Incomplete'])
pass_analysis.info()
# Draw a nested violinplot and split the violins for easier comparison
# sns.violinplot(x="PassLocation", y="SuccessfulPass", hue="PassLength", data=pass_analysis, split=True,
# inner="quart")
# sns.despine(left=True)
pass_info = pd.crosstab([pass_analysis['PassLength'],pass_analysis['PassLocation'] ],
pass_analysis.PassOutcome.astype(bool))
print(pass_info)
pass_info.plot(kind='bar', stacked=True)
Out[27]:
In [28]:
df.RunGap.value_counts()
Out[28]:
In [29]:
pass_rate = pass_info.div(pass_info.sum(1).astype(float),
axis=0) # normalize the value
# print pass_rate
pass_rate.plot(kind='barh',
stacked=True)
Out[29]:
In [30]:
# Run data
In [31]:
run_analysis = df[df.PlayType == 'Run']
run_analysis = run_analysis[['Yards.Gained','RunGap','RunLocation']]
runlocation_violinplot = sns.violinplot(x="RunLocation", y="Yards.Gained", data=run_analysis, inner="quart")
run_analysis = run_analysis[run_analysis.RunLocation != -1]
run_analysis['RunGap'].replace(-1, 'up the middle',inplace=True)
# run_analysis['RunLocation'].replace(-1, 'no location',inplace=True)
ra_grouped = run_analysis.groupby(by=['RunGap'])
print(ra_grouped.count())
print(run_analysis.info())
sns.set(style="whitegrid", palette="muted")
# Draw a categorical scatterplot to show each observation
sns.factorplot(x="RunLocation", y="Yards.Gained", hue="RunGap", data=run_analysis)
sns.factorplot(x="RunLocation", y="Yards.Gained", hue="RunGap", data=run_analysis,kind="bar")
sns.factorplot(x="RunLocation", y="Yards.Gained", hue="RunGap", data=run_analysis,kind="violin")
Out[31]:
In [32]:
#just compare left and right options
run_lr = run_analysis[(run_analysis['RunLocation'] == 'right') | (run_analysis['RunLocation'] == 'left')]
sns.factorplot(x="RunLocation", y="Yards.Gained", hue="RunGap", data=run_lr,kind="bar")
Out[32]:
In [33]:
rungap_violinplot = sns.violinplot(x="RunGap", y="Yards.Gained", data=run_analysis, inner="quart")
In [ ]: