The NFL (National Football League) has 32 teams split into two conferences, the AFC and NFC. Each of the 32 teams plays 16 games during the regular season (non-playoff season) every year. Due to the considerable viewership of American football, as well as the pervasiveness of fantasy football, considerable data about the game is collected. During the 2015-2016 season, information about every play from each game that occurred was logged. All of that data was consolidated into a single data set which is analyzed throughout this report.
The data being used for analysis is a table of 63 attributes for 46,129 rows (plays). The data will be analyzed to identify two potential insights. The first goal, motivated by the prevalence of fantasy football, is to identify players who perform exceptionally well, and specifically to identify in what situations a player excels. The second goal, motivated by the need for coaching insights, is to produce situationally-aware metrics for the potential success of a play. For example: given a field location, score differential, team, and time, identify what type of play is most likely to be successful.
Two forms of player performance analysis are relevant for fantasy football and general player performance evaluation. The first is a novel analysis, wherein all players of a certain position are ranked based on their performance at that position. This analysis can provide insight into identifying which players are most valuable for a fantasy team. The second is player-to-player comparison. Fantasy players are often faced with a decision of which player to play on their fantasy team in any given week. They must choose between players based on their individual player performances, as well as their matchups for the week. Consider a situation where player A is individually superior to player B, but player B is facing a team whose defense is very weak, while player A is facing a team whose defense is strong. Which player is expected to outperform the other? This question can be answered by analyzing the performance of each individual player against their respective opponents.
Offensive play-calling is a very difficult task, and is often a cause of error for teams and coaches. Providing a data-informed situational understanding of the probable outcomes of certain types of plays could help inform coaches' play-calling. Analyzing the statistical outcomes of play-calls can be done on a league-wide, per-team, or per-matchup basis. As the analysis becomes more specific (narrowing down to a specific team, or a specific matchup of two teams), the relevance of the analysis increases, but so does the margin of error.
The vast amount of money, pride, and time involved in NFL football is profound. It is for that reason that the play-by-play data was gathered in the first place. The intent of analyzing the data is to identify trends or statistics which can meaningfully influence the decisions made by coaches and It is important to define a metric by which the results of any analyses will be measured. Since two main forms of analysis will occur, two performance metrics must be defined.
Any meaningful player performance analysis must include a novel look at season-long player performance. For a running back, for example, total carries, yards, and touchdowns must be calculated. However, this novel analysis is simply a baseline. In order for a player performance analysis to be considered effective or meaningful, specific trends must be identified for that player which do not appear during routine stat summaries. For example, for a running back, a meaningful and effective analysis may conclude that the player in question performs significantly better when playing against teams whose defenses are very strong against passing plays, or that he performs significantly better when playing away, as opposed to at home.
In order to effectively inform offensive play-calling, play-call analysis must discover trends which identify, for a given game scenario, play calls which have statistically significantly higher probable yardage outcomes than other play calls. For example, given a scenario where an offense is down by 14 points in the 3rd quarter, on their own 35 yard line, an effective play-call analysis would be one that identified that a run play would produce statistically significantly more yardage than a passing play.
Play-calling optimization could also be effective in a generalized scenario. For example, an effective analysis may reveal that offenses have the most success with running up the middle of the offensive line when near the goal line, but have more success with runs to the outside when nearer to the middle of the field.
In [295]:
#For final version of report, remove warnings for aesthetics.
import warnings
warnings.filterwarnings('ignore')
#Libraries used for data analysis
import pandas as pd
import numpy as np
from sklearn import preprocessing
df = pd.read_csv('data/data.csv') # read in the csv file
#List of attributes which aren't going to be used for analysis
columns_to_delete = ['Unnamed: 0', 'Date', 'time', 'PassAttempt', 'RushAttempt',
'DefTeamScore', 'Season', 'PlayAttempted']
#Iterate through and delete the columns we don't want
for col in columns_to_delete:
if col in df:
del df[col]
Missing data needs to be identified and either removed or imputed.
For many columns, there is intentionally missing data (for example, the "Interceptor" column is N/A when no interception was thrown).
In order to help identify missing data, the attributes will be labeled as continuous, ordinal, binary, or categorical, and each scale of data will be imputed on its own.
In [296]:
#Defining list of column names of each of the scales of variables being used.
#Interval and Ratio features are grouped together, and binary features are separated from other ordinal features
continuous_features = ['TimeSecs', 'PlayTimeDiff', 'yrdln', 'yrdline100',
'ydstogo', 'ydsnet', 'Yards.Gained', 'Penalty.Yards',
'ScoreDiff', 'AbsScoreDiff']
ordinal_features = ['Drive', 'qtr', 'down']
binary_features = ['GoalToGo', 'FirstDown','sp', 'Touchdown', 'Safety', 'Fumble']
categorical_features = df.columns.difference(continuous_features).difference(ordinal_features)
First, the categorical features will be examined for missing data. Only three categorical columns have missing data: FieldGoalDistance, FirstDown, and GoalToGo. For each of these attributes, missing data represents plays where the attribute does not apply. FieldGoalDistance is set to NaN when no field goal occurs, and GoalToGo and FirstDown are set to NaN when the play is not an n-th down play (i.e. it's a kickoff or an extra point). Because these attributes are not meaningful for plays where they are set to NaN, it is acceptable to leave the missing data in those columns as NaN, as rows with NaN will be excluded from analysis related to those attributes. (For instance, to analyze field goal plays, it's completely reasonable to exclude all plays where no field goal attempt was made).
Finally, the continuous features will be examined for missing data. The continuous variables will be checked for missing data one-by-one, and imputation or deletion will occur on an attribute-by-attribute basis.
First, the dataset will be examined to see which columns include NaN values.
In [297]:
df[continuous_features].describe()
Out[297]:
There are many plays in the dataset which are placeholder rows which represent the end of a quarter, half, or game. Many of these plays have NA values for a number of continuous attributes. All of these plays will be removed from the dataset, as they don't represent actual plays that occurred on the field.
In [298]:
df = df[["end" != x[0:3].lower() for x in df.desc]]
df[continuous_features].describe()
Out[298]:
Removing the "End of something" plays eliminates all NaN values for TimeSecs and PlayTimeDiff
The remaining continuous attributes with NaN columns will be evaluated one-by-one and imputation or deletion will occur on an attribute-by-attribute basis.
The remaining missing attributes are all caused by TimeOut plays or Two-Minute-Warning plays. Because these rows are not actual plays, they can be deleted from the dataset.
In [299]:
# Remove rows representing Timeouts and Two-Minute-Warnings
df = df[[play not in ["Timeout", "Two Minute Warning"] for play in df.PlayType]]
Removing the timeouts and two-minute-warnings removes all remaining NaN values from the continuous feature set.
In [300]:
df[continuous_features].describe()
Out[300]:
Now that NaN values have been removed, columns can be coerced to the correct encoding for their data type. However, because NaNs exist in the data, some columns can not be coerced into numeric representations. In order to allow for coercion to numeric representations, NaN values will be replaced with -1. The value -1 will only exist in categorical and ordinal features, so it will not skew the data in the continuous columns.
In [301]:
#Replace NaNs in categorical and ordinal columns with -1
df = df.replace(to_replace=np.nan,value=-1)
#Coercing the data columns to the correct types
df[continuous_features] = df[continuous_features].astype(np.float64)
df[ordinal_features] = df[ordinal_features].astype(np.int64)
df[binary_features] = df[binary_features].astype(np.int8)
After deleting missing data (rows for end of play and timeouts/two-minute warnings), 42867 of the 46128 original play rows remain in the data set. (About 93% of the data). The 7% of rows that were eliminated were not relevant to the analyses in this report, so the loss of those rows is acceptable.
In addition, the removal and re-encoding of data reduced the dataset's size in memory from 22.5MB to 16.3MB, a decrease in memory expense of over 25%.
In [302]:
df.info()
For the following visualizations, the libraries matplotlib, seaborn, and plotly will be utilized.
In [303]:
#Setup seaborn
import seaborn as sns
sns.set_palette('muted')
In [304]:
#Setup plotly
import plotly
plotly.offline.init_notebook_mode() # run at the start of every notebook
In [305]:
#Setup matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
#Embed figures in the Jupyter Notebook
%matplotlib inline
#Use GGPlot style for matplotlib
plt.style.use('ggplot')
The first step in visualizing play data is examining some of the individual attributes of plays in isolation. A number of single-attribute visualizations will be used to gather some cursory information about the data, including identifying some of the top players in the NFL.
The first important statistic to visualize it the distribution of different play types that can occur. The most important play types to identify are Pass and Run. These are the two most frequent types of plays, and the relative frequencies of the two is important when deciding how to organize a defense.
In [306]:
#Group data by playtype and plot the counts of the groups
df.groupby("PlayType").PlayType.count().sort_values().plot(kind='barh')
Out[306]:
A quick look at the play type distribution shows that pass plays are generally more common than run plays. Many of the less-frequent play types are not critical in designing a defense. For example, the frequency with which a QB kneel occurs is not particularly important, as a QB kneel is not a play that warrants any real defensive strategy.
While the relative difference in frequencies of run and pass plays are easily identified, it's difficult to visualize the ratio of onside kicks to standard kickoffs. However, the percentage of kickoffs which are onside kicks can be easily computed.
In [307]:
#Computing percentage of kickoffs which are on-side kicks
normal_kicks = len(df[df.PlayType == "Kickoff"])
onside_kicks = len(df[df.PlayType == "Onside Kick"])
print(str(onside_kicks / (normal_kicks + onside_kicks))[0:6] + "% of kicks are onside-kick attempts")
With such a small percentage of onside kicks, a deeper look into when an onside kick will occur is necessary. Onside kicks will be examined in more detail in the upcoming sections.
In [308]:
print("Mean Yards.Gained per play: " + str(df['Yards.Gained'].mean()))
print("Mean ydstogo per play: " + str(df['ydstogo'].mean()))
print("Median Yards.Gained per play: " + str(df['Yards.Gained'].median()))
print("Median ydstogo per play: " + str(df['ydstogo'].median()))
#Boxplot of Yards.Gained and ydstogo
df[["Yards.Gained", "ydstogo"]].plot(kind='box', ylim=[-20, 30])
Out[308]:
This boxplot shows that Yards.Gained is a slightly right-skewed attribute. The mean is more than 3 yards higher than the median. ydstogo, on the other hand, is left skewed. The median is 10 yards, while the mean is less than 8 yards. This is likely the result of the first-down mechanic of football.
Since Yards.Gained is essentially the most important feature of a play with respect to the offense's goal of earning a touchdown, a more in-depth visualization of the attribute is warranted. A better look into the distribution of Yards.Gained can be achieved with a histogram:
In [309]:
print(df["Yards.Gained"].mode())
#Plot the distribution of yards gained
sns.distplot(df["Yards.Gained"])
Out[309]:
The histogram of yardage gain shows that there is a strong peak at around zero yards. Additionally, the mode of yards gained is 0. However, the data is quite right-skewed. While mostly unimodal, the distribution has a significantly larger number of instances of plays with new positive yardage than net negative. This is expected.
Identifying top players is an important problem for coaches and fantasy team managers alike. There are a number of metrics by which top players can be identified. For rushers, these metrics are yards per carry, total yards, and total rushing touchdowns.
The top running backs and running players can be identified by finding the average yardage per carry of each running back over the course of the season.
In [310]:
#Select only rushing plays
rush_plays = df[(df.Rusher != -1)]
#Select groups of running plays by Rusher, but only for Rushers with more than 10 carries
rush_plays_grouped = rush_plays.groupby(by=['Rusher']).filter(lambda g: len(g) > 10).groupby(by=["Rusher"])
In [311]:
#Calculate the yards_per_carry for each rusher and sort them
yards_per_carry = rush_plays_grouped["Yards.Gained"].sum() / rush_plays_grouped["Yards.Gained"].count()
yards_per_carry.sort_values(inplace=True, ascending=False)
#Coerce the list back into a data frame for plotting
yards_per_carry_df = pd.DataFrame({'yards_per_carry': yards_per_carry, 'rusher' : yards_per_carry.index})
#Plot the distribution of yards-per-carry
ax = sns.barplot(x="rusher", y="yards_per_carry", data=yards_per_carry_df)
ax.set(xlabel='Rusher', ylabel='Yards Per Carry')
ax.set_xticks([])
Out[311]:
The above barplot shows the mean yards per carry for every rusher with more than 10 carries in the 2015 NFL season. There is clearly a top-tier of rushers, who have the most yards-per-carry. Those are players who are likely valuable assets, so they should be identified.
In [312]:
#Plot the yards-per-carry for only the top 10 rushers by yards-per-carry
ax = sns.barplot(x="yards_per_carry", y="rusher", data=yards_per_carry_df[:10])
ax.set(xlabel='Yards Per Carry', ylabel='Rusher')
Out[312]:
Interestingly, it looks like the top rushers are mostly QBs. They are in many ways outliers. The fact that they earn many yards per carry is largely because they run very infrequently, and when they do, defensive ends are too far down the field to tackle them.
To isolate RBs, total running yardage can be assessed, and yards per carry can be assessed only for players who rarely throw the ball, to exclude QBs.
In [313]:
#Calculate total running yards for each player, sort them, and coerce them into a data frame for plotting
total_running_yards = rush_plays_grouped["Yards.Gained"].sum()
total_running_yards.sort_values(inplace=True, ascending=False)
total_running_yards_df = pd.DataFrame({'total_yards': total_running_yards, 'rusher' : total_running_yards.index})
Once the total running yards of each rusher have been calculated, they can be visualized. First, a histogram will show the distribution of rushers based on total yards gained over the course of the season.
In [314]:
#Plot the distribution of total rushing yards per rusher in the NFL
ax = sns.distplot(total_running_yards_df['total_yards'], bins = 30)
ax.set(xlabel='Total Season Yardage', ylabel='Percentage of Rushers')
ax.set_yticks([])
Out[314]:
This histogram shows that the majority of rushers only earn 100-200 yards in a season. This makes it even more important to isolate the elite rushers, because very few earn more than 1000 yards in a season.
To isolate top rushers, all rushers will be plotted based on their total yardage for the season.
In [315]:
# Plot each individual rusher and their total rushing yards, sorted by total yards
ax = sns.barplot(x="rusher", y="total_yards", data=total_running_yards_df)
ax.set(xlabel='Rusher', ylabel='Total Rushing Yards')
ax.set_xticks([])
Out[315]:
The above figure shows that the range of total yards earned by a rusher is much larger than the range of yards per carry. This is due to the fact that some rushers have many more carries than others.
A subset of these rushers can be isolated to identify the rushers with the most yards per season. These will be considered top-tier RBs.
In [316]:
#Make barplot of rushers by total season rushing yards
ax = sns.barplot(x="total_yards", y="rusher", data=total_running_yards_df[:20])
ax.set(xlabel='Total Rushing Yards', ylabel='Rusher')
Out[316]:
This reveals the top 20 RBs based on total yardage. This is an important insight when it comes to fantasy football, where the best running backs are those which earn the most total yards in a season. The same analysis can be done to find the RBs with the most running touchdowns.
In [317]:
#Isolate touchdown rushing plays
touchdown_rush_plays = rush_plays[rush_plays.Touchdown == 1]
#Select groups of running plays by Rusher, but only for Rushers with more than 2 TDs
rush_plays_grouped = touchdown_rush_plays.groupby(by=['Rusher']).filter(lambda g: len(g) > 2).groupby(by=["Rusher"])
#Count TDs for each rusher
total_tds = rush_plays_grouped.size()
#Convert to DF for Seaborn
total_tds.sort_values(inplace=True, ascending=False)
total_tds_df = pd.DataFrame({'touchdowns': total_tds, 'rusher' : total_tds.index})
#Plot touchdowns by RB
ax = sns.barplot(x="touchdowns", y="rusher", data=total_tds_df[:20])
ax.set(xlabel='Total Touchdowns', ylabel='Rusher')
Out[317]:
The above chart shows the number of rushing touchdowns earned by each rusher over the course of the season. This is an additional metric by which rushers can be ranked relative to one another.
It may be of interest to identify which teams most frequently commit penalties, and which commit penalties the least. This can be achieved with a simple bar plot.
In [318]:
#Take only plays where penalties occurred
penalties = df[df['Penalty.Yards'] != 0]
#Plot the number of penalties per team
penalties['PenalizedTeam'].value_counts().sort_values().plot(kind='barh', stacked=True, fontsize=10)
Out[318]:
The above bar plot shows how many penalties each team committed over the course of the season. This information can help teams identify if they need to make an active effort to reduce their penalization, or perhaps even to become more aggressive.
The distribution of penalty yardage is also of interest. A violin plot will be used to visualize the distribution of yards associated with penalties.
In [319]:
#Violin plot of penalty yardage
sns.violinplot(penalties["Penalty.Yards"])
Out[319]:
The above violin plot indicates that penalty distance is tri-modal. This is likely caused by the fact that many penalties in football have a pre-prescribed yardage associated with them. However, this plot also identifies which of those sets of penalties is most common. For example, it shows that 5-yard penalties are nearly twice as common as 10-yard penalties.
There are a number of additional interesting things to visualize when it comes to single-attributes. However, in the interest of brevity, this report will now move on to multi-variate visualizations, the meat of the analysis.
To take a first look at comparing some of the attributes, a scatter matrix will be used to see if there are any correlations or patterns among a handful of the continuous attributes in the dataset.
Yards.Gained, ydstogo, TimeSecs, and ScoreDiff will be compared to eachother in a scatter matrix. This set of attributes is selected because it stands to reason that there may be correlations or patterns among these data columns. For example, it may be that when ScoreDiff is a large negative number (when the offensive team is behind), Yards.Gained is higher, because the offense is running low-percentage high-gain passing players.
The scatter matrix is color-coated by playtype (run or pass) to see if there is any pattern between those two play types with respect to the attributes being plotted. A sample of 250 plays will be plotted, to make the plots simpler to read.
In [320]:
#Isolate variables for
df_plot = df[["Yards.Gained", "ydstogo", "TimeSecs", "PlayType", "ScoreDiff"]]
df_plot = df_plot[[x in ["Run", "Pass"] for x in df_plot.PlayType]].sample(250)
#Plot a scatter matrix of the 4 continuous attributes, color-coded by PlayType
sns.set_palette("muted")
sns.pairplot(df_plot, size=2, hue="PlayType")
Out[320]:
From this simple scatter matrix, no strong correlations are evident. Some basic patterns emerge, such as the fact that high-yardage plays seem to be overwhelmingly pass plays, and plays where ydstogo is very high typically are pass plays, while plays where ydstogo is very low are typically run plays.
The histograms on the diagonal also contain some interesting information. For example, ScoreDiff follows a zero-centered normal distribution. This is expected, due to the nature of the ScoreDiff attribute. As expected,TimeSecs follows a reasonably uniform distribution. The Yards.Gained attribute, however, does not follow a normal distribution. It follows a uni-modal left-skewed distribution. The vast majority of plays result in a -5 to 5 yard gain. The Yards.Gained histogram also confirms the fact that run plays generally generate fewer yards than passing plays.
The fact that no strong correlations seem to emerge among the 4 plotted attributes indicates that no one of these attributes is a good predictor of the others. The benefit to this is that, because the attributes are so weakly correlated, they truly represent multiple dimensions in the data, and are not simply data redundancy.
Before diving into the details of play type distributions and per-team play performance, a novel look has to be taken into the distribution of play types over the 4 different downs that can occur. This can be done with a simple stacked bar plot.
In [321]:
#Isolate releant attributes
df_playtype = df[['qtr', 'down', 'TimeSecs', 'PlayType']]
df_playtype = df[df.down != -1]
#Group and then re-flatten
dfp_sample = df_playtype[['down', 'PlayType']].sample(20000)
dfp_sample_reset = dfp_sample.groupby(['down', 'PlayType']).size().reset_index()
#Data transformation to get it to a plot-able form
playtypes = ['Field Goal', 'Pass', 'Run', 'Sack', 'Spike', 'Punt']
dfp_list = [dfp_sample_reset[dfp_sample_reset.PlayType == x] for x in playtypes]
for i, x in enumerate(dfp_list):
dfp_list[i] = dfp_list[i][['down',0]]
dfp_list[i].columns = ['down', playtypes[i]]
#Concatenate the list into one data frame and then plot it
pd.concat([x.set_index('down') for x in dfp_list], axis=1).plot(kind='bar', stacked=True)
Out[321]:
This stacked barplot shows the number of instances of each type of play being called for every down that can occur. Interestingly, it seems that run plays become less and less common as down increases. For 1st down plays, run plays are the most likely, but for 3rd down plays, pass plays represent a large majority of plays. On 4th down, on the other hand, punts and field goals dominate (as expected), but a close look shoes that pass plays outnumber run plays on 4th down, which is also a valuable insight.
It's likely that some teams are more likely to run than others, and that other teams are more likely to throw. This could be the result of having a star QB or a star RB. To identify teams which focus more on passing or running, all teams will be plotted on a stacked barplot where the y-axis is the percentage of plays that are runs or passes, sorted by run percentage.
In [322]:
#Isolate running and passing plays, and group by team and play types
team_analysis = df[['posteam', 'PlayType', 'Yards.Gained']]
team_analysis = team_analysis[team_analysis.posteam != -1]
team_analysis = team_analysis[(team_analysis.PlayType == 'Run') | (team_analysis.PlayType == 'Pass')]
team_grouped = team_analysis.groupby(['posteam', 'PlayType'], sort=True)
#get the count of each playtype per team and coerce back into un-grouped dataframe
teams_count = team_grouped.count()
teams_count.columns = ['count']
teams_count = teams_count.reset_index()
#gets the total number of plays for each team (Run + Pass)
def teamSum(x):
return teams_count[teams_count.posteam == x.posteam]['count'].sum()
#add to teams_count as column 'totals'
totals = teams_count.apply(teamSum, axis=1)
teams_count['totals'] = totals
#finds percentage of PlayTypes for each team
def teamPercent(x):
return x['count'] / x['totals']
#adds percentages to teams_count
teams_count['percentages'] = teams_count.apply(teamPercent, axis=1)
teams_count_perc = teams_count[['posteam', 'PlayType', 'percentages']]
#this is to format the data to make plotting easier
trp = teams_count_perc[teams_count_perc.PlayType == 'Run']
tpp = teams_count_perc[teams_count_perc.PlayType == 'Pass']
trp = trp.set_index('posteam')
tpp = tpp.set_index('posteam')
pass_run = pd.concat([trp, tpp], axis=1)
#plot playtype percentage per team
pass_run.columns = ['PlayType1', 'RunPercentage', 'PlayType2', 'PassPercentage']
pass_run = pass_run.sort_values(by = 'RunPercentage')
pass_run[['RunPercentage', 'PassPercentage']].plot(kind = 'bar', stacked=True)
print("Minimum Run Percentage: " + str(pass_run['RunPercentage'].min()))
print("Maximum Run Percentage: " + str(pass_run['RunPercentage'].max()))
While there is some variation in the distribution of play type among teams, there is not a very wide range of percentages. All teams fall between 35% and 51% run plays. However, this could be valuable in predicting the PlayType for a certain play.
Since all teams seem to follow a similar distribution of run plays vs pass plays, it's worth seeing if some teams are more successful than others in terms of run and pass plays. This can be assessed with a barplot made of average running yards and average passing yards per play for every team in the league.
In [323]:
#Take group sum and coerce back to data frame
team_yards = team_grouped.sum().reset_index()
team_yards['totals'] = totals
#Get average yardage for each team
def teamAvgYards(x):
temp = x['Yards.Gained'] / x['totals']
return temp
#Compute average yardage
team_yards['AvgYards'] = team_yards.apply(teamAvgYards, axis=1)
team_yards = team_yards[['posteam', 'PlayType', 'AvgYards']]
#Split into two data frames for runs and passes
tyr = team_yards[team_yards.PlayType == 'Run']
typ = team_yards[team_yards.PlayType == 'Pass']
#Sort each list in place
tyr =tyr.set_index('posteam').sort_values(by = 'AvgYards')
typ =typ.set_index('posteam').sort_values(by = 'AvgYards')
#Make the bar plots
tyr.plot(kind='bar', title = "Average Rushing Yards per Team")
typ.plot(kind='bar', title = "Average Passing Yards per Team")
Out[323]:
This pair of bar plots shows that there is significant variation in the average yardage of passing and running plays for each team in the NFL. Interestingly, the amount of variation in average yardage per team is very similar for pass plays and run plays. Average running yardage varies from 1.3 to 2.6, while average passing yardage varies from 3.3 to 5.1. There is slightly more variation in running plays by percentage, but it's not dramatically more than it is for passing.
Interestingly, the top teams by average rushing yards appear to be, generally speaking, some of the worst teams in terms of average passing yards, and vise-versa.
To further analyze the observed trend that the best teams in terms of passing yards per pass play are the worst teams in terms of rushing yards per rush play, a simple scatter plot with a line of best fit is used.
In [324]:
#Concatenate the passing and running plays for plotting
team_avg_yards = pd.concat([tyr, typ], axis=1)
team_avg_yards.columns = ['a', 'AvgRunYards', 'b', 'AvgPassYards']
#Scatter plot with regression line and confidence bounds.
sns.lmplot(x = "AvgRunYards", y = "AvgPassYards", data = team_avg_yards)
Out[324]:
There appears to be a somewhat strong negative trend between average rushing yards per carry and average passing yards per throw for NFL teams. This trend is not strong enough to predict average rushing yards meaningfully based on average passing yards, or vise-versa, but it does provide some insight into coaching strategies, offensive play-calling, and fantasy football. It identifies a general trend that high-octane passing teams are less likely to produce top-performing RBs, and powerful rushing teams aren't likely to yield large gains in the air.
In [325]:
fg_analysis = df[['FieldGoalDistance','FieldGoalResult', 'PlayType']]
#Remove plays that aren't kicks
fg_analysis = fg_analysis[fg_analysis['FieldGoalDistance'] > -1.0]
#Group by result
fg_grouped = fg_analysis.groupby(by=["FieldGoalResult"])
print(fg_grouped.sum()/fg_grouped.count())
#Plot the distributions of yardage for each field goal result
sns.violinplot(x="FieldGoalResult", y="FieldGoalDistance", data=fg_analysis, inner="quart")
Out[325]:
The violin plot of field goal results indicates a number of things. The first meaningful insight is that the distribution of successful field goal attempts is nearly uniform between 20 and 50 yards. Missed field goals, on the other hand, follow a very unimodal distribution, with the majority of missed attempts occurring at around 50 yards. Blocked field goals also follow a unimodal distribution. However, the distribution of blocked field goals is much less normal than that of missed field goals.
In [326]:
#Remove blocked kicks
fg_analysis = fg_analysis[fg_analysis['FieldGoalResult'] != "Blocked"]
fg_analysis = fg_analysis[fg_analysis['PlayType'] == "Field Goal"]
#Plot scored and missed field goals, side-by-side
sns.violinplot(x = "PlayType", y="FieldGoalDistance", hue="FieldGoalResult", data=fg_analysis, inner="quart", split = True)
Out[326]:
The above graph, which is essentially a simplification of the previous graph, shows side-by-side the distributions of his and missed field goals. It helps identify the relative chances of success and failure for different kick distances.
To identify trends in pass performance, the pass plays can be split up into the pass location. The dataset separates passes into 'right', 'left', and 'middle.' A violin plot of the three locations will be used to visualize the distributions of the pass plays that occurred in those locations.
In [327]:
#Isolate pass plays
df_plot = df[df.PlayType == "Pass"]
#Remove incomplete passes
df_plot = df_plot[df_plot.PassLocation != -1]
#Group by pass locations
groups_by_pass_location = df_plot[['Yards.Gained', 'PassLocation']].groupby("PassLocation")
distances_by_pass_location = groups_by_pass_location.sum() / groups_by_pass_location.count()
print(distances_by_pass_location)
#Violin plot
sns.violinplot(x="PassLocation", y = "Yards.Gained", data = df_plot)
Out[327]:
The above violin plot indicates that left and right passes, as expected, have very similar yardage distributions. Middle passes, on the other hand, are more bimodal, and are more likely to go for more than 20 yards than passes to the left or right. However, generally speaking, there doesn't appear to be a significant difference in pass performance based on pass location.
In [328]:
#Isolate pass plays
df_plot = df[df.PlayType == "Run"]
#Ignore plays with missing run location data
df_plot = df_plot[df_plot.RunLocation != -1]
#Group by run locations
groups_by_run_location = df_plot[['Yards.Gained', 'RunLocation']].groupby("RunLocation")
distances_by_run_location = groups_by_run_location.sum() / groups_by_run_location.count()
print(distances_by_run_location)
#Violin plot
sns.violinplot(x="RunLocation", y = "Yards.Gained", data = df_plot)
Out[328]:
This distributions of right and left run plays are, as expected, quite similar. The middle run plays, however, follow a slightly narrower distribution. They appear to frequently result in very few yards, but also seem to rarely result in negative yardage. Additionally, the upper bound on yardage gain from outside runs appears to be larger than it is for middle runs. However, once again, there is no obvious difference in run performance based on run location.
Two runs in the same location may differ based on the blocking strategy. Blocking strategy refers to the type of offensive player who provides a block for the runner. These players can be tackles, guards, or ends. To compare the blocking abilities of the three types of players, a factor plot is used.
In [329]:
#Isolate run plays and their features
run_analysis = df[df.PlayType == 'Run']
run_analysis = run_analysis[['Yards.Gained','RunGap','RunLocation']]
run_analysis = run_analysis[run_analysis.RunGap != -1]
#Reformat data for prettification
run_analysis = run_analysis[run_analysis.RunLocation != -1]
#Set seaborn style
sns.set(style="whitegrid", palette="muted")
# Draw a categorical scatterplot to show each observation
sns.factorplot(x="RunLocation", y="Yards.Gained", hue="RunGap", data=run_analysis)
Out[329]:
The above factor plot shows a number of things. First, run plays where an end player was the blocker are generally the most successful. They outperform the other blockers, even when error bounds are considered. Second, runs to the left seem to have a slightly higher average yardage than runs to the right, and outside runs generally have higher yardage outcomes than middle runs. The final interesting insight is that middle runs where tackles are the blocking players perform very poorly. This is likely because tackles are on the outside edge of the offensive line, so they are not in an ideal position blocking. Tackles, however, seem to perform best on the right-side runs. This is likely because the job of the left tackle is almost always to protect the QB, not to block for a runner.
It may stand to reason that scoring happens more-or-less frequently in different quarters of the game. Moreover, it's possible that touchdowns happen more-or-less frequently than other scoring plays in different quarters. A stacked barplot is used to visualize any such trend.
In [330]:
#Isolate scoring plays and relevant features
quarter_data = df[df['sp'] == 1]
quarter_data = quarter_data[['qtr','Touchdown','sp']]
#Group by quarter
qd_grouped = quarter_data.groupby(by=['qtr'])
#Crosstab data for plotting
qd_info = pd.crosstab([quarter_data['qtr'] ],
quarter_data.Touchdown.astype(bool))
#Barplot of stacked scoring plays by quarter
qd_info.plot(kind='bar', stacked=True)
Out[330]:
The above bar plot shows that, in general, scoring is more frequent in the 3rd and 4th quarters. This is likely due to the fact that a kickoff always occurs at the beginning of the 1st and 3rd quarters, where the 2nd and 4th quarters can start mid-drive. With respect to the frequency of touchdowns relative to other types of scoring plays, it seems there is a reasonably even distribution among the quarters.
In general, it's reasonable to consider a possible trend between game time elapsed and score difference. To visualize the possibility of such a trend, a joint plot will be used. Because both attributes are continuous ratio attributes, a scatter plot and histogram combination is an ideal plot.
In [331]:
#Isolate scoring data and relevant attributes
time_score_data = df[df['sp'] == 1]
time_score_data = time_score_data[['sp','ScoreDiff','TimeSecs']]
#Joint plot of the two attributes
g = sns.jointplot("TimeSecs", "ScoreDiff", data=time_score_data, kind="reg",
xlim=(3600, -900), ylim=(-40, 50), color="r", size=12)
The joint plot of ScoreDiff and TimeSecs shows absolutely no meaningful trend between the two attributes. ScoreDiff appears to follow a very normal distribution, while TimeSecs (as expected) follows a continuous distribution. These two attributes appear to be almost completely uncorrelated.
In the absence of a trend between game time and score difference, perhaps one exists when isolating teams that are winning at any given point of play. It stands to reason that, perhaps, as more time is elapsed, the team that is winning will earn a larger and larger lead. Once again, a joint plot will be used to visualize such a trend.
In [332]:
#isolate winning teams
time_score_data = time_score_data[time_score_data['ScoreDiff']>0]
#Joint plot of the two attributes
sns.jointplot("TimeSecs", "ScoreDiff", data=time_score_data, kind="reg",
xlim=(3600, -900), ylim=(-40, 50), color="r", size=12)
Out[332]:
Once again, there is very little correlation between the two attributes. There appears to be a slightly stronger trend, as indicated by the larger correlation coefficient, but the attributes are still somewhat distinct. The new distribution of ScoreDiff looks somewhat like a right-skewed beta distribution, but in reality it is just the left half of the normal distribution from the previous plot.
In different quarters, it's possible that the distribution of play frequency throughout the duration of the quarter is different. For example, one might expect that in the 4th quarter, the last few minutes have lots of plays, as a team may be trying to catch up from a late deficit. To visualize trends between quarter and plays-per-time over the quarter, a simple split-histogram is be used. The vertical axis represents the number of plays, in total, that occurred during a given minute of a quarter. The horizontal axis shows the quarter, and minute within that quarter.
In [333]:
quarter_data = df[df['sp'] == 1]
quarter_data = quarter_data[['qtr','TimeUnder']]
sns.countplot(x="qtr",hue="TimeUnder",data=quarter_data, hue_order = range(15,-1,-1))
Out[333]:
What the above set of histograms depicts is that, in general, in the 1st and 3rd quarters, very few plays occur in the first few minutes. In the 2nd and 4th quarters, lots of plays occur in the final few minutes of the quarter. Interestingly, this is especially true for the 2nd quarter, where a significant plurality of plays occur with less than a minute left of play.
Plays from the NFL dataset will be classified based on play type. In future reports, attempts will be made to build a model which can, given attributes about a play, identify if the play is a running play, a passing play, a punt, and so on.
Before a classifier can be built, the PlayType attribute will be compared to existing attributes to see what attributes are strongly correlated with PlayType. Those attributes will be important components in a regression-based classifier.
As a first attempt to identify factors strongly correlate with play types, a correlation plot will be used to show which attributes are highly correlated or inversely-correlated. Note that a number of the attributes are not continuous, which means that correlations are not always meaningful. However, all of the attributes are included to demonstrate the general lack of correlation among items in the dataset.
In [334]:
#Make a heatmap of the correlations among the data attributes
sns.heatmap(df.corr())
Out[334]:
The first important insight from the above correlation plot is that, in general, there is little correlation among the various attributes in the dataset. This is represented by the paleness of the plot i general. White squares represent no correlation, and most of the pairs of attributes have correlations which aren't far from zero.
There are a number of attributes which appear to have strong correlations with one another. However, many of these are simply results of the nature of the sport and the nature of the attributes themselves. For example, ydsnet and ydline100 are highly correlated, because as a team advances down the field and their yardage gained increases, they also reach a higher yard line on the field.
To look at a few specific continuous attributes, a correlation plot of the continuous attributes graphed in a scatter matrix earlier in the report can be used:
In [335]:
#Isolate a few attributes whose correlation will be mapped
df_plot = df[["Yards.Gained", "ydstogo", "TimeSecs", "ScoreDiff"]]
#Produce heatmap of the correlations among the attributes
sns.heatmap(df_plot.corr())
Out[335]:
This correlation plot, like the above plot, shows very little correlation among the plotted datasets. This has an important ramification for classification: because these attributes are very un-correlated, principle component analysis will likely not do an excellent job of reducing the dimensionality of the dataset.
To see if Yards.Gained is a good predictor of play type, a violin plot of the Yards.Gained attribute will be performed where the data is split up between rushing plays and passing plays. Note that Yards.Gained is only a meaningful predictor for two possible PlayTypes, rush and pass. Those are the only two that will be included in the violin.
Two violins will be drawn, one for scoring plays, one for non-scoring plays. This will help visualize if there is a difference in the Yards.Gained-PlayType correlation for scoring plays.
In [336]:
#Isolate run and pass plays
df_plot = df[[x in ['Run', 'Pass'] for x in df.PlayType]]
#Plot the yards gained distributions for the two types of plays
sns.violinplot(x = "sp", y = "Yards.Gained", data =df_plot, hue="PlayType", split=True, inner = 'quart')
Out[336]:
What the above violin plot demonstrates is that there is very little correlation between Yards.Gained and PlayType. Because both violins are somewhat symmetrical, and the quartiles for each side of each violin are somewhat similar, it is difficult to discern strictly based on Yards.Gained and sp (scoringplay) if a play was a run or a pass.
It stands to reason that there may be a correlation between score difference and play type, as teams who are behind by a lot of points may need to throw the ball frequently to try to catch up to their opponents. A scatter plot of PlayType vs ScoreDiff will help identify such a correlation, if one exists.
In [337]:
#Isolate run and pass plays
df_plot = df[[x in ['Run', 'Pass'] for x in df.PlayType]]
print("Mean ScoreDiff for run play: " + str(df_plot[df_plot.PlayType == 'Run'].ScoreDiff.mean()))
print("Mean ScoreDiff for pass play: " + str(df_plot[df_plot.PlayType == 'Pass'].ScoreDiff.mean()))
#PlayType converted to floating point value with jitter
playType_as_int = [1 if x == "Run" else 2 for x in df_plot.PlayType]
#Add jitter for visualization
playType_as_int = playType_as_int + np.random.rand(len(playType_as_int))/2.5
df_plot.PlayType = pd.Series(playType_as_int)
#Plot without a linear regression
sns.lmplot(x = "ScoreDiff", y = "PlayType", data = df_plot.sample(1000), fit_reg=False)
Out[337]:
The above visualization shows no meaningful correlation between PlayType and ScoreDiff. The top group of plays are passing plays, while the bottom group of plays are running plays. The means, however, show that there is, on average, a difference between the score differences for the two types of plays. This suggests that running plays do, on average, occur more often when a team is in the lead.
The distributions of those ScoreDifferences can be visualized using a set of boxplots:
In [338]:
#Isolate run and pass plays
df_plot = df[[x in ['Run', 'Pass'] for x in df.PlayType]]
#Boxplot of the ScoreDiff when these types of plays occur
sns.boxplot(x = "PlayType", y = "ScoreDiff", data = df_plot)
Out[338]:
The overlap in the 25-50th quantiles of the ScoreDiff for run and pass plays indicates that classification based solely on ScoreDiff would be very inaccurate. However, there is a slight difference in the distributions of ScoreDiff for each play type, which may be leveraged in the classification process in some way.
Since ScoreDiff is not a string predictor of PlayType, perhaps the combination of ydstogo and down are. For example, when offenses are met with a 3rd and long situation, they may be forced to go with a passing play, because a running play just won't get them the yards they need.
A violin plot of the ydstogo for a play for each possible down, split by PlayType, may help identify what types of plays are likely to be called given a specific down and yardage-to-go.
In [339]:
#Isolate run and pass plays
df_plot = df[[x in ['Run', 'Pass'] for x in df.PlayType]]
df_plot = df_plot[df_plot.down != -1]
#Violin plot, split by PlayType, with down on x axis and ydstogo on y axis
sns.violinplot(x = "down", y = "ydstogo", data = df_plot, fit_reg=False, hue = "PlayType", split=True)
Out[339]:
What the above visualization indicates is that as down increases, the ability to predict PlayType becomes better. For first down situations, the distributions of run and pass plays are essentially even based on yardage. The distributions become less and less symmetric until 4th down, where running plays occur much less frequently for high-yardage situations.
Perhaps simply down is a good predictor of play type. To check, a bar plot of the percentage of plays on each down which are running plays.
In [340]:
#Isolate run and pass plays
df_plot = df[[x in ['Run', 'Pass'] for x in df.PlayType]]
df_plot = df_plot[df_plot.down != -1]
#PlayType converted to integer
playType_as_int = [1 if x == "Run" else 0 for x in df_plot.PlayType]
df_plot.PlayType = pd.Series(playType_as_int)
#Group by down and count how often each type of play occurs
df_plot_groups = df_plot.groupby('down')
run_plays_by_down = df_plot_groups.PlayType.sum() / df_plot_groups.PlayType.count()
#Barplot the data
run_plays_by_down.plot(kind='barh')
Out[340]:
The barplot of down vs percentage of running plays shows that there is no noticeable correlation between what down it is and what kind of play occurs. Therefore, down on its own is not a good predictor of PlayType. However, down and ydstogo may provide some insight into PlayType, as seen in the above violin plot.
There are a few different features about each play that are missing, and that may provide additional insight to the analyses performed on the data set.
Each play includes information about the game time when the play began, but don't include information about how long the play took. This information could be added to the dataset by iterating through the rows of the dataset and, for each row, calculating the difference in time between the given row and the row before it. Adding this data column would require checking to make sure that the GameIDs, quarters, and posteams match for the two rows, so that time of play actually represented the time of a single play.
In [341]:
PlayTime = [-1 for x in range(len(df))]
for i in range(len(df)):
try:
if df.GameID[i] == df.GameID[i-1] and df.posteam[i] == df.posteam[i-1] and df.qtr[i-1] == df.qtr[i]:
PlayTime[i] = df.TimeSecs[i-1] - df.TimeSecs[i]
except:
PlayTime[i] = -1
PlayTime = pd.Series(PlayTime)
df['PlayTime'] = PlayTime
A violin plot of the data provides a novel affirmation that the data addition worked.
In [342]:
df_plot = df[[x in ['Run', 'Pass'] for x in df.PlayType]]
sns.violinplot(x = "PlayType", y = "PlayTime", data = df_plot)
Out[342]:
It looks as if the addition of the data was successful, but before it can be used in analysis more verification would need to occur. Given, however, that 45 seconds are allowed between plays, this distribution makes sense. The bi-modality can be explained by the fact that sometimes the clock is stopped between plays, and sometimes it is not (this depends on the nature of the previous play).
The location where the game occurred would be valuable information in analyzing play data, as home-field advantage is a significant factor in professional football. This data could be collected from an external source and added into the dataset if it were deemed useful for further analysis.
For long-distance pass plays, as well as field goal attempts, punts, and kickoffs, the type of roof that a stadium has is very important. Open stadiums are exposed to wind, and that can cause difficulty for kickers and quarterbacks. Moreover, open stadiums expose players to natural weather like rain and snow, which can have a large impact on the outcome of many types of plays, as wet footballs are generally more difficult to throw, catch, and carry.
The long-standing Packers-Vikings NFC North rivalry is going to see the light of day this coming Sunday. To provide some insight into the upcoming game, a number of visualizations will be made about the two matchups that occurred between the two teams in the 2015-2016 season.
First, the data for the Green Bay - Minnesota matchup (GB vs MIN) will be isolated:
In [343]:
#Select plays in GB vs MIN games
GBvMIN = df[[x in ['GB', 'MIN' ] for x in df.posteam]]
GBvMIN = GBvMIN[[x in ['GB', 'MIN' ] for x in GBvMIN.DefensiveTeam]]
#Split the data into the two games
Game_1 = GBvMIN[GBvMIN.GameID == 2015112205]
Game_2 = GBvMIN[GBvMIN.GameID != 2015112205]
In [344]:
# plot the score over time for Game 1 GB vs MIN
ax = sns.lmplot(x="TimeSecs", y="PosTeamScore", data=Game_1, hue="posteam",
order=1, ci=None, scatter_kws={"s": 80});
# reverse the way time is displayed so the start of the game is on the right
ax.set(xlim=(3600,-900))
ax.fig.suptitle("MIN vs GB | Game 1 Score over Time")
# plot the score over time for Game 2 GB vs MIN
ax = sns.lmplot(x="TimeSecs", y="PosTeamScore", data=Game_2, hue="posteam",
order=1, ci=None, scatter_kws={"s": 80});
# reverse the way time is displayed so the start of the game is on the right
ax.set(xlim=(3600,-900))
ax.fig.suptitle("MIN vs GB | Game 2 Score over Time")
Out[344]:
A first look at the time series data for the two games shows no serious advantage in favor of one team over the other in terms of the scorelines of their previous games. In the first matchup, the Packers won by a strong margin over the Vikings, but in the second matchup, the Vikings took the win with a one-touchdown lead.
Because there's no clear winner between the two based on overall score, an in-depth look will be taken at the rushing, passing, and defensive capabilities of the two teams.
First, a comparison must be made between the two teams in terms of general performance of rushing and passing plays. Afterwards, a more in-depth look can be taken at the running and passing capabilities of the two offenses. To visualize the difference, a violin plot shows the average yard gained per play type per team.
In [345]:
# get the playtype data we want to show
GBvMIN_playtype = GBvMIN[((GBvMIN.PlayType == 'Pass') | (GBvMIN.PlayType == 'Run'))]
# plot the running and passing yards gained against each other for GB vs MIN games
ax = sns.violinplot(hue = "posteam", y="Yards.Gained", x="PlayType", data=GBvMIN_playtype, inner="quart", split = True)
ax.set_title("MIN vs GB | Running and Passing Against Each Other in 2015")
Out[345]:
The above violin plots seem reasonably symmetrical; however, it does seem like the Packers, despite having a lower median yardage gain for passing plays, have a much higher upper-bound on their yardage gain, especially for passing plays.
To take a look at which team performs best on the ground when the two match up, a bar plot of the rushers from each team and how they perform when they face head-to-head is used. The bars are color-coded by team.
In [346]:
#Isolate the running plays
run_plays = GBvMIN[GBvMIN.PlayType == "Run"]
run_plays_grouped = run_plays.groupby('Rusher')
from statistics import mode
#Format the data for plotting
total_running_yards = run_plays_grouped["Yards.Gained"].sum()
total_running_yards.sort_values(inplace=True, ascending=False)
teams = [mode(run_plays[run_plays.Rusher == x].posteam.tolist()) for x in total_running_yards.index]
total_running_yards_df = pd.DataFrame({'total_yards': total_running_yards, 'rusher' : total_running_yards.index, 'team': teams})
#Plot the rushers
ax = sns.barplot(y="rusher", x="total_yards", data=total_running_yards_df, hue = 'team')
#Print the total rushing yards for each team
print("Total Rushing Yards:\n" + str(run_plays.groupby('posteam')['Yards.Gained'].sum()))
From the looks of it, when the Packers meet the Vikings head-to-head, RB Eddie Lacy outperforms league star Adrian Peterson. However, the 2nd, 3rd, and 4th top rushers are Vikings, so the Vikings actually outperform the packers overall in terms of rushing yards. The Packers put up 207 yards in the two matchups, while the Vikings put up 245.
While the Vikings, with their star RB Adrian Peterson, seem to dominate the ground game, it's possible that the Packers' star QB Aaron Rodgers gives them an edge in the matchup. To identify which team has a superior passing game, the same plot from above (for rushing) is used.
In [347]:
#Isolating the passing plays
pass_plays = GBvMIN[GBvMIN.PlayType == "Pass"]
pass_plays = pass_plays[pass_plays.Receiver != -1]
pass_plays_grouped = pass_plays.groupby('Receiver')
total_passing_yards = pass_plays_grouped["Yards.Gained"].sum()
total_passing_yards.sort_values(inplace=True, ascending=False)
teams = [mode(pass_plays[pass_plays.Receiver == x].posteam.tolist()) for x in total_passing_yards.index]
total_running_yards_df = pd.DataFrame({'total_yards': total_passing_yards, 'receiver' : total_passing_yards.index, 'team': teams})
ax = sns.barplot(y="receiver", x="total_yards", data=total_running_yards_df, hue = 'team')
print("Total Passing Yards:\n" + str(pass_plays.groupby('posteam')['Yards.Gained'].sum()))
As expected, it looks like the Packers have a significant edge on the Vikings in terms of passing yards. The Packers earned over 100 more yards in the air than the Vikings, and the top Packers receiver produced double the yardage of the top Vikings receiver.
It looks like, once again, the Packers/Vikings matchup will be a close call this weekend. The Vikings show an edge in terms of rushing yards, while the Packers show an edge in terms of passing yards. Both teams won one of the two matchups last year. However, because the Packers won by a much larger margin, and their passing potential outshines that of the Vikings so significantly, the Packers seem to be a slight favorite for the upcoming matchup.
This is based solely on data from last year's NFL season. Many things have changed since then, so the slight edge that the Packers seem to have has no real predictive quality with respect to their upcoming matchup.