NFLscrapR is an awesome R library which queries the official NFL API for play-by-play data, and parses it into an R dataframe. Data is available from 2009 through the latest week of the current season. In this blog post, I'll explore the seven full seasons of play-by-play data available from 2009-2015.
To downnload a season of play by play data to an R dataframe, execute the following in an R session:
#Download and Load nflScrapR package
devtools::install_github(repo = "maksimhorowitz/nflscrapR")
library('nflscrapR')
#Create DataFrame with 2015 play by play data
pbp_2015 <- season_play_by_play(2015)
After using nflScrapR in an R session to download and save play by play data for 2009-2015, below I explore the data in Python.
In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
nfl_data = pd.read_csv('/home/max/nfl_stats/data/pbp_2009_2015.csv', low_memory=False)
In [3]:
#Print (Rows, Columns) of Data
print(nfl_data.shape)
#Print Variable-Names and First Two Values for Each Variable.
with pd.option_context('display.max_rows', 999, 'display.max_colwidth', 25):
print(nfl_data.head(2).transpose())
How cool is this?! Almost any question I can think of regarding NFL play outcomes is suddenly queryable. Although first, we'll clarify exactly what info some of the variables include.
Some of the variables look categorical, but it's not obvious what the categories are. We can check, and get the percentage occurrence of each category, using DataFrame.value_counts() with the "normalize=True" option.
(the ".to_frame()" included below is just included to suppress "dtype:float64" from being printed below each value_counts section.)
In [4]:
selected_columns = ['PassOutcome', 'PassLength', 'PassLocation', 'RunLocation', 'RunGap', 'PlayType']
for c in selected_columns:
print(nfl_data[c].value_counts(normalize=True).to_frame(), '\n')
Wow, that's a lot of play types. For most statistics we'd be interested in, we'd want to restrict our plays to runs and passes. Although the PlayTypes above show us that we'll need to include Sacks as Passes since Sacks occur during pass attempts.
In [6]:
#Create new DataFrame where Play-Type is Run/Pass/Sack
run_pass_row_indices = nfl_data[nfl_data['PlayType'].isin(['Run', 'Pass', 'Sack'])].index
runs_passes_sacks = nfl_data.loc[run_pass_row_indices, :]
What's the average yards gained per play?
In [8]:
runs_passes_sacks['Yards.Gained'].mean()
Out[8]:
What percentage of plays are run vs pass?
In [9]:
runs_passes_sacks['PlayType'].value_counts(normalize=True).to_frame()
Out[9]:
So pass attempts comprise 54.6% + 3.7%(sacks) = 58.3% of plays. The other 41.7% are runs. At least from 2009-2015.
On first downs?
In [10]:
first_downs = runs_passes_sacks[runs_passes_sacks['down']==1]
first_downs['PlayType'].value_counts(normalize=True).to_frame()
Out[10]:
What percentage of coaches' challenges are successful?
In [25]:
nfl_data['ChalReplayResult'].value_counts(normalize=True).to_frame()
Out[25]:
...(a challenge is successful if it reverses the call on the field, so 41% of challenges are successful)
What are the average yards gained per pass-play and per run-play?
In [26]:
runs_passes_sacks.groupby('PlayType')['Yards.Gained'].mean().to_frame()
Out[26]:
So Passes look way more productive on average. How is that affected by including Sacks as Pass Attempts?
In [16]:
runs_passes_sacks['PlayType2'] = runs_passes_sacks['PlayType'].replace({'Sack':'Pass'})
runs_passes_sacks.groupby('PlayType2')['Yards.Gained'].mean()
Out[16]:
So including sacks drops pass plays by almost 0.9 yards-per-play, or about 12%. Even with sacks included though, passes still look way more effective on average; about 2 yards-per-play more effective, which is almost 50% more yards-per-play than the run plays. I wonder how consistent that is by year.
In [158]:
annual = runs_passes_sacks.groupby(['PlayType2', 'Season'], as_index=False)['Yards.Gained'].mean()
sn.pointplot(data=annual, x='Season', y='Yards.Gained', hue='PlayType2')#, scale=0.75)
Out[158]:
Yep, the heavy average premium in pass vs run yards-per-play goes back at least to 2009.