My new favorite dataset is the trove of NFL play-by-play data downloadable in R now through nflScrapR. However, in another post, I noticed something that didn't look right with the GoalToGo variable, based the plot below.
[[[[ INSERT PLOT IMAGE ]]]]]
The problem here is that the probability distribution in the right sub-plot says that for GoalToGo situations (i.e. '3rd and Goal'), a play never gained more than 10 yards.
At first this seems to make sense since you're "goal-to-go" when you get inside the 10-yardline. But, if you're "goal-to-go" at the 10 yardline and get called for a penalty, you replay the down, still goal-to-go, backed up another 5-15 yards. Some of those plays must have gone for more than 10 Yards.Gained.
So leaving aside the actual Yards.Gained on a play for a moment, how many cases can we find where at least we had GoalToGo at beyond the 10-yardline?
In [1]:
library(nflscrapR)
pbp_data <- season_play_by_play(2015)
In [ ]:
subset(pbp_data, GoalToGo ==1 & yrdline100>10, select=c('down', 'desc', 'Yards.Gained', 'yrdline100'))
None? Ok. So now let's track down a case that we think should be goal-to-go beyond the 10-yardline, but somehow isn't.
In [52]:
#Filter on goal-to-go plays with a False Start penalty that would move offense back beyond 10-yardline
g2g_penalty <- subset(pbp_data, GoalToGo==1 & yrdline100>5 & PenaltyType == "False Start",
select = c('Date', 'posteam', 'Drive', 'qtr', 'down', 'ydstogo', 'yrdline100', 'GoalToGo', 'PenaltyType', 'Penalty.Yards', 'Accepted.Penalty', 'Yards.Gained'))
head(g2g_penalty)
We've identified a bunch of plays above where we'd expect the subsequent play on that drive (after the Penalty.Yards
are enforced) to be "goal-to-go" beyond the 10-yardline. We can pull the rest of those drives (matching on date/posteam/drive/qtr) to view those subsequent plays and inspect what's going on some more. And having just done this, I can confirm we find instances where GoalToGo should be 1, but is set to 0 incorrectly.
But... we've also accidentally stumbled onto an example of the opposite error - where GoalToGo should be 0 but is set to 1 incorrectly. The first row above is a 2nd-down-and-6 from the 8, clearly not a "goal-to-go" situation, but that first row still lists GoalToGo = 1.
We can answer this question by breaking out the handy cut() function for converting numeric variables to categorical variables, and then looking at a contingency table of GoalToGo vs yrdline-as-a-factor.
In [66]:
pbp_data$yrdline100_factor <- cut(pbp_data$yrdline100, breaks=c(0,10, 100))
table(pbp_data$GoalToGo, pbp_data$yrdline100_factor)
The GoalToGo variable is apparently (incorrectly) just a flag for whether the offense is within ten yards of the end-zone. Based on the table() output above:
This is definitely incorrect, as some plays inside the 10 are not goal-to-go, and some plays outside the 10 are.