Problem With nflScrapR GoalToGo Variable

My new favorite dataset is the trove of NFL play-by-play data downloadable in R now through nflScrapR. However, in another post, I noticed something that didn't look right with the GoalToGo variable, based the plot below.


The problem here is that the probability distribution in the right sub-plot says that for GoalToGo situations (i.e. '3rd and Goal'), a play never gained more than 10 yards.

At first this seems to make sense since you're "goal-to-go" when you get inside the 10-yardline. But, if you're "goal-to-go" at the 10 yardline and get called for a penalty, you replay the down, still goal-to-go, backed up another 5-15 yards. Some of those plays must have gone for more than 10 Yards.Gained.

Tracking Down Cases Which Appear Misclassified

So leaving aside the actual Yards.Gained on a play for a moment, how many cases can we find where at least we had GoalToGo at beyond the 10-yardline?

In [1]:
pbp_data <- season_play_by_play(2015)

Loading required package: nnet
Loading required package: XML
Loading required package: RCurl
Loading required package: bitops
Error in file(con, "r"): cannot open the connection

1. season_play_by_play(2015)
2. lapply(game_ids, FUN = game_play_by_play)
3. FUN(X[[i]], ...)
4. RJSONIO::fromJSON(RCurl::getURL(urlstring))
5. RJSONIO::fromJSON(RCurl::getURL(urlstring))
6. I(suppressWarnings(paste(readLines(content), collapse = "\n")))
7. structure(x, class = unique(c("AsIs", oldClass(x))))
8. unique(c("AsIs", oldClass(x)))
9. suppressWarnings(paste(readLines(content), collapse = "\n"))
10. withCallingHandlers(expr, warning = function(w) invokeRestart("muffleWarning"))
11. paste(readLines(content), collapse = "\n")
12. readLines(content)
13. file(con, "r")

In [ ]:
subset(pbp_data, GoalToGo ==1 & yrdline100>10, select=c('down', 'desc', 'Yards.Gained', 'yrdline100'))

None? Ok. So now let's track down a case that we think should be goal-to-go beyond the 10-yardline, but somehow isn't.

In [52]:
#Filter on goal-to-go plays with a False Start penalty that would move offense back beyond 10-yardline
g2g_penalty <- subset(pbp_data, GoalToGo==1 & yrdline100>5 & PenaltyType == "False Start", 
      select = c('Date', 'posteam', 'Drive', 'qtr', 'down', 'ydstogo', 'yrdline100', 'GoalToGo', 'PenaltyType', 'Penalty.Yards', 'Accepted.Penalty', 'Yards.Gained'))

2332015-09-13 CHI 1 1 2 6 8 1 False Start5 1 0
5672015-09-13 CLE 2 1 1 9 9 1 False Start5 1 0
6432015-09-20 NE 4 1 1 10 10 1 False Start5 1 0
309222015-09-20 NE 17 3 1 8 8 1 False Start5 1 0
208412015-09-20 DAL 11 2 2 10 10 1 False Start5 1 0
272232015-09-27 STL 14 4 2 9 9 1 False Start5 1 0

We've identified a bunch of plays above where we'd expect the subsequent play on that drive (after the Penalty.Yards are enforced) to be "goal-to-go" beyond the 10-yardline. We can pull the rest of those drives (matching on date/posteam/drive/qtr) to view those subsequent plays and inspect what's going on some more. And having just done this, I can confirm we find instances where GoalToGo should be 1, but is set to 0 incorrectly.

But... we've also accidentally stumbled onto an example of the opposite error - where GoalToGo should be 0 but is set to 1 incorrectly. The first row above is a 2nd-down-and-6 from the 8, clearly not a "goal-to-go" situation, but that first row still lists GoalToGo = 1.

Is GoalToGo just an Alias for [yrdline100 <= 10]?

We can answer this question by breaking out the handy cut() function for converting numeric variables to categorical variables, and then looking at a contingency table of GoalToGo vs yrdline-as-a-factor.

In [66]:
pbp_data$yrdline100_factor <- cut(pbp_data$yrdline100, breaks=c(0,10, 100))

table(pbp_data$GoalToGo, pbp_data$yrdline100_factor)

    (0,10] (10,100]
  0      0    43135
  1   2886        0


The GoalToGo variable is apparently (incorrectly) just a flag for whether the offense is within ten yards of the end-zone. Based on the table() output above:

  • In all 2,886 plays where the offense is within 10 yards of the endzone, GoalToGo is set to 1.
  • In all 43,135 plays where the offense is beyond 10 yards from the endzone, GoalToGo is set to 0.

This is definitely incorrect, as some plays inside the 10 are not goal-to-go, and some plays outside the 10 are.