Classification of Play Type in NFL Play-By-Play Data

Ian Johnson, Derek Phanekham, Travis Siems

Introduction

The NFL (National Football League) has 32 teams split into two conferences, the AFC and NFC. Each of the 32 teams plays 16 games during the regular season (non-playoff season) every year. Due to the considerable viewership of American football, as well as the pervasiveness of fantasy football, considerable data about the game is collected. During the 2015-2016 season, information about every play from each game that occurred was logged. All of that data was consolidated into a single data set which is analyzed throughout this report.

In this report, we will attempt to classify the type of a play, given the game situation before the play began.

The Classification Task

We will attempt to classify plays based on play type using information about the state of the game prior to the start of the play. This is expected to be an exceptionally difficult classification task, due to the amount of noise in the dataset (specifically, the decision to run vs pass the ball is often a seemingly random one). A successful classifier would have huge value to defensive coordinators, who could call plays based on the expected offensive playcall. Because it may be very difficult to identify what play will be called, it is relevant to provide a probability of a given playcall in a situation. For example, it would be useful to provide the probability of a 4th down conversion attempt, even if the overall prediction is that a punt occurs.

Data Preparation

In order to prepare the data for classification, a number of variables from the original dataset will be removed, as they measure the result of the play, not the state of the game prior to the start of the play. The dataset being included in this report has had previous cleaning and preprocessing performed in our previous report.


In [47]:
#For final version of report, remove warnings for aesthetics.
import warnings
warnings.filterwarnings('ignore')

#Libraries used for data analysis
import pandas as pd
import numpy as np
from sklearn import preprocessing

df = pd.read_csv('data/cleaned.csv') # read in the csv file

colsToInclude = ['GameID', 'Drive', 'qtr', 'down',
                 'TimeSecs', 'yrdline100','ydstogo','ydsnet',
                 'GoalToGo','posteam','DefensiveTeam',
                 'PosTeamScore','ScoreDiff', 'PlayType']

df = df[colsToInclude]
df = df[[p not in ["Sack", "No Play", "QB Kneel", "Spike"] for p in df.PlayType]]
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 38600 entries, 0 to 42875
Data columns (total 14 columns):
GameID           38600 non-null int64
Drive            38600 non-null int64
qtr              38600 non-null int64
down             38600 non-null int64
TimeSecs         38600 non-null float64
yrdline100       38600 non-null float64
ydstogo          38600 non-null float64
ydsnet           38600 non-null float64
GoalToGo         38600 non-null int64
posteam          38600 non-null object
DefensiveTeam    38600 non-null object
PosTeamScore     38600 non-null float64
ScoreDiff        38600 non-null float64
PlayType         38600 non-null object
dtypes: float64(6), int64(5), object(3)
memory usage: 4.4+ MB

A subset of attributes will be one-hot encoded, as they are categorical variables which won't work with our classification algorithm. The following Python function was used for one-hot encoding, and was adapted from the website referenced in the code.


In [48]:
from sklearn.feature_extraction import DictVectorizer

#Simple function for 1 hot encoding
def encode_onehot(df, cols):
    """
    One-hot encoding is applied to columns specified in a pandas DataFrame.
    
    Modified from: https://gist.github.com/kljensen/5452382
    
    Details:
    
    http://en.wikipedia.org/wiki/One-hot
    http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
    
    @param df pandas DataFrame
    @param cols a list of columns to encode
    @return a DataFrame with one-hot encoding
    """
    vec = DictVectorizer()
    
    vec_data = pd.DataFrame(vec.fit_transform(df[cols].to_dict(outtype='records')).toarray())
    vec_data.columns = vec.get_feature_names()
    vec_data.index = df.index
    
    df = df.drop(cols, axis=1)
    df = df.join(vec_data)
    return df

df = encode_onehot(df, cols=['posteam', 'DefensiveTeam'])

The following are descriptions of the remaining data columns in the play-by-play dataset. Note that the one-hot encoded columns do not follow the structure listed below, but for the sake of readability they are presented as if they were not one-hot encoded.

  • GameID (nominal): A unique integer which identifies each game played
  • Drive (ordinal): The number of the drive during a game when the play occurred (indexed at one, so the first drive of the game has Drive 1 and the nth drive has Drive n)
  • qtr (interval): The quarter of the game when the play occurred
  • down (interval): The down when the play occurred (1st, 2nd, 3rd, or 4th)
  • TimeSecs (interval): The remaining game time, in seconds, when the play began
  • yrdline100 (ratio): The absolute yard-line on the field where the play started (from 0 to 100, where 0 is the defensive end zone and 100 is the offensive end zone of the team with the ball)
  • ydstogo (ratio): The number of yards from the line of scrimmage to the first-down line
  • ydsnet (ratio): The number of yards from the beginning of the drive to the current line of scrimmage
  • GoalToGo (nominal): A binary attribute whose value is 1 if there is no first down line (the end-zone is the first down line) or 0 if there is a normal first down line
  • posteam (nominal): A 2-or-3 character code representing the team on offense
  • PosTeamScore (ratio): The score of the team with possesion of the ball
  • DefensiveTeam (nominal): A 2-or-3 character code representing the team on defense
  • ScoreDiff: (ratio) The difference in score between the offensive and defensive at the time of the play.
  • PlayType: (nominal) An attribute that identifies the type of play (i.e. Kickoff, Run, Pass, Sack, etc)

Performance Metrics

The value of a classifier will be evaulated using the following cost matrix. Costs in the matrix which are set to 1 represent play predictions that would never actually occur in the context of a football game. For example, if we predicted a pass play and a kickoff occurs, then the classifier has a significant flaw.

Bolded weights represent actual mispredictions that could occur.

Actual Play Pass Run Kickoff Punt Extra Point Field Goal Onside Kick
Predicted Play
Pass 0 0.1 1 0.15 0.15 0.1 1
Run 0.1 0 1 0.15 0.15 0.1 1
Kickoff 1 1 0 1 1 1 0.75
Punt 0.25 0.25 1 0 1 0.15 1
Extra Point 0.4 0.4 1 1 0 1 1
Field Goal 0.4 0.4 1 0.1 1 0 1
Onside Kick 1 1 0.25 1 1 1 0

This performance metric is the best for this classification problem because the actual potential cost of an incorrect play prediction varies significantly based on the nature of the misclassification. In an actual football game, it would be very costly to predict an extra point and have the opposing team run a pass play. This means that they ran a fake extra point and went for a two-point conversion. However, if a pass play is predicted and a run play occurs, the cost of the error is minimal because the defensive strategy for defending against run and pass plays.

Because the goal of this classification is to help inform defensive play-calling, a cost matrix is helpful because it allows a defensive coordinator to set his own costs to produce his own classifier, without any knowledge of the actualy computation that occurs.

Cross Validation Methodology

We will use a stratified 10-fold cross validation technique to compare the models. We use a sequential partition of the data because this mirrors how data will be collected and analyzed. For our use, we assume that it is okay to use data in the “future” to predict data “now” because it can represent data from a previous football season. For example, if we use the first 90% of data for training and the remaining 10% of data for testing, that would simulate using most of the current season's data to predict plays towards the end of this season. If we use the first 50% and last 40% of data for training and the remaining 10% for testing, this would simulate using 40% of the previous season's data and the first 50% of this season's data to predict plays happening around the middle of the current season.

By using stratification, we can use a more representative sample of the distribution of play types. This stratified sample will reduce variance in the estimation. Given that the dataset only covers play-by-play data for one football season and using our assumptions, we can conclude that the stratified 10-fold cross validation technique will give us an ideal comparison between models.


In [49]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit

#Using a 10-fold stratified shuffle split.
cv = StratifiedKFold(n_splits=10)

y = df.PlayType.values
X = df.drop('PlayType', 1).values

Modeling

Before we build any models, we define a cost function in Python below, which is used to test all of our forthcoming models. It computes the item-wise product of a confusion matrix and our cost matrix, and returns the sum of all of the elements in the resulting matrix. We also define a function to calculate area under roc curve for a multiclass classification problem.


In [74]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import make_scorer
from sklearn.metrics import roc_curve, auc
from scipy import interp

import matplotlib.pyplot as plt
%matplotlib inline 
plt.style.use('ggplot')

cost_mat =     [[0  ,.1  , 1   , .15 , 0.15, .1 , 1   ],
                [.1 , 0  , 1   , 0.15, 0.15, 0.1, 1   ],
                [1  , 1  , 0   , 1   , 1   , 1  , 0.75],
                [.25,0.25, 1   , 0   , 1   ,0.15,  1  ],
                [0.4, 0.4, 1   , 1   , 0   , 1  ,  1  ],
                [0.4, 0.4, 1   , 0.1 , 1   , 0  ,  1  ],
                [1  , 1  , 0.25, 1   , 1   , 1  ,  0  ]]

def cost(Y, yhat):
    return np.sum(np.multiply(confusion_matrix(Y,yhat), cost_mat))

def auc_of_roc(Y,yhat):
    classes = ['Pass', 'Run', 'Kickoff', 'Punt', 'Extra Point', 'Field Goal', 'Onside Kick']
    
    mean_tpr = 0.0
    mean_fpr = np.linspace(0, 1, 100)
    for c in classes:
        tempY = [x==c for x in Y]
        tempYhat = [x==c for x in yhat]
        
        fpr, tpr, thresholds = roc_curve(tempY, tempYhat)
        mean_tpr += interp(mean_fpr, fpr, tpr)
        mean_tpr[0] = 0.0

        roc_auc = auc(fpr, tpr)
        plt.plot(fpr, tpr, lw=2, label=c+' (area = %0.2f)' % (roc_auc))
    
    plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='k', label='Luck')

    mean_tpr /= len(classes)
    mean_tpr[-1] = 1.0
    mean_auc = auc(mean_fpr, mean_tpr)
    
    plt.plot(mean_fpr, mean_tpr, color='g', linestyle='--',label='Mean ROC (area = %0.2f)' % mean_auc, lw=2)
    
    plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
    
    print(mean_auc)
    plt.show()
    
    return mean_auc

auc_roc_scorer = make_scorer(auc_of_roc)
scorer = make_scorer(cost)

An Accessory Plotting Function

The following plotting function will be used for tuning hyperparameters for various algorithms. It is adapted from one of the instructional ML notebooks for CSE 5393


In [75]:
#Adapted From MachineLearningNotebooks/09_Evaluation
#Credit: Dr. Eric Larson
def plot_filled(train_scores,test_scores,train_x_axis, xlabel=''):

    test_mean = np.mean(test_scores, axis=1)
    test_std = np.std(test_scores, axis=1)

    plt.plot(train_x_axis, test_mean,
             color='blue', marker='o',
             markersize=5, label='testing cost')

    plt.fill_between(train_x_axis,
                     test_mean + test_std,
                     test_mean - test_std,
                     alpha=0.15, color='blue')

    plt.grid()
    plt.xlabel(xlabel)
    plt.ylabel('Cost')
    plt.legend(loc='upper right')

Logistic Regression

The first model we will use is a logistic regression model. Before creating the final model, we will tune the cost variable C using a validation curve generated with the cross validation strategy described above. For both the hyperparameter tuning and the final model creation, we will use a class weight dictionary to encode the costs of each class. These costs will be computed using the row sums of the cost matrix.


In [52]:
#Building the class weight map
PlayTypes = df.PlayType.value_counts().index.tolist()
Costs = [sum(x) for x in cost_mat]

ClassWeights = dict(zip(PlayTypes, Costs))

In [53]:
#Some includes
from sklearn.model_selection import validation_curve
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

In [ ]:
#Note: This takes a long time to run. Don't re-run it unless you have to!

#create pipeline of scaling and logistic regression
clf = Pipeline([('sca',StandardScaler()),
                ('clf',LogisticRegression(class_weight=ClassWeights))])

#Some possible values of 
param_range = [.0001, .001, .01, .1, 1, 10, 100, 1000, 10000]

train_scores, test_scores = validation_curve(
                estimator=clf, 
                X=X, 
                y=y, 
                param_name='clf__C', 
                param_range=param_range,
                scoring=scorer,
                cv=cv,
                n_jobs=-1)

In order to assess the optimal value of C, we plot the value of C vs the cost of the produced classifier:


In [70]:
plot_filled(train_scores, test_scores, param_range, xlabel='Parameter Sweep')
plt.xscale('log')
plt.title("Cost for Different C values for Logistic Regression")
plt.show()


Based on the above plot, we can see that the (approximate) optimal value of C is 1. While larger values of C don't percievably change the training cost, they do introduce a larger variance, and likely result in more overfitting. Therefore, C will be set to 1 for our final Logistic Regression model.


In [76]:
#Note: This takes a long time to run. Don't re-run it unless you have to!

#Re-define the pipeline, this time with the tuned value of C (C=1)
clf = Pipeline([('sca',StandardScaler()),
                ('clf',LogisticRegression(C=1, class_weight = ClassWeights))])

per_fold_eval_criteria = cross_val_score(estimator=clf,
                                    X=X,
                                    y=y,
                                    cv=cv,
                                    scoring=scorer
                                   )

LRCosts = per_fold_eval_criteria

After calculating the costs of the model on the test folds, we also compute the area under the multiclass ROC curve.


In [76]:
per_fold_eval_criteria = cross_val_score(estimator=clf,
                                    X=X,
                                    y=y,
                                    cv=cv,
                                    scoring=auc_roc_scorer
                                   )

LR_auc = per_fold_eval_criteria


0.806065259118
0.790951159182
0.800936966774
0.783116216593
0.797494279892
0.790935844399
0.793982532328
0.790658416893
0.812471306801
0.795203900461

In [58]:
#Plot the costs of the evaluations on each fold
plt.bar(range(len(LRCosts)),LRCosts)
plt.ylim([min(LRCosts)-20,max(LRCosts)+5])
plt.xlabel("Fold")
plt.ylabel("Cost")
plt.title("Cost per Fold for Logistic Regression")
print("Mean Cost: ", np.mean(LRCosts))


Mean Cost:  1219.1

The above plot shows the costs of the models built and evaluated on each of the ten folds. Note that the y-axis is not zeroed at zero, so the relative heights of the models is not meaningful. The mean cost for the logistic regression models was 1219.1


In [59]:
#Plot the costs of the evaluations on each fold
plt.bar(range(len(LR_auc)),LR_auc)
# plt.ylim([min(LR_auc)-20,max(LR_auc)+5])
plt.xlabel("Fold")
plt.ylabel("Area Under Curve")
plt.title("AoC per Fold for Logistic Regression")
print("Mean AUC: ", np.mean(LR_auc))


Mean AUC:  0.796181589

The above plot shows the areas under the ROC curves for the classifier for each of the folds of the test data. It is readily observable that the area under the curve is relatively constant across folds, which is not true for the costs.

Support Vector Machine

Our next model will use a support vector machine with the RBF kernel. We will tune the hyper-parameters C and gamma for the SVM using the same method used for tuning the logistic regression model. The same cost metric and cross-validation strategy will be used for the SVM as was used for logistic regression.


In [80]:
from sklearn.svm import SVC

In [16]:
#Note: This takes a VERY long time to run. Don't re-run it unless you have to!

#create pipeline of scaling and logistic regression
clf = Pipeline([('sca',StandardScaler()),
                ('clf',SVC(class_weight=ClassWeights, kernel='rbf'))])

#Some possible values of C. Fewer values are used because this takes forever to run.
param_range = [.01, .1, 1, 10]

train_scores, test_scores = validation_curve(
                estimator=clf, 
                X=X, 
                y=y, 
                param_name='clf__C', 
                param_range=param_range,
                scoring=scorer,
                cv=cv,
                n_jobs=-1)

Once again, we plot the value of C vs the cost of the produced classifier to identify the optimal value of C


In [71]:
plot_filled(train_scores, test_scores, param_range, xlabel='Parameter Sweep')
plt.xscale('log')
plt.title("Cost for Different C values for SVM")
plt.show()


Once again, we identify that the optimal value of C for the SVM Classifier is 1. At C=1, we have the lowest cost, but we have a high variance in cost. However, the upper-bound of the cost at C=1 is still lower than the upper bound for all other values of C. Therefore, our final SVM classifier will be built using a C of 1. Next, the gamma variable will be tuned, using the same process that was used for tuning C. Note that tuning the parameters in order like this may not result in an optimal solution. However, it is the only way to tune the hyperparameters in a feasible amount of time.


In [18]:
#Note: This takes a VERY long time to run. Don't re-run it unless you have to!

#create pipeline of scaling and logistic regression
clf = Pipeline([('sca',StandardScaler()),
                ('clf',SVC(class_weight=ClassWeights, kernel='rbf', C=1))])

#Some possible values of gamma. 1/76 is default (1/num_attributes)
param_range = [1/10000, 1/1000 , 1/76, 1/10, 1]

train_scores, test_scores = validation_curve(
                estimator=clf, 
                X=X, 
                y=y, 
                param_name='clf__gamma', 
                param_range=param_range,
                scoring=scorer,
                cv=cv,
                n_jobs=-1)

In [72]:
plot_filled(train_scores, test_scores, param_range, xlabel='Parameter Sweep')
plt.xscale('log')
plt.title("Cost for Different values of gamma for SVM")
plt.show()


The above plot shows that the optimal value of gamma, in this case, is 1/1000. Note that only 5 values of gamma were evaluated, due to the significant time required to test each possible value of gamma.


In [81]:
#Note: This takes a long time to run. Don't re-run it unless you have to!

#Re-define the pipeline, this time with the tuned value of C (C=1) and gamma (gamma=0.1)
clf = Pipeline([('sca',StandardScaler()),
                ('clf',SVC(C=1, class_weight = ClassWeights, gamma=0.001))])

per_fold_eval_criteria = cross_val_score(estimator=clf,
                                    X=X,
                                    y=y,
                                    cv=cv,
                                    scoring=scorer,
                                    n_jobs=-1
                                    )

SVMCosts = per_fold_eval_criteria

Once again, the area under curve metric is generated for the multi-class ROC curve.


In [82]:
per_fold_eval_criteria = cross_val_score(estimator=clf,
                                    X=X,
                                    y=y,
                                    cv=cv,
                                    scoring=auc_roc_scorer,
                                    n_jobs=-1
                                    )

SVM_auc = per_fold_eval_criteria


0.817416892504
0.822900130599
0.824939500757
0.823814826357
0.820142011332
0.825981425321
0.816645390954
0.826639532116
0.824924570031
0.623288535146

In [60]:
#Plot the costs of the evaluations on each fold
plt.bar(range(len(SVMCosts)),SVMCosts)
plt.ylim([min(SVMCosts)-200,max(SVMCosts)+5])
plt.xlabel("Fold")
plt.ylabel("Cost")
plt.title("Cost per Fold for SVM")
print("Mean Cost: ", np.mean(SVMCosts))


Mean Cost:  1205.76

The above plot shows the costs of each of the 10 folds of testing data when run against a model trained with the remaining 9 folds. The costs of this model are saved for comparison against the other algorithms in later analysis. The mean cost of this model is slightly lower than the mean cost of the logistic regression model; however, this may not be statistically meaningful. That will be explored in later sections. Both models have had the most cost associated with classifying the 10th and final fold. If this continues for the remaining models, its meaning may warrant some investigation.


In [67]:
#Plot the costs of the evaluations on each fold
plt.bar(range(len(SVM_auc)),SVM_auc)

plt.xlabel("Fold")
plt.ylabel("Area Under Curve")
plt.title("AoC per Fold for SVM")
print("Mean AUC: ", np.mean(SVM_auc))


Mean AUC:  0.802669282

The ROC_AUC scores for the SVM model do show a significant decrease for the 10th fold, which is expected, considering the high cost of the model for the 10th fold.

Multi-Layer Perceptron


In [85]:
from sklearn.neural_network import MLPClassifier

In [ ]:
#Note: This takes a VERY long time to run. Don't re-run it unless you have to!

#create pipeline of scaling and neural network
clf_neural = Pipeline([('sca',StandardScaler()),
                ('clf',MLPClassifier(max_iter=1000))])

#testing different numbers of hidden layers. 100 is default
param_range = [1, 5,10, 50,100]

train_scores, test_scores = validation_curve(
                estimator=clf_neural, 
                X=X, 
                y=y, 
                param_name='clf__hidden_layer_sizes', 
                param_range=param_range,
                scoring=scorer,
                cv=cv,
                n_jobs=-1)

In [73]:
plot_filled(train_scores, test_scores, param_range, xlabel='Parameter Sweep')
plt.xscale('log')
plt.title("Cost for Different numbers of hidden layers for MLP")
plt.show()


From this graph we can see that the optimal number of hidden layers to use is 10. With our cost function, 10 hidden layers gives us the lowest cost with the least amount of variance.


In [86]:
#Note: This takes a long time to run. Don't re-run it unless you have to!

clf = Pipeline([('sca',StandardScaler()),
                ('clf',MLPClassifier(max_iter=1000, hidden_layer_sizes=10))])

per_fold_eval_criteria = cross_val_score(estimator=clf,
                                    X=X,
                                    y=y,
                                    cv=cv,
                                    scoring=scorer,
                                    n_jobs=-1
                                   )
MLPCosts = per_fold_eval_criteria

In [87]:
per_fold_eval_criteria = cross_val_score(estimator=clf,
                                    X=X,
                                    y=y,
                                    cv=cv,
                                    scoring=auc_roc_scorer,
                                    n_jobs=-1
                                   )
MLP_auc = per_fold_eval_criteria


0.831932677183
0.807841555319
0.829729379173
0.825582231534
0.817486830549
0.829286275465
0.826826123924
0.842430159365
0.650199658494
0.842802548927

In [62]:
#Plot the costs of the evaluations on each fold
plt.bar(range(len(MLPCosts)),MLPCosts)
plt.ylim([min(MLPCosts)-200,max(MLPCosts)+5])
plt.xlabel("Fold")
plt.ylabel("Cost")
plt.title("Cost per Fold for MLP")
print("Mean Cost: ", np.mean(MLPCosts))


Mean Cost:  1150.765

The above plot shows the costs of each of the 10 folds of testing data when run against a model trained with the remaining 9 folds. The costs of this model are saved for comparison against the other algorithms in later analysis. The mean cost of this algorithm was below both SVM and Logistic Regression's mean costs by 50.


In [66]:
#Plot the costs of the evaluations on each fold
plt.bar(range(len(MLP_auc)),MLP_auc)
plt.xlabel("Fold")
plt.ylabel("Area Under Curve")
plt.title("AoC per Fold for MLP")
print("Mean AUC: ", np.mean(MLP_auc))


Mean AUC:  0.810411745

Similarly to the SVM classifier, the 10th fold's ROC_AUC score is much lower than the other folds, which is consistent with the high cost associated with the 10th fold.

Random Forest Classifier


In [77]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import log_loss

In [ ]:
#RANDOM FOREST CLASSIFIER

#create pipeline of scaling and logistic regression
clf = Pipeline([('sca',StandardScaler()),
                ('clf',RandomForestClassifier(class_weight=ClassWeights))])

#Some possible values of 
param_range = [10,25,50,100,250,500,1000]

train_scores, test_scores = validation_curve(
                    estimator=clf, 
                    X=X, 
                    y=y, 
                    param_name='clf__n_estimators', 
                    param_range=param_range,
                    scoring=scorer,
                    cv=cv,
                    n_jobs=-1)

In [78]:
plot_filled(train_scores, test_scores, param_range, xlabel='Parameter Sweep')
plt.xscale('log')
plt.title("Cost for Different numbers of estimators for Random Forest")
plt.show()


Based on the above parameter sweep plot, we can see that an ideal value of n_estimators is 250. Beyond 250, there appears to be very minimal decrease in cost, and an increase in cost variance. Therefore, 250 is identified as a good choice. Note that larger values of n_estimators also lead to greatly increased running times, so 250 is a good tradeoff between efficiency and cost minimization.


In [93]:
clf = Pipeline([('sca',StandardScaler()),
                ('clf',RandomForestClassifier(class_weight=ClassWeights, n_estimators=250))])

per_fold_eval_criteria = cross_val_score(estimator=clf,
                                    X=X,
                                    y=y,
                                    cv=cv,
                                    scoring=scorer,
                                    n_jobs=-1
                                   )

RFCosts = per_fold_eval_criteria

In [79]:
per_fold_eval_criteria = cross_val_score(estimator=clf,
                                    X=X,
                                    y=y,
                                    cv=cv,
                                    scoring=auc_roc_scorer,
                                    n_jobs=-1
                                   )

RF_auc = per_fold_eval_criteria


0.783116216593
0.790935844399
0.793982532328
0.797494279892
0.790951159182
0.790658416893
0.800936966774
0.806065259118
0.795203900461
0.812471306801

In [64]:
#Plot the costs of the evaluations on each fold
plt.bar(range(len(RFCosts)),RFCosts)
plt.ylim([min(RFCosts)-200,max(RFCosts)+200])
plt.xlabel("Fold")
plt.ylabel("Cost")
plt.title("Cost per Fold for Random Forest Classifier")
print("Mean Cost: ", np.mean(RFCosts))


Mean Cost:  1017.465

The above chart (which shows the costs of the 10 folds from the stratified folds) shows that, random forests, unlike the previous models, which perform worst for fold 10 performs worst for fold 1. Moreover, the mean cost from the 10 folds is significantly lower for random forests than it is for any of the other models, even tuned SVMs and MLPs. Since random forests are a somewhat novel classification algorithm, this is an unexpected result. The following section will examine whether or not the result is statistically significant.


In [65]:
#Plot the costs of the evaluations on each fold
plt.bar(range(len(RF_auc)),RF_auc)
plt.xlabel("Fold")
plt.ylabel("Area Under Curve")
plt.title("AoC per Fold for Random Forest Classifier")
print("Mean AUC: ", np.mean(RF_auc))


Mean AUC:  0.854063393

The random forests's AUC_ROC scores are consistent with its cost scores in that it performs very well across all 10 folds, unlike the other classifiers.

Classifier Comparison

We will now compare each of the classifiers to see if any of them is statistically significantly better or worse than any others, in the hopes of identifying which classifier(s) is/are the best.


In [54]:
#These are the saved costs of the tests run above.
#Run this if you don't want to re-run the classifiers

LRCosts = np.array([ 1161.3, 1230.3, 1202.35, 1229.6, 1218.95, 1247.95, 1174.75, 1208.6, 1157.9, 1359.3])
SVMCosts = np.array([ 1128.4, 1142.8, 1094.2, 1151.1, 1104.05, 1152.75,  1087.25,  1122.65,
  1107.2,   1967.2 ])
MLPCosts = np.array([ 1131.5, 1130.45, 1086.05, 1133.5, 1081.2, 1145.4, 1073.0,  1142.85,
  1093.4, 1490.3 ])
RFCosts = np.array([ 1429.75, 1003.6, 993.9, 953.2, 937.8, 949.85, 929.6, 978.2,
   999.75,   999.0  ])
ECCosts = np.array([ 1071.2 ,  1108.95,  1109.4 ,  1111.05,  1097.8 ,  1105.55,
        1046.4 ,  1086.3 ,  1086.85,  1163.3 ])

In [55]:
#These are the saved AUC of the tests run above.
#Run this if you don't want to re-run the classifiers

LR_auc = np.array([ 0.80606526,  0.79095116,  0.80093697,  0.78311622,  0.79749428,  0.79093584,
  0.79398253,  0.79065842,  0.81247131,  0.7952039 ])
SVM_auc = np.array([ 0.82598143,  0.81741689,  0.8249395,   0.82014201,  0.82290013,  0.81664539,
  0.82663953,  0.82381483,  0.82492457,  0.62328854])
MLP_auc = np.array([ 0.84243016,  0.80784156,  0.82558223,  0.82972938,  0.82682612,  0.82928628,
  0.81748683,  0.83193268,  0.84280255,  0.65019966])
RF_auc = np.array([ 0.83747838,  0.83882935,  0.84658976,  0.84024972,  0.87242755,  0.83743138,
  0.87135475,  0.87090366,  0.87515512,  0.85021426])
EC_auc = np.array([ 0.8259078,   0.81925832,  0.82215783,  0.81867119,  0.82228697,  0.82155269,
  0.82831326,  0.82261223,  0.82877846,  0.82978827])

Cost Difference

The following getDifference function is used to create a tuple of the range of possible differences of mean for two sets of costs, with 95% confidence. We use the second of the two confidence interval tests proposed in the ICA3 reversed assignment, because the datasets cannot be assumed to be independent, so the binomial approximation to the normal distribution does not hold.


In [56]:
def getDifference(cost1,cost2,z_val=2.26,size=10):
    diff12 = cost1 - cost2
    sigma12 = np.sqrt(np.sum(diff12*diff12) * 1/(size-1))
    d12 = (np.mean(diff12) + 1/(np.sqrt(size)) * z_val * sigma12, np.mean(diff12) - 1/(np.sqrt(size)) * z_val * sigma12)
    return d12

dLR_SVM = np.array(getDifference(LRCosts,SVMCosts))
dLR_MLP = np.array(getDifference(LRCosts,MLPCosts))
dLR_RF = np.array(getDifference(LRCosts,RFCosts))
dSVM_MLP = np.array(getDifference(SVMCosts,MLPCosts))
dSVM_RF = np.array(getDifference(SVMCosts,RFCosts))
dMLP_RF = np.array(getDifference(MLPCosts,RFCosts))

print("Average LR vs SVM Difference: " , dLR_SVM )
print("Average LR vs MLP Difference: " , dLR_MLP )
print("Average LR vs RF Difference:  " , dLR_RF  )
print("Average SVM vs MLP Difference:" , dSVM_MLP)
print("Average SVM vs RF Difference: " , dSVM_RF )
print("Average MLP vs RF Difference: " , dMLP_RF )


Average LR vs SVM Difference:  [ 170.64403373 -143.96403373]
Average LR vs MLP Difference:  [ 143.33155717   -6.66155717]
Average LR vs RF Difference:   [ 397.96843539    5.30156461]
Average SVM vs MLP Difference: [ 169.08200777  -59.09200777]
Average SVM vs RF Difference:  [ 451.73519301  -75.14519301]
Average MLP vs RF Difference:  [ 302.19898901  -35.59898901]

What the above confidence intervals show that there is only one pair of classifiers whose results are statistically different with 95% confidence. The Logistic Regression model has a statistically significantly higher cost than the Random Forest model. All other models cannot be statistically shown to be different from one another in terms of cost.

Area Under Multiclass ROC Curve Difference

The same analysis will be performed for the area under the ROC curves for each classifier to determine if one is statistically significantly different than the other with respect to the area under curve. Note that because of the non-equal costs in the cost matrix, it is possible for one model to be statistically different from another in terms of ROC score despite having statistically similar costs.


In [15]:
dLR_SVM = np.array(getDifference(LR_auc,SVM_auc))
dLR_MLP = np.array(getDifference(LR_auc,MLP_auc))
dLR_RF = np.array(getDifference(LR_auc,RF_auc))
dSVM_MLP = np.array(getDifference(SVM_auc,MLP_auc))
dSVM_RF = np.array(getDifference(SVM_auc,RF_auc))
dMLP_RF = np.array(getDifference(MLP_auc,RF_auc))

print("Average LR vs SVM Difference: " , dLR_SVM )
print("Average LR vs MLP Difference: " , dLR_MLP )
print("Average LR vs RF Difference:  " , dLR_RF  )
print("Average SVM vs MLP Difference:" , dSVM_MLP)
print("Average SVM vs RF Difference: " , dSVM_RF )
print("Average MLP vs RF Difference: " , dMLP_RF )


Average LR vs SVM Difference:  [ 0.03885419 -0.05182957]
Average LR vs MLP Difference:  [ 0.02766033 -0.05612064]
Average LR vs RF Difference:   [-0.01281666 -0.10294695]
Average SVM vs MLP Difference: [ 0.00242973 -0.01791466]
Average SVM vs RF Difference:  [ 0.0082021  -0.11099032]
Average MLP vs RF Difference:  [ 0.00916535 -0.09646865]

The statistical significance comparisons shown above show the same result as the cost-based comparisons. No two models are statistically significantly different other than the Logistic Regression model vs Random Forest model, where the latter outperforms the former with statistical significance.

Ensemble Classifier

An ensemble classifier is built using the VotingClassifier from sklearn.


In [97]:
from sklearn.ensemble import VotingClassifier

clf = Pipeline([('sca',StandardScaler()),
                ('clf', VotingClassifier(
                            estimators=[('lr',  LogisticRegression(C=1)), 
                                        ('svc', SVC(kernel='rbf', C=1)),
                                        ('mlp', MLPClassifier(max_iter=1000, hidden_layer_sizes=10)),
                                        ('rfc', RandomForestClassifier(n_estimators=250))], 
                            voting='hard'))
                ])

per_fold_eval_criteria = cross_val_score(estimator=clf,
                                    X=X,
                                    y=y,
                                    cv=cv,
                                    scoring=scorer,
                                    n_jobs=-1
                                   )

ECCosts = per_fold_eval_criteria

Once again, the area under the ROC curve will be evaluated for the final ensemble classifier.


In [98]:
per_fold_eval_criteria = cross_val_score(estimator=clf,
                                    X=X,
                                    y=y,
                                    cv=cv,
                                    scoring=auc_roc_scorer,
                                    n_jobs=-1
                                   )

EC_auc = per_fold_eval_criteria


0.822612225013
0.818671185116
0.822157834288
0.825907797789
0.828313258664
0.822286974619
0.821552690995
0.819258323579
0.828778464824
0.829788266696

In [68]:
#Plot the costs of the evaluations on each fold
plt.bar(range(len(ECCosts)),ECCosts)
plt.ylim([min(ECCosts)-200,max(ECCosts)+5])
plt.xlabel("Fold")
plt.ylabel("Cost")
plt.title("Cost per Fold for Ensemble Classifier")
print("Mean Cost: ", np.mean(ECCosts))


Mean Cost:  1098.68

The above chart of the costs of the classifiers built and tested on each of the 10 stratified folds shows that the ensemble classifier, like all of the other sklearn classifiers, has the most trouble classifying the 10th fold. We beleive that the general difficulty of classifying the 10th fold may be caused by the fact that the last few games of an NFL season can be very unusual due to teams making late-season changes to either make a last effort to make the playoffs, or to rest important players if they've guaranteed a spot in the playoffs, or have already been eliminated. The next section will see if, with any statistical significance, the ensemble classifier outperforms the individual ones.


In [69]:
#Plot the AUC of the evaluations on each fold
plt.bar(range(len(RF_auc)),RF_auc)
plt.xlabel("Fold")
plt.ylabel("Area Under Curve")
plt.title("AoC per Fold for Ensemble Classifier")
print("Mean AUC: ", np.mean(RF_auc))


Mean AUC:  0.854063393

Despite the wider range of score differences with respect to the cost metric, the above plot shows that the ROC_AUC scores for the ensemble classifier are relatively constant across the 10 folds.

Comparison of Ensemble to Original Classifiers

The same comparison method used for the original classifiers will be used for the final classifier.


In [20]:
dEC_LR = getDifference(ECCosts,LRCosts)
dEC_SVM = getDifference(ECCosts,SVMCosts)
dEC_MLP = getDifference(ECCosts,MLPCosts)
dEC_RF = getDifference(ECCosts,RFCosts)

ensemble_comparisons = np.array([dEC_LR,dEC_SVM,dEC_MLP,dEC_RF])

print(ensemble_comparisons)


[[ -24.13003694 -223.57996306]
 [  93.28904627 -314.31904627]
 [  36.34980151 -147.38980151]
 [ 197.92494928  -42.36494928]]

Each of the four confidence intervals produced from comparing the costs of the individual classifiers to the ensemble classifiers shows no statistically significant difference in costs, except for the interval for logistic regression vs ensemble. Therefore, it can be concluded with statistical significance at 95% confidence that the ensemble classifier outperforms only the logistic regression model with respect to the cost metric.

The same comparison is performed below for the area under ROC curve metric.


In [16]:
dEC_LR = getDifference(EC_auc,LR_auc)
dEC_SVM = getDifference(EC_auc,SVM_auc)
dEC_MLP = getDifference(EC_auc,MLP_auc)
dEC_RF = getDifference(EC_auc,RF_auc)

ensemble_comparisons = np.array([dEC_LR,dEC_SVM,dEC_MLP,dEC_RF])

print(ensemble_comparisons)


[[ 0.04922206  0.00628016]
 [ 0.07048958 -0.02796274]
 [ 0.05697372 -0.0299318 ]
 [-0.00503404 -0.05522734]]

Interestingly, the above confidence intervals show two statistically different pairs of models. First, the ensemble classifier is significantly better than the logistic regression classifier, with 95% confidence. Second, the random forest classifier is significantly better than the ensemble classifier with respect to the area under the curve score, with 95% confidence. This information can lead to a reasonable conclusion that the random forest model is, in fact, the best model, as it statistically outperforms the ensemble model, and is very inexpensive to compute compared to the other models.

Deployment

Given the performance of all of the algorithms tested above with respect to both cost and area under roc curve, we think that the best business decision would be to deploy the random-forest based model, as the random forest model was the only one to outperform any other model with statistical significance, and it is by far the cheapest model to train. Given that the mean ROC_AUC score for the random forest model was approximately 0.85, we are confident that our model can provide meaningful insight to football organizations, and we think that it has considerable value to a defensive coordinator. However, we do not beleive that our model can in any way replace a defensive coordinator, because it fails to consider a number of factors which exist in a real-game situation.

The model only considers basic metrics of the game, and does not in any way consider the intricacies of a football game, such as the location of the game, the weather, the crowd, any injuries that may exist on a team, and so on.

Considering the available data to train the model, we beleive that our model is high-performing and viable for use in an actual professional football setting. It can, with reasonable 'accuracy,' predict what type of play will be called in a given game situation.

In order to ultimately determine the viability of our model, we would like to implement it alongside a defensive coordinator, and keep track of the predictions of the defensive coordinator and the predictions of the model, and see if one is better than the other, with statistical significance.

An Additional Classifier (in R)

For the sake of comparison, one non-sklearn classifier will be used on the dataset. An association-rule-based classifier from R package arulesCBA is used to train a model on the dataset and predict one fold of the dataset in the following R code:

df <- read.csv("data/cleaned.csv", stringsAsFactors = TRUE)
df <- df[c('GameID', 'Drive', 'qtr', 'down','TimeSecs', 'yrdline100','ydstogo','ydsnet','GoalToGo','posteam','DefensiveTeam','PosTeamScore','ScoreDiff', 'PlayType')]

df <- df[(df$PlayType != "Sack"),]
df <- df[(df$PlayType != "No Play"),]
df <- df[(df$PlayType != "QB Kneel"),]
df <- df[(df$PlayType != "Spike"),]
df$PlayType <- factor(df$PlayType)

df[c("GameID", "Drive","qtr","down","GoalToGo","posteam","DefensiveTeam","PlayType")] <- sapply(df[c("GameID", "Drive","qtr","down","GoalToGo","posteam","DefensiveTeam","PlayType")], as.factor)

library(arulesCBA)

df[c("TimeSecs", "yrdline100","ydstogo","ydsnet","PosTeamScore","ScoreDiff")] <- sapply(df[c("TimeSecs", "yrdline100","ydstogo","ydsnet","PosTeamScore","ScoreDiff")], function(x) discretize(x, categories = 5))
df <- as.data.frame(sapply(df, as.factor))

classifier <- CBA(df[1:34740,], "PlayType", support=0.04, maxtime=0.05, confidence = 0.80)
results <- predict(classifier, df[34741:38600,])

library(caret)
conf <- confusionMatrix(results, df$PlayType[34741:38600])[[2]]

write.csv(conf, file="data/confusion.csv", row.names = FALSE, col.names = FALSE)

The arulesCBA classifier was used because we beleived that a rule-based classification strategy would work well for this classification problem. Below, the cost of the model for a single fold is calculated. Note that because stratified folding was not used, and only one fold was evaluated, we can't compare this model to the existing models with statistical significance.


In [34]:
confMat = pd.read_csv('data/confusion.csv')
print("Cost:",np.sum(np.multiply(confMat.values, cost_mat)))


Cost: 1671.4

The cost of the resulting classifier is 1,671. This is above average for each of the individual classifiers built above by about 15%. The rule-based classifier may have produced better results if it weren't for the memory constraints on the way it's currently implemented in R. A novel conclusion is that the rule-based classifier, in this form, is not as strong as the preceding classifiers. However, take note that the hyper-parameters in the CBA algorithm could be tuned to potentially significantly increase the effectiveness of the algorithm if the implementation were more efficient.

A more insightful conclusion to draw is that R is clearly far superior to Python, and whoever wrote arulesCBA is an awful developer.