Inferential Statistics Project

In this notebook I will perform inferential statistical tests on a dataset of NCAA tournament games over 10 years (2002 - 2012). Many games in the NCAA tournament take place between a team who is heavily favored or expected to win ("the favorite") against a lower-regarded team for whom fans, experts, and oddsmakers expect have a much lower chance of winning the game ("the underdog"). The quality of each team is indicated by their "seed", a ranking from 1 (best) to 16 (worst) assigned to each team prior to the tournament. An "upset" occurs when an "underdog" with a much higher numerical seed plays a "favorite" with a lower numerical seed, and the "underdog" wins the game. For this dataset I selected only games involving teams with a seed differential of at least 4 slots. For example, a 1-seed plays a team seeded 5 or above, a 2-seed plays a team seeded 6 or above. All games in this dataset had potential for an upset, and the goal is to predict upsets with a reasonable degree of accuracy.

Using data on the two teams involved in each game or "features", I will build statistical models to predict upsets. I've collected around 70 features on each team that describe their seed level, their average offensive and defensive performance, distance travelled to the game, coach's tournament experience, and other team-specific factors.


In [1]:
# import packages used in the notebook
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
from scipy import stats
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix, roc_auc_score, precision_score, recall_score
from sklearn.preprocessing import StandardScaler

In [2]:
# setup for plots
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.style.use('ggplot')

# create a list of colors from matplotlib ggplot style
colors = []
for j, c in zip(range(8), plt.rcParams['axes.prop_cycle']):
    colors.append((j, c)[1].get('color'))
b = colors[1]
r = colors[0]
g = colors[5]
gry = colors[3]

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [3]:
# load and subset the data
# selecting years 2002-2012
path = "C:\Users\mworley\Dropbox\capstone\data"
mats_uppo = pd.read_csv(path + r'\processed\matchups_upsetpot.csv')
data = mats_uppo[(mats_uppo.season > 2002) & (mats_uppo.season < 2013)].copy()

In [4]:
# split labels from features, split into training and test set
labels = data['upset'].reset_index(drop=True)
feats = data.drop(['upset', 'upsetpot', 'daynum', 'season', 'f_scoremarg', 'u_scoremarg',
                   'u_team_id', 'f_team_id'], axis=1).reset_index(drop=True)

x_train, x_test, y_train, y_test = train_test_split(feats, labels,
                                                    test_size=0.2,
                                                    random_state=0)

# standardize features
x_trainst = x_train.copy()
for col in x_trainst.columns:
    x_trainst[col] = (x_trainst[col] - np.mean(x_trainst[col])) / np.std(x_trainst[col])

In [5]:
# confirm that the training set is stratified similarly to the full set with respect to upsets
print data.groupby('upset')['upset'].count() / len(data)
print float(np.sum(y_train)) / float(len(y_train))


upset
0    0.800971
1    0.199029
Name: upset, dtype: float64
0.179331306991

The full set of games consists of 20% upsets, while the training set is 18% upsets.

How many features does the dataset include? How large is the training set?


In [6]:
print "Number of features: %d" % (len(x_trainst.columns))
print "Number of examples: %d" % (len(x_trainst))


Number of features: 163
Number of examples: 329

There are 163 features and 329 training cases ("examples") in the training set.

Through intuitive understanding of basketball and a prior data visualization project, I have an idea of approximately 20 features I'd like to examine in statistical models.


In [7]:
# %% set a list of features to test
feat_list = ['u_seed', 'f_seed', 'u_adjoe', 'u_adjde', 'u_em', 'f_adjoe', 
            'f_adjde', 'f_em', 'u_dist', 'f_dist', 'u_dstdelt', 'u_texp', 'f_texp',
            'u_orbtomarg', 'f_orbtomarg', 'u_sos', 'f_sos', 'u_cvisits',
            'f_cvisits', 'u_cfar', 'f_cfar']

# %% create a dictionary for features
feat_dict = {'u_seed': "Underdog Seed",
             'f_seed': 'Favorite Seed',
             'u_adjoe': 'Underdog Offense',
             'u_adjde': 'Underdog Defense',
             'u_em': 'Underdog Margin',
             'f_adjoe': 'Favorite Offense',
             'f_adjde': 'Favorite Defense',
             'f_em': 'Favorite Margin',
             'u_dist': 'Underdog Distance',
             'f_dist': 'Favorite Distance',
             'u_dstdelt': 'Distance Difference',
             'u_texp': 'Underdog Experience',
             'f_texp': 'Favorite Experience',
             'u_orbtomarg': 'Underdog ORB/TO Margin',
             'f_orbtomarg': 'Favorite ORB/TO Margin',
             'u_sos': 'Underdog Schedule',
             'f_sos': 'Favorite Schedule',
             'u_cvisits': 'Underdog Coach Experience',
             'f_cvisits': 'Favorite Coach Experience',
             'u_cfar': 'Underdog Coach Success',
             'f_cfar': 'Favorite Coach Success'}

These include 10 features on each underdog and favorite involved each game, plus a 21st feature (Distance Difference) that is computed as Underdog Travel Distance - Favorite Travel Distance. By printing the summary statistics, we can confirm there are no missing data or outlier values for any of the features


In [8]:
print x_train[feat_list].describe()


           u_seed      f_seed     u_adjoe     u_adjde        u_em     f_adjoe  \
count  329.000000  329.000000  329.000000  329.000000  329.000000  329.000000   
mean    11.638298    3.003040  105.250689   94.313704   10.936942  111.799766   
std      2.945624    1.717021    4.981474    4.447976    6.940847    3.663976   
min      5.000000    1.000000   88.155400   84.867400  -14.628600  100.651000   
25%     10.000000    1.000000  102.138000   91.344400    6.697500  109.364000   
50%     12.000000    3.000000  105.684000   93.825100   12.643300  111.688000   
75%     14.000000    4.000000  108.764000   97.272200   16.154200  114.502000   
max     16.000000    7.000000  117.164000  111.451000   25.179300  121.987000   

          f_adjde        f_em       u_dist       f_dist    u_dstdelt  \
count  329.000000  329.000000   329.000000   329.000000   329.000000   
mean    88.678858   23.120907   911.234043   749.264438   161.969605   
std      3.436152    4.232619   606.321091   609.849372   695.977470   
min     78.209600   13.165100    14.000000     0.000000 -2182.000000   
25%     86.180900   20.211000   430.000000   268.000000  -183.000000   
50%     88.286500   23.033800   757.000000   579.000000   128.000000   
75%     91.180300   25.779700  1353.000000  1080.000000   509.000000   
max     99.630100   33.213700  2638.000000  2517.000000  2021.000000   

           u_texp      f_texp  u_orbtomarg  f_orbtomarg       u_sos  \
count  329.000000  329.000000   329.000000   329.000000  329.000000   
mean     0.628566    1.805221     1.543785     2.481467    1.620608   
std      0.994026    1.406055     2.851658     2.563652    4.923999   
min      0.000000    0.000000    -5.100000    -4.937500  -10.780000   
25%      0.000000    0.842626    -0.533300     0.689700   -2.070000   
50%      0.000000    1.690943     1.718800     2.343800    1.510000   
75%      0.930844    2.679878     3.470600     4.428600    5.960000   
max      5.314445    5.577663     8.161300     8.264700   11.410000   

            f_sos   u_cvisits   f_cvisits      u_cfar      f_cfar  
count  329.000000  329.000000  329.000000  329.000000  329.000000  
mean     7.691945    2.747720    8.784195    1.686930    4.279635  
std      1.814940    3.655959    6.179154    1.783011    1.741069  
min     -2.600000    0.000000    0.000000    0.000000    0.000000  
25%      6.820000    0.000000    3.000000    0.000000    3.000000  
50%      7.860000    1.000000    7.000000    1.000000    5.000000  
75%      8.810000    4.000000   13.000000    3.000000    6.000000  
max     12.040000   21.000000   26.000000    6.000000    6.000000  

Models that include too many features with strong correlations may have unreliable estimates. Below I plot a heat map of the feature correlation matrix to examine the degree of association between the features in this dataset. In this plot blue grids indicate negative correlations and red grids indicate positive correlations, with darker colors indicative of stronger correlations.


In [9]:
# %%
# heatmap of correlation matrix
df = x_trainst[feat_list].copy()
for col in list(df.columns):
    df.rename(columns={col: feat_dict[col]}, inplace=True)
corrmat = df.corr()
plt.figure(figsize=(9, 9))
ax = plt.axes()
ax.tick_params(labelsize=12)
#sns.set(font_scale=0.8)
sns.heatmap(corrmat, square=True) #fmt='.2f',annot=True)
plt.show()


Some of the strongest correlations are due to the features being calculated from the same data and representative of a very similar construct. For example, team margin is strongly positively correlated with team offense and strongly negatively correlated with team defense. This is expected since team margin is computed as the difference between offense (points scored) and defense (points allowed).

  • Schedule is positively correlated with offense and margin, and negatively correlated with defense (but only for underdogs).. The offense and defensive metrics are adjusted for schedule strength, so this pattern is expected.
  • Margins are negatively correlated with seed, supporting the expectation that teams who win by more points receive more favorable seeds for the tournament)

Other interesting correlations:

  • Positive correlation between underdog schedule and margin - underdogs who schedule tougher opponents also win by greater differentials. This may be because the typical teams in "underdog" roles in the tournament are dominant small or mid-level conference teams who also schedule a few games against tougher opponents each season.
  • Underdog coach experience/success positively correlated with schedule - underdogs led by coaches with more tourney experience/success also schedule tougher opponents

Now I want to examine the relationship between these features and upsets. I'll use logistic regression, which can be used to estimate a classification outcome as a function of numeric features.

A reasonable first step is to examine "univariate" models where each feature is tested as a sole predictor of upsets. To do this systematically, I'll run a separate logistic regression for each feature, save the results of each model in a dataset, and inspect the complete dataset of results. For each model I will save the odds ratio and p-value for the feature.


In [10]:
# set up empty results dataframe
rescols = ['feature', 'odds_ratio', 'p_value', 'accuracy', 'precision', 'recall',
           'f1', 'roc_auc']
results = pd.DataFrame(columns=rescols)
threshold =0.5

# loop over list of features, fit model, append results to results dataframe
for f in feat_list:
    logit = sm.Logit(y_train, x_trainst[f])
    result = logit.fit(disp=False)
    prob = result.predict(x_trainst[f])
    pred = np.array(prob > threshold, dtype=float)
    n = feat_dict[f]
    o = round(np.exp(result.params.values), 2)
    p = round(result.pvalues.values, 3)
    s1 = round(accuracy_score(y_train, pred), 2)
    s2 = round(precision_score(y_train, pred), 2)
    s3 = round(recall_score(y_train, pred), 2)
    s4 = round(f1_score(y_train, pred), 2)
    s5 = round(roc_auc_score(y_train, pred), 2)
    l = [n, o, p, s1, s2, s3, s4 , s5]
    results = results.append(pd.Series(l, index=rescols), ignore_index=True)

In [11]:
results.sort_values('odds_ratio')


Out[11]:
feature odds_ratio p_value accuracy precision recall f1 roc_auc
7 Favorite Margin 0.73 0.007 0.53 0.23 0.66 0.34 0.58
3 Underdog Defense 0.76 0.017 0.51 0.21 0.64 0.32 0.56
14 Favorite ORB/TO Margin 0.77 0.017 0.55 0.24 0.68 0.35 0.60
20 Favorite Coach Success 0.79 0.034 0.57 0.24 0.66 0.36 0.61
16 Favorite Schedule 0.81 0.060 0.59 0.23 0.56 0.33 0.58
0 Underdog Seed 0.82 0.069 0.59 0.25 0.63 0.36 0.61
12 Favorite Experience 0.84 0.113 0.49 0.21 0.66 0.32 0.56
5 Favorite Offense 0.88 0.256 0.53 0.22 0.61 0.32 0.56
18 Favorite Coach Experience 0.90 0.351 0.53 0.23 0.69 0.34 0.59
17 Underdog Coach Experience 0.94 0.606 0.40 0.18 0.64 0.28 0.50
10 Distance Difference 0.94 0.599 0.50 0.19 0.56 0.29 0.53
8 Underdog Distance 0.95 0.656 0.45 0.18 0.58 0.27 0.50
9 Favorite Distance 1.02 0.874 0.60 0.22 0.47 0.30 0.55
19 Underdog Coach Success 1.05 0.689 0.57 0.19 0.44 0.27 0.52
11 Underdog Experience 1.10 0.395 0.57 0.18 0.41 0.25 0.50
2 Underdog Offense 1.18 0.140 0.50 0.19 0.58 0.29 0.53
6 Favorite Defense 1.27 0.031 0.58 0.23 0.56 0.32 0.57
15 Underdog Schedule 1.27 0.031 0.57 0.25 0.68 0.36 0.61
13 Underdog ORB/TO Margin 1.33 0.012 0.53 0.24 0.71 0.35 0.60
4 Underdog Margin 1.34 0.010 0.50 0.23 0.76 0.36 0.61
1 Favorite Seed 1.36 0.006 0.65 0.27 0.54 0.36 0.61

Features with an odds ratio below 1 have a negative association with upsets (higher values are associated with less upsets), while features with an odds ratio above 1 have a positive association with upsets (higher values associated with more upsets).

Here are the characteristics of favorite teams with the strongest association with upsets:

  • Lower scoring margin: typically outscore their opponents by a relatively small margin
  • Lower offensive rebound & turnover margin: have less of an advantage in rebounds and turnovers than their opponents
  • Lower coach success: led by coaches who haven't won a large number of games in a previous tournament
  • Lower strength of schedule: played a relatively easy schedule prior to the tournament
  • Higher defense: typically allow a higher number of points scored by their opponents
  • Higher seed: assigned a higher seed value for the tournament

Underdogs who sprung upsets on favorites were likely to have these characteristics:

  • Lower defense: typically allow relatively low scoring by their opponents
  • Lower seed: assigned a lower seed value for the tournament
  • Higher schedule: played a relatively difficult schedule prior to the tournament
  • Higher offensive rebound & turnover margin: typically have a greater advantage in rebounds and turnovers than their opponents
  • Higher scoring margin: typically outscore their opponents by a relatively large margin

All of these results make sense. The features for favorites reflect a lower quality team, while the features for underdogs reflect a higher quality team.

The goal is to use these results to predict the outcomes of games. After fitting a model, I'd like to use it to predict upset probabilities and inspect how well these predicted probabilities are aligned with the true outcomes of the games.

First I will inspect predictions for simple single-feature models, before incorporating multiple features into the model predictions. My intuition is that including more features should improve the predictions.

Below I display model predictions for the logistic regression of upsets as predicted by favorite scoring margin.


In [12]:
# re-run logistic model for a feature
logit = sm.Logit(y_train, x_trainst['f_em'])
result = logit.fit(disp=False)
df = x_train.copy()
threshold = 0.5
df['prb'] = result.predict(x_trainst['f_em'])
df['pred'] = np.array(df['prb'] > threshold, dtype=float)
df['lab'] = y_train

fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, sharex=True, figsize=(8, 5))
fig.subplots_adjust(top=0.15)
st = fig.suptitle('Upset Probabilities', fontsize=20)
x1 = df.loc[df.lab == 0, 'f_em']
y1 = df.loc[df.lab == 0, 'prb']
x2 = df.loc[df.lab == 1, 'f_em']
y2 = df.loc[df.lab == 1, 'prb']
ax1.tick_params(labelsize=14)
ax1.scatter(x1, y1, s=40, marker='v', color=b, alpha=0.7)
ax1.set_title('True Non-Upsets', fontsize=18)
ax1.set_xlabel('Favorite Margin', fontsize=18)
ax1.set_ylabel('Probability', fontsize=18)
ax1.set_xlim([np.min(df['f_em']) - 1, np.max(df['f_em']) + 1])
ax1.set_ylim(0, 1)
ax1.axhline(y=0.5, color='black', linewidth=2)
ax2.scatter(x2, y2, s=40, marker='^', color=r, alpha=0.7)
ax2.set_title('True Upsets', fontsize=18)
ax2.set_xlabel('Favorite Margin', fontsize=18)
ax2.tick_params(labelsize=14)
ax2.axhline(y=0.5, color='black', linewidth=2)
plt.tight_layout()
plt.show()


The predicted probabilities reflect the linear assumptions of the model and the use of only one feature in the model. That is, probabilities are estimated completely as a linear function of the feature.

When using the model to classify games, probabilities above a threshold (typically 0.50) would be classified as an upset, while probabilities below the threshold are classified as non-upset. Probabilities for the true upsets (right-side) look to be mostly above 0.5, meaning that the model is in fact predicting an upset for most of the games that were true upsets. But the probabilities for the true non-upsets (left-side) are pretty evenly distributed above and below 0.5. The model is classifying many true non-upsets as upsets (lots of false positives). This would be a problem if one were using the predictions to bet on upsets, as it would result in many losing bets that possibly would not be offset by the winning bets.

Below I print some model scores to examine some objective indicators of model performance in predicting upsets.


In [13]:
preds1 = df['pred']

print 'Accuracy score: %r' % (accuracy_score(y_train, preds1 ))
print 'ROC AUC score: %r' % (roc_auc_score(y_train, preds1 ))
print(classification_report(y_train, preds1 ))
print(confusion_matrix(y_train, preds1))


Accuracy score: 0.53191489361702127
ROC AUC score: 0.58236032642812308
             precision    recall  f1-score   support

          0       0.87      0.50      0.64       270
          1       0.23      0.66      0.34        59

avg / total       0.76      0.53      0.58       329

[[136 134]
 [ 20  39]]

As suggested by the plot, the model does predict the majority of upsets as upsets (recall = .66), but also predicts many false positives as reflected by the low precision (.23).

Below are model predictions for the logistic regression of upsets as predicted by underdog scoring margin.


In [14]:
# re-run logistic model for a feature
logit = sm.Logit(y_train, x_trainst['u_em'])
result = logit.fit(disp=False)
df = x_train.copy()
df['prb'] = result.predict(x_trainst['u_em'])
df['pred'] = np.array(df['prb'] > threshold, dtype=float)
df['lab'] = y_train

fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, sharex=True, figsize=(8, 5))
fig.subplots_adjust(top=0.15)
st = fig.suptitle('Upset Probabilities', fontsize=20)
x1 = df.loc[df.lab == 0, 'u_em']
y1 = df.loc[df.lab == 0, 'prb']
x2 = df.loc[df.lab == 1, 'u_em']
y2 = df.loc[df.lab == 1, 'prb']
ax1.tick_params(labelsize=14)
ax1.scatter(x1, y1, s=40, marker='v', color=b, alpha=0.7)
ax1.set_title('True Non-Upsets', fontsize=18)
ax1.set_xlabel('Underdog Margin', fontsize=18)
ax1.set_ylabel('Probability', fontsize=18)
ax1.set_xlim([np.min(df['u_em']) - 1, np.max(df['u_em']) + 1])
ax1.set_ylim(0, 1)
ax1.axhline(y=0.5, color='black', linewidth=2)
ax2.scatter(x2, y2, s=40, marker='^', color=r, alpha=0.7)
ax2.set_title('True Upsets', fontsize=18)
ax2.set_xlabel('Underdog Margin', fontsize=18)
ax2.tick_params(labelsize=14)
ax2.axhline(y=0.5, color='black', linewidth=2)
plt.tight_layout()
plt.show()


With underdog margin as the model feature, the model would classify many many non-upsets as upsets (probabilities above .50). Both of the favorite and underdog margin models tend to make many of these "false positive" errors.


In [15]:
preds2 = df['pred']

print 'Accuracy score: %r' % (accuracy_score(y_train, df['pred']))
print 'ROC AUC score: %r' % (roc_auc_score(y_train, df['pred']))
print(classification_report(y_train, df['pred']))
print(confusion_matrix(y_train, df['pred']))


Accuracy score: 0.50455927051671734
ROC AUC score: 0.60543000627746391
             precision    recall  f1-score   support

          0       0.90      0.45      0.60       270
          1       0.23      0.76      0.36        59

avg / total       0.78      0.50      0.55       329

[[121 149]
 [ 14  45]]

The underdog margin model is slightly better on recall (.76) for upsets and exactly the same on precision (.23). Inspecting the confusion matrix reveals the underdog margin model is getting more of the true upsets right (45 compared to 39) but also has more false positives for non-upsets (149 to 134).

Combining these features into a single model may improve the predictions beyond those generated by the models that include either feature alone.


In [16]:
# re-run logistic model for multiple features
logit = sm.Logit(y_train, x_trainst[['f_em', 'u_em']])
result = logit.fit(disp=False)
preds3 = np.array(result.predict(x_trainst[['f_em', 'u_em']]) > threshold, dtype=float)

print 'Accuracy score: %r' % (accuracy_score(y_train, preds3))
print 'ROC AUC score: %r' % (roc_auc_score(y_train, preds3))
print(classification_report(y_train, preds3))
print(confusion_matrix(y_train, preds3))


Accuracy score: 0.55319148936170215
ROC AUC score: 0.66155053358443183
             precision    recall  f1-score   support

          0       0.93      0.49      0.64       270
          1       0.26      0.83      0.40        59

avg / total       0.81      0.55      0.60       329

[[133 137]
 [ 10  49]]

This model produces the best scores yet, but is still predicting too many upsets. Most of the true upsets are predicted as upsets (recall = .83), but only around 1/4 of the predicted upsets are true upsets (precision = .26).

It's possible that adding more features to the model will help improve the predictions. Now I'll incorporate all of the 11 features listed above that had the strongest association with upsets in the univariate models. Below is a summary of the model results.


In [17]:
dfcols = ['f_em', 'u_em', 'f_seed', 'u_seed', 'u_orbtomarg', 
          'f_orbtomarg', 'f_adjde', 'u_adjde', 'u_sos', 
          'f_sos', 'f_cfar']
train = x_trainst[dfcols].copy()
# re-run logistic model for a feature
logit = sm.Logit(y_train, train)
result = logit.fit(disp=False)
print result.summary()
print pd.Series(np.exp(result.params.values), name='Odds Ratios')


                           Logit Regression Results                           
==============================================================================
Dep. Variable:                  upset   No. Observations:                  329
Model:                          Logit   Df Residuals:                      318
Method:                           MLE   Df Model:                           10
Date:                Mon, 08 May 2017   Pseudo R-squ.:                 -0.3908
Time:                        13:22:43   Log-Likelihood:                -215.23
converged:                       True   LL-Null:                       -154.75
                                        LLR p-value:                     1.000
===============================================================================
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
f_em           -0.0659      0.200     -0.330      0.741      -0.457       0.325
u_em           -0.1280      0.261     -0.490      0.624      -0.640       0.384
f_seed          0.1363      0.177      0.768      0.443      -0.212       0.484
u_seed         -0.1379      0.226     -0.609      0.542      -0.582       0.306
u_orbtomarg     0.2462      0.129      1.914      0.056      -0.006       0.498
f_orbtomarg    -0.2021      0.122     -1.663      0.096      -0.440       0.036
f_adjde         0.0872      0.143      0.610      0.542      -0.193       0.367
u_adjde        -0.1034      0.163     -0.634      0.526      -0.423       0.216
u_sos           0.1833      0.213      0.860      0.390      -0.235       0.601
f_sos          -0.1582      0.124     -1.273      0.203      -0.402       0.085
f_cfar         -0.1022      0.131     -0.782      0.434      -0.358       0.154
===============================================================================
0     0.936207
1     0.879847
2     1.145989
3     0.871186
4     1.279111
5     0.816975
6     1.091066
7     0.901794
8     1.201145
9     0.853685
10    0.902852
Name: Odds Ratios, dtype: float64

In this model the odds ratios and p-values have changed from those above when each feature was used in the model by itself. These changes reflect the correlations between the features and the overlap in their association with upsets.

Below I print the classification scores for this model. For comparison I also print the confusion matrix and scores for the model with favorite/underdog margins.


In [18]:
preds4 = np.array(result.predict(train) > threshold, dtype=float)
print '2-feature model'
print 'Accuracy score: %r' % (accuracy_score(y_train, preds3))
print 'ROC AUC score: %r' % (roc_auc_score(y_train, preds3))
print (classification_report(y_train, preds3))
print(confusion_matrix(y_train, preds3))
print '11-feature model'
print 'Accuracy score: %r' % (accuracy_score(y_train, preds4))
print 'ROC AUC score: %r' % (roc_auc_score(y_train, preds4))
print(classification_report(y_train, preds4))
print(confusion_matrix(y_train, preds4))


2-feature model
Accuracy score: 0.55319148936170215
ROC AUC score: 0.66155053358443183
             precision    recall  f1-score   support

          0       0.93      0.49      0.64       270
          1       0.26      0.83      0.40        59

avg / total       0.81      0.55      0.60       329

[[133 137]
 [ 10  49]]
11-feature model
Accuracy score: 0.60182370820668696
ROC AUC score: 0.67793471437539232
             precision    recall  f1-score   support

          0       0.93      0.56      0.70       270
          1       0.28      0.80      0.42        59

avg / total       0.81      0.60      0.65       329

[[151 119]
 [ 12  47]]

The model with all 11 features has improved accuracy, recall, and f1-score. The most noticeable improvement is the reduction in false positives (from 137 to 119). Including the additional features has improved the balance between precision and recall, as shown by the increase in f1-score (.60 to .65).

So far I've only inspected how well each model performs with respect to classifying the data used to train the model. Ultimately, the goal is to use the model to make predictions on unseen data (new games that haven't occurred yet). The scores examined so far might likely overstimate that performance.

Here I use 5-fold cross-validation to gauge how well the model might classify unseen data.


In [19]:
# set high C parameter value to examine "non-regularized model"
clf = LogisticRegression(random_state=0, n_jobs=-1, C=100000)
scores = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
df_noreg = pd.DataFrame()

for scr in scores:
    results = cross_val_score(clf, train, y_train, cv=5, scoring=scr)
    s = pd.Series(results, name=scr)
    df_noreg[scr] = s

print(df_noreg)


   accuracy  precision    recall        f1   roc_auc
0  0.818182   0.500000  0.166667  0.250000  0.719136
1  0.818182   0.500000  0.166667  0.250000  0.679012
2  0.818182   0.500000  0.166667  0.250000  0.675926
3  0.833333   1.000000  0.083333  0.153846  0.864198
4  0.815385   0.333333  0.090909  0.142857  0.641414

Each row of the dataframe is a estimated score on the model predictions for 5 held-out samples, each of which is about 1/5 of the training set. To compare with the training scores, I'll obtain the means and compare.


In [20]:
print "Training set scores"
print 'Accuracy score: %r' % (accuracy_score(y_train, preds3))
print 'ROC AUC score: %r' % (roc_auc_score(y_train, preds3))
print (classification_report(y_train, preds3))
print "Cross-validation scores"
print df_noreg.mean()


Training set scores
Accuracy score: 0.55319148936170215
ROC AUC score: 0.66155053358443183
             precision    recall  f1-score   support

          0       0.93      0.49      0.64       270
          1       0.26      0.83      0.40        59

avg / total       0.81      0.55      0.60       329

Cross-validation scores
accuracy     0.820653
precision    0.566667
recall       0.134848
f1           0.209341
roc_auc      0.715937
dtype: float64

While the accuracy, ROC AUC, and upset precision scores are better for the cross-validation sets, the recall and f1 scores are all much worse. One should be skeptical of the accuracy scores in particular for this dataset, as it's heavily imbalanced (80% are non-upsets). Based on the cross-validation results, in real-life situations this model would be expected to correctly predict only 13% of the actual upsets, and only 57% of the games predicted as upsets would be actual upsets.

Prediction of unseen data can likely be improved by tweaking the hyperparameters of the model. Because I don't know ahead of time which hyperparameter values are likely to be the best for the model, I'll set up a process for the model to search over a grid that I provide.


In [21]:
# %% define a classifier
clf = LogisticRegression(random_state=0, n_jobs=-1)

# %% set up hyperparameter grid
c_params = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]
pen_params = ('l1', 'l2')
params = {'penalty':pen_params, 'C':c_params}
paramcols = map(lambda x: 'param_' + x, params.keys())

Above I provide a range of 9 values for the logistic regression C parameter, and also provide two types of regularization, L1 and L2. Below I set up a for loop to iterate over the 5 scores I've tracked thus far (accuracy, precision, recall, f1, ROC AUC). For each score type, the loop will find the best hyperparameter set for that score, print the score and parameters, and save the results in a dataframe.


In [22]:
# %% set up empty results dataframe
results = pd.DataFrame(columns=['mean_test_score',
                                'mean_train_score',
                                'param_C',
                                'param_penalty',
                                'params',
                                'rank_test_score'])

# %% list of scores to iterate over
scores = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
for scr in scores:
    gs = GridSearchCV(estimator=clf,
                      param_grid=params,
                      scoring=scr,
                      cv=5,
                      n_jobs=-1)
    gsf = gs.fit(train, y_train)
    print "Best %s: %r"  % (scr, round(gsf.best_score_, 3))
    print "Best parameters: %s" % (str(gsf.best_params_))
    df = pd.DataFrame.from_dict(gsf.cv_results_)
    res_temp = df.ix[:,'mean_test_score': 'rank_test_score']
    res_temp['scoring'] = scr
    results = pd.concat([results, res_temp])


Best accuracy: 0.83
Best parameters: {'penalty': 'l2', 'C': 0.1}
Best precision: 0.667
Best parameters: {'penalty': 'l2', 'C': 0.1}
Best recall: 0.373
Best parameters: {'penalty': 'l2', 'C': 0.0001}
Best f1: 0.378
Best parameters: {'penalty': 'l2', 'C': 0.0001}
Best roc_auc: 0.738
Best parameters: {'penalty': 'l2', 'C': 0.001}

In [23]:
# for comparison print results of the above "non-regularized" logistic regression above.  
print df_noreg.mean()


accuracy     0.820653
precision    0.566667
recall       0.134848
f1           0.209341
roc_auc      0.715937
dtype: float64

Results of the grid search show that l2 is the best penalty parameter, while the best C parameter value depends on the score used. Compared to cross-validation results on the "non-regularized" logistic regression above, the grid search cross-validation results have higher scores for every score type.

When seeking to make predictions, it still isn't clear which model I would use, because different score types (accuracy, precision, etc.) have different "best" models. It would be good to visualize the grid search results to get a sense of the tradeoffs between the different models.


In [24]:
# %% create plot for each in list of scores
for scr in scores:
    df = results.loc[results.scoring == scr].sort_values(paramcols)
    best = round(np.max(df.mean_test_score), 3)
    plt.plot(c_params, df.ix[df.param_penalty=='l1', ['mean_test_score']],
             marker='o',
             markersize=5, label='L1')
    plt.plot(c_params, df.ix[df.param_penalty=='l2', ['mean_test_score']],
             linestyle='--',
             marker='s', markersize=5,
             label='L2')
    plt.title('Logistic Regression w/ 5-fold CV: ' + str(scr))
    plt.xscale('log')
    plt.legend(loc='lower right', handlelength=5, borderpad=1.2, labelspacing=1.2)
    plt.xlabel('Parameter C')
    plt.xticks(c_params)
    plt.ylabel('Test ' + str(scr))
    ymin = np.min([df.mean_test_score]) - .10
    ymax = np.max([df.mean_test_score]) + .10
    plt.ylim(ymin, ymax)
    plt.tight_layout()
    plt.show()


For the L2 regularized models, the accuracy and ROC AUC don't change very much. The major trade-offs seem to be between precision and recall. The f1 score (which conveys balance between precision and recall) maxes out at c = 0.0001 so at this point I'd consider L2 regularization with C = 0.001 as the "best model". Since the result was obtained with the lowest C parameter value I'm curious if even better results are obtained with lower values.


In [25]:
# repeat process with some lower c parameter values

# %% set up hyperparameter grid
c_params = [0.0000001, 0.000001, 0.00001]
pen_params = ('l1', 'l2')
params = {'penalty':pen_params, 'C':c_params}
paramcols = map(lambda x: 'param_' + x, params.keys())

# %% set up empty results dataframe
results_2 = pd.DataFrame(columns=['mean_test_score',
                                'mean_train_score',
                                'param_C',
                                'param_penalty',
                                'params',
                                'rank_test_score'])

# %% list of scores to iterate over
for scr in scores:
    gs = GridSearchCV(estimator=clf,
                      param_grid=params,
                      scoring=scr,
                      cv=5,
                      n_jobs=-1)
    gsf = gs.fit(train, y_train)
    print "Best %s: %r"  % (scr, round(gsf.best_score_, 3))
    print "Best parameters: %s" % (str(gsf.best_params_))
    df = pd.DataFrame.from_dict(gsf.cv_results_)
    res_temp = df.ix[:,'mean_test_score': 'rank_test_score']
    res_temp['scoring'] = scr
    results_2 = pd.concat([results_2, res_temp])


Best accuracy: 0.821
Best parameters: {'penalty': 'l1', 'C': 1e-07}
Best precision: 0.416
Best parameters: {'penalty': 'l2', 'C': 1e-07}
Best recall: 0.373
Best parameters: {'penalty': 'l2', 'C': 1e-07}
Best f1: 0.378
Best parameters: {'penalty': 'l2', 'C': 1e-07}
Best roc_auc: 0.736
Best parameters: {'penalty': 'l2', 'C': 1e-07}

In [26]:
print "Best scores for C .0001-10000"
print results.groupby('scoring')['mean_test_score'].max()
print "Best scores for C .0000001-.00001"
print results_2.groupby('scoring')['mean_test_score'].max()


Best scores for C .0001-10000
scoring
accuracy     0.829787
f1           0.378258
precision    0.667173
recall       0.372755
roc_auc      0.737632
Name: mean_test_score, dtype: float64
Best scores for C .0000001-.00001
scoring
accuracy     0.820669
f1           0.378258
precision    0.416136
recall       0.372755
roc_auc      0.736084
Name: mean_test_score, dtype: float64

The f1 score does not improve with increasingly small C parameter values, nor do any of the other scores.

The training set used in the grid search was one where the features had been pre-screened based on the results from univariate models. I think it's worth testing whether better results can be obtained by estimating the model using the full set of 20 features that were originally tested.


In [27]:
# create training set with 21 features
train_21 = x_trainst[feat_list].copy()

# %% define a classifier
clf = LogisticRegression(random_state=0, n_jobs=-1)

# %% set up hyperparameter grid
c_params = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]
pen_params = ('l1', 'l2')
params = {'penalty':pen_params, 'C':c_params}
paramcols = map(lambda x: 'param_' + x, params.keys())

# %% set up empty results dataframe
results3 = pd.DataFrame(columns=['mean_test_score',
                                'mean_train_score',
                                'param_C',
                                'param_penalty',
                                'params',
                                'rank_test_score'])

# %% list of scores to iterate over
scores = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
for scr in scores:
    gs = GridSearchCV(estimator=clf,
                      param_grid=params,
                      scoring=scr,
                      cv=5,
                      n_jobs=-1)
    gsf = gs.fit(train_21, y_train)
    print "Best %s: %r"  % (scr, round(gsf.best_score_, 3))
    print "Best parameters: %s" % (str(gsf.best_params_))
    df = pd.DataFrame.from_dict(gsf.cv_results_)
    res_temp = df.ix[:,'mean_test_score': 'rank_test_score']
    res_temp['scoring'] = scr
    results3 = pd.concat([results3, res_temp])


Best accuracy: 0.821
Best parameters: {'penalty': 'l1', 'C': 0.0001}
Best precision: 0.53
Best parameters: {'penalty': 'l2', 'C': 0.1}
Best recall: 0.388
Best parameters: {'penalty': 'l2', 'C': 0.0001}
Best f1: 0.352
Best parameters: {'penalty': 'l2', 'C': 0.0001}
Best roc_auc: 0.725
Best parameters: {'penalty': 'l2', 'C': 0.01}

In [28]:
print "best cross-validation score from 11-feature model"
print results.groupby('scoring')['mean_test_score'].max()
print "best cross-validation score from 21-feature model"
print results3.groupby('scoring')['mean_test_score'].max()


best cross-validation score from 11-feature model
scoring
accuracy     0.829787
f1           0.378258
precision    0.667173
recall       0.372755
roc_auc      0.737632
Name: mean_test_score, dtype: float64
best cross-validation score from 21-feature model
scoring
accuracy     0.820669
f1           0.351828
precision    0.530091
recall       0.388229
roc_auc      0.724937
Name: mean_test_score, dtype: float64

I find it interesting that although the scores are very close, the best-scoring models from the "screened" 11 features are superior to the models with 21 features on nearly all the scores. This demonstrates that for this particular problem, more data is not necessarily better for the purposes of predicting out-of-sample data.

Finally, I'd like to use the "best f1 model" from above and use it to make classifications on the test set.


In [29]:
# standardize feature test set
test = x_test[dfcols]
sc = StandardScaler()
sc.fit(x_train[dfcols])
test = sc.transform(test)

In [30]:
clf = LogisticRegression(random_state=0, n_jobs=-1, C=0.0001)
clf.fit(train, y_train)
preds_train = clf.predict(train)
preds_test = clf.predict(test)

Below I print the model performance


In [31]:
print "Cross-validation scores for non-regularized model"
print df_noreg.mean()
print 'Training Set Scores'
print 'Accuracy score: %r' % (accuracy_score(y_train, preds_train))
print 'ROC AUC score: %r' % (roc_auc_score(y_train, preds_train))
print(classification_report(y_train, preds_train))
print(confusion_matrix(y_train, preds_train))
print 'Test Set Scores'
print 'Accuracy score: %r' % (accuracy_score(y_test, preds_test))
print 'ROC AUC score: %r' % (roc_auc_score(y_test, preds_test))
print(classification_report(y_test, preds_test))
print(confusion_matrix(y_test, preds_test))


Cross-validation scores for non-regularized model
accuracy     0.820653
precision    0.566667
recall       0.134848
f1           0.209341
roc_auc      0.715937
dtype: float64
Training Set Scores
Accuracy score: 0.80851063829787229
ROC AUC score: 0.66478342749529185
             precision    recall  f1-score   support

          0       0.88      0.89      0.88       270
          1       0.46      0.44      0.45        59

avg / total       0.80      0.81      0.81       329

[[240  30]
 [ 33  26]]
Test Set Scores
Accuracy score: 0.77108433734939763
ROC AUC score: 0.6807971014492753
             precision    recall  f1-score   support

          0       0.82      0.88      0.85        60
          1       0.61      0.48      0.54        23

avg / total       0.76      0.77      0.76        83

[[53  7]
 [12 11]]

The L2 regularized logistic regression with C=0.001 and 11 screened features have test-set scores only slightly lower from the training set scores. The test set precision and recall is actually superior to the training set precision and recall. The comparable training/test set performance provides a reasonable expectation for performance on future unseen data.

Compared to the cross-validation results from the non-regularized 11-feature model, upset precision on the test set is only slightly higher (61% vs 57%) but upset recall is drastically higher (48% vs 13%), which is also reflected in a much higher upset F1 score (0.54 vs 0.21).

On the held-out test set the logistic regression correctly identified 48% of the upsets as upsets, and 61% of the games predicted as upsets were true upsets. It would be nice to identify more of the upsets (reduce the false negatives), but a 61% hit rate on upsets would provide a solid rate of return since wagers on underdogs typically return greater odds.

The next step in the project would be to see if model performance can be further improved. I only tested 21 out of 163 possible features; it's possible that training the model using some (or all?) of the remaining 142 features would improve the classification scores. There are also other classification algorithms I would like to try, in particular those that can more readily accomodate non-linearity. For example, underdog seed is one obviously important feature that would probably be best modeled by some nonlinear function.