Any Given Sunday: Football and a Machine Learning Rookie

I love football more than engineers love coffee, all my Turi friends know that. Throughout the course of an NFL season I have fantasy teams, point-spread pools, survivor pools and non-stop pontification on the latest terrible play call, awful refs and amazing plays (GO HAWKS!). Despite hearing about all the neato, life-changing, world-saving, cutting-edge, inspirational things machine learning could be used for, my first thought was "Huh. I should learn that and use it for football."

Well, let's take a shot.

I'm no data scientist and this post won't be terribly math-y. I'm a software engineer that's fond of BBQ and learning by doing. Thankfully, I work with a slew of smart folks kind enough to assist with these educational shenanigans and provide guidance when I run off the rails. We'll apply the simplest of machine learning concepts, linear regression, to football in the simplest way and see what happens. You get to learn from all of the glorious mistakes I made so that you can go on to make bigger, better mistakes. We'll learn, together, inch by inch.

Setting up the data

My dataset is from Armchair Analysis, which is a pretty rad source of well-curated, nicely documented, NFL stats. You can grab yourself a copy for pretty cheap.



In [22]:

    
#Fire up the GraphLab engine
import graphlab as gl
import graphlab.aggregate as agg

The team SFrame has all of one team's stats per game. The games SFrame has info about each game itself, which teams played in it, where, etc. So first, let me narrow it down to what I think are some relevant columns and join some info about each game to the team table.



In [23]:

    
team = gl.SFrame.read_csv('C:\\Users\\Susan\\Documents\\NFLData_2000-2014\\csv\TEAM.csv', header=True)
games = gl.SFrame.read_csv('C:\\Users\\Susan\\Documents\\NFLData_2000-2014\\csv\GAME.csv', header=True)
# TID=team id
# GID=Game id
# TNAME=team name
# PTS points,
# RY rushing yards,
# PY passing yards,
# RA rushing attempts
# PA pass attempts,
# PC pass completions
# SK sacks against,
# FUM fumbles lost,
# INT int for defense,
# TOP time of possession,
# TD tds,
# TDR td rushing,
# TDP td passing,
# PLO total plays offense,
# PLD total plays defense,
# DP points by defense
team = team.select_columns(
    ['TNAME', 'TID', 'GID', 'PTS', 'RY', 'RA', 'PY', 'PA', 'PC', 'PU',
     'SK', 'FUM', 'INT', 'TOP', 'TD', 'TDR', 'TDP',  'DP'])
team['TOP'] = team['TOP'].astype(float)

team = team.join(games.select_columns(['GID', 'SEAS', 'WEEK']), 'GID')
team['SEAS'] = team['SEAS'].astype(str)

# restrict to regular season
team = team[team['WEEK'] < 18]









    




PROGRESS: Finished parsing file C:\Users\Susan\Documents\NFLData_2000-2014\csv\TEAM.csv






    




PROGRESS: Parsing completed. Parsed 100 lines in 0.120085 secs.






    




PROGRESS: Finished parsing file C:\Users\Susan\Documents\NFLData_2000-2014\csv\TEAM.csv






    




PROGRESS: Parsing completed. Parsed 7978 lines in 0.116081 secs.






    




PROGRESS: Finished parsing file C:\Users\Susan\Documents\NFLData_2000-2014\csv\GAME.csv






    




PROGRESS: Parsing completed. Parsed 100 lines in 0.044031 secs.






    



------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[long,long,str,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,str,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,str,str,long,long,long,long,long,long,str,str,str,str,str,str,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
------------------------------------------------------





    




PROGRESS: Finished parsing file C:\Users\Susan\Documents\NFLData_2000-2014\csv\GAME.csv






    




PROGRESS: Parsing completed. Parsed 3989 lines in 0.048034 secs.






    



Inferred types from first line of file as 
column_type_hints=[long,long,long,str,str,str,str,str,str,str,str,str,str,str,str,long,long]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------

Here's what we've got so far



In [24]:

    
team.head(5)









    Out[24]:





    
        TNAME
        TID
        GID
        PTS
        RY
        RA
        PY
        PA
        PC
        PU
        SK
        FUM
        INT
        TOP
        TD
        TDR
        TDP
        DP
        SEAS
        WEEK
    
    
        SF
        1
        1
        28
        92
        24
        247
        36
        23
        5
        1
        0
        1
        27.5
        4
        1
        3
        0
        2000
        1
    
    
        ATL
        2
        1
        36
        95
        32
        264
        31
        16
        2
        0
        1
        0
        31.5
        3
        0
        2
        7
        2000
        1
    
    
        JAC
        3
        2
        27
        119
        40
        279
        34
        24
        3
        4
        0
        0
        37.0
        3
        2
        1
        0
        2000
        1
    
    
        CLE
        4
        2
        7
        96
        16
        153
        27
        19
        5
        1
        1
        0
        23.0
        1
        0
        1
        0
        2000
        1
    
    
        PHI
        5
        3
        41
        306
        46
        119
        29
        16
        2
        1
        0
        3
        39.5
        5
        3
        1
        7
        2000
        1
    

[5 rows x 20 columns]

Predict the total season wins for a team

Ok. Great. I have data... how do I data science it? I know from intro reading that linear regression is probably the simplest thing for me to try. But ... what type of question can I even answer with linear regression? Well, the way I understand it, we can predict a continuous variable Y that changes based on the values of a bunch of other variables, X1, X2... etc. So as all the X's change, our predicted Y is going to change. Here is the better, math-happy, grown-up explanation to achieve next level enlightenment. The data science "hello world"-foobar example everyone on the planet uses to explain linear regression is predicting the price of a house. How much will a house cost (Y) based on a bunch of info we have (Xs) like square footage, number of rooms, etc? Well, I have a bunch of info about team stats. What do I want to predict?

Let's try to predict the total season wins for a team given that team's stats for the year. Right now I don't have a season summary for each team but I can make one. I'll also need to generate season win column for each team. I’m also going to split my data into a train set, and a testing set. I can feed my model the training set to learn on, then reserve the test set to see how well the model performs against data it hasn’t seen yet. What I don't need to do is code up the algorithm, GraphLab Create has a toolkit for that.



In [25]:

    
# add a column to indicate if the team won this game or not
winners = team.groupby(key_columns='GID', operations={'WIN_TID': agg.ARGMAX('PTS', 'TID')})
team = team.join(winners, 'GID')
team['WIN'] = team['TID'] == team['WIN_TID']

#create season summary
of_interest= ['PTS', 'RY', 'RA', 'PY', 'PA', 'SK', 'FUM', 'INT', 'TOP', 'TD', 'TDR', 'TDP', 'DP', 'WIN']
season = team.groupby(['SEAS', 'TNAME'], {'%s_SUM' % x : agg.SUM(x) for x in of_interest})
season_sums = filter(lambda x: '_SUM' in x, season.column_names())
season_sums.remove('WIN_SUM')

I now have season summaries that look like this



In [26]:

    
season.head()









    Out[26]:





    
        SEAS
        TNAME
        INT_SUM
        DP_SUM
        RY_SUM
        SK_SUM
        RA_SUM
        TD_SUM
        FUM_SUM
        TDP_SUM
        TOP_SUM
        PY_SUM
        PTS_SUM
    
    
        2005
        TEN
        14
        30
        1533
        31
        389
        33
        12
        20
        499.0
        3597
        299
    
    
        2013
        ARI
        22
        39
        1560
        41
        402
        41
        9
        24
        496.5
        4002
        379
    
    
        2014
        NYJ
        15
        2
        2285
        47
        501
        27
        9
        16
        497.0
        2946
        283
    
    
        2005
        ATL
        13
        23
        2563
        39
        514
        39
        16
        19
        486.5
        2679
        351
    
    
        2003
        PHI
        11
        9
        2026
        43
        406
        43
        11
        17
        454.5
        3020
        374
    
    
        2011
        NYJ
        18
        27
        1699
        40
        436
        45
        16
        26
        493.0
        3297
        377
    
    
        2007
        PHI
        15
        0
        1988
        49
        407
        38
        12
        24
        496.0
        3755
        336
    
    
        2010
        NYJ
        14
        25
        2382
        28
        526
        39
        7
        20
        522.0
        3229
        367
    
    
        2012
        PHI
        15
        0
        1893
        48
        407
        29
        22
        18
        476.5
        3781
        280
    
    
        2000
        ATL
        20
        9
        1217
        61
        347
        25
        14
        14
        470.5
        2780
        252
    


    
        TDR_SUM
        PA_SUM
        WIN_SUM
    
    
        8
        588
        4
    
    
        12
        573
        10
    
    
        11
        495
        4
    
    
        17
        450
        8
    
    
        23
        481
        12
    
    
        14
        543
        8
    
    
        12
        575
        8
    
    
        14
        520
        11
    
    
        10
        614
        4
    
    
        6
        514
        4
    

[10 rows x 16 columns]

Awesome. Let's get to it.



In [27]:

    
# predict number of wins
#split the data into train and test
season_train, season_test = season.random_split(0.8, seed=0)
lin_model = gl.linear_regression.create(season_train, target='WIN_SUM', features=season_sums)
lin_pred = lin_model.predict(season_test)
lin_result = lin_model.evaluate(season_test)









    




PROGRESS: Linear regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 362






    




PROGRESS: Number of features          : 13






    




PROGRESS: Number of unpacked features : 13






    




PROGRESS: Number of coefficients    : 14






    




PROGRESS: Starting Newton Method






    




PROGRESS: --------------------------------------------------------






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.







    




PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: | 1         | 2        | 0.001001     | 5.115316           | 2.863893             | 1.634377      | 1.284790        |






    




PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+

How'd we do?



In [28]:

    
print lin_result









    



{'max_error': 5.340498547412093, 'rmse': 1.7816995360472347}

Printing out lin_result shows us our max_error was around 5, and our rmse (root mean squared error, a favorite measure of how wrong you are) is around 1.8. That's ... not bad? Just browsing the first few, the model is generally close but not great.



In [29]:

    
print season_test.head(5)['WIN_SUM']
print lin_pred[0:5]









    



[4L, 11L, 5L, 10L, 4L]
[5.928937053986296, 11.790802796979062, 4.276084765261491, 8.579262126730567, 6.401094017887274]

The first five actual win totals of my test data are (4,11,5,10,4) and my first five predictions out of lin_pred are (5.9, 11.8, 4.2, 8.7). I can print the lin_model to get a little more information about what's going on here, or you can use lin_model.show() to see it all pretty in your browser.



In [30]:

    
gl.canvas.set_target('ipynb')
lin_model.show()

Not surprisingly, Xs that are positive coefficients are things like the sum of passing touchdowns, total points, and time of possession. Xs that drag the Y value down are things like interceptions, sacks, and fumbles. For the life of me, I can't figure out why the sum of touchdowns would bring the win sum prediction down. I had to ask my data science sherpa Chris why this could happen.

Chris: I suspect that several of your features are highly correlated. This makes it tricky for the linear regression model to have parameter estimates that aren't noisy; it doesn't know whether or not to make two highly correlated features be large or small, since there are ways for the predictions to be the same in either case. (Check out wikipedia's page on "multicollinearity" to read more.)

You have two options:

hand select a subset of the features, and see if the coefficients from the model make more sense. Then decide if the accuracy of the previous model is worth the decrease in interpretability.

experiment with increasing l2_penalty and l1_penalty arguments: this encourages the optimization to find estimates where some features are either close to 0 or dropped entirely.

Yeah, actually. I do have multiple columns that are kinda correlated. All of the points columns are related, like the sums of rushing and passing TDs to the sum of total points. Alright, throwing every single column of data I had at this didn't give me what I wanted. Given Chris' explanation, the simplest thing to try is removing the columns that are duplicating the effect of points from the season_sums list of features and training the model again. After removing features like TDR_SUM (rushing touchdown sum) and TDP_SUM (passing tds) and total points, I end up with coefficients that look like this.



In [31]:

    
# Why does TD_SUM end up in a negative coefficient?
# Points are represented by multiple features, they're correlated and it's confusing the model
season_sums.remove('PTS_SUM')
season_sums.remove('TDP_SUM')
season_sums.remove('TDR_SUM')
lin_model = gl.linear_regression.create(season_train, target='WIN_SUM', features=season_sums)
lin_pred = lin_model.predict(season_test)
lin_result = lin_model.evaluate(season_test)
lin_model.show()









    




PROGRESS: Linear regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 357






    




PROGRESS: Number of features          : 10






    




PROGRESS: Number of unpacked features : 10






    




PROGRESS: Number of coefficients    : 11






    




PROGRESS: Starting Newton Method






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |






    




PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+






    




PROGRESS: | 1         | 2        | 0.001000     | 5.141604           | 3.519644             | 1.722067      | 1.745545        |






    




PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

That's more what I expected. Touchdowns and time of possession lead the list of positive coefficients while interceptions lead the list of negative. It doesn’t look like my predictions changed all that much as a result, but the coefficients make more sense to me now.



In [32]:

    
print lin_result
print season_test.head(5)['WIN_SUM']
print lin_pred[0:5]









    



{'max_error': 4.890134040943327, 'rmse': 1.8691112976359405}
[4L, 11L, 5L, 10L, 4L]
[5.356049945089315, 11.550561219697165, 4.353967601094646, 7.642765711124727, 6.205331611073706]

Predict who won a game

Now, what if I want to predict who won a specific game? Rather than trying to predict a sum, I want to predict whether one column, specifically the WIN column, contains a 1 (they won) or a 0 (they lost) based on a team's total stats for just that game. This sounds fun. But first, Chris advised that I do a little data reformatting. It'll be easier to try and predict a win by looking at the difference between the numbers of the two teams playing each other. I'm going to express this as Home-Away in columns named _DIFF.

One side note here. Whereas earlier I was trying to predict a continuous number, what I'm really trying to do now is predict what whether it's a WIN or a LOSS. In other words, the WIN column contains two categories; zero or one. It's not going to be a 0.7 or 1.3. Using a formula similar to linear regression, this is commonly called logistic regression, to classify the value Y. Rather than code up this math by hand, I'm going to use GLC's toolkit, getting right to the fun stuff.



In [43]:

    
# OK, so after talking to chris refactor data into one row per game
# With values being Home - Away
# add info about who was home and away
more = team.join(games.select_columns(['GID', 'V', 'H', 'PTSV', 'PTSH']), 'GID')


home = more[more['TNAME'] == more['H']].remove_columns(['PTSH', 'PTSV'])
visit = more[more['TNAME'] == more['V']].remove_columns(['PTSH', 'PTSV'])
difftable = home.join(visit, 'GID')
#the visitor's info is now always under blank.1 column names

for thing in ['PTS', 'RY', 'RA', 'PY', 'PA', 'SK', 'FUM', 'INT', 'TOP', 'TD', 'TDR', 'TDP', 'DP']:
    difftable['%s_DIFF' % thing] = difftable[thing] - difftable['%s.1' % thing]

diff_train, diff_test = difftable.random_split(0.8, seed=20)


diff_lgc_model = gl.logistic_classifier.create(diff_train, target='WIN',
                                               features=[ i for i in difftable.column_names() if "_DIFF" in i])


diff_result = diff_lgc_model.evaluate(diff_test)
print diff_result
diff_lgc_model.show()









    




PROGRESS: Logistic regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 2923






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 13






    




PROGRESS: Number of unpacked features : 13






    




PROGRESS: Number of coefficients    : 14






    




PROGRESS: Starting Newton Method






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: +-----------+----------+--------------+-------------------+---------------------+






    




PROGRESS: | Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |






    




PROGRESS: +-----------+----------+--------------+-------------------+---------------------+






    




PROGRESS: | 1         | 2        | 0.012009     | 0.941156          | 0.954023            |






    




PROGRESS: | 2         | 3        | 0.033024     | 0.969894          | 0.977011            |






    




PROGRESS: | 3         | 4        | 0.043031     | 0.988368          | 0.988506            |






    




PROGRESS: | 4         | 5        | 0.054040     | 0.998632          | 0.994253            |






    




PROGRESS: | 5         | 6        | 0.065046     | 0.998974          | 1.000000            |






    




PROGRESS: | 6         | 7        | 0.075054     | 0.998632          | 1.000000            |






    




PROGRESS: +-----------+----------+--------------+-------------------+---------------------+






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

{'confusion_matrix': Columns:
	target_label	int
	predicted_label	int
	count	int

Rows: 2

Data:
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|      0       |        0        |  294  |
|      1       |        1        |  433  |
+--------------+-----------------+-------+
[2 rows x 3 columns]
, 'accuracy': 1.0}

Me: HOLY GUACAMOLE!! LOOK AT MY ACCURACY! IT'S ALMOST PERFECT!

Chris: That can't be right.

Me: I CAN PREDICT THE FUTURE!

Chris: I don't really think that's how it w-

Me: I'M QUITTING MY JOB AND GOING TO VEGAS! I AM AMAZING!!! SEE YOU CLOWNS LATER AAAAAAHAHAHAHAHA!!!!!!!

Chris: Did you leave the difference in scores as a feature for the model to use?

Me: AAAHAHAHA-... what? Yes? ... Yes, I did.

Chris: So the model learned that with 100% accuracy that the final score predicts who won.

Me: Ahem. Oh. Me: ... sorry about the clowns thing.

Chris: ...

Me: ... I'm also un-quitting.

Chris: ...

Me: ... I'll go back to my desk now.

Yeah. Once again putting all the info I had into this black box wasn't quite the best idea. I want to predict who won based on every other stat but I don't want to use the score. The model picks up pretty quickly that some of my features like, oh, say, the point differential between teams, is a pretty darn good indicator of what is in the WIN column. Clever. Let’s remove that stuff.



In [40]:

    
# CLEARLY I CAN PREDICT THE FUTURE
# oh.  remove points

better_features= [ i for i in difftable.column_names() if "_DIFF" in i]
better_features.remove('PTS_DIFF')
better_features.remove('TD_DIFF')
better_features.remove('TDR_DIFF')
better_features.remove('TDP_DIFF')
better_features.remove('DP_DIFF')

better_model = gl.logistic_classifier.create(diff_train, target='WIN', features=better_features)
better_model.show()
print better_model.evaluate(diff_test)









    




PROGRESS: Logistic regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 2935






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 8






    




PROGRESS: Number of unpacked features : 8






    




PROGRESS: Number of coefficients    : 9






    




PROGRESS: Starting Newton Method






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: +-----------+----------+--------------+-------------------+---------------------+






    




PROGRESS: | Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |






    




PROGRESS: +-----------+----------+--------------+-------------------+---------------------+






    




PROGRESS: | 1         | 2        | 0.009008     | 0.884497          | 0.885350            |






    




PROGRESS: | 2         | 3        | 0.017012     | 0.885179          | 0.878981            |






    




PROGRESS: | 3         | 4        | 0.023016     | 0.883475          | 0.885350            |






    




PROGRESS: | 4         | 5        | 0.031022     | 0.883816          | 0.885350            |






    




PROGRESS: | 5         | 6        | 0.040029     | 0.883816          | 0.885350            |






    




PROGRESS: | 6         | 7        | 0.047033     | 0.883816          | 0.885350            |






    




PROGRESS: +-----------+----------+--------------+-------------------+---------------------+






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.







    














    



{'confusion_matrix': Columns:
	target_label	int
	predicted_label	int
	count	int

Rows: 4

Data:
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|      0       |        1        |   60  |
|      1       |        1        |  359  |
|      1       |        0        |   43  |
|      0       |        0        |  270  |
+--------------+-----------------+-------+
[4 rows x 3 columns]
, 'accuracy': 0.8592896174863388}

That's not bad. The confusion matrix here is an awesome little guide to how the model did guessing the outcome of games.

(N)ewb (F)eature exp(L)oration

While I’m in the business of classifying things, I browsed the list of classifiers in the API searching for other high-powered tools to naively explore. Chris pointed me to the boosted trees classifier as an interesting way to explore the effects of each of my features on the classification. It’s not entirely clear to me how a tree, regardless of whether I boosted it or purchased it legally, is going to help me out but it looks easy enough to try.



In [45]:

    
#Exploring the trees
btc_classifier = gl.boosted_trees_classifier.create(diff_train, target='WIN', features=better_features)
btc_classifier.show(view="Tree", tree_id=1)









    




PROGRESS: Boosted trees classifier:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 2924






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 8






    




PROGRESS: Number of unpacked features : 8






    




PROGRESS: Starting Boosted Trees






    




PROGRESS: --------------------------------------------------------






    




PROGRESS:   Iter      Accuracy          Elapsed time






    




PROGRESS:         (training) (validation)






    




PROGRESS:      0  8.683e-001  7.977e-001        0.00s






    




PROGRESS:      1  8.830e-001  8.497e-001        0.01s






    




PROGRESS:      2  8.916e-001  8.786e-001        0.01s






    




PROGRESS:      3  8.984e-001  8.671e-001        0.02s






    




PROGRESS:      4  9.053e-001  8.728e-001        0.02s






    




PROGRESS:      5  9.124e-001  8.728e-001        0.03s






    




PROGRESS:      6  9.142e-001  8.728e-001        0.03s






    




PROGRESS:      7  9.193e-001  8.786e-001        0.04s






    




PROGRESS:      8  9.254e-001  8.786e-001        0.04s






    




PROGRESS:      9  9.302e-001  8.786e-001        0.05s






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Um, what? I have no idea what I just did and this visualization isn’t helpful at all. I’ve created what looks like an insane decision shrub of a tree but I don’t think that every single tiny decision and combination of features is really this important. Let me try that again, this time using just one decision tree and limiting its depth.



In [46]:

    
btc_class2 = gl.boosted_trees_classifier.create(diff_train, target='WIN', features=better_features, max_iterations=1,
                                                    max_depth=3)
btc_class2.show(view="Tree", tree_id=0)









    




PROGRESS: Boosted trees classifier:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 2962






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 8






    




PROGRESS: Number of unpacked features : 8






    




PROGRESS: Starting Boosted Trees






    




PROGRESS: --------------------------------------------------------






    




PROGRESS:   Iter      Accuracy          Elapsed time






    




PROGRESS:         (training) (validation)






    




PROGRESS:      0  8.126e-001  7.852e-001        0.00s






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Ok, limiting the depth did wonders for my readability here. This is interesting, the tree isn’t making decisions around the things I thought are important. The big first branch happens on rushing attempts. I can view what the model thinks is important with a call to model.get_feature_importance(). This gives me a sum of the nodes in the tree that are branching on each feature.



In [47]:

    
btc_class2.get_feature_importance()









    Out[47]:





    
        feature
        count
    
    
        INT_DIFF
        2
    
    
        RA_DIFF
        2
    
    
        SK_DIFF
        2
    
    
        PA_DIFF
        1
    

[4 rows x 2 columns]

Huh. Not what I thought. The number of rushing attempts is more important than interceptions and more important than rushing yards? While I could maybe see that, it doesn’t line up with the features that other folks on the internet have concluded are important. I’d expect to see something about offense, maybe passing, or total rush yards. Maybe if I make the tree deeper it’d be a little closer to what I’ve read is important in a football game?



In [48]:

    
btc_bigger = gl.boosted_trees_classifier.create(diff_train, target='WIN', 
features=better_features, max_iterations=50, max_depth=5)
btc_bigger.show(view="Tree", tree_id=0)









    




PROGRESS: Boosted trees classifier:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 2939






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 8






    




PROGRESS: Number of unpacked features : 8






    




PROGRESS: Starting Boosted Trees






    




PROGRESS: --------------------------------------------------------






    




PROGRESS:   Iter      Accuracy          Elapsed time






    




PROGRESS:         (training) (validation)






    




PROGRESS:      0  8.425e-001  7.848e-001        0.00s






    




PROGRESS:      1  8.636e-001  8.418e-001        0.01s






    




PROGRESS:      2  8.738e-001  8.608e-001        0.01s






    




PROGRESS:      3  8.799e-001  8.671e-001        0.02s






    




PROGRESS:      4  8.901e-001  8.608e-001        0.02s






    




PROGRESS:      5  8.925e-001  8.861e-001        0.03s






    




PROGRESS:      6  8.962e-001  8.987e-001        0.03s






    




PROGRESS:      7  8.989e-001  8.924e-001        0.03s






    




PROGRESS:      8  9.034e-001  8.924e-001        0.04s






    




PROGRESS:      9  9.078e-001  8.924e-001        0.04s






    




PROGRESS:     10  9.064e-001  8.924e-001        0.05s






    




PROGRESS:     11  9.081e-001  8.861e-001        0.05s






    




PROGRESS:     12  9.115e-001  8.797e-001        0.05s






    




PROGRESS:     13  9.143e-001  8.797e-001        0.06s






    




PROGRESS:     14  9.156e-001  8.734e-001        0.06s






    




PROGRESS:     15  9.187e-001  8.734e-001        0.07s






    




PROGRESS:     16  9.207e-001  8.734e-001        0.08s






    




PROGRESS:     17  9.234e-001  8.734e-001        0.09s






    




PROGRESS:     18  9.238e-001  8.734e-001        0.10s






    




PROGRESS:     19  9.258e-001  8.861e-001        0.10s






    




PROGRESS:     20  9.279e-001  8.861e-001        0.11s






    




PROGRESS:     21  9.296e-001  8.861e-001        0.11s






    




PROGRESS:     22  9.299e-001  8.797e-001        0.12s






    




PROGRESS:     23  9.340e-001  8.797e-001        0.12s






    




PROGRESS:     24  9.364e-001  8.797e-001        0.13s






    




PROGRESS:     25  9.364e-001  8.797e-001        0.13s






    




PROGRESS:     26  9.381e-001  8.797e-001        0.14s






    




PROGRESS:     27  9.384e-001  8.797e-001        0.15s






    




PROGRESS:     28  9.384e-001  8.797e-001        0.15s






    




PROGRESS:     29  9.388e-001  8.797e-001        0.16s






    




PROGRESS:     30  9.401e-001  8.797e-001        0.17s






    




PROGRESS:     31  9.401e-001  8.797e-001        0.17s






    




PROGRESS:     32  9.408e-001  8.797e-001        0.18s






    




PROGRESS:     33  9.435e-001  8.797e-001        0.18s






    




PROGRESS:     34  9.449e-001  8.797e-001        0.19s






    




PROGRESS:     35  9.486e-001  8.734e-001        0.20s






    




PROGRESS:     36  9.483e-001  8.797e-001        0.20s






    




PROGRESS:     37  9.486e-001  8.797e-001        0.21s






    




PROGRESS:     38  9.479e-001  8.797e-001        0.22s






    




PROGRESS:     39  9.496e-001  8.797e-001        0.22s






    




PROGRESS:     40  9.513e-001  8.797e-001        0.23s






    




PROGRESS:     41  9.517e-001  8.797e-001        0.23s






    




PROGRESS:     42  9.541e-001  8.797e-001        0.24s






    




PROGRESS:     43  9.568e-001  8.734e-001        0.24s






    




PROGRESS:     44  9.592e-001  8.734e-001        0.25s






    




PROGRESS:     45  9.612e-001  8.734e-001        0.26s






    




PROGRESS:     46  9.605e-001  8.734e-001        0.26s






    




PROGRESS:     47  9.619e-001  8.734e-001        0.27s






    




PROGRESS:     48  9.622e-001  8.734e-001        0.27s






    




PROGRESS:     49  9.622e-001  8.734e-001        0.28s






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Nope.

I’m going to try going back and adding some more features… maybe compute average yards per pass attempt, yards per rush attempt, average starting field position. You know, just some good to know stuff.



In [49]:

    
# OK so this isn't really what I expected.
# The first branch is Rushing Attempts, then Interceptions difference.
# should probably add more features
# PEN = penalty yardage against
# SRP = Succeessful rush plays
# SPP = Successful pass plays
# SFPY =  The Total Starting Field Position Yardage: Dividing by the # of Drives on Offense (DRV)
#         produces the Average Starting Field Position.
# PU = Punts
# DRV = Drives on Offense

# Create a pass yds / attempt feature, and an average starting field position (SFPY/DRV)
# Total rush / rush attempts (AVG RUSH)
extra = gl.SFrame.read_csv('C:\\Users\\Susan\\Documents\\NFLData_2000-2014\\csv\TEAM.csv', header=True)
# games = gl.SFrame.read_csv('C:\\Users\\Susan\\Documents\\NFLData_2000-2014\\csv\GAME.csv', header=True)
extra = extra.select_columns(
    [ 'TID', 'GID','SRP',  'SPP',  'PLO', 'PLD', 'PEN', 'SFPY', 'DRV'])
# team_more_info['TOP'] = team_more_info['TOP'].astype(float)
# team_more_info = team_more_info.join(games.select_columns(['GID', 'SEAS', 'WEEK']), 'GID')
# team_more_info = team_more_info[team['WEEK'] < 18]
team = team.join(extra, on=['GID', 'TID'])

#GID 364 has no data (CAR/MIA 2001) for DRV ... without the if x['blah'] !=0 you get a divide by zero
# but ONLY at the point you try and materialize the sframe because the operations are lazy! gah.

team['YD_PER_P_ATT'] = team.apply(lambda x: float(x['PY']) / float(x['PA']) if x['PA'] != 0 else 0)
team['YD_PER_R_ATT'] = team.apply(lambda x: float(x['RY']) / float(x['RA']) if x['RA'] != 0 else 0)
team['AVG_STFP']     = team.apply(lambda x: float(x['SFPY']) / float(x['DRV'])  if x['DRV'] != 0 else 0)

#On second thought, I want to drop that game altogether since I don't trust that.  The score and other
#references show they did have an off drive.  wth?
team = team.filter_by( [364], 'GID', exclude=True)

features = ['YD_PER_P_ATT', 'YD_PER_R_ATT', 'AVG_STFP', 'SRP', 'PY', 'RY', 'DRV', 'SPP', 'FUM', 'SK', 'INT', 'PLO',
            'PLD', 'TOP', 'PEN', 'PU']

# doing that joining stuff again
more = team.join(games.select_columns(['GID', 'V', 'H', 'PTSV', 'PTSH']), 'GID')
home = more[more['TNAME'] == more['H']].remove_columns(['PTSH', 'PTSV'])
visit = more[more['TNAME'] == more['V']].remove_columns(['PTSH', 'PTSV'])
difftable = home.join(visit, 'GID')
#the visitor's info is now always under blank.1 column names
#so let's express everything again as a difference
for thing in features:
    difftable['%s_DIFF' % thing] = difftable[thing] - difftable['%s.1' % thing]









    




PROGRESS: Finished parsing file C:\Users\Susan\Documents\NFLData_2000-2014\csv\TEAM.csv






    




PROGRESS: Parsing completed. Parsed 100 lines in 0.135093 secs.






    




PROGRESS: Finished parsing file C:\Users\Susan\Documents\NFLData_2000-2014\csv\TEAM.csv






    




PROGRESS: Parsing completed. Parsed 7978 lines in 0.122088 secs.






    



------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[long,long,str,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,str,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,str,str,long,long,long,long,long,long,str,str,str,str,str,str,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------



In [50]:

    
diff_train, diff_test = difftable.random_split(0.8, seed=0)
stuff = [ i for i in difftable.column_names() if "_DIFF" in i]
btc_classifier = gl.boosted_trees_classifier.create(diff_train, target='WIN',
                                                    features=stuff)
btc_classifier.show(view="Tree", tree_id=1)









    




PROGRESS: Boosted trees classifier:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 2915






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 16






    




PROGRESS: Number of unpacked features : 16






    




PROGRESS: Starting Boosted Trees






    




PROGRESS: --------------------------------------------------------






    




PROGRESS:   Iter      Accuracy          Elapsed time






    




PROGRESS:         (training) (validation)






    




PROGRESS:      0  8.954e-001  8.205e-001        0.01s






    




PROGRESS:      1  9.005e-001  8.269e-001        0.02s






    




PROGRESS:      2  9.129e-001  8.718e-001        0.02s






    




PROGRESS:      3  9.177e-001  8.654e-001        0.03s






    




PROGRESS:      4  9.259e-001  8.590e-001        0.03s






    




PROGRESS:      5  9.369e-001  8.782e-001        0.04s






    




PROGRESS:      6  9.358e-001  8.910e-001        0.06s






    




PROGRESS:      7  9.400e-001  8.846e-001        0.06s






    




PROGRESS:      8  9.444e-001  8.846e-001        0.07s






    




PROGRESS:      9  9.458e-001  8.846e-001        0.08s






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Gah, look at that mess again.



In [55]:

    
btc_classifier = gl.boosted_trees_classifier.create(diff_train, target='WIN',
                                                    features=stuff, max_depth=3)

all_f_results = btc_classifier.evaluate(diff_test)
btc_classifier.show(view="Tree")
print all_f_results









    




PROGRESS: Boosted trees classifier:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 2932






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 16






    




PROGRESS: Number of unpacked features : 16






    




PROGRESS: Starting Boosted Trees






    




PROGRESS: --------------------------------------------------------






    




PROGRESS:   Iter      Accuracy          Elapsed time






    




PROGRESS:         (training) (validation)






    




PROGRESS:      0  8.254e-001  8.489e-001        0.00s






    




PROGRESS:      1  8.571e-001  8.273e-001        0.01s






    




PROGRESS:      2  8.547e-001  8.705e-001        0.01s






    




PROGRESS:      3  8.636e-001  8.561e-001        0.02s






    




PROGRESS:      4  8.660e-001  8.705e-001        0.02s






    




PROGRESS:      5  8.738e-001  8.849e-001        0.03s






    




PROGRESS:      6  8.782e-001  8.705e-001        0.03s






    




PROGRESS:      7  8.810e-001  8.921e-001        0.04s






    




PROGRESS:      8  8.844e-001  8.993e-001        0.04s






    




PROGRESS:      9  8.932e-001  8.921e-001        0.04s






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.







    














    



{'confusion_matrix': Columns:
	target_label	int
	predicted_label	int
	count	int

Rows: 4

Data:
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|      1       |        1        |  372  |
|      0       |        0        |  280  |
|      1       |        0        |   47  |
|      0       |        1        |   53  |
+--------------+-----------------+-------+
[4 rows x 3 columns]
, 'accuracy': 0.8670212765957447}

Hey! Not bad!



In [59]:

    
print btc_classifier.get_feature_importance()
important_f  = btc_classifier.get_feature_importance()['feature']









    



+-------------------+-------+
|      feature      | count |
+-------------------+-------+
| YD_PER_P_ATT_DIFF |   15  |
|      RY_DIFF      |   14  |
|   AVG_STFP_DIFF   |   12  |
|      INT_DIFF     |   8   |
|      FUM_DIFF     |   6   |
|      SRP_DIFF     |   6   |
|      PEN_DIFF     |   5   |
|      SK_DIFF      |   2   |
|      DRV_DIFF     |   1   |
|      PU_DIFF      |   1   |
+-------------------+-------+
[10 rows x 2 columns]

Oh man, that looks … actually pretty good. My most important features relate to the passing game, field position and ball control.

Out of all the features I gave it, the model identified the most important ones. Just for giggles, I can start making life hard for my model. Yes, yards per pass attempt appears to be important. What if I ONLY gave the model that feature? How accurate am I then? Thanks to Chris’ suggestion, I can loop through the list of features here, add them one by one and see how accurate I am with just that subset of features.



In [62]:

    
models_i = []
accuracy_i = []
f_imp = []
for i in range(1, len(important_f )):
    print
    btc = gl.boosted_trees_classifier.create(diff_train, target='WIN',
                                             features=important_f[:i],
                                             max_depth=3)
    models_i.append(btc)
    accuracy_i.append(btc.evaluate(diff_test))
    f_imp.append(btc.get_feature_importance())

accurate = [ x['accuracy'] for x in accuracy_i]
test_sf = gl.SFrame( {'feature': range(1, len(important_f)), 'accuracy': accurate} )









    




PROGRESS: Boosted trees classifier:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 2924






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 1






    




PROGRESS: Number of unpacked features : 1






    




PROGRESS: Starting Boosted Trees






    




PROGRESS: --------------------------------------------------------






    




PROGRESS:   Iter      Accuracy          Elapsed time






    




PROGRESS:         (training) (validation)






    




PROGRESS:      0  7.469e-001  7.279e-001        0.00s






    




PROGRESS:      1  7.469e-001  7.279e-001        0.00s






    




PROGRESS:      2  7.469e-001  7.279e-001        0.01s






    




PROGRESS:      3  7.469e-001  7.279e-001        0.01s






    




PROGRESS:      4  7.469e-001  7.279e-001        0.01s






    




PROGRESS:      5  7.500e-001  7.279e-001        0.01s






    




PROGRESS:      6  7.500e-001  7.279e-001        0.01s






    




PROGRESS:      7  7.500e-001  7.279e-001        0.01s






    




PROGRESS:      8  7.500e-001  7.279e-001        0.01s






    




PROGRESS:      9  7.517e-001  7.211e-001        0.02s






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.








    




PROGRESS: Boosted trees classifier:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 2921






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 2






    




PROGRESS: Number of unpacked features : 2






    




PROGRESS: Starting Boosted Trees






    




PROGRESS: --------------------------------------------------------






    




PROGRESS:   Iter      Accuracy          Elapsed time






    




PROGRESS:         (training) (validation)






    




PROGRESS:      0  8.100e-001  8.133e-001        0.00s






    




PROGRESS:      1  8.069e-001  8.133e-001        0.00s






    




PROGRESS:      2  8.148e-001  8.133e-001        0.01s






    




PROGRESS:      3  8.144e-001  8.133e-001        0.01s






    




PROGRESS:      4  8.148e-001  8.133e-001        0.01s






    




PROGRESS:      5  8.138e-001  8.133e-001        0.01s






    




PROGRESS:      6  8.144e-001  8.133e-001        0.01s






    




PROGRESS:      7  8.148e-001  8.133e-001        0.02s






    




PROGRESS:      8  8.148e-001  8.133e-001        0.02s






    




PROGRESS:      9  8.148e-001  8.133e-001        0.02s






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.








    




PROGRESS: Boosted trees classifier:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 2925






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 3






    




PROGRESS: Number of unpacked features : 3






    




PROGRESS: Starting Boosted Trees






    




PROGRESS: --------------------------------------------------------






    




PROGRESS:   Iter      Accuracy          Elapsed time






    




PROGRESS:         (training) (validation)






    




PROGRESS:      0  8.202e-001  7.671e-001        0.00s






    




PROGRESS:      1  8.403e-001  7.329e-001        0.01s






    




PROGRESS:      2  8.441e-001  7.329e-001        0.01s






    




PROGRESS:      3  8.479e-001  7.329e-001        0.01s






    




PROGRESS:      4  8.520e-001  7.534e-001        0.01s






    




PROGRESS:      5  8.540e-001  7.740e-001        0.02s






    




PROGRESS:      6  8.561e-001  7.740e-001        0.02s






    




PROGRESS:      7  8.544e-001  7.740e-001        0.02s






    




PROGRESS:      8  8.554e-001  7.671e-001        0.03s






    




PROGRESS:      9  8.568e-001  7.671e-001        0.03s






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.








    




PROGRESS: Boosted trees classifier:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 2940






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 4






    




PROGRESS: Number of unpacked features : 4






    




PROGRESS: Starting Boosted Trees






    




PROGRESS: --------------------------------------------------------






    




PROGRESS:   Iter      Accuracy          Elapsed time






    




PROGRESS:         (training) (validation)






    




PROGRESS:      0  8.207e-001  8.321e-001        0.00s






    




PROGRESS:      1  8.364e-001  8.550e-001        0.01s






    




PROGRESS:      2  8.497e-001  8.626e-001        0.01s






    




PROGRESS:      3  8.514e-001  8.779e-001        0.01s






    




PROGRESS:      4  8.544e-001  8.779e-001        0.01s






    




PROGRESS:      5  8.619e-001  8.702e-001        0.02s






    




PROGRESS:      6  8.636e-001  8.779e-001        0.02s






    




PROGRESS:      7  8.653e-001  8.855e-001        0.02s






    




PROGRESS:      8  8.673e-001  8.931e-001        0.02s






    




PROGRESS:      9  8.670e-001  9.008e-001        0.03s






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.








    




PROGRESS: Boosted trees classifier:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 2919






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 5






    




PROGRESS: Number of unpacked features : 5






    




PROGRESS: Starting Boosted Trees






    




PROGRESS: --------------------------------------------------------






    




PROGRESS:   Iter      Accuracy          Elapsed time






    




PROGRESS:         (training) (validation)






    




PROGRESS:      0  8.181e-001  8.750e-001        0.01s






    




PROGRESS:      1  8.345e-001  8.355e-001        0.01s






    




PROGRESS:      2  8.438e-001  8.684e-001        0.01s






    




PROGRESS:      3  8.565e-001  8.224e-001        0.01s






    




PROGRESS:      4  8.602e-001  8.289e-001        0.02s






    




PROGRESS:      5  8.650e-001  8.421e-001        0.02s






    




PROGRESS:      6  8.722e-001  8.224e-001        0.02s






    




PROGRESS:      7  8.750e-001  8.355e-001        0.03s






    




PROGRESS:      8  8.777e-001  8.158e-001        0.03s






    




PROGRESS:      9  8.777e-001  8.158e-001        0.03s






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.








    




PROGRESS: Boosted trees classifier:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 2904






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 6






    




PROGRESS: Number of unpacked features : 6






    




PROGRESS: Starting Boosted Trees






    




PROGRESS: --------------------------------------------------------






    




PROGRESS:   Iter      Accuracy          Elapsed time






    




PROGRESS:         (training) (validation)






    




PROGRESS:      0  8.340e-001  7.784e-001        0.00s






    




PROGRESS:      1  8.464e-001  8.204e-001        0.01s






    




PROGRESS:      2  8.567e-001  8.084e-001        0.01s






    




PROGRESS:      3  8.629e-001  8.024e-001        0.01s






    




PROGRESS:      4  8.657e-001  8.204e-001        0.01s






    




PROGRESS:      5  8.705e-001  8.144e-001        0.02s






    




PROGRESS:      6  8.743e-001  8.263e-001        0.02s






    




PROGRESS:      7  8.760e-001  8.263e-001        0.02s






    




PROGRESS:      8  8.798e-001  8.323e-001        0.03s






    




PROGRESS:      9  8.822e-001  8.383e-001        0.03s






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.








    




PROGRESS: Boosted trees classifier:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 2929






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 7






    




PROGRESS: Number of unpacked features : 7






    




PROGRESS: Starting Boosted Trees






    




PROGRESS: --------------------------------------------------------






    




PROGRESS:   Iter      Accuracy          Elapsed time






    




PROGRESS:         (training) (validation)






    




PROGRESS:      0  8.218e-001  8.380e-001        0.01s






    




PROGRESS:      1  8.327e-001  8.521e-001        0.01s






    




PROGRESS:      2  8.508e-001  8.592e-001        0.02s






    




PROGRESS:      3  8.621e-001  8.662e-001        0.02s






    




PROGRESS:      4  8.686e-001  8.803e-001        0.02s






    




PROGRESS:      5  8.730e-001  8.944e-001        0.02s






    




PROGRESS:      6  8.774e-001  9.014e-001        0.03s






    




PROGRESS:      7  8.812e-001  9.014e-001        0.03s






    




PROGRESS:      8  8.829e-001  8.944e-001        0.03s






    




PROGRESS:      9  8.860e-001  9.085e-001        0.04s






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.








    




PROGRESS: Boosted trees classifier:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 2919






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 8






    




PROGRESS: Number of unpacked features : 8






    




PROGRESS: Starting Boosted Trees






    




PROGRESS: --------------------------------------------------------






    




PROGRESS:   Iter      Accuracy          Elapsed time






    




PROGRESS:         (training) (validation)






    




PROGRESS:      0  8.232e-001  8.882e-001        0.00s






    




PROGRESS:      1  8.338e-001  8.684e-001        0.01s






    




PROGRESS:      2  8.527e-001  8.947e-001        0.01s






    




PROGRESS:      3  8.619e-001  9.013e-001        0.01s






    




PROGRESS:      4  8.702e-001  9.079e-001        0.02s






    




PROGRESS:      5  8.691e-001  8.947e-001        0.02s






    




PROGRESS:      6  8.763e-001  9.013e-001        0.02s






    




PROGRESS:      7  8.811e-001  9.079e-001        0.02s






    




PROGRESS:      8  8.866e-001  9.079e-001        0.03s






    




PROGRESS:      9  8.911e-001  9.145e-001        0.03s






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.








    




PROGRESS: Boosted trees classifier:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 2894






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 9






    




PROGRESS: Number of unpacked features : 9






    




PROGRESS: Starting Boosted Trees






    




PROGRESS: --------------------------------------------------------






    




PROGRESS:   Iter      Accuracy          Elapsed time






    




PROGRESS:         (training) (validation)






    




PROGRESS:      0  8.286e-001  7.627e-001        0.00s






    




PROGRESS:      1  8.583e-001  8.192e-001        0.01s






    




PROGRESS:      2  8.642e-001  8.475e-001        0.01s






    




PROGRESS:      3  8.645e-001  8.475e-001        0.01s






    




PROGRESS:      4  8.704e-001  8.701e-001        0.02s






    




PROGRESS:      5  8.732e-001  8.757e-001        0.02s






    




PROGRESS:      6  8.822e-001  8.701e-001        0.02s






    




PROGRESS:      7  8.901e-001  8.814e-001        0.03s






    




PROGRESS:      8  8.887e-001  8.757e-001        0.03s






    




PROGRESS:      9  8.922e-001  8.870e-001        0.03s






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



In [63]:

    
test_sf.show(view="Line Chart", x='feature', y='accuracy')

Keeping in mind I’m adding them in by order of importance, it looks like I get 75% accuracy from just the first feature, yards per pass attempt. I get another big boost by including starting field position, interception and rushing yards… but after that, the rest of the features really aren’t contributing much. This sounds about right, from what I’ve read about the current state of the pro football world.

And there you have it. This is what it feels like to get into machine learning. Lots of “I have no idea what I’m doing” followed by an occasional “huh… that was pretty cool”. There’s plenty of additional tuning that can (and should) be done, but the ability to create insights and get moving with the basics quickly make the topic a lot more approachable. And once I figure out how to use machine learning to beat the odds, I’m going to Vegas.

I’ll send Chris a postcard.



In [ ]:

TNAME	TID	GID	PTS	RY	RA	PY	PA	PC	PU	SK	FUM	INT	TOP	TD	TDR	TDP	DP	SEAS	WEEK
SF	1	1	28	92	24	247	36	23	5	1	0	1	27.5	4	1	3	0	2000	1
ATL	2	1	36	95	32	264	31	16	2	0	1	0	31.5	3	0	2	7	2000	1
JAC	3	2	27	119	40	279	34	24	3	4	0	0	37.0	3	2	1	0	2000	1
CLE	4	2	7	96	16	153	27	19	5	1	1	0	23.0	1	0	1	0	2000	1
PHI	5	3	41	306	46	119	29	16	2	1	0	3	39.5	5	3	1	7	2000	1

SEAS	TNAME	INT_SUM	DP_SUM	RY_SUM	SK_SUM	RA_SUM	TD_SUM	FUM_SUM	TDP_SUM	TOP_SUM	PY_SUM	PTS_SUM
2005	TEN	14	30	1533	31	389	33	12	20	499.0	3597	299
2013	ARI	22	39	1560	41	402	41	9	24	496.5	4002	379
2014	NYJ	15	2	2285	47	501	27	9	16	497.0	2946	283
2005	ATL	13	23	2563	39	514	39	16	19	486.5	2679	351
2003	PHI	11	9	2026	43	406	43	11	17	454.5	3020	374
2011	NYJ	18	27	1699	40	436	45	16	26	493.0	3297	377
2007	PHI	15	0	1988	49	407	38	12	24	496.0	3755	336
2010	NYJ	14	25	2382	28	526	39	7	20	522.0	3229	367
2012	PHI	15	0	1893	48	407	29	22	18	476.5	3781	280
2000	ATL	20	9	1217	61	347	25	14	14	470.5	2780	252

TDR_SUM	PA_SUM	WIN_SUM
8	588	4
12	573	10
11	495	4
17	450	8
23	481	12
14	543	8
12	575	8
14	520	11
10	614	4
6	514	4

TDR_SUM	PA_SUM	WIN_SUM
8	588	4
12	573	10
11	495	4
17	450	8
23	481	12
14	543	8
12	575	8
14	520	11
10	614	4
6	514	4

TDR_SUM	PA_SUM	WIN_SUM
8	588	4
12	573	10
11	495	4
17	450	8
23	481	12
14	543	8
12	575	8
14	520	11
10	614	4
6	514	4