Any Given Sunday: Football and a Machine Learning Rookie

I love football more than engineers love coffee, all my Turi friends know that. Throughout the course of an NFL season I have fantasy teams, point-spread pools, survivor pools and non-stop pontification on the latest terrible play call, awful refs and amazing plays (GO HAWKS!). Despite hearing about all the neato, life-changing, world-saving, cutting-edge, inspirational things machine learning could be used for, my first thought was "Huh. I should learn that and use it for football."

Well, let's take a shot.

I'm no data scientist and this post won't be terribly math-y. I'm a software engineer that's fond of BBQ and learning by doing. Thankfully, I work with a slew of smart folks kind enough to assist with these educational shenanigans and provide guidance when I run off the rails. We'll apply the simplest of machine learning concepts, linear regression, to football in the simplest way and see what happens. You get to learn from all of the glorious mistakes I made so that you can go on to make bigger, better mistakes. We'll learn, together, inch by inch.

Setting up the data

My dataset is from Armchair Analysis, which is a pretty rad source of well-curated, nicely documented, NFL stats. You can grab yourself a copy for pretty cheap.


In [22]:
#Fire up the GraphLab engine
import graphlab as gl
import graphlab.aggregate as agg

The team SFrame has all of one team's stats per game. The games SFrame has info about each game itself, which teams played in it, where, etc. So first, let me narrow it down to what I think are some relevant columns and join some info about each game to the team table.


In [23]:
team = gl.SFrame.read_csv('C:\\Users\\Susan\\Documents\\NFLData_2000-2014\\csv\TEAM.csv', header=True)
games = gl.SFrame.read_csv('C:\\Users\\Susan\\Documents\\NFLData_2000-2014\\csv\GAME.csv', header=True)
# TID=team id
# GID=Game id
# TNAME=team name
# PTS points,
# RY rushing yards,
# PY passing yards,
# RA rushing attempts
# PA pass attempts,
# PC pass completions
# SK sacks against,
# FUM fumbles lost,
# INT int for defense,
# TOP time of possession,
# TD tds,
# TDR td rushing,
# TDP td passing,
# PLO total plays offense,
# PLD total plays defense,
# DP points by defense
team = team.select_columns(
    ['TNAME', 'TID', 'GID', 'PTS', 'RY', 'RA', 'PY', 'PA', 'PC', 'PU',
     'SK', 'FUM', 'INT', 'TOP', 'TD', 'TDR', 'TDP',  'DP'])
team['TOP'] = team['TOP'].astype(float)

team = team.join(games.select_columns(['GID', 'SEAS', 'WEEK']), 'GID')
team['SEAS'] = team['SEAS'].astype(str)

# restrict to regular season
team = team[team['WEEK'] < 18]


PROGRESS: Finished parsing file C:\Users\Susan\Documents\NFLData_2000-2014\csv\TEAM.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.120085 secs.
PROGRESS: Finished parsing file C:\Users\Susan\Documents\NFLData_2000-2014\csv\TEAM.csv
PROGRESS: Parsing completed. Parsed 7978 lines in 0.116081 secs.
PROGRESS: Finished parsing file C:\Users\Susan\Documents\NFLData_2000-2014\csv\GAME.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.044031 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[long,long,str,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,str,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,str,str,long,long,long,long,long,long,str,str,str,str,str,str,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
------------------------------------------------------
PROGRESS: Finished parsing file C:\Users\Susan\Documents\NFLData_2000-2014\csv\GAME.csv
PROGRESS: Parsing completed. Parsed 3989 lines in 0.048034 secs.
Inferred types from first line of file as 
column_type_hints=[long,long,long,str,str,str,str,str,str,str,str,str,str,str,str,long,long]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------

Here's what we've got so far


In [24]:
team.head(5)


Out[24]:
TNAME TID GID PTS RY RA PY PA PC PU SK FUM INT TOP TD TDR TDP DP SEAS WEEK
SF 1 1 28 92 24 247 36 23 5 1 0 1 27.5 4 1 3 0 2000 1
ATL 2 1 36 95 32 264 31 16 2 0 1 0 31.5 3 0 2 7 2000 1
JAC 3 2 27 119 40 279 34 24 3 4 0 0 37.0 3 2 1 0 2000 1
CLE 4 2 7 96 16 153 27 19 5 1 1 0 23.0 1 0 1 0 2000 1
PHI 5 3 41 306 46 119 29 16 2 1 0 3 39.5 5 3 1 7 2000 1
[5 rows x 20 columns]

Predict the total season wins for a team

Ok. Great. I have data... how do I data science it? I know from intro reading that linear regression is probably the simplest thing for me to try. But ... what type of question can I even answer with linear regression? Well, the way I understand it, we can predict a continuous variable Y that changes based on the values of a bunch of other variables, X1, X2... etc. So as all the X's change, our predicted Y is going to change. Here is the better, math-happy, grown-up explanation to achieve next level enlightenment. The data science "hello world"-foobar example everyone on the planet uses to explain linear regression is predicting the price of a house. How much will a house cost (Y) based on a bunch of info we have (Xs) like square footage, number of rooms, etc? Well, I have a bunch of info about team stats. What do I want to predict?

Let's try to predict the total season wins for a team given that team's stats for the year. Right now I don't have a season summary for each team but I can make one. I'll also need to generate season win column for each team. I’m also going to split my data into a train set, and a testing set. I can feed my model the training set to learn on, then reserve the test set to see how well the model performs against data it hasn’t seen yet. What I don't need to do is code up the algorithm, GraphLab Create has a toolkit for that.


In [25]:
# add a column to indicate if the team won this game or not
winners = team.groupby(key_columns='GID', operations={'WIN_TID': agg.ARGMAX('PTS', 'TID')})
team = team.join(winners, 'GID')
team['WIN'] = team['TID'] == team['WIN_TID']

#create season summary
of_interest= ['PTS', 'RY', 'RA', 'PY', 'PA', 'SK', 'FUM', 'INT', 'TOP', 'TD', 'TDR', 'TDP', 'DP', 'WIN']
season = team.groupby(['SEAS', 'TNAME'], {'%s_SUM' % x : agg.SUM(x) for x in of_interest})
season_sums = filter(lambda x: '_SUM' in x, season.column_names())
season_sums.remove('WIN_SUM')

I now have season summaries that look like this


In [26]:
season.head()


Out[26]:
SEAS TNAME INT_SUM DP_SUM RY_SUM SK_SUM RA_SUM TD_SUM FUM_SUM TDP_SUM TOP_SUM PY_SUM PTS_SUM
2005 TEN 14 30 1533 31 389 33 12 20 499.0 3597 299
2013 ARI 22 39 1560 41 402 41 9 24 496.5 4002 379
2014 NYJ 15 2 2285 47 501 27 9 16 497.0 2946 283
2005 ATL 13 23 2563 39 514 39 16 19 486.5 2679 351
2003 PHI 11 9 2026 43 406 43 11 17 454.5 3020 374
2011 NYJ 18 27 1699 40 436 45 16 26 493.0 3297 377
2007 PHI 15 0 1988 49 407 38 12 24 496.0 3755 336
2010 NYJ 14 25 2382 28 526 39 7 20 522.0 3229 367
2012 PHI 15 0 1893 48 407 29 22 18 476.5 3781 280
2000 ATL 20 9 1217 61 347 25 14 14 470.5 2780 252
TDR_SUM PA_SUM WIN_SUM
8 588 4
12 573 10
11 495 4
17 450 8
23 481 12
14 543 8
12 575 8
14 520 11
10 614 4
6 514 4
[10 rows x 16 columns]

Awesome. Let's get to it.


In [27]:
# predict number of wins
#split the data into train and test
season_train, season_test = season.random_split(0.8, seed=0)
lin_model = gl.linear_regression.create(season_train, target='WIN_SUM', features=season_sums)
lin_pred = lin_model.predict(season_test)
lin_result = lin_model.evaluate(season_test)


PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 362
PROGRESS: Number of features          : 13
PROGRESS: Number of unpacked features : 13
PROGRESS: Number of coefficients    : 14
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 2        | 0.001001     | 5.115316           | 2.863893             | 1.634377      | 1.284790        |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+

How'd we do?


In [28]:
print lin_result


{'max_error': 5.340498547412093, 'rmse': 1.7816995360472347}

Printing out lin_result shows us our max_error was around 5, and our rmse (root mean squared error, a favorite measure of how wrong you are) is around 1.8. That's ... not bad? Just browsing the first few, the model is generally close but not great.


In [29]:
print season_test.head(5)['WIN_SUM']
print lin_pred[0:5]


[4L, 11L, 5L, 10L, 4L]
[5.928937053986296, 11.790802796979062, 4.276084765261491, 8.579262126730567, 6.401094017887274]

The first five actual win totals of my test data are (4,11,5,10,4) and my first five predictions out of lin_pred are (5.9, 11.8, 4.2, 8.7). I can print the lin_model to get a little more information about what's going on here, or you can use lin_model.show() to see it all pretty in your browser.


In [30]:
gl.canvas.set_target('ipynb')
lin_model.show()


Not surprisingly, Xs that are positive coefficients are things like the sum of passing touchdowns, total points, and time of possession. Xs that drag the Y value down are things like interceptions, sacks, and fumbles. For the life of me, I can't figure out why the sum of touchdowns would bring the win sum prediction down. I had to ask my data science sherpa Chris why this could happen.

Chris: I suspect that several of your features are highly correlated. This makes it tricky for the linear regression model to have parameter estimates that aren't noisy; it doesn't know whether or not to make two highly correlated features be large or small, since there are ways for the predictions to be the same in either case. (Check out wikipedia's page on "multicollinearity" to read more.)

You have two options:

  1. hand select a subset of the features, and see if the coefficients from the model make more sense. Then decide if the accuracy of the previous model is worth the decrease in interpretability.
  2. experiment with increasing l2_penalty and l1_penalty arguments: this encourages the optimization to find estimates where some features are either close to 0 or dropped entirely.

Yeah, actually. I do have multiple columns that are kinda correlated. All of the points columns are related, like the sums of rushing and passing TDs to the sum of total points. Alright, throwing every single column of data I had at this didn't give me what I wanted. Given Chris' explanation, the simplest thing to try is removing the columns that are duplicating the effect of points from the season_sums list of features and training the model again. After removing features like TDR_SUM (rushing touchdown sum) and TDP_SUM (passing tds) and total points, I end up with coefficients that look like this.


In [31]:
# Why does TD_SUM end up in a negative coefficient?
# Points are represented by multiple features, they're correlated and it's confusing the model
season_sums.remove('PTS_SUM')
season_sums.remove('TDP_SUM')
season_sums.remove('TDR_SUM')
lin_model = gl.linear_regression.create(season_train, target='WIN_SUM', features=season_sums)
lin_pred = lin_model.predict(season_test)
lin_result = lin_model.evaluate(season_test)
lin_model.show()


PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 357
PROGRESS: Number of features          : 10
PROGRESS: Number of unpacked features : 10
PROGRESS: Number of coefficients    : 11
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 2        | 0.001000     | 5.141604           | 3.519644             | 1.722067      | 1.745545        |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

That's more what I expected. Touchdowns and time of possession lead the list of positive coefficients while interceptions lead the list of negative. It doesn’t look like my predictions changed all that much as a result, but the coefficients make more sense to me now.


In [32]:
print lin_result
print season_test.head(5)['WIN_SUM']
print lin_pred[0:5]


{'max_error': 4.890134040943327, 'rmse': 1.8691112976359405}
[4L, 11L, 5L, 10L, 4L]
[5.356049945089315, 11.550561219697165, 4.353967601094646, 7.642765711124727, 6.205331611073706]

Predict who won a game

Now, what if I want to predict who won a specific game? Rather than trying to predict a sum, I want to predict whether one column, specifically the WIN column, contains a 1 (they won) or a 0 (they lost) based on a team's total stats for just that game. This sounds fun. But first, Chris advised that I do a little data reformatting. It'll be easier to try and predict a win by looking at the difference between the numbers of the two teams playing each other. I'm going to express this as Home-Away in columns named _DIFF.

One side note here. Whereas earlier I was trying to predict a continuous number, what I'm really trying to do now is predict what whether it's a WIN or a LOSS. In other words, the WIN column contains two categories; zero or one. It's not going to be a 0.7 or 1.3. Using a formula similar to linear regression, this is commonly called logistic regression, to classify the value Y. Rather than code up this math by hand, I'm going to use GLC's toolkit, getting right to the fun stuff.


In [43]:
# OK, so after talking to chris refactor data into one row per game
# With values being Home - Away
# add info about who was home and away
more = team.join(games.select_columns(['GID', 'V', 'H', 'PTSV', 'PTSH']), 'GID')


home = more[more['TNAME'] == more['H']].remove_columns(['PTSH', 'PTSV'])
visit = more[more['TNAME'] == more['V']].remove_columns(['PTSH', 'PTSV'])
difftable = home.join(visit, 'GID')
#the visitor's info is now always under blank.1 column names

for thing in ['PTS', 'RY', 'RA', 'PY', 'PA', 'SK', 'FUM', 'INT', 'TOP', 'TD', 'TDR', 'TDP', 'DP']:
    difftable['%s_DIFF' % thing] = difftable[thing] - difftable['%s.1' % thing]

diff_train, diff_test = difftable.random_split(0.8, seed=20)


diff_lgc_model = gl.logistic_classifier.create(diff_train, target='WIN',
                                               features=[ i for i in difftable.column_names() if "_DIFF" in i])


diff_result = diff_lgc_model.evaluate(diff_test)
print diff_result
diff_lgc_model.show()


PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 2923
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 13
PROGRESS: Number of unpacked features : 13
PROGRESS: Number of coefficients    : 14
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 2        | 0.012009     | 0.941156          | 0.954023            |
PROGRESS: | 2         | 3        | 0.033024     | 0.969894          | 0.977011            |
PROGRESS: | 3         | 4        | 0.043031     | 0.988368          | 0.988506            |
PROGRESS: | 4         | 5        | 0.054040     | 0.998632          | 0.994253            |
PROGRESS: | 5         | 6        | 0.065046     | 0.998974          | 1.000000            |
PROGRESS: | 6         | 7        | 0.075054     | 0.998632          | 1.000000            |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

{'confusion_matrix': Columns:
	target_label	int
	predicted_label	int
	count	int

Rows: 2

Data:
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|      0       |        0        |  294  |
|      1       |        1        |  433  |
+--------------+-----------------+-------+
[2 rows x 3 columns]
, 'accuracy': 1.0}

Me: HOLY GUACAMOLE!! LOOK AT MY ACCURACY! IT'S ALMOST PERFECT!

Chris: That can't be right.

Me: I CAN PREDICT THE FUTURE!

Chris: I don't really think that's how it w-

Me: I'M QUITTING MY JOB AND GOING TO VEGAS! I AM AMAZING!!! SEE YOU CLOWNS LATER AAAAAAHAHAHAHAHA!!!!!!!

Chris: Did you leave the difference in scores as a feature for the model to use?

Me: AAAHAHAHA-... what? Yes? ... Yes, I did.

Chris: So the model learned that with 100% accuracy that the final score predicts who won.

Me: Ahem. Oh. Me: ... sorry about the clowns thing.

Chris: ...

Me: ... I'm also un-quitting.

Chris: ...

Me: ... I'll go back to my desk now.

Yeah. Once again putting all the info I had into this black box wasn't quite the best idea. I want to predict who won based on every other stat but I don't want to use the score. The model picks up pretty quickly that some of my features like, oh, say, the point differential between teams, is a pretty darn good indicator of what is in the WIN column. Clever. Let’s remove that stuff.


In [40]:
# CLEARLY I CAN PREDICT THE FUTURE
# oh.  remove points

better_features= [ i for i in difftable.column_names() if "_DIFF" in i]
better_features.remove('PTS_DIFF')
better_features.remove('TD_DIFF')
better_features.remove('TDR_DIFF')
better_features.remove('TDP_DIFF')
better_features.remove('DP_DIFF')

better_model = gl.logistic_classifier.create(diff_train, target='WIN', features=better_features)
better_model.show()
print better_model.evaluate(diff_test)


PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 2935
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 8
PROGRESS: Number of unpacked features : 8
PROGRESS: Number of coefficients    : 9
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 2        | 0.009008     | 0.884497          | 0.885350            |
PROGRESS: | 2         | 3        | 0.017012     | 0.885179          | 0.878981            |
PROGRESS: | 3         | 4        | 0.023016     | 0.883475          | 0.885350            |
PROGRESS: | 4         | 5        | 0.031022     | 0.883816          | 0.885350            |
PROGRESS: | 5         | 6        | 0.040029     | 0.883816          | 0.885350            |
PROGRESS: | 6         | 7        | 0.047033     | 0.883816          | 0.885350            |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

{'confusion_matrix': Columns:
	target_label	int
	predicted_label	int
	count	int

Rows: 4

Data:
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|      0       |        1        |   60  |
|      1       |        1        |  359  |
|      1       |        0        |   43  |
|      0       |        0        |  270  |
+--------------+-----------------+-------+
[4 rows x 3 columns]
, 'accuracy': 0.8592896174863388}

That's not bad. The confusion matrix here is an awesome little guide to how the model did guessing the outcome of games.

(N)ewb (F)eature exp(L)oration

While I’m in the business of classifying things, I browsed the list of classifiers in the API searching for other high-powered tools to naively explore. Chris pointed me to the boosted trees classifier as an interesting way to explore the effects of each of my features on the classification. It’s not entirely clear to me how a tree, regardless of whether I boosted it or purchased it legally, is going to help me out but it looks easy enough to try.


In [45]:
#Exploring the trees
btc_classifier = gl.boosted_trees_classifier.create(diff_train, target='WIN', features=better_features)
btc_classifier.show(view="Tree", tree_id=1)


PROGRESS: Boosted trees classifier:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 2924
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 8
PROGRESS: Number of unpacked features : 8
PROGRESS: Starting Boosted Trees
PROGRESS: --------------------------------------------------------
PROGRESS:   Iter      Accuracy          Elapsed time
PROGRESS:         (training) (validation)
PROGRESS:      0  8.683e-001  7.977e-001        0.00s
PROGRESS:      1  8.830e-001  8.497e-001        0.01s
PROGRESS:      2  8.916e-001  8.786e-001        0.01s
PROGRESS:      3  8.984e-001  8.671e-001        0.02s
PROGRESS:      4  9.053e-001  8.728e-001        0.02s
PROGRESS:      5  9.124e-001  8.728e-001        0.03s
PROGRESS:      6  9.142e-001  8.728e-001        0.03s
PROGRESS:      7  9.193e-001  8.786e-001        0.04s
PROGRESS:      8  9.254e-001  8.786e-001        0.04s
PROGRESS:      9  9.302e-001  8.786e-001        0.05s
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Um, what? I have no idea what I just did and this visualization isn’t helpful at all. I’ve created what looks like an insane decision shrub of a tree but I don’t think that every single tiny decision and combination of features is really this important. Let me try that again, this time using just one decision tree and limiting its depth.


In [46]:
btc_class2 = gl.boosted_trees_classifier.create(diff_train, target='WIN', features=better_features, max_iterations=1,
                                                    max_depth=3)
btc_class2.show(view="Tree", tree_id=0)


PROGRESS: Boosted trees classifier:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 2962
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 8
PROGRESS: Number of unpacked features : 8
PROGRESS: Starting Boosted Trees
PROGRESS: --------------------------------------------------------
PROGRESS:   Iter      Accuracy          Elapsed time
PROGRESS:         (training) (validation)
PROGRESS:      0  8.126e-001  7.852e-001        0.00s
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Ok, limiting the depth did wonders for my readability here. This is interesting, the tree isn’t making decisions around the things I thought are important. The big first branch happens on rushing attempts. I can view what the model thinks is important with a call to model.get_feature_importance(). This gives me a sum of the nodes in the tree that are branching on each feature.


In [47]:
btc_class2.get_feature_importance()


Out[47]:
feature count
INT_DIFF 2
RA_DIFF 2
SK_DIFF 2
PA_DIFF 1
[4 rows x 2 columns]

Huh. Not what I thought. The number of rushing attempts is more important than interceptions and more important than rushing yards? While I could maybe see that, it doesn’t line up with the features that other folks on the internet have concluded are important. I’d expect to see something about offense, maybe passing, or total rush yards. Maybe if I make the tree deeper it’d be a little closer to what I’ve read is important in a football game?


In [48]:
btc_bigger = gl.boosted_trees_classifier.create(diff_train, target='WIN', 
features=better_features, max_iterations=50, max_depth=5)
btc_bigger.show(view="Tree", tree_id=0)


PROGRESS: Boosted trees classifier:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 2939
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 8
PROGRESS: Number of unpacked features : 8
PROGRESS: Starting Boosted Trees
PROGRESS: --------------------------------------------------------
PROGRESS:   Iter      Accuracy          Elapsed time
PROGRESS:         (training) (validation)
PROGRESS:      0  8.425e-001  7.848e-001        0.00s
PROGRESS:      1  8.636e-001  8.418e-001        0.01s
PROGRESS:      2  8.738e-001  8.608e-001        0.01s
PROGRESS:      3  8.799e-001  8.671e-001        0.02s
PROGRESS:      4  8.901e-001  8.608e-001        0.02s
PROGRESS:      5  8.925e-001  8.861e-001        0.03s
PROGRESS:      6  8.962e-001  8.987e-001        0.03s
PROGRESS:      7  8.989e-001  8.924e-001        0.03s
PROGRESS:      8  9.034e-001  8.924e-001        0.04s
PROGRESS:      9  9.078e-001  8.924e-001        0.04s
PROGRESS:     10  9.064e-001  8.924e-001        0.05s
PROGRESS:     11  9.081e-001  8.861e-001        0.05s
PROGRESS:     12  9.115e-001  8.797e-001        0.05s
PROGRESS:     13  9.143e-001  8.797e-001        0.06s
PROGRESS:     14  9.156e-001  8.734e-001        0.06s
PROGRESS:     15  9.187e-001  8.734e-001        0.07s
PROGRESS:     16  9.207e-001  8.734e-001        0.08s
PROGRESS:     17  9.234e-001  8.734e-001        0.09s
PROGRESS:     18  9.238e-001  8.734e-001        0.10s
PROGRESS:     19  9.258e-001  8.861e-001        0.10s
PROGRESS:     20  9.279e-001  8.861e-001        0.11s
PROGRESS:     21  9.296e-001  8.861e-001        0.11s
PROGRESS:     22  9.299e-001  8.797e-001        0.12s
PROGRESS:     23  9.340e-001  8.797e-001        0.12s
PROGRESS:     24  9.364e-001  8.797e-001        0.13s
PROGRESS:     25  9.364e-001  8.797e-001        0.13s
PROGRESS:     26  9.381e-001  8.797e-001        0.14s
PROGRESS:     27  9.384e-001  8.797e-001        0.15s
PROGRESS:     28  9.384e-001  8.797e-001        0.15s
PROGRESS:     29  9.388e-001  8.797e-001        0.16s
PROGRESS:     30  9.401e-001  8.797e-001        0.17s
PROGRESS:     31  9.401e-001  8.797e-001        0.17s
PROGRESS:     32  9.408e-001  8.797e-001        0.18s
PROGRESS:     33  9.435e-001  8.797e-001        0.18s
PROGRESS:     34  9.449e-001  8.797e-001        0.19s
PROGRESS:     35  9.486e-001  8.734e-001        0.20s
PROGRESS:     36  9.483e-001  8.797e-001        0.20s
PROGRESS:     37  9.486e-001  8.797e-001        0.21s
PROGRESS:     38  9.479e-001  8.797e-001        0.22s
PROGRESS:     39  9.496e-001  8.797e-001        0.22s
PROGRESS:     40  9.513e-001  8.797e-001        0.23s
PROGRESS:     41  9.517e-001  8.797e-001        0.23s
PROGRESS:     42  9.541e-001  8.797e-001        0.24s
PROGRESS:     43  9.568e-001  8.734e-001        0.24s
PROGRESS:     44  9.592e-001  8.734e-001        0.25s
PROGRESS:     45  9.612e-001  8.734e-001        0.26s
PROGRESS:     46  9.605e-001  8.734e-001        0.26s
PROGRESS:     47  9.619e-001  8.734e-001        0.27s
PROGRESS:     48  9.622e-001  8.734e-001        0.27s
PROGRESS:     49  9.622e-001  8.734e-001        0.28s
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Nope.

I’m going to try going back and adding some more features… maybe compute average yards per pass attempt, yards per rush attempt, average starting field position. You know, just some good to know stuff.


In [49]:
# OK so this isn't really what I expected.
# The first branch is Rushing Attempts, then Interceptions difference.
# should probably add more features
# PEN = penalty yardage against
# SRP = Succeessful rush plays
# SPP = Successful pass plays
# SFPY =  The Total Starting Field Position Yardage: Dividing by the # of Drives on Offense (DRV)
#         produces the Average Starting Field Position.
# PU = Punts
# DRV = Drives on Offense

# Create a pass yds / attempt feature, and an average starting field position (SFPY/DRV)
# Total rush / rush attempts (AVG RUSH)
extra = gl.SFrame.read_csv('C:\\Users\\Susan\\Documents\\NFLData_2000-2014\\csv\TEAM.csv', header=True)
# games = gl.SFrame.read_csv('C:\\Users\\Susan\\Documents\\NFLData_2000-2014\\csv\GAME.csv', header=True)
extra = extra.select_columns(
    [ 'TID', 'GID','SRP',  'SPP',  'PLO', 'PLD', 'PEN', 'SFPY', 'DRV'])
# team_more_info['TOP'] = team_more_info['TOP'].astype(float)
# team_more_info = team_more_info.join(games.select_columns(['GID', 'SEAS', 'WEEK']), 'GID')
# team_more_info = team_more_info[team['WEEK'] < 18]
team = team.join(extra, on=['GID', 'TID'])

#GID 364 has no data (CAR/MIA 2001) for DRV ... without the if x['blah'] !=0 you get a divide by zero
# but ONLY at the point you try and materialize the sframe because the operations are lazy! gah.

team['YD_PER_P_ATT'] = team.apply(lambda x: float(x['PY']) / float(x['PA']) if x['PA'] != 0 else 0)
team['YD_PER_R_ATT'] = team.apply(lambda x: float(x['RY']) / float(x['RA']) if x['RA'] != 0 else 0)
team['AVG_STFP']     = team.apply(lambda x: float(x['SFPY']) / float(x['DRV'])  if x['DRV'] != 0 else 0)

#On second thought, I want to drop that game altogether since I don't trust that.  The score and other
#references show they did have an off drive.  wth?
team = team.filter_by( [364], 'GID', exclude=True)

features = ['YD_PER_P_ATT', 'YD_PER_R_ATT', 'AVG_STFP', 'SRP', 'PY', 'RY', 'DRV', 'SPP', 'FUM', 'SK', 'INT', 'PLO',
            'PLD', 'TOP', 'PEN', 'PU']

# doing that joining stuff again
more = team.join(games.select_columns(['GID', 'V', 'H', 'PTSV', 'PTSH']), 'GID')
home = more[more['TNAME'] == more['H']].remove_columns(['PTSH', 'PTSV'])
visit = more[more['TNAME'] == more['V']].remove_columns(['PTSH', 'PTSV'])
difftable = home.join(visit, 'GID')
#the visitor's info is now always under blank.1 column names
#so let's express everything again as a difference
for thing in features:
    difftable['%s_DIFF' % thing] = difftable[thing] - difftable['%s.1' % thing]


PROGRESS: Finished parsing file C:\Users\Susan\Documents\NFLData_2000-2014\csv\TEAM.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.135093 secs.
PROGRESS: Finished parsing file C:\Users\Susan\Documents\NFLData_2000-2014\csv\TEAM.csv
PROGRESS: Parsing completed. Parsed 7978 lines in 0.122088 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[long,long,str,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,str,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,str,str,long,long,long,long,long,long,str,str,str,str,str,str,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long,long]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------

In [50]:
diff_train, diff_test = difftable.random_split(0.8, seed=0)
stuff = [ i for i in difftable.column_names() if "_DIFF" in i]
btc_classifier = gl.boosted_trees_classifier.create(diff_train, target='WIN',
                                                    features=stuff)
btc_classifier.show(view="Tree", tree_id=1)


PROGRESS: Boosted trees classifier:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 2915
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 16
PROGRESS: Number of unpacked features : 16
PROGRESS: Starting Boosted Trees
PROGRESS: --------------------------------------------------------
PROGRESS:   Iter      Accuracy          Elapsed time
PROGRESS:         (training) (validation)
PROGRESS:      0  8.954e-001  8.205e-001        0.01s
PROGRESS:      1  9.005e-001  8.269e-001        0.02s
PROGRESS:      2  9.129e-001  8.718e-001        0.02s
PROGRESS:      3  9.177e-001  8.654e-001        0.03s
PROGRESS:      4  9.259e-001  8.590e-001        0.03s
PROGRESS:      5  9.369e-001  8.782e-001        0.04s
PROGRESS:      6  9.358e-001  8.910e-001        0.06s
PROGRESS:      7  9.400e-001  8.846e-001        0.06s
PROGRESS:      8  9.444e-001  8.846e-001        0.07s
PROGRESS:      9  9.458e-001  8.846e-001        0.08s
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Gah, look at that mess again.


In [55]:
btc_classifier = gl.boosted_trees_classifier.create(diff_train, target='WIN',
                                                    features=stuff, max_depth=3)

all_f_results = btc_classifier.evaluate(diff_test)
btc_classifier.show(view="Tree")
print all_f_results


PROGRESS: Boosted trees classifier:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 2932
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 16
PROGRESS: Number of unpacked features : 16
PROGRESS: Starting Boosted Trees
PROGRESS: --------------------------------------------------------
PROGRESS:   Iter      Accuracy          Elapsed time
PROGRESS:         (training) (validation)
PROGRESS:      0  8.254e-001  8.489e-001        0.00s
PROGRESS:      1  8.571e-001  8.273e-001        0.01s
PROGRESS:      2  8.547e-001  8.705e-001        0.01s
PROGRESS:      3  8.636e-001  8.561e-001        0.02s
PROGRESS:      4  8.660e-001  8.705e-001        0.02s
PROGRESS:      5  8.738e-001  8.849e-001        0.03s
PROGRESS:      6  8.782e-001  8.705e-001        0.03s
PROGRESS:      7  8.810e-001  8.921e-001        0.04s
PROGRESS:      8  8.844e-001  8.993e-001        0.04s
PROGRESS:      9  8.932e-001  8.921e-001        0.04s
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

{'confusion_matrix': Columns:
	target_label	int
	predicted_label	int
	count	int

Rows: 4

Data:
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|      1       |        1        |  372  |
|      0       |        0        |  280  |
|      1       |        0        |   47  |
|      0       |        1        |   53  |
+--------------+-----------------+-------+
[4 rows x 3 columns]
, 'accuracy': 0.8670212765957447}

Hey! Not bad!


In [59]:
print btc_classifier.get_feature_importance()
important_f  = btc_classifier.get_feature_importance()['feature']


+-------------------+-------+
|      feature      | count |
+-------------------+-------+
| YD_PER_P_ATT_DIFF |   15  |
|      RY_DIFF      |   14  |
|   AVG_STFP_DIFF   |   12  |
|      INT_DIFF     |   8   |
|      FUM_DIFF     |   6   |
|      SRP_DIFF     |   6   |
|      PEN_DIFF     |   5   |
|      SK_DIFF      |   2   |
|      DRV_DIFF     |   1   |
|      PU_DIFF      |   1   |
+-------------------+-------+
[10 rows x 2 columns]

Oh man, that looks … actually pretty good. My most important features relate to the passing game, field position and ball control.

Out of all the features I gave it, the model identified the most important ones. Just for giggles, I can start making life hard for my model. Yes, yards per pass attempt appears to be important. What if I ONLY gave the model that feature? How accurate am I then? Thanks to Chris’ suggestion, I can loop through the list of features here, add them one by one and see how accurate I am with just that subset of features.


In [62]:
models_i = []
accuracy_i = []
f_imp = []
for i in range(1, len(important_f )):
    print
    btc = gl.boosted_trees_classifier.create(diff_train, target='WIN',
                                             features=important_f[:i],
                                             max_depth=3)
    models_i.append(btc)
    accuracy_i.append(btc.evaluate(diff_test))
    f_imp.append(btc.get_feature_importance())

accurate = [ x['accuracy'] for x in accuracy_i]
test_sf = gl.SFrame( {'feature': range(1, len(important_f)), 'accuracy': accurate} )


PROGRESS: Boosted trees classifier:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 2924
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 1
PROGRESS: Number of unpacked features : 1
PROGRESS: Starting Boosted Trees
PROGRESS: --------------------------------------------------------
PROGRESS:   Iter      Accuracy          Elapsed time
PROGRESS:         (training) (validation)
PROGRESS:      0  7.469e-001  7.279e-001        0.00s
PROGRESS:      1  7.469e-001  7.279e-001        0.00s
PROGRESS:      2  7.469e-001  7.279e-001        0.01s
PROGRESS:      3  7.469e-001  7.279e-001        0.01s
PROGRESS:      4  7.469e-001  7.279e-001        0.01s
PROGRESS:      5  7.500e-001  7.279e-001        0.01s
PROGRESS:      6  7.500e-001  7.279e-001        0.01s
PROGRESS:      7  7.500e-001  7.279e-001        0.01s
PROGRESS:      8  7.500e-001  7.279e-001        0.01s
PROGRESS:      9  7.517e-001  7.211e-001        0.02s
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.


PROGRESS: Boosted trees classifier:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 2921
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 2
PROGRESS: Number of unpacked features : 2
PROGRESS: Starting Boosted Trees
PROGRESS: --------------------------------------------------------
PROGRESS:   Iter      Accuracy          Elapsed time
PROGRESS:         (training) (validation)
PROGRESS:      0  8.100e-001  8.133e-001        0.00s
PROGRESS:      1  8.069e-001  8.133e-001        0.00s
PROGRESS:      2  8.148e-001  8.133e-001        0.01s
PROGRESS:      3  8.144e-001  8.133e-001        0.01s
PROGRESS:      4  8.148e-001  8.133e-001        0.01s
PROGRESS:      5  8.138e-001  8.133e-001        0.01s
PROGRESS:      6  8.144e-001  8.133e-001        0.01s
PROGRESS:      7  8.148e-001  8.133e-001        0.02s
PROGRESS:      8  8.148e-001  8.133e-001        0.02s
PROGRESS:      9  8.148e-001  8.133e-001        0.02s
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.


PROGRESS: Boosted trees classifier:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 2925
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 3
PROGRESS: Number of unpacked features : 3
PROGRESS: Starting Boosted Trees
PROGRESS: --------------------------------------------------------
PROGRESS:   Iter      Accuracy          Elapsed time
PROGRESS:         (training) (validation)
PROGRESS:      0  8.202e-001  7.671e-001        0.00s
PROGRESS:      1  8.403e-001  7.329e-001        0.01s
PROGRESS:      2  8.441e-001  7.329e-001        0.01s
PROGRESS:      3  8.479e-001  7.329e-001        0.01s
PROGRESS:      4  8.520e-001  7.534e-001        0.01s
PROGRESS:      5  8.540e-001  7.740e-001        0.02s
PROGRESS:      6  8.561e-001  7.740e-001        0.02s
PROGRESS:      7  8.544e-001  7.740e-001        0.02s
PROGRESS:      8  8.554e-001  7.671e-001        0.03s
PROGRESS:      9  8.568e-001  7.671e-001        0.03s
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.


PROGRESS: Boosted trees classifier:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 2940
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 4
PROGRESS: Number of unpacked features : 4
PROGRESS: Starting Boosted Trees
PROGRESS: --------------------------------------------------------
PROGRESS:   Iter      Accuracy          Elapsed time
PROGRESS:         (training) (validation)
PROGRESS:      0  8.207e-001  8.321e-001        0.00s
PROGRESS:      1  8.364e-001  8.550e-001        0.01s
PROGRESS:      2  8.497e-001  8.626e-001        0.01s
PROGRESS:      3  8.514e-001  8.779e-001        0.01s
PROGRESS:      4  8.544e-001  8.779e-001        0.01s
PROGRESS:      5  8.619e-001  8.702e-001        0.02s
PROGRESS:      6  8.636e-001  8.779e-001        0.02s
PROGRESS:      7  8.653e-001  8.855e-001        0.02s
PROGRESS:      8  8.673e-001  8.931e-001        0.02s
PROGRESS:      9  8.670e-001  9.008e-001        0.03s
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.


PROGRESS: Boosted trees classifier:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 2919
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 5
PROGRESS: Number of unpacked features : 5
PROGRESS: Starting Boosted Trees
PROGRESS: --------------------------------------------------------
PROGRESS:   Iter      Accuracy          Elapsed time
PROGRESS:         (training) (validation)
PROGRESS:      0  8.181e-001  8.750e-001        0.01s
PROGRESS:      1  8.345e-001  8.355e-001        0.01s
PROGRESS:      2  8.438e-001  8.684e-001        0.01s
PROGRESS:      3  8.565e-001  8.224e-001        0.01s
PROGRESS:      4  8.602e-001  8.289e-001        0.02s
PROGRESS:      5  8.650e-001  8.421e-001        0.02s
PROGRESS:      6  8.722e-001  8.224e-001        0.02s
PROGRESS:      7  8.750e-001  8.355e-001        0.03s
PROGRESS:      8  8.777e-001  8.158e-001        0.03s
PROGRESS:      9  8.777e-001  8.158e-001        0.03s
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.


PROGRESS: Boosted trees classifier:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 2904
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 6
PROGRESS: Number of unpacked features : 6
PROGRESS: Starting Boosted Trees
PROGRESS: --------------------------------------------------------
PROGRESS:   Iter      Accuracy          Elapsed time
PROGRESS:         (training) (validation)
PROGRESS:      0  8.340e-001  7.784e-001        0.00s
PROGRESS:      1  8.464e-001  8.204e-001        0.01s
PROGRESS:      2  8.567e-001  8.084e-001        0.01s
PROGRESS:      3  8.629e-001  8.024e-001        0.01s
PROGRESS:      4  8.657e-001  8.204e-001        0.01s
PROGRESS:      5  8.705e-001  8.144e-001        0.02s
PROGRESS:      6  8.743e-001  8.263e-001        0.02s
PROGRESS:      7  8.760e-001  8.263e-001        0.02s
PROGRESS:      8  8.798e-001  8.323e-001        0.03s
PROGRESS:      9  8.822e-001  8.383e-001        0.03s
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.


PROGRESS: Boosted trees classifier:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 2929
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 7
PROGRESS: Number of unpacked features : 7
PROGRESS: Starting Boosted Trees
PROGRESS: --------------------------------------------------------
PROGRESS:   Iter      Accuracy          Elapsed time
PROGRESS:         (training) (validation)
PROGRESS:      0  8.218e-001  8.380e-001        0.01s
PROGRESS:      1  8.327e-001  8.521e-001        0.01s
PROGRESS:      2  8.508e-001  8.592e-001        0.02s
PROGRESS:      3  8.621e-001  8.662e-001        0.02s
PROGRESS:      4  8.686e-001  8.803e-001        0.02s
PROGRESS:      5  8.730e-001  8.944e-001        0.02s
PROGRESS:      6  8.774e-001  9.014e-001        0.03s
PROGRESS:      7  8.812e-001  9.014e-001        0.03s
PROGRESS:      8  8.829e-001  8.944e-001        0.03s
PROGRESS:      9  8.860e-001  9.085e-001        0.04s
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.


PROGRESS: Boosted trees classifier:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 2919
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 8
PROGRESS: Number of unpacked features : 8
PROGRESS: Starting Boosted Trees
PROGRESS: --------------------------------------------------------
PROGRESS:   Iter      Accuracy          Elapsed time
PROGRESS:         (training) (validation)
PROGRESS:      0  8.232e-001  8.882e-001        0.00s
PROGRESS:      1  8.338e-001  8.684e-001        0.01s
PROGRESS:      2  8.527e-001  8.947e-001        0.01s
PROGRESS:      3  8.619e-001  9.013e-001        0.01s
PROGRESS:      4  8.702e-001  9.079e-001        0.02s
PROGRESS:      5  8.691e-001  8.947e-001        0.02s
PROGRESS:      6  8.763e-001  9.013e-001        0.02s
PROGRESS:      7  8.811e-001  9.079e-001        0.02s
PROGRESS:      8  8.866e-001  9.079e-001        0.03s
PROGRESS:      9  8.911e-001  9.145e-001        0.03s
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.


PROGRESS: Boosted trees classifier:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 2894
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 9
PROGRESS: Number of unpacked features : 9
PROGRESS: Starting Boosted Trees
PROGRESS: --------------------------------------------------------
PROGRESS:   Iter      Accuracy          Elapsed time
PROGRESS:         (training) (validation)
PROGRESS:      0  8.286e-001  7.627e-001        0.00s
PROGRESS:      1  8.583e-001  8.192e-001        0.01s
PROGRESS:      2  8.642e-001  8.475e-001        0.01s
PROGRESS:      3  8.645e-001  8.475e-001        0.01s
PROGRESS:      4  8.704e-001  8.701e-001        0.02s
PROGRESS:      5  8.732e-001  8.757e-001        0.02s
PROGRESS:      6  8.822e-001  8.701e-001        0.02s
PROGRESS:      7  8.901e-001  8.814e-001        0.03s
PROGRESS:      8  8.887e-001  8.757e-001        0.03s
PROGRESS:      9  8.922e-001  8.870e-001        0.03s
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.


In [63]:
test_sf.show(view="Line Chart", x='feature', y='accuracy')


Keeping in mind I’m adding them in by order of importance, it looks like I get 75% accuracy from just the first feature, yards per pass attempt. I get another big boost by including starting field position, interception and rushing yards… but after that, the rest of the features really aren’t contributing much. This sounds about right, from what I’ve read about the current state of the pro football world.

And there you have it. This is what it feels like to get into machine learning. Lots of “I have no idea what I’m doing” followed by an occasional “huh… that was pretty cool”. There’s plenty of additional tuning that can (and should) be done, but the ability to create insights and get moving with the basics quickly make the topic a lot more approachable. And once I figure out how to use machine learning to beat the odds, I’m going to Vegas.

I’ll send Chris a postcard.


In [ ]: