I love football more than engineers love coffee, all my Turi friends know that. Throughout the course of an NFL season I have fantasy teams, point-spread pools, survivor pools and non-stop pontification on the latest terrible play call, awful refs and amazing plays (GO HAWKS!). Despite hearing about all the neato, life-changing, world-saving, cutting-edge, inspirational things machine learning could be used for, my first thought was "Huh. I should learn that and use it for football."
Well, let's take a shot.
I'm no data scientist and this post won't be terribly math-y. I'm a software engineer that's fond of BBQ and learning by doing. Thankfully, I work with a slew of smart folks kind enough to assist with these educational shenanigans and provide guidance when I run off the rails. We'll apply the simplest of machine learning concepts, linear regression, to football in the simplest way and see what happens. You get to learn from all of the glorious mistakes I made so that you can go on to make bigger, better mistakes. We'll learn, together, inch by inch.
My dataset is from Armchair Analysis, which is a pretty rad source of well-curated, nicely documented, NFL stats. You can grab yourself a copy for pretty cheap.
In [22]:
#Fire up the GraphLab engine
import graphlab as gl
import graphlab.aggregate as agg
The team SFrame has all of one team's stats per game. The games SFrame has info about each game itself, which teams played in it, where, etc. So first, let me narrow it down to what I think are some relevant columns and join some info about each game to the team table.
In [23]:
team = gl.SFrame.read_csv('C:\\Users\\Susan\\Documents\\NFLData_2000-2014\\csv\TEAM.csv', header=True)
games = gl.SFrame.read_csv('C:\\Users\\Susan\\Documents\\NFLData_2000-2014\\csv\GAME.csv', header=True)
# TID=team id
# GID=Game id
# TNAME=team name
# PTS points,
# RY rushing yards,
# PY passing yards,
# RA rushing attempts
# PA pass attempts,
# PC pass completions
# SK sacks against,
# FUM fumbles lost,
# INT int for defense,
# TOP time of possession,
# TD tds,
# TDR td rushing,
# TDP td passing,
# PLO total plays offense,
# PLD total plays defense,
# DP points by defense
team = team.select_columns(
['TNAME', 'TID', 'GID', 'PTS', 'RY', 'RA', 'PY', 'PA', 'PC', 'PU',
'SK', 'FUM', 'INT', 'TOP', 'TD', 'TDR', 'TDP', 'DP'])
team['TOP'] = team['TOP'].astype(float)
team = team.join(games.select_columns(['GID', 'SEAS', 'WEEK']), 'GID')
team['SEAS'] = team['SEAS'].astype(str)
# restrict to regular season
team = team[team['WEEK'] < 18]
Here's what we've got so far
In [24]:
team.head(5)
Out[24]:
Ok. Great. I have data... how do I data science it? I know from intro reading that linear regression is probably the simplest thing for me to try. But ... what type of question can I even answer with linear regression? Well, the way I understand it, we can predict a continuous variable Y that changes based on the values of a bunch of other variables, X1, X2... etc. So as all the X's change, our predicted Y is going to change. Here is the better, math-happy, grown-up explanation to achieve next level enlightenment. The data science "hello world"-foobar example everyone on the planet uses to explain linear regression is predicting the price of a house. How much will a house cost (Y) based on a bunch of info we have (Xs) like square footage, number of rooms, etc? Well, I have a bunch of info about team stats. What do I want to predict?
Let's try to predict the total season wins for a team given that team's stats for the year. Right now I don't have a season summary for each team but I can make one. I'll also need to generate season win column for each team. I’m also going to split my data into a train set, and a testing set. I can feed my model the training set to learn on, then reserve the test set to see how well the model performs against data it hasn’t seen yet. What I don't need to do is code up the algorithm, GraphLab Create has a toolkit for that.
In [25]:
# add a column to indicate if the team won this game or not
winners = team.groupby(key_columns='GID', operations={'WIN_TID': agg.ARGMAX('PTS', 'TID')})
team = team.join(winners, 'GID')
team['WIN'] = team['TID'] == team['WIN_TID']
#create season summary
of_interest= ['PTS', 'RY', 'RA', 'PY', 'PA', 'SK', 'FUM', 'INT', 'TOP', 'TD', 'TDR', 'TDP', 'DP', 'WIN']
season = team.groupby(['SEAS', 'TNAME'], {'%s_SUM' % x : agg.SUM(x) for x in of_interest})
season_sums = filter(lambda x: '_SUM' in x, season.column_names())
season_sums.remove('WIN_SUM')
I now have season summaries that look like this
In [26]:
season.head()
Out[26]:
Awesome. Let's get to it.
In [27]:
# predict number of wins
#split the data into train and test
season_train, season_test = season.random_split(0.8, seed=0)
lin_model = gl.linear_regression.create(season_train, target='WIN_SUM', features=season_sums)
lin_pred = lin_model.predict(season_test)
lin_result = lin_model.evaluate(season_test)
How'd we do?
In [28]:
print lin_result
Printing out lin_result shows us our max_error was around 5, and our rmse (root mean squared error, a favorite measure of how wrong you are) is around 1.8. That's ... not bad? Just browsing the first few, the model is generally close but not great.
In [29]:
print season_test.head(5)['WIN_SUM']
print lin_pred[0:5]
The first five actual win totals of my test data are (4,11,5,10,4) and my first five predictions out of lin_pred are (5.9, 11.8, 4.2, 8.7). I can print the lin_model to get a little more information about what's going on here, or you can use lin_model.show() to see it all pretty in your browser.
In [30]:
gl.canvas.set_target('ipynb')
lin_model.show()
Not surprisingly, Xs that are positive coefficients are things like the sum of passing touchdowns, total points, and time of possession. Xs that drag the Y value down are things like interceptions, sacks, and fumbles. For the life of me, I can't figure out why the sum of touchdowns would bring the win sum prediction down. I had to ask my data science sherpa Chris why this could happen.
Chris: I suspect that several of your features are highly correlated. This makes it tricky for the linear regression model to have parameter estimates that aren't noisy; it doesn't know whether or not to make two highly correlated features be large or small, since there are ways for the predictions to be the same in either case. (Check out wikipedia's page on "multicollinearity" to read more.)
You have two options:
- hand select a subset of the features, and see if the coefficients from the model make more sense. Then decide if the accuracy of the previous model is worth the decrease in interpretability.
- experiment with increasing l2_penalty and l1_penalty arguments: this encourages the optimization to find estimates where some features are either close to 0 or dropped entirely.
Yeah, actually. I do have multiple columns that are kinda correlated. All of the points columns are related, like the sums of rushing and passing TDs to the sum of total points. Alright, throwing every single column of data I had at this didn't give me what I wanted. Given Chris' explanation, the simplest thing to try is removing the columns that are duplicating the effect of points from the season_sums list of features and training the model again. After removing features like TDR_SUM (rushing touchdown sum) and TDP_SUM (passing tds) and total points, I end up with coefficients that look like this.
In [31]:
# Why does TD_SUM end up in a negative coefficient?
# Points are represented by multiple features, they're correlated and it's confusing the model
season_sums.remove('PTS_SUM')
season_sums.remove('TDP_SUM')
season_sums.remove('TDR_SUM')
lin_model = gl.linear_regression.create(season_train, target='WIN_SUM', features=season_sums)
lin_pred = lin_model.predict(season_test)
lin_result = lin_model.evaluate(season_test)
lin_model.show()
That's more what I expected. Touchdowns and time of possession lead the list of positive coefficients while interceptions lead the list of negative. It doesn’t look like my predictions changed all that much as a result, but the coefficients make more sense to me now.
In [32]:
print lin_result
print season_test.head(5)['WIN_SUM']
print lin_pred[0:5]
Now, what if I want to predict who won a specific game? Rather than trying to predict a sum, I want to predict whether one column, specifically the WIN column, contains a 1 (they won) or a 0 (they lost) based on a team's total stats for just that game. This sounds fun. But first, Chris advised that I do a little data reformatting. It'll be easier to try and predict a win by looking at the difference between the numbers of the two teams playing each other. I'm going to express this as Home-Away in columns named
One side note here. Whereas earlier I was trying to predict a continuous number, what I'm really trying to do now is predict what whether it's a WIN or a LOSS. In other words, the WIN column contains two categories; zero or one. It's not going to be a 0.7 or 1.3. Using a formula similar to linear regression, this is commonly called logistic regression, to classify the value Y. Rather than code up this math by hand, I'm going to use GLC's toolkit, getting right to the fun stuff.
In [43]:
# OK, so after talking to chris refactor data into one row per game
# With values being Home - Away
# add info about who was home and away
more = team.join(games.select_columns(['GID', 'V', 'H', 'PTSV', 'PTSH']), 'GID')
home = more[more['TNAME'] == more['H']].remove_columns(['PTSH', 'PTSV'])
visit = more[more['TNAME'] == more['V']].remove_columns(['PTSH', 'PTSV'])
difftable = home.join(visit, 'GID')
#the visitor's info is now always under blank.1 column names
for thing in ['PTS', 'RY', 'RA', 'PY', 'PA', 'SK', 'FUM', 'INT', 'TOP', 'TD', 'TDR', 'TDP', 'DP']:
difftable['%s_DIFF' % thing] = difftable[thing] - difftable['%s.1' % thing]
diff_train, diff_test = difftable.random_split(0.8, seed=20)
diff_lgc_model = gl.logistic_classifier.create(diff_train, target='WIN',
features=[ i for i in difftable.column_names() if "_DIFF" in i])
diff_result = diff_lgc_model.evaluate(diff_test)
print diff_result
diff_lgc_model.show()
Me: HOLY GUACAMOLE!! LOOK AT MY ACCURACY! IT'S ALMOST PERFECT!
Chris: That can't be right.
Me: I CAN PREDICT THE FUTURE!
Chris: I don't really think that's how it w-
Me: I'M QUITTING MY JOB AND GOING TO VEGAS! I AM AMAZING!!! SEE YOU CLOWNS LATER AAAAAAHAHAHAHAHA!!!!!!!
Chris: Did you leave the difference in scores as a feature for the model to use?
Me: AAAHAHAHA-... what? Yes? ... Yes, I did.
Chris: So the model learned that with 100% accuracy that the final score predicts who won.
Me: Ahem. Oh. Me: ... sorry about the clowns thing.
Chris: ...
Me: ... I'm also un-quitting.
Chris: ...
Me: ... I'll go back to my desk now.
Yeah. Once again putting all the info I had into this black box wasn't quite the best idea. I want to predict who won based on every other stat but I don't want to use the score. The model picks up pretty quickly that some of my features like, oh, say, the point differential between teams, is a pretty darn good indicator of what is in the WIN column. Clever. Let’s remove that stuff.
In [40]:
# CLEARLY I CAN PREDICT THE FUTURE
# oh. remove points
better_features= [ i for i in difftable.column_names() if "_DIFF" in i]
better_features.remove('PTS_DIFF')
better_features.remove('TD_DIFF')
better_features.remove('TDR_DIFF')
better_features.remove('TDP_DIFF')
better_features.remove('DP_DIFF')
better_model = gl.logistic_classifier.create(diff_train, target='WIN', features=better_features)
better_model.show()
print better_model.evaluate(diff_test)
That's not bad. The confusion matrix here is an awesome little guide to how the model did guessing the outcome of games.
While I’m in the business of classifying things, I browsed the list of classifiers in the API searching for other high-powered tools to naively explore. Chris pointed me to the boosted trees classifier as an interesting way to explore the effects of each of my features on the classification. It’s not entirely clear to me how a tree, regardless of whether I boosted it or purchased it legally, is going to help me out but it looks easy enough to try.
In [45]:
#Exploring the trees
btc_classifier = gl.boosted_trees_classifier.create(diff_train, target='WIN', features=better_features)
btc_classifier.show(view="Tree", tree_id=1)
Um, what? I have no idea what I just did and this visualization isn’t helpful at all. I’ve created what looks like an insane decision shrub of a tree but I don’t think that every single tiny decision and combination of features is really this important. Let me try that again, this time using just one decision tree and limiting its depth.
In [46]:
btc_class2 = gl.boosted_trees_classifier.create(diff_train, target='WIN', features=better_features, max_iterations=1,
max_depth=3)
btc_class2.show(view="Tree", tree_id=0)
Ok, limiting the depth did wonders for my readability here. This is interesting, the tree isn’t making decisions around the things I thought are important. The big first branch happens on rushing attempts. I can view what the model thinks is important with a call to model.get_feature_importance(). This gives me a sum of the nodes in the tree that are branching on each feature.
In [47]:
btc_class2.get_feature_importance()
Out[47]:
Huh. Not what I thought. The number of rushing attempts is more important than interceptions and more important than rushing yards? While I could maybe see that, it doesn’t line up with the features that other folks on the internet have concluded are important. I’d expect to see something about offense, maybe passing, or total rush yards. Maybe if I make the tree deeper it’d be a little closer to what I’ve read is important in a football game?
In [48]:
btc_bigger = gl.boosted_trees_classifier.create(diff_train, target='WIN',
features=better_features, max_iterations=50, max_depth=5)
btc_bigger.show(view="Tree", tree_id=0)
Nope.
I’m going to try going back and adding some more features… maybe compute average yards per pass attempt, yards per rush attempt, average starting field position. You know, just some good to know stuff.
In [49]:
# OK so this isn't really what I expected.
# The first branch is Rushing Attempts, then Interceptions difference.
# should probably add more features
# PEN = penalty yardage against
# SRP = Succeessful rush plays
# SPP = Successful pass plays
# SFPY = The Total Starting Field Position Yardage: Dividing by the # of Drives on Offense (DRV)
# produces the Average Starting Field Position.
# PU = Punts
# DRV = Drives on Offense
# Create a pass yds / attempt feature, and an average starting field position (SFPY/DRV)
# Total rush / rush attempts (AVG RUSH)
extra = gl.SFrame.read_csv('C:\\Users\\Susan\\Documents\\NFLData_2000-2014\\csv\TEAM.csv', header=True)
# games = gl.SFrame.read_csv('C:\\Users\\Susan\\Documents\\NFLData_2000-2014\\csv\GAME.csv', header=True)
extra = extra.select_columns(
[ 'TID', 'GID','SRP', 'SPP', 'PLO', 'PLD', 'PEN', 'SFPY', 'DRV'])
# team_more_info['TOP'] = team_more_info['TOP'].astype(float)
# team_more_info = team_more_info.join(games.select_columns(['GID', 'SEAS', 'WEEK']), 'GID')
# team_more_info = team_more_info[team['WEEK'] < 18]
team = team.join(extra, on=['GID', 'TID'])
#GID 364 has no data (CAR/MIA 2001) for DRV ... without the if x['blah'] !=0 you get a divide by zero
# but ONLY at the point you try and materialize the sframe because the operations are lazy! gah.
team['YD_PER_P_ATT'] = team.apply(lambda x: float(x['PY']) / float(x['PA']) if x['PA'] != 0 else 0)
team['YD_PER_R_ATT'] = team.apply(lambda x: float(x['RY']) / float(x['RA']) if x['RA'] != 0 else 0)
team['AVG_STFP'] = team.apply(lambda x: float(x['SFPY']) / float(x['DRV']) if x['DRV'] != 0 else 0)
#On second thought, I want to drop that game altogether since I don't trust that. The score and other
#references show they did have an off drive. wth?
team = team.filter_by( [364], 'GID', exclude=True)
features = ['YD_PER_P_ATT', 'YD_PER_R_ATT', 'AVG_STFP', 'SRP', 'PY', 'RY', 'DRV', 'SPP', 'FUM', 'SK', 'INT', 'PLO',
'PLD', 'TOP', 'PEN', 'PU']
# doing that joining stuff again
more = team.join(games.select_columns(['GID', 'V', 'H', 'PTSV', 'PTSH']), 'GID')
home = more[more['TNAME'] == more['H']].remove_columns(['PTSH', 'PTSV'])
visit = more[more['TNAME'] == more['V']].remove_columns(['PTSH', 'PTSV'])
difftable = home.join(visit, 'GID')
#the visitor's info is now always under blank.1 column names
#so let's express everything again as a difference
for thing in features:
difftable['%s_DIFF' % thing] = difftable[thing] - difftable['%s.1' % thing]
In [50]:
diff_train, diff_test = difftable.random_split(0.8, seed=0)
stuff = [ i for i in difftable.column_names() if "_DIFF" in i]
btc_classifier = gl.boosted_trees_classifier.create(diff_train, target='WIN',
features=stuff)
btc_classifier.show(view="Tree", tree_id=1)
Gah, look at that mess again.
In [55]:
btc_classifier = gl.boosted_trees_classifier.create(diff_train, target='WIN',
features=stuff, max_depth=3)
all_f_results = btc_classifier.evaluate(diff_test)
btc_classifier.show(view="Tree")
print all_f_results
Hey! Not bad!
In [59]:
print btc_classifier.get_feature_importance()
important_f = btc_classifier.get_feature_importance()['feature']
Oh man, that looks … actually pretty good. My most important features relate to the passing game, field position and ball control.
Out of all the features I gave it, the model identified the most important ones. Just for giggles, I can start making life hard for my model. Yes, yards per pass attempt appears to be important. What if I ONLY gave the model that feature? How accurate am I then? Thanks to Chris’ suggestion, I can loop through the list of features here, add them one by one and see how accurate I am with just that subset of features.
In [62]:
models_i = []
accuracy_i = []
f_imp = []
for i in range(1, len(important_f )):
print
btc = gl.boosted_trees_classifier.create(diff_train, target='WIN',
features=important_f[:i],
max_depth=3)
models_i.append(btc)
accuracy_i.append(btc.evaluate(diff_test))
f_imp.append(btc.get_feature_importance())
accurate = [ x['accuracy'] for x in accuracy_i]
test_sf = gl.SFrame( {'feature': range(1, len(important_f)), 'accuracy': accurate} )
In [63]:
test_sf.show(view="Line Chart", x='feature', y='accuracy')
Keeping in mind I’m adding them in by order of importance, it looks like I get 75% accuracy from just the first feature, yards per pass attempt. I get another big boost by including starting field position, interception and rushing yards… but after that, the rest of the features really aren’t contributing much. This sounds about right, from what I’ve read about the current state of the pro football world.
And there you have it. This is what it feels like to get into machine learning. Lots of “I have no idea what I’m doing” followed by an occasional “huh… that was pretty cool”. There’s plenty of additional tuning that can (and should) be done, but the ability to create insights and get moving with the basics quickly make the topic a lot more approachable. And once I figure out how to use machine learning to beat the odds, I’m going to Vegas.
I’ll send Chris a postcard.
In [ ]: