This notebook builds a machine learning model that can be used to predict professional soccer games. The notebook was created for the "Predicting the Future with the Google Cloud Platform" talk at Google I/O 2014 by Jordan Tigani and Felipe Hoffa. A link to the presentation is here: https://www.youtube.com/watch?v=YyvvxFeADh8
Once the machine learning model is built, we use it to predict outcomes in the World Cup. If you are seeing this after the world cup is over, you can use it to predict hypothetical matchups (how would the 2010 World Cup winners do against the current champions?). You can also see how various different strategies would affect prediction outcomes. Maybe you'd like to add player salary data and see how that affects predictions (likely it will help a lot). Maybe you'd like to try Poisson Regression instead of Logistic Regression. Or maybe you'd like to try data coercion techniques like whitening or PCA.
The model uses Logistic Regression, built from touch-by-touch data about three different soccer leagues (English Premier League, Spainish La Liga, and American Major League Soccer) over multiple seasons. Because the data is licensed, only the aggregated statistics about those games are available. (If you have ideas of other statistics you'd like to see, create a new issue in the https://github.com/GoogleCloudPlatform/ipython-soccer-predictions GitHub repo and we'll see what we can do.) The match_stats.py file shows the raw queries that were used to generate the stats.
There are four python files that are used by this notebook. They must be in the path. These are:
Since we're providing this notebook as part of a Docker image that can be run on Google Compute Engine, we'll override the authorization used in the Pandas BigQuery connector to use GCE auth. This will mean that you don't have to do any authorization on your own. You must, however, have the BigQuery API enabled in your Google Cloud Project (https://console.developers.google.com). Because the data sizes (after aggregation) are quite small, you may not need to enable billing.
In [1]:
from oauth2client.gce import AppAssertionCredentials
from bigquery_client import BigqueryClient
from pandas.io import gbq
def GetMetadata(path):
import urllib2
BASE_PATH = 'http://metadata/computeMetadata/v1/'
request = urllib2.Request(BASE_PATH + path, headers={'Metadata-Flavor': 'Google'})
return urllib2.urlopen(request).read()
credentials = AppAssertionCredentials(scope='https://www.googleapis.com/auth/bigquery')
client = BigqueryClient(credentials=credentials,
api='https://www.googleapis.com',
api_version='v2',
project_id=GetMetadata('project/project-id'))
gbq._authenticate = lambda: client
In [1]:
from pandas.io import gbq
# Import the four python modules that we use.
import match_stats
import features
import world_cup
import power
query = "SELECT * FROM (%(summary_query)s) LIMIT 1" % {
'summary_query': match_stats.team_game_summary_query()}
gbq.read_gbq(query)
Out[1]:
This will return a pandas dataframe that contains the features that will be used to build a model.
The features query will read from the game summary table that has prepared per-game statistics that will be used to predict outcomes. The data has been aggregated from touch-by-touch data from Opta. However, since that data is not public, we use these prepared statistics instead of the raw data.
In order to predict a game, we look at the previous N games of history for each team, where N is defined here as history_size.
In [3]:
import features
reload(features)
# Sets the history size. This is how far back we will look before each game to aggregate statistics
# to predict the next game. For example, a history size of 5 will look at the previous 5 games played
# by a particular team in order to predict the next game.
history_size = 6
game_summaries = features.get_game_summaries()
data = features.get_features(history_size)
The features include rollups from the last K games. Most of them are averages that are computed per-minute of game time. Per-minute stats are used in order to be able to normalize for games in the world cup that go into overtime.
The following columns are the features that will be used to build the prediction model:
The following columns are included as metadata about the match:
The following columns are target variables that we will be attempting to predict. These columns must be dropped before any prediction is done, but are useful when building a model. The models that we will build below will just try to predict outcome (points) but other models may choose to predict goals, which is why they are also included here.
In [4]:
# Partition the world cup data and the club data. We're only going to train our model using club data.
club_data = data[data['competitionid'] <> 4]
# Show the features latest game in competition id 4, which is the world cup.
data[data['competitionid'] == 4].iloc[0]
Out[4]:
Compute the crosstabs for goals scored vs outcomes. Scoring more than 5 goals means you're guaranteed to win, and scoring no goals means you lose about 75% of the time (sometimes you tie!).
In [5]:
import pandas as pd
pd.crosstab(
club_data['goals'],
club_data.replace(
{'points': {
0: 'lose', 1: 'tie', 3: 'win'}})['points'])
Out[5]:
We're going to train a logistic regression model based on the club data only. This will use an external code file world_cup.py to build the model.
The output of this cell this will be a logistic regression model and a test set that we can use to test how good we are at predicting outcomes. The cell will also print out the Rsquared value for the regression. This is a measaure of how good the fit was to the model (higher is better).
In [6]:
import world_cup
reload(world_cup)
import match_stats
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
# Don't train on games that ended in a draw, since they have less signal.
train = club_data.loc[club_data['points'] <> 1]
# train = club_data
(model, test) = world_cup.train_model(
train, match_stats.get_non_feature_columns())
print "\nRsquared: %0.03g" % model.prsquared
The logistic regression model is built using regularization; this means that it penalizes complex models. It has the side effect of helping us with feature selection. Features that are not important will be dropped out of the model completely.
We can divide the features into three buckets:
In [7]:
def print_params(model, limit=None):
params = model.params.copy()
params.sort(ascending=False)
del params['intercept']
if not limit:
limit = len(params)
print("Positive features")
params.sort(ascending=False)
print np.exp(params[[param > 0.001 for param in params]]).sub(1)[:limit]
print("\nDropped features")
print params[[param == 0.0 for param in params]][:limit]
print("\nNegative features")
params.sort(ascending=True)
print np.exp(params[[param < -0.001 for param in params]]).sub(1)[:limit]
print_params(model, 10)
This cell uses the test set (which was not used during the creation of the model) to predict outcomes. We can a few of the predictions to see how well we did. We'll show 5 each from two buckets: cases where we got it right, and cases where we got it wrong. We can see if these make sense. When we display these, the home team is always on the left.
For example, it might show that we predicted Manchester United playing at home beating Sunderland. This is completely reasonable and we'd expect that the outcome would be 3 points (a victory).
The columns of the output are:
In [8]:
reload(world_cup)
results = world_cup.predict_model(model, test,
match_stats.get_non_feature_columns())
predictions = world_cup.extract_predictions(
results.copy(), results['predicted'])
print 'Correct predictions:'
predictions[(predictions['predicted'] > 50) & (predictions['points'] == 3)][:5]
Out[8]:
In [9]:
print '\nIncorrect predictions:'
predictions[(predictions['predicted'] > 50) & (predictions['points'] < 3)][:5]
Out[9]:
Next, we want to actually quantify how good our predictions are. We can compute the lift ("How much better are we doing than random chance?"), AUC (the area under the ROC curve) and plot the ROC curve. AUC is arguable the most interesting number, it ranges between 0.5 (your model is no better than dumb luck) and 1.0 (perfect prediction).
In [10]:
import pylab as pl
# Compute a baseline, which is the percentage of overall outcomes are actually wins.
# (remember in soccer we can have draws too).
baseline = (sum([yval == 3 for yval in club_data['points']])
* 1.0 / len(club_data))
y = [yval == 3 for yval in test['points']]
world_cup.validate(3, y, results['predicted'], baseline,
compute_auc=True)
pl.show()
One thing that is missing, if you're predicting the next game based on the previous few games, is that some teams may have just played a really tough schedule, while other teams have played against much weaker competition.
We can solve for schedule difficulty by running another regression; this one computes a power ranking, similar to the FIFA/CocaCola power ranking for international soccer teams (there are power rankings for other sports like college (american) football that may be familiar.)
Once we compute the power ranking (which creates a stack ranking of all of the teams), we can add that power ranking as a feature to our model, then rebuild it and re-validate it. The regression essentailly automated the process of looking at relationships like "Well, team A beat team B and team B beat team C, so A is probably better than C".
The output here will show the power ranking for various teams. This can be useful to spot check the ranking, since if we rank Wiggan at 1.0 and Chelsea at 0.0, something is likely wrong.
Note that because there isn't a strict ordering to the data (if team A beats team B and team B beats team C, sometimes team C will then beat team A) we sometimes fail to assign ordering to all of the teams (especially where the data is sparse). For teams that we can't rank, we put them in the middle (0.5).
Additionally, because the rankings for international teams are noisy and sparse, we chunk the rankings into quartiles. So teams that have been ranked will show up as 0, .33, .66, or 1.0.
Once we add this to the model, the performance generally improves significantly.
In [11]:
import power
reload(power)
reload(world_cup)
def points_to_sgn(p):
if p > 0.1: return 1.0
elif p < -0.1: return -1.0
else: return 0.0
power_cols = [
('points', points_to_sgn, 'points'),
]
power_data = power.add_power(club_data, game_summaries, power_cols)
power_train = power_data.loc[power_data['points'] <> 1]
# power_train = power_data
(power_model, power_test) = world_cup.train_model(
power_train, match_stats.get_non_feature_columns())
print "\nRsquared: %0.03g, Power Coef %0.03g" % (
power_model.prsquared,
math.exp(power_model.params['power_points']))
power_results = world_cup.predict_model(power_model, power_test,
match_stats.get_non_feature_columns())
power_y = [yval == 3 for yval in power_test['points']]
world_cup.validate(3, power_y, power_results['predicted'], baseline,
compute_auc=True, quiet=False)
pl.plot([0, 1], [0, 1], '--', color=(0.6, 0.6, 0.6), label='Luck')
# Add the old model to the graph
world_cup.validate('old', y, results['predicted'], baseline,
compute_auc=True, quiet=True)
pl.legend(loc="lower right")
pl.show()
print_params(power_model, 8)
Now that we've got a model that we like, let's look at predicting the world cup. We can build the same statistics (features) for the world cup games that we did for the club games. In this case, however, we don't have the targets; that is, we don't know who won (for some of the previous games, we do know who won, but let's predict them all equally as if we didn't know).
features.get_wc_features() will return build features from the world cup games.
In [12]:
import world_cup
import features
reload(match_stats)
reload(features)
reload(world_cup)
wc_data = world_cup.prepare_data(features.get_wc_features(history_size))
wc_labeled = world_cup.prepare_data(features.get_features(history_size))
wc_labeled = wc_labeled[wc_labeled['competitionid'] == 4]
wc_power_train = game_summaries[game_summaries['competitionid'] == 4].copy()
Once we have the model and the features, we can start predicting.
There are a couple of differences between the world cup and club data. For one, while home team advantage is important in club games, who is really at home? Is it only Brazil? What about other south american teams? Some models give the 'is home' status to only Brazil, others give partial status to other teams from the same continent, since historical data shows that teams from the same continent tend to outperform.
We use a slightly modified model that is, however, somewhat subjective. We assing a value to is_home between 0.0 to 1.0 depending on the fan support (both numbers and enthusiasm) that a team enjoys. This is a result of noticing, in the early rounds, that the teams that had the more entusiastic supporters did better. For example, Chile's fans were deafining in support of their team, but Spain's fans barely showed up (Chile upset spain 2-0). There were a number of other cases like this; many involving south american sides, but many involving other teams that had sent a lot of supporters (Mexico, for example). Some teams, like the USA, had a lot of fans, but they were more reserved... they got a lower score. This factor was set based on first-hand reports from the group games.
In [13]:
import pandas as pd
wc_home = pd.read_csv('wc_home.csv')
def add_home_override(df, home_map):
for ii in xrange(len(df)):
team = df.iloc[ii]['teamid']
if team in home_map:
df['is_home'].iloc[ii] = home_map[team]
else:
# If we don't know, assume not at home.
df['is_home'].iloc[ii] = 0.0
home_override = {}
for ii in xrange(len(wc_home)):
row = wc_home.iloc[ii]
home_override[row['teamid']] = row['is_home']
# Add home team overrides.
add_home_override(wc_data, home_override)
The lattice of teams playing eachother in the world cup is pretty sparese. Many teams haven't played eachother for decades. Many European teams rarely play South American ones, and even more rarely play Asian ones. We can use the same technique as we did for the club games, but we have to be prepared for failure.
We'll output the power rankings from the previous games. We should eyeball them to make sure they make sense.
In [14]:
# When training power data, since the games span multiple competitions, just set is_home to 0.5
# Otherwise when we looked at games from the 2010 world cup, we'd think Brazil was still at
# home instead of South Africa.
wc_power_train['is_home'] = 0.5
wc_power_data = power.add_power(wc_data, wc_power_train, power_cols)
wc_results = world_cup.predict_model(power_model, wc_power_data,
match_stats.get_non_feature_columns())
Now's the moment we've been waiting for. Let's predict some world cup games. Let's start with predicting the ones that have already happenned.
We will output 4 columns:
But wait! These predictions are different from the ones you published!
There are three reasons why the prediction numbers might be different from the numbers you may have seen as published predictions:
In [15]:
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
wc_with_points = wc_power_data.copy()
wc_with_points.index = pd.Index(
zip(wc_with_points['matchid'], wc_with_points['teamid']))
wc_labeled.index = pd.Index(
zip(wc_labeled['matchid'], wc_labeled['teamid']))
wc_with_points['points'] = wc_labeled['points']
wc_pred = world_cup.extract_predictions(wc_with_points,
wc_results['predicted'])
# Reverse our predictions to show the most recent first.
wc_pred.reindex(index=wc_pred.index[::-1])
# Show our predictions for the games that have already happenned.
wc_pred[wc_pred['points'] >= 0.0]
Out[15]:
Let's look at the stats for the teams in the final. We can compare them by eyeball to see which one we think will win:
In [16]:
final = wc_power_data[wc_power_data['matchid'] == '731830']
final
Out[16]:
Now let's look at the games that made up the decisions:
In [17]:
op = game_summaries
def countryStats(d, name):
pred = d['team_name'] == name
return d[pred]
fr = countryStats(op, 'France')
ge = countryStats(op, 'Germany')
ar = countryStats(op, 'Argentina')
br = countryStats(op, 'Brazil')
ne = countryStats(op, 'Netherlands')
ge[:6]
Out[17]:
OK now that we've looked at the data every which way possible, let's predict the final results:
In [18]:
wc_pred[~(wc_pred['points'] >= 0)][[
'team_name', 'op_team_name', 'predicted']]
Out[18]:
In [18]: