Here I try to predict fifa world cup matches results, based on the knowledge of previous matches from the cups since the year 1950.
I'll use a MLP neural network classifier, my inputs will be the past matches (replacing each team name with a lot of stats from both), and my output will be a number indicating the result (0 = tie, 1 = wins team1, 2 = wins team2).
I'll be using pybrain for the classifier, pandas to hack my way through the data, and pygal for the graphs (far easier than matplotlib). And a lot of extra useful things implemented in the utils.py file, mostly to abstract the data processing I need before I feed the classifier.
In [1]:
from random import random
from IPython.display import SVG
import pygal
from pybrain.structure import SigmoidLayer
from pybrain.tools.shortcuts import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.datasets import ClassificationDataSet
from pybrain.utilities import percentError
from utils import get_matches, get_team_stats, extract_samples, normalize, split_samples, graph_teams_stat_bars, graph_matches_results_scatter
In [2]:
# the features I will feed to the classifier as input data.
input_features = ['year',
'matches_won_percent',
'podium_score_yearly',
'matches_won_percent_2',
'podium_score_yearly_2',]
# the feature giving the result the classifier must learn to predict (I recommend allways using 'winner')
output_feature = 'winner'
# used to avoid including tied matches in the learning process. I found this greatly improves the classifier accuracy.
# I know there will be some ties, but I'm willing to fail on those and have better accuracy with all the rest.
# at this point, this code will break if you set it to False, because the network uses a sigmoid function with a
# threeshold for output, so it is able to distinquish only 2 kinds of results.
exclude_ties = True
# used to duplicate matches data, reversing the teams (team1->team2, and viceversa).
# This helps on visualizations, and also improves precission of the predictions avoiding a dependence on the
# order of the teams from the input.
duplicate_with_reversed = True
In [3]:
def show(graph):
'''Small utility to display pygal graphs'''
return SVG(graph.render())
First we need the teams stats. We can't feed the classifier inputs like ('Argentina', 'Brazil'), we need to give it numbers. And not any numbers, not just ids, but numbers that could be somewhat related to the result of the matches.
For example: the percentage of won matches of each team is something that could have an impact in the result, so that stat is a very good candidate.
We just calculate a lots of stats per team, and after we will decide which ones to use.
In [4]:
team_stats = get_team_stats()
team_stats
Out[4]:
Lets visualize some of those stats, just because it helps paint a bigger picture on how good the teams are.
(you can hoover with your mouse on the '...' from the x axys to see the team name)
In [5]:
show(graph_teams_stat_bars(team_stats, 'matches_won_percent'))
Out[5]:
Pudium score is an invented measure on how good the teams are by looking at the 4 first teams from each cup. The first team receives 8 points, the second 4, the third 2, and the fourth 1. All the rest receive 0 points. As you can see, the scoring is exponential, because each position implies an exponentially bigger amount of matches won than the next one.
In [6]:
show(graph_teams_stat_bars(team_stats, 'podium_score_yearly'))
Out[6]:
In [7]:
matches = get_matches(with_team_stats=True,
duplicate_with_reversed=duplicate_with_reversed,
exclude_ties=exclude_ties)
matches
Out[7]:
Are the results able to be classified? Can we see a pattern, some kind of grouping of results based on the stats of bot teams?
Let's try visualizing two of the most interesting ones: matches won percent, and podium score yearly (mean).
In [8]:
show(graph_matches_results_scatter(matches, 'matches_won_percent', 'matches_won_percent_2'))
Out[8]:
In [9]:
show(graph_matches_results_scatter(matches, 'podium_score_yearly', 'podium_score_yearly_2'))
Out[9]:
Before any conclussions: there is more there than what you can see with your eyes. At some location, there could be more than 1 point, and you only see the one on the top.
The first graph tells us something that most people already expect: there is a small tendency on the result, the team with the better matches won percent tends to win. The second graph also shows a similar relation between podium score yearly and the result, even if it's not visible to the eye because of the overlapping of dots.
But remember, the classifier can learn a lot more than just those simple relations based on the info we give to it. These graphs were just a screening to confirm some basic intuitions.
Ok, now we have everything we need. Lets feed the selected input features to a the neural network classifier, and let it learn.
We have to normalize the data, otherwise the features with smaller values will impose a greater weight on the prediction.
Also, we use a percentage of the inputs to train, but keep the rest "hidden", we don't let the classifier see them while learning. After the training we use those inputs to "test" the ability of the classifier to predict data it has never seen before (and data we already know the correct answer).
In [10]:
inputs, outputs = extract_samples(matches,
input_features,
output_feature)
normalizer, inputs = normalize(inputs)
train_inputs, train_outputs, test_inputs, test_outputs = split_samples(inputs, outputs)
n = buildNetwork(len(input_features),
10 * len(input_features),
10 * len(input_features),
1,
outclass=SigmoidLayer,
bias=True)
To be able to evaluate the results and show progress on the learning cycle, we need these two functions wich help us calculate how well the network can predict the results from the matches used to learn, and the matches it doesn't know.
In [11]:
def neural_result(input):
"""Call the neural network, and translates its output to a match result."""
n_output = n.activate(input)
if n_output >= 0.5:
return 2
else:
return 1
def test_network():
"""Calculate train and test sets errors."""
print (100 - percentError(map(neural_result, train_inputs), train_outputs),
100 - percentError(map(neural_result, test_inputs), test_outputs))
Create a train set (a kind of dataset that pybrain uses to train neural networks), and display initial accuracy on both sets (train and test).
In [12]:
train_set = ClassificationDataSet(len(input_features))
for i, input_line in enumerate(train_inputs):
train_set.addSample(train_inputs[i], [train_outputs[i] - 1])
trainer = BackpropTrainer(n, dataset=train_set, momentum=0.5, weightdecay=0.0)
train_set.assignClasses()
test_network()
Train the network, for a given number of iterations. You can re-run this step many times, and it will keep learning, but as you know, if you train too much you can end overfitting the training data (this is visible when the test set accuracy starts to decrease).
In [13]:
for i in range(20):
trainer.train()
test_network()
The closer this score is to 100%, the better the classifier is doing its predictions. A score of 100 means the classifier allways predicts the exact real result, something impossible.
And something around 75% sounds impressive, but in fact is not that good. It's pretty good, but consider that just throwing a coin will get you 50%. So this sits in the middle between throwing a coin and having a time machine.
With the classifier already trained, we can start making predictions. But we need a little function able to translate inputs like this: (2014, 'Argentina', 'Brazil'), to the numeric inputs the classifier expects (based on the input features).
This function does the conversion, also normalizes the data with the same normalizer used before, and then just asks the classifier for the prediction.
In [14]:
def predict(year, team1, team2):
inputs = []
for feature in input_features:
from_team_2 = '_2' in feature
feature = feature.replace('_2', '')
if feature in team_stats.columns.values:
team = team2 if from_team_2 else team1
value = team_stats.loc[team, feature]
elif feature == 'year':
value = year
else:
raise ValueError("Don't know where to get feature: " + feature)
inputs.append(value)
inputs = normalizer.transform(inputs)
result = neural_result(inputs)
if result == 0:
return 'tie'
elif result == 1:
return team1
elif result == 2:
return team2
else:
return 'Unknown result: ' + str(result)
In [15]:
predict(1950, 'Mexico', 'Brazil') # real result: 4-0 wins Brazil
Out[15]:
In [16]:
predict(1990, 'United Arab Emirates', 'Colombia') # real result: 2-0 wins Colombia
Out[16]:
In [17]:
predict(2002, 'South Africa', 'Spain') # real result: 2-3 wins Spain
Out[17]:
In [18]:
predict(2010, 'Japan', 'Cameroon') # real result: 1-0 wins Japan
Out[18]:
In [19]:
predict(2014, 'Argentina', 'Brazil')
Out[19]:
In [20]:
predict(2014, 'Spain', 'Haiti')
Out[20]:
In [21]:
predict(2014, 'Russia', 'Germany')
Out[21]:
In [22]:
predict(2014, 'Russia', 'Russia')
Out[22]: