World Cup Learning

Here I try to predict fifa world cup matches results, based on the knowledge of previous matches from the cups since the year 1950.

I'll use a MLP neural network classifier, my inputs will be the past matches (replacing each team name with a lot of stats from both), and my output will be a number indicating the result (0 = tie, 1 = wins team1, 2 = wins team2).

I'll be using pybrain for the classifier, pandas to hack my way through the data, and pygal for the graphs (far easier than matplotlib). And a lot of extra useful things implemented in the utils.py file, mostly to abstract the data processing I need before I feed the classifier.



In [1]:

    
from random import random

from IPython.display import SVG
import pygal

from pybrain.structure import SigmoidLayer
from pybrain.tools.shortcuts import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.datasets import ClassificationDataSet
from pybrain.utilities import percentError

from utils import get_matches, get_team_stats, extract_samples, normalize, split_samples, graph_teams_stat_bars, graph_matches_results_scatter

Configs



In [2]:

    
# the features I will feed to the classifier as input data.
input_features = ['year',
                  'matches_won_percent',
                  'podium_score_yearly',
                  'matches_won_percent_2',
                  'podium_score_yearly_2',]

# the feature giving the result the classifier must learn to predict (I recommend allways using 'winner')
output_feature = 'winner'

# used to avoid including tied matches in the learning process. I found this greatly improves the classifier accuracy.
# I know there will be some ties, but I'm willing to fail on those and have better accuracy with all the rest.
# at this point, this code will break if you set it to False, because the network uses a sigmoid function with a 
# threeshold for output, so it is able to distinquish only 2 kinds of results.
exclude_ties = True

# used to duplicate matches data, reversing the teams (team1->team2, and viceversa). 
# This helps on visualizations, and also improves precission of the predictions avoiding a dependence on the
# order of the teams from the input.
duplicate_with_reversed = True



In [3]:

    
def show(graph):
    '''Small utility to display pygal graphs'''
    return SVG(graph.render())

Team stats

First we need the teams stats. We can't feed the classifier inputs like ('Argentina', 'Brazil'), we need to give it numbers. And not any numbers, not just ids, but numbers that could be somewhat related to the result of the matches.

For example: the percentage of won matches of each team is something that could have an impact in the result, so that stat is a very good candidate.

We just calculate a lots of stats per team, and after we will decide which ones to use.



In [4]:

    
team_stats = get_team_stats()
team_stats









    Out[4]:






  
    
      
      matches_played
      matches_won
      years_played
      podium_score
      cups_won
      matches_won_percent
      podium_score_yearly
      cups_won_yearly
    
    
      team
      
      
      
      
      
      
      
      
    
  
  
    
      Brazil
       89
       63
       16
       102
       5
       70.786517
       6.375000
       0.312500
    
    
      Canada
        3
        0
        1
         0
       0
        0.000000
       0.000000
       0.000000
    
    
      Serbia and Montenegro
        3
        0
        1
         0
       0
        0.000000
       0.000000
       0.000000
    
    
      Kuwait
        3
        0
        1
         0
       0
        0.000000
       0.000000
       0.000000
    
    
      Scotland
       23
        4
        8
         0
       0
       17.391304
       0.000000
       0.000000
    
    
      Costa Rica
       10
        3
        3
         0
       0
       30.000000
       0.000000
       0.000000
    
    
      Ivory Coast
        6
        2
        2
         0
       0
       33.333333
       0.000000
       0.000000
    
    
      Wales
        5
        1
        1
         0
       0
       20.000000
       0.000000
       0.000000
    
    
      Argentina
       64
       33
       13
        40
       2
       51.562500
       3.076923
       0.153846
    
    
      Bolivia
        4
        0
        2
         0
       0
        0.000000
       0.000000
       0.000000
    
    
      Cameroon
       20
        4
        6
         0
       0
       20.000000
       0.000000
       0.000000
    
    
      Ecuador
        7
        3
        2
         0
       0
       42.857143
       0.000000
       0.000000
    
    
      Ghana
        9
        4
        2
         0
       0
       44.444444
       0.000000
       0.000000
    
    
      Saudi Arabia
       13
        2
        4
         0
       0
       15.384615
       0.000000
       0.000000
    
    
      Australia
       10
        2
        3
         0
       0
       20.000000
       0.000000
       0.000000
    
    
      Iran
        9
        1
        3
         0
       0
       11.111111
       0.000000
       0.000000
    
    
      Algeria
        9
        2
        3
         0
       0
       22.222222
       0.000000
       0.000000
    
    
      El Salvador
        6
        0
        2
         0
       0
        0.000000
       0.000000
       0.000000
    
    
      Republic of Ireland
       13
        2
        3
         0
       0
       15.384615
       0.000000
       0.000000
    
    
      Slovenia
        6
        1
        2
         0
       0
       16.666667
       0.000000
       0.000000
    
    
      Chile
       26
        7
        7
         4
       0
       26.923077
       0.571429
       0.000000
    
    
      Belgium
       32
       10
        8
         2
       0
       31.250000
       0.250000
       0.000000
    
    
      Haiti
        3
        0
        1
         0
       0
        0.000000
       0.000000
       0.000000
    
    
      Iraq
        3
        0
        1
         0
       0
        0.000000
       0.000000
       0.000000
    
    
      Spain
       53
       27
       12
        18
       1
       50.943396
       1.500000
       0.083333
    
    
      China PR
        3
        0
        1
         0
       0
        0.000000
       0.000000
       0.000000
    
    
      Netherlands
       41
       22
        7
        26
       0
       53.658537
       3.714286
       0.000000
    
    
      Denmark
       16
        8
        4
         0
       0
       50.000000
       0.000000
       0.000000
    
    
      Poland
       30
       15
        6
         8
       0
       50.000000
       1.333333
       0.000000
    
    
      Morocco
       13
        2
        4
         0
       0
       15.384615
       0.000000
       0.000000
    
    
      Croatia
       13
        6
        3
         4
       0
       46.153846
       1.333333
       0.000000
    
    
      Switzerland
       24
        7
        7
         0
       0
       29.166667
       0.000000
       0.000000
    
    
      Honduras
        6
        0
        2
         0
       0
        0.000000
       0.000000
       0.000000
    
    
      New Zealand
        6
        0
        2
         0
       0
        0.000000
       0.000000
       0.000000
    
    
      Jamaica
        3
        1
        1
         0
       0
       33.333333
       0.000000
       0.000000
    
    
      England
       59
       26
       13
        18
       1
       44.067797
       1.384615
       0.076923
    
    
      Uruguay
       43
       14
       10
        22
       1
       32.558140
       2.200000
       0.100000
    
    
      United Arab Emirates
        3
        0
        1
         0
       0
        0.000000
       0.000000
       0.000000
    
    
      South Africa
        9
        2
        3
         0
       0
       22.222222
       0.000000
       0.000000
    
    
      Egypt
        3
        0
        1
         0
       0
        0.000000
       0.000000
       0.000000
    
    
      Colombia
       13
        3
        4
         0
       0
       23.076923
       0.000000
       0.000000
    
    
      South Korea
       28
        5
        8
         2
       0
       17.857143
       0.250000
       0.000000
    
    
      Turkey
       10
        5
        2
         4
       0
       50.000000
       2.000000
       0.000000
    
    
      Italy
       71
       36
       15
        54
       2
       50.704225
       3.600000
       0.133333
    
    
      Czech Republic
        3
        1
        1
         0
       0
       33.333333
       0.000000
       0.000000
    
    
      France
       48
       23
       10
        34
       1
       47.916667
       3.400000
       0.100000
    
    
      Slovakia
        4
        1
        1
         0
       0
       25.000000
       0.000000
       0.000000
    
    
      Peru
       13
        4
        3
         0
       0
       30.769231
       0.000000
       0.000000
    
    
      Norway
        7
        2
        2
         0
       0
       28.571429
       0.000000
       0.000000
    
    
      Nigeria
       14
        4
        4
         0
       0
       28.571429
       0.000000
       0.000000
    
    
      Israel
        3
        0
        1
         0
       0
        0.000000
       0.000000
       0.000000
    
    
      Zaire
        3
        0
        1
         0
       0
        0.000000
       0.000000
       0.000000
    
    
      Czechoslovakia
       23
        7
        6
         8
       0
       30.434783
       1.333333
       0.000000
    
    
      Austria
       25
       10
        6
         4
       0
       40.000000
       0.666667
       0.000000
    
    
      Togo
        3
        0
        1
         0
       0
        0.000000
       0.000000
       0.000000
    
    
      Germany
       98
       59
       15
        94
       3
       60.204082
       6.266667
       0.200000
    
    
      Ukraine
        5
        2
        1
         0
       0
       40.000000
       0.000000
       0.000000
    
    
      Northern Ireland
       13
        3
        3
         0
       0
       23.076923
       0.000000
       0.000000
    
    
      United States
       25
        5
        7
         0
       0
       20.000000
       0.000000
       0.000000
    
    
      Trinidad and Tobago
        3
        0
        1
         0
       0
        0.000000
       0.000000
       0.000000
    
    
      
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
  

76 rows × 8 columns

Lets visualize some of those stats, just because it helps paint a bigger picture on how good the teams are.

(you can hoover with your mouse on the '...' from the x axys to see the team name)



In [5]:

    
show(graph_teams_stat_bars(team_stats, 'matches_won_percent'))









    Out[5]:

Pudium score is an invented measure on how good the teams are by looking at the 4 first teams from each cup. The first team receives 8 points, the second 4, the third 2, and the fourth 1. All the rest receive 0 points. As you can see, the scoring is exponential, because each position implies an exponentially bigger amount of matches won than the next one.



In [6]:

    
show(graph_teams_stat_bars(team_stats, 'podium_score_yearly'))









    Out[6]:

Matches

Now we need to get the matches data, including the "reversed" duplication of matches, and adding the team stats in each match.



In [7]:

    
matches = get_matches(with_team_stats=True,
                      duplicate_with_reversed=duplicate_with_reversed,
                      exclude_ties=exclude_ties)
        
matches









    Out[7]:






  
    
      
      score1
      score2
      team1
      team2
      year
      score_diff
      winner
      matches_played
      matches_won
      years_played
      podium_score
      cups_won
      matches_won_percent
      podium_score_yearly
      cups_won_yearly
      matches_played_2
      matches_won_2
      years_played_2
      podium_score_2
      cups_won_2
      
    
  
  
    
      0 
       4
       0
                 Brazil
                 Mexico
       1950
       4
       1
       89
       63
       16
       102
       5
       70.786517
       6.375000
       0.312500
       46
       12
       13
         0
       0
      ...
    
    
      1 
       3
       0
             Yugoslavia
            Switzerland
       1950
       3
       1
       34
       14
        8
         2
       0
       41.176471
       0.250000
       0.000000
       24
        7
        7
         0
       0
      ...
    
    
      3 
       4
       1
             Yugoslavia
                 Mexico
       1950
       3
       1
       34
       14
        8
         2
       0
       41.176471
       0.250000
       0.000000
       46
       12
       13
         0
       0
      ...
    
    
      4 
       2
       0
                 Brazil
             Yugoslavia
       1950
       2
       1
       89
       63
       16
       102
       5
       70.786517
       6.375000
       0.312500
       34
       14
        8
         2
       0
      ...
    
    
      5 
       2
       1
            Switzerland
                 Mexico
       1950
       1
       1
       24
        7
        7
         0
       0
       29.166667
       0.000000
       0.000000
       46
       12
       13
         0
       0
      ...
    
    
      6 
       2
       0
                England
                  Chile
       1950
       2
       1
       59
       26
       13
        18
       1
       44.067797
       1.384615
       0.076923
       26
        7
        7
         4
       0
      ...
    
    
      7 
       3
       1
                  Spain
          United States
       1950
       2
       1
       53
       27
       12
        18
       1
       50.943396
       1.500000
       0.083333
       25
        5
        7
         0
       0
      ...
    
    
      8 
       2
       0
                  Spain
                  Chile
       1950
       2
       1
       53
       27
       12
        18
       1
       50.943396
       1.500000
       0.083333
       26
        7
        7
         4
       0
      ...
    
    
      9 
       1
       0
          United States
                England
       1950
       1
       1
       25
        5
        7
         0
       0
       20.000000
       0.000000
       0.000000
       59
       26
       13
        18
       1
      ...
    
    
      10
       1
       0
                  Spain
                England
       1950
       1
       1
       53
       27
       12
        18
       1
       50.943396
       1.500000
       0.083333
       59
       26
       13
        18
       1
      ...
    
    
      11
       5
       2
                  Chile
          United States
       1950
       3
       1
       26
        7
        7
         4
       0
       26.923077
       0.571429
       0.000000
       25
        5
        7
         0
       0
      ...
    
    
      12
       3
       2
                 Sweden
                  Italy
       1950
       1
       1
       41
       14
        9
        16
       0
       34.146341
       1.777778
       0.000000
       71
       36
       15
        54
       2
      ...
    
    
      14
       2
       0
                  Italy
               Paraguay
       1950
       2
       1
       71
       36
       15
        54
       2
       50.704225
       3.600000
       0.133333
       25
        6
        7
         0
       0
      ...
    
    
      15
       8
       0
                Uruguay
                Bolivia
       1950
       8
       1
       43
       14
       10
        22
       1
       32.558140
       2.200000
       0.100000
        4
        0
        2
         0
       0
      ...
    
    
      17
       7
       1
                 Brazil
                 Sweden
       1950
       6
       1
       89
       63
       16
       102
       5
       70.786517
       6.375000
       0.312500
       41
       14
        9
        16
       0
      ...
    
    
      18
       6
       1
                 Brazil
                  Spain
       1950
       5
       1
       89
       63
       16
       102
       5
       70.786517
       6.375000
       0.312500
       53
       27
       12
        18
       1
      ...
    
    
      19
       3
       2
                Uruguay
                 Sweden
       1950
       1
       1
       43
       14
       10
        22
       1
       32.558140
       2.200000
       0.100000
       41
       14
        9
        16
       0
      ...
    
    
      20
       3
       1
                 Sweden
                  Spain
       1950
       2
       1
       41
       14
        9
        16
       0
       34.146341
       1.777778
       0.000000
       53
       27
       12
        18
       1
      ...
    
    
      21
       2
       1
                Uruguay
                 Brazil
       1950
       1
       1
       43
       14
       10
        22
       1
       32.558140
       2.200000
       0.100000
       89
       63
       16
       102
       5
      ...
    
    
      22
       5
       0
                 Brazil
                 Mexico
       1954
       5
       1
       89
       63
       16
       102
       5
       70.786517
       6.375000
       0.312500
       46
       12
       13
         0
       0
      ...
    
    
      23
       1
       0
             Yugoslavia
                 France
       1954
       1
       1
       34
       14
        8
         2
       0
       41.176471
       0.250000
       0.000000
       48
       23
       10
        34
       1
      ...
    
    
      25
       3
       2
                 France
                 Mexico
       1954
       1
       1
       48
       23
       10
        34
       1
       47.916667
       3.400000
       0.100000
       46
       12
       13
         0
       0
      ...
    
    
      26
       4
       1
                Germany
                 Turkey
       1954
       3
       1
       98
       59
       15
        94
       3
       60.204082
       6.266667
       0.200000
       10
        5
        2
         4
       0
      ...
    
    
      27
       9
       0
                Hungary
            South Korea
       1954
       9
       1
       26
       11
        7
         8
       0
       42.307692
       1.142857
       0.000000
       28
        5
        8
         2
       0
      ...
    
    
      28
       8
       3
                Hungary
                Germany
       1954
       5
       1
       26
       11
        7
         8
       0
       42.307692
       1.142857
       0.000000
       98
       59
       15
        94
       3
      ...
    
    
      29
       7
       0
                 Turkey
            South Korea
       1954
       7
       1
       10
        5
        2
         4
       0
       50.000000
       2.000000
       0.000000
       28
        5
        8
         2
       0
      ...
    
    
      30
       7
       2
                Germany
                 Turkey
       1954
       5
       1
       98
       59
       15
        94
       3
       60.204082
       6.266667
       0.200000
       10
        5
        2
         4
       0
      ...
    
    
      31
       2
       0
                Uruguay
         Czechoslovakia
       1954
       2
       1
       43
       14
       10
        22
       1
       32.558140
       2.200000
       0.100000
       23
        7
        6
         8
       0
      ...
    
    
      32
       1
       0
                Austria
               Scotland
       1954
       1
       1
       25
       10
        6
         4
       0
       40.000000
       0.666667
       0.000000
       23
        4
        8
         0
       0
      ...
    
    
      33
       7
       0
                Uruguay
               Scotland
       1954
       7
       1
       43
       14
       10
        22
       1
       32.558140
       2.200000
       0.100000
       23
        4
        8
         0
       0
      ...
    
    
      34
       5
       0
                Austria
         Czechoslovakia
       1954
       5
       1
       25
       10
        6
         4
       0
       40.000000
       0.666667
       0.000000
       23
        7
        6
         8
       0
      ...
    
    
      35
       2
       1
            Switzerland
                  Italy
       1954
       1
       1
       24
        7
        7
         0
       0
       29.166667
       0.000000
       0.000000
       71
       36
       15
        54
       2
      ...
    
    
      37
       4
       1
                  Italy
                Belgium
       1954
       3
       1
       71
       36
       15
        54
       2
       50.704225
       3.600000
       0.133333
       32
       10
        8
         2
       0
      ...
    
    
      38
       2
       0
                England
            Switzerland
       1954
       2
       1
       59
       26
       13
        18
       1
       44.067797
       1.384615
       0.076923
       24
        7
        7
         0
       0
      ...
    
    
      39
       4
       1
            Switzerland
                  Italy
       1954
       3
       1
       24
        7
        7
         0
       0
       29.166667
       0.000000
       0.000000
       71
       36
       15
        54
       2
      ...
    
    
      40
       7
       5
                Austria
            Switzerland
       1954
       2
       1
       25
       10
        6
         4
       0
       40.000000
       0.666667
       0.000000
       24
        7
        7
         0
       0
      ...
    
    
      41
       4
       2
                Uruguay
                England
       1954
       2
       1
       43
       14
       10
        22
       1
       32.558140
       2.200000
       0.100000
       59
       26
       13
        18
       1
      ...
    
    
      42
       2
       4
                 Brazil
                Hungary
       1954
      -2
       2
       89
       63
       16
       102
       5
       70.786517
       6.375000
       0.312500
       26
       11
        7
         8
       0
      ...
    
    
      43
       0
       2
             Yugoslavia
                Germany
       1954
      -2
       2
       34
       14
        8
         2
       0
       41.176471
       0.250000
       0.000000
       98
       59
       15
        94
       3
      ...
    
    
      44
       4
       2
                Hungary
                Uruguay
       1954
       2
       1
       26
       11
        7
         8
       0
       42.307692
       1.142857
       0.000000
       43
       14
       10
        22
       1
      ...
    
    
      45
       6
       1
                Germany
                Austria
       1954
       5
       1
       98
       59
       15
        94
       3
       60.204082
       6.266667
       0.200000
       25
       10
        6
         4
       0
      ...
    
    
      46
       1
       3
                Uruguay
                Austria
       1954
      -2
       2
       43
       14
       10
        22
       1
       32.558140
       2.200000
       0.100000
       25
       10
        6
         4
       0
      ...
    
    
      47
       2
       3
                Hungary
                Germany
       1954
      -1
       2
       26
       11
        7
         8
       0
       42.307692
       1.142857
       0.000000
       98
       59
       15
        94
       3
      ...
    
    
      48
       3
       1
                Germany
              Argentina
       1958
       2
       1
       98
       59
       15
        94
       3
       60.204082
       6.266667
       0.200000
       64
       33
       13
        40
       2
      ...
    
    
      49
       1
       0
       Northern Ireland
         Czechoslovakia
       1958
       1
       1
       13
        3
        3
         0
       0
       23.076923
       0.000000
       0.000000
       23
        7
        6
         8
       0
      ...
    
    
      50
       3
       1
              Argentina
       Northern Ireland
       1958
       2
       1
       64
       33
       13
        40
       2
       51.562500
       3.076923
       0.153846
       13
        3
        3
         0
       0
      ...
    
    
      53
       6
       1
         Czechoslovakia
              Argentina
       1958
       5
       1
       23
        7
        6
         8
       0
       30.434783
       1.333333
       0.000000
       64
       33
       13
        40
       2
      ...
    
    
      54
       2
       1
       Northern Ireland
         Czechoslovakia
       1958
       1
       1
       13
        3
        3
         0
       0
       23.076923
       0.000000
       0.000000
       23
        7
        6
         8
       0
      ...
    
    
      55
       7
       3
                 France
               Paraguay
       1958
       4
       1
       48
       23
       10
        34
       1
       47.916667
       3.400000
       0.100000
       25
        6
        7
         0
       0
      ...
    
    
      57
       3
       2
             Yugoslavia
                 France
       1958
       1
       1
       34
       14
        8
         2
       0
       41.176471
       0.250000
       0.000000
       48
       23
       10
        34
       1
      ...
    
    
      58
       3
       2
               Paraguay
               Scotland
       1958
       1
       1
       25
        6
        7
         0
       0
       24.000000
       0.000000
       0.000000
       23
        4
        8
         0
       0
      ...
    
    
      59
       2
       1
                 France
               Scotland
       1958
       1
       1
       48
       23
       10
        34
       1
       47.916667
       3.400000
       0.100000
       23
        4
        8
         0
       0
      ...
    
    
      61
       3
       0
                 Sweden
                 Mexico
       1958
       3
       1
       41
       14
        9
        16
       0
       34.146341
       1.777778
       0.000000
       46
       12
       13
         0
       0
      ...
    
    
      64
       2
       1
                 Sweden
                Hungary
       1958
       1
       1
       41
       14
        9
        16
       0
       34.146341
       1.777778
       0.000000
       26
       11
        7
         8
       0
      ...
    
    
      66
       4
       0
                Hungary
                 Mexico
       1958
       4
       1
       26
       11
        7
         8
       0
       42.307692
       1.142857
       0.000000
       46
       12
       13
         0
       0
      ...
    
    
      67
       2
       1
                  Wales
                Hungary
       1958
       1
       1
        5
        1
        1
         0
       0
       20.000000
       0.000000
       0.000000
       26
       11
        7
         8
       0
      ...
    
    
      68
       3
       0
                 Brazil
                Austria
       1958
       3
       1
       89
       63
       16
       102
       5
       70.786517
       6.375000
       0.312500
       25
       10
        6
         4
       0
      ...
    
    
      71
       2
       0
                 Russia
                Austria
       1958
       2
       1
       37
       17
        9
         2
       0
       45.945946
       0.222222
       0.000000
       25
       10
        6
         4
       0
      ...
    
    
      73
       2
       0
                 Brazil
                 Russia
       1958
       2
       1
       89
       63
       16
       102
       5
       70.786517
       6.375000
       0.312500
       37
       17
        9
         2
       0
      ...
    
    
      74
       1
       0
                 Russia
                England
       1958
       1
       1
       37
       17
        9
         2
       0
       45.945946
       0.222222
       0.000000
       59
       26
       13
        18
       1
      ...
    
    
      
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
  

1100 rows × 23 columns

Are the results able to be classified? Can we see a pattern, some kind of grouping of results based on the stats of bot teams?

Let's try visualizing two of the most interesting ones: matches won percent, and podium score yearly (mean).



In [8]:

    
show(graph_matches_results_scatter(matches, 'matches_won_percent', 'matches_won_percent_2'))









    Out[8]:



In [9]:

    
show(graph_matches_results_scatter(matches, 'podium_score_yearly', 'podium_score_yearly_2'))









    Out[9]:

Before any conclussions: there is more there than what you can see with your eyes. At some location, there could be more than 1 point, and you only see the one on the top.

The first graph tells us something that most people already expect: there is a small tendency on the result, the team with the better matches won percent tends to win. The second graph also shows a similar relation between podium score yearly and the result, even if it's not visible to the eye because of the overlapping of dots.

But remember, the classifier can learn a lot more than just those simple relations based on the info we give to it. These graphs were just a screening to confirm some basic intuitions.

Learn

Ok, now we have everything we need. Lets feed the selected input features to a the neural network classifier, and let it learn.

We have to normalize the data, otherwise the features with smaller values will impose a greater weight on the prediction.

Also, we use a percentage of the inputs to train, but keep the rest "hidden", we don't let the classifier see them while learning. After the training we use those inputs to "test" the ability of the classifier to predict data it has never seen before (and data we already know the correct answer).



In [10]:

    
inputs, outputs = extract_samples(matches,
                                  input_features,
                                  output_feature)

normalizer, inputs = normalize(inputs)

train_inputs, train_outputs, test_inputs, test_outputs = split_samples(inputs, outputs)

n = buildNetwork(len(input_features),
                 10 * len(input_features),
                 10 * len(input_features),
                 1,
                 outclass=SigmoidLayer,
                 bias=True)

To be able to evaluate the results and show progress on the learning cycle, we need these two functions wich help us calculate how well the network can predict the results from the matches used to learn, and the matches it doesn't know.



In [11]:

    
def neural_result(input):
    """Call the neural network, and translates its output to a match result."""
    n_output = n.activate(input) 
    if n_output >= 0.5:
        return 2
    else:
        return 1
    
def test_network():
    """Calculate train and test sets errors."""
    print (100 - percentError(map(neural_result, train_inputs), train_outputs), 
           100 - percentError(map(neural_result, test_inputs), test_outputs))

Create a train set (a kind of dataset that pybrain uses to train neural networks), and display initial accuracy on both sets (train and test).



In [12]:

    
train_set = ClassificationDataSet(len(input_features))

for i, input_line in enumerate(train_inputs):
    train_set.addSample(train_inputs[i], [train_outputs[i] - 1])

trainer = BackpropTrainer(n, dataset=train_set, momentum=0.5, weightdecay=0.0)

train_set.assignClasses()

test_network()









    



(50.78979343863912, 54.51263537906137)

Train the network, for a given number of iterations. You can re-run this step many times, and it will keep learning, but as you know, if you train too much you can end overfitting the training data (this is visible when the test set accuracy starts to decrease).



In [13]:

    
for i in range(20):
    trainer.train()
    test_network()









    



(72.17496962332928, 77.9783393501805)
(73.02551640340218, 75.09025270758123)
(73.63304981773997, 75.81227436823104)
(73.63304981773997, 75.45126353790613)
(73.87606318347508, 74.72924187725631)
(74.24058323207777, 74.0072202166065)
(74.36208991494533, 74.36823104693141)
(73.87606318347508, 76.17328519855596)
(74.48359659781288, 75.09025270758123)
(73.99756986634264, 75.45126353790613)
(73.2685297691373, 72.20216606498195)
(74.726609963548, 74.36823104693141)
(74.726609963548, 74.72924187725631)
(74.24058323207777, 75.81227436823104)
(74.60510328068044, 75.09025270758123)
(74.96962332928311, 74.72924187725631)
(74.726609963548, 75.09025270758123)
(74.24058323207777, 72.92418772563177)
(74.36208991494533, 74.72924187725631)
(74.60510328068044, 76.17328519855596)

The closer this score is to 100%, the better the classifier is doing its predictions. A score of 100 means the classifier allways predicts the exact real result, something impossible.

And something around 75% sounds impressive, but in fact is not that good. It's pretty good, but consider that just throwing a coin will get you 50%. So this sits in the middle between throwing a coin and having a time machine.

Predict

With the classifier already trained, we can start making predictions. But we need a little function able to translate inputs like this: (2014, 'Argentina', 'Brazil'), to the numeric inputs the classifier expects (based on the input features).

This function does the conversion, also normalizes the data with the same normalizer used before, and then just asks the classifier for the prediction.



In [14]:

    
def predict(year, team1, team2):
    inputs = []
    
    for feature in input_features:
        from_team_2 = '_2' in feature
        feature = feature.replace('_2', '')
        
        if feature in team_stats.columns.values:
            team = team2 if from_team_2 else team1
            value = team_stats.loc[team, feature]
        elif feature == 'year':
            value = year
        else:
            raise ValueError("Don't know where to get feature: " + feature)
            
        inputs.append(value)
        
    inputs = normalizer.transform(inputs)
    result = neural_result(inputs)
    
    if result == 0:
        return 'tie'
    elif result == 1:
        return team1
    elif result == 2:
        return team2
    else:
        return 'Unknown result: ' + str(result)

Some predictions about the past, compared to real results:

Even while we know those results and some of them where used to train, that doesn't guarantee the real result is what the classifier will predict.



In [15]:

    
predict(1950, 'Mexico', 'Brazil')  # real result: 4-0 wins Brazil









    Out[15]:





'Brazil'



In [16]:

    
predict(1990, 'United Arab Emirates', 'Colombia')  # real result: 2-0 wins Colombia









    Out[16]:





'Colombia'



In [17]:

    
predict(2002, 'South Africa', 'Spain')  # real result: 2-3 wins Spain









    Out[17]:





'Spain'



In [18]:

    
predict(2010, 'Japan', 'Cameroon')  # real result: 1-0 wins Japan









    Out[18]:





'Japan'

Some predictions about the future:

(at least these where "future" at the moment of programming)



In [19]:

    
predict(2014, 'Argentina', 'Brazil')









    Out[19]:





'Argentina'



In [20]:

    
predict(2014, 'Spain', 'Haiti')









    Out[20]:





'Spain'



In [21]:

    
predict(2014, 'Russia', 'Germany')









    Out[21]:





'Germany'



In [22]:

    
predict(2014, 'Russia', 'Russia')









    Out[22]:





'Russia'

	matches_played	matches_won	years_played	podium_score	cups_won	matches_won_percent	podium_score_yearly	cups_won_yearly
team
Brazil	89	63	16	102	5	70.786517	6.375000	0.312500
Canada	3	0	1	0	0	0.000000	0.000000	0.000000
Serbia and Montenegro	3	0	1	0	0	0.000000	0.000000	0.000000
Kuwait	3	0	1	0	0	0.000000	0.000000	0.000000
Scotland	23	4	8	0	0	17.391304	0.000000	0.000000
Costa Rica	10	3	3	0	0	30.000000	0.000000	0.000000
Ivory Coast	6	2	2	0	0	33.333333	0.000000	0.000000
Wales	5	1	1	0	0	20.000000	0.000000	0.000000
Argentina	64	33	13	40	2	51.562500	3.076923	0.153846
Bolivia	4	0	2	0	0	0.000000	0.000000	0.000000
Cameroon	20	4	6	0	0	20.000000	0.000000	0.000000
Ecuador	7	3	2	0	0	42.857143	0.000000	0.000000
Ghana	9	4	2	0	0	44.444444	0.000000	0.000000
Saudi Arabia	13	2	4	0	0	15.384615	0.000000	0.000000
Australia	10	2	3	0	0	20.000000	0.000000	0.000000
Iran	9	1	3	0	0	11.111111	0.000000	0.000000
Algeria	9	2	3	0	0	22.222222	0.000000	0.000000
El Salvador	6	0	2	0	0	0.000000	0.000000	0.000000
Republic of Ireland	13	2	3	0	0	15.384615	0.000000	0.000000
Slovenia	6	1	2	0	0	16.666667	0.000000	0.000000
Chile	26	7	7	4	0	26.923077	0.571429	0.000000
Belgium	32	10	8	2	0	31.250000	0.250000	0.000000
Haiti	3	0	1	0	0	0.000000	0.000000	0.000000
Iraq	3	0	1	0	0	0.000000	0.000000	0.000000
Spain	53	27	12	18	1	50.943396	1.500000	0.083333
China PR	3	0	1	0	0	0.000000	0.000000	0.000000
Netherlands	41	22	7	26	0	53.658537	3.714286	0.000000
Denmark	16	8	4	0	0	50.000000	0.000000	0.000000
Poland	30	15	6	8	0	50.000000	1.333333	0.000000
Morocco	13	2	4	0	0	15.384615	0.000000	0.000000
Croatia	13	6	3	4	0	46.153846	1.333333	0.000000
Switzerland	24	7	7	0	0	29.166667	0.000000	0.000000
Honduras	6	0	2	0	0	0.000000	0.000000	0.000000
New Zealand	6	0	2	0	0	0.000000	0.000000	0.000000
Jamaica	3	1	1	0	0	33.333333	0.000000	0.000000
England	59	26	13	18	1	44.067797	1.384615	0.076923
Uruguay	43	14	10	22	1	32.558140	2.200000	0.100000
United Arab Emirates	3	0	1	0	0	0.000000	0.000000	0.000000
South Africa	9	2	3	0	0	22.222222	0.000000	0.000000
Egypt	3	0	1	0	0	0.000000	0.000000	0.000000
Colombia	13	3	4	0	0	23.076923	0.000000	0.000000
South Korea	28	5	8	2	0	17.857143	0.250000	0.000000
Turkey	10	5	2	4	0	50.000000	2.000000	0.000000
Italy	71	36	15	54	2	50.704225	3.600000	0.133333
Czech Republic	3	1	1	0	0	33.333333	0.000000	0.000000
France	48	23	10	34	1	47.916667	3.400000	0.100000
Slovakia	4	1	1	0	0	25.000000	0.000000	0.000000
Peru	13	4	3	0	0	30.769231	0.000000	0.000000
Norway	7	2	2	0	0	28.571429	0.000000	0.000000
Nigeria	14	4	4	0	0	28.571429	0.000000	0.000000
Israel	3	0	1	0	0	0.000000	0.000000	0.000000
Zaire	3	0	1	0	0	0.000000	0.000000	0.000000
Czechoslovakia	23	7	6	8	0	30.434783	1.333333	0.000000
Austria	25	10	6	4	0	40.000000	0.666667	0.000000
Togo	3	0	1	0	0	0.000000	0.000000	0.000000
Germany	98	59	15	94	3	60.204082	6.266667	0.200000
Ukraine	5	2	1	0	0	40.000000	0.000000	0.000000
Northern Ireland	13	3	3	0	0	23.076923	0.000000	0.000000
United States	25	5	7	0	0	20.000000	0.000000	0.000000
Trinidad and Tobago	3	0	1	0	0	0.000000	0.000000	0.000000
	...	...	...	...	...	...	...	...

	score1	score2	team1	team2	year	score_diff	winner	matches_played	matches_won	years_played	podium_score	cups_won	matches_won_percent	podium_score_yearly	cups_won_yearly	matches_played_2	matches_won_2	years_played_2	podium_score_2	cups_won_2
0	4	0	Brazil	Mexico	1950	4	1	89	63	16	102	5	70.786517	6.375000	0.312500	46	12	13	0	0	...
1	3	0	Yugoslavia	Switzerland	1950	3	1	34	14	8	2	0	41.176471	0.250000	0.000000	24	7	7	0	0	...
3	4	1	Yugoslavia	Mexico	1950	3	1	34	14	8	2	0	41.176471	0.250000	0.000000	46	12	13	0	0	...
4	2	0	Brazil	Yugoslavia	1950	2	1	89	63	16	102	5	70.786517	6.375000	0.312500	34	14	8	2	0	...
5	2	1	Switzerland	Mexico	1950	1	1	24	7	7	0	0	29.166667	0.000000	0.000000	46	12	13	0	0	...
6	2	0	England	Chile	1950	2	1	59	26	13	18	1	44.067797	1.384615	0.076923	26	7	7	4	0	...
7	3	1	Spain	United States	1950	2	1	53	27	12	18	1	50.943396	1.500000	0.083333	25	5	7	0	0	...
8	2	0	Spain	Chile	1950	2	1	53	27	12	18	1	50.943396	1.500000	0.083333	26	7	7	4	0	...
9	1	0	United States	England	1950	1	1	25	5	7	0	0	20.000000	0.000000	0.000000	59	26	13	18	1	...
10	1	0	Spain	England	1950	1	1	53	27	12	18	1	50.943396	1.500000	0.083333	59	26	13	18	1	...
11	5	2	Chile	United States	1950	3	1	26	7	7	4	0	26.923077	0.571429	0.000000	25	5	7	0	0	...
12	3	2	Sweden	Italy	1950	1	1	41	14	9	16	0	34.146341	1.777778	0.000000	71	36	15	54	2	...
14	2	0	Italy	Paraguay	1950	2	1	71	36	15	54	2	50.704225	3.600000	0.133333	25	6	7	0	0	...
15	8	0	Uruguay	Bolivia	1950	8	1	43	14	10	22	1	32.558140	2.200000	0.100000	4	0	2	0	0	...
17	7	1	Brazil	Sweden	1950	6	1	89	63	16	102	5	70.786517	6.375000	0.312500	41	14	9	16	0	...
18	6	1	Brazil	Spain	1950	5	1	89	63	16	102	5	70.786517	6.375000	0.312500	53	27	12	18	1	...
19	3	2	Uruguay	Sweden	1950	1	1	43	14	10	22	1	32.558140	2.200000	0.100000	41	14	9	16	0	...
20	3	1	Sweden	Spain	1950	2	1	41	14	9	16	0	34.146341	1.777778	0.000000	53	27	12	18	1	...
21	2	1	Uruguay	Brazil	1950	1	1	43	14	10	22	1	32.558140	2.200000	0.100000	89	63	16	102	5	...
22	5	0	Brazil	Mexico	1954	5	1	89	63	16	102	5	70.786517	6.375000	0.312500	46	12	13	0	0	...
23	1	0	Yugoslavia	France	1954	1	1	34	14	8	2	0	41.176471	0.250000	0.000000	48	23	10	34	1	...
25	3	2	France	Mexico	1954	1	1	48	23	10	34	1	47.916667	3.400000	0.100000	46	12	13	0	0	...
26	4	1	Germany	Turkey	1954	3	1	98	59	15	94	3	60.204082	6.266667	0.200000	10	5	2	4	0	...
27	9	0	Hungary	South Korea	1954	9	1	26	11	7	8	0	42.307692	1.142857	0.000000	28	5	8	2	0	...
28	8	3	Hungary	Germany	1954	5	1	26	11	7	8	0	42.307692	1.142857	0.000000	98	59	15	94	3	...
29	7	0	Turkey	South Korea	1954	7	1	10	5	2	4	0	50.000000	2.000000	0.000000	28	5	8	2	0	...
30	7	2	Germany	Turkey	1954	5	1	98	59	15	94	3	60.204082	6.266667	0.200000	10	5	2	4	0	...
31	2	0	Uruguay	Czechoslovakia	1954	2	1	43	14	10	22	1	32.558140	2.200000	0.100000	23	7	6	8	0	...
32	1	0	Austria	Scotland	1954	1	1	25	10	6	4	0	40.000000	0.666667	0.000000	23	4	8	0	0	...
33	7	0	Uruguay	Scotland	1954	7	1	43	14	10	22	1	32.558140	2.200000	0.100000	23	4	8	0	0	...
34	5	0	Austria	Czechoslovakia	1954	5	1	25	10	6	4	0	40.000000	0.666667	0.000000	23	7	6	8	0	...
35	2	1	Switzerland	Italy	1954	1	1	24	7	7	0	0	29.166667	0.000000	0.000000	71	36	15	54	2	...
37	4	1	Italy	Belgium	1954	3	1	71	36	15	54	2	50.704225	3.600000	0.133333	32	10	8	2	0	...
38	2	0	England	Switzerland	1954	2	1	59	26	13	18	1	44.067797	1.384615	0.076923	24	7	7	0	0	...
39	4	1	Switzerland	Italy	1954	3	1	24	7	7	0	0	29.166667	0.000000	0.000000	71	36	15	54	2	...
40	7	5	Austria	Switzerland	1954	2	1	25	10	6	4	0	40.000000	0.666667	0.000000	24	7	7	0	0	...
41	4	2	Uruguay	England	1954	2	1	43	14	10	22	1	32.558140	2.200000	0.100000	59	26	13	18	1	...
42	2	4	Brazil	Hungary	1954	-2	2	89	63	16	102	5	70.786517	6.375000	0.312500	26	11	7	8	0	...
43	0	2	Yugoslavia	Germany	1954	-2	2	34	14	8	2	0	41.176471	0.250000	0.000000	98	59	15	94	3	...
44	4	2	Hungary	Uruguay	1954	2	1	26	11	7	8	0	42.307692	1.142857	0.000000	43	14	10	22	1	...
45	6	1	Germany	Austria	1954	5	1	98	59	15	94	3	60.204082	6.266667	0.200000	25	10	6	4	0	...
46	1	3	Uruguay	Austria	1954	-2	2	43	14	10	22	1	32.558140	2.200000	0.100000	25	10	6	4	0	...
47	2	3	Hungary	Germany	1954	-1	2	26	11	7	8	0	42.307692	1.142857	0.000000	98	59	15	94	3	...
48	3	1	Germany	Argentina	1958	2	1	98	59	15	94	3	60.204082	6.266667	0.200000	64	33	13	40	2	...
49	1	0	Northern Ireland	Czechoslovakia	1958	1	1	13	3	3	0	0	23.076923	0.000000	0.000000	23	7	6	8	0	...
50	3	1	Argentina	Northern Ireland	1958	2	1	64	33	13	40	2	51.562500	3.076923	0.153846	13	3	3	0	0	...
53	6	1	Czechoslovakia	Argentina	1958	5	1	23	7	6	8	0	30.434783	1.333333	0.000000	64	33	13	40	2	...
54	2	1	Northern Ireland	Czechoslovakia	1958	1	1	13	3	3	0	0	23.076923	0.000000	0.000000	23	7	6	8	0	...
55	7	3	France	Paraguay	1958	4	1	48	23	10	34	1	47.916667	3.400000	0.100000	25	6	7	0	0	...
57	3	2	Yugoslavia	France	1958	1	1	34	14	8	2	0	41.176471	0.250000	0.000000	48	23	10	34	1	...
58	3	2	Paraguay	Scotland	1958	1	1	25	6	7	0	0	24.000000	0.000000	0.000000	23	4	8	0	0	...
59	2	1	France	Scotland	1958	1	1	48	23	10	34	1	47.916667	3.400000	0.100000	23	4	8	0	0	...
61	3	0	Sweden	Mexico	1958	3	1	41	14	9	16	0	34.146341	1.777778	0.000000	46	12	13	0	0	...
64	2	1	Sweden	Hungary	1958	1	1	41	14	9	16	0	34.146341	1.777778	0.000000	26	11	7	8	0	...
66	4	0	Hungary	Mexico	1958	4	1	26	11	7	8	0	42.307692	1.142857	0.000000	46	12	13	0	0	...
67	2	1	Wales	Hungary	1958	1	1	5	1	1	0	0	20.000000	0.000000	0.000000	26	11	7	8	0	...
68	3	0	Brazil	Austria	1958	3	1	89	63	16	102	5	70.786517	6.375000	0.312500	25	10	6	4	0	...
71	2	0	Russia	Austria	1958	2	1	37	17	9	2	0	45.945946	0.222222	0.000000	25	10	6	4	0	...
73	2	0	Brazil	Russia	1958	2	1	89	63	16	102	5	70.786517	6.375000	0.312500	37	17	9	2	0	...
74	1	0	Russia	England	1958	1	1	37	17	9	2	0	45.945946	0.222222	0.000000	59	26	13	18	1	...
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...