Sorin Peste (sorinpeste@gmail.com)
This is a walkthrough on using Machine Learning while trying to predict who is going to win the next FIFA World Cup.
We're going to use historical information about international football (soccer) matches to build a model, which is going to give us the ability to predict future match results.
Afterwards, we're going to use that model to run multiple simulations of the next World Cup tournament, and produce statistics about which teams are the most likely to win it all.
This document is meant for people who are new to Machine Learning, and want to better understand the data science process, as well as the R language.
NOTE: If you attended any of my presentations about this walkthrough, you can find the presentation slides here.
NOTE 2: The GitHub containing all the code I used is here. You can also find the Javascript code I used to acquire the historical matches data from the FIFA website, as well as the Python program which simulates the tournament thousands of times.
If you want to do this walkthrough on your own machine, or if you'd like to customize it, you can clone this notebook by using the Clone button at the top of this page. This will copy the entire notebook into your own Azure Notebooks workspace, where you can edit it.
You can also find an R source file containing the entire code on my GitHub repo. Open that with the editor of your choice - RStudio is a popular one.
With that out of the way, let's begin!
Below is a typical workflow for a Data Science project such as ours.
The first step is defining an objective, an outcome for our model to predict:
Given a match between two teams, what is the expected goal differential at the end of the match?
In other words, we're going to attempt to predict the outcome
variable in the table below, which is the difference between the number of goals scored by team1
and the number of goals scored by team2
.
Let's note that the outcome
value can be zero (for draws), positive (whenever team1
wins) but also negative (whenever team2
wins the match).
We're going to use a dataset containing more than thirty-thousand international football matches played between 1950 and 2017. All these matches are played between senior men's national teams - there are no club matches, and no youth / women's games.
The dataset is available as CSV and JSON files.
Below is a small sample from the JSON file:
[
{
"date": "19560930",
"team1": "AUT",
"team1Text": "Austria",
"team2": "LUX",
"team2Text": "Luxembourg",
"resText": "7-0",
"statText": "",
"venue": "Ernst Happel Stadium - Vienna , Austria",
"IdCupSeason": "10",
"CupName": "FIFA World Cup™ Qualifier",
"team1Score": "7",
"team2Score": "0"
},
{
"date": "19561003",
"team1": "IRL",
"team1Text": "Republic of Ireland",
"team2": "DEN",
"team2Text": "Denmark",
"resText": "2-1",
"statText": "",
"venue": "DUBLIN - Dublin , Republic of Ireland",
"IdCupSeason": "10",
"CupName": "FIFA World Cup™ Qualifier",
"team1Score": "2",
"team2Score": "1"
},
...
]
We also have some information regarding the international associations and the FIFA confederations they are part of. We may find that useful when looking at past opponents of a team.
This dataset is available as a CSV file.
Below is a sample of the data:
csv
confederation,name,fifa_code,ioc_code
CAF,Algeria,ALG,ALG
CAF,Angola,ANG,ANG
CAF,Benin,BEN,BEN
CAF,Botswana,BOT,BOT
...
AFC,Afghanistan,AFG,AFG
AFC,Australia,AUS,---
AFC,Bahrain,BHR,BRN
AFC,Bangladesh,BAN,BAN
...
UEFA,Albania,ALB,ALB
UEFA,Andorra,AND,AND
UEFA,Armenia,ARM,ARM
UEFA,Austria,AUT,AUT
...
CONMEBOL,Argentina,ARG,ARG
CONMEBOL,Bolivia,BOL,BOL
CONMEBOL,Brazil,BRA,BRA
CONMEBOL,Chile,CHI,CHI
Last, we have a list of the teams which have qualified for the World Cup, and their group stage draw.
This dataset is available as a CSV file.
Below is a sample of the data:
csv
name,draw
RUS,A1
IRN,F2
KOR,A3
JPN,G2
KSA,H2
AUS,B3
TUN,A2
NGA,B2
CIV,C2
...
First, we're going to load a few R libraries from CRAN - the Comprehensive R Archive Network - into our environment.
In [2]:
# prepare the R environment
if (!require("pacman")) install.packages("pacman")
pacman::p_load(
dplyr, # Data munging functions
zoo, # Feature engineering rolling aggregates
data.table, # Feature engineering
ggplot2, # Graphics
scales, # Time formatted axis
readr, # Reading input files
stringr, # String functions
reshape2, # restructure and aggregate data
randomForest, # Random forests
corrplot, # correlation plots
Metrics, # Eval metrics for ML
vcd # Visualizing discrete distributions
)
# set options for plots
options(repr.plot.width=6, repr.plot.height=6)
In [3]:
# Load the matches data
if(!file.exists("matches.csv")){
tryCatch(download.file('https://github.com/neaorin/PredictTheWorldCup/raw/master/input/matches.csv'
,destfile="./matches.csv",method="auto"))
}
if(file.exists("matches.csv")) matches_original <- read_csv("matches.csv")
head(matches_original)
First let's perform some basic cleanup on the dataset.
In [4]:
# eliminate any duplicates that may exist in the dataset
matches <- matches_original %>%
distinct(.keep_all = TRUE, date, team1, team2)
# the date field is formatted as a string (e.g. 19560930) - transform that into R date
matches$date <- as.POSIXct(strptime(matches$date, "%Y%m%d"), origin="1960-01-01", tz="UTC")
# generate an id column for future use (joins etc)
matches$match_id = seq.int(nrow(matches))
head(matches)
summary(matches)
More often than not, the best way to understand a dataset is to turn it into a picture.
Or rather, multiple pictures.
Fortunately, R has some useful tools in this regard - and a lot of them come with the very popular ggplot2 package.
Some useful resources when learning to use ggplot2 are:
For example, let's get a sense on the number of games which have been played over the years, and how close they were from a competitive standpoint.
In [5]:
# how many international games have been played over the years?
matches %>%
ggplot(mapping = aes(year(date))) +
geom_bar(aes(fill=CupName), width=1, color="black") +
theme(legend.position = "bottom", legend.direction = "vertical") + ggtitle("Matches played by Year")
In [6]:
# how many goals have been scored per game over the years?
matches %>%
dplyr::group_by(year = year(date)) %>%
dplyr::summarize(
totalgames = n(),
totalgoals = sum(team1Score + team2Score),
goalspergame = totalgoals / totalgames
) %>%
ggplot(mapping = aes(x = year, y = goalspergame)) +
geom_point() +
geom_smooth(method = "loess") + ggtitle("Goals scored per game, over time")
In [7]:
# what values is our dataset missing?
ggplot_missing <- function(x){
x %>%
is.na %>%
melt %>%
ggplot(mapping = aes(x = Var2,
y = Var1)) +
geom_raster(aes(fill = value)) +
scale_fill_grey(name = "",
labels = c("Present","Missing")) +
theme(axis.text.x = element_text(angle=45, vjust=0.5)) +
labs(x = "Variables in Dataset",
y = "Rows / observations")
}
ggplot_missing(matches)
#Amelia::missmap(matches, main = "Missing values")
A few things we can note from our graphs above, as well as examining the dataset:
Although our sample size of matches played before 1960 is fairly small, we can note that the era of free-wheeling, mostly attacking football coming to an end with the perfection of devensive tactics like catenaccio in Italy, and eventually with the era of Total Football which took off in the early '70s. We may need to factor in some of these developments into our model.
pen1Score
and pen2Score
are only present whenever a match ended on penalties. All missing values indicate a match which did not get to a penalty shootout. Therefore, the values aren't really missing, so we can use these features if we want; however, the number of observations is fairly small, and penalty shoot-outs do have a reputation of being a lottery of sorts, especially at the highest level of play when the prize is advancement to a later stage of the World Cup.
One additional thing to note is that, for matches ending on penalties, team1Score
= team2Score
; these values do not include the penalty shoot-out goals; still, those games weren't really draws, since they were decided on penalties. From a purely performance standpoint however, we might decide to consider a team losing on penalties as being closer to a draw than an actual loss.
statText
is also only present for matches which didn't end in regulation time. It includes extra-time matches as well as penalty shoot-outs as above. Unlike pen1Score
and pen2Score
however, team1Score
and team2Score
do include all the goals scored in extra time as well. Same as before, we might decide that a team performed better if they lost in extra time vs. a loss in regulation; however, we will disregard this field for now.
venue
is interesting for the purpose of determining the home team, which in football has a distinct advantage (as we will later conclude). However, since venue
is a text field, we will need to do some pattern matching with team names to determine the correct home team, which may present several problems. Also, about 15 percent of values are missing for this column, which will force us to consider these matches as being played in a neutral venue.
CupName
can be useful to determine whether a game was a played as a friendly, a qualifier, or a final tournament. Simple pattern matching will be enough for this task.
IdCupSeason
can be ignored at this time.
resText
can be ignored as all the information is also contained in other non-text fields.
Finally, we may have an issue with the fact that team1
consistently performs better than team2
- likely because most of the games list the home team first:
In [8]:
summary(matches$team1Score - matches$team2Score)
The Mean
value for the goal differential is greater than 0.5, which may present a problem later on when training the model - it may capture this bias team1
is better than team2
, which is something we'd rather avoid, especially since the World Cup final tournament is played in a single country.
So let's get rid of that by simply randomizing the order in which teams are listed for any one match.
In [9]:
set.seed(4342)
matches$switch = runif(nrow(matches), min = 0, max = 1)
matches <- bind_rows(
matches %>% dplyr::filter(switch < 0.5),
matches %>% dplyr::filter(switch >= 0.5) %>%
dplyr::mutate(
x_team2 = team2,
team2 = team1,
team1 = x_team2,
x_team2Text = team2Text,
team2Text = team1Text,
team1Text = x_team2Text,
x_resText = "",
x_team2Score = team2Score,
team2Score = team1Score,
team1Score = x_team2Score,
x_team2PenScore = team2PenScore,
team2PenScore = team1PenScore,
team1PenScore = x_team2PenScore
) %>%
dplyr::select(
date, team1, team1Text, team2, team2Text, resText, statText, venue, IdCupSeason, CupName, team1Score, team2Score, team1PenScore, team2PenScore, match_id, switch
)
) %>%
dplyr::arrange(date) %>%
dplyr::select(-c(switch))
summary(matches$team1Score - matches$team2Score)
Now, we can create some aditional features about the matches.
In [10]:
# is the game played in a neutral venue
matches$team1Home <- mapply(grepl, pattern=matches$team1Text, x=matches$venue, MoreArgs = list(fixed = TRUE, ignore.case = FALSE))
matches$team2Home <- mapply(grepl, pattern=matches$team2Text, x=matches$venue, MoreArgs = list(fixed = TRUE, ignore.case = FALSE))
matches$neutralVenue <- !(matches$team1Home | matches$team2Home)
# text-matching the venue is not 100% accurate.
# some games get TRUE for both team1 and team2 (ex. Congo DR vs Congo)
# in this case, team1 is at home
matches$team2Home[(matches$team1Home == TRUE) & (matches$team2Home == TRUE)] <- FALSE
# game type: Friendly, Qualifier, Final Tournament
matches$friendly <- FALSE
matches$friendly[matches$CupName == "Friendly"] <- TRUE
matches$qualifier <- FALSE
matches$qualifier[matches$CupName %like% "qual"] <- TRUE
matches$finaltourn <- FALSE
matches$finaltourn[matches$CupName %like% "final"] <- TRUE
head(matches)
At this point, we're going to eliminate friendly matches from the dataset.
This decision is based on the observation that, with few exceptions, the main objective for a team playing a friendly is not to win it, but to evaluate its own players and tactics.
For this reason it's not uncommon for friendlies to allow an unlimited number of substitutions, and for a team to roll out its entire squad during a friendly game.
If you'd like to experiment with keeping friendlies in the dataset, you can comment out the line below.
In [11]:
# only use official matches (no friendlies)
matches <- matches %>% dplyr::filter(friendly == FALSE)
Up until this point we've only looked at individual matches. However, what we really need is to look at each team's performance over its history.
When we build our predictive model, we'd like to supply it with as many features about each of the teams about to be involved in a match. For that, we need to have a team-centric dataset with historical data.
Building this dataset is simple: take each observation in matches
- which has the form "team1 vs team2" - and produce two separate observations of the form "team1 played against team2" and "team2 played against team1" respectively.
In [12]:
# transform the matches table into a team performance table, where each team being
# involved in a game is a separate observation (row)
teamperf <- bind_rows(
(matches %>%
dplyr::mutate(
name = team1,
opponentName = team2,
homeVenue = team1Home,
neutralVenue = neutralVenue,
gs = team1Score,
ga = team2Score,
gd = gs - ga,
w = (team1Score > team2Score),
l = (team1Score < team2Score),
d = (team1Score == team2Score),
friendly = friendly,
qualifier = qualifier,
finaltourn = finaltourn
) %>%
dplyr::select (match_id, date, name, opponentName, homeVenue, neutralVenue, gs, ga, gd, w, l, d, friendly, qualifier, finaltourn))
,
(matches %>%
dplyr::mutate(
name = team2,
opponentName = team1,
homeVenue = team2Home,
neutralVenue = neutralVenue,
gs = team2Score,
ga = team1Score,
gd = gs - ga,
w = (team1Score < team2Score),
l = (team1Score > team2Score),
d = (team1Score == team2Score),
friendly = friendly,
qualifier = qualifier,
finaltourn = finaltourn
) %>%
dplyr::select (match_id, date, name, opponentName, homeVenue, neutralVenue, gs, ga, gd, w, l, d, friendly, qualifier, finaltourn))
) %>%
dplyr::arrange(date)
head(teamperf)
In order to capture some information about how good each team is, let's define a winning percentage formula:
winpercentage = (wins + 0.5 * draws) / games played
Then, let's plot that for each team which has played a significant number of games.
We're going to define the win percentage formula and plot as R functions, since we might want to re-use them after we further tweak our dataset.
In [13]:
# Out of the teams who have played at least 100 games, what are the winning percentages for each of those teams?
formula_winpercentage <- function(totalgames, wins, draws) {
return ((wins + 0.5 * draws) / totalgames)
}
plot_winpercentage <- function(teamperf, mingames) {
teamperf %>%
group_by(name) %>%
summarize(
totalgames = n(),
wins = length(w[w==TRUE]),
draws = length(d[d==TRUE]),
winpercentage = formula_winpercentage(totalgames, wins, draws)
) %>%
filter(totalgames >= mingames ) %>%
ggplot(mapping = aes(x = winpercentage, y = totalgames)) +
geom_point(size = 1.5) +
geom_text(aes(label=name), hjust=-.2 , vjust=-.2, size=3) +
geom_vline(xintercept = .5, linetype = 2, color = "red") +
ggtitle("Winning Percentage vs Games Played") +
expand_limits(x = c(0,1))
}
plot_winpercentage(teamperf, 100)
Straight away we can see that there are some potential issues with our dataset.
For one thing, some countries have ceased to exist, either because they dissolved into multiple countries - for example the Soviet Union (URS
), Yugoslavia (YUG
) or Czechoslovakia (TCH
), or because they united into one country - like it was the case with the German reunification of 1990. In the latter case, West Germany (FRG
) and East Germany (GDR
) unified into a single Germany (GER
).
Another case was when a country would rename itself - for example from Zaire (ZAI
) to Democratic Republic of the Congo (COD
).
Here is a complete list of all the FIFA obsolete country codes which stood for countries and territories that no longer exist.
From our perspective, for the purposes of continuity we would like to consider the new countries as successors to (part of) the old ones, because it will allow us to take past performance into account instead of starting from scratch. The process is not 100% straightforward - for example, which of the six countries should we consider as a succesor to Yugoslavia? - but we will undergo a best effort approach.
In [14]:
# transform old country codes into new ones.
countryCodeMappings <- matrix(c(
"FRG","GER",
"TCH","CZE",
"URS","RUS",
"SCG","SRB",
"ZAI","COD"
), ncol=2, byrow = TRUE)
for (i in 1:nrow(countryCodeMappings)) {
teamperf$name[teamperf$name == countryCodeMappings[i,1]] <- countryCodeMappings[i,2]
teamperf$opponentName[teamperf$opponentName == countryCodeMappings[i,1]] <- countryCodeMappings[i,2]
matches$team1[matches$team1 == countryCodeMappings[i,1]] <- countryCodeMappings[i,2]
matches$team2[matches$team2 == countryCodeMappings[i,1]] <- countryCodeMappings[i,2]
}
In [15]:
# let's run the win percentage graph again
plot_winpercentage(teamperf, 100)
Since our model will predict match results, it would be useful to also look at the distribution of match scores as well as total number of goals scored per game.
In [16]:
# what is the occurence frequency for match scores?
scorefreq <- matches %>%
group_by(team1Score, team2Score) %>%
summarise(
n = n(),
freq = n / nrow(matches)
) %>%
ungroup() %>%
mutate(
scoretext = paste(team1Score,"-",team2Score)
) %>%
arrange(desc(freq))
head(scorefreq, 20)
In [17]:
# distribution of goals scored per match
gsfreq <- matches %>%
group_by(gs = team1Score + team2Score) %>%
summarise(
n = n(),
freq = n / nrow(matches)
) %>%
ungroup() %>%
arrange(desc(freq))
head(gsfreq, 10)
gsfreq %>%
filter(freq >= 0.01) %>%
ggplot(mapping = aes(x = gs, y = freq)) + geom_bar(stat = "identity") + ggtitle("Goals scored per match distribution")
In [18]:
# distribution of goal differential
gdfreq <- matches %>%
group_by(gd = team1Score - team2Score) %>%
summarise(
n = n(),
freq = n / nrow(matches)
) %>%
ungroup() %>%
arrange(gd)
head(gdfreq %>% filter(abs(gd)<=4), 10)
gdfreq %>%
filter(abs(gd)<=4) %>%
ggplot(mapping = aes(x = gd, y = freq)) + geom_bar(stat = "identity") + ggtitle("Goal differential distribution")
Since our aim is to predict the goal differential (win margin) between the two teams in a match, we'd like to get rid of outliers - values which are far away at the end of the spectrum of possible values for this variable. The reason is that outliers can drastically change the results of the data analysis and statistical modeling. Outliers increase the error variance, reduce the power of statistical tests, and ultimately they can bias or influence estimates.
So let's deal with all matches where the goal differential is greater than 7.
First, let's verify how many of those we've got in the first place:
In [19]:
# how many outliers do we have?
temp <- matches %>% dplyr::filter(abs(team1Score - team2Score) > 7)
head(temp)
paste(nrow(temp), "matches, or", (nrow(temp)/nrow(matches)*100), "% of total.")
Not bad - only a very small percentage of games would be outliers as per our definition.
NOTE: As a rule of thumb, any value which is out of range of the 5th and 95th percentile may be considered an outlier.
Let's deal with the outliers in teamperf
by capping the goal differential to the [-7, +7] interval:
In [20]:
# get rid of all the outliers by capping the gd to [-7, +7]
teamperf$gd[teamperf$gd < -7] <- -7
teamperf$gd[teamperf$gd > +7] <- +7
We may also want to take into account the fact that teams play most of their matches against opponents from the same FIFA Confederation - for example, European teams play mostly against other UEFA members, while African teams face other CAF members for the most part. Only during final tournaments like the Olympics, the World Cup and the Confederations Cup will teams play official (non-friendly) matches against non-confederation opponents.
Since not all conferences are the same general strength, we can adjust our teamperf
dataset to also include information about the conference the opponent belongs to. We can assign adjustment coefficients to each conference, in a similar way to how FIFA's World ranking algorithm accounts for regional strength.
In [21]:
# get information about the various FIFA confederations and the teams they contain
if(!file.exists("teams.csv")){
tryCatch(download.file('https://raw.githubusercontent.com/neaorin/PredictTheWorldCup/master/input/teams.csv'
,destfile="./teams.csv",method="auto"))
}
if(file.exists("teams.csv")) teams <- read_csv("teams.csv")
# confederations and adjustment coefficients for them
confederations <- as.data.frame(matrix(c(
"UEFA","0.99",
"CONMEBOL","1.00",
"CONCACAF","0.85",
"AFC","0.85",
"CAF","0.85",
"OFC","0.85"
), ncol=2, byrow = TRUE, dimnames = list(NULL, c("confederation","adjust"))), stringsAsFactors = FALSE)
confederations$confederation <- as.vector(confederations$confederation)
confederations$adjust <- as.numeric(confederations$adjust)
# add a confederation coefficient for the opponent faced
teamperf <- teamperf %>%
dplyr::left_join(teams, by=c("opponentName" = "fifa_code")) %>%
dplyr::left_join(confederations, by=c("confederation")) %>%
dplyr::mutate(
opponentConfederationCoefficient = adjust
) %>%
dplyr::select(match_id, date, name = name.x, opponentName, opponentConfederationCoefficient, homeVenue, neutralVenue, gs, ga, gd, w, l, d, friendly, qualifier, finaltourn)
# set missing values to 1
teamperf$opponentConfederationCoefficient[is.na(teamperf$opponentConfederationCoefficient)] <- 1
Now, let's calculate some lag features for each team which is about to play a game.
We'll look at the previous N
games a team has played, up to the game in question, and we'll calculate the percentage of wins, draws, losses, as well as the goal differential, per game, for those past N
games.
For example, taking N=10
:
last10games_w_per = (number of wins in the past 10 games) / 10
last10games_d_per = (number of draws in the past 10 games) / 10
last10games_l_per = (number of losses in the past 10 games) / 10
last10games_gd_per = (goals scored - goals conceeded in the past 10 games) / 10
We'll use three different values for N
(10, 30 and 50) to capture short-, medium-, and long-term form.
We'll calculate those values for every team and every game in our dataset.
To model the strength of opposition faced, we'll use the same technique with respect to the opponentConfederationCoefficient
values we introduced earlier.
In [22]:
# Let's calculate some lag features for each team which is about to play a game
# we'll take three windows: last 5 games, last 20 games, last 35 games.
# for each window we'll calculate some values
lagfn <- function(data, width) {
return (rollapplyr(data, width = width + 1, FUN = sum, fill = NA, partial=TRUE) - data)
}
lagfn_per <- function(data, width) {
return (lagfn(data, width) / width)
}
team_features <- teamperf %>%
dplyr::arrange(name, date) %>%
dplyr::group_by(name) %>%
dplyr::mutate(
last10games_w_per = lagfn_per(w, 10),
last30games_w_per = lagfn_per(w, 30),
last50games_w_per = lagfn_per(w, 50),
last10games_l_per = lagfn_per(l, 10),
last30games_l_per = lagfn_per(l, 30),
last50games_l_per = lagfn_per(l, 50),
last10games_d_per = lagfn_per(d, 10),
last30games_d_per = lagfn_per(d, 30),
last50games_d_per = lagfn_per(d, 50),
last10games_gd_per = lagfn_per(gd, 10),
last30games_gd_per = lagfn_per(gd, 30),
last50games_gd_per = lagfn_per(gd, 50),
last10games_opp_cc_per = lagfn_per(opponentConfederationCoefficient, 10),
last30games_opp_cc_per = lagfn_per(opponentConfederationCoefficient, 30),
last50games_opp_cc_per = lagfn_per(opponentConfederationCoefficient, 50)
) %>%
dplyr::select (
match_id, date, name, opponentName, gs, ga,
w, last10games_w_per, last30games_w_per, last50games_w_per,
l, last10games_l_per, last30games_l_per, last50games_l_per,
d, last10games_d_per, last30games_d_per, last50games_d_per,
gd, last10games_gd_per, last30games_gd_per, last50games_gd_per,
opponentConfederationCoefficient, last10games_opp_cc_per, last30games_opp_cc_per, last50games_opp_cc_per
) %>%
dplyr::ungroup()
head((team_features %>% dplyr::filter(name == "BRA" & date >= '1970-01-01')), n = 20)
summary(team_features)
Now that we have built a series of team-specific features, we need to fold them back into match-specific features.
We will then have a set of features for both teams about to face each other.
In [23]:
# fold per-team features into per-match features
match_features <- matches %>%
left_join(team_features, by=c("match_id", "team1" = "name")) %>%
left_join(team_features, by=c("match_id", "team2" = "name"), suffix=c(".t1",".t2")) %>%
dplyr::select(
date, match_id, team1, team2, team1Home, team2Home, neutralVenue, team1Score, team2Score, friendly, qualifier, finaltourn,
last10games_w_per.t1,
last30games_w_per.t1,
last50games_w_per.t1,
last10games_l_per.t1,
last30games_l_per.t1,
last50games_l_per.t1,
last10games_d_per.t1,
last30games_d_per.t1,
last50games_d_per.t1,
last10games_gd_per.t1,
last30games_gd_per.t1,
last50games_gd_per.t1,
last10games_opp_cc_per.t1,
last30games_opp_cc_per.t1,
last50games_opp_cc_per.t1,
last10games_w_per.t2,
last30games_w_per.t2,
last50games_w_per.t2,
last10games_l_per.t2,
last30games_l_per.t2,
last50games_l_per.t2,
last10games_d_per.t2,
last30games_d_per.t2,
last50games_d_per.t2,
last10games_gd_per.t2,
last30games_gd_per.t2,
last50games_gd_per.t2,
last10games_opp_cc_per.t2,
last30games_opp_cc_per.t2,
last50games_opp_cc_per.t2,
outcome = gd.t1
)
head(match_features)
names(match_features)
We're also going to get rid of some columns which should not be used in training - specifically team1Score
and team2Score
. We will use the new outcome
column instead - which is the difference between team1Score
and team2Score
.
In [24]:
# drop all non-interesting columns, and those which should not be supplied for new data (like scores)
match_features <- match_features %>%
dplyr::select(-c(match_id,team1Score,team2Score))
head(match_features)
names(match_features)
Let's also have a look at how correlated our numeric features are:
In [25]:
# correlation matrix
cormatrix <- cor(match_features %>% dplyr::select(-c(date, team1, team2, team1Home, team2Home, neutralVenue, friendly, qualifier, finaltourn)) )
corrplot(cormatrix, type = "upper", order = "original", tl.col = "black", tl.srt = 45, tl.cex = 0.5)
Well, we don't have a lot of surprises here. It looks like goal differential lag values are strongly correlated with win and loss lag values (as expected).
Also, we have strong positive correlations between lag features measuring the same metric on different lag windows. For example, last10games_w_per.t1
, last30games_w_per.t1
and last50games_w_per.t1
are correlated. Also unsurprisingly, the correlation between 10- and 50- window metrics are weaker than between 10- and 30-, or 30- and 50-. But it does seem to suggest that good teams generally keep being good over time, and bad teams keep being bad.
Last, none of our features has a correlation value (positive or negative) with our outcome
that's much stronger than others.
In [26]:
# create the training formula
trainformula <- as.formula(paste('outcome',
paste(names(match_features %>% dplyr::select(-c(date,team1,team2,outcome))),collapse=' + '),
sep=' ~ '))
trainformula
We're going to split our match_features
into a training and a testing dataset. We're going to be using the training data to fit our model, then we're going to use the testing data to evaluate its accuracy.
We're going to use the matches from 1960 - 2009 to train our model, and the matches from 2010 - present to validate it.
NOTE: Although we're going to skip this step for the tutorial, model cross validation) is an important step to verify how a model will respond to a new, unknown data set. We would be creating multiple "folds" of training and testing set combinations from our original data set, and validate each combination to obtain a more complete picture of our model's predictive power.
In [27]:
# training and testing datasets
data.train1 <- match_features %>% dplyr::filter(date < '2009/1/1')
data.test1 <- match_features %>% dplyr::filter(date >= '2009/1/1' & date <= '2015/1/1')
nrow(data.train1)
nrow(data.test1)
Now it's time to train our model.
Since we're going to train a model to predict a numeric value (goal differential), we have a wide choice of regression algorithms we could use:
Indeed we might decide to try several algorithms, with a variety of parameter combinations for each of them, in order to find the optimal model and training strategy.
For this tutorial we're going to use a random forest, an algorithm which grows multiple decision trees from the features presented to it, and has each individual tree "vote" on the outcome for each new input vector (or in other words, new match to predict). It's fast, fairly accurate, and it gives an unbiased estimate of the generalization error, which makes cross-validation unnecessary for this particular algorithm.
The R implementation of the random forest algorithm is available in the randomForest package.
We're going to tell the algorithm to grow 500 trees.
NOTE: The training process should take several minutes.
In [28]:
# train a random forest
model.randomForest1 <- randomForest::randomForest(trainformula, data = data.train1,
importance = TRUE, ntree = 500)
summary(model.randomForest1)
In [29]:
randomForest::importance(model.randomForest1, type=1)
randomForest::varImpPlot(model.randomForest1, type=1)
We can now expose our trained model to the test dataset, and calculate performance metrics.
In [30]:
data.pred.randomForest1 <- predict(model.randomForest1, data.test1, predict.all = TRUE)
metrics.randomForest1.mae <- Metrics::mae(data.test1$outcome, data.pred.randomForest1$aggregate)
metrics.randomForest1.rmse <- Metrics::rmse(data.test1$outcome, data.pred.randomForest1$aggregate)
paste("Mean Absolute Error:", metrics.randomForest1.mae)
paste("Root Mean Square Error:",metrics.randomForest1.rmse)
abs_error <- abs(data.test1$outcome - data.pred.randomForest1$aggregate)
plot(abs_error, main="Mean Absolute Error")
With a trained model at our disposal, we can now run tournament simulations on it.
For example, let's take the qualified teams for the FIFA 2018 World Cup.
In [31]:
if(!file.exists("wc2018qualified.csv")){
tryCatch(download.file('https://raw.githubusercontent.com/neaorin/PredictTheWorldCup/master/src/TournamentSim/wc2018qualified.csv'
,destfile="./wc2018qualified.csv",method="auto"))
}
if(file.exists("wc2018qualified.csv")) qualified <- read_csv("wc2018qualified.csv")
We can now run the entire tournament 10,000 times, calling into the model to get a prediction for each match.
For performance reasons, we could generate all the the possible two-team combinations, then ask the model for predictions for each combination, and then store those predictions.
We can store the mean values, as well as the standard deviation of the predicted values from every one of our decision trees. This will allow us to simulate a more realistic distribution of results, for multiple iterations of the same match.
In [32]:
# get a list of possible matches to be played at the world cup
data.topredict <- expand.grid(team1 = qualified$name, team2 = qualified$name, stringsAsFactors = FALSE) %>% filter(team1 < team2)
temp <- teamperf %>%
semi_join(qualified, by = c("name")) %>%
group_by(name) %>%
summarise(
date = max(date)
)
temp <- team_features %>%
semi_join(temp, by = c("name", "date"))
# calculate the features for every possbile match
data.topredict <- data.topredict %>%
left_join(temp, by = c("team1" = "name")) %>%
left_join(temp, by = c("team2" = "name"), suffix = c(".t1", ".t2")) %>%
dplyr::select(
team1, team2,
last10games_w_per.t1,
last30games_w_per.t1,
last50games_w_per.t1,
last10games_l_per.t1,
last30games_l_per.t1,
last50games_l_per.t1,
last10games_d_per.t1,
last30games_d_per.t1,
last50games_d_per.t1,
last10games_gd_per.t1,
last30games_gd_per.t1,
last50games_gd_per.t1,
last10games_opp_cc_per.t1,
last30games_opp_cc_per.t1,
last50games_opp_cc_per.t1,
last10games_w_per.t2,
last30games_w_per.t2,
last50games_w_per.t2,
last10games_l_per.t2,
last30games_l_per.t2,
last50games_l_per.t2,
last10games_d_per.t2,
last30games_d_per.t2,
last50games_d_per.t2,
last10games_gd_per.t2,
last30games_gd_per.t2,
last50games_gd_per.t2,
last10games_opp_cc_per.t2,
last30games_opp_cc_per.t2,
last50games_opp_cc_per.t2
) %>%
mutate(
date = as.POSIXct("2018-06-14"),
team1Home = (team1 == "RUS"), team2Home = (team2 == "RUS"), neutralVenue = !(team1Home | team2Home),
friendly = FALSE, qualifier = FALSE, finaltourn = TRUE
)
head(data.topredict)
At this point, our data.topredict
table contains all the possible two-team match combinations, with calculated features for each team.
We can now ask our model to predict outcomes for these matches:
In [33]:
# ask the model to predict our world cup matches
data.predicted <- predict(model.randomForest1, data.topredict, predict.all = TRUE)
head(data.predicted$individual)
head(data.predicted$aggregate)
So, for every game in our input dataset, we've got the individual predictions from every one of our decision trees, as well as the mean value of those predictions.
We're going to save the mean values, as well as the standard deviation of the 100 individual predictions. The standard deviation is a measure of how dispersed our values are; in other words, how close (or far away from) the mean the individual values are.
In [34]:
# calculate the standard deviation of the individual predictions of each match
data.predicted$sd = apply(data.predicted$individual, c(1), sd)
# keep only the interesting columns for running tournament simulations
data.staticpred <- data.topredict %>%
dplyr::select(team1, team2)
data.staticpred$outcome = data.predicted$aggregate
data.staticpred$sd = data.predicted$sd
head(data.staticpred)
We can use the mean and standard deviation values to pick an individual outcome for a match. For example, we can use the normal distribution in conjunction with R's rnorm
function to pick an outcome for a match where we have obtained a predicted mean and standard deviation from the model.
For instance, let's assume we need to provide predicted outcomes for a Brazil vs Argentina match.
In [35]:
temp <- data.staticpred %>% dplyr::filter(team1 == "ARG" & team2 == "BRA")
temp
In [36]:
set.seed(4342)
draw_threshold <- 0.4475
temp2 <- rnorm(100, temp$outcome, temp$sd)
temp2
plot(round(temp2),xlab="Match Index",ylab="Goal Diff", main="ARG vs BRA, 100 simulated matches")
abline(h = 0, v = 0, col = "gray60")
abline(h = -0.4475, v = 0, col = "gray60", lty=3)
abline(h = +0.4475, v = 0, col = "gray60", lty=3)
mtext(c("BRA","Draw","ARG"),side=2,line=-3,at=c(-3,0,3),col= "red")
paste("ARG won", length(temp2[temp2 > +draw_threshold]), "matches.")
paste("BRA won", length(temp2[temp2 < -draw_threshold]), "matches.")
paste(length(temp2[temp2 >= -draw_threshold & temp2 <= +draw_threshold]), "matches drawn.")
The points in the above graph are individual outcomes (goal differentials) which we can round to the nearest integer, or to a predicted draw - in the case of values that are "close enough" to zero.
Let's have a look at the result distribution for the real-world matches between these two teams:
In [37]:
# real results of ARG vs BRA matches
temp <- as.vector(teamperf %>% filter(name == "ARG" & opponentName == "BRA") %>% dplyr::select(gd))
plot(temp$gd,xlab="Match Index",ylab="Goal Diff", main="ARG vs BRA, real matches")
abline(h = 0, v = 0, col = "gray60")
abline(h = -0.4475, v = 0, col = "gray60", lty=3)
abline(h = +0.4475, v = 0, col = "gray60", lty=3)
mtext(c("BRA","Draw","ARG"),side=2,line=-3,at=c(-3,0,3),col= "red")
paste("ARG won", nrow(temp %>% dplyr::filter(gd > 0)), "matches.")
paste("BRA won", nrow(temp %>% dplyr::filter(gd < 0)), "matches.")
paste(nrow(temp %>% dplyr::filter(gd == 0)), "matches drawn.")
For comparison, let's also generate a plot for outcomes of a more unbalanced match-up: Brazil vs Egypt.
In [40]:
set.seed(4342)
temp <- data.staticpred %>% dplyr::filter(team1 == "BRA" & team2 == "EGY")
temp
temp2 <- rnorm(100, temp$outcome, temp$sd)
plot(round(temp2),xlab="Match Index",ylab="Goal Diff", main="BRA vs EGY, 100 simulated matches")
abline(h = 0, v = 0, col = "gray60")
abline(h = -0.4475, v = 0, col = "gray60", lty=3)
abline(h = +0.4475, v = 0, col = "gray60", lty=3)
mtext(c("EGY","Draw","BRA"),side=2,line=-3,at=c(-3,0,3), col="red")
paste("BRA won", length(temp2[temp2 > +draw_threshold]), "matches.")
paste("EGY won", length(temp2[temp2 < -draw_threshold]), "matches.")
paste(length(temp2[temp2 >= -draw_threshold & temp2 <= +draw_threshold]), "matches drawn.")
As we can see, starting from a rather inconspicuous-looking prediction we can generate a number of possible match results which, while respecting a World Cup's surprising nature, is still in line with what experience tells us should happen.
Now that we're armed with a way of generating multiple predictions for every possible game in the World Cup, we can write a rather straightforward program to run the tournament a large number of times - for example 10,000 iterations.
We can save our predictions to a CSV file and use it as input for the simulator program.
In [39]:
write_csv(data.staticpred, "wc2018staticPredictions.csv")
Now we can run the program. You simply need Python 3.x installed in order to run it.
Clone the GitHub repository, then cd
to the correct folder, and run it.
Assuming a Windows machine, the steps you need to perform should look like the following, with Python 3 already installed:
git clone https://github.com/neaorin/PredictTheWorldCup.git
cd PredictTheWorldCup\src\TournamentSim
python simulateworldcup.py
On a Linux or Mac the steps should be similar.
The program will output a list of tournament winners (one for each iteration) to a simresults.csv
file.
The R code that's going to run after this step will attempt to download a version of this file from the GitHub repository; however, if you ran the Python program yourself and would like to use your own simresults.csv
file, you can simply upload it into this Azure Notebook library by using the Data / Upload...
menu at the top of this notebook.
Once we have the results inside the simresults.csv
file, we can load it up into R and see who won tournaments, and who didn't:
In [41]:
# Load the results of the simulation program
if(!file.exists("simresults.csv")){
tryCatch(download.file('https://raw.githubusercontent.com/neaorin/PredictTheWorldCup/master/src/TournamentSim/simresults.csv'
,destfile="./simresults.csv",method="auto"))
}
if(file.exists("simresults.csv")) simresults <- read_csv("simresults.csv")
head(simresults, 20)
In [42]:
# plot the winners
winsperteam <- simresults %>%
dplyr::group_by(winner) %>%
dplyr::summarize(
wins = n()
) %>%
dplyr::arrange(desc(wins))
winsperteam$winner <- factor(winsperteam$winner, levels = winsperteam$winner[order(winsperteam$wins, decreasing = FALSE)])
ggplot(winsperteam, mapping = aes(x=winner, y=wins)) +
geom_bar(stat="identity") +
coord_flip() +
geom_text(aes(label=paste(wins / 100, "%")), vjust=0.3, hjust=-0.1, size=2.1) +
ggtitle("Tournament simulation winners (10,000 iterations)")
Just for fun, let's calculate the odds to win the tournament predicted by our model:
In [43]:
# calculate the sports odds
winsperteam$odds <- lengths(simresults) / winsperteam$wins
writeLines(paste(winsperteam$winner, ": ",round(winsperteam$odds), " to 1\n"))
And that's it! We now have a surefire way of making money by betting on sports!
(pretty sure I'm not the first person ever to say those words :P)
Sorin