Allen Downey
This notebook contains a solution to a problem I posed in my Bayesian statistics class:
In the 2014 FIFA World Cup, Germany played Brazil in a semifinal match. Germany scored after 11 minutes and again at the 23 minute mark. At that point in the match, how many goals would you expect Germany to score after 90 minutes? What was the probability that they would score 5 more goals?
Scoring in games like soccer and hockey can be (reasonably) well modeled by a Poisson process, which assumes that each team, against a given opponent, will score goals at some goal-scoring rate, $\lambda$, and that this rate is stationary; in other words, the probability of scoring a goal is about the same at any point during the game.
Based on this modeling decision, we can answer the questions by
My solution uses the ThinkBayes2 framework, which is described in Think Bayes, and summarized in this notebook.
I'll start with Step 2.
If goal-scoring is a Poisson process, the distribution of time between goals is exponential with parameter $\lambda$, the goal scoring rate. In this case we are given as data the inter-arrival time of the first two goals, 11 minutes and 12 minutes. We can define a new class that inherits from thinkbayes2.Suite
and provides an appropriate Likelihood
function:
In [1]:
import thinkbayes2
class Soccer(thinkbayes2.Suite):
"""Represents hypotheses about goal-scoring rates."""
def Likelihood(self, data, hypo):
"""Computes the likelihood of the data under the hypothesis.
hypo: goal rate in goals per game
data: interarrival time in minutes
"""
x = data
lam = hypo / 90
like = thinkbayes2.EvalExponentialPdf(x, lam)
return like
Likelihood
computes the likelihood of data
given hypo
, where data
is an observed time between goals in minutes, and hypo
is a hypothetical goal-scoring rate in goals per game.
After converting hypo
to goals per minute, we can compute the likelihood of the data by evaluating the exponential probability density function (PDF). The result is a density, and therefore not a true probability. But the result from Likelihood
only needs to be proportional to the probability of the data; it doesn't have to be a probability.
Now we can get back to Step 1.
Before the game starts, what should we believe about Germany's goal-scoring rate against Brazil? We could use previous tournament results to construct the prior, but to keep things simple, I'll just use the average goal-scoring rate from all matches in the tournament, which was 2.67 goals per game (total for both teams).
To construct the prior, I'll start with an unrealistic uniform distribution and update it with fake data until the mean matches the observed rate for a single team, 1.34 goals per game.
In [2]:
import numpy
import thinkplot
hypos = numpy.linspace(0, 12, 201)
suite = Soccer(hypos)
suite.Update(134) # fake data chosen by trial and error to yield the observed prior mean
thinkplot.Pdf(suite)
suite.Mean()
Out[2]:
Now that we have a prior, we can update with the time of the first goal, 11 minutes.
In [3]:
suite.Update(11) # time until first goal is 11 minutes
thinkplot.Pdf(suite)
suite.Mean()
Out[3]:
After the first goal, the posterior mean rate is almost 1.9 goals per game.
Now we update with the second goal:
In [4]:
suite.Update(12) # time between first and second goals is 12 minutes
thinkplot.Pdf(suite)
suite.Mean()
Out[4]:
After the second goal, the posterior mean goal rate is 2.3 goals per game.
Now on to Step 3.
If we knew the actual goal scoring rate, $\lambda$, we could predict how many goals Germany would score in the remaining $t = 90-23$ minutes. The distribution of goals would be Poisson with parameter $\lambda t$.
We don't actually know $\lambda$, but we can use the posterior distribution of $\lambda$ to generate a predictive distribution for the number of additional goals.
In [5]:
def PredRemaining(suite, rem_time):
"""Plots the predictive distribution for additional number of goals.
suite: posterior distribution of lam in goals per game
rem_time: remaining time in the game in minutes
"""
metapmf = thinkbayes2.Pmf()
for lam, prob in suite.Items():
lt = lam * rem_time / 90
pred = thinkbayes2.MakePoissonPmf(lt, 15)
metapmf[pred] = prob
thinkplot.Pdf(pred, color='gray', alpha=0.3, linewidth=0.5)
mix = thinkbayes2.MakeMixture(metapmf)
return mix
mix = PredRemaining(suite, 90-23)
PredRemaining
takes the posterior distribution of $\lambda$ and the remaining game time in minutes (I'm ignoring so-called "injury time").
It loops through the hypotheses in suite
, computes the predictive distribution of additional goals for each hypothesis, and assembles a "meta-Pmf" which is a Pmf that maps from each predictive distribution to its probability. The figure shows each of the distributions in the meta-Pmf.
Finally, PredRemaining
uses MakeMixture
to compute the mixture of the distributions. Here's what the predictive distribution looks like.
In [6]:
thinkplot.Hist(mix)
thinkplot.Config(xlim=[-0.5, 10.5])
After the first two goals, the most likely outcome is that Germany will score once more, but there is a substantial chance of scoring 0 or 2--4 additional goals.
Now we can answer the original questions: what is the chance of scoring 5 or more additional goals:
In [7]:
mix.ProbGreater(4)
Out[7]:
After the first two goals, there was only a 6% chance of scoring 5 more times. And the expected number of additional goals was only 1.7.
In [8]:
mix.Mean()
Out[8]:
That's the end of this example. But for completeness (and if you are curious), here is the code for MakeMixture
:
In [9]:
def MakeMixture(metapmf, label='mix'):
"""Make a mixture distribution.
Args:
metapmf: Pmf that maps from Pmfs to probs.
label: string label for the new Pmf.
Returns: Pmf object.
"""
mix = Pmf(label=label)
for pmf, p1 in metapmf.Items():
for x, p2 in pmf.Items():
mix.Incr(x, p1 * p2)
return mix
In [9]: