Second Edition

Copyright 2020 Allen B. Downey

License: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

```
In [1]:
```# If we're running on Colab, install empiricaldist
# https://pypi.org/project/empiricaldist/
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
!pip install empiricaldist

```
In [2]:
```# Get utils.py and create directories
import os
if not os.path.exists('utils.py'):
!wget https://github.com/AllenDowney/ThinkBayes2/raw/master/code/soln/utils.py
if not os.path.exists('figs'):
!mkdir figs

```
In [3]:
```import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from empiricaldist import Pmf
from utils import decorate, savefig

This notebook introduces the Poisson process, which is a model used to describe events that occur at random intervals. In this context, "process" has a mathematical definition that is almost unrelated to its usual meaning.

As an example, we'll model goal-scoring in soccer, which is American English for the game everyone else calls "football". We'll use goals scored in a game to estimate the parameter of a Poisson process; then we'll use the posterior distribution to make predictions.

And we'll solve The World Cup Problem.

In the 2018 FIFA World Cup final, France defeated Croatia 4 goals to 2. Based on this outcome:

How confident should we be that France is the better team?

If the same teams played again, what is the chance France would win again?

To answer these questions, we have to make some modeling decisions.

First, I'll assume that for any team against another team there is some unknown goal-scoring rate, measured in goals per game, which I'll denote

`lam`

.Second, I'll assume that a goal is equally likely during any minute of a game. So, in a 90 minute game, the probability of scoring during any minute is

`lam/90`

.Third, I'll assume that a team never scores twice during the same minute.

Of course, none of these assumptions is completely true in the real world, but I think they are reasonable simplifications.

As George Box said, "All models are wrong; some are useful."

In this case, the model is useful because if these assumption are true, we expect the number of goals scored in a game to follow a Poisson distribution.

If the number of goals scored in a game follows a Poisson distribution with a goal-scoring rate, $\lambda$, the probability of scoring $k$ goals is

$f(k; \lambda) = \lambda^k \exp(-\lambda) ~/~ k!$

SciPy provides a `poisson`

object that represents a Poisson distribution.
We can create one like this:

```
In [4]:
```from scipy.stats import poisson
lam = 1.4
dist = poisson(lam)
dist

```
Out[4]:
```

`poisson`

provides `pmf`

, which evaluates the PMF of the Poisson distribution.

```
In [5]:
```k = 4
dist.pmf(k)

```
Out[5]:
```

If the average goal-scoring rate is 1.4 goals per game, the probability of scoring 4 goals is about 4%.

The following function uses `poisson.pmf`

to evaluate the PMF of the Poisson distribution for values of `k`

from 0 to a given upper limit, `high`

.

```
In [6]:
```def make_poisson_pmf(lam, high):
"""Make a PMF of a Poisson distribution.
lam: event rate
high: upper bound on number of events, `k`
returns: Pmf of number of events
"""
qs = np.arange(high)
ps = poisson(lam).pmf(qs)
pmf = Pmf(ps, qs)
pmf.normalize()
return pmf

`Pmf`

object.
For example, here's the distribution of goals scored for `lam=1.4`

.

```
In [7]:
```pmf_goals = make_poisson_pmf(lam=1.4, high=10)

The following figure shows what it looks like.

```
In [8]:
```def decorate_goals(title=''):
decorate(xlabel='Number of goals',
ylabel='PMF',
title=title)

```
In [9]:
```pmf_goals.bar(alpha=0.7, label='Poisson distribution')
decorate_goals('Distribution of goals scored')
savefig('fig07-01')

```
```

The most likely outcomes are 0, 1, and 2; higher values are possible but increasingly unlikely. Values above 7 are negligible. This distribution shows that if we know the goal scoring rate, we can predict the number of goals.

Now let's turn it around: given a number of goals, what can we say about the goal-scoring rate?

To answer that, we need to think about the prior distribution of `lam`

, which represents the range of possible values and their probabilities before we see the score.

If you have ever seen a soccer game, you have some information about `lam`

. In most games, teams score a few goals each. In rare cases, a team might score more than 5 goals, but they almost never score more than 10.

Using data from previous World Cups I estimate that each team scores about 1.4 goals per game, on average. So I'll set the mean of `lam`

to be 1.4.

For a good team against a bad one, we expect `lam`

to be higher; for a bad team against a good one, we expect it to be lower.

To model the distribution of goal-scoring rates, I'll use a gamma distribution, which I chose because:

The goal scoring rate is a continuous quantity that cannot be less than 0, and the gamma distribution is appropriate for this kind of quantity.

The gamma distribution has only one parameter,

`alpha`

, which is the mean. So it's easy to construct a gamma distribution with the mean we want.As we'll see, the shape of the Gamma distribution is a reasonable choice, given what we know about soccer.

The gamma distribution is continuous, but we'll approximate it with a discrete `Pmf`

.

SciPy provides `gamma`

, which creates an object that represents a gamma distribution.

The `gamma`

object provides provides `pdf`

, which evaluates the **probability density function** (PDF) of the gamma distribution.

Here's how we use it.

```
In [10]:
```from scipy.stats import gamma
alpha = 1.4
qs = np.linspace(0, 10, 101)
ps = gamma(alpha).pdf(qs)

The parameter, `alpha`

, is the mean of the distribution.
The `qs`

are possible values of $\lambda$ from 0 to 10.
The `ps`

are probability densities, which we can think of as unnormalized probabilities.

If we put the densities in a `Pmf`

and normalize them, like this:

```
In [11]:
```prior = Pmf(ps, qs)
prior.normalize()

```
Out[11]:
```

The result is a discrete `Pmf`

that approximates a gamma distribution.

Here's what it looks like.

```
In [12]:
```def decorate_rate(title=''):
decorate(xlabel='Goal scoring rate (lam)',
ylabel='PMF',
title=title)

```
In [13]:
```prior.plot(label='prior', color='gray')
decorate_rate('Prior distribution')
savefig('fig07-02')

```
```

`lam`

is usually less than 2, occasionally as high as 6, and seldom higher than that. And the mean is about 1.4.

```
In [14]:
```prior.mean()

```
Out[14]:
```

Suppose you are given the goal-scoring rate, $\lambda$, and asked to compute the probability of scoring a number of goals, $k$. That is precisely the question we answered by computing the Poisson distribution:

$ f(k; \lambda) = \lambda^k \exp(-\lambda) ~/~ k! $

For example, if $\lambda$ is 1.4, the probability of scoring 4 goals in a game is:

```
In [15]:
```lam = 1.4
k = 4
poisson(lam).pmf(4)

```
Out[15]:
```

Now suppose we are have an array of possible values for $\lambda$, like this:

```
In [16]:
```lams = prior.qs

We can compute the likelihood of the data for each hypothetical value of lam, like this:

```
In [17]:
```k = 4
likelihood = poisson(lams).pmf(k)

And that's all we need to do the update. To get the posterior distribution, we multiply the prior by the likelihoods we just computed and normalize the result.

The following function encapsulates these steps.

```
In [18]:
```def update_poisson(pmf, data):
"""Update the PMF with a Poisson likelihood
pmf: Series that represents the prior
data: integer number of goals
returns: posterior
"""
k = data
lams = pmf.qs
likelihood = poisson(lams).pmf(k)
pmf *= likelihood
pmf.normalize()

The first parameter is the prior; the second is the number of goals.

In the example, France scored 4 goals, so I'll make a copy of the prior and update it with the data.

```
In [19]:
```france = prior.copy()
update_poisson(france, 4)

Here's what the posterior distribution looks like:

```
In [20]:
```prior.plot(label='prior', color='gray')
france.plot(label='France posterior', color='C0')
decorate_rate('Posterior distribution for France')

```
```

`k=4`

, makes us think higher values of `lam`

are more likely and lower values are less likely. So the posterior distribution is shifted to the right.

Let's do the same for Croatia:

```
In [21]:
```croatia = prior.copy()
update_poisson(croatia, 2)

And here are the results.

```
In [22]:
```prior.plot(label='prior', color='gray')
france.plot(label='France posterior', color='C0')
croatia.plot(label='Croatia posterior', color='C3')
decorate_rate('Posterior distributions for France and Croatia')
savefig('fig07-03')

```
```

Here are the posterior means for these distributions.

```
In [23]:
```prior.mean(), croatia.mean(), france.mean()

```
Out[23]:
```

Recall that the mean of the prior distribution is 1.4. After Croatia scores 2 goals, their posterior mean is 1.7, which is near the midpoint of the prior and the data. Likewise after France scores 4 goals, their posterior mean is 2.7.

These results are typical of a Bayesian update: the location of the posterior distribution is a compromise between the prior and the data.

Now that we have a posterior distribution for each team, we can answer the first question: How confident should we be that France is the better team?

In the model, "better" means having a higher goal-scoring rate against the opponent. We can use the posterior distributions to compute the probability that a random value drawn from France's distribution exceeds a value drawn from Croatia's.

One way to do that is to enumerate all pairs of values from the two distributions, adding up the total probability that one value exceeds the other.

```
In [24]:
```def prob_gt(pmf1, pmf2):
"""Compute the probability of superiority.
pmf1: Pmf object
pmf2: Pmf object
returns: float probability
"""
total = 0
for q1, p1 in pmf1.items():
for q2, p2 in pmf2.items():
if q1 > q2:
total += p1 * p2
return total

This is similar to the method we use in Chapter 5 to compute the distribution of a sum.

Here's how we use it:

```
In [25]:
```prob_gt(france, croatia)

```
Out[25]:
```

`Pmf`

provides a function that does the same thing.

```
In [26]:
```Pmf.prob_gt(france, croatia)

```
Out[26]:
```

The result is close to 75%. So, on the basis of one game, we have moderate confidence that France is actually the better team.

Of course, we should remember that this result is based on the assumption that the goal-scoring rate is constant. In reality, if a team is down by one goal, they might play more aggressively toward the end of the game, making them more likely to score, but also more likely to give up an additional goal.

As always, the results are only as good as the model.

Now we can take on the second question: If the same teams played again, what is the chance Croatia would win?

To answer this question, we'll generate the "posterior predictive distribution", which is the number of goals we expect a team to score.

If we knew the goal scoring rate, `lam`

, the distribution of goals would be a Poisson distribution with parameter `lam`

.

Since we don't know `lam`

, the distribution of goals is a mixture of a Poisson distributions with different values of `lam`

.

First I'll generate a sequence of `Pmf`

objects, one for each value of `lam`

.

```
In [27]:
```pmf_seq = [make_poisson_pmf(lam, 12) for lam in prior.qs]

The following figure shows what some of these distributions look like.

```
In [28]:
```for pmf in pmf_seq[::25]:
pmf.plot(lw=1)

```
```

The predictive distribution is a mixture of these `Pmf`

s, weighted with the posterior probabilities.

We can use `make_mixture`

from Chapter 6 to compute this mixture.

```
In [29]:
```from utils import make_mixture
pred_france = make_mixture(france, pmf_seq)

Here's the predictive distribution for the number of goals France would score in a rematch.

```
In [30]:
```pred_france.bar(color='C0', label='France')
decorate_goals('Predictive distribution')
pred_france.mean()

```
Out[30]:
```

This distribution represents two sources of uncertainty: we don't know the actual value of `lam`

, and even if we did, we would not know the number of goals in the next game.

Here's the predictive distribution for Croatia.

```
In [31]:
```pred_croatia = make_mixture(croatia, pmf_seq)

```
In [32]:
```pred_croatia.bar(color='C3', label='Croatia')
decorate_goals('Predictive distribution')
pred_croatia.mean()

```
Out[32]:
```

```
In [33]:
```plt.figure(figsize=(8, 4))
plt.subplot(1, 2, 1)
pred_france.bar(color='C0', label='France')
decorate_goals('Predictive distribution')
plt.subplot(1, 2, 2)
pred_croatia.bar(color='C3', label='Croatia')
decorate_goals('Predictive distribution')
savefig('fig07-04')

```
```

```
In [34]:
```win = Pmf.prob_gt(pred_france, pred_croatia)
win

```
Out[34]:
```

```
In [35]:
```lose = Pmf.prob_lt(pred_france, pred_croatia)
lose

```
Out[35]:
```

```
In [36]:
```tie = Pmf.prob_eq(pred_france, pred_croatia)
tie

```
Out[36]:
```

Assuming that France wins half of the ties, their chance of winning the rematch is about 65%.

```
In [37]:
```win + tie/2

```
Out[37]:
```

As an exercise at the end of this notebook, you'll have a chance to work on this variation on the World Cup Problem:

In the 2014 FIFA World Cup, Germany played Brazil in a semifinal match. Germany scored after 11 minutes and again at the 23 minute mark. At that point in the match, how many goals would you expect Germany to score after 90 minutes? What was the probability that they would score 5 more goals (as, in fact, they did)?

In this version, notice that the data is not the number of goals in a fixed period of time, but the time between goals.

To compute the likelihood of data like this, we can take advantage of the theory of Poisson processes again. If each team has a constant goal-scoring rate, we expect the time between goals to follow an exponential distribution.

If the goal-scoring rate is $\lambda$, the probability of seeing an interval between goals of $t$ is proportional to the PDF of the exponential distribution:

$f(t; \lambda) = \lambda~\exp(-\lambda t)$

Because $t$ is a continuous quantity, the value of this expression is not a probability; it is a probability density. However, it is proportional to the probability of the data, so we can use it as a likelihood in a Bayesian update.

SciPy provides `expon`

, which creates an object that represents an exponential distribution.
However, it does not take `lam`

as a parameter in the way you might expect, which makes it awkward to work with.
Since the PDF of the exponential distribution is so easy to evaluate, I'll use my own function.

```
In [38]:
```def expo_pdf(t, lam):
"""Compute the PDF of the exponential distribution.
lam: time
λ: rate
returns: probability density
"""
return lam * np.exp(-lam * t)

`lam`

is 1.4; we can compute the distribution of $t$ like this:

```
In [39]:
```lam = 1.4
qs = np.linspace(0, 4, 101)
ps = expo_pdf(qs, lam)
pmf_time = Pmf(ps, qs)
pmf_time.normalize()

```
Out[39]:
```

And here's what it looks like:

```
In [40]:
```def decorate_time(title=''):
decorate(xlabel='Time between goals (games)',
ylabel='PMF',
title=title)

```
In [41]:
```pmf_time.plot(label='exponential lam = 1.4')
decorate_time('Distribution of time between goals')
savefig('fig07-05')

```
```

It is counterintuitive, but true, that the most likely time to score a goal is immediately. After that, the probability of each possible interval is a little lower.

With a goal-scoring rate of 1.4, it is possible that a team will take more than one game to score a goal, but it is unlikely that they will take more than two games.

This chapter introduces three new distributions, so it can be hard to keep them straight. Let's review:

If a system satisfies the assumptions of a Poisson model, the number of events in a period of time follows a Poisson distribution, which is a discrete distribution with integer quantities from 0 to infinity. In practice, we can usually ignore low-probability quantities above a finite limit.

Also under the Poisson model, the interval between events follows an exponential distribution, which is a continuous distribution with quantities from 0 to infinity. Because it is continuous, it is described by a probability density function (PDF) rather than a probability mass function (PMF). But when we use an exponential distribution to compute the likelihood of the data, we can treat densities as unnormalized probabilities.

The Poisson and exponential distributions are parameterized by an event rate, denoted $\lambda$ or

`lam`

.For the prior distribution of $\lambda$, I used a gamma distribution, which is a continuous distribution with quantities from 0 to infinity, but I approximated it with a discrete, bounded PMF. The gamma distribution has one parameter, denoted $\alpha$ or

`alpha`

, which is also its mean.

I chose the gamma distribution because the shape is consistent with our background knowledge about goal-scoring rates. There are many other distributions we could have used; however, we will see in Chapter XX that the gamma distribution can be a particularly good choice.

But we have a few things to do before we get there, starting with these exercises.

**Exercise:** Let's finish off the exercise we started:

In the 2014 FIFA World Cup, Germany played Brazil in a semifinal match. Germany scored after 11 minutes and again at the 23 minute mark. At that point in the match, how many goals would you expect Germany to score after 90 minutes? What was the probability that they would score 5 more goals (as, in fact, they did)?

Here are the steps I recommend:

Starting with the same gamma prior we used in the previous problem, compute the likelihood of scoring a goal after 11 minutes for each possible value of

`lam`

. Don't forget to convert all times into units of games.Compute the posterior distribution of

`lam`

for Germany after the first goal.Compute the likelihood of scoring another goal after 12 more minutes and do another update. Plot the prior, posterior after one goal, and posterior after two goals.

Compute the posterior predictive distribution of goals Germany might score during the remaining time in the game,

`90-23`

minutes. Note: you will have to think about how to generate predicted goals for a fraction of a game.Compute the probability of scoring 5 or more goals during the remaining time.

```
In [42]:
```# Solution
def update_expo(pmf, data):
"""Update based on an observed interval
pmf: prior PMF
data: time between goals in minutes
"""
t = data / 90
lams = pmf.qs
likelihood = expo_pdf(t, lams)
pmf *= likelihood
pmf.normalize()

```
In [43]:
```# Solution
germany = prior.copy()
update_expo(germany, 11)
germany2 = germany.copy()
update_expo(germany2, 12)

```
In [44]:
```# Solution
germany.mean(), germany2.mean()

```
Out[44]:
```

```
In [45]:
```# Solution
prior.plot(color='gray', label='Prior')
germany.plot(color='C3', label='Posterior after 1 goal')
germany2.plot(color='C8', label='Posterior after 2 goals')
decorate_rate()

```
```

```
In [46]:
```# Solution
t = (90-23) / 90
pmf_seq = [make_poisson_pmf(lam*t, 12)
for lam in germany.qs]

```
In [47]:
```# Solution
pmf_germany = make_mixture(germany, pmf_seq)

```
In [48]:
```# Solution
pmf_germany.bar(color='C8', label='germany')
decorate_goals('Predictive distribution')

```
```

```
In [49]:
```# Solution
pmf_germany[5]

```
Out[49]:
```

```
In [50]:
```# Solution
pmf_germany.prob_ge(5)

```
Out[50]:
```

**Exercise:** Returning to the first version of the World Cup Problem. Suppose France and Croatia play a rematch. What is the probability that France scores first?

Hint: Compute the posterior predictive distribution for the time until the first goal by making a mixture of exponential distributions. You can use the following function to make a PMF that approximates an exponential distribution.

```
In [51]:
```def make_expo_pmf(lam, high):
"""Make a PMF of an exponential distribution.
lam: event rate
high: upper bound on the interval `t`
returns: Pmf of the interval between events
"""
qs = np.linspace(0, high, 101)
ps = expo_pdf(qs, lam)
pmf = Pmf(ps, qs)
pmf.normalize()
return pmf

```
In [52]:
```# Solution
pmf_seq = [make_expo_pmf(lam, high=4) for lam in prior.qs]

```
In [53]:
```# Solution
pred_france = make_mixture(france, pmf_seq)
pred_croatia = make_mixture(croatia, pmf_seq)

```
In [54]:
```# Solution
pred_france.plot()
pred_croatia.plot()
decorate_time('Posterior predictive distribution')

```
```

```
In [55]:
```# Solution
Pmf.prob_lt(pred_france, pred_croatia)

```
Out[55]:
```

**Exercise:** In the 2010-11 National Hockey League (NHL) Finals, my beloved Boston
Bruins played a best-of-seven championship series against the despised
Vancouver Canucks. Boston lost the first two games 0-1 and 2-3, then
won the next two games 8-1 and 4-0. At this point in the series, what
is the probability that Boston will win the next game, and what is
their probability of winning the championship?

To choose a prior distribution, I got some statistics from http://www.nhl.com, specifically the average goals per game for each team in the 2010-11 season. The distribution well modeled by a gamma distribution with mean 2.8.

In what ways do you think the outcome of these games might violate the assumptions of the Poisson model? How would these violations affect your predictions.

```
In [56]:
```# Solution
# When a team is winning or losing by an insurmountable margin,
# they might remove their best players from the game, which
# would affect their goal-scoring rate, violating the assumption
# that the goal scoring rate is constant.
# In this example, Boston wins the third game 8-1, but scoring
# eight goals in a game might not reflect their true long-term
# goal-scoring rate.
# As a result, the analysis below might overestimate the chance
# that Boston wins.
# As it turned out, they did not.

```
In [57]:
```# Solution
from scipy.stats import gamma
alpha = 2.8
qs = np.linspace(0, 15, 101)
ps = gamma.pdf(qs, alpha)
prior_hockey = Pmf(ps, qs)
prior_hockey.normalize()

```
Out[57]:
```

```
In [58]:
```# Solution
prior_hockey.plot()
decorate_rate('Prior distribution for hockey')
prior_hockey.mean()

```
Out[58]:
```

```
In [59]:
```# Solution
bruins = prior_hockey.copy()
for data in [0, 2, 8, 4]:
update_poisson(bruins, data)
bruins.mean()

```
Out[59]:
```

```
In [60]:
```# Solution
canucks = prior_hockey.copy()
for data in [1, 3, 1, 0]:
update_poisson(canucks, data)
canucks.mean()

```
Out[60]:
```

```
In [61]:
```# Solution
canucks.plot(label='Canucks')
bruins.plot(label='Bruins')
decorate_rate('Posterior distributions')

```
```

```
In [62]:
```# Solution
pmf_seq = [make_poisson_pmf(lam, 15) for lam in bruins.qs]

```
In [63]:
```# Solution
pred_bruins = make_mixture(bruins, pmf_seq)
pred_bruins.bar(label='Bruins', color='C1')
decorate_goals('Predictive distribution')

```
```

```
In [64]:
```# Solution
pred_canucks = make_mixture(canucks, pmf_seq)
pred_canucks.bar(label='Canucks')
decorate_goals('Predictive distribution')

```
```

```
In [65]:
```# Solution
win = Pmf.prob_gt(pred_bruins, pred_canucks)
lose = Pmf.prob_lt(pred_bruins, pred_canucks)
tie = Pmf.prob_eq(pred_bruins, pred_canucks)
win, lose, tie

```
Out[65]:
```

```
In [66]:
```# Solution
# Assuming the Bruins win half of the ties,
# their chance of winning the next game is...
p = win + lose/2
p

```
Out[66]:
```

```
In [67]:
```# Solution
# Their chance of winning the series is their
# chance of winning k=2 or k=3 of the remaining
# n=3 games.
from scipy.stats import binom
n = 3
a = binom.pmf([2,3], n, p)
a.sum()

```
Out[67]:
```

```
In [ ]:
```