Think Bayes

Second Edition

License: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)



In [1]:

    
# If we're running on Colab, install empiricaldist
# https://pypi.org/project/empiricaldist/

import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !pip install empiricaldist



In [2]:

    
# Get utils.py and create directories

import os

if not os.path.exists('utils.py'):
    !wget https://github.com/AllenDowney/ThinkBayes2/raw/master/code/soln/utils.py
        
if not os.path.exists('figs'):
    !mkdir figs



In [3]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from empiricaldist import Pmf
from utils import decorate, savefig

Introduction

This notebook introduces the Poisson process, which is a model used to describe events that occur at random intervals. In this context, "process" has a mathematical definition that is almost unrelated to its usual meaning.

As an example, we'll model goal-scoring in soccer, which is American English for the game everyone else calls "football". We'll use goals scored in a game to estimate the parameter of a Poisson process; then we'll use the posterior distribution to make predictions.

And we'll solve The World Cup Problem.

The World Cup Problem

In the 2018 FIFA World Cup final, France defeated Croatia 4 goals to 2. Based on this outcome:

How confident should we be that France is the better team?
If the same teams played again, what is the chance France would win again?

To answer these questions, we have to make some modeling decisions.

First, I'll assume that for any team against another team there is some unknown goal-scoring rate, measured in goals per game, which I'll denote lam.
Second, I'll assume that a goal is equally likely during any minute of a game. So, in a 90 minute game, the probability of scoring during any minute is lam/90.
Third, I'll assume that a team never scores twice during the same minute.

Of course, none of these assumptions is completely true in the real world, but I think they are reasonable simplifications.

As George Box said, "All models are wrong; some are useful."

In this case, the model is useful because if these assumption are true, we expect the number of goals scored in a game to follow a Poisson distribution.

The Poisson distribution

If the number of goals scored in a game follows a Poisson distribution with a goal-scoring rate, $\lambda$, the probability of scoring $k$ goals is

$f(k; \lambda) = \lambda^k \exp(-\lambda) ~/~ k!$

SciPy provides a poisson object that represents a Poisson distribution. We can create one like this:



In [4]:

    
from scipy.stats import poisson

lam = 1.4
dist = poisson(lam)
dist

poisson provides pmf, which evaluates the PMF of the Poisson distribution.



In [5]:

    
k = 4
dist.pmf(k)

If the average goal-scoring rate is 1.4 goals per game, the probability of scoring 4 goals is about 4%.

The following function uses poisson.pmf to evaluate the PMF of the Poisson distribution for values of k from 0 to a given upper limit, high.



In [6]:

    
def make_poisson_pmf(lam, high):
    """Make a PMF of a Poisson distribution.
    
    lam: event rate
    high: upper bound on number of events, `k`
    
    returns: Pmf of number of events
    """
    qs = np.arange(high)
    ps = poisson(lam).pmf(qs)
    pmf = Pmf(ps, qs)
    pmf.normalize()
    return pmf

The result is a normalized Pmf object. For example, here's the distribution of goals scored for lam=1.4.



In [7]:

    
pmf_goals = make_poisson_pmf(lam=1.4, high=10)

The following figure shows what it looks like.



In [8]:

    
def decorate_goals(title=''):
    decorate(xlabel='Number of goals',
        ylabel='PMF',
        title=title)



In [9]:

    
pmf_goals.bar(alpha=0.7, label='Poisson distribution')

decorate_goals('Distribution of goals scored')
savefig('fig07-01')

The most likely outcomes are 0, 1, and 2; higher values are possible but increasingly unlikely. Values above 7 are negligible. This distribution shows that if we know the goal scoring rate, we can predict the number of goals.

Now let's turn it around: given a number of goals, what can we say about the goal-scoring rate?

To answer that, we need to think about the prior distribution of lam, which represents the range of possible values and their probabilities before we see the score.

The prior

If you have ever seen a soccer game, you have some information about lam. In most games, teams score a few goals each. In rare cases, a team might score more than 5 goals, but they almost never score more than 10.

Using data from previous World Cups I estimate that each team scores about 1.4 goals per game, on average. So I'll set the mean of lam to be 1.4.

For a good team against a bad one, we expect lam to be higher; for a bad team against a good one, we expect it to be lower.

To model the distribution of goal-scoring rates, I'll use a gamma distribution, which I chose because:

The goal scoring rate is a continuous quantity that cannot be less than 0, and the gamma distribution is appropriate for this kind of quantity.
The gamma distribution has only one parameter, alpha, which is the mean. So it's easy to construct a gamma distribution with the mean we want.
As we'll see, the shape of the Gamma distribution is a reasonable choice, given what we know about soccer.

The gamma distribution is continuous, but we'll approximate it with a discrete Pmf.

SciPy provides gamma, which creates an object that represents a gamma distribution.

The gamma object provides provides pdf, which evaluates the probability density function (PDF) of the gamma distribution.

Here's how we use it.



In [10]:

    
from scipy.stats import gamma

alpha = 1.4
qs = np.linspace(0, 10, 101)
ps = gamma(alpha).pdf(qs)

The parameter, alpha, is the mean of the distribution. The qs are possible values of $\lambda$ from 0 to 10. The ps are probability densities, which we can think of as unnormalized probabilities.

If we put the densities in a Pmf and normalize them, like this:



In [11]:

    
prior = Pmf(ps, qs)
prior.normalize()

The result is a discrete Pmf that approximates a gamma distribution.

Here's what it looks like.



In [12]:

    
def decorate_rate(title=''):
    decorate(xlabel='Goal scoring rate (lam)',
        ylabel='PMF',
        title=title)



In [13]:

    
prior.plot(label='prior', color='gray')
decorate_rate('Prior distribution')
savefig('fig07-02')

This distribution represents our prior knowledge about goal scoring: lam is usually less than 2, occasionally as high as 6, and seldom higher than that. And the mean is about 1.4.



In [14]:

    
prior.mean()

As usual, reasonable people could disagree about the details of the prior, but this is good enough to get started. Let's do an update.

The update

Suppose you are given the goal-scoring rate, $\lambda$, and asked to compute the probability of scoring a number of goals, $k$. That is precisely the question we answered by computing the Poisson distribution:

$ f(k; \lambda) = \lambda^k \exp(-\lambda) ~/~ k! $

For example, if $\lambda$ is 1.4, the probability of scoring 4 goals in a game is:



In [15]:

    
lam = 1.4
k = 4
poisson(lam).pmf(4)

Now suppose we are have an array of possible values for $\lambda$, like this:



In [16]:

    
lams = prior.qs

We can compute the likelihood of the data for each hypothetical value of lam, like this:



In [17]:

    
k = 4
likelihood = poisson(lams).pmf(k)

And that's all we need to do the update. To get the posterior distribution, we multiply the prior by the likelihoods we just computed and normalize the result.

The following function encapsulates these steps.



In [18]:

    
def update_poisson(pmf, data):
    """Update the PMF with a Poisson likelihood
    
    pmf: Series that represents the prior
    data: integer number of goals
    
    returns: posterior
    """
    k = data
    lams = pmf.qs
    likelihood = poisson(lams).pmf(k)
    pmf *= likelihood
    pmf.normalize()

The first parameter is the prior; the second is the number of goals.

In the example, France scored 4 goals, so I'll make a copy of the prior and update it with the data.



In [19]:

    
france = prior.copy()
update_poisson(france, 4)

Here's what the posterior distribution looks like:



In [20]:

    
prior.plot(label='prior', color='gray')
france.plot(label='France posterior', color='C0')

decorate_rate('Posterior distribution for France')

The data, k=4, makes us think higher values of lam are more likely and lower values are less likely. So the posterior distribution is shifted to the right.

Let's do the same for Croatia:



In [21]:

    
croatia = prior.copy()
update_poisson(croatia, 2)

And here are the results.



In [22]:

    
prior.plot(label='prior', color='gray')
france.plot(label='France posterior', color='C0')
croatia.plot(label='Croatia posterior', color='C3')

decorate_rate('Posterior distributions for France and Croatia')
savefig('fig07-03')

Here are the posterior means for these distributions.



In [23]:

    
prior.mean(), croatia.mean(), france.mean()

Recall that the mean of the prior distribution is 1.4. After Croatia scores 2 goals, their posterior mean is 1.7, which is near the midpoint of the prior and the data. Likewise after France scores 4 goals, their posterior mean is 2.7.

These results are typical of a Bayesian update: the location of the posterior distribution is a compromise between the prior and the data.

Probability of superiority

Now that we have a posterior distribution for each team, we can answer the first question: How confident should we be that France is the better team?

In the model, "better" means having a higher goal-scoring rate against the opponent. We can use the posterior distributions to compute the probability that a random value drawn from France's distribution exceeds a value drawn from Croatia's.

One way to do that is to enumerate all pairs of values from the two distributions, adding up the total probability that one value exceeds the other.



In [24]:

    
def prob_gt(pmf1, pmf2):
    """Compute the probability of superiority.
    
    pmf1: Pmf object
    pmf2: Pmf object
    
    returns: float probability
    """
    total = 0
    for q1, p1 in pmf1.items():
        for q2, p2 in pmf2.items():
            if q1 > q2:
                total += p1 * p2
    return total

This is similar to the method we use in Chapter 5 to compute the distribution of a sum.

Here's how we use it:



In [25]:

    
prob_gt(france, croatia)

Pmf provides a function that does the same thing.



In [26]:

    
Pmf.prob_gt(france, croatia)

The result is close to 75%. So, on the basis of one game, we have moderate confidence that France is actually the better team.

Of course, we should remember that this result is based on the assumption that the goal-scoring rate is constant. In reality, if a team is down by one goal, they might play more aggressively toward the end of the game, making them more likely to score, but also more likely to give up an additional goal.

As always, the results are only as good as the model.

Predicting the rematch

Now we can take on the second question: If the same teams played again, what is the chance Croatia would win?

To answer this question, we'll generate the "posterior predictive distribution", which is the number of goals we expect a team to score.

If we knew the goal scoring rate, lam, the distribution of goals would be a Poisson distribution with parameter lam.

Since we don't know lam, the distribution of goals is a mixture of a Poisson distributions with different values of lam.

First I'll generate a sequence of Pmf objects, one for each value of lam.



In [27]:

    
pmf_seq = [make_poisson_pmf(lam, 12) for lam in prior.qs]

The following figure shows what some of these distributions look like.



In [28]:

    
for pmf in pmf_seq[::25]:
    pmf.plot(lw=1)

The predictive distribution is a mixture of these Pmfs, weighted with the posterior probabilities.

We can use make_mixture from Chapter 6 to compute this mixture.



In [29]:

    
from utils import make_mixture

pred_france = make_mixture(france, pmf_seq)

Here's the predictive distribution for the number of goals France would score in a rematch.



In [30]:

    
pred_france.bar(color='C0', label='France')
decorate_goals('Predictive distribution')
pred_france.mean()

This distribution represents two sources of uncertainty: we don't know the actual value of lam, and even if we did, we would not know the number of goals in the next game.

Here's the predictive distribution for Croatia.



In [31]:

    
pred_croatia = make_mixture(croatia, pmf_seq)



In [32]:

    
pred_croatia.bar(color='C3', label='Croatia')
decorate_goals('Predictive distribution')
pred_croatia.mean()



In [33]:

    
plt.figure(figsize=(8, 4))

plt.subplot(1, 2, 1)
pred_france.bar(color='C0', label='France')
decorate_goals('Predictive distribution')

plt.subplot(1, 2, 2)
pred_croatia.bar(color='C3', label='Croatia')
decorate_goals('Predictive distribution')

savefig('fig07-04')

We can use these distributions to compute the probability that France wins, loses, or ties the rematch.



In [34]:

    
win = Pmf.prob_gt(pred_france, pred_croatia)
win



In [35]:

    
lose = Pmf.prob_lt(pred_france, pred_croatia)
lose



In [36]:

    
tie = Pmf.prob_eq(pred_france, pred_croatia)
tie

Assuming that France wins half of the ties, their chance of winning the rematch is about 65%.



In [37]:

    
win + tie/2

This is a bit lower than their probability of superiority, which is 75%. And that makes sense; even if they are better team, they might lose the game.

The Exponential Distribution

As an exercise at the end of this notebook, you'll have a chance to work on this variation on the World Cup Problem:

In the 2014 FIFA World Cup, Germany played Brazil in a semifinal match. Germany scored after 11 minutes and again at the 23 minute mark. At that point in the match, how many goals would you expect Germany to score after 90 minutes? What was the probability that they would score 5 more goals (as, in fact, they did)?

In this version, notice that the data is not the number of goals in a fixed period of time, but the time between goals.

To compute the likelihood of data like this, we can take advantage of the theory of Poisson processes again. If each team has a constant goal-scoring rate, we expect the time between goals to follow an exponential distribution.

If the goal-scoring rate is $\lambda$, the probability of seeing an interval between goals of $t$ is proportional to the PDF of the exponential distribution:

$f(t; \lambda) = \lambda~\exp(-\lambda t)$

Because $t$ is a continuous quantity, the value of this expression is not a probability; it is a probability density. However, it is proportional to the probability of the data, so we can use it as a likelihood in a Bayesian update.

SciPy provides expon, which creates an object that represents an exponential distribution. However, it does not take lam as a parameter in the way you might expect, which makes it awkward to work with. Since the PDF of the exponential distribution is so easy to evaluate, I'll use my own function.



In [38]:

    
def expo_pdf(t, lam):
    """Compute the PDF of the exponential distribution.
    
    lam: time
    λ: rate
    
    returns: probability density
    """
    return lam * np.exp(-lam * t)

To see what the exponential distribution looks like, let's assume again that lam is 1.4; we can compute the distribution of $t$ like this:



In [39]:

    
lam = 1.4
qs = np.linspace(0, 4, 101)
ps = expo_pdf(qs, lam)
pmf_time = Pmf(ps, qs)
pmf_time.normalize()

And here's what it looks like:



In [40]:

    
def decorate_time(title=''):
    decorate(xlabel='Time between goals (games)',
             ylabel='PMF',
             title=title)



In [41]:

    
pmf_time.plot(label='exponential lam = 1.4')

decorate_time('Distribution of time between goals')
savefig('fig07-05')

It is counterintuitive, but true, that the most likely time to score a goal is immediately. After that, the probability of each possible interval is a little lower.

With a goal-scoring rate of 1.4, it is possible that a team will take more than one game to score a goal, but it is unlikely that they will take more than two games.

Summary

This chapter introduces three new distributions, so it can be hard to keep them straight. Let's review:

If a system satisfies the assumptions of a Poisson model, the number of events in a period of time follows a Poisson distribution, which is a discrete distribution with integer quantities from 0 to infinity. In practice, we can usually ignore low-probability quantities above a finite limit.
Also under the Poisson model, the interval between events follows an exponential distribution, which is a continuous distribution with quantities from 0 to infinity. Because it is continuous, it is described by a probability density function (PDF) rather than a probability mass function (PMF). But when we use an exponential distribution to compute the likelihood of the data, we can treat densities as unnormalized probabilities.
The Poisson and exponential distributions are parameterized by an event rate, denoted $\lambda$ or lam.
For the prior distribution of $\lambda$, I used a gamma distribution, which is a continuous distribution with quantities from 0 to infinity, but I approximated it with a discrete, bounded PMF. The gamma distribution has one parameter, denoted $\alpha$ or alpha, which is also its mean.

I chose the gamma distribution because the shape is consistent with our background knowledge about goal-scoring rates. There are many other distributions we could have used; however, we will see in Chapter XX that the gamma distribution can be a particularly good choice.

But we have a few things to do before we get there, starting with these exercises.

Exercises

Exercise: Let's finish off the exercise we started:

In the 2014 FIFA World Cup, Germany played Brazil in a semifinal match. Germany scored after 11 minutes and again at the 23 minute mark. At that point in the match, how many goals would you expect Germany to score after 90 minutes? What was the probability that they would score 5 more goals (as, in fact, they did)?

Here are the steps I recommend:

Starting with the same gamma prior we used in the previous problem, compute the likelihood of scoring a goal after 11 minutes for each possible value of lam. Don't forget to convert all times into units of games.
Compute the posterior distribution of lam for Germany after the first goal.
Compute the likelihood of scoring another goal after 12 more minutes and do another update. Plot the prior, posterior after one goal, and posterior after two goals.
Compute the posterior predictive distribution of goals Germany might score during the remaining time in the game, 90-23 minutes. Note: you will have to think about how to generate predicted goals for a fraction of a game.
Compute the probability of scoring 5 or more goals during the remaining time.



In [42]:

    
# Solution goes here



In [43]:

    
# Solution goes here



In [44]:

    
# Solution goes here



In [45]:

    
# Solution goes here



In [46]:

    
# Solution goes here



In [47]:

    
# Solution goes here



In [48]:

    
# Solution goes here



In [49]:

    
# Solution goes here



In [50]:

    
# Solution goes here

Exercise: Returning to the first version of the World Cup Problem. Suppose France and Croatia play a rematch. What is the probability that France scores first?

Hint: Compute the posterior predictive distribution for the time until the first goal by making a mixture of exponential distributions. You can use the following function to make a PMF that approximates an exponential distribution.



In [51]:

    
def make_expo_pmf(lam, high):
    """Make a PMF of an exponential distribution.
    
    lam: event rate
    high: upper bound on the interval `t`
    
    returns: Pmf of the interval between events
    """
    qs = np.linspace(0, high, 101)
    ps = expo_pdf(qs, lam)
    pmf = Pmf(ps, qs)
    pmf.normalize()
    return pmf



In [52]:

    
# Solution goes here



In [53]:

    
# Solution goes here



In [54]:

    
# Solution goes here



In [55]:

    
# Solution goes here

Exercise: In the 2010-11 National Hockey League (NHL) Finals, my beloved Boston Bruins played a best-of-seven championship series against the despised Vancouver Canucks. Boston lost the first two games 0-1 and 2-3, then won the next two games 8-1 and 4-0. At this point in the series, what is the probability that Boston will win the next game, and what is their probability of winning the championship?

To choose a prior distribution, I got some statistics from http://www.nhl.com, specifically the average goals per game for each team in the 2010-11 season. The distribution well modeled by a gamma distribution with mean 2.8.

In what ways do you think the outcome of these games might violate the assumptions of the Poisson model? How would these violations affect your predictions.



In [56]:

    
# Solution goes here



In [57]:

    
# Solution goes here



In [58]:

    
# Solution goes here



In [59]:

    
# Solution goes here



In [60]:

    
# Solution goes here



In [61]:

    
# Solution goes here



In [62]:

    
# Solution goes here



In [63]:

    
# Solution goes here



In [64]:

    
# Solution goes here



In [65]:

    
# Solution goes here



In [66]:

    
# Solution goes here



In [67]:

    
# Solution goes here



In [ ]: