Copyright 2016 Allen Downey
MIT License: http://opensource.org/licenses/MIT
In [1]:
from __future__ import print_function, division
%matplotlib inline
%precision 6
import matplotlib.pyplot as plt
import numpy as np
In [2]:
from inspect import getsourcelines
def show_code(func):
lines, _ = getsourcelines(func)
for line in lines:
print(line, end='')
In [3]:
from distribution import Pmf
show_code(Pmf)
I'll create a Pmf that represents a biased coin, which has a 60% chance of landing heads and a 40% chance of landing tails.
If you are bothered by the fact that it is physically very difficult to bias a coin toss in this way, imagine a 10-sided die with H
on 6 sides and T
on 4 sides.
In [4]:
coin = Pmf(dict(H=0.6, T=0.4))
coin.display()
We can use the +
operator to compute the possible outcomes of two coin tosses and their probabilities.
In [5]:
twice = coin + coin
twice.display()
And similarly, the possible outcomes of three coin tosses.
In [6]:
thrice = sum([coin]*3)
thrice.display()
Notice that the outcomes take the order of the tosses into account, so HT
is considered a different outcome from TH
.
If we don't care about the order and we only care about the number of heads and tails, we can loop through the outcomes and count the number of heads.
In [7]:
from collections import Counter
for val, prob in sorted(thrice.items()):
heads = val.count('H')
print(heads, prob)
And we can make a new Pmf that maps from the total number of heads to the probability of that total.
In [8]:
def make_pmf_heads(coin, n):
coins = sum([coin]*n)
pmf = Pmf()
for val, prob in coins.items():
heads = val.count('H')
pmf[heads] += prob
return pmf
Here's what it looks like:
In [9]:
pmf_heads = make_pmf_heads(coin, 3)
pmf_heads.display()
Exerise: Create pmf_heads
for a few different values of n
and plot them using the plot_pmf
method.
In [10]:
# Solution
for n in [5, 10, 15]:
make_pmf_heads(coin, n).plot_pmf()
Exerise: Run the following example and see how long it takes. Try it out with a few values of n
and see how the run time depends on n
.
In [11]:
n = 15
%time make_pmf_heads(coin, n).plot_pmf()
This way of computing pmf_heads
is not very efficient. For n
tosses, there are $2^n$ possible outcomes, and for large values of n
, that is not tractable.
In the next section we will figure out a better way.
In [12]:
from sympy import symbols
p = symbols('p')
sym_coin = Pmf(dict(H=p, T=1-p))
sym_coin.display()
Now we can see the distribution of the number of heads, which I'll call k
, after a few tosses:
In [13]:
make_pmf_heads(sym_coin, 2).display()
In [14]:
make_pmf_heads(sym_coin, 3).display()
In [15]:
make_pmf_heads(sym_coin, 4).display()
The general pattern is that probability of k
heads after n
tosses is the product of three terms
k
heads, which is p**k
.n-k
tails, which is (1-p)**(n-k)
.You might already know that the coefficient is the "binomial coefficient", which is written $n \choose k$ and pronounced "n choose k". But pretend for a moment that you don't know that and let's figure it out.
To make the pattern easier to see, I'll create a fair coin where p = 1-p = 1/2
In [16]:
fair_coin = Pmf(dict(H=p, T=p))
fair_coin.display()
Now the probability of all outcomes is p**n
In [17]:
thrice = sum([fair_coin]*3)
thrice.display()
So when we count the number of heads, it is easier to see the coefficients.
In [18]:
pmf_heads = make_pmf_heads(fair_coin, 3)
pmf_heads.display()
And even easier if we divide through by p**n
In [19]:
for val, prob in pmf_heads.items():
print(val, prob / p**3)
We can assemble the code from the previous cells into a function that prints the coefficients for a given value of n
:
In [20]:
def coefficients(n):
fair_coin = Pmf(dict(H=p, T=p))
pmf_heads = make_pmf_heads(fair_coin, n)
for val, prob in pmf_heads.items():
print(prob / p**n, end=' ')
print()
Here are the coefficients for n=3
In [21]:
coefficients(3)
And here they are for n
in the range from 1
to 9
In [22]:
for n in range(1, 10):
coefficients(n)
Now we can look for patterns.
If we flip the coin n
times, the coefficient associated with getting k
heads is 1
if k
is 0
or n
.
Otherwise, the (n, k)
coefficient is the sum of two coefficients from the previous row: (n-1, k)
and (n-1, k-1)
We can use these observations to compute the binomial coefficient recursively:
In [23]:
def binomial_coefficient(n, k):
if k==0 or k==n:
return 1
return binomial_coefficient(n-1, k) + binomial_coefficient(n-1, k-1)
And it yields the same results.
In [24]:
binomial_coefficient(9, 5)
Out[24]:
Here are the results for n
from 1 to 9 again.
In [25]:
for n in range(1, 10):
for k in range(n+1):
print(binomial_coefficient(n, k), end=' ')
print()
So far so good.
Exercise: SciPy provides a "special" function called binom
that you can import from scipy.special
. Test it to confirm that it is consistent with our function.
Note that scipy.special.binom
returns a float, so for large values of n
it is only approximate.
In [26]:
# Solution
from scipy.special import binom
binom(9, 5)
Out[26]:
Exercise: The recursive implementation of binomial_coefficient
is inefficient for large values of n
because it computes the same intermediate results many times. You can speed it up (a lot!) by memoizing previously computed results.
Write a version called fast_binom
that caches results in a dictionary.
In [27]:
# Solution
def fast_binom(n, k, cache={}):
if k==0 or k==n:
return 1
try:
return cache[n, k]
except KeyError:
res = fast_binom(n-1, k) + fast_binom(n-1, k-1)
cache[n, k] = res
return res
for n in range(1, 10):
for k in range(n+1):
print(binomial_coefficient(n, k), end=' ')
print()
Bringing it all together, we have
$PMF(k; n,p) = {n \choose k} p^k (1-p)^{n-k}$
This equation can be interpreted at two levels:
If you want to know the probability of k
heads in n
tosses, where the probability of heads in each toss is p
, you can use the formula on the right to compute it.
More abstractly, this equation states that the formula on the right is the PMF of k
with the parameters n
and p
.
To make the difference between these interpretations clear, let's look at two functions:
1. `eval_binomial_pmf` evaluates this formula for given values of `k`, `n`, and `p`.
2. `make_binomial_pmf` evaluates the formula for a range of values of `k`, and returns the resulting `Pmf` object.
First, here's eval_binomial_pmf
:
In [28]:
from scipy.special import binom
def eval_binomial_pmf(k, n, p):
return binom(n, k) * p**k * (1-p)**(n-k)
We can use it to compute probabilities for each value of k
directly:
In [29]:
n = 3
p = 0.6
for k in range(n+1):
print(k, eval_binomial_pmf(k, n, p))
And here's the corresponding pmf_heads
for comparison
In [30]:
coin = Pmf(dict(H=p, T=1-p))
make_pmf_heads(coin, n).display()
They are the same, at least within floating point error.
Now here's make_binomial_pmf
In [31]:
def make_binomial_pmf(n, p):
pmf = Pmf()
for k in range(n+1):
pmf[k] = eval_binomial_pmf(k, n, p)
return pmf
We can use it to make a Pmf that contains the possible values of k
and their probabilities:
In [32]:
pmf = make_binomial_pmf(n, p)
pmf.display()
Here's what the distribution of k
looks like for given values of n
and p
:
In [33]:
pmf.plot_pmf()
plt.xlabel('k');
If we hold p
constant, we can see how the distribution of k
changes as n
increases
In [34]:
p = 0.6
for n in [5, 10, 15]:
make_binomial_pmf(n, p).plot_pmf(label=n)
plt.legend();
Exercise: Keeping n=10
, plot the distribution of k
for a few different values of p
.
In [35]:
# Solution
n = 10
for p in [0.2, 0.5, 0.8]:
make_binomial_pmf(n, p).plot_pmf(label=p)
plt.legend();
The Pmf objects we just created represent differnet distributions of k
based on different values of the parameters n
and p
.
As make_binomial_pmf
demonstrates, if you give me n
and p
, I can compute a distribution of k
.
Speaking casually, people sometimes refer to this equation as the PMF of "the" binomial distribution:
$PMF(k; n,p) = {n \choose k} p^k (1-p)^{n-k}$
More precisely, it is a "family" of distributions, where the parameters n
and p
specify a particular member of the family.
Exercise: Suppose you toss a fair coin 10 times. What is the probability of getting 5 heads?
If you run this experiment many times, what is the mean number of heads you expect? What is the variance in the number of heads?
In [36]:
# Solution
n = 10
p = 0.5
pmf = make_binomial_pmf(n, p)
print(pmf[5])
print(pmf.mean())
print(pmf.var())
Exercise: Suppose you toss a fair coin 10 times. What is the probability of getting fewer than 5 heads?
If you run this experiment many times, what is the median number of heads you expect? What is the interquartile range (IQR) in the number of heads?
Hint: You might want to make a CDF.
In [37]:
# Solution
from distribution import compute_cumprobs, Cdf
xs, ps = compute_cumprobs(pmf.d)
cdf = Cdf(xs, ps)
print(cdf[4])
low, median, high = cdf.values([0.25, 0.5, 0.75])
print(median, high-low)
In [ ]: