Copyright 2016 Allen Downey
MIT License: http://opensource.org/licenses/MIT
In [1]:
from __future__ import print_function, division
%matplotlib inline
%precision 6
import matplotlib.pyplot as plt
import numpy as np
In [2]:
from inspect import getsourcelines
def show_code(func):
lines, _ = getsourcelines(func)
for line in lines:
print(line, end='')
In [3]:
from distribution import Pmf
show_code(Pmf)
I'll create a Pmf that represents a biased coin, which has a 60% chance of landing heads and a 40% chance of landing tails.
If you are bothered by the fact that it is physically very difficult to bias a coin toss in this way, imagine a 10-sided die with H on 6 sides and T on 4 sides.
In [4]:
coin = Pmf(dict(H=0.6, T=0.4))
coin.display()
We can use the + operator to compute the possible outcomes of two coin tosses and their probabilities.
In [5]:
twice = coin + coin
twice.display()
And similarly, the possible outcomes of three coin tosses.
In [6]:
thrice = sum([coin]*3)
thrice.display()
Notice that the outcomes take the order of the tosses into account, so HT is considered a different outcome from TH.
If we don't care about the order and we only care about the number of heads and tails, we can loop through the outcomes and count the number of heads.
In [7]:
from collections import Counter
for val, prob in sorted(thrice.items()):
heads = val.count('H')
print(heads, prob)
And we can make a new Pmf that maps from the total number of heads to the probability of that total.
In [8]:
def make_pmf_heads(coin, n):
coins = sum([coin]*n)
pmf = Pmf()
for val, prob in coins.items():
heads = val.count('H')
pmf[heads] += prob
return pmf
Here's what it looks like:
In [9]:
pmf_heads = make_pmf_heads(coin, 3)
pmf_heads.display()
Exerise: Create pmf_heads for a few different values of n and plot them using the plot_pmf method.
In [10]:
# Solution
for n in [5, 10, 15]:
make_pmf_heads(coin, n).plot_pmf()
Exerise: Run the following example and see how long it takes. Try it out with a few values of n and see how the run time depends on n.
In [11]:
n = 15
%time make_pmf_heads(coin, n).plot_pmf()
This way of computing pmf_heads is not very efficient. For n tosses, there are $2^n$ possible outcomes, and for large values of n, that is not tractable.
In the next section we will figure out a better way.
In [12]:
from sympy import symbols
p = symbols('p')
sym_coin = Pmf(dict(H=p, T=1-p))
sym_coin.display()
Now we can see the distribution of the number of heads, which I'll call k, after a few tosses:
In [13]:
make_pmf_heads(sym_coin, 2).display()
In [14]:
make_pmf_heads(sym_coin, 3).display()
In [15]:
make_pmf_heads(sym_coin, 4).display()
The general pattern is that probability of k heads after n tosses is the product of three terms
k heads, which is p**k.n-k tails, which is (1-p)**(n-k).You might already know that the coefficient is the "binomial coefficient", which is written $n \choose k$ and pronounced "n choose k". But pretend for a moment that you don't know that and let's figure it out.
To make the pattern easier to see, I'll create a fair coin where p = 1-p = 1/2
In [16]:
fair_coin = Pmf(dict(H=p, T=p))
fair_coin.display()
Now the probability of all outcomes is p**n
In [17]:
thrice = sum([fair_coin]*3)
thrice.display()
So when we count the number of heads, it is easier to see the coefficients.
In [18]:
pmf_heads = make_pmf_heads(fair_coin, 3)
pmf_heads.display()
And even easier if we divide through by p**n
In [19]:
for val, prob in pmf_heads.items():
print(val, prob / p**3)
We can assemble the code from the previous cells into a function that prints the coefficients for a given value of n:
In [20]:
def coefficients(n):
fair_coin = Pmf(dict(H=p, T=p))
pmf_heads = make_pmf_heads(fair_coin, n)
for val, prob in pmf_heads.items():
print(prob / p**n, end=' ')
print()
Here are the coefficients for n=3
In [21]:
coefficients(3)
And here they are for n in the range from 1 to 9
In [22]:
for n in range(1, 10):
coefficients(n)
Now we can look for patterns.
If we flip the coin n times, the coefficient associated with getting k heads is 1 if k is 0 or n.
Otherwise, the (n, k) coefficient is the sum of two coefficients from the previous row: (n-1, k) and (n-1, k-1)
We can use these observations to compute the binomial coefficient recursively:
In [23]:
def binomial_coefficient(n, k):
if k==0 or k==n:
return 1
return binomial_coefficient(n-1, k) + binomial_coefficient(n-1, k-1)
And it yields the same results.
In [24]:
binomial_coefficient(9, 5)
Out[24]:
Here are the results for n from 1 to 9 again.
In [25]:
for n in range(1, 10):
for k in range(n+1):
print(binomial_coefficient(n, k), end=' ')
print()
So far so good.
Exercise: SciPy provides a "special" function called binom that you can import from scipy.special. Test it to confirm that it is consistent with our function.
Note that scipy.special.binom returns a float, so for large values of n it is only approximate.
In [26]:
# Solution
from scipy.special import binom
binom(9, 5)
Out[26]:
Exercise: The recursive implementation of binomial_coefficient is inefficient for large values of n because it computes the same intermediate results many times. You can speed it up (a lot!) by memoizing previously computed results.
Write a version called fast_binom that caches results in a dictionary.
In [27]:
# Solution
def fast_binom(n, k, cache={}):
if k==0 or k==n:
return 1
try:
return cache[n, k]
except KeyError:
res = fast_binom(n-1, k) + fast_binom(n-1, k-1)
cache[n, k] = res
return res
for n in range(1, 10):
for k in range(n+1):
print(binomial_coefficient(n, k), end=' ')
print()
Bringing it all together, we have
$PMF(k; n,p) = {n \choose k} p^k (1-p)^{n-k}$
This equation can be interpreted at two levels:
If you want to know the probability of k heads in n tosses, where the probability of heads in each toss is p, you can use the formula on the right to compute it.
More abstractly, this equation states that the formula on the right is the PMF of k with the parameters n and p.
To make the difference between these interpretations clear, let's look at two functions:
1. `eval_binomial_pmf` evaluates this formula for given values of `k`, `n`, and `p`.
2. `make_binomial_pmf` evaluates the formula for a range of values of `k`, and returns the resulting `Pmf` object.
First, here's eval_binomial_pmf:
In [28]:
from scipy.special import binom
def eval_binomial_pmf(k, n, p):
return binom(n, k) * p**k * (1-p)**(n-k)
We can use it to compute probabilities for each value of k directly:
In [29]:
n = 3
p = 0.6
for k in range(n+1):
print(k, eval_binomial_pmf(k, n, p))
And here's the corresponding pmf_heads for comparison
In [30]:
coin = Pmf(dict(H=p, T=1-p))
make_pmf_heads(coin, n).display()
They are the same, at least within floating point error.
Now here's make_binomial_pmf
In [31]:
def make_binomial_pmf(n, p):
pmf = Pmf()
for k in range(n+1):
pmf[k] = eval_binomial_pmf(k, n, p)
return pmf
We can use it to make a Pmf that contains the possible values of k and their probabilities:
In [32]:
pmf = make_binomial_pmf(n, p)
pmf.display()
Here's what the distribution of k looks like for given values of n and p:
In [33]:
pmf.plot_pmf()
plt.xlabel('k');
If we hold p constant, we can see how the distribution of k changes as n increases
In [34]:
p = 0.6
for n in [5, 10, 15]:
make_binomial_pmf(n, p).plot_pmf(label=n)
plt.legend();
Exercise: Keeping n=10, plot the distribution of k for a few different values of p.
In [35]:
# Solution
n = 10
for p in [0.2, 0.5, 0.8]:
make_binomial_pmf(n, p).plot_pmf(label=p)
plt.legend();
The Pmf objects we just created represent differnet distributions of k based on different values of the parameters n and p.
As make_binomial_pmf demonstrates, if you give me n and p, I can compute a distribution of k.
Speaking casually, people sometimes refer to this equation as the PMF of "the" binomial distribution:
$PMF(k; n,p) = {n \choose k} p^k (1-p)^{n-k}$
More precisely, it is a "family" of distributions, where the parameters n and p specify a particular member of the family.
Exercise: Suppose you toss a fair coin 10 times. What is the probability of getting 5 heads?
If you run this experiment many times, what is the mean number of heads you expect? What is the variance in the number of heads?
In [36]:
# Solution
n = 10
p = 0.5
pmf = make_binomial_pmf(n, p)
print(pmf[5])
print(pmf.mean())
print(pmf.var())
Exercise: Suppose you toss a fair coin 10 times. What is the probability of getting fewer than 5 heads?
If you run this experiment many times, what is the median number of heads you expect? What is the interquartile range (IQR) in the number of heads?
Hint: You might want to make a CDF.
In [37]:
# Solution
from distribution import compute_cumprobs, Cdf
xs, ps = compute_cumprobs(pmf.d)
cdf = Cdf(xs, ps)
print(cdf[4])
low, median, high = cdf.values([0.25, 0.5, 0.75])
print(median, high-low)
In [ ]: