This notebook presents code and exercises from Think Bayes, second edition.
Copyright 2016 Allen B. Downey
MIT License: https://opensource.org/licenses/MIT
In [90]:
from __future__ import print_function, division
% matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import math
import numpy as np
from thinkbayes2 import Pmf, Cdf, Suite, Joint, EvalBinomialPmf
import thinkplot
From DASL(http://lib.stat.cmu.edu/DASL/Stories/ImprovingReadingAbility.html)
An educator conducted an experiment to test whether new directed reading activities in the classroom will help elementary school pupils improve some aspects of their reading ability. She arranged for a third grade class of 21 students to follow these activities for an 8-week period. A control classroom of 23 third graders followed the same curriculum without the activities. At the end of the 8 weeks, all students took a Degree of Reading Power (DRP) test, which measures the aspects of reading ability that the treatment is designed to improve.
Summary statistics on the two groups of children show that the average score of the treatment class was almost ten points higher than the average of the control class. A two-sample t-test is appropriate for testing whether this difference is statistically significant. The t-statistic is 2.31, which is significant at the .05 level.
I'll use Pandas to load the data into a DataFrame.
In [2]:
import pandas as pd
df = pd.read_csv('drp_scores.csv', skiprows=21, delimiter='\t')
df.head()
Out[2]:
And use groupby
to compute the means for the two groups.
In [3]:
grouped = df.groupby('Treatment')
for name, group in grouped:
print(name, group.Response.mean())
The Normal
class provides a Likelihood
function that computes the likelihood of a sample from a normal distribution.
In [4]:
from scipy.stats import norm
class Normal(Suite, Joint):
def Likelihood(self, data, hypo):
"""
data: sequence of test scores
hypo: mu, sigma
"""
mu, sigma = hypo
likes = norm.pdf(data, mu, sigma)
return np.prod(likes)
The prior distributions for mu
and sigma
are uniform.
In [5]:
mus = np.linspace(20, 80, 101)
sigmas = np.linspace(5, 30, 101)
I use itertools.product
to enumerate all pairs of mu
and sigma
.
In [6]:
from itertools import product
control = Normal(product(mus, sigmas))
data = df[df.Treatment=='Control'].Response
control.Update(data)
Out[6]:
After the update, we can plot the probability of each mu
-sigma
pair as a contour plot.
In [7]:
thinkplot.Contour(control, pcolor=True)
thinkplot.Config(xlabel='mu', ylabel='sigma')
And then we can extract the marginal distribution of mu
In [8]:
pmf_mu0 = control.Marginal(0)
thinkplot.Pdf(pmf_mu0)
thinkplot.Config(xlabel='mu', ylabel='Pmf')
And the marginal distribution of sigma
In [9]:
pmf_sigma0 = control.Marginal(1)
thinkplot.Pdf(pmf_sigma0)
thinkplot.Config(xlabel='sigma', ylabel='Pmf')
Exercise: Run this analysis again for the control group. What is the distribution of the difference between the groups? What is the probability that the average "reading power" for the treatment group is higher? What is the probability that the variance of the treatment group is higher?
In [10]:
# Solution goes here
In [11]:
# Solution goes here
In [12]:
# Solution goes here
In [13]:
# Solution goes here
In [14]:
# Solution goes here
In [15]:
# Solution goes here
In [16]:
# Solution goes here
In [17]:
# Solution goes here
In [18]:
# It looks like there is a high probability that the mean of
# the treatment group is higher, and the most likely size of
# the effect is 9-10 points.
# It looks like the variance of the treated group is substantially
# smaller, which suggests that the treatment might be helping
# low scorers more than high scorers.
Suppose you are playing paintball in an indoor arena 30 feet wide and 50 feet long. You are standing near one of the 30 foot walls, and you suspect that one of your opponents has taken cover nearby. Along the wall, you see several paint spatters, all the same color, that you think your opponent fired recently.
The spatters are at 15, 16, 18, and 21 feet, measured from the lower-left corner of the room. Based on these data, where do you think your opponent is hiding?
Here's the Suite that does the update. It uses MakeLocationPmf
,
defined below.
In [19]:
class Paintball(Suite, Joint):
"""Represents hypotheses about the location of an opponent."""
def __init__(self, alphas, betas, locations):
"""Makes a joint suite of parameters alpha and beta.
Enumerates all pairs of alpha and beta.
Stores locations for use in Likelihood.
alphas: possible values for alpha
betas: possible values for beta
locations: possible locations along the wall
"""
self.locations = locations
pairs = [(alpha, beta)
for alpha in alphas
for beta in betas]
Suite.__init__(self, pairs)
def Likelihood(self, data, hypo):
"""Computes the likelihood of the data under the hypothesis.
hypo: pair of alpha, beta
data: location of a hit
Returns: float likelihood
"""
alpha, beta = hypo
x = data
pmf = MakeLocationPmf(alpha, beta, self.locations)
like = pmf.Prob(x)
return like
In [ ]:
In [20]:
def MakeLocationPmf(alpha, beta, locations):
"""Computes the Pmf of the locations, given alpha and beta.
Given that the shooter is at coordinates (alpha, beta),
the probability of hitting any spot is inversely proportionate
to the strafe speed.
alpha: x position
beta: y position
locations: x locations where the pmf is evaluated
Returns: Pmf object
"""
pmf = Pmf()
for x in locations:
prob = 1.0 / StrafingSpeed(alpha, beta, x)
pmf.Set(x, prob)
pmf.Normalize()
return pmf
In [ ]:
In [21]:
def StrafingSpeed(alpha, beta, x):
"""Computes strafing speed, given location of shooter and impact.
alpha: x location of shooter
beta: y location of shooter
x: location of impact
Returns: derivative of x with respect to theta
"""
theta = math.atan2(x - alpha, beta)
speed = beta / math.cos(theta)**2
return speed
The prior probabilities for alpha
and beta
are uniform.
In [22]:
alphas = range(0, 31)
betas = range(1, 51)
locations = range(0, 31)
suite = Paintball(alphas, betas, locations)
suite.UpdateSet([15, 16, 18, 21])
Out[22]:
To visualize the joint posterior, I take slices for a few values of beta
and plot the conditional distributions of alpha
. If the shooter is close to the wall, we can be somewhat confident of his position. The farther away he is, the less certain we are.
In [23]:
locations = range(0, 31)
alpha = 10
betas = [10, 20, 40]
thinkplot.PrePlot(num=len(betas))
for beta in betas:
pmf = MakeLocationPmf(alpha, beta, locations)
pmf.label = 'beta = %d' % beta
thinkplot.Pdf(pmf)
thinkplot.Config(xlabel='Distance',
ylabel='Prob')
Here are the marginal posterior distributions for alpha
and beta
.
In [24]:
marginal_alpha = suite.Marginal(0, label='alpha')
marginal_beta = suite.Marginal(1, label='beta')
print('alpha CI', marginal_alpha.CredibleInterval(50))
print('beta CI', marginal_beta.CredibleInterval(50))
thinkplot.PrePlot(num=2)
thinkplot.Cdf(Cdf(marginal_alpha))
thinkplot.Cdf(Cdf(marginal_beta))
thinkplot.Config(xlabel='Distance',
ylabel='Prob')
To visualize the joint posterior, I take slices for a few values of beta
and plot the conditional distributions of alpha
. If the shooter is close to the wall, we can be somewhat confident of his position. The farther away he is, the less certain we are.
In [25]:
betas = [10, 20, 40]
thinkplot.PrePlot(num=len(betas))
for beta in betas:
cond = suite.Conditional(0, 1, beta)
cond.label = 'beta = %d' % beta
thinkplot.Pdf(cond)
thinkplot.Config(xlabel='Distance',
ylabel='Prob')
Another way to visualize the posterio distribution: a pseudocolor plot of probability as a function of alpha
and beta
.
In [26]:
thinkplot.Contour(suite.GetDict(), contour=False, pcolor=True)
thinkplot.Config(xlabel='alpha',
ylabel='beta',
axis=[0, 30, 0, 20])
Here's another visualization that shows posterior credible regions.
In [27]:
d = dict((pair, 0) for pair in suite.Values())
percentages = [75, 50, 25]
for p in percentages:
interval = suite.MaxLikeInterval(p)
for pair in interval:
d[pair] += 1
thinkplot.Contour(d, contour=False, pcolor=True)
thinkplot.Text(17, 4, '25', color='white')
thinkplot.Text(17, 15, '50', color='white')
thinkplot.Text(17, 30, '75')
thinkplot.Config(xlabel='alpha',
ylabel='beta',
legend=False)
In [ ]:
Exercise: From John D. Cook
"Suppose you have a tester who finds 20 bugs in your program. You want to estimate how many bugs are really in the program. You know there are at least 20 bugs, and if you have supreme confidence in your tester, you may suppose there are around 20 bugs. But maybe your tester isn't very good. Maybe there are hundreds of bugs. How can you have any idea how many bugs there are? There’s no way to know with one tester. But if you have two testers, you can get a good idea, even if you don’t know how skilled the testers are.
Suppose two testers independently search for bugs. Let k1 be the number of errors the first tester finds and k2 the number of errors the second tester finds. Let c be the number of errors both testers find. The Lincoln Index estimates the total number of errors as k1 k2 / c [I changed his notation to be consistent with mine]."
So if the first tester finds 20 bugs, the second finds 15, and they find 3 in common, we estimate that there are about 100 bugs. What is the Bayesian estimate of the number of errors based on this data?
In [79]:
def shared_bugs(p1, p2, bugs):
k1 = np.random.random(bugs) < p1
k2 = np.random.random(bugs) < p2
return np.sum(k1 & k2)
In [80]:
p1 = .20
p2 = .15
bugs = 100
In [81]:
bug_pmf = Pmf()
for trial in range(1000):
bug_pmf[shared_bugs(p1, p2, bugs)] += 1
In [84]:
bug_pmf.Normalize()
bug_pmf.Print()
In [89]:
thinkplot.Hist(bug_pmf)
Now do some bayes
In [122]:
from scipy import special
class bugFinder(Suite, Joint):
def Likelihood(self, data, hypo):
"""
data: (k1, k2, c)
hypo: (n, p1, p1)
"""
n = hypo[0]
p1 = hypo[1]
p2 = hypo[2]
k1 = data[0]
k2 = data[1]
c = data[2]
like1 = EvalBinomialPmf(k1, n, p1)
like2 = EvalBinomialPmf(k2, n, p2)
return like1 * like2
In [123]:
p1 = np.linspace(0, 1, 40)
p2 = np.linspace(0, 1, 40)
n = np.linspace(32, 300, 40)
hypos = []
for p1_ in p1:
for p2_ in p2:
for n_ in n:
hypos.append((n_, p1_, p2_))
In [124]:
bug_finder_suite = bugFinder(hypos)
In [125]:
bug_finder_suite.Update([20, 15, 3])
Out[125]:
In [126]:
thinkplot.Contour(bug_finder_suite.GetDict(), contour=False, pcolor=True)
thinkplot.Config(xlabel='alpha',
ylabel='beta',
axis=[0, 30, 0, 20])
In [ ]:
In [ ]:
Exercise: The GPS problem. According to Wikipedia

GPS included a (currently disabled) feature called Selective Availability (SA) that adds intentional, time varying errors of up to 100 meters (328 ft) to the publicly available navigation signals. This was intended to deny an enemy the use of civilian GPS receivers for precision weapon guidance. [...] Before it was turned off on May 2, 2000, typical SA errors were about 50 m (164 ft) horizontally and about 100 m (328 ft) vertically.[10] Because SA affects every GPS receiver in a given area almost equally, a fixed station with an accurately known position can measure the SA error values and transmit them to the local GPS receivers so they may correct their position fixes. This is called Differential GPS or DGPS. DGPS also corrects for several other important sources of GPS errors, particularly ionospheric delay, so it continues to be widely used even though SA has been turned off. The ineffectiveness of SA in the face of widely available DGPS was a common argument for turning off SA, and this was finally done by order of President Clinton in 2000.
Suppose it is 1 May 2000, and you are standing in a field that is 200m square. You are holding a GPS unit that indicates that your location is 51m north and 15m west of a known reference point in the middle of the field.
However, you know that each of these coordinates has been perturbed by a "feature" that adds random errors with mean 0 and standard deviation 30m.
1) After taking one measurement, what should you believe about your position?
Note: Since the intentional errors are independent, you could solve this problem independently for X and Y. But we'll treat it as a two-dimensional problem, partly for practice and partly to see how we could extend the solution to handle dependent errors.
You can start with the code in gps.py.
2) Suppose that after one second the GPS updates your position and reports coordinates (48, 90). What should you believe now?
3) Suppose you take 8 more measurements and get:
(11.903060613102866, 19.79168669735705)
(77.10743601503178, 39.87062906535289)
(80.16596823095534, -12.797927542984425)
(67.38157493119053, 83.52841028148538)
(89.43965206875271, 20.52141889230797)
(58.794021026248245, 30.23054016065644)
(2.5844401241265302, 51.012041625783766)
(45.58108994142448, 3.5718287379754585)
At this point, how certain are you about your location?
In [32]:
# Solution goes here
In [33]:
# Solution goes here
In [34]:
# Solution goes here
In [35]:
# Solution goes here
In [36]:
# Solution goes here
In [37]:
# Solution goes here
Exercise: The Flea Beetle problem from DASL
Datafile Name: Flea Beetles
Datafile Subjects: Biology
Story Names: Flea Beetles
Reference: Lubischew, A.A. (1962) On the use of discriminant functions in taxonomy. Biometrics, 18, 455-477. Also found in: Hand, D.J., et al. (1994) A Handbook of Small Data Sets, London: Chapman & Hall, 254-255.
Authorization: Contact Authors
Description: Data were collected on the genus of flea beetle Chaetocnema, which contains three species: concinna (Con), heikertingeri (Hei), and heptapotamica (Hep). Measurements were made on the width and angle of the aedeagus of each beetle. The goal of the original study was to form a classification rule to distinguish the three species.
Number of cases: 74
Variable Names:
Width: The maximal width of aedeagus in the forpart (in microns)
Angle: The front angle of the aedeagus (1 unit = 7.5 degrees)
Species: Species of flea beetle from the genus Chaetocnema
Suggestions:
Plot CDFs for the width and angle data, broken down by species, to get a visual sense of whether the normal distribution is a good model.
Use the data to estimate the mean and standard deviation for each variable, broken down by species.
Given a joint posterior distribution for mu
and sigma
, what is the likelihood of a given datum?
Write a function that takes a measured width and angle and returns a posterior PMF of species.
Use the function to classify each of the specimens in the table and see how many you get right.
In [38]:
import pandas as pd
df = pd.read_csv('flea_beetles.csv', delimiter='\t')
df.head()
Out[38]:
In [39]:
# Solution goes here