Think Bayes: Chapter 9

This notebook presents code and exercises from Think Bayes, second edition.

MIT License: https://opensource.org/licenses/MIT



In [1]:

    
from __future__ import print_function, division

% matplotlib inline
import warnings
warnings.filterwarnings('ignore')

import math
import numpy as np

from thinkbayes2 import Pmf, Cdf, Suite, Joint
import thinkplot

Improving Reading Ability

From DASL(http://lib.stat.cmu.edu/DASL/Stories/ImprovingReadingAbility.html)

An educator conducted an experiment to test whether new directed reading activities in the classroom will help elementary school pupils improve some aspects of their reading ability. She arranged for a third grade class of 21 students to follow these activities for an 8-week period. A control classroom of 23 third graders followed the same curriculum without the activities. At the end of the 8 weeks, all students took a Degree of Reading Power (DRP) test, which measures the aspects of reading ability that the treatment is designed to improve.

Summary statistics on the two groups of children show that the average score of the treatment class was almost ten points higher than the average of the control class. A two-sample t-test is appropriate for testing whether this difference is statistically significant. The t-statistic is 2.31, which is significant at the .05 level.



In [2]:

    
import pandas as pd

df = pd.read_csv('drp_scores.csv', skiprows=21, delimiter='\t')
df.head()









    Out[2]:






  
    
      
      Treatment
      Response
    
  
  
    
      0
      Treated
      24
    
    
      1
      Treated
      43
    
    
      2
      Treated
      58
    
    
      3
      Treated
      71
    
    
      4
      Treated
      43



In [3]:

    
grouped = df.groupby('Treatment')
for name, group in grouped:
    print(name, group.Response.mean())









    



Control 41.5217391304
Treated 51.4761904762



In [4]:

    
from scipy.stats import norm

class Normal(Suite, Joint):
    
    def Likelihood(self, data, hypo):
        """
        
        data: sequence of test scores
        hypo: mu, sigma
        """
        mu, sigma = hypo
        likes = norm.pdf(data, mu, sigma)
        return np.prod(likes)



In [5]:

    
from itertools import product

mus = np.linspace(20, 80, 101)
sigmas = np.linspace(5, 30, 101)



In [6]:

    
control = Normal(product(mus, sigmas))
data = df[df.Treatment=='Control'].Response
control.Update(data)









    Out[6]:





1.791883192150766e-44



In [7]:

    
thinkplot.Contour(control, pcolor=True)



In [8]:

    
pmf_mu0 = control.Marginal(0)
thinkplot.Pdf(pmf_mu0)



In [9]:

    
pmf_sigma0 = control.Marginal(1)
thinkplot.Pdf(pmf_sigma0)

Exercise: Run this analysis again for the control group. What is the distribution of the difference between the groups? What is the probability that the average "reading power" for the treatment group is higher? What is the probability that the variance of the treatment group is higher?



In [10]:

    
# Solution goes here



In [11]:

    
# Solution goes here



In [12]:

    
# Solution goes here



In [13]:

    
# Solution goes here



In [14]:

    
# Solution goes here



In [15]:

    
# Solution goes here



In [16]:

    
# Solution goes here

Paintball



In [17]:

    
class Paintball(Suite, Joint):
    """Represents hypotheses about the location of an opponent."""

    def __init__(self, alphas, betas, locations):
        """Makes a joint suite of parameters alpha and beta.

        Enumerates all pairs of alpha and beta.
        Stores locations for use in Likelihood.

        alphas: possible values for alpha
        betas: possible values for beta
        locations: possible locations along the wall
        """
        self.locations = locations
        pairs = [(alpha, beta) 
                 for alpha in alphas 
                 for beta in betas]
        Suite.__init__(self, pairs)

    def Likelihood(self, data, hypo):
        """Computes the likelihood of the data under the hypothesis.

        hypo: pair of alpha, beta
        data: location of a hit

        Returns: float likelihood
        """
        alpha, beta = hypo
        x = data
        pmf = MakeLocationPmf(alpha, beta, self.locations)
        like = pmf.Prob(x)
        return like



In [18]:

    
def MakeLocationPmf(alpha, beta, locations):
    """Computes the Pmf of the locations, given alpha and beta. 

    Given that the shooter is at coordinates (alpha, beta),
    the probability of hitting any spot is inversely proportionate
    to the strafe speed.

    alpha: x position
    beta: y position
    locations: x locations where the pmf is evaluated

    Returns: Pmf object
    """
    pmf = Pmf()
    for x in locations:
        prob = 1.0 / StrafingSpeed(alpha, beta, x)
        pmf.Set(x, prob)
    pmf.Normalize()
    return pmf



In [19]:

    
def StrafingSpeed(alpha, beta, x):
    """Computes strafing speed, given location of shooter and impact.

    alpha: x location of shooter
    beta: y location of shooter
    x: location of impact

    Returns: derivative of x with respect to theta
    """
    theta = math.atan2(x - alpha, beta)
    speed = beta / math.cos(theta)**2
    return speed



In [20]:

    
alphas = range(0, 31)
betas = range(1, 51)
locations = range(0, 31)

suite = Paintball(alphas, betas, locations)
suite.UpdateSet([15, 16, 18, 21])









    Out[20]:





1.903291595810899e-06



In [21]:

    
locations = range(0, 31)
alpha = 10
betas = [10, 20, 40]
thinkplot.PrePlot(num=len(betas))

for beta in betas:
    pmf = MakeLocationPmf(alpha, beta, locations)
    pmf.label = 'beta = %d' % beta
    thinkplot.Pdf(pmf)

thinkplot.Config(xlabel='Distance',
                ylabel='Prob')



In [22]:

    
marginal_alpha = suite.Marginal(0, label='alpha')
marginal_beta = suite.Marginal(1, label='beta')

print('alpha CI', marginal_alpha.CredibleInterval(50))
print('beta CI', marginal_beta.CredibleInterval(50))

thinkplot.PrePlot(num=2)
    
thinkplot.Cdf(Cdf(marginal_alpha))
thinkplot.Cdf(Cdf(marginal_beta))
    
thinkplot.Config(xlabel='Distance',
                ylabel='Prob')









    



alpha CI (14, 21)
beta CI (5, 31)



In [23]:

    
betas = [10, 20, 40]
thinkplot.PrePlot(num=len(betas))

for beta in betas:
    cond = suite.Conditional(0, 1, beta)
    cond.label = 'beta = %d' % beta
    thinkplot.Pdf(cond)

thinkplot.Config(xlabel='Distance',
                ylabel='Prob')



In [24]:

    
thinkplot.Contour(suite.GetDict(), contour=False, pcolor=True)

thinkplot.Config(xlabel='alpha',
                ylabel='beta',
                axis=[0, 30, 0, 20])



In [25]:

    
d = dict((pair, 0) for pair in suite.Values())

percentages = [75, 50, 25]
for p in percentages:
    interval = suite.MaxLikeInterval(p)
    for pair in interval:
        d[pair] += 1

thinkplot.Contour(d, contour=False, pcolor=True)
thinkplot.Text(17, 4, '25', color='white')
thinkplot.Text(17, 15, '50', color='white')
thinkplot.Text(17, 30, '75')

thinkplot.Config(xlabel='alpha',
                   ylabel='beta',
                   legend=False)

Exercise: From John D. Cook

"Suppose you have a tester who finds 20 bugs in your program. You want to estimate how many bugs are really in the program. You know there are at least 20 bugs, and if you have supreme confidence in your tester, you may suppose there are around 20 bugs. But maybe your tester isn't very good. Maybe there are hundreds of bugs. How can you have any idea how many bugs there are? There’s no way to know with one tester. But if you have two testers, you can get a good idea, even if you don’t know how skilled the testers are.

Suppose two testers independently search for bugs. Let k1 be the number of errors the first tester finds and k2 the number of errors the second tester finds. Let c be the number of errors both testers find. The Lincoln Index estimates the total number of errors as k1 k2 / c [I changed his notation to be consistent with mine]."

So if the first tester finds 20 bugs, the second finds 15, and they find 3 in common, we estimate that there are about 100 bugs. What is the Bayesian estimate of the number of errors based on this data?



In [26]:

    
# Solution goes here



In [27]:

    
# Solution goes here



In [28]:

    
# Solution goes here



In [29]:

    
# Solution goes here

Exercise: The GPS problem. According to Wikipedia

GPS included a (currently disabled) feature called Selective Availability (SA) that adds intentional, time varying errors of up to 100 meters (328 ft) to the publicly available navigation signals. This was intended to deny an enemy the use of civilian GPS receivers for precision weapon guidance. [...] Before it was turned off on May 2, 2000, typical SA errors were about 50 m (164 ft) horizontally and about 100 m (328 ft) vertically.[10] Because SA affects every GPS receiver in a given area almost equally, a fixed station with an accurately known position can measure the SA error values and transmit them to the local GPS receivers so they may correct their position fixes. This is called Differential GPS or DGPS. DGPS also corrects for several other important sources of GPS errors, particularly ionospheric delay, so it continues to be widely used even though SA has been turned off. The ineffectiveness of SA in the face of widely available DGPS was a common argument for turning off SA, and this was finally done by order of President Clinton in 2000.

Suppose it is 1 May 2000, and you are standing in a field that is 200m square. You are holding a GPS unit that indicates that your location is 51m north and 15m west of a known reference point in the middle of the field.

However, you know that each of these coordinates has been perturbed by a "feature" that adds random errors with mean 0 and standard deviation 30m.

1) After taking one measurement, what should you believe about your position?

Note: Since the intentional errors are independent, you could solve this problem independently for X and Y. But we'll treat it as a two-dimensional problem, partly for practice and partly to see how we could extend the solution to handle dependent errors.

You can start with the code in gps.py.

2) Suppose that after one second the GPS updates your position and reports coordinates (48, 90). What should you believe now?

3) Suppose you take 8 more measurements and get:

(11.903060613102866, 19.79168669735705)
(77.10743601503178, 39.87062906535289)
(80.16596823095534, -12.797927542984425)
(67.38157493119053, 83.52841028148538)
(89.43965206875271, 20.52141889230797)
(58.794021026248245, 30.23054016065644)
(2.5844401241265302, 51.012041625783766)
(45.58108994142448, 3.5718287379754585)

At this point, how certain are you about your location?



In [30]:

    
# Solution goes here



In [31]:

    
# Solution goes here



In [32]:

    
# Solution goes here



In [33]:

    
# Solution goes here



In [34]:

    
# Solution goes here



In [35]:

    
# Solution goes here



In [ ]:



In [ ]:



In [ ]: