Think Bayes

This notebook presents example code and exercise solutions for Think Bayes.

MIT License: https://opensource.org/licenses/MIT



In [1]:

    
# Configure Jupyter so figures appear in the notebook
%matplotlib inline

# Configure Jupyter to display the assigned value after an assignment
%config InteractiveShell.ast_node_interactivity='last_expr_or_assign'

# import classes from thinkbayes2
from thinkbayes2 import Pmf, Suite

import pandas as pd
import numpy as np

import thinkplot

Interpreting medical tests

Suppose you are a doctor treating a 40-year old female patient. After she gets a routine screening mammogram, the result comes back positive (defined below).

The patient asks whether this result indicates that she has breast cancer. You interpret this question as, "What is the probability that this patient has breast cancer, given a positive test result?"

How would you respond?

The following background information from the Breast Cancer Screening Consortium (BCSC) might help:

Cancer Rate (per 1,000 examinations) and Cancer Detection Rate (per 1,000 examinations) for 1,838,372 Screening Mammography Examinations from 2004 to 2008 by Age -- based on BCSC data through 2009.

Performance Measures for 1,838,372 Screening Mammography Examinations1 from 2004 to 2008 by Age -- based on BCSC data through 2009.



In [27]:

    
class BayesTable(pd.DataFrame):
    def __init__(self, hypo, prior=1, **options):
        columns = ['prior', 'likelihood', 'unnorm', 'posterior']
        super().__init__(index=hypo, columns=columns, **options)
        self.prior = prior
    
    def mult(self):
        self.unnorm = self.prior * self.likelihood
        
    def norm(self):
        nc = np.sum(self.unnorm)
        self.posterior = self.unnorm / nc
        return nc
    
    def update(self):
        self.mult()
        return self.norm()
    
    def reset(self):
        return BayesTable(self.hypo, self.posterior)

Assumptions and interpretation

According to the first table, the cancer rate per 1000 examinations is 2.65 for women age 40-44. The notes explain that this rate is based on "the number of examinations with a tissue diagnosis of ductal carcinoma in situ or invasive cancer within 1 year following the examination and before the next screening mammography examination", so it would be more precise to say that it is the rate of diagnosis within a year of the examination, not the rate of actual cancers.

Since untreated invasive breast cancer is likely to become symptomatic, we expect a large fraction of cancers to be diagnosed eventually. But there might be a long delay between developing a cancer and diagnosis, and a patient might die of another cause before diagnosis. So we should consider this rate as a lower bound on the probability that a patient has cancer at the time of the examination.

According to the second table, the sensitivity of the test for women in this age group is 73.4%; the specificity is 87.7%. From these, we can get the conditional probabilities:

P(positive test | cancer) = sensitivity
P(positive test | no cancer) = (1 - specificity)

Now we can use a Bayes table to compute the probability we are interested in, P(cancer | positive test)



In [28]:

    
base_rate = 2.65 / 1000
hypo = ['cancer', 'no cancer']
prior = [base_rate, 1-base_rate]
table = BayesTable(hypo, prior)









    Out[28]:







  
    
      
      prior
      likelihood
      unnorm
      posterior
    
  
  
    
      cancer
      0.00265
      NaN
      NaN
      NaN
    
    
      no cancer
      0.99735
      NaN
      NaN
      NaN



In [29]:

    
sensitivity = 0.734
specificity = 0.877
table.likelihood = [sensitivity, 1-specificity]
table









    Out[29]:







  
    
      
      prior
      likelihood
      unnorm
      posterior
    
  
  
    
      cancer
      0.00265
      0.734
      NaN
      NaN
    
    
      no cancer
      0.99735
      0.123
      NaN
      NaN



In [34]:

    
likelihood_ratio = table.likelihood['cancer'] / table.likelihood['no cancer']









    Out[34]:





5.967479674796748



In [31]:

    
table.update()
table









    Out[31]:







  
    
      
      prior
      likelihood
      unnorm
      posterior
    
  
  
    
      cancer
      0.00265
      0.734
      0.001945
      0.015608
    
    
      no cancer
      0.99735
      0.123
      0.122674
      0.984392



In [36]:

    
table.posterior['cancer'] * 100









    Out[36]:





1.5608355537652119

So there is a 1.56% chance that this patient has cancer, given that the initial screening mammogram was positive.

This result is called the positive predictive value (PPV) of the test, which we could have read from the second table

This data was the basis, in 2009, for the recommendation of the US Preventive Services Task Force,



In [49]:

    
def compute_ppv(base_rate, sensitivity, specificity):
    pmf = Pmf()
    pmf['cancer'] = base_rate * sensitivity
    pmf['no cancer'] = (1 - base_rate) * (1 - specificity)
    pmf.Normalize()
    return pmf



In [51]:

    
pmf = compute_ppv(base_rate, sensitivity, specificity)









    Out[51]:





Pmf({'cancer': 0.01560835553765212, 'no cancer': 0.9843916444623478})



In [55]:

    
ages = [40, 50, 60, 70, 80]
rates = pd.Series([2.65, 4.28, 5.70, 6.76, 8.51], index=ages)









    Out[55]:





40    2.65
50    4.28
60    5.70
70    6.76
80    8.51
dtype: float64



In [58]:

    
for age, rate in rates.items():
    pmf = compute_ppv(rate, sensitivity, specificity)
    print(age, pmf['cancer'])









    



40 1.116493987314525
50 1.1473441243499096
60 1.1603294783259839
70 1.166569488592548
80 1.173548315582017



In [ ]:

	prior	likelihood	unnorm	posterior
cancer	0.00265	0.734	0.001945	0.015608
no cancer	0.99735	0.123	0.122674	0.984392