Think Bayes

This notebook presents example code and exercise solutions for Think Bayes.

Copyright 2016 Allen B. Downey

MIT License:

In [1]:
# Configure Jupyter so figures appear in the notebook
%matplotlib inline

# Configure Jupyter to display the assigned value after an assignment
%config InteractiveShell.ast_node_interactivity='last_expr_or_assign'

# import classes from thinkbayes2
from thinkbayes2 import Pmf, Suite

import pandas as pd
import numpy as np

import thinkplot

Interpreting medical tests

Suppose you are a doctor treating a 40-year old female patient. After she gets a routine screening mammogram, the result comes back positive (defined below).

The patient asks whether this result indicates that she has breast cancer. You interpret this question as, "What is the probability that this patient has breast cancer, given a positive test result?"

How would you respond?

The following background information from the Breast Cancer Screening Consortium (BCSC) might help:

Cancer Rate (per 1,000 examinations) and Cancer Detection Rate (per 1,000 examinations) for 1,838,372 Screening Mammography Examinations from 2004 to 2008 by Age -- based on BCSC data through 2009.

Performance Measures for 1,838,372 Screening Mammography Examinations1 from 2004 to 2008 by Age -- based on BCSC data through 2009.

In [27]:
class BayesTable(pd.DataFrame):
    def __init__(self, hypo, prior=1, **options):
        columns = ['prior', 'likelihood', 'unnorm', 'posterior']
        super().__init__(index=hypo, columns=columns, **options)
        self.prior = prior
    def mult(self):
        self.unnorm = self.prior * self.likelihood
    def norm(self):
        nc = np.sum(self.unnorm)
        self.posterior = self.unnorm / nc
        return nc
    def update(self):
        return self.norm()
    def reset(self):
        return BayesTable(self.hypo, self.posterior)

Assumptions and interpretation

According to the first table, the cancer rate per 1000 examinations is 2.65 for women age 40-44. The notes explain that this rate is based on "the number of examinations with a tissue diagnosis of ductal carcinoma in situ or invasive cancer within 1 year following the examination and before the next screening mammography examination", so it would be more precise to say that it is the rate of diagnosis within a year of the examination, not the rate of actual cancers.

Since untreated invasive breast cancer is likely to become symptomatic, we expect a large fraction of cancers to be diagnosed eventually. But there might be a long delay between developing a cancer and diagnosis, and a patient might die of another cause before diagnosis. So we should consider this rate as a lower bound on the probability that a patient has cancer at the time of the examination.

According to the second table, the sensitivity of the test for women in this age group is 73.4%; the specificity is 87.7%. From these, we can get the conditional probabilities:

P(positive test | cancer) = sensitivity
P(positive test | no cancer) = (1 - specificity)

Now we can use a Bayes table to compute the probability we are interested in, P(cancer | positive test)

In [28]:
base_rate = 2.65 / 1000
hypo = ['cancer', 'no cancer']
prior = [base_rate, 1-base_rate]
table = BayesTable(hypo, prior)

prior likelihood unnorm posterior
cancer 0.00265 NaN NaN NaN
no cancer 0.99735 NaN NaN NaN

In [29]:
sensitivity = 0.734
specificity = 0.877
table.likelihood = [sensitivity, 1-specificity]

prior likelihood unnorm posterior
cancer 0.00265 0.734 NaN NaN
no cancer 0.99735 0.123 NaN NaN

In [34]:
likelihood_ratio = table.likelihood['cancer'] / table.likelihood['no cancer']


In [31]:

prior likelihood unnorm posterior
cancer 0.00265 0.734 0.001945 0.015608
no cancer 0.99735 0.123 0.122674 0.984392

In [36]:
table.posterior['cancer'] * 100


So there is a 1.56% chance that this patient has cancer, given that the initial screening mammogram was positive.

This result is called the positive predictive value (PPV) of the test, which we could have read from the second table

This data was the basis, in 2009, for the recommendation of the US Preventive Services Task Force,

In [49]:
def compute_ppv(base_rate, sensitivity, specificity):
    pmf = Pmf()
    pmf['cancer'] = base_rate * sensitivity
    pmf['no cancer'] = (1 - base_rate) * (1 - specificity)
    return pmf

In [51]:
pmf = compute_ppv(base_rate, sensitivity, specificity)

Pmf({'cancer': 0.01560835553765212, 'no cancer': 0.9843916444623478})

In [55]:
ages = [40, 50, 60, 70, 80]
rates = pd.Series([2.65, 4.28, 5.70, 6.76, 8.51], index=ages)

40    2.65
50    4.28
60    5.70
70    6.76
80    8.51
dtype: float64

In [58]:
for age, rate in rates.items():
    pmf = compute_ppv(rate, sensitivity, specificity)
    print(age, pmf['cancer'])

40 1.116493987314525
50 1.1473441243499096
60 1.1603294783259839
70 1.166569488592548
80 1.173548315582017

In [ ]: