This notebook presents example code and exercise solutions for Think Bayes.
Copyright 2016 Allen B. Downey
MIT License: https://opensource.org/licenses/MIT
In [1]:
# Configure Jupyter so figures appear in the notebook
%matplotlib inline
# Configure Jupyter to display the assigned value after an assignment
%config InteractiveShell.ast_node_interactivity='last_expr_or_assign'
# import classes from thinkbayes2
from thinkbayes2 import Pmf, Suite
import pandas as pd
import numpy as np
import thinkplot
Suppose you are a doctor treating a 40-year old female patient. After she gets a routine screening mammogram, the result comes back positive (defined below).
The patient asks whether this result indicates that she has breast cancer. You interpret this question as, "What is the probability that this patient has breast cancer, given a positive test result?"
How would you respond?
The following background information from the Breast Cancer Screening Consortium (BCSC) might help:
In [27]:
class BayesTable(pd.DataFrame):
def __init__(self, hypo, prior=1, **options):
columns = ['prior', 'likelihood', 'unnorm', 'posterior']
super().__init__(index=hypo, columns=columns, **options)
self.prior = prior
def mult(self):
self.unnorm = self.prior * self.likelihood
def norm(self):
nc = np.sum(self.unnorm)
self.posterior = self.unnorm / nc
return nc
def update(self):
self.mult()
return self.norm()
def reset(self):
return BayesTable(self.hypo, self.posterior)
According to the first table, the cancer rate per 1000 examinations is 2.65 for women age 40-44. The notes explain that this rate is based on "the number of examinations with a tissue diagnosis of ductal carcinoma in situ or invasive cancer within 1 year following the examination and before the next screening mammography examination", so it would be more precise to say that it is the rate of diagnosis within a year of the examination, not the rate of actual cancers.
Since untreated invasive breast cancer is likely to become symptomatic, we expect a large fraction of cancers to be diagnosed eventually. But there might be a long delay between developing a cancer and diagnosis, and a patient might die of another cause before diagnosis. So we should consider this rate as a lower bound on the probability that a patient has cancer at the time of the examination.
According to the second table, the sensitivity of the test for women in this age group is 73.4%; the specificity is 87.7%. From these, we can get the conditional probabilities:
P(positive test | cancer) = sensitivity
P(positive test | no cancer) = (1 - specificity)
Now we can use a Bayes table to compute the probability we are interested in, P(cancer | positive test)
In [28]:
base_rate = 2.65 / 1000
hypo = ['cancer', 'no cancer']
prior = [base_rate, 1-base_rate]
table = BayesTable(hypo, prior)
Out[28]:
In [29]:
sensitivity = 0.734
specificity = 0.877
table.likelihood = [sensitivity, 1-specificity]
table
Out[29]:
In [34]:
likelihood_ratio = table.likelihood['cancer'] / table.likelihood['no cancer']
Out[34]:
In [31]:
table.update()
table
Out[31]:
In [36]:
table.posterior['cancer'] * 100
Out[36]:
So there is a 1.56% chance that this patient has cancer, given that the initial screening mammogram was positive.
This result is called the positive predictive value (PPV) of the test, which we could have read from the second table
This data was the basis, in 2009, for the recommendation of the US Preventive Services Task Force,
In [49]:
def compute_ppv(base_rate, sensitivity, specificity):
pmf = Pmf()
pmf['cancer'] = base_rate * sensitivity
pmf['no cancer'] = (1 - base_rate) * (1 - specificity)
pmf.Normalize()
return pmf
In [51]:
pmf = compute_ppv(base_rate, sensitivity, specificity)
Out[51]:
In [55]:
ages = [40, 50, 60, 70, 80]
rates = pd.Series([2.65, 4.28, 5.70, 6.76, 8.51], index=ages)
Out[55]:
In [58]:
for age, rate in rates.items():
pmf = compute_ppv(rate, sensitivity, specificity)
print(age, pmf['cancer'])
In [ ]: