# Bayesian Survival Analysis



In [1]:

from __future__ import print_function, division

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

import thinkbayes2
import thinkplot



## Survival analysis

Suppose that you are an auto insurance company interested in the time between collisions for a particular driver. If the probability of a collision is roughly constant over time, the time between collisions will follow an exponential distribution.

Here's an example with parameter $\lambda = 0.5$ collisions / year.



In [2]:

from thinkbayes2 import MakeExponentialPmf

pmf = MakeExponentialPmf(lam=0.5, high=30)
thinkplot.Pdf(pmf)






For the exponential distribution, the mean and standard deviation are $1/\lambda$.

In this case they are only approximate because we truncated the distribution.



In [3]:

pmf.Mean(), pmf.Std()




Out[3]:

(1.9255614181751446, 1.9994621154028254)



From the PMF, we can compute the CDF.



In [4]:

cdf = pmf.MakeCdf()
thinkplot.Cdf(cdf)






And from the CDF, we can compute the survival function, which is the complement of the CDF.

$SF(x) = Prob\{X > x\} = 1 - Prob\{X \le x\} = 1 - CDF(x)$



In [5]:

from survival import MakeSurvivalFromCdf

sf = MakeSurvivalFromCdf(cdf)
thinkplot.Plot(sf)






From the survival function we can get the hazard function, which is the probability of a collision at $x$, given no collision prior to $x$.



In [6]:

hf = sf.MakeHazardFunction()
thinkplot.Plot(hf)






If the distribution is truly exponential, the hazard function is constant for all $x$.

In this case it goes to 1 at the end, again because we truncated the distribution.

Exercise: Go back and increase the value of high, and confirm that the hazard function is a constant until we approach the point where we cut off the distribution.

Given the survival function, we can compute the distribution of remaining lifetime, conditioned on current age. The following function computes the mean remaining lifetime for a range of ages.



In [7]:

"""Computes remaining lifetime as a function of age.

sf: survival function
returns: Series that maps from age to remaining lifetime
"""
pmf = sf.MakePmf()
d = {}
for t in sorted(pmf.Values()):
pmf[t] = 0
if pmf.Total():
pmf.Normalize()
d[t] = pmf.Mean() - t

return pd.Series(d)



And here's what it looks like for the exponential survival function.



In [8]:

thinkplot.Plot(mean_rem_life)






The mean time until a collision is pretty much constant, until we approach the point where we truncate the distribution.

## The Weibull distribution

The Weibull distribution is a generalization of the exponential distribution that takes an additional "shape" parameter, k.

When k=1, the Weibull is an exponential distribution. Other values of k yield survival curves with different shapes, and hazard functions that increase, decrease, or both. So the Weibull family can capture a wide range of survival patterns.



In [9]:

from thinkbayes2 import MakeWeibullPmf

pmf = MakeWeibullPmf(lam=2.0, k=1.5, high=30)
thinkplot.Pdf(pmf)






Exercise: In the previous section, replace the exponential distribution with a Weibull distribituion and run the analysis again. What can you infer about the values of the parameters and the behavior of the hazard function and remaining lifetime?

## Bayesian survival analysis

Suppose you are the manager of a large building with many light fixtures. To figure out how often you will need to replace lightbulbs, you install 10 bulbs and measure the time until they fail.

To generate some fake data, I'll choose a Weibull distribution and generate a random sample (let's suppose it's in years):



In [10]:

def SampleWeibull(lam, k, n=1):
return np.random.weibull(k, size=n) * lam

data = SampleWeibull(lam=2, k=1.5, n=10)
data




Out[10]:

array([ 1.07415385,  0.8178688 ,  2.66135791,  3.02946871,  3.89009206,
0.59634116,  0.87744983,  2.64212958,  0.63081272,  3.59077345])



Exercise: Write a class called LightBulb that inherits from Suite and provides a Likelihood function that takes an observed lifespan as data and a tuple, (lam, k), as a hypothesis. It should return a likelihood proportional to the probability of the observed lifespan in a Weibull distribution with the given parameters.

Test your method by creating a LightBulb object with an appropriate prior and update it with the data above.

Plot the posterior distributions of lam and k. As the sample size increases, does the posterior distribution converge on the values of lam and k used to generate the sample?



In [11]:

# Hint

from thinkbayes2 import Suite, Joint, EvalWeibullPdf

class LightBulb(Suite, Joint):

def Likelihood(self, data, hypo):
lam, k = hypo
x = data
like = 1
return like




In [12]:

# Solution

class LightBulb(Suite, Joint):

def Likelihood(self, data, hypo):
lam, k = hypo
x = data
like = EvalWeibullPdf(x, lam, k)
return like




In [13]:

from itertools import product

lams = np.linspace(0.001, 6, 101)
ks = np.linspace(0.001, 8, 101)

suite = LightBulb(product(lams, ks))




In [14]:




Out[14]:

5.7385301188039346e-09




In [15]:

thinkplot.Contour(suite)







In [16]:

pmf_lam = suite.Marginal(0)
thinkplot.Pdf(pmf_lam)
pmf_lam.Mean()




Out[16]:

2.4625081728904044




In [17]:

pmf_k = suite.Marginal(1)
thinkplot.Pdf(pmf_k)
pmf_k.Mean()




Out[17]:

1.6608383071678676



Exercise: Go back and run this analysis again with n=20 and see if the posterior distributions seem to be converging on the actual parameters.

## Censored data

Exercise: Now suppose that instead of observing a complete lifespan, you observe a lightbulb that has operated for 1 year and is still working. Write another version of LightBulb that takes data in this form and performs an update.



In [18]:

# Hint

from thinkbayes2 import EvalWeibullCdf

class LightBulb2(Suite, Joint):

def Likelihood(self, data, hypo):
lam, k = hypo
x = data
like = 1
return like




In [19]:

# Solution

class LightBulb2(Suite, Joint):

def Likelihood(self, data, hypo):
lam, k = hypo
x = data
like = 1 - EvalWeibullCdf(x, lam, k)
return like




In [20]:

from itertools import product

lams = np.linspace(0.001, 10, 101)
ks = np.linspace(0.001, 10, 101)

suite = LightBulb2(product(lams, ks))




In [21]:

suite.Update(1)




Out[21]:

0.83584776184046183




In [22]:

thinkplot.Contour(suite)







In [23]:

pmf_lam = suite.Marginal(0)
thinkplot.Pdf(pmf_lam)
pmf_lam.Mean()




Out[23]:

5.6163331229520157




In [24]:

pmf_k = suite.Marginal(1)
thinkplot.Pdf(pmf_k)
pmf_k.Mean()




Out[24]:

5.2480876231355253



Note: based on this data alone, we can rule out some small values of lam and k, but we can't rule out large values. Without more data or a more informative prior, the results are not useful.

To see why, try increasing the upper bounds in the prior distribition.

## Left censored data

Exercise: Suppose you install a light bulb and then you don't check on it for a year, but when you come back, you find that it has burned out. Extend LightBulb to handle this kind of data, too.



In [25]:

# Hint

class LightBulb3(Suite, Joint):

def Likelihood(self, data, hypo):
lam, k = hypo
x = data
like = 1
return like




In [26]:

# Solution

class LightBulb3(Suite, Joint):

def Likelihood(self, data, hypo):
lam, k = hypo
x = data
like = EvalWeibullCdf(x, lam, k)
return like




In [27]:

from itertools import product

lams = np.linspace(0.001, 20, 101)
ks = np.linspace(0.001, 20, 101)

suite = LightBulb3(product(lams, ks))




In [28]:

suite.Update(1)




Out[28]:

0.08013753974381374




In [29]:

thinkplot.Contour(suite)







In [30]:

pmf_lam = suite.Marginal(0)
thinkplot.Pdf(pmf_lam)
pmf_lam.Mean()




Out[30]:

2.8365008331664181




In [31]:

pmf_k = suite.Marginal(1)
thinkplot.Pdf(pmf_k)
pmf_k.Mean()




Out[31]:

7.3098570578312811



This example has some of the same problems as the previous one. Based on this data alone, we can't pin down the parameters much.

## Pulling it together

Exercise: Suppose you have 15 lightbulbs installed at different times over a 10 year period. When you observe them, some have died and some are still working. Write a version of LightBulb that takes data in the form of a (flag, x) tuple, where:

1. If flag is eq, it means that x is the actual lifespan of a bulb that has died.
2. If flag is gt, it means that x is the current age of a bulb that is still working, so it is a lower bound on the lifespan.
3. If flag is lt, it means that x is the elapsed time between installation and the first time the bulb is seen broken, so it is an upper bound on the lifespan.

First, I'll generate a Pandas DataFrame with random start times and lifespans. The columns are:

• start: time when the bulb was installed

• lifespan: lifespan of the bulb in years

• end: time when bulb died or will die

• age_t: age of the bulb at t=10



In [32]:

import pandas as pd

lam = 2
k = 1.5
n = 15
t_end = 10
starts = np.random.uniform(0, t_end, n)
lifespans = SampleWeibull(lam, k, n)

df = pd.DataFrame({'start': starts, 'lifespan': lifespans})
df['end'] = df.start + df.lifespan
df['age_t'] = t_end - df.start




Out[32]:

lifespan
start
end
age_t

0
1.388902
0.540694
1.929596
9.459306

1
2.193729
5.315099
7.508829
4.684901

2
3.388538
4.122471
7.511009
5.877529

3
4.825324
5.278081
10.103405
4.721919

4
1.531704
5.688559
7.220263
4.311441



Now I'll process the DataFrame to generate data in the form we want for the update.



In [33]:

data = []
for i, row in df.iterrows():
if row.end < t_end:
data.append(('eq', row.lifespan))
else:
data.append(('gt', row.age_t))

for pair in data:
print(pair)




('eq', 1.3889018262027999)
('eq', 2.1937293645100646)
('eq', 3.3885380077362912)
('gt', 4.7219194448895783)
('eq', 1.5317042697961687)
('eq', 1.5558517330164294)
('eq', 2.9663661336026252)
('gt', 1.5041879932546003)
('eq', 3.682370914588053)
('gt', 1.0721506849858393)
('gt', 1.6194561275894479)
('eq', 1.1862252712721322)
('eq', 0.33329204152931641)
('eq', 0.47672090066923911)
('eq', 2.0003409069507132)




In [34]:

# Hint

class LightBulb4(Suite, Joint):

def Likelihood(self, data, hypo):
lam, k = hypo
flag, x = data
like = 1
return like




In [35]:

# Solution

class LightBulb4(Suite, Joint):

def Likelihood(self, data, hypo):
lam, k = hypo
flag, x = data
if flag == 'eq':
like = EvalWeibullPdf(x, lam, k)
elif flag == 'gt':
like = 1 - EvalWeibullCdf(x, lam, k)
elif flag == 'lt':
like = EvalWeibullCdf(x, lam, k)
else:
raise ValueError('Invalid data')
return like




In [36]:

from itertools import product

lams = np.linspace(0.001, 10, 101)
ks = np.linspace(0.001, 10, 101)

suite = LightBulb4(product(lams, ks))




In [37]:




Out[37]:

2.2606259309970298e-11




In [38]:

thinkplot.Contour(suite)







In [39]:

pmf_lam = suite.Marginal(0)
thinkplot.Pdf(pmf_lam)
pmf_lam.Mean()




Out[39]:

2.9495499859462395




In [40]:

pmf_k = suite.Marginal(1)
thinkplot.Pdf(pmf_k)
pmf_k.Mean()




Out[40]:

1.6230093351964052



## Prediction

Suppose we know that, for a particular kind of lightbulb in a particular location, the distribution of lifespans is well modeled by a Weibull distribution with lam=2 and k=1.5. If we install n=100 lightbulbs and come back one year later, what is the distribution of c, the number of lightbulbs that have burned out?

The probability that any given bulb has burned out comes from the CDF of the distribution.



In [41]:

lam = 2
k = 1.5
p = EvalWeibullCdf(1, lam, k)
p




Out[41]:

0.29781149867344037



The number of bulbs that have burned out is distributed Binom(n, p).



In [42]:

from thinkbayes2 import MakeBinomialPmf

n = 100
pmf_c = MakeBinomialPmf(n, p)
thinkplot.Pdf(pmf_c)






Or we can approximate the distribution with a random sample.



In [43]:

n = 100
sample = np.random.binomial(n, p, 1000)
pdf_c = thinkbayes2.EstimatedPdf(sample)
thinkplot.Pdf(pdf_c)
np.mean(sample), np.std(sample)




Out[43]:

(29.751999999999999, 4.4304058504836776)



Exercise: Now suppose that lam and k are not known precisely, but we have a LightBulb object that represents the joint posterior distribution of the parameters after seeing some data. Compute the posterior predictive distribution for c, the number of bulbs burned out after one year.



In [44]:

# Solution

n = 100
ns = []
params = suite.Sample(1000)

for lam, k in params:
p = EvalWeibullCdf(1, lam, k)
sample = np.random.binomial(n, p, 10)
ns.extend(sample)

pdf_c_mix = thinkbayes2.EstimatedPdf(ns)
thinkplot.Pdf(pdf_c_mix)
np.mean(ns), np.std(ns)




Out[44]:

(18.332899999999999, 8.5569782978572526)




In [45]:

# Solution

# Alternatively, you can get a more precise solution by making
# a mixture of binomials.

from thinkbayes2 import Pmf

n = 100
t_return = 1

metapmf = Pmf()
for (lam, k), prob in suite.Items():
p = EvalWeibullCdf(t_return, lam, k)
pmf = MakeBinomialPmf(n, p)
metapmf[pmf] = prob




In [46]:

# Solution

from thinkbayes2 import MakeMixture

mix = MakeMixture(metapmf)
thinkplot.Pdf(mix)







In [47]:

# Solution

mix.Mean(), mix.Std()




Out[47]:

(18.125005795937824, 8.634899225787988)




In [ ]: