Bayesian Survival Analysis

MIT License: https://opensource.org/licenses/MIT



In [1]:

    
from __future__ import print_function, division

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

import thinkbayes2
import thinkplot

Survival analysis

Suppose that you are an auto insurance company interested in the time between collisions for a particular driver. If the probability of a collision is roughly constant over time, the time between collisions will follow an exponential distribution.

Here's an example with parameter $\lambda = 0.5$ collisions / year.



In [2]:

    
from thinkbayes2 import MakeExponentialPmf

pmf = MakeExponentialPmf(lam=0.5, high=30)
thinkplot.Pdf(pmf)
thinkplot.Config(xlabel='Lifetime', ylabel='PMF')

For the exponential distribution, the mean and standard deviation are $1/\lambda$.

In this case they are only approximate because we truncated the distribution.



In [3]:

    
pmf.Mean(), pmf.Std()









    Out[3]:





(1.9255614181751446, 1.9994621154028254)

From the PMF, we can compute the CDF.



In [4]:

    
cdf = pmf.MakeCdf()
thinkplot.Cdf(cdf)
thinkplot.Config(xlabel='Lifetime', ylabel='CDF')

And from the CDF, we can compute the survival function, which is the complement of the CDF.

$SF(x) = Prob\{X > x\} = 1 - Prob\{X \le x\} = 1 - CDF(x)$



In [5]:

    
from survival import MakeSurvivalFromCdf

sf = MakeSurvivalFromCdf(cdf)
thinkplot.Plot(sf)
thinkplot.Config(xlabel='Lifetime', ylabel='Survival function')

From the survival function we can get the hazard function, which is the probability of a collision at $x$, given no collision prior to $x$.



In [6]:

    
hf = sf.MakeHazardFunction()
thinkplot.Plot(hf)
thinkplot.Config(xlabel='Lifetime', ylabel='Hazard function')

If the distribution is truly exponential, the hazard function is constant for all $x$.

In this case it goes to 1 at the end, again because we truncated the distribution.

Exercise: Go back and increase the value of high, and confirm that the hazard function is a constant until we approach the point where we cut off the distribution.

Remaining lifetime

Given the survival function, we can compute the distribution of remaining lifetime, conditioned on current age. The following function computes the mean remaining lifetime for a range of ages.



In [7]:

    
def RemainingLifetime(sf):
    """Computes remaining lifetime as a function of age.
    
    sf: survival function
    returns: Series that maps from age to remaining lifetime
    """
    pmf = sf.MakePmf()
    d = {}
    for t in sorted(pmf.Values()):
        pmf[t] = 0
        if pmf.Total():
            pmf.Normalize()
            d[t] = pmf.Mean() - t

    return pd.Series(d)

And here's what it looks like for the exponential survival function.



In [8]:

    
mean_rem_life = RemainingLifetime(sf)
thinkplot.Plot(mean_rem_life)
thinkplot.Config(xlabel='Lifetime', ylabel='Survival function')

The mean time until a collision is pretty much constant, until we approach the point where we truncate the distribution.

The Weibull distribution

The Weibull distribution is a generalization of the exponential distribution that takes an additional "shape" parameter, k.

When k=1, the Weibull is an exponential distribution. Other values of k yield survival curves with different shapes, and hazard functions that increase, decrease, or both. So the Weibull family can capture a wide range of survival patterns.



In [9]:

    
from thinkbayes2 import MakeWeibullPmf

pmf = MakeWeibullPmf(lam=2.0, k=1.5, high=30)
thinkplot.Pdf(pmf)
thinkplot.Config(xlabel='Lifetime', ylabel='PMF')

Exercise: In the previous section, replace the exponential distribution with a Weibull distribituion and run the analysis again. What can you infer about the values of the parameters and the behavior of the hazard function and remaining lifetime?

Bayesian survival analysis

Suppose you are the manager of a large building with many light fixtures. To figure out how often you will need to replace lightbulbs, you install 10 bulbs and measure the time until they fail.

To generate some fake data, I'll choose a Weibull distribution and generate a random sample (let's suppose it's in years):



In [10]:

    
def SampleWeibull(lam, k, n=1):
    return np.random.weibull(k, size=n) * lam

data = SampleWeibull(lam=2, k=1.5, n=10)
data









    Out[10]:





array([ 0.80460326,  1.35802123,  1.3170894 ,  3.92508256,  1.23148026,
        0.94237378,  1.56099169,  3.21101094,  2.11097004,  1.06647873])

Exercise: Write a class called LightBulb that inherits from Suite and provides a Likelihood function that takes an observed lifespan as data and a tuple, (lam, k), as a hypothesis. It should return a likelihood proportional to the probability of the observed lifespan in a Weibull distribution with the given parameters.

Test your method by creating a LightBulb object with an appropriate prior and update it with the data above.

Plot the posterior distributions of lam and k. As the sample size increases, does the posterior distribution converge on the values of lam and k used to generate the sample?



In [11]:

    
# Hint

from thinkbayes2 import Suite, Joint, EvalWeibullPdf

class LightBulb(Suite, Joint):
    
    def Likelihood(self, data, hypo):
        lam, k = hypo
        x = data
        like = 1
        return like



In [12]:

    
# Solution goes here



In [13]:

    
from itertools import product

lams = np.linspace(0.001, 6, 101)
ks = np.linspace(0.001, 8, 101)

suite = LightBulb(product(lams, ks))



In [14]:

    
suite.UpdateSet(data)









    Out[14]:





0.9999999999998264



In [15]:

    
thinkplot.Contour(suite)



In [16]:

    
pmf_lam = suite.Marginal(0)
thinkplot.Pdf(pmf_lam)
pmf_lam.Mean()









    Out[16]:





3.0005000000005237



In [17]:

    
pmf_k = suite.Marginal(1)
thinkplot.Pdf(pmf_k)
pmf_k.Mean()









    Out[17]:





4.0005000000006969

Exercise: Go back and run this analysis again with n=20 and see if the posterior distributions seem to be converging on the actual parameters.

Censored data

Exercise: Now suppose that instead of observing a complete lifespan, you observe a lightbulb that has operated for 1 year and is still working. Write another version of LightBulb that takes data in this form and performs an update.



In [18]:

    
# Hint

from thinkbayes2 import EvalWeibullCdf

class LightBulb2(Suite, Joint):
    
    def Likelihood(self, data, hypo):
        lam, k = hypo
        x = data
        like = 1
        return like



In [19]:

    
# Solution goes here



In [20]:

    
from itertools import product

lams = np.linspace(0.001, 10, 101)
ks = np.linspace(0.001, 10, 101)

suite = LightBulb2(product(lams, ks))



In [21]:

    
suite.Update(1)









    Out[21]:





0.9999999999998264



In [22]:

    
thinkplot.Contour(suite)



In [23]:

    
pmf_lam = suite.Marginal(0)
thinkplot.Pdf(pmf_lam)
pmf_lam.Mean()









    Out[23]:





5.0005000000008701



In [24]:

    
pmf_k = suite.Marginal(1)
thinkplot.Pdf(pmf_k)
pmf_k.Mean()









    Out[24]:





5.0005000000008701

Note: based on this data alone, we can rule out some small values of lam and k, but we can't rule out large values. Without more data or a more informative prior, the results are not useful.

To see why, try increasing the upper bounds in the prior distribition.

Left censored data

Exercise: Suppose you install a light bulb and then you don't check on it for a year, but when you come back, you find that it has burned out. Extend LightBulb to handle this kind of data, too.



In [25]:

    
# Hint

class LightBulb3(Suite, Joint):
    
    def Likelihood(self, data, hypo):
        lam, k = hypo
        x = data
        like = 1
        return like



In [26]:

    
# Solution goes here



In [27]:

    
from itertools import product

lams = np.linspace(0.001, 20, 101)
ks = np.linspace(0.001, 20, 101)

suite = LightBulb3(product(lams, ks))



In [28]:

    
suite.Update(1)









    Out[28]:





0.9999999999998264



In [29]:

    
thinkplot.Contour(suite)



In [30]:

    
pmf_lam = suite.Marginal(0)
thinkplot.Pdf(pmf_lam)
pmf_lam.Mean()









    Out[30]:





10.000500000001749



In [31]:

    
pmf_k = suite.Marginal(1)
thinkplot.Pdf(pmf_k)
pmf_k.Mean()









    Out[31]:





10.000500000001747

This example has some of the same problems as the previous one. Based on this data alone, we can't pin down the parameters much.

Pulling it together

Exercise: Suppose you have 15 lightbulbs installed at different times over a 10 year period. When you observe them, some have died and some are still working. Write a version of LightBulb that takes data in the form of a (flag, x) tuple, where:

If flag is eq, it means that x is the actual lifespan of a bulb that has died.
If flag is gt, it means that x is the current age of a bulb that is still working, so it is a lower bound on the lifespan.
If flag is lt, it means that x is the elapsed time between installation and the first time the bulb is seen broken, so it is an upper bound on the lifespan.

To help you test, I will generate some fake data.

First, I'll generate a Pandas DataFrame with random start times and lifespans. The columns are:

start: time when the bulb was installed
lifespan: lifespan of the bulb in years
end: time when bulb died or will die
age_t: age of the bulb at t=10



In [32]:

    
import pandas as pd

lam = 2
k = 1.5
n = 15
t_end = 10
starts = np.random.uniform(0, t_end, n)
lifespans = SampleWeibull(lam, k, n)

df = pd.DataFrame({'start': starts, 'lifespan': lifespans})
df['end'] = df.start + df.lifespan
df['age_t'] = t_end - df.start

df.head()

Now I'll process the DataFrame to generate data in the form we want for the update.



In [33]:

    
data = []
for i, row in df.iterrows():
    if row.end < t_end:
        data.append(('eq', row.lifespan))
    else:
        data.append(('gt', row.age_t))
        
for pair in data:
    print(pair)









    



('eq', 1.1759860510154005)
('gt', 0.29556020328479349)
('gt', 0.27910123141193566)
('eq', 1.1848797804175306)
('gt', 0.29001207938665274)
('gt', 1.3978634865036383)
('eq', 1.2334994412767566)
('eq', 0.16868843981156056)
('eq', 2.601843857679202)
('eq', 1.1254408135714249)
('eq', 0.51933656112111692)
('eq', 1.7603791482384576)
('eq', 1.2154630771067745)
('eq', 1.6544578964981635)
('eq', 0.6926668890434553)



In [34]:

    
# Hint

class LightBulb4(Suite, Joint):
    
    def Likelihood(self, data, hypo):
        lam, k = hypo
        flag, x = data
        like = 1
        return like



In [35]:

    
# Solution goes here



In [36]:

    
from itertools import product

lams = np.linspace(0.001, 10, 101)
ks = np.linspace(0.001, 10, 101)

suite = LightBulb4(product(lams, ks))



In [37]:

    
suite.UpdateSet(data)









    Out[37]:





0.9999999999998264



In [38]:

    
thinkplot.Contour(suite)



In [39]:

    
pmf_lam = suite.Marginal(0)
thinkplot.Pdf(pmf_lam)
pmf_lam.Mean()









    Out[39]:





5.0005000000008701



In [40]:

    
pmf_k = suite.Marginal(1)
thinkplot.Pdf(pmf_k)
pmf_k.Mean()









    Out[40]:





5.0005000000008701

Prediction

Suppose we know that, for a particular kind of lightbulb in a particular location, the distribution of lifespans is well modeled by a Weibull distribution with lam=2 and k=1.5. If we install n=100 lightbulbs and come back one year later, what is the distribution of c, the number of lightbulbs that have burned out?

The probability that any given bulb has burned out comes from the CDF of the distribution.



In [41]:

    
lam = 2
k = 1.5
p = EvalWeibullCdf(1, lam, k)
p









    Out[41]:





0.29781149867344037

The number of bulbs that have burned out is distributed Binom(n, p).



In [42]:

    
from thinkbayes2 import MakeBinomialPmf

n = 100
pmf_c = MakeBinomialPmf(n, p)
thinkplot.Pdf(pmf_c)

Or we can approximate the distribution with a random sample.



In [43]:

    
n = 100
sample = np.random.binomial(n, p, 1000)
pdf_c = thinkbayes2.EstimatedPdf(sample)
thinkplot.Pdf(pdf_c)
np.mean(sample), np.std(sample)









    Out[43]:





(29.84, 4.7144883073351664)

Exercise: Now suppose that lam and k are not known precisely, but we have a LightBulb object that represents the joint posterior distribution of the parameters after seeing some data. Compute the posterior predictive distribution for c, the number of bulbs burned out after one year.



In [44]:

    
# Solution goes here



In [45]:

    
# Solution goes here



In [46]:

    
# Solution goes here



In [47]:

    
# Solution goes here



In [ ]:

	lifespan	start	end	age_t
0	1.175986	7.149446	8.325432	2.850554
1	1.187085	9.704440	10.891525	0.295560
2	3.673732	9.720899	13.394631	0.279101
3	1.184880	3.730789	4.915669	6.269211
4	2.401481	9.709988	12.111469	0.290012