Copyright 2017 Allen Downey
MIT License: https://opensource.org/licenses/MIT
In [1]:
from __future__ import print_function, division
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import thinkbayes2
import thinkplot
Suppose that you are an auto insurance company interested in the time between collisions for a particular driver. If the probability of a collision is roughly constant over time, the time between collisions will follow an exponential distribution.
Here's an example with parameter $\lambda = 0.5$ collisions / year.
In [2]:
from thinkbayes2 import MakeExponentialPmf
pmf = MakeExponentialPmf(lam=0.5, high=30)
thinkplot.Pdf(pmf)
thinkplot.Config(xlabel='Lifetime', ylabel='PMF')
For the exponential distribution, the mean and standard deviation are $1/\lambda$.
In this case they are only approximate because we truncated the distribution.
In [3]:
pmf.Mean(), pmf.Std()
Out[3]:
From the PMF, we can compute the CDF.
In [4]:
cdf = pmf.MakeCdf()
thinkplot.Cdf(cdf)
thinkplot.Config(xlabel='Lifetime', ylabel='CDF')
And from the CDF, we can compute the survival function, which is the complement of the CDF.
$SF(x) = Prob\{X > x\} = 1 - Prob\{X \le x\} = 1 - CDF(x)$
In [5]:
from survival import MakeSurvivalFromCdf
sf = MakeSurvivalFromCdf(cdf)
thinkplot.Plot(sf)
thinkplot.Config(xlabel='Lifetime', ylabel='Survival function')
From the survival function we can get the hazard function, which is the probability of a collision at $x$, given no collision prior to $x$.
In [6]:
hf = sf.MakeHazardFunction()
thinkplot.Plot(hf)
thinkplot.Config(xlabel='Lifetime', ylabel='Hazard function')
If the distribution is truly exponential, the hazard function is constant for all $x$.
In this case it goes to 1 at the end, again because we truncated the distribution.
Exercise: Go back and increase the value of high
, and confirm that the hazard function is a constant until we approach the point where we cut off the distribution.
In [7]:
def RemainingLifetime(sf):
"""Computes remaining lifetime as a function of age.
sf: survival function
returns: Series that maps from age to remaining lifetime
"""
pmf = sf.MakePmf()
d = {}
for t in sorted(pmf.Values()):
pmf[t] = 0
if pmf.Total():
pmf.Normalize()
d[t] = pmf.Mean() - t
return pd.Series(d)
And here's what it looks like for the exponential survival function.
In [8]:
mean_rem_life = RemainingLifetime(sf)
thinkplot.Plot(mean_rem_life)
thinkplot.Config(xlabel='Lifetime', ylabel='Survival function')
The mean time until a collision is pretty much constant, until we approach the point where we truncate the distribution.
The Weibull distribution is a generalization of the exponential distribution that takes an additional "shape" parameter, k
.
When k=1
, the Weibull is an exponential distribution. Other values of k
yield survival curves with different shapes, and hazard functions that increase, decrease, or both. So the Weibull family can capture a wide range of survival patterns.
In [9]:
from thinkbayes2 import MakeWeibullPmf
pmf = MakeWeibullPmf(lam=2.0, k=1.5, high=30)
thinkplot.Pdf(pmf)
thinkplot.Config(xlabel='Lifetime', ylabel='PMF')
Exercise: In the previous section, replace the exponential distribution with a Weibull distribituion and run the analysis again. What can you infer about the values of the parameters and the behavior of the hazard function and remaining lifetime?
Suppose you are the manager of a large building with many light fixtures. To figure out how often you will need to replace lightbulbs, you install 10 bulbs and measure the time until they fail.
To generate some fake data, I'll choose a Weibull distribution and generate a random sample (let's suppose it's in years):
In [10]:
def SampleWeibull(lam, k, n=1):
return np.random.weibull(k, size=n) * lam
data = SampleWeibull(lam=2, k=1.5, n=10)
data
Out[10]:
Exercise: Write a class called LightBulb
that inherits from Suite
and provides a Likelihood
function that takes an observed lifespan as data and a tuple, (lam, k)
, as a hypothesis. It should return a likelihood proportional to the probability of the observed lifespan in a Weibull distribution with the given parameters.
Test your method by creating a LightBulb
object with an appropriate prior and update it with the data above.
Plot the posterior distributions of lam
and k
. As the sample size increases, does the posterior distribution converge on the values of lam
and k
used to generate the sample?
In [11]:
# Hint
from thinkbayes2 import Suite, Joint, EvalWeibullPdf
class LightBulb(Suite, Joint):
def Likelihood(self, data, hypo):
lam, k = hypo
x = data
like = 1
return like
In [12]:
# Solution goes here
In [13]:
from itertools import product
lams = np.linspace(0.001, 6, 101)
ks = np.linspace(0.001, 8, 101)
suite = LightBulb(product(lams, ks))
In [14]:
suite.UpdateSet(data)
Out[14]:
In [15]:
thinkplot.Contour(suite)
In [16]:
pmf_lam = suite.Marginal(0)
thinkplot.Pdf(pmf_lam)
pmf_lam.Mean()
Out[16]:
In [17]:
pmf_k = suite.Marginal(1)
thinkplot.Pdf(pmf_k)
pmf_k.Mean()
Out[17]:
Exercise: Go back and run this analysis again with n=20
and see if the posterior distributions seem to be converging on the actual parameters.
Exercise: Now suppose that instead of observing a complete lifespan, you observe a lightbulb that has operated for 1 year and is still working. Write another version of LightBulb
that takes data in this form and performs an update.
In [18]:
# Hint
from thinkbayes2 import EvalWeibullCdf
class LightBulb2(Suite, Joint):
def Likelihood(self, data, hypo):
lam, k = hypo
x = data
like = 1
return like
In [19]:
# Solution goes here
In [20]:
from itertools import product
lams = np.linspace(0.001, 10, 101)
ks = np.linspace(0.001, 10, 101)
suite = LightBulb2(product(lams, ks))
In [21]:
suite.Update(1)
Out[21]:
In [22]:
thinkplot.Contour(suite)
In [23]:
pmf_lam = suite.Marginal(0)
thinkplot.Pdf(pmf_lam)
pmf_lam.Mean()
Out[23]:
In [24]:
pmf_k = suite.Marginal(1)
thinkplot.Pdf(pmf_k)
pmf_k.Mean()
Out[24]:
Note: based on this data alone, we can rule out some small values of lam
and k
, but we can't rule out large values. Without more data or a more informative prior, the results are not useful.
To see why, try increasing the upper bounds in the prior distribition.
In [25]:
# Hint
class LightBulb3(Suite, Joint):
def Likelihood(self, data, hypo):
lam, k = hypo
x = data
like = 1
return like
In [26]:
# Solution goes here
In [27]:
from itertools import product
lams = np.linspace(0.001, 20, 101)
ks = np.linspace(0.001, 20, 101)
suite = LightBulb3(product(lams, ks))
In [28]:
suite.Update(1)
Out[28]:
In [29]:
thinkplot.Contour(suite)
In [30]:
pmf_lam = suite.Marginal(0)
thinkplot.Pdf(pmf_lam)
pmf_lam.Mean()
Out[30]:
In [31]:
pmf_k = suite.Marginal(1)
thinkplot.Pdf(pmf_k)
pmf_k.Mean()
Out[31]:
This example has some of the same problems as the previous one. Based on this data alone, we can't pin down the parameters much.
Exercise: Suppose you have 15 lightbulbs installed at different times over a 10 year period. When you observe them, some have died and some are still working. Write a version of LightBulb
that takes data in the form of a (flag, x)
tuple, where:
flag
is eq
, it means that x
is the actual lifespan of a bulb that has died.flag
is gt
, it means that x
is the current age of a bulb that is still working, so it is a lower bound on the lifespan.flag
is lt
, it means that x
is the elapsed time between installation and the first time the bulb is seen broken, so it is an upper bound on the lifespan.To help you test, I will generate some fake data.
First, I'll generate a Pandas DataFrame with random start times and lifespans. The columns are:
start
: time when the bulb was installed
lifespan
: lifespan of the bulb in years
end
: time when bulb died or will die
age_t
: age of the bulb at t=10
In [32]:
import pandas as pd
lam = 2
k = 1.5
n = 15
t_end = 10
starts = np.random.uniform(0, t_end, n)
lifespans = SampleWeibull(lam, k, n)
df = pd.DataFrame({'start': starts, 'lifespan': lifespans})
df['end'] = df.start + df.lifespan
df['age_t'] = t_end - df.start
df.head()
Out[32]:
Now I'll process the DataFrame to generate data in the form we want for the update.
In [33]:
data = []
for i, row in df.iterrows():
if row.end < t_end:
data.append(('eq', row.lifespan))
else:
data.append(('gt', row.age_t))
for pair in data:
print(pair)
In [34]:
# Hint
class LightBulb4(Suite, Joint):
def Likelihood(self, data, hypo):
lam, k = hypo
flag, x = data
like = 1
return like
In [35]:
# Solution goes here
In [36]:
from itertools import product
lams = np.linspace(0.001, 10, 101)
ks = np.linspace(0.001, 10, 101)
suite = LightBulb4(product(lams, ks))
In [37]:
suite.UpdateSet(data)
Out[37]:
In [38]:
thinkplot.Contour(suite)
In [39]:
pmf_lam = suite.Marginal(0)
thinkplot.Pdf(pmf_lam)
pmf_lam.Mean()
Out[39]:
In [40]:
pmf_k = suite.Marginal(1)
thinkplot.Pdf(pmf_k)
pmf_k.Mean()
Out[40]:
Suppose we know that, for a particular kind of lightbulb in a particular location, the distribution of lifespans is well modeled by a Weibull distribution with lam=2
and k=1.5
. If we install n=100
lightbulbs and come back one year later, what is the distribution of c
, the number of lightbulbs that have burned out?
The probability that any given bulb has burned out comes from the CDF of the distribution.
In [41]:
lam = 2
k = 1.5
p = EvalWeibullCdf(1, lam, k)
p
Out[41]:
The number of bulbs that have burned out is distributed Binom(n, p).
In [42]:
from thinkbayes2 import MakeBinomialPmf
n = 100
pmf_c = MakeBinomialPmf(n, p)
thinkplot.Pdf(pmf_c)
Or we can approximate the distribution with a random sample.
In [43]:
n = 100
sample = np.random.binomial(n, p, 1000)
pdf_c = thinkbayes2.EstimatedPdf(sample)
thinkplot.Pdf(pdf_c)
np.mean(sample), np.std(sample)
Out[43]:
Exercise: Now suppose that lam
and k
are not known precisely, but we have a LightBulb
object that represents the joint posterior distribution of the parameters after seeing some data. Compute the posterior predictive distribution for c
, the number of bulbs burned out after one year.
In [44]:
# Solution goes here
In [45]:
# Solution goes here
In [46]:
# Solution goes here
In [47]:
# Solution goes here
In [ ]: