Suppose we generate data from a mixture of exponentials with no censoring (this could be extended to censoring, but for our application we don't observe censoring). Furthermore, the data is not observed directly, but our instrument has binned the data into integer buckets. For example, the actual value is 4.3, but our instrument's resolution is too poor to measure the decimal point. So 4.3 goes into the "4.x" bucket.

This model is easy to handle in lifelines. Instead of worrying about the binning, we can treat the system as an interval-censoring problem. That is, if an observation landed in the $i$th bin, then we know the true data point occured somewhere between the $i$th and $i+1$th bin. This is precisely interval censoring.

We can use lifelines custom model semantics to create a mixture model as well. The true model is:

$$S(t) = p \exp\left(-\frac{t}{\lambda_1}\right) + (1-p)\exp\left(-\frac{t}{\lambda_2}\right)$$

Therefore the cumulative hazard is:

$$H(t) = -\log(S(t)) = -\log\left(p \exp\left(-\frac{t}{\lambda_1}\right) + (1-p)\exp\left(-\frac{t}{\lambda_2}\right)\right) $$



In [10]:

    
from numpy.random import exponential
from lifelines.fitters import ParametricUnivariateFitter
import autograd.numpy as np
np.random.seed(10)



In [6]:

    
T = [exponential(20) for _ in range(10000)] + [exponential(40) for _ in range(500)]
counts_obs = np.bincount(T)
T_obs = np.arange(np.amax(T))



In [19]:

    
plt.figure(figsize=(13,5))
plt.hist(counts_obs, bins=T_obs, density=True)
plt.title("Histogram of observed durations")
plt.xlim(0, 100)









    Out[19]:





(0, 100)



In [8]:

    
# To help the model avoid the "non-identibility" problem, we can set the *upper* bound of the first lambda to be 
# the average of the observed data, and the *lower* bound of the second lambda to the same value. Why? 
# We'd like to partition the postive reals into two halves, each containing one of the lambdas. The sample
# mean of the data is v = p * lambda1 + (1-p) * lambda2, which has the property lambda1 < v < lambda2, therefore  
# it will partition the space correctly. 
mean_obs = np.average(T_obs, weights=counts_obs)
print(mean_obs)









    



20.015428571428572



In [9]:

    
class MixtureExponential(ParametricUnivariateFitter):

    _fitted_parameter_names = ['lambda1', 'lambda2', 'p']
    _bounds = [(0, mean_obs), (mean_obs, None), (0, 1)]

    def _cumulative_hazard(self, params, t):
        l1, l2, p = params
        return -np.log(p * np.exp(-t / l1) + (1-p) * np.exp(-t / l2))


model = MixtureExponential()
model.fit_interval_censoring(
    lower_bound=T_obs, 
    upper_bound=T_obs + 1, 
    weights=counts_obs, 
    initial_point=np.array([mean_obs / 2, mean_obs * 2, 0.5])
)

model.print_summary()









    







  
    
      model
      lifelines.MixtureExponential
    
    
      number of observations
      10500
    
    
      number of events observed
      0
    
    
      log-likelihood
      -42208.35
    
    
      hypothesis
      lambda1 != 10.0077, lambda2 != 40.0309, p != 0.5
    
  


  
    
      
      coef
      se(coef)
      coef lower 95%
      coef upper 95%
      z
      p
      -log2(p)
    
  
  
    
      lambda1
      19.00
      1.01
      17.01
      20.98
      8.88
      <0.005
      60.36
    
    
      lambda2
      36.46
      10.32
      16.24
      56.68
      -0.35
      0.73
      0.46
    
    
      p
      0.91
      0.10
      0.71
      1.11
      4.06
      <0.005
      14.30



In [17]:

    
model.plot_cumulative_hazard(figsize=(8,5))









    Out[17]:





<matplotlib.axes._subplots.AxesSubplot at 0x11d8f9390>



In [ ]:

model	lifelines.MixtureExponential
number of observations	10500
number of events observed	0
log-likelihood	-42208.35
hypothesis	lambda1 != 10.0077, lambda2 != 40.0309, p != 0.5

	coef	se(coef)	coef lower 95%	coef upper 95%	z	p	-log2(p)
lambda1	19.00	1.01	17.01	20.98	8.88	<0.005	60.36
lambda2	36.46	10.32	16.24	56.68	-0.35	0.73	0.46
p	0.91	0.10	0.71	1.11	4.06	<0.005	14.30