## Time-lagged conversion rates and cure models

Suppose in our population we have a subpopulation that will never experience the event of interest. Or, for some subjects the event will occur so far in the future that it's essentially at time infinity. The survival function should not asymptically approach zero, but some positive value. Models that describe this are sometimes called cure models or time-lagged conversion models.

There's a serious fault in using parametric models for these types of problems that non-parametric models don't have. The most common parametric models like Weibull, Log-Normal, etc. all have strictly increasing cumulative hazard functions, which means the corresponding survival function will always converge to 0.

Let's look at an example of this problem. Below I generated some data that has individuals who will not experience the event, not matter how long we wait.



In [1]:

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from matplotlib import pyplot as plt
import pandas as pd
plt.style.use('seaborn-colorblind')




In [2]:

N = 200
U = np.random.rand(N)
T = -(logit(-np.log(U) / 0.5) - np.random.exponential(2, N) - 6.00) / 0.50

E = ~np.isnan(T)
T[np.isnan(T)] = 50




In [3]:

from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter().fit(T, E)
kmf.plot(figsize=(8,4))
plt.ylim(0, 1);






It should be clear that there is an asymptote at around 0.6. The non-parametric model will always show this. If this is true, then the cumulative hazard function should have a horizontal asymptote as well. Let's use the Nelson-Aalen model to see this.



In [4]:

from lifelines import NelsonAalenFitter

naf = NelsonAalenFitter().fit(T, E)
naf.plot(figsize=(8,4))




Out[4]:

<matplotlib.axes._subplots.AxesSubplot at 0x11b8b9b50>



However, when we try a parametric model, we will see that it won't extrapolate very well. Let's use the flexible two parameter LogLogisticFitter model.



In [5]:

from lifelines import LogLogisticFitter

fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(10, 6))

t = np.linspace(0, 40)
llf = LogLogisticFitter().fit(T, E, timeline=t)

llf.plot_survival_function(ax=ax[0][0])
kmf.plot(ax=ax[0][0])

llf.plot_cumulative_hazard(ax=ax[0][1])
naf.plot(ax=ax[0][1])

t = np.linspace(0, 100)
llf = LogLogisticFitter().fit(T, E, timeline=t)

llf.plot_survival_function(ax=ax[1][0])
kmf.plot(ax=ax[1][0])

llf.plot_cumulative_hazard(ax=ax[1][1])
naf.plot(ax=ax[1][1])




Out[5]:

<matplotlib.axes._subplots.AxesSubplot at 0x108136890>



The LogLogistic model does a quite terrible job of capturing the not only the asymptotes, but also the intermediate values as well. If we extended the survival function out further, we would see that it eventually nears 0.

### Custom parametric models to handle asymptotes

Focusing on modeling the cumulative hazard function, what we would like is a function that increases up to a limit and then tapers off to an asymptote. We can think long and hard about these (I did), but there's a family of functions that have this property that we are very familiar with: cumulative distribution functions! By their nature, they will asympotically approach 1. And, they are readily available in the SciPy and autograd libraries. So our new model of the cumulative hazard function is:

$$H(t; c, \theta) = c F(t; \theta)$$

where $c$ is the (unknown) horizontal asymptote, and $\theta$ is a vector of (unknown) parameters for the CDF, $F$.

We can create a custom cumulative hazard model using ParametricUnivariateFitter (for a tutorial on how to create custom models, see this here). Let's choose the Normal CDF for our model.

Remember we must use the imports from autograd for this, i.e. from autograd.scipy.stats import norm.



In [6]:

from lifelines.fitters import ParametricUnivariateFitter

class UpperAsymptoteFitter(ParametricUnivariateFitter):

_fitted_parameter_names = ["c_", "mu_", "sigma_"]

_bounds = ((0, None), (None, None), (0, None))

def _cumulative_hazard(self, params, times):
c, mu, sigma = params
return c * norm.cdf((times - mu) / sigma, loc=0, scale=1)




In [7]:

uaf = UpperAsymptoteFitter().fit(T, E)
uaf.print_summary(3)
uaf.plot(figsize=(8,4))




.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}

model
lifelines.UpperAsymptoteFitter

number of observations
200

number of events observed
78

log-likelihood
-357.204

hypothesis
c_ != 1, mu_ != 0, sigma_ != 1

coef
se(coef)
coef lower 95%
coef upper 95%
z
p
-log2(p)

c_
0.494
0.056
0.383
0.604
-8.973
<0.0005
61.592

mu_
16.233
0.490
15.273
17.193
33.150
<0.0005
798.089

sigma_
4.260
0.343
3.587
4.933
9.498
<0.0005
68.659

Out[7]:

<matplotlib.axes._subplots.AxesSubplot at 0x11e29a490>



We get a lovely asympotical cumulative hazard. The summary table suggests that the best value of $c$ is 0.586. This can be translated into the survival function asymptote by $\exp(-0.586) \approx 0.56$.

Let's compare this fit to the non-parametric models.



In [8]:

fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(10, 6))

t = np.linspace(0, 40)
uaf = UpperAsymptoteFitter().fit(T, E, timeline=t)

uaf.plot_survival_function(ax=ax[0][0])
kmf.plot(ax=ax[0][0])

uaf.plot_cumulative_hazard(ax=ax[0][1])
naf.plot(ax=ax[0][1])

t = np.linspace(0, 100)
uaf = UpperAsymptoteFitter().fit(T, E, timeline=t)
uaf.plot_survival_function(ax=ax[1][0])
kmf.survival_function_.plot(ax=ax[1][0])

uaf.plot_cumulative_hazard(ax=ax[1][1])
naf.plot(ax=ax[1][1])




Out[8]:

<matplotlib.axes._subplots.AxesSubplot at 0x108307090>



I wasn't expect this good of a fit. But there it is. This was some artificial data, but let's try this technique on some real life data.



In [9]:

uaf.fit(T, E)
ax = uaf.plot_survival_function(figsize=(8,4))
uaf.print_summary()

kmf.fit(T, E).plot(ax=ax, ci_show=False)
print("---")
print("Estimated lower bound: {:.2f} ({:.2f}, {:.2f})".format(
np.exp(-uaf.summary.loc['c_', 'coef']),
np.exp(-uaf.summary.loc['c_', 'coef upper 95%']),
np.exp(-uaf.summary.loc['c_', 'coef lower 95%']),
)
)




.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}

model
lifelines.UpperAsymptoteFitter

number of observations
42

number of events observed
30

log-likelihood
-118.60

hypothesis
c_ != 1, mu_ != 0, sigma_ != 1

coef
se(coef)
coef lower 95%
coef upper 95%
z
p
-log2(p)

c_
1.63
0.36
0.94
2.33
1.78
0.07
3.75

mu_
13.44
1.73
10.06
16.82
7.79
<0.005
47.07

sigma_
7.03
1.07
4.94
9.12
5.65
<0.005
25.91

---
Estimated lower bound: 0.20 (0.10, 0.39)



So we might expect that about 20% will not have the even in this population (but make note of the large CI bounds too!)



In [10]:

# Another, less obvious, dataset.

uaf.fit(T, E)
ax = uaf.plot_survival_function(figsize=(8,4))
uaf.print_summary()

kmf.fit(T, E).plot(ax=ax)
print("---")
print("Estimated lower bound: {:.2f} ({:.2f}, {:.2f})".format(
np.exp(-uaf.summary.loc['c_', 'coef']),
np.exp(-uaf.summary.loc['c_', 'coef upper 95%']),
np.exp(-uaf.summary.loc['c_', 'coef lower 95%']),
)
)




.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}

model
lifelines.UpperAsymptoteFitter

number of observations
863

number of events observed
140

log-likelihood
-1458.88

hypothesis
c_ != 1, mu_ != 0, sigma_ != 1

coef
se(coef)
coef lower 95%
coef upper 95%
z
p
-log2(p)

c_
0.29
0.03
0.24
0.35
-24.37
<0.005
433.49

mu_
1139.88
101.60
940.75
1339.01
11.22
<0.005
94.62

sigma_
872.59
66.31
742.62
1002.56
13.14
<0.005
128.67

---
Estimated lower bound: 0.75 (0.70, 0.79)



#### Using alternative functional forms

An even simplier model might look like $c \left(1 - \frac{1}{\lambda t + 1}\right)$, however this model cannot handle any "inflection points" like our artificial dataset has above. However, it works well for this Lung dataset.

With all cure models, one important feature is the ability to extrapolate to unforseen times.



In [11]:

from lifelines.fitters import ParametricUnivariateFitter

class SimpleUpperAsymptoteFitter(ParametricUnivariateFitter):

_fitted_parameter_names = ["c_", "lambda_"]

_bounds = ((0, None), (0, None))

def _cumulative_hazard(self, params, times):
c, lambda_ = params
return c * (1 -  1 /(lambda_ * times + 1))




In [12]:

# Another, less obvious, dataset.

saf = SimpleUpperAsymptoteFitter().fit(T, E, timeline=np.arange(1, 10000))
ax = saf.plot_survival_function(figsize=(8,4))
saf.print_summary(4)

kmf.fit(T, E).plot(ax=ax)
print("---")
print("Estimated lower bound: {:.2f} ({:.2f}, {:.2f})".format(
np.exp(-saf.summary.loc['c_', 'coef']),
np.exp(-saf.summary.loc['c_', 'coef upper 95%']),
np.exp(-saf.summary.loc['c_', 'coef lower 95%']),
)
)




.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}

model
lifelines.SimpleUpperAsymptoteFitter

number of observations
863

number of events observed
140

log-likelihood
-1392.1610

hypothesis
c_ != 1, lambda_ != 1

coef
se(coef)
coef lower 95%
coef upper 95%
z
p
-log2(p)

c_
0.4252
0.0717
0.2847
0.5658
-8.0151
<5e-05
49.6912

lambda_
0.0006
0.0002
0.0003
0.0009
-5982.3551
<5e-05
inf

---
Estimated lower bound: 0.65 (0.57, 0.75)