Fitted regularized censored regression models for Weibull or Exponential distributions

$$ \newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\bX}{\boldsymbol{X}} \newcommand{\bZ}{\boldsymbol{Z}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\bz}{\boldsymbol{z}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bt}{\boldsymbol{t}} \newcommand{\biota}{\boldsymbol{\iota}} \newcommand{\bdelta}{\boldsymbol{\delta}} $$

There are many situations when the observed data is censored, meaning that its value is only partially known. For example, income data which is published by national statistics agencies will truncate the ganularity of the data to some upper bound (greater than \$500K dollars for example in order to preserve the privacy of high-income households). In the biomedical domain in which I work in, certain biological measurements are not measureable if there are outside some detectable range. In both cases, the data we record is partially known because we know the measurement was at least this large, also known as right-censoring. The canonical case of right-censoring is survival times in which a patient has lived for at least this long. When a measurment is at most some value, this is known as left censoring.

In this post I will show how to build a regularized linear model from scratch in python to be able to fit right-censored data with an explicit likelihood model. The data challenge I have in mind is the one of missing value imputation for censored data; namely we want to predict the expected value of a censored measurement observation.

Exponential distribution

Recall that the likelihood of a sample of $n$ i.i.d. exponential distributions can be written: $f(t_i;\lambda)=\lambda \exp(-\lambda t_i)$, $S(t_i;\lambda)=\exp(-\lambda t_i)$, and $h(t_i; \lambda) = \lambda)$.

$$ \begin{align*} L(\bbeta) &= \prod_{i=1}^n \hspace{1mm} f(\bbeta;t_i)^{\delta_i} S(\bbeta;t_i)^{1-\delta_i} \\ &= \prod_{i=1}^n \hspace{1mm} h(\bbeta; t_i)^{\delta_i} S(\bbeta;t_i) \\ \end{align*} $$

Hence the log-likelihood is

$$ \begin{align*} \ell(\bbeta) &= \sum_{i=1}^n \big[ \delta_i \log h(\bbeta; t_i) + \log S(\bbeta;t_i) \big] \end{align*} $$

As we have parameterized $h(\bbeta; t_i) = \lambda_i = \exp(\bx^T \bbeta)$, where $\bx^T \bbeta = \sum_{j=0}^p \beta_jx_j$, and $S(\bbeta; t_i) = \exp(- \lambda_i t_i)= \exp(-\exp(\bx^T \bbeta) t_i)$, we can defined the log-likelihood as:

$$ \begin{align*} \ell(\bbeta) &= \sum_{i=1}^n \big[ \delta_i (\bx_i^T \bbeta ) - \exp(\bx_i^T \bbeta ) t_i \big] \\ \frac{\partial \ell(\bbeta)}{\partial \beta_j} &= \sum_{i=1}^n \big[ \delta_i x_{ij} - z_i x_{ij} \big] = \bX_j^T(\bdelta - \bz) , \hspace{3mm} z_i = \exp(\bx_i^T \bbeta) t_i \\ \frac{\partial ^2 \ell(\bbeta)}{\partial \beta_j \partial \beta_k} &= - \sum_{i=1}^n x_{ij}x_{ik} z_i \end{align*} $$

While the partial derivative were derived for the likelihood, as we will use function minimizers, the vectorized gradient adn Hessian below are derived for the negative log-likelihood (simply a sign change):

$$ \begin{align*} \nabla_\bbeta &= - \bX^T (\bdelta - \bz) \\ H_\bbeta &= \bX^T \bZ \bX, \hspace{2mm} \text{diag}(\bZ)_i = [\exp(\bx_i^T \bbeta) t_i ] \end{align*} $$

Iteratively re-weighted least squares

The Newton-step update at the $t+1$ iteration can be written as:

$$ \begin{align*} \bbeta^{(t+1)} &\gets \bbeta^{(t)} - (H_{\bbeta^{(t)}})^{-1} \nabla_{\bbeta^{(t)}} \\ &\gets \bbeta^{(t)} + (\bX^T \bZ^{(t)} \bX )^{-1} \bX^T (\bdelta - \bz^{(t)}) \\ &\gets (\bX^T \bZ^{(t)} \bX )^{-1} \bX^T \bZ^{(t)} \by^{(t)}, \hspace{2mm} \by^{(t)} = \bX \bbeta^{(t)} + (\bZ^{(t)})^{-1} (\bdelta - \bz^{(k)}) \end{align*} $$

The last line reveals that the Newton update for the Exponential likelihood model is equivalent to a weighted least-squares problem.

$$ \begin{align*} \bbeta^{(t+1)} &= \arg\min_{\bbeta} \hspace{2mm} (\by^{(t)} - \bX \bbeta)^T \bZ^{(t)} (\by^{(t)} - \bX \bbeta) \end{align*} $$

Hence if our original likelihood model has some elastic-net regularization of the form:

$$ \begin{align*} \ell_{\gamma}(\bbeta) &= \ell(\bbeta) - \gamma \|\bbeta_{-0}\|_1 , \end{align*} $$

Where the L1-norm penalty is put on the non-intercept coefficient, then the Newton update will be equavilent to solving a least-squares Lasso problem at each iteration.

$$ \begin{align} \bbeta^{(t+1)} &= \arg\min_{\bbeta} \hspace{2mm} (\by^{(t)} - \bX \bbeta)^T \bZ^{(t)} (\by^{(t)} - \bX \bbeta) + \gamma \|\bbeta_{0}\|_1 \label{eq:irls} \end{align} $$

The sklearn package has very fast solvers for the type of optimization problem seen in $\eqref{eq:irls}$.

Weibull model

In some cases the exponential distribution not be flexible enough to handle different types of censoring processes. For example the exponential distribution has a hazard rate which is constant across time (or whatever unit $t_i$ stands for). The Weibull distributions allows for a measurement-level dependent hazard with an additional shape parameter ($\alpha$) in addition to the rate parameter ($\lambda$). The case a simple rate/shape Weibull distribution has the following density, inverse CDF, and hazard function: $f(t_i;\lambda,\alpha) = \alpha\lambda t_i^{\alpha-1}\exp(-\lambda t_i^\alpha)$, $S(t_i;\lambda,\alpha)=\exp(-\lambda t_i^\alpha)$, and $h(t_i;\lambda, \alpha) = \alpha \lambda t_i^{\alpha-1}$. When $\alpha=1$, the Weibull distribution reduces to the Exponential making the latter a nested distribution.

In the case of $n$ data points with a $p$-dimensional covariate vector we will again parameterize the scale parameter as $\lambda_i = \exp(\bx_i^T \bbeta)$, and leave the shape parameter as a global term. Because $\alpha \in (0,\infty)$, it will be easier to optimize over $\alpha(\phi)=\exp(\phi)$ so that $\phi \in (-\infty,\infty)$. The log-likelihood funtion then becomes:

$$ \begin{align*} \ell(\bbeta, \phi) &= \sum_{i=1}^n \big\{ \delta_i\big(\phi + (\bx_i^T\bbeta)+[\exp(\phi)-1]\cdot \log t_i \big) - \exp(\bx_i^T\bbeta)t_i^{\exp(\phi)} \big\} \\ \end{align*} $$

The partial derivatives for both components can be calculated.

$$ \begin{align*} \frac{\partial \ell(\phi;\bbeta)}{\partial \phi} &= \sum_{i=1}^n \delta_i + e^\phi \sum_{i=1}^n \log t_i \Big[1 - \exp(\bx_i^T \bbeta) t_i^{\exp(\phi)} \Big] \\ \frac{\partial \ell(\bbeta; \phi)}{\partial \bbeta} &= \bX^T( \bdelta - \bz_\phi), \hspace{2mm} z_{i,\phi} = \exp(\bx_i^T\bbeta) t_i^{\exp(\phi)} \\ \frac{\partial^2 \ell(\bbeta; \phi)}{\partial \bbeta \partial \bbeta^T } &= - \bX^T \bZ_\phi \bX \end{align*} $$

In order to optimize this likelihood, a block-coordinate descent will prove useful where the parameter $\phi$ is updated first through a root finding method, and then $\bbeta$ is solved for with the IRLS approach highlighted above.

$$ \begin{align*} \phi^{(t+1)} &\gets \arg \min \sum_{i=1}^n \delta_i + e^\phi \sum_{i=1}^n \log t_i \Big[1 - \exp(\bx_i^T \bbeta) t_i^{\exp(\phi)} \Big] \\ \bbeta^{(t+1)} &\gets (\by^{(t)} - \bX \bbeta)^T \bZ^{(t)}_\phi (\by^{(t)} - \bX \bbeta) + \gamma \|\bbeta_{-0}\|_1, \hspace{2mm} \by^{(t)} = \bX \bbeta^{(t)} + (\bZ^{(t)}_\phi)^{-1} (\bdelta - \bz^{(k)}_\phi) \end{align*} $$

In [ ]:


In [ ]: