Bayesian Logistic Regression with PyStan

TODO: Work in progress

Authors: Jonah Gabry, Ben Goodrich, Aki Vehtari, Tuomas Sivula

The introduction to Bayesian logistic regression is from a CRAN vignette by Jonah Gabry and Ben Goodrich. CRAN vignette was modified to a R notebook by Aki Vehtari. Instead of wells data in CRAN vignette, Pima Indians data is used. The end of the notebook differs significantly from the CRAN vignette. The R notebook was ported to this Python notebook by Aki Vehtari and Tuomas Sivula.

Introduction

This vignette explains how to estimate generalized linear models (GLMs) for binary (Bernoulli) response variables using PyStan.

The four steps of a Bayesian analysis are

  1. Specify a joint distribution for the outcome(s) and all the unknowns, which typically takes the form of a marginal prior distribution for the unknowns multiplied by a likelihood for the outcome(s) conditional on the unknowns. This joint distribution is proportional to a posterior distribution of the unknowns conditional on the observed data
  2. Draw from posterior distribution using Markov Chain Monte Carlo (MCMC).
  3. Evaluate how well the model fits the data and possibly revise the model.
  4. Draw from the posterior predictive distribution of the outcome(s) given interesting values of the predictors in order to visualize how a manipulation of a predictor affects (a function of) the outcome(s). This notebook demonstrates Steps 1-3 when the likelihood is the product of conditionally independent binomial distributions (possibly with only one trial per observation).

Likelihood

For a binomial GLM the likelihood for one observation $y$ can be written as a conditionally binomial PMF $$\binom{n}{y} \pi^{y} (1 - \pi)^{n - y},$$ where $n$ is the known number of trials, $\pi = g^{-1}(\eta)$ is the probability of success and $\eta = \alpha + \mathbf{x}^\top \boldsymbol{\beta}$ is a linear predictor. For a sample of size $N$, the likelihood of the entire sample is the product of $N$ individual likelihood contributions.

Because $\pi$ is a probability, for a binomial model the link function $g$ maps between the unit interval (the support of $\pi$) and the set of all real numbers $\mathbb{R}$. When applied to a linear predictor $\eta$ with values in $\mathbb{R}$, the inverse link function $g^{-1}(\eta)$ therefore returns a valid probability between 0 and 1.

The two most common link functions used for binomial GLMs are the logit and probit functions. With the logit (or log-odds) link function $g(x) = \ln{\left(\frac{x}{1-x}\right)}$, the likelihood for a single observation becomes

$$\binom{n}{y}\left(\text{logit}^{-1}(\eta)\right)^y \left(1 - \text{logit}^{-1}(\eta)\right)^{n-y} = \binom{n}{y} \left(\frac{e^{\eta}}{1 + e^{\eta}}\right)^{y} \left(\frac{1}{1 + e^{\eta}}\right)^{n - y}$$

and the probit link function $g(x) = \Phi^{-1}(x)$ yields the likelihood

$$\binom{n}{y} \left(\Phi(\eta)\right)^{y} \left(1 - \Phi(\eta)\right)^{n - y},$$

where $\Phi$ is the CDF of the standard normal distribution. The differences between the logit and probit functions are minor and -- if, as rstanarm does by default, the probit is scaled so its slope at the origin matches the logit's -- the two link functions should yield similar results. Unless the user has a specific reason to prefer the probit link, we recommend the logit simply because it will be slightly faster and more numerically stable.

In theory, there are infinitely many possible link functions, although in practice only a few are typically used.

Priors

A full Bayesian analysis requires specifying prior distributions $f(\alpha)$ and $f(\boldsymbol{\beta})$ for the intercept and vector of regression coefficients.

As an example, suppose we have $K$ predictors and believe --- prior to seeing the data --- that $\alpha, \beta_1, \dots, \beta_K$ are as likely to be positive as they are to be negative, but are highly unlikely to be far from zero. These beliefs can be represented by normal distributions with mean zero and a small scale (standard deviation).

If, on the other hand, we have less a priori confidence that the parameters will be close to zero then we could use a larger scale for the normal distribution and/or a distribution with heavier tails than the normal like the Student's $t$ distribution.

Posterior

With independent prior distributions, the joint posterior distribution for $\alpha$ and $\boldsymbol{\beta}$ is proportional to the product of the priors and the $N$ likelihood contributions:

$$f\left(\alpha,\boldsymbol{\beta} | \mathbf{y},\mathbf{X}\right) \propto f\left(\alpha\right) \times \prod_{k=1}^K f\left(\beta_k\right) \times \prod_{i=1}^N { g^{-1}\left(\eta_i\right)^{y_i} \left(1 - g^{-1}\left(\eta_i\right)\right)^{n_i-y_i}}.$$

This is posterior distribution that PyStan will draw from when using MCMC.

Logistic Regression Example

When the logit link function is used the model is often referred to as a logistic regression model (the inverse logit function is the CDF of the standard logistic distribution). As an example, here we will show how to carry out a analysis for Pima Indians data set similar to analysis from Chapter 5.4 of Gelman and Hill (2007) using PyStan.


In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# import stan interface
import pystan

In [2]:
# add utilities directory to path
import os, sys
util_path = os.path.abspath(os.path.join(os.path.pardir, 'utilities_and_data'))
if util_path not in sys.path and os.path.exists(util_path):
    sys.path.insert(0, util_path)

# import from utilities
import stan_utility
import psis  # pareto smoothed importance sampling
import plot_tools

Data

First we load and pre-process data.


In [3]:
# load data
data_path = os.path.abspath(
    os.path.join(
        os.path.pardir,
        'utilities_and_data',
        'diabetes.csv'
    )
)
data = pd.read_csv(data_path)
# print some basic info()
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies                 768 non-null int64
Glucose                     768 non-null int64
BloodPressure               768 non-null int64
SkinThickness               768 non-null int64
Insulin                     768 non-null int64
BMI                         768 non-null float64
DiabetesPedigreeFunction    768 non-null float64
Age                         768 non-null int64
Outcome                     768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

In [4]:
# preview some first rows
data.head()


Out[4]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1

In [5]:
# some summary
data.describe()


Out[5]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 0.348958
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232 0.476951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 0.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000

Preprocess data


In [6]:
# modify the data column names slightly for easier typing
# rename DiabetesPedigreeFunction to dpf
data.rename(columns={'DiabetesPedigreeFunction': 'dpf'}, inplace=True)
# make lower
data.rename(columns=lambda old_name: old_name.lower(), inplace=True)

In [7]:
# removing those observation rows with 0 in selected variables
normed_predictors = [
    'glucose',
    'bloodpressure',
    'skinthickness',
    'insulin',
    'bmi'
]
data = data[(data[normed_predictors] != 0).all(axis=1)]

In [8]:
# scale the covariates for easier comparison of coefficient posteriors
# N.B. int columns turn into floats
data.iloc[:,:-1] -= data.iloc[:,:-1].mean()
data.iloc[:,:-1] /= data.iloc[:,:-1].std()

In [9]:
# preview some first rows againg
data.head()


Out[9]:
pregnancies glucose bloodpressure skinthickness insulin bmi dpf age outcome
3 -0.716511 -1.089653 -0.373178 -0.584363 -0.522175 -0.709514 -1.030559 -0.967063 0
4 -1.027899 0.465719 -2.453828 0.556709 0.100502 1.424909 5.108582 0.209318 1
6 -0.093734 -1.446093 -1.653578 0.271441 -0.572662 -0.296859 -0.796108 -0.476904 1
8 -0.405123 2.409934 -0.053078 1.507603 3.255961 -0.368007 -1.056609 2.169953 1
13 -0.716511 2.150705 -0.853328 -0.584363 5.805571 -0.424924 -0.361940 2.758143 1

In [10]:
# preparing the inputs
X = data.iloc[:,:-1]
y = data.iloc[:,-1]

In [11]:
# get shape into variables
n, p = X.shape
print('number of observations = {}'.format(n))
print('number of predictors = {}'.format(p))


number of observations = 392
number of predictors = 8

Stan model code for logistic regression

Logistic regression with Student's $t$ prior as discussed above.


In [12]:
with open('logistic_t.stan') as file:
    print(file.read())


/**
 * Logistic regression t-prior
 *
 * Priors:
 *     weights - student t
 *     intercept - student t
 */
data {
    int<lower=0> n;               // number of data points
    int<lower=1> d;               // explanatory variable dimension
    matrix[n, d] X;               // explanatory variable
    int<lower=0,upper=1> y[n];    // response variable
    int<lower=1> p_alpha_df;      // prior degrees of freedom for alpha
    real p_alpha_loc;             // prior location for alpha
    real<lower=0> p_alpha_scale;  // prior scale for alpha
    int<lower=1> p_beta_df;       // prior degrees of freedom for beta
    real p_beta_loc;              // prior location for beta
    real<lower=0> p_beta_scale;   // prior scale for beta
}
parameters {
    real alpha;      // intercept
    vector[d] beta;  // explanatory variable weights
}
transformed parameters {
    // linear predictor
    vector[n] eta;
    eta = alpha + X * beta;
}
model {
    alpha ~ student_t(p_alpha_df, p_alpha_loc, p_alpha_scale);
    beta ~ student_t(p_beta_df, p_beta_loc, p_beta_scale);
    y ~ bernoulli_logit(eta);
}
generated quantities {
    vector[n] log_lik;
    for (i in 1:n)
        log_lik[i] = bernoulli_logit_lpmf(y[i] | eta[i]);
}


In [13]:
model = stan_utility.compile_model('logistic_t.stan')


Using cached StanModel

Set priors and sample from the posterior

Here we'll use a Student t prior with 7 degrees of freedom and a scale of 2.5, which, as discussed above, is a reasonable default prior when coefficients should be close to zero but have some chance of being large. PyStan returns the posterior distribution for the parameters describing the uncertainty related to unknown parameter values.


In [14]:
data1 = dict(
    n=n,
    d=p,
    X=X,
    y=y,
    p_alpha_df=7,
    p_alpha_loc=0,
    p_alpha_scale=2.5,
    p_beta_df=7,
    p_beta_loc=0,
    p_beta_scale=2.5
)
fit1 = model.sampling(data=data1, seed=74749)
samples1 = fit1.extract(permuted=True)

Inspect the resulting posterior

Check n_effs and Rhats


In [15]:
# print summary of selected variables
# use pandas data frame for layout
summary = fit1.summary(pars=['alpha', 'beta'])
pd.DataFrame(
    summary['summary'],
    index=summary['summary_rownames'],
    columns=summary['summary_colnames']
)


Out[15]:
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
alpha -1.015174 0.002262 0.143076 -1.299494 -1.109587 -1.014207 -0.919651 -0.744491 4000.0 0.999538
beta[0] 0.268131 0.003343 0.179087 -0.092588 0.145121 0.268793 0.390940 0.617576 2869.0 1.000258
beta[1] 1.214370 0.003466 0.185477 0.858116 1.087410 1.212147 1.340282 1.577813 2863.0 1.000793
beta[2] -0.018038 0.002448 0.154842 -0.317233 -0.127057 -0.015590 0.090371 0.279008 4000.0 0.999856
beta[3] 0.123833 0.003049 0.179358 -0.226432 0.000636 0.127423 0.245617 0.471099 3461.0 0.999971
beta[4] -0.097538 0.002803 0.159845 -0.412036 -0.200400 -0.097875 0.007030 0.226323 3252.0 0.999609
beta[5] 0.507957 0.003516 0.197709 0.136263 0.368516 0.507830 0.640744 0.906786 3162.0 1.000351
beta[6] 0.407946 0.002358 0.149134 0.123293 0.304112 0.404529 0.506626 0.705823 4000.0 0.999301
beta[7] 0.356895 0.003518 0.192169 -0.001486 0.224827 0.352623 0.481480 0.749368 2984.0 1.001202

n_effs are high and Rhats<1.1, which is good.

Next we check divergences, E-BMFI and treedepth exceedences as explained in Robust Statistical Workflow with PyStan Case Study by Michael Betancourt.


In [16]:
stan_utility.check_treedepth(fit1)
stan_utility.check_energy(fit1)
stan_utility.check_div(fit1)


0 of 4000 iterations saturated the maximum tree depth of 10 (0.0%)
0.0 of 4000 iterations ended with a divergence (0.0%)

Everything is fine based on these diagnostics and we can proceed with our analysis.

Visualise the marginal posterior distributions of each parameter


In [17]:
# plot histograms
fig, axes = plot_tools.hist_multi_sharex(
    [samples1['alpha']] + [sample for sample in samples1['beta'].T],
    rowlabels=['intercept'] + list(X.columns),
    n_bins=60,
    x_lines=0,
    figsize=(7, 10)
)


We can use Pareto smoothed importance sampling leave-one-out cross-validation to estimate the predictive performance.


In [18]:
loo1, loos1, ks1 = psis.psisloo(samples1['log_lik'])
loo1_se = np.sqrt(np.var(loos1, ddof=1)*n)
print('elpd_loo: {:.4} (SE {:.3})'.format(loo1, loo1_se))


elpd_loo: -182.3 (SE 12.0)

In [19]:
# check the number of large (> 0.5) Pareto k estimates
np.sum(ks1 > 0.5)


Out[19]:
0

Alternative horseshoe prior on weights

In this example, with $n \gg p$ the difference is small, and thus we don’t expect much difference with a different prior and horseshoe prior is usually more useful for $n<p$.

The global scale parameter for horseshoe prior is chosen as recommended by Juho Piironen and Aki Vehtari (2017). On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior. Journal of Machine Learning Research: Workshop and Conference Proceedings (AISTATS 2017 Proceedings), accepted for publication. arXiv preprint arXiv:1610.05559.


In [20]:
with open('logistic_hs.stan') as file:
    print(file.read())


/**
 * Logistic regression HS-prior
 *
 * Priors:
 *     weights - hierarchical shrinkage
 *     intercept - student t
 */

data {
    int<lower=0> n;                     // number of data points
    int<lower=1> d;                     // explanatory variable dimension
    matrix[n, d] X;                     // explanatory variable
    int<lower=0,upper=1> y[n];          // response variable
    int<lower=1> p_alpha_df;            // prior alpha degrees of freedom
    real p_alpha_loc;                   // prior alpha location
    real<lower=0> p_alpha_scale;        // prior scale alpha
    int<lower=1> p_beta_df;             // prior beta degrees of freedom
    int<lower=1> p_beta_global_df;      // prior beta global degrees of freedom
    real<lower=0> p_beta_global_scale;  // prior beta global scale
}

parameters {

    // intercept
    real alpha;

    // auxiliary variables for the variance parameters
    vector[d] z;
    vector<lower=0>[d] lambda_r1;
    vector<lower=0>[d] lambda_r2;
    real<lower=0> tau_r1;
    real<lower=0> tau_r2;
}

transformed parameters {

    vector<lower=0>[d] lambda;  // local variance parameter
    real<lower=0> tau;          // global variance parameter
    vector[d] beta;             // explanatory variable weights
    vector[n] eta;              // linear predictor

    lambda = lambda_r1 .* sqrt(lambda_r2);
    tau = tau_r1 * sqrt(tau_r2);
    beta = z .* (lambda*tau);
    eta = alpha + X * beta;
}

model {

    // student t prior for intercept
    alpha ~ student_t(p_alpha_df, p_alpha_loc, p_alpha_scale);

    z ~ normal(0.0, 1.0);

    // half t priors for lambdas
    lambda_r1 ~ normal(0.0, 1.0);
    lambda_r2 ~ inv_gamma(0.5*p_beta_df, 0.5*p_beta_df);

    // half t priors for tau
    tau_r1 ~ normal(0.0, p_beta_global_scale);
    tau_r2 ~ inv_gamma(0.5*p_beta_global_df, 0.5*p_beta_global_df);

    // observation model
    y ~ bernoulli_logit(eta);
}

generated quantities {
    vector[n] log_lik;
    for (i in 1:n)
        log_lik[i] = bernoulli_logit_lpmf(y[i] | eta[i]);
}


In [21]:
model = stan_utility.compile_model('logistic_hs.stan')


Using cached StanModel

In [22]:
p0 = 2 # prior guess for the number of relevant variables
tau0 = p0 / (p - p0) * 1 / np.sqrt(n)
data2 = dict(
    n=n,
    d=p,
    X=X,
    y=y,
    p_alpha_df=7,
    p_alpha_loc=0,
    p_alpha_scale=2.5,
    p_beta_df=1,
    p_beta_global_df=1,
    p_beta_global_scale=tau0
)
fit2 = model.sampling(data=data2, seed=74749)
samples2 = fit2.extract(permuted=True)

We see that the horseshoe prior has shrunk the posterior distribution of irrelevant features closer to zero, without affecting the posterior distribution of the relevant features.


In [23]:
# print summary of selected variables
# use pandas data frame for layout
summary = fit2.summary(pars=['alpha', 'beta'])
pd.DataFrame(
    summary['summary'],
    index=summary['summary_rownames'],
    columns=summary['summary_colnames']
)


Out[23]:
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
alpha -0.972157 0.002265 0.143276 -1.264426 -1.066998 -0.964746 -0.873828 -0.700521 4000.0 1.000846
beta[0] 0.182342 0.003634 0.169962 -0.058117 0.039738 0.156690 0.300738 0.561509 2187.0 1.001279
beta[1] 1.134855 0.002612 0.165206 0.800193 1.023235 1.136612 1.242016 1.468268 4000.0 1.001648
beta[2] 0.014218 0.001454 0.091937 -0.177740 -0.026688 0.004477 0.053512 0.227477 4000.0 1.000275
beta[3] 0.104842 0.002593 0.147235 -0.107136 0.002692 0.062478 0.186669 0.453316 3225.0 0.999200
beta[4] -0.024354 0.004711 0.099047 -0.259446 -0.068772 -0.007807 0.024773 0.161351 442.0 1.012295
beta[5] 0.396166 0.003994 0.189707 0.002694 0.272083 0.406092 0.528994 0.745738 2256.0 1.000674
beta[6] 0.291092 0.003909 0.161427 -0.001565 0.178404 0.295211 0.401841 0.605812 1705.0 1.003422
beta[7] 0.321249 0.004307 0.194694 -0.016229 0.173146 0.327803 0.463665 0.693086 2043.0 1.000677

In [24]:
# plot histograms
fig, axes = plot_tools.hist_multi_sharex(
    [samples2['alpha']] + [sample for sample in samples2['beta'].T],
    rowlabels=['intercept'] + list(X.columns),
    n_bins=60,
    x_lines=0,
    figsize=(7, 10)
)


We compute LOO also for the model with Horseshoe prior. Expected log predictive density is higher, but not significantly. This is not surprising as this is a easy data with $n \gg p$.


In [25]:
loo2, loos2, ks2 = psis.psisloo(samples2['log_lik'])
loo2_se = np.sqrt(np.var(loos2, ddof=1)*n)
print('elpd_loo: {:.4} (SE {:.3})'.format(loo2, loo2_se))


elpd_loo: -181.8 (SE 11.1)

In [26]:
# check the number of large (> 0.5) Pareto k estimates
np.sum(ks2 > 0.5)


Out[26]:
0

In [27]:
elpd_diff = loos2 - loos1
elpd_diff_se = np.sqrt(np.var(elpd_diff, ddof=1)*n)
elpd_diff = np.sum(elpd_diff)
print('elpd_diff: {:.4} (SE {:.3})'.format(elpd_diff, elpd_diff_se))


elpd_diff: 0.5638 (SE 1.51)