Bayesian Logistic Regression with PyStan

TODO: Work in progress

Authors: Jonah Gabry, Ben Goodrich, Aki Vehtari, Tuomas Sivula

The introduction to Bayesian logistic regression is from a CRAN vignette by Jonah Gabry and Ben Goodrich. CRAN vignette was modified to a R notebook by Aki Vehtari. Instead of wells data in CRAN vignette, Pima Indians data is used. The end of the notebook differs significantly from the CRAN vignette. The R notebook was ported to this Python notebook by Aki Vehtari and Tuomas Sivula.

Introduction

This vignette explains how to estimate generalized linear models (GLMs) for binary (Bernoulli) response variables using PyStan.

The four steps of a Bayesian analysis are

Specify a joint distribution for the outcome(s) and all the unknowns, which typically takes the form of a marginal prior distribution for the unknowns multiplied by a likelihood for the outcome(s) conditional on the unknowns. This joint distribution is proportional to a posterior distribution of the unknowns conditional on the observed data
Draw from posterior distribution using Markov Chain Monte Carlo (MCMC).
Evaluate how well the model fits the data and possibly revise the model.
Draw from the posterior predictive distribution of the outcome(s) given interesting values of the predictors in order to visualize how a manipulation of a predictor affects (a function of) the outcome(s). This notebook demonstrates Steps 1-3 when the likelihood is the product of conditionally independent binomial distributions (possibly with only one trial per observation).

Likelihood

For a binomial GLM the likelihood for one observation $y$ can be written as a conditionally binomial PMF $$\binom{n}{y} \pi^{y} (1 - \pi)^{n - y},$$ where $n$ is the known number of trials, $\pi = g^{-1}(\eta)$ is the probability of success and $\eta = \alpha + \mathbf{x}^\top \boldsymbol{\beta}$ is a linear predictor. For a sample of size $N$, the likelihood of the entire sample is the product of $N$ individual likelihood contributions.

Because $\pi$ is a probability, for a binomial model the link function $g$ maps between the unit interval (the support of $\pi$) and the set of all real numbers $\mathbb{R}$. When applied to a linear predictor $\eta$ with values in $\mathbb{R}$, the inverse link function $g^{-1}(\eta)$ therefore returns a valid probability between 0 and 1.

The two most common link functions used for binomial GLMs are the logit and probit functions. With the logit (or log-odds) link function $g(x) = \ln{\left(\frac{x}{1-x}\right)}$, the likelihood for a single observation becomes

$$\binom{n}{y}\left(\text{logit}^{-1}(\eta)\right)^y \left(1 - \text{logit}^{-1}(\eta)\right)^{n-y} = \binom{n}{y} \left(\frac{e^{\eta}}{1 + e^{\eta}}\right)^{y} \left(\frac{1}{1 + e^{\eta}}\right)^{n - y}$$

and the probit link function $g(x) = \Phi^{-1}(x)$ yields the likelihood

$$\binom{n}{y} \left(\Phi(\eta)\right)^{y} \left(1 - \Phi(\eta)\right)^{n - y},$$

where $\Phi$ is the CDF of the standard normal distribution. The differences between the logit and probit functions are minor and -- if, as rstanarm does by default, the probit is scaled so its slope at the origin matches the logit's -- the two link functions should yield similar results. Unless the user has a specific reason to prefer the probit link, we recommend the logit simply because it will be slightly faster and more numerically stable.

In theory, there are infinitely many possible link functions, although in practice only a few are typically used.

Priors

A full Bayesian analysis requires specifying prior distributions $f(\alpha)$ and $f(\boldsymbol{\beta})$ for the intercept and vector of regression coefficients.

As an example, suppose we have $K$ predictors and believe --- prior to seeing the data --- that $\alpha, \beta_1, \dots, \beta_K$ are as likely to be positive as they are to be negative, but are highly unlikely to be far from zero. These beliefs can be represented by normal distributions with mean zero and a small scale (standard deviation).

If, on the other hand, we have less a priori confidence that the parameters will be close to zero then we could use a larger scale for the normal distribution and/or a distribution with heavier tails than the normal like the Student's $t$ distribution.

Posterior

With independent prior distributions, the joint posterior distribution for $\alpha$ and $\boldsymbol{\beta}$ is proportional to the product of the priors and the $N$ likelihood contributions:

$$f\left(\alpha,\boldsymbol{\beta} | \mathbf{y},\mathbf{X}\right) \propto f\left(\alpha\right) \times \prod_{k=1}^K f\left(\beta_k\right) \times \prod_{i=1}^N { g^{-1}\left(\eta_i\right)^{y_i} \left(1 - g^{-1}\left(\eta_i\right)\right)^{n_i-y_i}}.$$

This is posterior distribution that PyStan will draw from when using MCMC.

Logistic Regression Example

When the logit link function is used the model is often referred to as a logistic regression model (the inverse logit function is the CDF of the standard logistic distribution). As an example, here we will show how to carry out a analysis for Pima Indians data set similar to analysis from Chapter 5.4 of Gelman and Hill (2007) using PyStan.



In [1]:

    
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# import stan interface
import pystan



In [2]:

    
# add utilities directory to path
import os, sys
util_path = os.path.abspath(os.path.join(os.path.pardir, 'utilities_and_data'))
if util_path not in sys.path and os.path.exists(util_path):
    sys.path.insert(0, util_path)

# import from utilities
import stan_utility
import psis  # pareto smoothed importance sampling
import plot_tools

Data

First we load and pre-process data.



In [3]:

    
# load data
data_path = os.path.abspath(
    os.path.join(
        os.path.pardir,
        'utilities_and_data',
        'diabetes.csv'
    )
)
data = pd.read_csv(data_path)
# print some basic info()
data.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies                 768 non-null int64
Glucose                     768 non-null int64
BloodPressure               768 non-null int64
SkinThickness               768 non-null int64
Insulin                     768 non-null int64
BMI                         768 non-null float64
DiabetesPedigreeFunction    768 non-null float64
Age                         768 non-null int64
Outcome                     768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB



In [4]:

    
# preview some first rows
data.head()









    Out[4]:







  
    
      
      Pregnancies
      Glucose
      BloodPressure
      SkinThickness
      Insulin
      BMI
      DiabetesPedigreeFunction
      Age
      Outcome
    
  
  
    
      0
      6
      148
      72
      35
      0
      33.6
      0.627
      50
      1
    
    
      1
      1
      85
      66
      29
      0
      26.6
      0.351
      31
      0
    
    
      2
      8
      183
      64
      0
      0
      23.3
      0.672
      32
      1
    
    
      3
      1
      89
      66
      23
      94
      28.1
      0.167
      21
      0
    
    
      4
      0
      137
      40
      35
      168
      43.1
      2.288
      33
      1



In [5]:

    
# some summary
data.describe()









    Out[5]:







  
    
      
      Pregnancies
      Glucose
      BloodPressure
      SkinThickness
      Insulin
      BMI
      DiabetesPedigreeFunction
      Age
      Outcome
    
  
  
    
      count
      768.000000
      768.000000
      768.000000
      768.000000
      768.000000
      768.000000
      768.000000
      768.000000
      768.000000
    
    
      mean
      3.845052
      120.894531
      69.105469
      20.536458
      79.799479
      31.992578
      0.471876
      33.240885
      0.348958
    
    
      std
      3.369578
      31.972618
      19.355807
      15.952218
      115.244002
      7.884160
      0.331329
      11.760232
      0.476951
    
    
      min
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.078000
      21.000000
      0.000000
    
    
      25%
      1.000000
      99.000000
      62.000000
      0.000000
      0.000000
      27.300000
      0.243750
      24.000000
      0.000000
    
    
      50%
      3.000000
      117.000000
      72.000000
      23.000000
      30.500000
      32.000000
      0.372500
      29.000000
      0.000000
    
    
      75%
      6.000000
      140.250000
      80.000000
      32.000000
      127.250000
      36.600000
      0.626250
      41.000000
      1.000000
    
    
      max
      17.000000
      199.000000
      122.000000
      99.000000
      846.000000
      67.100000
      2.420000
      81.000000
      1.000000

Preprocess data



In [6]:

    
# modify the data column names slightly for easier typing
# rename DiabetesPedigreeFunction to dpf
data.rename(columns={'DiabetesPedigreeFunction': 'dpf'}, inplace=True)
# make lower
data.rename(columns=lambda old_name: old_name.lower(), inplace=True)



In [7]:

    
# removing those observation rows with 0 in selected variables
normed_predictors = [
    'glucose',
    'bloodpressure',
    'skinthickness',
    'insulin',
    'bmi'
]
data = data[(data[normed_predictors] != 0).all(axis=1)]



In [8]:

    
# scale the covariates for easier comparison of coefficient posteriors
# N.B. int columns turn into floats
data.iloc[:,:-1] -= data.iloc[:,:-1].mean()
data.iloc[:,:-1] /= data.iloc[:,:-1].std()



In [9]:

    
# preview some first rows againg
data.head()









    Out[9]:







  
    
      
      pregnancies
      glucose
      bloodpressure
      skinthickness
      insulin
      bmi
      dpf
      age
      outcome
    
  
  
    
      3
      -0.716511
      -1.089653
      -0.373178
      -0.584363
      -0.522175
      -0.709514
      -1.030559
      -0.967063
      0
    
    
      4
      -1.027899
      0.465719
      -2.453828
      0.556709
      0.100502
      1.424909
      5.108582
      0.209318
      1
    
    
      6
      -0.093734
      -1.446093
      -1.653578
      0.271441
      -0.572662
      -0.296859
      -0.796108
      -0.476904
      1
    
    
      8
      -0.405123
      2.409934
      -0.053078
      1.507603
      3.255961
      -0.368007
      -1.056609
      2.169953
      1
    
    
      13
      -0.716511
      2.150705
      -0.853328
      -0.584363
      5.805571
      -0.424924
      -0.361940
      2.758143
      1



In [10]:

    
# preparing the inputs
X = data.iloc[:,:-1]
y = data.iloc[:,-1]



In [11]:

    
# get shape into variables
n, p = X.shape
print('number of observations = {}'.format(n))
print('number of predictors = {}'.format(p))









    



number of observations = 392
number of predictors = 8

Stan model code for logistic regression

Logistic regression with Student's $t$ prior as discussed above.



In [12]:

    
with open('logistic_t.stan') as file:
    print(file.read())









    



/**
 * Logistic regression t-prior
 *
 * Priors:
 *     weights - student t
 *     intercept - student t
 */
data {
    int<lower=0> n;               // number of data points
    int<lower=1> d;               // explanatory variable dimension
    matrix[n, d] X;               // explanatory variable
    int<lower=0,upper=1> y[n];    // response variable
    int<lower=1> p_alpha_df;      // prior degrees of freedom for alpha
    real p_alpha_loc;             // prior location for alpha
    real<lower=0> p_alpha_scale;  // prior scale for alpha
    int<lower=1> p_beta_df;       // prior degrees of freedom for beta
    real p_beta_loc;              // prior location for beta
    real<lower=0> p_beta_scale;   // prior scale for beta
}
parameters {
    real alpha;      // intercept
    vector[d] beta;  // explanatory variable weights
}
transformed parameters {
    // linear predictor
    vector[n] eta;
    eta = alpha + X * beta;
}
model {
    alpha ~ student_t(p_alpha_df, p_alpha_loc, p_alpha_scale);
    beta ~ student_t(p_beta_df, p_beta_loc, p_beta_scale);
    y ~ bernoulli_logit(eta);
}
generated quantities {
    vector[n] log_lik;
    for (i in 1:n)
        log_lik[i] = bernoulli_logit_lpmf(y[i] | eta[i]);
}



In [13]:

    
model = stan_utility.compile_model('logistic_t.stan')









    



Using cached StanModel

Set priors and sample from the posterior

Here we'll use a Student t prior with 7 degrees of freedom and a scale of 2.5, which, as discussed above, is a reasonable default prior when coefficients should be close to zero but have some chance of being large. PyStan returns the posterior distribution for the parameters describing the uncertainty related to unknown parameter values.



In [14]:

    
data1 = dict(
    n=n,
    d=p,
    X=X,
    y=y,
    p_alpha_df=7,
    p_alpha_loc=0,
    p_alpha_scale=2.5,
    p_beta_df=7,
    p_beta_loc=0,
    p_beta_scale=2.5
)
fit1 = model.sampling(data=data1, seed=74749)
samples1 = fit1.extract(permuted=True)

Inspect the resulting posterior

Check n_effs and Rhats



In [15]:

    
# print summary of selected variables
# use pandas data frame for layout
summary = fit1.summary(pars=['alpha', 'beta'])
pd.DataFrame(
    summary['summary'],
    index=summary['summary_rownames'],
    columns=summary['summary_colnames']
)

n_effs are high and Rhats<1.1, which is good.

Next we check divergences, E-BMFI and treedepth exceedences as explained in Robust Statistical Workflow with PyStan Case Study by Michael Betancourt.



In [16]:

    
stan_utility.check_treedepth(fit1)
stan_utility.check_energy(fit1)
stan_utility.check_div(fit1)









    



0 of 4000 iterations saturated the maximum tree depth of 10 (0.0%)
0.0 of 4000 iterations ended with a divergence (0.0%)

Everything is fine based on these diagnostics and we can proceed with our analysis.

Visualise the marginal posterior distributions of each parameter



In [17]:

    
# plot histograms
fig, axes = plot_tools.hist_multi_sharex(
    [samples1['alpha']] + [sample for sample in samples1['beta'].T],
    rowlabels=['intercept'] + list(X.columns),
    n_bins=60,
    x_lines=0,
    figsize=(7, 10)
)

We can use Pareto smoothed importance sampling leave-one-out cross-validation to estimate the predictive performance.



In [18]:

    
loo1, loos1, ks1 = psis.psisloo(samples1['log_lik'])
loo1_se = np.sqrt(np.var(loos1, ddof=1)*n)
print('elpd_loo: {:.4} (SE {:.3})'.format(loo1, loo1_se))









    



elpd_loo: -182.3 (SE 12.0)



In [19]:

    
# check the number of large (> 0.5) Pareto k estimates
np.sum(ks1 > 0.5)









    Out[19]:





0

Alternative horseshoe prior on weights

In this example, with $n \gg p$ the difference is small, and thus we don’t expect much difference with a different prior and horseshoe prior is usually more useful for $n<p$.

The global scale parameter for horseshoe prior is chosen as recommended by Juho Piironen and Aki Vehtari (2017). On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior. Journal of Machine Learning Research: Workshop and Conference Proceedings (AISTATS 2017 Proceedings), accepted for publication. arXiv preprint arXiv:1610.05559.



In [20]:

    
with open('logistic_hs.stan') as file:
    print(file.read())









    



/**
 * Logistic regression HS-prior
 *
 * Priors:
 *     weights - hierarchical shrinkage
 *     intercept - student t
 */

data {
    int<lower=0> n;                     // number of data points
    int<lower=1> d;                     // explanatory variable dimension
    matrix[n, d] X;                     // explanatory variable
    int<lower=0,upper=1> y[n];          // response variable
    int<lower=1> p_alpha_df;            // prior alpha degrees of freedom
    real p_alpha_loc;                   // prior alpha location
    real<lower=0> p_alpha_scale;        // prior scale alpha
    int<lower=1> p_beta_df;             // prior beta degrees of freedom
    int<lower=1> p_beta_global_df;      // prior beta global degrees of freedom
    real<lower=0> p_beta_global_scale;  // prior beta global scale
}

parameters {

    // intercept
    real alpha;

    // auxiliary variables for the variance parameters
    vector[d] z;
    vector<lower=0>[d] lambda_r1;
    vector<lower=0>[d] lambda_r2;
    real<lower=0> tau_r1;
    real<lower=0> tau_r2;
}

transformed parameters {

    vector<lower=0>[d] lambda;  // local variance parameter
    real<lower=0> tau;          // global variance parameter
    vector[d] beta;             // explanatory variable weights
    vector[n] eta;              // linear predictor

    lambda = lambda_r1 .* sqrt(lambda_r2);
    tau = tau_r1 * sqrt(tau_r2);
    beta = z .* (lambda*tau);
    eta = alpha + X * beta;
}

model {

    // student t prior for intercept
    alpha ~ student_t(p_alpha_df, p_alpha_loc, p_alpha_scale);

    z ~ normal(0.0, 1.0);

    // half t priors for lambdas
    lambda_r1 ~ normal(0.0, 1.0);
    lambda_r2 ~ inv_gamma(0.5*p_beta_df, 0.5*p_beta_df);

    // half t priors for tau
    tau_r1 ~ normal(0.0, p_beta_global_scale);
    tau_r2 ~ inv_gamma(0.5*p_beta_global_df, 0.5*p_beta_global_df);

    // observation model
    y ~ bernoulli_logit(eta);
}

generated quantities {
    vector[n] log_lik;
    for (i in 1:n)
        log_lik[i] = bernoulli_logit_lpmf(y[i] | eta[i]);
}



In [21]:

    
model = stan_utility.compile_model('logistic_hs.stan')









    



Using cached StanModel



In [22]:

    
p0 = 2 # prior guess for the number of relevant variables
tau0 = p0 / (p - p0) * 1 / np.sqrt(n)
data2 = dict(
    n=n,
    d=p,
    X=X,
    y=y,
    p_alpha_df=7,
    p_alpha_loc=0,
    p_alpha_scale=2.5,
    p_beta_df=1,
    p_beta_global_df=1,
    p_beta_global_scale=tau0
)
fit2 = model.sampling(data=data2, seed=74749)
samples2 = fit2.extract(permuted=True)

We see that the horseshoe prior has shrunk the posterior distribution of irrelevant features closer to zero, without affecting the posterior distribution of the relevant features.



In [23]:

    
# print summary of selected variables
# use pandas data frame for layout
summary = fit2.summary(pars=['alpha', 'beta'])
pd.DataFrame(
    summary['summary'],
    index=summary['summary_rownames'],
    columns=summary['summary_colnames']
)



In [24]:

    
# plot histograms
fig, axes = plot_tools.hist_multi_sharex(
    [samples2['alpha']] + [sample for sample in samples2['beta'].T],
    rowlabels=['intercept'] + list(X.columns),
    n_bins=60,
    x_lines=0,
    figsize=(7, 10)
)

We compute LOO also for the model with Horseshoe prior. Expected log predictive density is higher, but not significantly. This is not surprising as this is a easy data with $n \gg p$.



In [25]:

    
loo2, loos2, ks2 = psis.psisloo(samples2['log_lik'])
loo2_se = np.sqrt(np.var(loos2, ddof=1)*n)
print('elpd_loo: {:.4} (SE {:.3})'.format(loo2, loo2_se))









    



elpd_loo: -181.8 (SE 11.1)



In [26]:

    
# check the number of large (> 0.5) Pareto k estimates
np.sum(ks2 > 0.5)









    Out[26]:





0



In [27]:

    
elpd_diff = loos2 - loos1
elpd_diff_se = np.sqrt(np.var(elpd_diff, ddof=1)*n)
elpd_diff = np.sum(elpd_diff)
print('elpd_diff: {:.4} (SE {:.3})'.format(elpd_diff, elpd_diff_se))









    



elpd_diff: 0.5638 (SE 1.51)

	mean	se_mean	sd	2.5%	25%	50%	75%	97.5%	n_eff	Rhat
alpha	-1.015174	0.002262	0.143076	-1.299494	-1.109587	-1.014207	-0.919651	-0.744491	4000.0	0.999538
beta[0]	0.268131	0.003343	0.179087	-0.092588	0.145121	0.268793	0.390940	0.617576	2869.0	1.000258
beta[1]	1.214370	0.003466	0.185477	0.858116	1.087410	1.212147	1.340282	1.577813	2863.0	1.000793
beta[2]	-0.018038	0.002448	0.154842	-0.317233	-0.127057	-0.015590	0.090371	0.279008	4000.0	0.999856
beta[3]	0.123833	0.003049	0.179358	-0.226432	0.000636	0.127423	0.245617	0.471099	3461.0	0.999971
beta[4]	-0.097538	0.002803	0.159845	-0.412036	-0.200400	-0.097875	0.007030	0.226323	3252.0	0.999609
beta[5]	0.507957	0.003516	0.197709	0.136263	0.368516	0.507830	0.640744	0.906786	3162.0	1.000351
beta[6]	0.407946	0.002358	0.149134	0.123293	0.304112	0.404529	0.506626	0.705823	4000.0	0.999301
beta[7]	0.356895	0.003518	0.192169	-0.001486	0.224827	0.352623	0.481480	0.749368	2984.0	1.001202

	mean	se_mean	sd	2.5%	25%	50%	75%	97.5%	n_eff	Rhat
alpha	-0.972157	0.002265	0.143276	-1.264426	-1.066998	-0.964746	-0.873828	-0.700521	4000.0	1.000846
beta[0]	0.182342	0.003634	0.169962	-0.058117	0.039738	0.156690	0.300738	0.561509	2187.0	1.001279
beta[1]	1.134855	0.002612	0.165206	0.800193	1.023235	1.136612	1.242016	1.468268	4000.0	1.001648
beta[2]	0.014218	0.001454	0.091937	-0.177740	-0.026688	0.004477	0.053512	0.227477	4000.0	1.000275
beta[3]	0.104842	0.002593	0.147235	-0.107136	0.002692	0.062478	0.186669	0.453316	3225.0	0.999200
beta[4]	-0.024354	0.004711	0.099047	-0.259446	-0.068772	-0.007807	0.024773	0.161351	442.0	1.012295
beta[5]	0.396166	0.003994	0.189707	0.002694	0.272083	0.406092	0.528994	0.745738	2256.0	1.000674
beta[6]	0.291092	0.003909	0.161427	-0.001565	0.178404	0.295211	0.401841	0.605812	1705.0	1.003422
beta[7]	0.321249	0.004307	0.194694	-0.016229	0.173146	0.327803	0.463665	0.693086	2043.0	1.000677

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	3.845052	120.894531	69.105469	20.536458	79.799479	31.992578	0.471876	33.240885	0.348958
std	3.369578	31.972618	19.355807	15.952218	115.244002	7.884160	0.331329	11.760232	0.476951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.078000	21.000000	0.000000
25%	1.000000	99.000000	62.000000	0.000000	0.000000	27.300000	0.243750	24.000000	0.000000
50%	3.000000	117.000000	72.000000	23.000000	30.500000	32.000000	0.372500	29.000000	0.000000
75%	6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000	1.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

	pregnancies	glucose	bloodpressure	skinthickness	insulin	bmi	dpf	age	outcome
3	-0.716511	-1.089653	-0.373178	-0.584363	-0.522175	-0.709514	-1.030559	-0.967063	0
4	-1.027899	0.465719	-2.453828	0.556709	0.100502	1.424909	5.108582	0.209318	1
6	-0.093734	-1.446093	-1.653578	0.271441	-0.572662	-0.296859	-0.796108	-0.476904	1
8	-0.405123	2.409934	-0.053078	1.507603	3.255961	-0.368007	-1.056609	2.169953	1
13	-0.716511	2.150705	-0.853328	-0.584363	5.805571	-0.424924	-0.361940	2.758143	1