TODO: Work in progress
Authors: Jonah Gabry, Ben Goodrich, Aki Vehtari, Tuomas Sivula
The introduction to Bayesian logistic regression is from a CRAN vignette by Jonah Gabry and Ben Goodrich. CRAN vignette was modified to a R notebook by Aki Vehtari. Instead of wells data in CRAN vignette, Pima Indians data is used. The end of the notebook differs significantly from the CRAN vignette. The R notebook was ported to this Python notebook by Aki Vehtari and Tuomas Sivula.
This vignette explains how to estimate generalized linear models (GLMs) for binary (Bernoulli) response variables using PyStan.
The four steps of a Bayesian analysis are
For a binomial GLM the likelihood for one observation $y$ can be written as a conditionally binomial PMF $$\binom{n}{y} \pi^{y} (1 - \pi)^{n - y},$$ where $n$ is the known number of trials, $\pi = g^{-1}(\eta)$ is the probability of success and $\eta = \alpha + \mathbf{x}^\top \boldsymbol{\beta}$ is a linear predictor. For a sample of size $N$, the likelihood of the entire sample is the product of $N$ individual likelihood contributions.
Because $\pi$ is a probability, for a binomial model the link function $g$ maps between the unit interval (the support of $\pi$) and the set of all real numbers $\mathbb{R}$. When applied to a linear predictor $\eta$ with values in $\mathbb{R}$, the inverse link function $g^{-1}(\eta)$ therefore returns a valid probability between 0 and 1.
The two most common link functions used for binomial GLMs are the logit and probit functions. With the logit (or log-odds) link function $g(x) = \ln{\left(\frac{x}{1-x}\right)}$, the likelihood for a single observation becomes
$$\binom{n}{y}\left(\text{logit}^{-1}(\eta)\right)^y \left(1 - \text{logit}^{-1}(\eta)\right)^{n-y} = \binom{n}{y} \left(\frac{e^{\eta}}{1 + e^{\eta}}\right)^{y} \left(\frac{1}{1 + e^{\eta}}\right)^{n - y}$$and the probit link function $g(x) = \Phi^{-1}(x)$ yields the likelihood
$$\binom{n}{y} \left(\Phi(\eta)\right)^{y} \left(1 - \Phi(\eta)\right)^{n - y},$$where $\Phi$ is the CDF of the standard normal distribution. The differences between the logit and probit functions are minor and -- if, as rstanarm does by default, the probit is scaled so its slope at the origin matches the logit's -- the two link functions should yield similar results. Unless the user has a specific reason to prefer the probit link, we recommend the logit simply because it will be slightly faster and more numerically stable.
In theory, there are infinitely many possible link functions, although in practice only a few are typically used.
A full Bayesian analysis requires specifying prior distributions $f(\alpha)$ and $f(\boldsymbol{\beta})$ for the intercept and vector of regression coefficients.
As an example, suppose we have $K$ predictors and believe --- prior to seeing the data --- that $\alpha, \beta_1, \dots, \beta_K$ are as likely to be positive as they are to be negative, but are highly unlikely to be far from zero. These beliefs can be represented by normal distributions with mean zero and a small scale (standard deviation).
If, on the other hand, we have less a priori confidence that the parameters will be close to zero then we could use a larger scale for the normal distribution and/or a distribution with heavier tails than the normal like the Student's $t$ distribution.
With independent prior distributions, the joint posterior distribution for $\alpha$ and $\boldsymbol{\beta}$ is proportional to the product of the priors and the $N$ likelihood contributions:
$$f\left(\alpha,\boldsymbol{\beta} | \mathbf{y},\mathbf{X}\right) \propto f\left(\alpha\right) \times \prod_{k=1}^K f\left(\beta_k\right) \times \prod_{i=1}^N { g^{-1}\left(\eta_i\right)^{y_i} \left(1 - g^{-1}\left(\eta_i\right)\right)^{n_i-y_i}}.$$This is posterior distribution that PyStan will draw from when using MCMC.
When the logit link function is used the model is often referred to as a logistic regression model (the inverse logit function is the CDF of the standard logistic distribution). As an example, here we will show how to carry out a analysis for Pima Indians data set similar to analysis from Chapter 5.4 of Gelman and Hill (2007) using PyStan.
In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# import stan interface
import pystan
In [2]:
# add utilities directory to path
import os, sys
util_path = os.path.abspath(os.path.join(os.path.pardir, 'utilities_and_data'))
if util_path not in sys.path and os.path.exists(util_path):
sys.path.insert(0, util_path)
# import from utilities
import stan_utility
import psis # pareto smoothed importance sampling
import plot_tools
In [3]:
# load data
data_path = os.path.abspath(
os.path.join(
os.path.pardir,
'utilities_and_data',
'diabetes.csv'
)
)
data = pd.read_csv(data_path)
# print some basic info()
data.info()
In [4]:
# preview some first rows
data.head()
Out[4]:
In [5]:
# some summary
data.describe()
Out[5]:
In [6]:
# modify the data column names slightly for easier typing
# rename DiabetesPedigreeFunction to dpf
data.rename(columns={'DiabetesPedigreeFunction': 'dpf'}, inplace=True)
# make lower
data.rename(columns=lambda old_name: old_name.lower(), inplace=True)
In [7]:
# removing those observation rows with 0 in selected variables
normed_predictors = [
'glucose',
'bloodpressure',
'skinthickness',
'insulin',
'bmi'
]
data = data[(data[normed_predictors] != 0).all(axis=1)]
In [8]:
# scale the covariates for easier comparison of coefficient posteriors
# N.B. int columns turn into floats
data.iloc[:,:-1] -= data.iloc[:,:-1].mean()
data.iloc[:,:-1] /= data.iloc[:,:-1].std()
In [9]:
# preview some first rows againg
data.head()
Out[9]:
In [10]:
# preparing the inputs
X = data.iloc[:,:-1]
y = data.iloc[:,-1]
In [11]:
# get shape into variables
n, p = X.shape
print('number of observations = {}'.format(n))
print('number of predictors = {}'.format(p))
In [12]:
with open('logistic_t.stan') as file:
print(file.read())
In [13]:
model = stan_utility.compile_model('logistic_t.stan')
Here we'll use a Student t prior with 7 degrees of freedom and a scale of 2.5, which, as discussed above, is a reasonable default prior when coefficients should be close to zero but have some chance of being large. PyStan returns the posterior distribution for the parameters describing the uncertainty related to unknown parameter values.
In [14]:
data1 = dict(
n=n,
d=p,
X=X,
y=y,
p_alpha_df=7,
p_alpha_loc=0,
p_alpha_scale=2.5,
p_beta_df=7,
p_beta_loc=0,
p_beta_scale=2.5
)
fit1 = model.sampling(data=data1, seed=74749)
samples1 = fit1.extract(permuted=True)
In [15]:
# print summary of selected variables
# use pandas data frame for layout
summary = fit1.summary(pars=['alpha', 'beta'])
pd.DataFrame(
summary['summary'],
index=summary['summary_rownames'],
columns=summary['summary_colnames']
)
Out[15]:
n_effs are high and Rhats<1.1, which is good.
Next we check divergences, E-BMFI and treedepth exceedences as explained in Robust Statistical Workflow with PyStan Case Study by Michael Betancourt.
In [16]:
stan_utility.check_treedepth(fit1)
stan_utility.check_energy(fit1)
stan_utility.check_div(fit1)
Everything is fine based on these diagnostics and we can proceed with our analysis.
Visualise the marginal posterior distributions of each parameter
In [17]:
# plot histograms
fig, axes = plot_tools.hist_multi_sharex(
[samples1['alpha']] + [sample for sample in samples1['beta'].T],
rowlabels=['intercept'] + list(X.columns),
n_bins=60,
x_lines=0,
figsize=(7, 10)
)
We can use Pareto smoothed importance sampling leave-one-out cross-validation to estimate the predictive performance.
In [18]:
loo1, loos1, ks1 = psis.psisloo(samples1['log_lik'])
loo1_se = np.sqrt(np.var(loos1, ddof=1)*n)
print('elpd_loo: {:.4} (SE {:.3})'.format(loo1, loo1_se))
In [19]:
# check the number of large (> 0.5) Pareto k estimates
np.sum(ks1 > 0.5)
Out[19]:
In this example, with $n \gg p$ the difference is small, and thus we don’t expect much difference with a different prior and horseshoe prior is usually more useful for $n<p$.
The global scale parameter for horseshoe prior is chosen as recommended by Juho Piironen and Aki Vehtari (2017). On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior. Journal of Machine Learning Research: Workshop and Conference Proceedings (AISTATS 2017 Proceedings), accepted for publication. arXiv preprint arXiv:1610.05559.
In [20]:
with open('logistic_hs.stan') as file:
print(file.read())
In [21]:
model = stan_utility.compile_model('logistic_hs.stan')
In [22]:
p0 = 2 # prior guess for the number of relevant variables
tau0 = p0 / (p - p0) * 1 / np.sqrt(n)
data2 = dict(
n=n,
d=p,
X=X,
y=y,
p_alpha_df=7,
p_alpha_loc=0,
p_alpha_scale=2.5,
p_beta_df=1,
p_beta_global_df=1,
p_beta_global_scale=tau0
)
fit2 = model.sampling(data=data2, seed=74749)
samples2 = fit2.extract(permuted=True)
We see that the horseshoe prior has shrunk the posterior distribution of irrelevant features closer to zero, without affecting the posterior distribution of the relevant features.
In [23]:
# print summary of selected variables
# use pandas data frame for layout
summary = fit2.summary(pars=['alpha', 'beta'])
pd.DataFrame(
summary['summary'],
index=summary['summary_rownames'],
columns=summary['summary_colnames']
)
Out[23]:
In [24]:
# plot histograms
fig, axes = plot_tools.hist_multi_sharex(
[samples2['alpha']] + [sample for sample in samples2['beta'].T],
rowlabels=['intercept'] + list(X.columns),
n_bins=60,
x_lines=0,
figsize=(7, 10)
)
We compute LOO also for the model with Horseshoe prior. Expected log predictive density is higher, but not significantly. This is not surprising as this is a easy data with $n \gg p$.
In [25]:
loo2, loos2, ks2 = psis.psisloo(samples2['log_lik'])
loo2_se = np.sqrt(np.var(loos2, ddof=1)*n)
print('elpd_loo: {:.4} (SE {:.3})'.format(loo2, loo2_se))
In [26]:
# check the number of large (> 0.5) Pareto k estimates
np.sum(ks2 > 0.5)
Out[26]:
In [27]:
elpd_diff = loos2 - loos1
elpd_diff_se = np.sqrt(np.var(elpd_diff, ddof=1)*n)
elpd_diff = np.sum(elpd_diff)
print('elpd_diff: {:.4} (SE {:.3})'.format(elpd_diff, elpd_diff_se))