Author: Thomas Wiecki
This tutorial is adapted from a blog post by Thomas Wiecki called "The Inference Button: Bayesian GLMs made easy with PyMC3".
This tutorial appeared as a post in a small series on Bayesian GLMs on my blog:
In this blog post I will talk about:
Ready? Lets get started!
There is a huge paradigm shift underway in many scientific disciplines: The Bayesian Revolution.
While the theoretical benefits of Bayesian over Frequentist stats have been discussed at length elsewhere (see Further Reading below), there is a major obstacle that hinders wider adoption -- usability (this is one of the reasons DARPA wrote out a huge grant to improve Probabilistic Programming).
This is mildly ironic because the beauty of Bayesian statistics is their generality. Frequentist stats have a bazillion different tests for every different scenario. In Bayesian land you define your model exactly as you think is appropriate and hit the Inference Button(TM) (i.e. running the magical MCMC sampling algorithm).
Yet when I ask my colleagues why they use frequentist stats (even though they would like to use Bayesian stats) the answer is that software packages like SPSS or R make it very easy to run all those individuals tests with a single command (and more often then not, they don't know the exact model and inference method being used).
While there are great Bayesian software packages like JAGS, BUGS, Stan and PyMC, they are written for Bayesians statisticians who know very well what model they want to build.
Unfortunately, "the vast majority of statistical analysis is not performed by statisticians" -- so what we really need are tools for scientists and not for statisticians.
In the interest of putting my code where my mouth is I wrote a submodule for the upcoming PyMC3 that makes construction of Bayesian Generalized Linear Models (GLMs) as easy as Frequentist ones in R.
While future blog posts will explore more complex models, I will start here with the simplest GLM -- linear regression. In general, frequentists think about Linear Regression as follows:
$$ Y = X\beta + \epsilon $$where $Y$ is the output we want to predict (or dependent variable), $X$ is our predictor (or independent variable), and $\beta$ are the coefficients (or parameters) of the model we want to estimate. $\epsilon$ is an error term which is assumed to be normally distributed.
We can then use Ordinary Least Squares or Maximum Likelihood to find the best fitting $\beta$.
Bayesians take a probabilistic view of the world and express this model in terms of probability distributions. Our above linear regression can be rewritten to yield:
$$ Y \sim \mathcal{N}(X \beta, \sigma^2) $$In words, we view $Y$ as a random variable (or random vector) of which each element (data point) is distributed according to a Normal distribution. The mean of this normal distribution is provided by our linear predictor with variance $\sigma^2$.
While this is essentially the same model, there are two critical advantages of Bayesian estimation:
In [1]:
%matplotlib inline
from pymc3 import *
import numpy as np
import matplotlib.pyplot as plt
Create some toy data to play around with and scatter-plot it.
Essentially we are creating a regression line defined by intercept and slope and add data points by sampling from a Normal with the mean set to the regression line.
In [2]:
size = 200
true_intercept = 1
true_slope = 2
x = np.linspace(0, 1, size)
# y = a + b*x
true_regression_line = true_intercept + true_slope * x
# add noise
y = true_regression_line + np.random.normal(scale=.5, size=size)
data = dict(x=x, y=y)
In [3]:
fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111, xlabel='x', ylabel='y', title='Generated data and underlying model')
ax.plot(x, y, 'x', label='sampled data')
ax.plot(x, true_regression_line, label='true regression line', lw=2.)
plt.legend(loc=0);
Lets fit a Bayesian linear regression model to this data. As you can see, model specifications in PyMC3
are wrapped in a with
statement.
Here we use the awesome new NUTS sampler (our Inference Button) to draw 2000 posterior samples.
In [4]:
with Model() as model: # model specifications in PyMC3 are wrapped in a with-statement
# Define priors
sigma = HalfCauchy('sigma', beta=10, testval=1.)
intercept = Normal('Intercept', 0, sd=20)
x_coeff = Normal('x', 0, sd=20)
# Define likelihood
likelihood = Normal('y', mu=intercept + x_coeff * x,
sd=sigma, observed=y)
# Inference!
trace = sample(3000, njobs=2) # draw 3000 posterior samples using NUTS sampling
This should be fairly readable for people who know probabilistic programming. However, would my non-statistican friend know what all this does? Moreover, recall that this is an extremely simple model that would be one line in R. Having multiple, potentially transformed regressors, interaction terms or link-functions would also make this much more complex and error prone.
The new glm()
function instead takes a Patsy linear model specifier from which it creates a design matrix. glm()
then adds random variables for each of the coefficients and an appopriate likelihood to the model.
In [5]:
with Model() as model:
# specify glm and pass in data. The resulting linear model, its likelihood and
# and all its parameters are automatically added to our model.
glm.GLM.from_formula('y ~ x', data)
trace = sample(3000, njobs=2) # draw 3000 posterior samples using NUTS sampling
Much shorter, but this code does the exact same thing as the above model specification (you can change priors and everything else too if we wanted). glm()
parses the Patsy
model string, adds random variables for each regressor (Intercept
and slope x
in this case), adds a likelihood (by default, a Normal is chosen), and all other variables (sigma
). Finally, glm()
then initializes the parameters to a good starting point by estimating a frequentist linear model using statsmodels.
If you are not familiar with R's syntax, 'y ~ x'
specifies that we have an output variable y
that we want to estimate as a linear function of x
.
Bayesian inference does not give us only one best fitting line (as maximum likelihood does) but rather a whole posterior distribution of likely parameters. Lets plot the posterior distribution of our parameters and the individual samples we drew.
In [6]:
plt.figure(figsize=(7, 7))
traceplot(trace[100:])
plt.tight_layout();
The left side shows our marginal posterior -- for each parameter value on the x-axis we get a probability on the y-axis that tells us how likely that parameter value is.
There are a couple of things to see here. The first is that our sampling chains for the individual parameters (left side) seem well converged and stationary (there are no large drifts or other odd patterns).
Secondly, the maximum posterior estimate of each variable (the peak in the left side distributions) is very close to the true parameters used to generate the data (x
is the regression coefficient and sigma
is the standard deviation of our normal).
In the GLM we thus do not only have one best fitting regression line, but many. A posterior predictive plot takes multiple samples from the posterior (intercepts and slopes) and plots a regression line for each of them. Here we are using the plot_posterior_predictive_glm()
convenience function for this.
In [7]:
plt.figure(figsize=(7, 7))
plt.plot(x, y, 'x', label='data')
plot_posterior_predictive_glm(trace, samples=100,
label='posterior predictive regression lines')
plt.plot(x, true_regression_line, label='true regression line', lw=3., c='y')
plt.title('Posterior predictive regression lines')
plt.legend(loc=0)
plt.xlabel('x')
plt.ylabel('y');
As you can see, our estimated regression lines are very similar to the true regression line. But since we only have limited data we have uncertainty in our estimates, here expressed by the variability of the lines.
PyMC3
allows GLM specification with convenient syntax borrowed from R.This is the first post of a small series on Bayesian GLMs I am preparing. Next week I will describe how the Student T distribution can be used to perform robust linear regression.
Then there are also other good resources on Bayesian statistics: