Frequentism vs. Bayesianism II: Fitting a Line to Data

In a previous post I gave a brief practical introduction to the frequentist and Bayesian approaches to statistics, covering both the philosophical underpinnings and practical aspects of these computations in Python. If you haven't yet read that post, please go read that first: most of what is below will assume that you've seen that previous material. Here, as promised, I'm going to explore a (slightly) more complicated model: the problem of fitting a straight line to data.

You might laugh at me calling "fitting a line to data" a more complicated model. It isn't extremely complicated, true, but it is an excellend test bed to demonstrate the frequentist and Bayesian approaches to a general set of problems.

The Easy Way

Of course, simple straight-line regression is such a common problem that most statistics packages have a canned version built-in: one of the many ways to do this in Python is with NumPy's polyfit function. polyfit fits a polynomial to data, and a straight line is simply a degree-1 polynomial.

First, though, we need to generate some data to try this on. We'll start by defining a model and drawing some points from it with Gaussian errors:


In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# define our model
theta = [1, -5]  # [slope, intercept]
def f(x, theta):
    return theta[0] * x + theta[1]

# Generate some data
sigma = 1  # error on y
np.random.seed(0)
x = 10 * np.random.random(50)
y = np.random.normal(f(x, theta), sigma)

Now that we have some data to work with, we can use np.polyfit to find the line of best fit:


In [3]:
# Fit a line to the data
theta_fit = np.polyfit(x, y, deg=1)
print(theta_fit)


[ 0.96927329 -5.00721008]

In [4]:
# Visualize the data & fit
xfit = np.linspace(0, 10)
plt.plot(xfit, f(xfit, theta_fit))
plt.errorbar(x, y, sigma, fmt='.k', ecolor='gray')
plt.xlabel('x')
plt.ylabel('y');


Voilà! We have our best-fit line!

What Is This Function Doing?

When using a tool like np.polyfit, it's rarely enough to simply apply the method and move on. To use a model like this in your research or data analysis, you really should make sure you know what's happening in the background, so that you understand the assumptions and approximations the model might entail.

To this end, let's start from what we were looking at the last post: the likelihood. For Gaussian measurement errors like we have here, the likelihood is simply the product of the normal-probability of each individual measurement:

$$ \mathcal{L}(\theta~|~D) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left[\frac{-[y_i - f(x_i;\theta)]^2}{2\sigma^2} \right] $$

This should be looking pretty familiar now. By taking the logarithm and ignoring constant terms, you can convince yourself that the likelihood $\mathcal{L}$ is maximized when the following quantity is minimized:

$$ S(\theta) = \sum_{i-1}^N [y_i - f(x_i;\theta)]^2 $$

Thus the optimal model is found by minimizing the sum of square residuals. You might recognize this as the standard least squares fitting procedure, and now we see where this comes from: the least squares approach is derived from the maximum likelihood result!

This minimization can be performed analytically in the case of a best-fit line (see this Wolfram article), and can also be solved in general for any linear model, even if the errors $\sigma_i$ are different for each point! But stepping through that derivation is not particularly enlightening, and the analytic approach doesn't easily generalize to more complex models. For those reasons, we'll skip the derivation and go straight to the numerical approach, using tools built-in to SciPy to find the optimal model for our data.

Scipy's optimize module contains many useful routines for numerical optimization, one of which is the leastsq optimizer. The nice thing about this is that it can also return an estimate of the error on the final parameters, which essentially comes from fitting an N-dimensional Gaussian to the likelihood surface at maximum:


In [5]:
from scipy.optimize import leastsq
theta_guess = [1, 1]

def minfunc(theta):
    return y - f(x, theta)

print leastsq(minfunc, theta_guess)[0]


[ 0.96927329 -5.00721008]

Comparing to the np.polyfit result above, we see that the answer is basically identical. This reflects the fact that polyfit is solving the exact same least-squares problem we wrote above; it's just taking a few extra steps to cast the minimization in terms of matrix algebra, which generally leads to a faster solution.

The benefit of optimize.leastsq, though, is that it can also quite quickly give you an estimate of the errors on your fit parameters. This is accomplished by asking the routine to return the full_output:


In [6]:
result = leastsq(minfunc, theta_guess, full_output=True)
mean = result[0]
cov = result[1]

print("best fit:")
print(mean)
print("error covariance matrix:")
print cov


best fit:
[ 0.96927329 -5.00721008]
error covariance matrix:
[[ 0.00269801 -0.01451436]
 [-0.01451436  0.09808219]]

The error is reported not as a single errorbar for each parameter, but as an error covariance matrix. The diagonal terms in this covariance matrix are related to the square of the typical error bars, and the off-diagonal terms give the covariance between these errors.

This error covariance is often visualized as an ellipse, which we can create using a small utility function to convert the matrix to a tuple of $(r_1, r_2, \alpha)$ which describes the ellipse (for background, see e.g. Section 3.5 in our book). After defining this function, we'll plot the Ellipse representing the error bounds:


In [7]:
def get_ellipse(M):
    """
    Compute the rotation and principal axes of a 2D covariance matrix M
    """
    alpha = 0.5 * np.arctan2(2 * M[1, 0], (M[0, 0] - M[1, 1]))
    s1 = 0.5 * (M[0, 0] + M[1, 1])
    s2 = np.sqrt(0.25 * (M[0, 0] - M[1, 1]) ** 2 + M[1, 0] ** 2)
    return np.sqrt(s1 + s2), np.sqrt(s1 - s2), np.degrees(alpha)

In [8]:
from matplotlib.patches import Ellipse
fig, ax = plt.subplots()

# plot a dot on the best-fit model
ax.plot(mean[:1], mean[1:], 'ok')

# plot cross-hairs on the true value
ax.axvline(theta[0], color='lightgray')
ax.axhline(theta[1], color='lightgray')

# Plot 1-2-3 sigma ellipses
r1, r2, alpha = get_ellipse(cov)
for s in [1, 2, 3]:
    #r1, r2 are radii: ellipse needs full widths 2*r1, 2*r2
    ax.add_patch(Ellipse(mean, s * 2 * r1, s * 2 * r2, angle=alpha, fc='none'))

ax.axis('auto')
ax.set_xlabel('slope')
ax.set_ylabel('intercept')


Out[8]:
<matplotlib.text.Text at 0x10627c2d0>

Looking at this we see that there is a correlation between the slope and the intercept, and this correlation makes intuitive sense. If you increase the slope of the line, you must decrease the intercept for it to fit the data, and vice versa.

To make this more clear, let's plot a sampling of the possible lines covered by these ellipses. We can do this using the np.random.multivariate_normal function, which generates draws from a multivariate normal distribution:


In [9]:
theta_sample = np.random.multivariate_normal(mean, cov, 200)

fig, ax = plt.subplots()
ax.errorbar(x, y, sigma, fmt='.k', ecolor='gray')
for theta_r in theta_sample:
    ax.plot(xfit, f(xfit, theta_r), '-b', lw=0.5, alpha=0.05)


This distribution of lines indicates the range of possible fits found through the least squares approach.

The Bayesian Approach

Let's now take a look at the Bayesian approach to the same problem. As always, we'll start by writing out Bayes' theorem:

$$ P(\theta~|~D, M) = \frac{P(D~|~\theta,M)P(\theta~|~M)}{P(D~|~M)} $$

Just to be explicit, this is what each of the terms mean:

  • $\theta$: the model parameters, i.e. $\theta_0 = {\rm intercept}$, $\theta_1 = {\rm slope}$
  • $D$: the data, i.e. $\{x_i, y_i\}$.
  • $M$: the models, i.e. "a straight line"
  • $P(\theta~|~D,M)$: the posterior, which is what we want to compute
  • $P(D~|~\theta,M)$: the likelihood, proportional to $\mathcal{L}(D~|~\theta)$ above
  • $P(\theta~|~M)$: the prior, encoding what we know (or assume) about $\theta$ beforehand
  • $P(D~|~M)$: effectively a normalization constant

As we saw last time, if we use a flat prior $P(\theta~|~M) \propto 1$, we find

$$ P(\theta~|~D,M) \propto \mathcal{L}(D~|~\theta) $$

and the maximizing the posterior will give the same result as the frequentist approach.

Computing the Bayesian Result

As we did in the previous post, we'll use the emcee package to solve this problem in a Bayesian manner:


In [15]:
import emcee

def log_prior(theta):
    return 1

def log_likelihood(theta, x, y, sigma):
    return -0.5 * np.sum(np.log(2 * np.pi * sigma ** 2)
                         + (y - f(x, theta)) ** 2 / sigma ** 2)

def log_posterior(theta, x, y, sigma):
    return log_prior(theta) + log_likelihood(theta, x, y, sigma)

ndim = 2  # number of parameters in the model
nwalkers = 50  # number of MCMC walkers
nburn = 1000  # "burn-in" period to let chains stabilize
nsteps = 2000  # number of MCMC steps to take
        
# run the sampler
starting_guesses = 10 * np.random.rand(nwalkers, ndim)
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=[x, y, sigma])
sampler.run_mcmc(starting_guesses, nsteps)
sample = sampler.chain[:, nburn:, :].reshape(-1, 2)  # discard burn-in points

Now let's use these samples to visualize the posterior, using the utility routine from astroML:


In [16]:
from astroML.plotting import plot_mcmc
fig = plt.figure()
ax = plot_mcmc(sample.T, fig=fig, labels=['slope', 'intercept'], colors='k')
ax[0].plot(sample[:, 0], sample[:, 1], ',k', alpha=0.1)
ax[0].axvline(theta[0], c='lightgray')
ax[0].axhline(theta[1], c='lightgray')
ax[0].grid(True, alpha=0.3)


As expected, we find parameter constraints very similar to those of the frequentist method, above.

Marginalization: where Bayesianism starts to shine

Looking at this example and the ones from the previous post, you might begin to wonder what the point of this discussion is. The two approaches seem to give virtually the same results for these simple problems, frequentism through optimization and Gaussian error estimation, Bayesianism through MCMC posterior sampling. So where do the approaches diverge?

In my opinion, the primary strength of the Bayesian approach is the concept of marginalization. Marginalization is the process of integrating out ("marginalizing over") one or more parameters whose value is not important for the result. Mathematically, if we have a posterior over two parameters, $P(\theta_1, \theta_2|~D,M)$ the marginalized posterior over $\theta_2$ can be written $$ P(\theta_1~|~D,M) = \int_{-\infty}^\infty P(\theta_1, \theta_2~|~D,M){\rm d}\theta_2 $$ This marginalized posterior gives us the result for $\theta_1$ without reference to $\theta_2$: essentially, marginalized values are treated as nuisance parameters.

In practice, when we use a sample-based approach such as MCMC, marginalization is extremely easy: we just ignore the variable we're marginalizing over. So, for example, to find the posterior distribution for the slope of the line, marginalizing over the intercept, we can simply do the following:


In [22]:
fig, ax = plt.subplots()
ax.hist(sample[:, 0], 50, histtype='stepfilled',
        alpha=0.3, normed=True)
ax.set_xlabel('slope')
ax.set_ylabel('P(slope|D,M)');


This particular application might belie the power of Bayesian marginalization. For that reason, let's take a look at a more complicated situation: fitting a line with outliers.

Controlling for Bad Data: Fitting a Line with Outliers

Everything up until this point has basically assumed that the data is clean: i.e. the errors on the data acutally reflect the real measurement errors. Let's add some outliers to our dataset and see what this does to the basic frequentist fit:


In [89]:
np.random.seed(0)
x = 10 * np.random.random(20)
y = np.random.normal(f(x, theta), sigma)

# Make some outliers
y[7] -= 10
y[15] += 10

# plot the input
plt.errorbar(x, y, sigma, fmt='.k')
plt.plot(xfit, f(xfit, theta), ':k');

# do a simple frequentist fit:
theta_fit = np.polyfit(x, y, deg=1)
plt.plot(xfit, f(xfit, theta_fit), '-b')


Out[89]:
[<matplotlib.lines.Line2D at 0x10751bc10>]

The dotted line shows the input model, and the blue line shows the basic maximum likelihood fit. With this, we see that the presence of just a couple outliers changes the line of best fit completely!

The basic reason for this over-influence of outliers is the fact that we're using a Gaussian likelihood. Ths likelihood is suppressed exponentially with the square of the number of standard deviations, so the influence of a far outlier can quickly exceed the influence of many good points.

Controlling for Outliers: the Frequentist Approach

One of the more common frequentist approaches to correcting for outliers is to adjust the loss function.


In [ ]:


In [ ]:


In [ ]:

Questioning the Prior

But is a flat prior the right choice? Let's assume that the prior on the slope is flat, and that any slope in the range $[-100, 100]$ has equal probability. What do the distribution of models with this prior look like? We can use the same trick as above and draw samples from this distribution to look at the results:


In [9]:
slope = -100 + 200 * np.random.random(1000)
fig, ax = plt.subplots()
xfit = np.linspace(-1, 1)

for m in slope:
    ax.plot(xfit, m * xfit, '-b', alpha=0.1)
ax.set_ylim(-1, 1);


What we see is that a uniform distribution of slopes leads to an effective weighting of models close to vertical. That is, a flat prior on slope subtley biases the result. With this in mind, a better prior might be one which is uniform in the azimuthal angle $\phi$ rather than uniform in the slope $m$.


In [10]:
phi = np.pi * (np.random.random(1000) - 0.5)
slope = np.tan(phi)
fig, ax = plt.subplots()

for m in slope:
    ax.plot(xfit, m * xfit, '-b', alpha=0.1)
ax.set_ylim(-1, 1);


This choice of prior distribution seems to cover the parameter space much more regularly, and be much less biased toward large slopes. (by the way, it can be shown that with uniform $\phi$, the slope $m = \tan(\phi)$, is distributed according to $P(m) \propto 1 / (m^2 + 1)$. See section 3.1 of our book for details).

Here is where frequentists are tempted to declare victory. The analysis fundamentally depends on the prior: how can you choose something so important seemingly out of thin air, and then claim that your results are not biased? Doesn't this whole exercise just show the fundamental silliness of the Bayesian approach?

A Bayesian would counter that, no, it actually shows the silliness of the frequentist approach. The frequentist maximum likelihood is, after all, just a special case of Bayesianism with an implicit choice of prior. This unmentioned flat prior will bias the frequentist results toward high slopes, whether they want to admit it or not. Not naming this bias does not cause it to disappear, and it's much more honest to take the Bayesian route and explicitly show what assumptions are entering the calculation.

And the holy war rages on.

But despite

The Bayesian Result

Philosophical arguments aside, though, what does the result of the Bayesian approach look like? As in the previous post, we'll compute the result via MCMC, using the emcee package:


In [11]:
import emcee
from astroML.plotting import plot_mcmc

def log_flat_prior(theta):
    return 1  # flat prior

def log_angle_prior(theta):
    return -np.log(theta[1] ** 2 + 1)

def log_likelihood(theta, x, y, sigma):
    return -0.5 * np.sum(np.log(2 * np.pi * sigma ** 2)
                         + (y - f(x, theta)) ** 2 / sigma ** 2)

def run_mcmc(x, y, sigma, log_prior, title,
             ndim=2, nwalkers=50, nburn=1000, nsteps=2000):
    # define the posterior
    def log_posterior(theta, x, y, sigma):
        return log_prior(theta) + log_likelihood(theta, x, y, sigma)
        
    # run the sampler
    starting_guesses = 10 * np.random.rand(nwalkers, ndim)
    sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=[x, y, sigma])
    sampler.run_mcmc(starting_guesses, nsteps)
    sample = sampler.chain[:, nburn:, :].reshape(-1, 2)  # discard burn-in points
    
    # plot the results
    fig = plt.figure()
    ax = plot_mcmc(sample.T, fig=fig, labels=['slope', 'intercept'], colors='k')
    ax[0].plot(sample[:, 0], sample[:, 1], ',k', alpha=0.1)
    ax[0].axvline(theta[0], c='lightgray')
    ax[0].axhline(theta[1], c='lightgray')
    ax[0].grid(True, alpha=0.3)
    ax[0].set_title(title)
    return ax[0]

In [12]:
ax1 = run_mcmc(x, y, sigma, log_flat_prior, 'flat prior on slope')
ax2 = run_mcmc(x, y, sigma, log_angle_prior, 'flat prior on angle')
ax2.axis(ax1.axis()); # make axis limits the same


Out[12]:
(0.7718434614741706,
 1.172217452344317,
 -6.2919716729826627,
 -3.7404353180342311)

In [12]: