A practical guide to MCMC

Andy Connolly (University of Washington) - ajc@astro.washington.edu

The objectives of this lecture

A gentle (re)introduction to MCMC (from a mathematical and intuative perspective)
An overview of probablistic programming
Hands on exercises using PYMC3 (from fitting a distribution to fitting data of increasing complex likelihoods)



In [1]:

    
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import cauchy
import scipy.stats as stats 

from JSAnimation import IPython_display
import pymc3 as pm
import theano as thea
import theano.tensor as T 
%matplotlib inline

Enough of the lecturing lets get practical!

Software for Markov Chain Monte Carlo

There are two common packages used in MCMC calculations in Astronomy

emcee

The emcee package is a Pure Python package written by Astronomer Dan Foreman-Mackey. It is a lightweight package which implements a fairly sophisticated Affine-invariant Hamiltonian MCMC. Because the package is pure Python (i.e. it contains no compiled extensions) it is extremely easy to install; with pip, simply type at the command-line "pip install emcee".

Emcee does not have much specific boilerplate code; it simply requires you to pass it a Python function which returns a value proportional to the log-posterior probability, and returns samples from that posterior.

ll you need to do is define your log-posterior (in Python) and emcee will sample from that distribution. Because it's pure-Python and does not have specially-defined objects for various common distributions (i.e. uniform, normal, etc.) and performs particularly well when the likelihood is expensive to calculate

PyMC3 (dont use PYMC2!)

The PyMC package has many more features than emcee, including built-in support for efficient sampling of common prior distributions. PyMC by default uses the classic Metropolis-Hastings sampler, one of the earliest MCMC algorithms. PYMC3 is a rewrite of PYMC which improves the API and installation. The hard work in PYMC3 is done through Theano and python package used for Machine Learning.

More details about PyMC are available from the pyMC User Guide

Probablistic programming: harnessing PYMC3

See "Probabilistic Programming and Bayesian Methods for Hackers" by Cam Davidson for a great introduction

We will start with a simple use case where we take a set of data drawn from a Normal distribution and then fit a Gaussian to these data. In doing so we will outline the details and structure of PYMC3 to prepare use for the exercises later on.

Probablistic distributions can have different forms which are supported by pymc3 using theano (you can also create your own distributons)

discrete (random variable have distinct values): e.g. Poisson distributions
continuous (random variable can have arbitrary values): e.g. Exponential distributions
mixed (can be discrete or continuous): e.g. Gaussian mixtures

As we consider the definition of the priors to use see Jake vanderPlas's discussion of how to define non-informative priors

Fitting a Gaussian to data!

$$p(M,\theta \,|\,D,I) = p(D\,|\,M,\theta, I)\,p(M,\theta\,|\,I)$$

We need to define the model we will be fitting and the distributions that we will use for the priors. For our simple Gaussian example the likelihood will have two parameters, $\mu$, and $\sigma$ and we can assume that these parameters have uniform priors

Likeihood: $p(D\,|\theta) = N(\mu,\sigma)$

Priors: $p(\mu,\sigma)$ as uniform distributions



In [ ]:

    
#generate a set of data
N = 200
np.random.seed(44)
mu_0 = 10.
sigma_0 = 2.
y = np.random.normal(loc=mu_0, scale=sigma_0, size=N)

#set up the model

#length of our chain
nsamples = 2000



In [4]:

    
with pm.Model() as model: # variables created within a given model's context are assigned to the model 
    #set up the priors for the parameters
    mu = pm.Uniform('mu', lower=-20, upper=20)     # a simple uniform prior
    sigma = pm.Uniform('sigma', lower=0, upper=10)

    #set up the likelihood
    y_obs = pm.Normal('Y_obs', mu, sigma, observed=y) # we use the canned distributions in PYMC3

    #choose our sampling technique
    start = pm.find_MAP()
    step = pm.Metropolis()
    
    trace = pm.sample(nsamples, step, start, random_seed=123, njobs=2, progressbar=True)









    



Warning: Desired error not necessarily achieved due to precision loss.
         Current function value: 437.975113
         Iterations: 12
         Function evaluations: 100
         Gradient evaluations: 89






    



100%|██████████| 2500/2500 [00:00<00:00, 2965.51it/s]

Lets look at the properties of the trace

Note that the number of samples is > number requested due to burn (defaults to draws=500) and we can run this on multiple cores using the njobs>1 parameter (this is the number of parallel jobs to start)

What do the distributions look like: lets marginalize over parameters by plotting up histograms



In [5]:

    
fig = plt.figure(figsize=(10, 10))

hist_mu, bins_mu = np.histogram(trace['mu'], normed=True)

ax = fig.add_axes((0.4, 0.1, 0.55, 0.29))
ax.plot(0.5 * (bins_mu[1:] + bins_mu[:-1]), hist_mu,
         '-k', drawstyle='steps-mid')
ax.set_xlabel(r'$\mu$')
plt.xlim(9.4, 10.4)

plt.show()









    



/Users/ajc/miniconda3/lib/python3.5/site-packages/matplotlib/font_manager.py:1288: UserWarning: findfont: Font family ['serif'] not found. Falling back to Bitstream Vera Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Use PYMC3 tools to do this for us!



In [6]:

    
lines = {var:trace[var].mean() for var in trace.varnames}
pm.traceplot(trace, lines= lines)
pm.autocorrplot(trace)
pm.df_summary(trace)

traceplot provides

marginal posterior distribution for each parameter (left)
the timeseries of the parameter values from the chain (right)

Right hand side shows broad oscillation: both inlying and extreme values occur frequently but at irregular intervals indicating the markov sampler chain was 'well-mixed' (sampling evenly around the optimal position)

autocorrplot provides

correlation between parameters as a function of lag (or separation between steps)
if you see structure it is showing that the sampler is not travelling the space randomly it depends on the last iteration - features may not be indepdent

When to stop

"Monitoring convergence of iterative simulation is straightforward—discard the first part of the simulations and then compare the variances of quantities of interest within and between chains—and inference given approximate convergence is even simpler: just mix the simulations together and use them as a joint distribution. Both these ideas can and have been refined, but the basic concepts are straightforward and robust.

The hard part is knowing what to do when the simulations are slow to converge."

“Inference from simulations and monitoring convergence” chapter of Handbook of Markov Chain Monte Carlo, Andrew Gelman and Kenneth Shirley

Convergence statistics

There are a number of proposed statistics for testing convergence

Geweke (1992): Compare the mean of the first fraction and last fraction of a series. The chain is divided into segments and this statistic computed. For convergence the score should vary between -1 and 1.

$$ \frac{E[t_s] - E[t_e]}{\sqrt{V[t_s] + V[t_e]}} $$

with $E[t_s]$ the expaction value and $V[ts]$ the variance of the chain segment

Gelman and Rubin (1992): Compare the variance between multiple chains to the variance within each chain. For convergence between-chain and within-chain variances should be the same. Chains that havent converged have values greater than one. Gelmen and Rubin show that if $\hat{R}$ greater than 1.1 or 1.2 we need longer burn-in. $$\hat{R} = \frac{\hat{V}}{\hat{W}}$$

with $\hat{W}$ the within chain variance and $\hat{V}$ posterior variance estimate for the pooled traces.

Both of these statistics are provided by pymc3



In [7]:

    
# Geweke diagnostic
g = pm.diagnostics.geweke(trace, intervals=20)
plt.scatter(*g[0]['mu'].T)
plt.hlines([-1,1], 0, 1000, linestyles='dotted')
plt.xlim(0, 1000)

#Gelman-Rubin diagnostic
pm.diagnostics.gelman_rubin(trace)









    Out[7]:





{'mu': 0.99974996874218502, 'sigma': 0.99974996874218502}

Exercise 2: Fit a Cauchy function to data

Repeat the example for the Gaussian but now fit to the data below a Cauchy distribution. Start by defining your own priors, set up your likelihood function, choose your sampling technique and build your PYMC model. The two cells below this one provide some plotting routines for showing the marginals of the trace (feel free to use your own - e.g. using seaborn pairgrid)

The Cauchy distribution is given by (two parameters $\gamma$, and $\mu$) and is often used because of its long tailed distributions

$$\frac {1}{\pi \gamma \,\left[1+\left({\frac {x-\mu}{\gamma }}\right)^{2}\right]}$$

Describe what happens as you reduce the number of samples

for example look at the Gelman-Rubin statistic as you reduce the sample size



In [2]:

    
#using PYMC3 to fit Cauchy distribution

# generate observed data
N = 50
np.random.seed(44)
mu_0 = 0
gamma_0 = 2
y = cauchy(mu_0, gamma_0).rvs(N)

nsamples = 2000
with pm.Model() as model:
    #define the priors
    mu = pm.Uniform('mu', lower=-5, upper=5)
    gamma = pm.Uniform('gamma', lower=0, upper=5)

    # define likelihood - need to write out the p(D|model) as this is Cauchy
    y_obs = pm.Cauchy('y_obs', mu, gamma, observed=y)

    # inference
    start = pm.find_MAP() # define the starting point using a gradient decent (or other optimizer)
    step = pm.NUTS(scaling=start) # Note here we are using the NUTS sampler (more later)
    trace = pm.sample(nsamples, step, start, random_seed=123, njobs=3, progressbar=True)









    



Optimization terminated successfully.
         Current function value: 161.690613
         Iterations: 5
         Function evaluations: 8
         Gradient evaluations: 8






    



100%|██████████| 2500/2500 [00:01<00:00, 1563.48it/s]



In [3]:

    
def convert_to_stdev(logL):
    '''
    Given a grid of log-likelihood values, convert them to cumulative
    standard deviation.  This is useful for drawing contours from a
    grid of likelihoods.
    '''
    sigma = np.exp(logL)

    shape = sigma.shape
    sigma = sigma.ravel()

    # obtain the indices to sort and unsort the flattened array
    i_sort = np.argsort(sigma)[::-1]
    i_unsort = np.argsort(i_sort)

    sigma_cumsum = sigma[i_sort].cumsum()
    sigma_cumsum /= sigma_cumsum[-1]

    return sigma_cumsum[i_unsort].reshape(shape)



def plotDistributionsMarginals (trace_x, trace_y, xlim=(-5,5), ylim=(0,5), nbins=41):
    '''Given a set of traces plot the pairwise distributions'''

    bins=(np.linspace(xlim[0], xlim[1], 41), np.linspace(ylim[0], ylim[1], nbins))
    # compute histogram of results to plot below
    L_MCMC, trace_x_bins, trace_y_bins = np.histogram2d(trace_x, trace_y, bins=bins)
    L_MCMC[L_MCMC == 0] = 1E-16  # prevents zero-division errors

    hist_x, bins_x = np.histogram(trace_x, bins=trace_x_bins, normed=True)
    hist_y, bins_y = np.histogram(trace_y, bins=trace_y_bins, normed=True)


    # first axis: likelihood contours
    ax1 = fig.add_axes((0.4, 0.4, 0.55, 0.55))
    ax1.xaxis.set_major_formatter(plt.NullFormatter())
    ax1.yaxis.set_major_formatter(plt.NullFormatter())
    ax1.contour(0.5 * (trace_x_bins[:-1] + trace_x_bins[1:]),
            0.5 * (trace_y_bins[:-1] + trace_y_bins[1:]),
            convert_to_stdev(np.log(L_MCMC.T)),
            levels=(0.683, 0.955, 0.997),
            colors='k')

    # second axis: marginalized over x
    ax2 = fig.add_axes((0.1, 0.4, 0.29, 0.55))
    ax2.xaxis.set_major_formatter(plt.NullFormatter())
    ax2.plot(hist_y, 0.5 * (bins_y[1:] + bins_y[:-1]
                            - bins_y[1] + bins_y[0]), '-k', drawstyle='steps')
    ax2.set_ylabel(r'$Parameter\ x$')
    ax2.set_ylim(ylim[0], ylim[1])

    # third axis: marginalized over y
    ax3 = fig.add_axes((0.4, 0.1, 0.55, 0.29))
    ax3.yaxis.set_major_formatter(plt.NullFormatter())
    ax3.plot(0.5 * (bins_x[1:] + bins_x[:-1]), hist_x, '-k', drawstyle='steps-mid')
    ax3.set_xlabel(r'$Parameter\ y$')
    plt.xlim(xlim[0], xlim[1])



In [4]:

    
fig = plt.figure(figsize=(5, 5))
plotDistributionsMarginals (trace['mu'], trace['gamma'])









    



/Users/ajc/miniconda3/lib/python3.5/site-packages/matplotlib/font_manager.py:1288: UserWarning: findfont: Font family ['serif'] not found. Falling back to Bitstream Vera Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Lets go further and extend our analysis to other models

Lets start with regression (infer the expectation value of $y$ given $x$, i.e., the conditional expectation value). We have all seen the classical maximum likelihood approach for this but we can we can easily use the Bayesian methodology seen earlier this week to write the posterior pdf for the model parameters

$$p({\theta}|x,y,I) \propto p(x,y | {\theta}, I) \, p({\theta}, I)$$

Here the information $I$ describes the error behavior for the dependent variable. If the $y$ error distribution is Gaussian, with the width for $i$th data point given by $\sigma_i$, and the errors on $x$ are negligible, then

$$ p(y_i|x_i,{\theta}, I) = {1 \over \sigma_i \sqrt{2\pi}} \, \exp{\left({-[y_i-f(x_i|{\theta})]^2 \over 2 \sigma_i^2}\right)} $$

So y is a random variable with a normal distribution with a mean predicted by the model. We estimate the posterior for the model parameters as well as the variance in the data (also a probablity distribution)

Lets start with a simple linear regression

$$y_i =\theta_0 + \theta_1x_{i1} + \epsilon_i$$

and fit the parameters $\theta_1$ and $\theta_0$

we will initially assume that there are homoskedastic errors (all points have the same error) and will learn this error as part of the model



In [3]:

    
size = 50
intercept = 1.4
slope = 2.7

x = np.linspace(0, 1, size)

# y = a + b*x
model_y = intercept + slope * x

# add noise
y_err = np.random.normal(scale=.5, size=size)
y = model_y + y_err
data = dict(x=x, y=y)

fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111, xlabel='x', ylabel='y', title='Generated data and underlying model')
ax.plot(x, y, 'x', label='sampled data')
ax.plot(x, model_y, label='true regression line', lw=2.)
plt.legend(loc=0)









    Out[3]:





<matplotlib.legend.Legend at 0x11daa4f60>






    



/Users/ajc/miniconda3/lib/python3.5/site-packages/matplotlib/font_manager.py:1288: UserWarning: findfont: Font family ['serif'] not found. Falling back to Bitstream Vera Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Do it yourself regression using PYMC



In [4]:

    
nsample = 10000
sigma2 = 1.0

with pm.Model() as model: # model specifications in PyMC3 are wrapped in a with-statement
    # Define priors
    sigma = pm.HalfNormal('sigma', sd=10)
    theta0 = pm.Normal('theta0', 0, sd=20)
    theta1 = pm.Normal('theta1', 0, sd=20)

    # Define likelihood
    likelihood = pm.Normal('y', mu=theta0 + theta1*x, \
                        sd=sigma, observed=y)   
    # Inference!
    start = pm.find_MAP() # Find starting value by optimization
    step = pm.NUTS(scaling=start) # Instantiate MCMC sampling algorithm
    trace = pm.sample(nsample, step, start=start, progressbar=True, njobs=4) # run with 4 cores

pm.traceplot(trace)









    



Optimization terminated successfully.
         Current function value: 49.811386
         Iterations: 15
         Function evaluations: 21
         Gradient evaluations: 21






    



100%|██████████| 10500/10500 [00:10<00:00, 1032.60it/s]






    Out[4]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x11fcdac18>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1204a45c0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1200a4898>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x105583d30>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1055ebac8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x106c0c400>]], dtype=object)



In [5]:

    
pm.df_summary(trace)

Now lets plot the predictive posterior samples to see how the distributions of fits varies



In [14]:

    
### Plot the predictive posterior samples
plt.scatter(x, y, s=30, label='data')
for a_, b_ in zip(trace['theta1'][-500:], trace['theta0'][-500:]):
    plt.plot(x, a_*x +b_, c='gray', alpha=0.1)

PYMC3 makes the model fitting easier with the GLM (generalized linear models) package

Very similar to the example before except we express the model using the patsy Python package (which can easily describe statistical models and create the associated design matrix)

patsy is a Python package for describing statistical models (especially linear models, or models that have a linear component) and building design matrices (the syntax is borrowed from R). For example

$$y \sim np.power(x, 2) + x$$

Would generate a model for a polynomial regression that included x$^2$, x, and an intercept Note the intercepts are not specified within the model

For our linear model the Patsy formula is y~x. glm() parses the model, adds the random variables (slope and intercept), creates a likelihood (Normal distribution is default), and adds other variables (e.g. sigma). glm() also initializes the parameters (e.g. what find_MAP() had previously done).



In [15]:

    
#Using GLM to fit a linear model
nsamples = 10000
with pm.Model() as model:
    pm.glm.GLM.from_formula('y ~ x', data)
    step = pm.NUTS()
    trace = pm.sample(nsamples, njobs=2)
    
plt.scatter(x, y, s=30, label='data')
plt.plot(x, slope*x + intercept, label='true regression line', lw=3., c='red')
plt.legend(loc='best')

pm.plot_posterior_predictive_glm(trace, samples=100, 
                                 label='posterior predictive regression lines')
pm.traceplot(trace)









    



Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
100%|██████████| 10500/10500 [00:12<00:00, 824.62it/s]






    Out[15]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x12375d048>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11fbb4160>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x11f9aaf98>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x121b670f0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1230d9f28>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x123fa1518>]], dtype=object)

Regression with uncertainities

We can now extend the examples to include uncertainties within the data. For this we will use the Hogg, Bovy, and Lang data set and base our answers on the notebooks in the pymc3 repository



In [6]:

    
#read in the Hogg data (initially exclude the 4 outlier data points)
data = np.load('hoggData.npy')
x = data['x'][4:]
y = data['y'][4:]
sigma_x = data['sigma_x'][4:]
sigma_y = data['sigma_y'][4:]
plt.scatter(x, y, c='k', s=9)
plt.errorbar(x, y, yerr=sigma_y, xerr=sigma_x, fmt='o')
plt.xlabel('x',fontsize=16)
plt.ylabel('y',fontsize=16)









    Out[6]:





<matplotlib.text.Text at 0x120f47908>






    



/Users/ajc/miniconda3/lib/python3.5/site-packages/matplotlib/font_manager.py:1288: UserWarning: findfont: Font family ['serif'] not found. Falling back to Bitstream Vera Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Exercise 4: build a regression model for the Hogg data

Repeat the earlier analysis and learn the variance of the data



In [17]:

    
nsample=2000
with pm.Model() as model:
    ## Define weakly informative Normal priors to give Ridge regression
    theta0 = pm.Normal('theta0', mu=0, sd=100)
    theta1 = pm.Normal('theta1', mu=0, sd=100)
    sigma = pm.HalfNormal('sigma', sd=10)


    ## Define linear model
    yest = theta0 +  theta1* x
    
    ## Define Normal likelihood
    likelihood = pm.Normal('likelihood', mu=yest, sd=sigma, observed=y)
    
    # Inference!
    start = pm.find_MAP() # Find starting value by optimization
    step = pm.NUTS(scaling=start) # Instantiate MCMC sampling algorithm
    trace = pm.sample(nsample, step, start=start, progressbar=True, njobs=4) # run with 4 cores

    
lines = {var:trace[var].mean() for var in trace.varnames}
pm.traceplot(trace, lines=lines)









    



Optimization terminated successfully.
         Current function value: 93.445945
         Iterations: 32
         Function evaluations: 40
         Gradient evaluations: 40






    



100%|██████████| 2500/2500 [00:03<00:00, 673.83it/s]






    Out[17]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1224f4be0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x123af4438>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x123f940f0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x122f6fac8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x12210ac50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11c9840f0>]], dtype=object)



In [18]:

    
plt.scatter(x, y, c='k', s=9)
plt.errorbar(x, y, yerr=sigma_y, xerr=sigma_x, fmt='o')
for a_, b_ in zip(trace['theta1'][-500:], trace['theta0'][-500:]):
    plt.plot(x, a_*x + b_, c='gray', alpha=0.1)

Now lets incorporate uncertainties from the data points themselves



In [19]:

    
#repeat using y errors
nsample=2000
with pm.Model() as model:
    ## Define weakly informative Normal priors to give Ridge regression
    theta0 = pm.Normal('theta0', mu=0, sd=100)
    theta1 = pm.Normal('theta1', mu=0, sd=100)

    ## Define linear model
    yest = theta0 +  theta1* x
    
#    ## Use y error from dataset, convert into theano variable
    thea_sigma_y = thea.shared(np.asarray(sigma_y, dtype=thea.config.floatX), name='sigma_y')

    ## Define Normal likelihood
    likelihood = pm.Normal('likelihood', mu=yest, sd=thea_sigma_y, observed=y)    

    # Inference!
    start = pm.find_MAP() # Find starting value by optimization
    step = pm.NUTS(scaling=start) # Instantiate MCMC sampling algorithm
    trace = pm.sample(nsample, step, start=start, progressbar=True, njobs=4) # run with 4 cores

lines = {var:trace[var].mean() for var in trace.varnames}
pm.traceplot(trace, lines=lines)









    



Optimization terminated successfully.
         Current function value: 85.410730
         Iterations: 5
         Function evaluations: 9
         Gradient evaluations: 9






    



100%|██████████| 2500/2500 [00:03<00:00, 699.13it/s]






    Out[19]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x127781780>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1275259e8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1275d7c18>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x127496278>]], dtype=object)



In [20]:

    
plt.scatter(x, y, c='k', s=9)
plt.errorbar(x, y, yerr=sigma_y, xerr=sigma_x, fmt='o')
for a_, b_ in zip(trace['theta1'][-500:], trace['theta0'][-500:]):
    plt.plot(x, a_*x + b_, c='gray', alpha=0.1)

Outlier rejection

The L2 norm (assuming the likelihood has a normal distribution) is sensitive to outliers (i.e. it squares the residuals). A number of approaches exist for correcting for outliers. These include "sigma-clipping", using interquartile ranges, and taking the median of solutions of subsets of the data

Other approaches include changing the likelihood function to one that allows for longer tails or assuming that the data are drawn from two Gaussians error distribution (one for the function and the other for the outliers)

$\mathcal{logL} = \sum_{i}^{i=N} log \left[ \frac{(1 - B_{i})}{\sqrt{2 \pi \sigma_{in}^{2}}} exp \left( - \frac{(x_{i} - \mu_{in})^{2}}{2\sigma_{in}^{2}} \right) \right] + \sum_{i}^{i=N} log \left[ \frac{B_{i}}{\sqrt{2 \pi (\sigma_{in}^{2} + \sigma_{out}^{2})}} exp \left( - \frac{(x_{i}- \mu_{out})^{2}}{2(\sigma_{in}^{2} + \sigma_{out}^{2})} \right) \right]$

where:

$\bf{B}$ is Bernoulli-distibuted $B_{i} \in [0_{(inlier)},1_{(outlier)}]$

If we use MCMC we can marginalize over the nuisance parameters $B$ , $\sigma_{out}$, $\mu_{out}$. We could also calculate the probability that a point is drawn from the outlier or "model" Gaussian.



In [21]:

    
x = data['x']
y = data['y']
sigma_x = data['sigma_x']
sigma_y = data['sigma_y']
plt.scatter(x, y, c='k', s=9)
plt.errorbar(x, y, yerr=sigma_y, xerr=sigma_x, fmt='o')
plt.xlabel('x',fontsize=16)
plt.ylabel('y',fontsize=16)









    Out[21]:





<matplotlib.text.Text at 0x126ecf438>



In [22]:

    
nsample=2000
with pm.Model() as model:
    ## Define weakly informative Normal priors to give Ridge regression
    theta0 = pm.Normal('theta0', mu=0, sd=100)
    theta1 = pm.Normal('theta1', mu=0, sd=100)

    ## Define linear model
    yest = theta0 +  theta1* x
    
#    ## Use y error from dataset, convert into theano variable
    thea_sigma_y = thea.shared(np.asarray(sigma_y, dtype=thea.config.floatX), name='sigma_y')

    ## Define Normal likelihood
    likelihood = pm.Normal('likelihood', mu=yest, sd=thea_sigma_y, observed=y)    

    # Inference!
    start = pm.find_MAP() # Find starting value by optimization
    step = pm.NUTS(scaling=start) # Instantiate MCMC sampling algorithm
    trace = pm.sample(nsample, step, start=start, progressbar=True, njobs=4) # run with 4 cores

lines = {var:trace[var].mean() for var in trace.varnames}
pm.traceplot(trace, lines=lines)

plt.figure()
plt.scatter(x, y, c='k', s=9)
plt.errorbar(x, y, yerr=sigma_y, xerr=sigma_x, fmt='o')
for a_, b_ in zip(trace['theta1'][-500:], trace['theta0'][-500:]):
    plt.plot(x, a_*x + b_, c='gray', alpha=0.1)









    



Optimization terminated successfully.
         Current function value: 240.575175
         Iterations: 8
         Function evaluations: 11
         Gradient evaluations: 11






    



100%|██████████| 2500/2500 [00:03<00:00, 767.74it/s]

Exercise 4: develop a robust regression using likelihood and priors

When we use a normal likelihood function excursions are weighted heavily (by the square of the deviation). If we consider that the distribution of residuals is not described by a Gaussian come up with a way to reduce the impact of outliers in the model fit



In [23]:

    
#robust regression
nsample=2000
with pm.Model() as model:
    ## Define weakly informative Normal priors to give Ridge regression
    theta0 = pm.Normal('theta0', mu=0, sd=100)
    theta1 = pm.Normal('theta1', mu=0, sd=100)

    ## Define linear model
    yest = theta0 +  theta1* x
    
#    ## Use y error from dataset, convert into theano variable
    thea_sigma_y = thea.shared(np.asarray(sigma_y, dtype=thea.config.floatX), name='sigma_y')

    ## Define the likelihood
    ## define prior for Student T degrees of freedom
    nu = pm.Uniform('nu', lower=1, upper=100)

    ## Define Student T likelihood
    likelihood = pm.StudentT('likelihood', mu=yest, sd=thea_sigma_y, nu=nu, observed=y)
    # Inference!
    start = pm.find_MAP() # Find starting value by optimization
    step = pm.NUTS(scaling=start) # Instantiate MCMC sampling algorithm
    trace = pm.sample(nsample, step, start=start, progressbar=True, njobs=4) # run with 4 cores

lines = {var:trace[var].mean() for var in trace.varnames}
pm.traceplot(trace, lines=lines)

plt.figure()
plt.scatter(x, y, c='k', s=9)
plt.errorbar(x, y, yerr=sigma_y, xerr=sigma_x, fmt='o')
for a_, b_ in zip(trace['theta1'][-500:], trace['theta0'][-500:]):
    plt.plot(x, a_*x + b_, c='gray', alpha=0.1)









    



Optimization terminated successfully.
         Current function value: 129.349244
         Iterations: 42
         Function evaluations: 63
         Gradient evaluations: 63






    



 98%|█████████▊| 2439/2500 [01:30<00:02, 24.11it/s]/Users/ajc/miniconda3/lib/python3.5/site-packages/pymc3/step_methods/hmc/nuts.py:459: UserWarning: Chain 2 reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
  'reparameterize.' % self._chain_id)
100%|█████████▉| 2498/2500 [01:32<00:00, 16.31it/s]/Users/ajc/miniconda3/lib/python3.5/site-packages/pymc3/step_methods/hmc/nuts.py:459: UserWarning: Chain 0 reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
  'reparameterize.' % self._chain_id)
100%|██████████| 2500/2500 [01:32<00:00, 26.90it/s]
/Users/ajc/miniconda3/lib/python3.5/site-packages/pymc3/step_methods/hmc/nuts.py:459: UserWarning: Chain 1 reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
  'reparameterize.' % self._chain_id)
/Users/ajc/miniconda3/lib/python3.5/site-packages/pymc3/step_methods/hmc/nuts.py:459: UserWarning: Chain 3 reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
  'reparameterize.' % self._chain_id)

Modelling outliers

Assume the data are drawn from two Gaussians error distribution (one for the function and the other for the outliers)



In [24]:

    
import theano.tensor as T 
def logp_signoise(yobs, is_outlier, yest_in, sigma_y_in, yest_out, sigma_y_out):
    '''
    Define custom loglikelihood for inliers vs outliers. 
    NOTE: in this particular case we don't need to use theano's @as_op 
    decorator because (as stated by Twiecki in conversation) that's only 
    required if the likelihood cannot be expressed as a theano expression.
    We also now get the gradient computation for free.
    '''   
    
    # likelihood for inliers
    pdfs_in = T.exp(-(yobs - yest_in + 1e-4)**2 / (2 * sigma_y_in**2)) 
    pdfs_in /= T.sqrt(2 * np.pi * sigma_y_in**2)
    logL_in = T.sum(T.log(pdfs_in) * (1 - is_outlier))

    # likelihood for outliers
    pdfs_out = T.exp(-(yobs - yest_out + 1e-4)**2 / (2 * (sigma_y_in**2 + sigma_y_out**2))) 
    pdfs_out /= T.sqrt(2 * np.pi * (sigma_y_in**2 + sigma_y_out**2))
    logL_out = T.sum(T.log(pdfs_out) * is_outlier)

    return logL_in + logL_out



In [25]:

    
with pm.Model() as mdl_signoise:
    
    ## Define weakly informative Normal priors to give Ridge regression
    b0 = pm.Normal('b0_intercept', mu=0, sd=10, testval=pm.floatX(0.1))
    b1 = pm.Normal('b1_slope', mu=0, sd=10, testval=pm.floatX(1.))
 
    ## Define linear model
    yest_in = b0 + b1 * x

    ## Define weakly informative priors for the mean and variance of outliers
    yest_out = pm.Normal('yest_out', mu=0, sd=100, testval=pm.floatX(1.))
    sigma_y_out = pm.HalfNormal('sigma_y_out', sd=100, testval=pm.floatX(1.))

    ## Define Bernoulli inlier / outlier flags according to a hyperprior 
    ## fraction of outliers, itself constrained to [0,.5] for symmetry
    frac_outliers = pm.Uniform('frac_outliers', lower=0., upper=.5)
    is_outlier = pm.Bernoulli('is_outlier', p=frac_outliers, shape=x.shape[0], 
                              testval=np.random.rand(x.shape[0]) < 0.2)
    
    ## Extract observed y and sigma_y from dataset, encode as theano objects
    yobs = thea.shared(np.asarray(y, dtype=thea.config.floatX), name='yobs')
    sigma_y_in = thea.shared(np.asarray(sigma_y, dtype=thea.config.floatX), 
                             name='sigma_y_in')
        
    ## Use custom likelihood using DensityDist
    likelihood = pm.DensityDist('likelihood', logp_signoise,
                        observed={'yobs': yobs, 'is_outlier': is_outlier,
                                  'yest_in': yest_in, 'sigma_y_in': sigma_y_in,
                                  'yest_out': yest_out, 'sigma_y_out': sigma_y_out})



In [26]:

    
with mdl_signoise:
    ## two-step sampling to create Bernoulli inlier/outlier flags
    step1 = pm.Metropolis([frac_outliers, yest_out, sigma_y_out, b0, b1])
    step2 = pm.step_methods.BinaryGibbsMetropolis([is_outlier])

    ## take samples
    traces_signoise = pm.sample(20000, step=[step1, step2], tune=10000, progressbar=True)









    



100%|██████████| 30000/30000 [01:25<00:00, 350.92it/s]



In [27]:

    
lines = {var:trace[var].mean() for var in trace.varnames}
pm.traceplot(trace, lines=lines)

plt.figure()
plt.scatter(x, y, c='k', s=9)
plt.errorbar(x, y, yerr=sigma_y, xerr=sigma_x, fmt='o')
for a_, b_ in zip(trace['theta1'][-500:], trace['theta0'][-500:]):
    plt.plot(x, a_*x + b_, c='gray', alpha=0.1)

Other forms for the likelihoods can be assumed (e.g., using the $L_1$ norm if it is known that measurement errors follow an exponential distribution instead of a Gaussian distribution)

$$ p(y_i|x_i,{\theta}, I) \propto \exp{\left(\frac{-|y_i- (\theta_0 + \theta_1x_{i})|}{ \Delta_i}\right)}. $$

Regularization can be expressed in terms of a Bayesian framework as well (see XXX) using a to impose constraints on the probability distribution of the regression coefficients. If we assumed that the prior on the regression coefficients was Gaussian with the width of this Gaussian governed by the regularization parameter $\lambda$ then we could write it as

$$p(\theta | I ) \propto \exp{\left(\frac{-(\lambda \theta^T \theta)^2}{2}\right)}$$

Multiplying the likelihood function by this prior results in a posterior distribution with an exponent

$$(Y -M \theta)^T(Y- M \theta) - \lambda |\theta^T \theta |^2$$

equivalent to ridge or Tikhonov

	mean	sd	mc_error	hpd_2.5	hpd_97.5
mu	9.914680	0.165937	0.008483	9.593520	10.205285
sigma	2.121394	0.101631	0.004024	1.955419	2.325934

	mean	sd	mc_error	hpd_2.5	hpd_97.5
theta0	1.411120	0.155508	0.001300	1.095015	1.707786
theta1	2.677773	0.268599	0.002304	2.168556	3.225511
sigma	0.557699	0.058778	0.000293	0.444438	0.672831