In [1]:
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import cauchy
import scipy.stats as stats
from JSAnimation import IPython_display
import pymc3 as pm
import theano as thea
import theano.tensor as T
%matplotlib inline
There are two common packages used in MCMC calculations in Astronomy
emcee
The emcee package is a Pure Python package written by Astronomer Dan Foreman-Mackey. It is a lightweight package which implements a fairly sophisticated Affine-invariant Hamiltonian MCMC. Because the package is pure Python (i.e. it contains no compiled extensions) it is extremely easy to install; with pip, simply type at the command-line "pip install emcee".
Emcee does not have much specific boilerplate code; it simply requires you to pass it a Python function which returns a value proportional to the log-posterior probability, and returns samples from that posterior.
ll you need to do is define your log-posterior (in Python) and emcee will sample from that distribution. Because it's pure-Python and does not have specially-defined objects for various common distributions (i.e. uniform, normal, etc.) and performs particularly well when the likelihood is expensive to calculate
PyMC3 (dont use PYMC2!)
The PyMC package has many more features than emcee, including built-in support for efficient sampling of common prior distributions. PyMC by default uses the classic Metropolis-Hastings sampler, one of the earliest MCMC algorithms. PYMC3 is a rewrite of PYMC which improves the API and installation. The hard work in PYMC3 is done through Theano and python package used for Machine Learning.
More details about PyMC are available from the pyMC User Guide
See "Probabilistic Programming and Bayesian Methods for Hackers" by Cam Davidson for a great introduction
We will start with a simple use case where we take a set of data drawn from a Normal distribution and then fit a Gaussian to these data. In doing so we will outline the details and structure of PYMC3 to prepare use for the exercises later on.
Probablistic distributions can have different forms which are supported by pymc3 using theano (you can also create your own distributons)
As we consider the definition of the priors to use see Jake vanderPlas's discussion of how to define non-informative priors
We need to define the model we will be fitting and the distributions that we will use for the priors. For our simple Gaussian example the likelihood will have two parameters, $\mu$, and $\sigma$ and we can assume that these parameters have uniform priors
Likeihood: $p(D\,|\theta) = N(\mu,\sigma)$
Priors: $p(\mu,\sigma)$ as uniform distributions
In [ ]:
#generate a set of data
N = 200
np.random.seed(44)
mu_0 = 10.
sigma_0 = 2.
y = np.random.normal(loc=mu_0, scale=sigma_0, size=N)
#set up the model
#length of our chain
nsamples = 2000
In [4]:
with pm.Model() as model: # variables created within a given model's context are assigned to the model
#set up the priors for the parameters
mu = pm.Uniform('mu', lower=-20, upper=20) # a simple uniform prior
sigma = pm.Uniform('sigma', lower=0, upper=10)
#set up the likelihood
y_obs = pm.Normal('Y_obs', mu, sigma, observed=y) # we use the canned distributions in PYMC3
#choose our sampling technique
start = pm.find_MAP()
step = pm.Metropolis()
trace = pm.sample(nsamples, step, start, random_seed=123, njobs=2, progressbar=True)
In [5]:
fig = plt.figure(figsize=(10, 10))
hist_mu, bins_mu = np.histogram(trace['mu'], normed=True)
ax = fig.add_axes((0.4, 0.1, 0.55, 0.29))
ax.plot(0.5 * (bins_mu[1:] + bins_mu[:-1]), hist_mu,
'-k', drawstyle='steps-mid')
ax.set_xlabel(r'$\mu$')
plt.xlim(9.4, 10.4)
plt.show()
In [6]:
lines = {var:trace[var].mean() for var in trace.varnames}
pm.traceplot(trace, lines= lines)
pm.autocorrplot(trace)
pm.df_summary(trace)
Out[6]:
traceplot provides
Right hand side shows broad oscillation: both inlying and extreme values occur frequently but at irregular intervals indicating the markov sampler chain was 'well-mixed' (sampling evenly around the optimal position)
autocorrplot provides
"Monitoring convergence of iterative simulation is straightforward—discard the first part of the simulations and then compare the variances of quantities of interest within and between chains—and inference given approximate convergence is even simpler: just mix the simulations together and use them as a joint distribution. Both these ideas can and have been refined, but the basic concepts are straightforward and robust.
The hard part is knowing what to do when the simulations are slow to converge."
Convergence statistics
There are a number of proposed statistics for testing convergence
Geweke (1992): Compare the mean of the first fraction and last fraction of a series. The chain is divided into segments and this statistic computed. For convergence the score should vary between -1 and 1.
$$ \frac{E[t_s] - E[t_e]}{\sqrt{V[t_s] + V[t_e]}} $$with $E[t_s]$ the expaction value and $V[ts]$ the variance of the chain segment
Gelman and Rubin (1992): Compare the variance between multiple chains to the variance within each chain. For convergence between-chain and within-chain variances should be the same. Chains that havent converged have values greater than one. Gelmen and Rubin show that if $\hat{R}$ greater than 1.1 or 1.2 we need longer burn-in. $$\hat{R} = \frac{\hat{V}}{\hat{W}}$$
with $\hat{W}$ the within chain variance and $\hat{V}$ posterior variance estimate for the pooled traces.
Both of these statistics are provided by pymc3
In [7]:
# Geweke diagnostic
g = pm.diagnostics.geweke(trace, intervals=20)
plt.scatter(*g[0]['mu'].T)
plt.hlines([-1,1], 0, 1000, linestyles='dotted')
plt.xlim(0, 1000)
#Gelman-Rubin diagnostic
pm.diagnostics.gelman_rubin(trace)
Out[7]:
Repeat the example for the Gaussian but now fit to the data below a Cauchy distribution. Start by defining your own priors, set up your likelihood function, choose your sampling technique and build your PYMC model. The two cells below this one provide some plotting routines for showing the marginals of the trace (feel free to use your own - e.g. using seaborn pairgrid)
The Cauchy distribution is given by (two parameters $\gamma$, and $\mu$) and is often used because of its long tailed distributions
$$\frac {1}{\pi \gamma \,\left[1+\left({\frac {x-\mu}{\gamma }}\right)^{2}\right]}$$Describe what happens as you reduce the number of samples
In [2]:
#using PYMC3 to fit Cauchy distribution
# generate observed data
N = 50
np.random.seed(44)
mu_0 = 0
gamma_0 = 2
y = cauchy(mu_0, gamma_0).rvs(N)
nsamples = 2000
with pm.Model() as model:
#define the priors
mu = pm.Uniform('mu', lower=-5, upper=5)
gamma = pm.Uniform('gamma', lower=0, upper=5)
# define likelihood - need to write out the p(D|model) as this is Cauchy
y_obs = pm.Cauchy('y_obs', mu, gamma, observed=y)
# inference
start = pm.find_MAP() # define the starting point using a gradient decent (or other optimizer)
step = pm.NUTS(scaling=start) # Note here we are using the NUTS sampler (more later)
trace = pm.sample(nsamples, step, start, random_seed=123, njobs=3, progressbar=True)
In [3]:
def convert_to_stdev(logL):
'''
Given a grid of log-likelihood values, convert them to cumulative
standard deviation. This is useful for drawing contours from a
grid of likelihoods.
'''
sigma = np.exp(logL)
shape = sigma.shape
sigma = sigma.ravel()
# obtain the indices to sort and unsort the flattened array
i_sort = np.argsort(sigma)[::-1]
i_unsort = np.argsort(i_sort)
sigma_cumsum = sigma[i_sort].cumsum()
sigma_cumsum /= sigma_cumsum[-1]
return sigma_cumsum[i_unsort].reshape(shape)
def plotDistributionsMarginals (trace_x, trace_y, xlim=(-5,5), ylim=(0,5), nbins=41):
'''Given a set of traces plot the pairwise distributions'''
bins=(np.linspace(xlim[0], xlim[1], 41), np.linspace(ylim[0], ylim[1], nbins))
# compute histogram of results to plot below
L_MCMC, trace_x_bins, trace_y_bins = np.histogram2d(trace_x, trace_y, bins=bins)
L_MCMC[L_MCMC == 0] = 1E-16 # prevents zero-division errors
hist_x, bins_x = np.histogram(trace_x, bins=trace_x_bins, normed=True)
hist_y, bins_y = np.histogram(trace_y, bins=trace_y_bins, normed=True)
# first axis: likelihood contours
ax1 = fig.add_axes((0.4, 0.4, 0.55, 0.55))
ax1.xaxis.set_major_formatter(plt.NullFormatter())
ax1.yaxis.set_major_formatter(plt.NullFormatter())
ax1.contour(0.5 * (trace_x_bins[:-1] + trace_x_bins[1:]),
0.5 * (trace_y_bins[:-1] + trace_y_bins[1:]),
convert_to_stdev(np.log(L_MCMC.T)),
levels=(0.683, 0.955, 0.997),
colors='k')
# second axis: marginalized over x
ax2 = fig.add_axes((0.1, 0.4, 0.29, 0.55))
ax2.xaxis.set_major_formatter(plt.NullFormatter())
ax2.plot(hist_y, 0.5 * (bins_y[1:] + bins_y[:-1]
- bins_y[1] + bins_y[0]), '-k', drawstyle='steps')
ax2.set_ylabel(r'$Parameter\ x$')
ax2.set_ylim(ylim[0], ylim[1])
# third axis: marginalized over y
ax3 = fig.add_axes((0.4, 0.1, 0.55, 0.29))
ax3.yaxis.set_major_formatter(plt.NullFormatter())
ax3.plot(0.5 * (bins_x[1:] + bins_x[:-1]), hist_x, '-k', drawstyle='steps-mid')
ax3.set_xlabel(r'$Parameter\ y$')
plt.xlim(xlim[0], xlim[1])
In [4]:
fig = plt.figure(figsize=(5, 5))
plotDistributionsMarginals (trace['mu'], trace['gamma'])
Lets start with regression (infer the expectation value of $y$ given $x$, i.e., the conditional expectation value). We have all seen the classical maximum likelihood approach for this but we can we can easily use the Bayesian methodology seen earlier this week to write the posterior pdf for the model parameters
$$p({\theta}|x,y,I) \propto p(x,y | {\theta}, I) \, p({\theta}, I)$$Here the information $I$ describes the error behavior for the dependent variable. If the $y$ error distribution is Gaussian, with the width for $i$th data point given by $\sigma_i$, and the errors on $x$ are negligible, then
$$ p(y_i|x_i,{\theta}, I) = {1 \over \sigma_i \sqrt{2\pi}} \, \exp{\left({-[y_i-f(x_i|{\theta})]^2 \over 2 \sigma_i^2}\right)} $$So y is a random variable with a normal distribution with a mean predicted by the model. We estimate the posterior for the model parameters as well as the variance in the data (also a probablity distribution)
In [3]:
size = 50
intercept = 1.4
slope = 2.7
x = np.linspace(0, 1, size)
# y = a + b*x
model_y = intercept + slope * x
# add noise
y_err = np.random.normal(scale=.5, size=size)
y = model_y + y_err
data = dict(x=x, y=y)
fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111, xlabel='x', ylabel='y', title='Generated data and underlying model')
ax.plot(x, y, 'x', label='sampled data')
ax.plot(x, model_y, label='true regression line', lw=2.)
plt.legend(loc=0)
Out[3]:
In [4]:
nsample = 10000
sigma2 = 1.0
with pm.Model() as model: # model specifications in PyMC3 are wrapped in a with-statement
# Define priors
sigma = pm.HalfNormal('sigma', sd=10)
theta0 = pm.Normal('theta0', 0, sd=20)
theta1 = pm.Normal('theta1', 0, sd=20)
# Define likelihood
likelihood = pm.Normal('y', mu=theta0 + theta1*x, \
sd=sigma, observed=y)
# Inference!
start = pm.find_MAP() # Find starting value by optimization
step = pm.NUTS(scaling=start) # Instantiate MCMC sampling algorithm
trace = pm.sample(nsample, step, start=start, progressbar=True, njobs=4) # run with 4 cores
pm.traceplot(trace)
Out[4]:
In [5]:
pm.df_summary(trace)
Out[5]:
In [14]:
### Plot the predictive posterior samples
plt.scatter(x, y, s=30, label='data')
for a_, b_ in zip(trace['theta1'][-500:], trace['theta0'][-500:]):
plt.plot(x, a_*x +b_, c='gray', alpha=0.1)
Very similar to the example before except we express the model using the patsy Python package (which can easily describe statistical models and create the associated design matrix)
patsy is a Python package for describing statistical models (especially linear models, or models that have a linear component) and building design matrices (the syntax is borrowed from R). For example
Would generate a model for a polynomial regression that included x$^2$, x, and an intercept Note the intercepts are not specified within the model
For our linear model the Patsy formula is y~x. glm() parses the model, adds the random variables (slope and intercept), creates a likelihood (Normal distribution is default), and adds other variables (e.g. sigma). glm() also initializes the parameters (e.g. what find_MAP() had previously done).
In [15]:
#Using GLM to fit a linear model
nsamples = 10000
with pm.Model() as model:
pm.glm.GLM.from_formula('y ~ x', data)
step = pm.NUTS()
trace = pm.sample(nsamples, njobs=2)
plt.scatter(x, y, s=30, label='data')
plt.plot(x, slope*x + intercept, label='true regression line', lw=3., c='red')
plt.legend(loc='best')
pm.plot_posterior_predictive_glm(trace, samples=100,
label='posterior predictive regression lines')
pm.traceplot(trace)
Out[15]:
We can now extend the examples to include uncertainties within the data. For this we will use the Hogg, Bovy, and Lang data set and base our answers on the notebooks in the pymc3 repository
In [6]:
#read in the Hogg data (initially exclude the 4 outlier data points)
data = np.load('hoggData.npy')
x = data['x'][4:]
y = data['y'][4:]
sigma_x = data['sigma_x'][4:]
sigma_y = data['sigma_y'][4:]
plt.scatter(x, y, c='k', s=9)
plt.errorbar(x, y, yerr=sigma_y, xerr=sigma_x, fmt='o')
plt.xlabel('x',fontsize=16)
plt.ylabel('y',fontsize=16)
Out[6]:
In [17]:
nsample=2000
with pm.Model() as model:
## Define weakly informative Normal priors to give Ridge regression
theta0 = pm.Normal('theta0', mu=0, sd=100)
theta1 = pm.Normal('theta1', mu=0, sd=100)
sigma = pm.HalfNormal('sigma', sd=10)
## Define linear model
yest = theta0 + theta1* x
## Define Normal likelihood
likelihood = pm.Normal('likelihood', mu=yest, sd=sigma, observed=y)
# Inference!
start = pm.find_MAP() # Find starting value by optimization
step = pm.NUTS(scaling=start) # Instantiate MCMC sampling algorithm
trace = pm.sample(nsample, step, start=start, progressbar=True, njobs=4) # run with 4 cores
lines = {var:trace[var].mean() for var in trace.varnames}
pm.traceplot(trace, lines=lines)
Out[17]:
In [18]:
plt.scatter(x, y, c='k', s=9)
plt.errorbar(x, y, yerr=sigma_y, xerr=sigma_x, fmt='o')
for a_, b_ in zip(trace['theta1'][-500:], trace['theta0'][-500:]):
plt.plot(x, a_*x + b_, c='gray', alpha=0.1)
In [19]:
#repeat using y errors
nsample=2000
with pm.Model() as model:
## Define weakly informative Normal priors to give Ridge regression
theta0 = pm.Normal('theta0', mu=0, sd=100)
theta1 = pm.Normal('theta1', mu=0, sd=100)
## Define linear model
yest = theta0 + theta1* x
# ## Use y error from dataset, convert into theano variable
thea_sigma_y = thea.shared(np.asarray(sigma_y, dtype=thea.config.floatX), name='sigma_y')
## Define Normal likelihood
likelihood = pm.Normal('likelihood', mu=yest, sd=thea_sigma_y, observed=y)
# Inference!
start = pm.find_MAP() # Find starting value by optimization
step = pm.NUTS(scaling=start) # Instantiate MCMC sampling algorithm
trace = pm.sample(nsample, step, start=start, progressbar=True, njobs=4) # run with 4 cores
lines = {var:trace[var].mean() for var in trace.varnames}
pm.traceplot(trace, lines=lines)
Out[19]:
In [20]:
plt.scatter(x, y, c='k', s=9)
plt.errorbar(x, y, yerr=sigma_y, xerr=sigma_x, fmt='o')
for a_, b_ in zip(trace['theta1'][-500:], trace['theta0'][-500:]):
plt.plot(x, a_*x + b_, c='gray', alpha=0.1)
The L2 norm (assuming the likelihood has a normal distribution) is sensitive to outliers (i.e. it squares the residuals). A number of approaches exist for correcting for outliers. These include "sigma-clipping", using interquartile ranges, and taking the median of solutions of subsets of the data
Other approaches include changing the likelihood function to one that allows for longer tails or assuming that the data are drawn from two Gaussians error distribution (one for the function and the other for the outliers)
$\mathcal{logL} = \sum_{i}^{i=N} log \left[ \frac{(1 - B_{i})}{\sqrt{2 \pi \sigma_{in}^{2}}} exp \left( - \frac{(x_{i} - \mu_{in})^{2}}{2\sigma_{in}^{2}} \right) \right] + \sum_{i}^{i=N} log \left[ \frac{B_{i}}{\sqrt{2 \pi (\sigma_{in}^{2} + \sigma_{out}^{2})}} exp \left( - \frac{(x_{i}- \mu_{out})^{2}}{2(\sigma_{in}^{2} + \sigma_{out}^{2})} \right) \right]$
where:
$\bf{B}$ is Bernoulli-distibuted $B_{i} \in [0_{(inlier)},1_{(outlier)}]$
If we use MCMC we can marginalize over the nuisance parameters $B$ , $\sigma_{out}$, $\mu_{out}$. We could also calculate the probability that a point is drawn from the outlier or "model" Gaussian.
In [21]:
x = data['x']
y = data['y']
sigma_x = data['sigma_x']
sigma_y = data['sigma_y']
plt.scatter(x, y, c='k', s=9)
plt.errorbar(x, y, yerr=sigma_y, xerr=sigma_x, fmt='o')
plt.xlabel('x',fontsize=16)
plt.ylabel('y',fontsize=16)
Out[21]:
In [22]:
nsample=2000
with pm.Model() as model:
## Define weakly informative Normal priors to give Ridge regression
theta0 = pm.Normal('theta0', mu=0, sd=100)
theta1 = pm.Normal('theta1', mu=0, sd=100)
## Define linear model
yest = theta0 + theta1* x
# ## Use y error from dataset, convert into theano variable
thea_sigma_y = thea.shared(np.asarray(sigma_y, dtype=thea.config.floatX), name='sigma_y')
## Define Normal likelihood
likelihood = pm.Normal('likelihood', mu=yest, sd=thea_sigma_y, observed=y)
# Inference!
start = pm.find_MAP() # Find starting value by optimization
step = pm.NUTS(scaling=start) # Instantiate MCMC sampling algorithm
trace = pm.sample(nsample, step, start=start, progressbar=True, njobs=4) # run with 4 cores
lines = {var:trace[var].mean() for var in trace.varnames}
pm.traceplot(trace, lines=lines)
plt.figure()
plt.scatter(x, y, c='k', s=9)
plt.errorbar(x, y, yerr=sigma_y, xerr=sigma_x, fmt='o')
for a_, b_ in zip(trace['theta1'][-500:], trace['theta0'][-500:]):
plt.plot(x, a_*x + b_, c='gray', alpha=0.1)
When we use a normal likelihood function excursions are weighted heavily (by the square of the deviation). If we consider that the distribution of residuals is not described by a Gaussian come up with a way to reduce the impact of outliers in the model fit
In [23]:
#robust regression
nsample=2000
with pm.Model() as model:
## Define weakly informative Normal priors to give Ridge regression
theta0 = pm.Normal('theta0', mu=0, sd=100)
theta1 = pm.Normal('theta1', mu=0, sd=100)
## Define linear model
yest = theta0 + theta1* x
# ## Use y error from dataset, convert into theano variable
thea_sigma_y = thea.shared(np.asarray(sigma_y, dtype=thea.config.floatX), name='sigma_y')
## Define the likelihood
## define prior for Student T degrees of freedom
nu = pm.Uniform('nu', lower=1, upper=100)
## Define Student T likelihood
likelihood = pm.StudentT('likelihood', mu=yest, sd=thea_sigma_y, nu=nu, observed=y)
# Inference!
start = pm.find_MAP() # Find starting value by optimization
step = pm.NUTS(scaling=start) # Instantiate MCMC sampling algorithm
trace = pm.sample(nsample, step, start=start, progressbar=True, njobs=4) # run with 4 cores
lines = {var:trace[var].mean() for var in trace.varnames}
pm.traceplot(trace, lines=lines)
plt.figure()
plt.scatter(x, y, c='k', s=9)
plt.errorbar(x, y, yerr=sigma_y, xerr=sigma_x, fmt='o')
for a_, b_ in zip(trace['theta1'][-500:], trace['theta0'][-500:]):
plt.plot(x, a_*x + b_, c='gray', alpha=0.1)
In [24]:
import theano.tensor as T
def logp_signoise(yobs, is_outlier, yest_in, sigma_y_in, yest_out, sigma_y_out):
'''
Define custom loglikelihood for inliers vs outliers.
NOTE: in this particular case we don't need to use theano's @as_op
decorator because (as stated by Twiecki in conversation) that's only
required if the likelihood cannot be expressed as a theano expression.
We also now get the gradient computation for free.
'''
# likelihood for inliers
pdfs_in = T.exp(-(yobs - yest_in + 1e-4)**2 / (2 * sigma_y_in**2))
pdfs_in /= T.sqrt(2 * np.pi * sigma_y_in**2)
logL_in = T.sum(T.log(pdfs_in) * (1 - is_outlier))
# likelihood for outliers
pdfs_out = T.exp(-(yobs - yest_out + 1e-4)**2 / (2 * (sigma_y_in**2 + sigma_y_out**2)))
pdfs_out /= T.sqrt(2 * np.pi * (sigma_y_in**2 + sigma_y_out**2))
logL_out = T.sum(T.log(pdfs_out) * is_outlier)
return logL_in + logL_out
In [25]:
with pm.Model() as mdl_signoise:
## Define weakly informative Normal priors to give Ridge regression
b0 = pm.Normal('b0_intercept', mu=0, sd=10, testval=pm.floatX(0.1))
b1 = pm.Normal('b1_slope', mu=0, sd=10, testval=pm.floatX(1.))
## Define linear model
yest_in = b0 + b1 * x
## Define weakly informative priors for the mean and variance of outliers
yest_out = pm.Normal('yest_out', mu=0, sd=100, testval=pm.floatX(1.))
sigma_y_out = pm.HalfNormal('sigma_y_out', sd=100, testval=pm.floatX(1.))
## Define Bernoulli inlier / outlier flags according to a hyperprior
## fraction of outliers, itself constrained to [0,.5] for symmetry
frac_outliers = pm.Uniform('frac_outliers', lower=0., upper=.5)
is_outlier = pm.Bernoulli('is_outlier', p=frac_outliers, shape=x.shape[0],
testval=np.random.rand(x.shape[0]) < 0.2)
## Extract observed y and sigma_y from dataset, encode as theano objects
yobs = thea.shared(np.asarray(y, dtype=thea.config.floatX), name='yobs')
sigma_y_in = thea.shared(np.asarray(sigma_y, dtype=thea.config.floatX),
name='sigma_y_in')
## Use custom likelihood using DensityDist
likelihood = pm.DensityDist('likelihood', logp_signoise,
observed={'yobs': yobs, 'is_outlier': is_outlier,
'yest_in': yest_in, 'sigma_y_in': sigma_y_in,
'yest_out': yest_out, 'sigma_y_out': sigma_y_out})
In [26]:
with mdl_signoise:
## two-step sampling to create Bernoulli inlier/outlier flags
step1 = pm.Metropolis([frac_outliers, yest_out, sigma_y_out, b0, b1])
step2 = pm.step_methods.BinaryGibbsMetropolis([is_outlier])
## take samples
traces_signoise = pm.sample(20000, step=[step1, step2], tune=10000, progressbar=True)
In [27]:
lines = {var:trace[var].mean() for var in trace.varnames}
pm.traceplot(trace, lines=lines)
plt.figure()
plt.scatter(x, y, c='k', s=9)
plt.errorbar(x, y, yerr=sigma_y, xerr=sigma_x, fmt='o')
for a_, b_ in zip(trace['theta1'][-500:], trace['theta0'][-500:]):
plt.plot(x, a_*x + b_, c='gray', alpha=0.1)
Other forms for the likelihoods can be assumed (e.g., using the $L_1$ norm if it is known that measurement errors follow an exponential distribution instead of a Gaussian distribution)
$$ p(y_i|x_i,{\theta}, I) \propto \exp{\left(\frac{-|y_i- (\theta_0 + \theta_1x_{i})|}{ \Delta_i}\right)}. $$Regularization can be expressed in terms of a Bayesian framework as well (see XXX) using a to impose constraints on the probability distribution of the regression coefficients. If we assumed that the prior on the regression coefficients was Gaussian with the width of this Gaussian governed by the regularization parameter $\lambda$ then we could write it as
$$p(\theta | I ) \propto \exp{\left(\frac{-(\lambda \theta^T \theta)^2}{2}\right)}$$Multiplying the likelihood function by this prior results in a posterior distribution with an exponent
$$(Y -M \theta)^T(Y- M \theta) - \lambda |\theta^T \theta |^2$$equivalent to ridge or Tikhonov