Introduction


In [1]:
import numpy as np
import matplotlib
matplotlib.use("nbagg")
import matplotlib.pyplot as plt
fig, ax = plt.subplots()

To demonstrate the difference between frequentist and bayesian perspectives we will use a simple regression problem.

We will create $N$ points uniformly from $[0,1]$ and use them as input to $\sin(2\pi \mathbf{x})$. We add noise to each observation by adding a sample from a Gaussian, $\epsilon \sim \mathcal{N}(0,0.3)$. This creates our dependent random variable $t$.

$$\mathbf{t} = \sin(2 \pi \mathbf{x}) + \boldsymbol{\epsilon}$$

.

Notice that $\mathbf{t}$ is stochastic even though $\mathbf{x}$ is not. This is because $\mathbf{t}$ depends on the random variable $\boldsymbol{\epsilon}$. This is also called a nonstationary stochastic process. It is nonstationary due to the sinusoidal driving term. In these circumstances, we can calculate the analytic derivative $\frac{d\mathbf{t}}{d\mathbf{x}}$.


In [2]:
def gen_data(size=10):
    x = np.linspace(0,1,size)
    eps = 0.3*np.random.randn(size)
    t = np.sin(2*np.pi*x)+eps
    return (x,t)
x, t = gen_data()

In [3]:
def plot_dataset(axis,x,t):
    axis.plot(np.linspace(0,1,500), np.sin(2*np.pi*np.linspace(0,1,500)), c='g', zorder=1, label='$\sin(2\pi x)$')
    axis.scatter(x,t,edgecolor='k',c=(0,0,1,0.5), s=80, zorder=2, label='$\sin(2\pi x)+\epsilon)$')
    axis.set_xlabel('$x$'); axis.set_ylabel('$t = \sin(2 \pi x)+\epsilon$')
plot_dataset(ax,x,t)
ax.legend()
plt.show()


Given only the $N$ observations $\mathcal{D} = \lbrace \bf{x}, \bf{t} \rbrace$ our goal is to learn $\sin(2 \pi \mathbf{x})$. In other words we'd like to learn the true underlying data generating process, or find the signal through the noise. We should appreciate for a moment how strictly difficult this problem is. Given a discrete, finite, number of observations we wish to generalize to an infinite number of values. In general this is impossible to do exactly. As we shall soon see, the capacity of the model we choose $C$, the number of observations we have $N$, and the inference techniques we employ will all influence the accuracy of our approximation.


Let us begin with one of the simplest models, an $M^{\text{th}}$ order polynomial.

$$y(x,\mathbf{w}) = \sum_{i=0}^M w_i x^i$$

In [4]:
def model(x,w,order):
    X = np.vander(x,order,True)
    y = np.dot(X,w)
    return y

where the vector $\mathbf{w}$ contains the coefficients, or parameters of the model. Choosing a simple polynomial may at first seem naive, but remember that within some domain of convergence, a power series can approximate any function exactly. Though as we will see, using polynomials of higher and higer order does not immediately yeild better results as we might expect.

We have data and a model, how do we measure how accurately our model fits the data? There are many things we will need to consider when designing our cost function, or performance measure. For now however we use a very common technique which is to examine how far each predicted value $y(x,\mathbf{w})$ deviates from its observed value $t$.


In [5]:
fig, ax = plt.subplots()
plot_dataset(ax,x,t)
for i,(xi,ti) in enumerate(zip(x,t)):
    if i == 0:
        ax.plot([xi,xi],[ti,np.sin(2*np.pi*xi)],c='r',label='error',zorder=0)
    else:
        ax.plot([xi,xi],[ti,np.sin(2*np.pi*xi)],c='r',zorder=0)

ax.legend(); ax.set(title='Visualization of SSE')
plt.show()


Taking each deviation, squaring and summing the result gives us the aptly named, sum of squared errors cost function (SSE)

$$ J(\mathbf{w}) = \frac{1}{2N}\sum_{j=0}^N (t_j-y(x_j,\mathbf{w}))^2$$

where normalizing over the $N$ observations allows us to compare results for different sized datasets.


In [6]:
def J(w,x,t,order):
    c = 1./(2*x.size)
    diff = t-model(x,w,order)
    cost = c*(diff**2).sum()
    return cost

We now have all the peices of a typical machine learning problem, the experience, the task, and the performance measure. Our experience, E, is the data set $\mathcal{D} = \lbrace (x_i,t_i)\rbrace$. Our task, T, is to fit our model $y(x,\mathbf{w})$ to the dataset $\mathcal{D}$. Our performance, P, is the SSE given by $J(\mathbf{w})$. Now we need an algorithm, and according to Mitchell's definition, our algorithm will be a machine learning algorithm if its performance on the task improves with experience, or if our models' fit to the data, as measured by the SSE, improves with more observations $(x_i,t_i)$.

For this problem, the cost function is quadratic in $\mathbf{w}$ and therefore has a unique minimum. This can be found analytically, using statistics, calculus, and linear aglebra. All such derivations produce what is called the Normal Equation

$$\mathbf{w}^* = (X^TX)^{-1}X^T\mathbf{t}$$

where $X$ is the Vandermonde matrix for the $N$ points between $[0,1]$. Numerically this is a nightmare to solve, due to the inverse of the Vandermonde matrix, and is not used in practice. In practice we solve the linear system with factorization methods or iterative schemes.


Capacity and Overfitting

First let us examine the accuracy of our results as a function of the order of the polynomial.


In [7]:
domain = np.linspace(0,1,200)
test_x, test_t = gen_data()
test_err = []
train_err = []
ws = []
for order in range(2,11):
    fig, ax = plt.subplots()
    X = np.vander(x,order,True)
    w = np.linalg.lstsq(X,t)[0]
    ws.append(w)
    train_err.append(J(w,x,t,order))
    test_err.append(J(w,test_x,test_t,order))
    pts = np.vander(domain,order,True)
    y = np.dot(pts,w)
    plot_dataset(ax,x,t)
    ax.plot(domain,y,c='r',label='model')
    ax.legend()
    ax.set_title('Order %d Polynomial' % (order-1))
    plt.show()



In [8]:
fig, ax = plt.subplots()
ax.scatter(range(len(train_err)),train_err,color='r',label='train'); 
ax.plot(train_err,color='r')
ax.scatter(range(len(test_err)),test_err,color='b',label='test'); 
ax.plot(test_err,color='b')
ax.legend()
ax.set(title='Over Fitting', xlabel='Polynomial Order', ylabel='Error')
ax.set_xticks(range(0,9),range(1,10))
plt.show()


Regularization

If we examine the magnitude of the parameters as a function of polynomial order, we see that they quickly become unwieldly.


In [9]:
def J2(w,x,t,order,lamda):
    c = 1./(2*x.size)
    diff = t-model(x,w,order)
    cost = c*(diff**2).sum()
    return cost

In [10]:
ws


Out[10]:
[array([ 0.49093896, -1.37435857]),
 array([ 0.34962688, -0.42050202, -0.95385656]),
 array([ -0.20689209,   8.74218523, -25.1028045 ,  16.09929863]),
 array([ -0.25754578,  10.64169876, -34.69534782,  31.48535822,  -7.69302979]),
 array([ -0.25225057,  10.1593045 , -30.67429247,  20.1191751 ,
          5.33518954,  -5.21128773]),
 array([ -2.81342371e-01,   1.79224523e+01,  -1.26450329e+02,
          4.32348216e+02,  -7.91287491e+02,   7.03398504e+02,
         -2.36203264e+02]),
 array([ -3.12263701e-01,   5.13614413e+01,  -6.78219819e+02,
          3.70440433e+03,  -1.00432258e+04,   1.41954211e+04,
         -1.00274505e+04,   2.79749922e+03]),
 array([ -3.10174778e-01,   3.64117072e+01,  -3.76737018e+02,
          1.43044632e+03,  -1.46484488e+03,  -3.66394352e+03,
          1.08084749e+04,  -9.95918980e+03,   3.18917226e+03]),
 array([ -3.12180354e-01,   2.45783183e+02,  -5.21285337e+03,
          4.49911247e+04,  -2.06107105e+05,   5.53857584e+05,
         -9.02099264e+05,   8.75504614e+05,  -4.65285594e+05,
          1.04105504e+05])]

In [11]:
ws_abs = [np.abs(wsi).sum() for wsi in ws]
ws_abs


Out[11]:
[1.8652975300581691,
 1.7239854475938268,
 50.151180446904746,
 84.772980373836205,
 71.751499902159566,
 2307.8915978236569,
 41497.894538674271,
 30929.530527259481,
 3157409.7371006855]

In [12]:
fig, ax = plt.subplots()
ax.plot(ws_abs)
plt.show()


Therefore we add a term to the cost function that penalizes the parameters for getting too large. Methods that employ such scale controlling terms are often referred to as shrinkage methods. In statistics it is called ridge regression and in machine learning literature it is referred to as weight decay.

$$ J(\mathbf{w}) = \frac{1}{2N}\sum_{j=0}^N (t_j-y(x_j,\mathbf{w}))^2 + \frac{\lambda}{2}\left\| \bf{w} \right\|^2$$

Dataset Size and Overfitting

Now we examine how our training and test error changes as the number of observations increase.


In [13]:
test_err = []
train_err = []
sizes = np.arange(10,300,10)
for size in sizes:
    trnerr = 0
    tsterr = 0
    for i in range(20):
        train_x, train_t = gen_data(size)
        X = np.vander(train_x, 9, True)
        w = np.linalg.lstsq(X, train_t)[0]
        trnerr += J(w, train_x, train_t, 9)
        test_x, test_t = gen_data(size)
        tsterr += J(w, test_x, test_t, 9)

    train_err.append(trnerr/10)
    test_err.append(tsterr/10)


    
fig, ax = plt.subplots()
ax.scatter(range(len(train_err)),train_err,color='r',label='train'); 
ax.plot(train_err,color='r')
ax.scatter(range(len(test_err)),test_err,color='b',label='test'); 
ax.plot(test_err,color='b')
ax.legend()
ax.set(title='Consistency for $9^{th}$ Order Polynomial', xlabel='Size of Dataset', ylabel='Error')
plt.xticks(range(0,sizes.size,3),sizes[::3])
plt.show()



In [14]:
domain = np.linspace(0,1,200)
fig, ax = plt.subplots()
pts = np.vander(domain,9,True)
y = np.dot(pts,w)
plot_dataset(ax,train_x,train_t)
ax.plot(domain,y,c='r',label='model')
ax.legend()
ax.set_title('Order %d Polynomial' % (order-1))
plt.show()


Probability and Bayesian Primer

Let our dataset be given by $\mathcal{D} = \lbrace \mathbf{x}, \mathbf{t} \rbrace$ and our model by $y(x,\mathbf{w})$. Our model is parameterized by $\mathbf{w}$ where $\dim{\mathbf{w}}=L$.

The joint distribution tells us everything about the data and is given by

$$p(w, \mathcal{D})$$

The product rule for probability theory allows us to express the joint distribution in terms of conditional and marginal distributions and is given by

$$p(w, \mathcal{D}) = p(w \lvert \mathcal{D})p(\mathcal{D}) = p(\mathcal{D} \lvert w)p(w)$$

The sum rule for probability theory allows us to eliminate certain variables and find what are called marginal distributions. It is given by

$$p(w) = \int p(w, \mathcal{D}) d\mathcal{D}$$$$p(\mathcal{D}) = \int p(w,\mathcal{D}) dw$$

for continuous random variables, and $$p(w) = \sum_{\mathcal{D}} p(w, \mathcal{D})$$ $$p(\mathcal{D}) = \sum_{w} p(w,\mathcal{D})$$ for discrete random variables.

Dividing by $p(\mathcal{D})$ in the product rule above we arrive at Bayes Formula.

$$p(w \lvert \mathcal{D}) = \frac{p(\mathcal{D}\lvert w)p(w)}{p(\mathcal{D})}$$

where

$$p(\mathcal{D}\lvert w) \hspace{5pt} \text{is called the Likelihood}$$$$p(w) \hspace{5pt} \text{is called the Prior}$$$$p(\mathcal{D}) \hspace{5pt} \text{is called the Marginal}$$$$p(w \lvert \mathcal{D}) \hspace{5pt} \text{is called the Posterior}$$

Another common way of expressing Bayes Formula involves combining both the product and sum rules

$$p(w \lvert \mathcal{D}) = \frac{p(\mathcal{D}\lvert w)p(w)}{\sum_{w}p(\mathcal{D}\lvert w)p(w)}$$

Probabilistic Perspective

Let us now cast the regression problem in a stochastic framework. Again our goal is to predict the target value $t$ given an input $x$ based on a set of traning data $\mathbf{x} = (x_1,\dots,x_n)^T$ and and their corresponding target values $\mathbf{t} = (t_1,\dots,t_n)^T$. We know there is noise in the data, there always is, and until now we have not taken it into consideration. Now however we will directly consider the uncertainty in our predictions. We will assume that given an input $x$, our target value $t$ is given by a Gaussian distribution with a mean $y(x,\mathbf{w})$ and a standard deviation $\sigma$. As per our previous formulation the mean $y(x,\mathbf{w})$ is our prediction, however unlike our previous formulation here we explicitly consider our uncertainty via the standard deviation $\sigma$.

$$p(t \lvert x,\mathbf{w},\sigma) = \mathcal{N}(t\lvert y(x,\mathbf{w}),\sigma)$$

Now if we assume that our training data $\lbrace \mathbf{x}, \mathbf{t} \rbrace$ are independent and identically distributed (iid), then we can express our likelihood function as

$$p(\mathbf{t} \lvert \mathbf{x},\mathbf{w},\sigma) = \prod_{n=1}^N \mathcal{N}(t_n\lvert y(x_n,\mathbf{w}),\sigma)$$

We notice a few things here. Firstly in casting this problem stochastically we have increased the number of parameters from $\dim{w}$ to $\dim{w}+1$ for the extra $\sigma$ term. By assuming $\sigma$ to be constant for all values of $x$ we're assuming the noise is what is called, homoscedastic. If instead we believed the noise were to change for different values of $x$, then we could allow the standard deviation to be a function of the input $\sigma(x)$, in which case we call the noise, heteroscedastic.

This new formulation changes our task T. Before we were fitting an $M^{th}$ order polynomial to the data, here however we wish to fit parameters $\theta = \lbrace{\mathbf{w}, \sigma}\rbrace$. Our goal this time will be to find the parameters $\theta^*$ that maximize the probability $p(\mathbf{t} \lvert \mathbf{x}, \mathbf{w}, \sigma)$. This goal, maximizing the likelihood, is the quintessential frequentist example. Where before we explicity chose the performance measure P to be the (SSE), here we will analytically optimize the likelihood and use the result as our performance measure.

The log function is monotonic, therefore, the extrema of $\ln f(x)$ equal the extrema of $f(x)$. Considering this and the fact that the logarithm of exponential distributions are analytically convenient, we will optimize the log of the likelihood.

$$ \ln p(\mathbf{t} \lvert \mathbf{x}, \mathbf{w}, \sigma) = -\frac{1}{2\sigma} \sum_{n=1}^N \lbrace y(x_n,\mathbf{w}) - t_n \rbrace^2 - \frac{N}{2} \ln \sigma- \frac{N}{2} \ln(2\pi)$$

Let us maximize the log likelihood with respect to $\mathbf{w}$. We notice two things, firstly that the second two terms are independent of $\mathbf{w}$, and secondly that maximizing the likelihood is equivalent to minimizing the negative log likelihood.

$$ \mathbf{w}_{ML} = \max_\mathbf{w} \hspace{4pt} \ln p(\mathbf{t} \lvert \mathbf{x}, \mathbf{w}, \sigma) = \min_\mathbf{w} \hspace{4pt} -\ln p(\mathbf{t} \lvert \mathbf{x}, \mathbf{w}, \sigma) = \min_{\mathbf{w}} \frac{1}{2} \sum_{n=1}^N \lbrace y(x_n,\mathbf{w}) - t_n \rbrace^2 = \min \text{SSE}$$

Thus the SSE has arisen as a consequence of maximizing the likelihood under the assumption of Gaussian noise. The optimal parameters found with such a technique are aptly called the maximum likelihood solutions (ML). With the maximum likelihood solution for $\mathbf{w}$ found we can also find the maximum likelihood solution for the standard deviation.

$$\sigma_{ML} = \frac{1}{N} \sum_{n=1}^N \lbrace y(x_n, \mathbf{w}_{ML})-t_n \rbrace^2$$

With the optimal parameters found we can define our predictive distribution

$$p(t \lvert x, \mathbf{w}_{ML}, \sigma_{ML}) = \mathcal{N}(t \lvert y(x, \mathbf{w}_{ML}), \sigma_{ML})$$

Let's go Deeper

What is the expected value of our SSE loss function $L(t,y(\bf{x}))$? We have dropped the dependence of $y$ on $\bf{w}$ for notational convenience and here $y$ operates element-wise on the vector $\bf{x}$.

$$ \mathbb{E}[L] = \iint L(t,y(\mathbf{x}))p(\mathbf{x},t)\text{d}\mathbf{x}\text{d}t $$$$ \mathbb{E}[L] = \iint \lbrace y(\mathbf{x})-t \rbrace^2 p(\mathbf{x},t)\text{d}\mathbf{x}\text{d}t $$

We wish to choose a $y(\mathbf{x})$ that minimizes the expected loss. We use the calculus of variations

$$ \frac{\partial \mathbb{E}[L]}{\partial y(\mathbf{x})} = 2 \int \lbrace y(\mathbf{x})-t \rbrace p(\mathbf{x},t)\text{d}t $$

We now set it to zero and solve for $y(\mathbf{x})$

$$ y(\mathbf{x})\int p(\mathbf{x},t)\text{d}t - \int t p(\mathbf{x},t)\text{d}t = 0$$$$ y(\mathbf{x}) = \frac{\int t p(\mathbf{x},t)\text{d}t}{p(\mathbf{x})} = \int t p(t \lvert \mathbf{x})\text{d}t = \mathbb{E}_t[t \lvert \mathbf{x}]$$

Another way to view this result comes as follows

$$ \lbrace y(\mathbf{x}) - t \rbrace^2 = \lbrace y(\mathbf{x}) -\mathbb{E}_t[t \lvert \mathbf{x}] + \mathbb{E}_t[t \lvert \mathbf{x}]- t \rbrace^2$$$$ \lbrace y(\mathbf{x}) - t \rbrace^2 = \lbrace y(\mathbf{x}) -\mathbb{E}_t[t \lvert \mathbf{x}] \rbrace^2 + 2\lbrace y(\mathbf{x}) -\mathbb{E}_t[t \lvert \mathbf{x}] \rbrace\lbrace \mathbb{E}_t[t \lvert \mathbf{x}]- t \rbrace + \lbrace \mathbb{E}_t[t \lvert \mathbf{x}]- t \rbrace^2$$

We now substitute this expression into the loss function

$$ \mathbb{E}[L] = \iint \lbrace y(\mathbf{x}) -\mathbb{E}_t[t \lvert \mathbf{x}] \rbrace^2 p(\mathbf{x},t)\text{d}\mathbf{x}\text{d}t + 2\iint \lbrace y(\mathbf{x}) -\mathbb{E}_t[t \lvert \mathbf{x}] \rbrace\lbrace \mathbb{E}_t[t \lvert \mathbf{x}]- t \rbrace p(\mathbf{x},t)\text{d}\mathbf{x}\text{d}t + \iint \lbrace \mathbb{E}_t[t \lvert \mathbf{x}]- t \rbrace^2 p(\mathbf{x},t)\text{d}\mathbf{x}\text{d}t $$

Integrating over $t$ we notice that the cross term equals 0 and are left with

$$ \mathbb{E}[L] = \int \lbrace y(\mathbf{x}) -\mathbb{E}_t[t \lvert \mathbf{x}] \rbrace^2 p(\mathbf{x})\text{d}\mathbf{x} + \int \lbrace \mathbb{E}_t[t \lvert \mathbf{x}]- t \rbrace^2 p(\mathbf{x})\text{d}\mathbf{x}$$

Where the first term goes to zero exactly when $y(\mathbf{x}) = \mathbb{E}_t[t \lvert \mathbf{x}]$ and the second term represents the irreducible minimum loss due to noise.

A Step Towards Bayes

Frequentist

$$p(\bf{t} \lvert \bf{x},\bf{w},\sigma)$$

This is the likelihood function and is a key to understanding the frequentist perspective, or the perspective of frequencies of random repeatable events. It poses the problem as "what is the probability of the data, given a certain set of parameters". In the previous example we found the set of parameters that maximized this distribution using a popular estimator called maximum likelihood. In other words, we estimated the parameters that maximize the probability of our data, which is maximizing the probability of the true data ON AVERAGE. This is the underlying distinction between Bayes and frequentist perspectives.

In the frequentist perspective, the data is considered the random variable and it aims at finding solutions considering our uncertainty of the observed data. The parameters are considered fixed and their value is determined by some form of 'estimator'. Error bars or uncertainty in the parameters are found by considering the distribution of possible datasets $\mathcal{D}$.

In a frequentist perspective the data is the random variable

Bayesian

$$p(\bf{w} \lvert \bf{x},\bf{t},\sigma) \propto p(\bf{t} \lvert \bf{x},\bf{w},\sigma)p(\bf{w})$$

On the right we have the Bayesian viewpoint. From this perspective there is only one single data set $\mathcal{D}$, the one that is actually observed. The uncertainty in the parameters is expressed via the probability distribution describing them.

In a Bayesian perspective the parameters are the random variable.

Let us examine the curve fitting problem from a more Bayesian perspective by introducing a prior over the parameters. For ease of exposition we will assume a simple Gaussian form

$$p(\bf{w} \lvert \alpha) = \mathcal{N}(\bf{w} \lvert \bf{0}, \alpha \bf{I}) = \bigg ( \frac{1}{2\pi \alpha} \bigg)^{(M+1)/2}\exp\bigg\{-\frac{1}{2\alpha}\bf{w}^T\bf{w}\bigg\}$$

Using Bayes theorem above we can express the posterior in terms of the likelihood and the prior. We interpret their product as a distribution describing the probability of $\bf{w}$ given the observed data $\mathcal{D} = \{\bf{x},\bf{t}\}$. If we find the $\bf{w}$ that maximize this posterior distribution we are finding the so-called maximum posterior solutions (MAP). Using techniques similar to ML we take the negative logarithm of the posterior and find that its maximum is given by the minimum of

$$\bf{w}_{MAP} = \min_{\bf{w}} \frac{1}{2\sigma} \sum_{n=1}^N \lbrace y(x_n,\mathbf{w}) - t_n \rbrace^2 + \frac{1}{2\alpha}\bf{w}^T\bf{w}$$

We see that maximizing the posterior distribution is equivalent to minimizing the SSE with a shrinkage regularization term where $\lambda = \sigma/\alpha$.

Full Bayes

Even though we've included a prior we're still making a point estimate of $\bf{w}$. A true fully Bayesian approach involves consistently applying the sum and product rules...

Coin Flipping with Bayes

To understand the difficult and complexity in using Bayes, let us explore a very simple problem. Coin flipping.


In [15]:
flip = lambda: np.random.binomial(n=1,p=0.5) # Bernoulli random variable

Now let us pose the problem of learning the probability distribution of a new unseen coin with probability of heads $q$ and tails $1-q$. We begin by assuming no information about the coin. Therefore our prior will be a uniform distribution. The likelihood function will be a Bernoulli and the posterior will be the normalized product of the two.

$$ p(q \lvert x) = \frac{p(x \lvert q)p(q)}{p(x)}$$

Let us derive the update rule. First we must define all the terms on the left hand side. We start with the Likelihood...

$$p(x_1 \lvert q) = q^{x_1} (1-q)^{1-x_1}$$

For the prior, we begin with the assumption that every coin is equally likely. We will allow the data to teach us the properties of the coin

$$p_0(q) = 1$$

For the marginal we must integrate over all possible values of $q$.

$$p(x_1) = \int_0^1 p(x_1 \lvert q)p_0(q)\text{d}q = \int_0^1 q^{x_1} (1-q)^{1-x_1}\text{d}q$$

Using repeated integration by parts this integral can be shown to equal

$$\int_0^1 q^{x_1} (1-q)^{1-x_1}\text{d}q = \frac{x_1!(1-x_1)!}{2}$$

However, we observe that $x \in \{0,1\}$ and under either circumstance, the numerator is always equal to 1, therefore

$$\int_0^1 q^{x_1} (1-q)^{1-x_1}\text{d}q = \frac{1}{2}$$

We combine the above equations to see what the updated prior would look like after observing one coin toss...

$$p_1(q \lvert x_1) = \frac{p(x_1 \lvert q)p_0(q)}{p(x_1)} = \frac{q^{x_1} (1-q)^{1-x_1}(1)}{\frac{1}{2}} = 2q^{x_1} (1-q)^{1-x_1}$$

Notice how the posterior of the previous observation becomes the prior of the next observation.

$$p_2(q \lvert x_1, x_2) = \frac{p(x_2 \lvert q)p_1(q\lvert x_1)}{\int_0^1 p(x_2 \lvert q)p_1(q \lvert x_1)\text{d}q }$$

In [16]:
fig, ax = plt.subplots()
ax.plot([0,1],[1,1],c='r')
ax.set_title('Prior Distribution $p_0(q)$'); ax.set_xlabel('$q$'); ax.set_ylabel('$p_0(q)$')
plt.show()


Now we flip a coin and update our prior using the above update rule.


In [17]:
x = 1
p = lambda x,q: 2*q**x*(1-q)**(1-x)
xpts = np.linspace(0,1,100)
fig, ax = plt.subplots()
ax.plot(xpts,p(x,xpts),c='r')
ax.set_title('Updated Prior Distribution $p_1(q | x_1=$ %d $)$' %(x)); ax.set_xlabel('$q$'); ax.set_ylabel('$p_1(q)$')
plt.show()


Let us derive the next update rule. We go through a similar process, but this time, the prior is no longer uniform. It is equal the value found above.

$$p_2(q \lvert x_1, x_2) = \frac{p(x_2 \lvert q)p_1(q\lvert x_1)}{p(x_2\lvert x_1)}$$$$p_2(q \lvert x_1, x_2) = \frac{q^{x_2}(1-q)^{1-x_2}\big (2q^{x_1}(1-q)^{1-x_1}\big)}{\int_0^1q^{x_2}(1-q)^{1-x_2}\big (2q^{x_1}(1-q)^{1-x_1}\big)\text{d}q} = \frac{q^{x_1+x_2}(1-q)^{2-(x_1+x_2)}}{\int_0^1 q^{x_1+x_2}(1-q)^{2-(x_1+x_2)}\text{d}q}$$

The integral in the denominator can be evaluated using the same technique as before

$$\int_0^1 q^{x_1+x_2}(1-q)^{2-(x_1+x_2)}\text{d}q = \frac{(x_1+x_2)!(2-x_1-x_2)!}{3!}$$

Combining the result of the integral into Bayes formula we arrive at the second updated prior

$$p_2(q \lvert x_1, x_2) = \frac{6q^{x_1+x_2}(1-q)^{2-(x_1+x_2)}}{(x_1+x_2)!(2-x_1-x_2)!}$$

In [18]:
from scipy.special import factorial as fact
x1 = 1
x2 = 0
p = lambda x,q: (6*q**(x1+x2)*(1-q)**(2-x1-x2))/(fact(x1+x2)*fact(2-x1-x2))
xpts = np.linspace(0,1,100)
fig, ax = plt.subplots()
ax.plot(xpts,p(x,xpts),c='r')
ax.set_title('Prior Distribution $p_2(q | x_1=$ %d, $x_2=$ %d$)$' %(x1,x2)); ax.set_xlabel('$q$'); ax.set_ylabel('$p_2(q)$')
plt.show()


To demonstrate further let us go back to regression, but this time we are fitting a straight line with only a linear model $y(x,\bf{w}) = w_0 + w_1x$. We assume an isotropic Gaussian prior.

Summary

Frequentist

  1. Data is the random variable
  2. Maximizing the likelihood (ML) is estimating which parameters make the data most probable on average
  3. Susceptible to bias and over fitting
  4. Easily interpretable

Bayesian

  1. Parameters are the random variable
  2. Maximizing the posterior (MAP) is finding which parameters are most probable given the data
  3. Priors are often chosen for mathematical convenience rather than a reflection of prior beliefs, difficult to interpret
  4. Protection against over fitting

Full Bayes

  1. Parameters and data are random variables
  2. Alternate between considering the data and parameters as being observed and learn the full posterior distribution
  3. Learns full distribution, everything about the data
  4. With a well chosen prior, bias and over fitting is often of little concern
  5. In practice it is computationally intractable. Must use approximation methods such as variational inference and Monte Carlo techniques

In [ ]: