$\newcommand{\ind}[1]{\left[#1\right]}$

Model

  • Data set $$ {\cal D} = \{ x_1, \dots x_N \} $$
  • Model with parameter $\theta$ $$ p(\cal D | \theta) $$

Maximum Likelihood

  • Maximum Likelihood (ML) $$ \theta^{\text{ML}} = \arg\max_{\theta} \log p({\cal D} | \theta) $$
  • Predictive distribution $$ p(x_{N+1} | {\cal D} ) \approx p(x_{N+1} | \theta^{\text{ML}}) $$

Maximum Aposteriori

  • Prior $$ p(\theta) $$

  • Maximum a-posteriori (MAP) : Regularised Maximum Likelihood $$ \theta^{\text{MAP}} = \arg\max_{\theta} \log p({\cal D} | \theta) p(\theta) $$

  • Predictive distribution $$ p(x_{N+1} | {\cal D} ) \approx p(x_{N+1} | \theta^{\text{MAP}}) $$

Bayesian Learning

  • We treat parameters on the same footing as all other variables
  • We integrate over unknown parameters rather than using point estimates (remember the many-dice example)
    • Self-regularisation, avoids overfitting
    • Natural setup for online adaptation
    • Model selection
  • Predictive distribution \begin{eqnarray} p(x_{N+1} , {\cal D} ) &=& \int d\theta \;\; p(x_{N+1} | \theta) p( {\cal D}| \theta) p(\theta) \\ &=& \int d\theta \;\; p(x_{N+1}| \theta) p( {\cal D}, \theta) \\ &=& \int d\theta \;\; p(x_{N+1}| \theta) p( \theta| {\cal D}) p({\cal D}) \\ &=& p({\cal D}) \int d\theta \;\; p(x_{N+1}| \theta) p( \theta| {\cal D}) \\ p(x_{N+1} | {\cal D} ) &=& \int d\theta \;\; p(x_{N+1} | \theta) p(\theta | {\cal D}) \end{eqnarray}

The interpretation is that past data provides an 'update' to the recent prior to be used for the current prediction.

  • Bayesian learning is just inference ...

Learning the parameter of a (possibly fake) coin

Suppose we have a coin, flipped several times independently. A vague question one can ask is if one can predict the outcome of the next flip.

It depends. If we already know that the coin is fair, there is nothing that we can learn from past data and indeed the future flips are independent of the previous flips. However, if we don't know the probability of the coin, we could estimate the parameter from past data to create a better prediction. Mathematically, the model is identical to

Here, $\theta$ is the parameter of the coin.

Maximum Likelihood Estimation

We observe the outcome of $N$ coin flips $\{x^{(n)}\}_{n=1\dots N}$ where $x^{(n)} \in \left\{0,1\right\}$. The model is a Bernoulli distribution with parameter $\pi = (\pi_0, \pi_1)$. We have $\pi_0 = 1 - \pi_1$ where $0 \leq \pi_1 \leq 1$.

\begin{eqnarray} x^{(n)} & \sim & p(x|\pi) = (1-\pi_1)^{1-x^{(n)} } \pi_1^{x^{(n)} } \end{eqnarray}

The loglikelihood is

\begin{eqnarray} {\cal L}(\pi_1) & = & \sum_{n=1}^N (1- x^{(n)}) \log (1 - \pi_1) + \sum_{n=1}^N x^{(n)} \log (\pi_1) \\ & = & \log (1 - \pi_1) \sum_{n=1}^N (1- x^{(n)}) + \log (\pi_1) \sum_{n=1}^N x^{(n)} \end{eqnarray}

We define the number of $0$'s
\begin{eqnarray} c_0 = \sum_{n=1}^N (1- x^{(n)}) \end{eqnarray} and $1$'s as \begin{eqnarray} c_1 = \sum_{n=1}^N x^{(n)} \end{eqnarray}

\begin{eqnarray} {\cal L}(\pi_1) & = & \log (1 - \pi_1) c_0 + \log (\pi_1) c_1 \end{eqnarray}

We compute the gradient \begin{eqnarray} \frac{\partial}{\partial \pi_1} {\cal L}(\pi_1) & = & - \frac{c_0}{1 - \pi_1} + \frac{c_1}{\pi_1} = 0 \end{eqnarray}

The solution is quite predictable \begin{eqnarray} \pi_1 & = &\frac{c_1}{c_0 + c_1} = \frac{c_1}{N} \end{eqnarray}

Maximum A-posteriori estimation

We need a prior over the probability parameter. One choice is the beta distribution

\begin{eqnarray} p(\pi_1) & = & \mathcal{B}(\pi_1; \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta) } \pi_1^{\alpha-1} (1-\pi_1)^{\beta-1} \end{eqnarray}

The log joint ditribution of data is \begin{eqnarray} \log p(X, \pi_1) & = & \log p(\pi_1) + \sum_{n=1}^N \log p(x^{(n)}|\pi_1) \\ & = & \log \Gamma(\alpha + \beta) -\log \Gamma(\alpha) - \log \Gamma(\beta) \\ & & + (\alpha-1) \log \pi_1 + (\beta-1) \log(1-\pi_1) \\ & & + c_1 \log (\pi_1) + c_0 \log (1 - \pi_1) \\ & = & \log \Gamma(\alpha + \beta) -\log \Gamma(\alpha) - \log \Gamma(\beta) \\ & & + (\alpha + c_1 -1) \log \pi_1 + (\beta + c_0 -1) \log(1-\pi_1) \end{eqnarray}

The gradient is

\begin{eqnarray} \frac{\partial}{\partial \pi_1} \log p(X, \pi_1) & = & - \frac{\beta + c_0 -1}{1 - \pi_1} + \frac{\alpha + c_1 -1}{\pi_1} = 0 \end{eqnarray}

We can solve for the parameter. \begin{eqnarray} \pi_1 (\beta + c_0 -1) & = & (1 - \pi_1) (\alpha + c_1 -1) \\ \pi_1 \beta + \pi_1 c_0 - \pi_1 & = & \alpha + c_1 - 1 - \pi_1 \alpha - \pi_1 c_1 + \pi_1 \\ \pi_1 & = & \frac{\alpha - 1 + c_1}{\alpha + \beta - 2 + c_0 + c_1} \\ \end{eqnarray}

When the prior is flat, i.e., when $\alpha = \beta = 1$, MAP and ML solutions coincide.

Full Bayesian inference

We infer the posterior

\begin{eqnarray} p(\pi_1| X) & = & \frac{p(\pi_1, X)}{p(X)} \end{eqnarray}

The log joint density is \begin{eqnarray} \log p(X, \pi_1) & = & \log \Gamma(\alpha + \beta) -\log \Gamma(\alpha) - \log \Gamma(\beta) \\ & & + (\alpha + c_1 -1) \log \pi_1 + (\beta + c_0 -1) \log(1-\pi_1) \end{eqnarray}

At this stage, we may try to evaluate the integral $$ p(X) = \int d\pi_1 p(X, \pi_1) $$

Rather than trying to evaluate this integral directly, a simple approach is known as 'completing the square': we add an substract terms to obtain an expression that corresponds to a known, normalized density. This typically involves adding and substracting an expression that will make us identify a normalized density.

\begin{eqnarray} \log p(X, \pi_1) & = & \log \Gamma(\alpha + \beta) -\log \Gamma(\alpha) - \log \Gamma(\beta) \\ & & - \log \Gamma(\alpha + \beta + c_0 + c_1) + \log \Gamma(\alpha + c_1) + \log \Gamma(\beta + c_0) \\ & & + \log \Gamma(\alpha + \beta + c_0 + c_1) - \log \Gamma(\alpha + c_1) - \log \Gamma(\beta + c_0) \\ & & + (\alpha + c_1 -1) \log \pi_1 + (\beta + c_0 -1) \log(1-\pi_1) \\ & = & \log \Gamma(\alpha + \beta) -\log \Gamma(\alpha) - \log \Gamma(\beta) \\ & & - \log \Gamma(\alpha + \beta + c_0 + c_1) + \log \Gamma(\alpha + c_1) + \log \Gamma(\beta + c_0) \\ & & + \log \mathcal{B}(\alpha + c_1, \beta + c_0) \\ & = & \log p(X) + \log p(\pi_1| X) \end{eqnarray}

From the resulting expression, taking the exponent on both sides we see that \begin{eqnarray} p(\pi_1| X) & = & \mathcal{B}(\alpha + c_1, \beta + c_0) \\ p(X) & = & \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \frac{\Gamma(\alpha + c_1)\Gamma(\beta + c_0)}{\Gamma(\alpha + \beta + c_0 + c_1)} \end{eqnarray}

Predictive distribution: Let $a=\alpha + c_1$ and $b=\beta+c_0$

\begin{eqnarray} \int d\pi_1 p(x|\pi_1) p(\pi_1| X) & = & \int d\pi_1 \mathcal{BE}(x; \pi_1) \mathcal{B}(\pi_1; a, b) \\ & = & \int d\pi_1 \pi_1^{\ind{x=1}}(1-\pi_1)^{\ind{x=0}}\mathcal{B}(a, b) \\ \end{eqnarray}\begin{eqnarray} p(x) & = & \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} \frac{\Gamma(a + \ind{x=1})\Gamma(b + \ind{x=0})}{\Gamma(b + a + 1)} \end{eqnarray}\begin{eqnarray} p(x=1) & = & \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} \frac{\Gamma(a + 1)\Gamma(b)}{\Gamma(b + a + 1)} \\ & = & \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} \frac{a\Gamma(a)\Gamma(b)}{(b+a)\Gamma(b + a)} \\ & = & \frac{a}{a+b} \end{eqnarray}

Alternative Derivation

Alternatively, we may directly write \begin{eqnarray} p(X, \pi_1) & = & \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \pi_1^{(\alpha + c_1 -1)} (1-\pi_1)^{(\beta + c_0 -1)} \end{eqnarray}

\begin{eqnarray} p(X) &=& \int d\pi_1 p(X, \pi_1) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \int d\pi_1 \pi_1^{(\alpha + c_1 -1)} (1-\pi_1)^{(\beta + c_0 -1)} \end{eqnarray}

From the definition of the beta distribution, we can arrive at the 'formula' for the integral \begin{eqnarray} 1 &=& \int d\pi \mathcal{B}(\pi; a, b) \\ & = & \int d\pi \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} \pi^{(a -1)} (1-\pi)^{(b -1)} \\ \frac{\Gamma(a)\Gamma(b)}{\Gamma(a + b)} & = & \int d\pi \pi^{(a -1)} (1-\pi)^{(b -1)} \end{eqnarray} Just substitute $a = \alpha + c_1$ and $b = \beta + c_0$

An Approximation

For large $x$, we have the following approximation \begin{eqnarray} \log \Gamma(x + a) - \log \Gamma(x) & \approx & a \log(x) \\ \Gamma(x + a) & \approx & \Gamma(x) x^a \\ \end{eqnarray}

When $c_0$ and $c_1$ are large, we obtain:

\begin{eqnarray} p(X) & \approx & \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \frac{\Gamma(c_1)\Gamma(c_0)c_0^{\beta}c_1^{\alpha}}{\Gamma(c_0 + c_1)(c_0+c_1)^{\alpha + \beta}} \end{eqnarray}

Let $\hat{\pi}_1 = c_1/(c_0+c_1)$ and $N = c_0 + c_1$, we have \begin{eqnarray} p(X) & \approx & \frac{\Gamma(c_1)\Gamma(c_0)}{\Gamma(c_0 + c_1)} (1-\hat{\pi}_1) \hat{\pi}_1 \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} (1-\hat{\pi}_1)^{\beta-1}\hat{\pi}_1^{\alpha-1} \end{eqnarray}

Illustration: Bayesian update of a Beta Distribution


In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from scipy.special import gammaln

def log_beta_pdf(x, a, b):
    return - gammaln(a) - gammaln(b) + gammaln(a+b) + np.log(x)*(a-1) + np.log(1-x)*(b-1) 

x = np.arange(0.01,1,0.01)

a = 1
b = 1
c_0 = 1
c_1 = 1
N = c_0 + c_1

pi_ML = c_1/N

plt.figure(figsize=(12,8))
plt.plot(x, np.exp(log_beta_pdf(x, a, b)), 'b')
plt.plot(x, np.exp(log_beta_pdf(x, a+c_1, b+c_0)), 'r')
yl = plt.gca().get_ylim()
plt.plot([pi_ML, pi_ML], yl , 'k:')
plt.legend(["Prior $\cal B$ $(a={}, b={})$".format(a,b), "Posterior $\cal B$ $(a={}, b={})$".format(a+c_1, b+c_0)], loc="best")
plt.show()


Illustration: Learning from a sequence of coin flips


In [4]:
%matplotlib inline
import matplotlib.pyplot as plt
from scipy.special import gammaln
import numpy as np
import scipy.special as sps

#savefigs = 

def log_beta_pdf(x, a, b):
    
    y = - gammaln(a) - gammaln(b) + gammaln(a+b) + np.log(x)*(a-1) + np.log(1-x)*(b-1) 
    idx = np.where(x == 0)
    if a==1:
        y[idx] = - gammaln(a) - gammaln(b) + gammaln(a+b)
    elif a<1:
        y[idx] = np.Inf
    else:
        y[idx] = - np.Inf
    
    idx = np.where(x == 1)
    if b==1:
        y[idx] = - gammaln(a) - gammaln(b) + gammaln(a+b)
    elif b<1:
        y[idx] = np.Inf
    else:
        y[idx] = - np.Inf
    return y

a = 1
b = 1

xx = [1,1,1,1,1,0,1,1, 1,0,1,0,1,1,1,1,0, 1,1,1,1,1,1,1]

p = np.arange(0.0,1.01,0.01)

c = [0,0]
N = 0
plt.figure(figsize=(5,3))
plt.plot(p, np.exp(log_beta_pdf(p, a+c[1], b+c[0])), 'r')
plt.xticks([0,1])
plt.yticks([])
plt.xlim([-0.1,1.1])
plt.ylim([0,6])
pi_ML =  (a+c[1])/(a+b+N)
yl = plt.gca().get_ylim()
plt.plot([pi_ML, pi_ML], yl , 'k:')

plt.xlabel('$\lambda$')
#plt.savefig('/Users/cemgil/Dropbox/tex/cam/talks/cmpe547/beta{n}.eps'.format(n=N), bbox_inches='tight')
plt.show()

for x in xx:
    c[x] += 1
    N += 1
    plt.figure(figsize=(5,3))
    plt.plot(p, np.exp(log_beta_pdf(p, a+c[1], b+c[0])), 'r')

    
    pi_ML =  (a+c[1])/(a+b+N)
    pi_mode = (a+c[1]-1)/float(a+b+N-2)
    tmp = str(int(a+c[1]-1))+'/'+str(a+b+N-2)
    plt.xticks([0,pi_mode],('0',tmp))
    #plt.xticks([0,pi_mode])
    plt.yticks([])
    plt.xlim([-0.1,1.1])
    plt.ylim([0,6])
    
    yl = plt.gca().get_ylim()
    plt.plot([pi_ML, pi_ML], yl , 'k:')
    plt.plot([pi_mode, pi_mode], yl , 'b:')
    
    plt.xlabel('$\lambda$')
    #plt.savefig('/Users/cemgil/Dropbox/tex/cam/talks/cmpe547/beta{n}.eps'.format(n=N), bbox_inches='tight')
    plt.show()


/Users/cemgil/anaconda/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:11: RuntimeWarning: divide by zero encountered in log
/Users/cemgil/anaconda/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:11: RuntimeWarning: invalid value encountered in multiply

Finding if a Coin is Fair or Fake

We consider the folowing problem: Given a sequence of coin tosses $X = \{x^{(n)}\}_{n=1\dots N}$, determine if the coin is fair or fake.

This can be cast as a model selection problem:

\begin{eqnarray} \pi_1|m & \sim & \left\{ \begin{array}{cc} \delta(\pi_1 - 0.5) & m = 0\\ \mathcal{B}(\pi; a, b) & m = 1 \end{array} \right. \end{eqnarray}

For $n = 1\dots N$ \begin{eqnarray} x^{(n)}| \pi_1 & \sim & \mathcal{BE}(x; \pi_1) \end{eqnarray}

This model defines the following:

  • The indicator $m$, that denotes if the coin is fake,
  • What a fake coin is: a fake coin is one that has an arbitrary probability $\pi_1$ between $0$ and $1$.
  • What a fair coin is: a fair coin has $\pi_1 = 0.5$

We need to calculate the marginal likelihoods for $m=0$ and $m=1$ \begin{eqnarray} p(X| m) & = & \int d\pi_1 p(X | \pi_1) p(\pi_1|m) \end{eqnarray}

Not Fake
\begin{eqnarray} p(X| m) & = & \int d\pi_1 p(X| \pi_1) \delta(\pi_1 - 0.5) \\ & = & \prod_{n=1}^N \left(\frac{1}{2}\right)^{x^{(n)}} \left(\frac{1}{2}\right)^{1-x^{(n)}} = \frac{1}{2^N} \end{eqnarray}
Fake
\begin{eqnarray} p(X| m) & = & \int d\pi_1 p(\pi_1; a, b) \prod_{n=1}^{N} p(x^{(n)}| \pi_1) \\ & = & \int d\pi_1 \left(\prod_{n=1}^N \left(1-\pi_1\right)^{1-x^{(n)}} \pi_1^{x^{(n)}} \right) \mathcal{B}(\pi; a, b) \\ & = & \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} \int d\pi_1 \left(1-\pi_1\right)^{c_0+a-1} \pi_1^{c_1+b-1} \\ & = & \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} \frac{\Gamma(c_0+a)\Gamma(c_1+b)}{\Gamma(c_0 + c_1 +a + b)} \end{eqnarray}

The log-odds is the ratio of marginal likelihoods

$$ l(X) = \log\left( \frac{p(X|m = \text{Fair})}{p(X|m = \text{Fake})} \right) $$

If $l(X)>0$, we may conclude that the coin is fair and biased when $l<0$.


In [42]:
import numpy as np
import scipy.special as sps

def log_odds(c_0, c_1, a, b):
    # Total number of tosses
    N = c_0 + c_1
    
    M_fair = N*np.log(0.5)
    M_fake = sps.gammaln(a+b) - sps.gammaln(a) - sps.gammaln(b) +  sps.gammaln(c_0+a) + sps.gammaln(c_1+b) - sps.gammaln(N+a + b) 
    return M_fair - M_fake

# Number of Zeros observed
c_0 = 6
# Number of Ones
c_1 = 1

# Prior
a = 1
b = 1


print('log_odds = ', log_odds(c_0, c_1, a, b) )


log_odds =  -0.826678573184

In [43]:
a = 1
b = 1
N = 10

l = np.zeros(N+1)

for c in range(0,N+1):
    l[c] = log_odds(N-c, c, a, b)

plt.plot(range(0,N+1), l, 'o')
plt.plot(range(0,N+1), np.zeros(N+1), 'k:')
ax = plt.gca()
ax.set_xlabel('Number of ones $c_1$')
ax.set_ylabel('log-odds $l(X)$')
plt.show()


We can visualize the region where we would decide that the coin is fake by plotting the points where the log-odds is negative.


In [50]:
a = 1
b = 1

for N in range(1, 25):

    l = np.zeros(N+1)

    for c in range(0,N+1):
        l[c] = log_odds(N-c, c, a, b)
    
    
    idx = np.where( np.array(l)<0 )
    p = np.arange(0,N+1)/N
    plt.plot(N*np.ones_like(p), p, '.k',markersize=4)    
    plt.plot(N*np.ones_like(p[idx]), p[idx], '.r',markersize=20)
    

ax = plt.gca()
ax.set_ylim((0,1))
ax.set_xlabel('$N$')
ax.set_ylabel('$c_1/N$')
plt.show()