$\newcommand{\ind}[1]{\left[#1\right]}$
Prior $$ p(\theta) $$
Maximum a-posteriori (MAP) : Regularised Maximum Likelihood $$ \theta^{\text{MAP}} = \arg\max_{\theta} \log p({\cal D} | \theta) p(\theta) $$
Predictive distribution $$ p(x_{N+1} | {\cal D} ) \approx p(x_{N+1} | \theta^{\text{MAP}}) $$
The interpretation is that past data provides an 'update' to the recent prior to be used for the current prediction.
Suppose we have a coin, flipped several times independently. A vague question one can ask is if one can predict the outcome of the next flip.
It depends. If we already know that the coin is fair, there is nothing that we can learn from past data and indeed the future flips are independent of the previous flips. However, if we don't know the probability of the coin, we could estimate the parameter from past data to create a better prediction. Mathematically, the model is identical to
Here, $\theta$ is the parameter of the coin.
We observe the outcome of $N$ coin flips $\{x^{(n)}\}_{n=1\dots N}$ where $x^{(n)} \in \left\{0,1\right\}$. The model is a Bernoulli distribution with parameter $\pi = (\pi_0, \pi_1)$. We have $\pi_0 = 1 - \pi_1$ where $0 \leq \pi_1 \leq 1$.
\begin{eqnarray} x^{(n)} & \sim & p(x|\pi) = (1-\pi_1)^{1-x^{(n)} } \pi_1^{x^{(n)} } \end{eqnarray}The loglikelihood is
\begin{eqnarray} {\cal L}(\pi_1) & = & \sum_{n=1}^N (1- x^{(n)}) \log (1 - \pi_1) + \sum_{n=1}^N x^{(n)} \log (\pi_1) \\ & = & \log (1 - \pi_1) \sum_{n=1}^N (1- x^{(n)}) + \log (\pi_1) \sum_{n=1}^N x^{(n)} \end{eqnarray}We define the number of $0$'s
\begin{eqnarray}
c_0 = \sum_{n=1}^N (1- x^{(n)})
\end{eqnarray}
and $1$'s as
\begin{eqnarray}
c_1 = \sum_{n=1}^N x^{(n)}
\end{eqnarray}
We compute the gradient \begin{eqnarray} \frac{\partial}{\partial \pi_1} {\cal L}(\pi_1) & = & - \frac{c_0}{1 - \pi_1} + \frac{c_1}{\pi_1} = 0 \end{eqnarray}
The solution is quite predictable \begin{eqnarray} \pi_1 & = &\frac{c_1}{c_0 + c_1} = \frac{c_1}{N} \end{eqnarray}
We need a prior over the probability parameter. One choice is the beta distribution
\begin{eqnarray} p(\pi_1) & = & \mathcal{B}(\pi_1; \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta) } \pi_1^{\alpha-1} (1-\pi_1)^{\beta-1} \end{eqnarray}The log joint ditribution of data is \begin{eqnarray} \log p(X, \pi_1) & = & \log p(\pi_1) + \sum_{n=1}^N \log p(x^{(n)}|\pi_1) \\ & = & \log \Gamma(\alpha + \beta) -\log \Gamma(\alpha) - \log \Gamma(\beta) \\ & & + (\alpha-1) \log \pi_1 + (\beta-1) \log(1-\pi_1) \\ & & + c_1 \log (\pi_1) + c_0 \log (1 - \pi_1) \\ & = & \log \Gamma(\alpha + \beta) -\log \Gamma(\alpha) - \log \Gamma(\beta) \\ & & + (\alpha + c_1 -1) \log \pi_1 + (\beta + c_0 -1) \log(1-\pi_1) \end{eqnarray}
The gradient is
\begin{eqnarray} \frac{\partial}{\partial \pi_1} \log p(X, \pi_1) & = & - \frac{\beta + c_0 -1}{1 - \pi_1} + \frac{\alpha + c_1 -1}{\pi_1} = 0 \end{eqnarray}We can solve for the parameter. \begin{eqnarray} \pi_1 (\beta + c_0 -1) & = & (1 - \pi_1) (\alpha + c_1 -1) \\ \pi_1 \beta + \pi_1 c_0 - \pi_1 & = & \alpha + c_1 - 1 - \pi_1 \alpha - \pi_1 c_1 + \pi_1 \\ \pi_1 & = & \frac{\alpha - 1 + c_1}{\alpha + \beta - 2 + c_0 + c_1} \\ \end{eqnarray}
When the prior is flat, i.e., when $\alpha = \beta = 1$, MAP and ML solutions coincide.
We infer the posterior
\begin{eqnarray} p(\pi_1| X) & = & \frac{p(\pi_1, X)}{p(X)} \end{eqnarray}The log joint density is \begin{eqnarray} \log p(X, \pi_1) & = & \log \Gamma(\alpha + \beta) -\log \Gamma(\alpha) - \log \Gamma(\beta) \\ & & + (\alpha + c_1 -1) \log \pi_1 + (\beta + c_0 -1) \log(1-\pi_1) \end{eqnarray}
At this stage, we may try to evaluate the integral $$ p(X) = \int d\pi_1 p(X, \pi_1) $$
Rather than trying to evaluate this integral directly, a simple approach is known as 'completing the square': we add an substract terms to obtain an expression that corresponds to a known, normalized density. This typically involves adding and substracting an expression that will make us identify a normalized density.
\begin{eqnarray} \log p(X, \pi_1) & = & \log \Gamma(\alpha + \beta) -\log \Gamma(\alpha) - \log \Gamma(\beta) \\ & & - \log \Gamma(\alpha + \beta + c_0 + c_1) + \log \Gamma(\alpha + c_1) + \log \Gamma(\beta + c_0) \\ & & + \log \Gamma(\alpha + \beta + c_0 + c_1) - \log \Gamma(\alpha + c_1) - \log \Gamma(\beta + c_0) \\ & & + (\alpha + c_1 -1) \log \pi_1 + (\beta + c_0 -1) \log(1-\pi_1) \\ & = & \log \Gamma(\alpha + \beta) -\log \Gamma(\alpha) - \log \Gamma(\beta) \\ & & - \log \Gamma(\alpha + \beta + c_0 + c_1) + \log \Gamma(\alpha + c_1) + \log \Gamma(\beta + c_0) \\ & & + \log \mathcal{B}(\alpha + c_1, \beta + c_0) \\ & = & \log p(X) + \log p(\pi_1| X) \end{eqnarray}From the resulting expression, taking the exponent on both sides we see that \begin{eqnarray} p(\pi_1| X) & = & \mathcal{B}(\alpha + c_1, \beta + c_0) \\ p(X) & = & \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \frac{\Gamma(\alpha + c_1)\Gamma(\beta + c_0)}{\Gamma(\alpha + \beta + c_0 + c_1)} \end{eqnarray}
Predictive distribution: Let $a=\alpha + c_1$ and $b=\beta+c_0$
\begin{eqnarray} \int d\pi_1 p(x|\pi_1) p(\pi_1| X) & = & \int d\pi_1 \mathcal{BE}(x; \pi_1) \mathcal{B}(\pi_1; a, b) \\ & = & \int d\pi_1 \pi_1^{\ind{x=1}}(1-\pi_1)^{\ind{x=0}}\mathcal{B}(a, b) \\ \end{eqnarray}\begin{eqnarray} p(x) & = & \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} \frac{\Gamma(a + \ind{x=1})\Gamma(b + \ind{x=0})}{\Gamma(b + a + 1)} \end{eqnarray}\begin{eqnarray} p(x=1) & = & \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} \frac{\Gamma(a + 1)\Gamma(b)}{\Gamma(b + a + 1)} \\ & = & \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} \frac{a\Gamma(a)\Gamma(b)}{(b+a)\Gamma(b + a)} \\ & = & \frac{a}{a+b} \end{eqnarray}Alternatively, we may directly write \begin{eqnarray} p(X, \pi_1) & = & \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \pi_1^{(\alpha + c_1 -1)} (1-\pi_1)^{(\beta + c_0 -1)} \end{eqnarray}
\begin{eqnarray} p(X) &=& \int d\pi_1 p(X, \pi_1) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \int d\pi_1 \pi_1^{(\alpha + c_1 -1)} (1-\pi_1)^{(\beta + c_0 -1)} \end{eqnarray}From the definition of the beta distribution, we can arrive at the 'formula' for the integral \begin{eqnarray} 1 &=& \int d\pi \mathcal{B}(\pi; a, b) \\ & = & \int d\pi \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} \pi^{(a -1)} (1-\pi)^{(b -1)} \\ \frac{\Gamma(a)\Gamma(b)}{\Gamma(a + b)} & = & \int d\pi \pi^{(a -1)} (1-\pi)^{(b -1)} \end{eqnarray} Just substitute $a = \alpha + c_1$ and $b = \beta + c_0$
For large $x$, we have the following approximation \begin{eqnarray} \log \Gamma(x + a) - \log \Gamma(x) & \approx & a \log(x) \\ \Gamma(x + a) & \approx & \Gamma(x) x^a \\ \end{eqnarray}
When $c_0$ and $c_1$ are large, we obtain:
\begin{eqnarray} p(X) & \approx & \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \frac{\Gamma(c_1)\Gamma(c_0)c_0^{\beta}c_1^{\alpha}}{\Gamma(c_0 + c_1)(c_0+c_1)^{\alpha + \beta}} \end{eqnarray}Let $\hat{\pi}_1 = c_1/(c_0+c_1)$ and $N = c_0 + c_1$, we have \begin{eqnarray} p(X) & \approx & \frac{\Gamma(c_1)\Gamma(c_0)}{\Gamma(c_0 + c_1)} (1-\hat{\pi}_1) \hat{\pi}_1 \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} (1-\hat{\pi}_1)^{\beta-1}\hat{\pi}_1^{\alpha-1} \end{eqnarray}
In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from scipy.special import gammaln
def log_beta_pdf(x, a, b):
return - gammaln(a) - gammaln(b) + gammaln(a+b) + np.log(x)*(a-1) + np.log(1-x)*(b-1)
x = np.arange(0.01,1,0.01)
a = 1
b = 1
c_0 = 1
c_1 = 1
N = c_0 + c_1
pi_ML = c_1/N
plt.figure(figsize=(12,8))
plt.plot(x, np.exp(log_beta_pdf(x, a, b)), 'b')
plt.plot(x, np.exp(log_beta_pdf(x, a+c_1, b+c_0)), 'r')
yl = plt.gca().get_ylim()
plt.plot([pi_ML, pi_ML], yl , 'k:')
plt.legend(["Prior $\cal B$ $(a={}, b={})$".format(a,b), "Posterior $\cal B$ $(a={}, b={})$".format(a+c_1, b+c_0)], loc="best")
plt.show()
In [4]:
%matplotlib inline
import matplotlib.pyplot as plt
from scipy.special import gammaln
import numpy as np
import scipy.special as sps
#savefigs =
def log_beta_pdf(x, a, b):
y = - gammaln(a) - gammaln(b) + gammaln(a+b) + np.log(x)*(a-1) + np.log(1-x)*(b-1)
idx = np.where(x == 0)
if a==1:
y[idx] = - gammaln(a) - gammaln(b) + gammaln(a+b)
elif a<1:
y[idx] = np.Inf
else:
y[idx] = - np.Inf
idx = np.where(x == 1)
if b==1:
y[idx] = - gammaln(a) - gammaln(b) + gammaln(a+b)
elif b<1:
y[idx] = np.Inf
else:
y[idx] = - np.Inf
return y
a = 1
b = 1
xx = [1,1,1,1,1,0,1,1, 1,0,1,0,1,1,1,1,0, 1,1,1,1,1,1,1]
p = np.arange(0.0,1.01,0.01)
c = [0,0]
N = 0
plt.figure(figsize=(5,3))
plt.plot(p, np.exp(log_beta_pdf(p, a+c[1], b+c[0])), 'r')
plt.xticks([0,1])
plt.yticks([])
plt.xlim([-0.1,1.1])
plt.ylim([0,6])
pi_ML = (a+c[1])/(a+b+N)
yl = plt.gca().get_ylim()
plt.plot([pi_ML, pi_ML], yl , 'k:')
plt.xlabel('$\lambda$')
#plt.savefig('/Users/cemgil/Dropbox/tex/cam/talks/cmpe547/beta{n}.eps'.format(n=N), bbox_inches='tight')
plt.show()
for x in xx:
c[x] += 1
N += 1
plt.figure(figsize=(5,3))
plt.plot(p, np.exp(log_beta_pdf(p, a+c[1], b+c[0])), 'r')
pi_ML = (a+c[1])/(a+b+N)
pi_mode = (a+c[1]-1)/float(a+b+N-2)
tmp = str(int(a+c[1]-1))+'/'+str(a+b+N-2)
plt.xticks([0,pi_mode],('0',tmp))
#plt.xticks([0,pi_mode])
plt.yticks([])
plt.xlim([-0.1,1.1])
plt.ylim([0,6])
yl = plt.gca().get_ylim()
plt.plot([pi_ML, pi_ML], yl , 'k:')
plt.plot([pi_mode, pi_mode], yl , 'b:')
plt.xlabel('$\lambda$')
#plt.savefig('/Users/cemgil/Dropbox/tex/cam/talks/cmpe547/beta{n}.eps'.format(n=N), bbox_inches='tight')
plt.show()
We consider the folowing problem: Given a sequence of coin tosses $X = \{x^{(n)}\}_{n=1\dots N}$, determine if the coin is fair or fake.
This can be cast as a model selection problem:
\begin{eqnarray} \pi_1|m & \sim & \left\{ \begin{array}{cc} \delta(\pi_1 - 0.5) & m = 0\\ \mathcal{B}(\pi; a, b) & m = 1 \end{array} \right. \end{eqnarray}For $n = 1\dots N$ \begin{eqnarray} x^{(n)}| \pi_1 & \sim & \mathcal{BE}(x; \pi_1) \end{eqnarray}
This model defines the following:
We need to calculate the marginal likelihoods for $m=0$ and $m=1$ \begin{eqnarray} p(X| m) & = & \int d\pi_1 p(X | \pi_1) p(\pi_1|m) \end{eqnarray}
The log-odds is the ratio of marginal likelihoods
$$ l(X) = \log\left( \frac{p(X|m = \text{Fair})}{p(X|m = \text{Fake})} \right) $$If $l(X)>0$, we may conclude that the coin is fair and biased when $l<0$.
In [42]:
import numpy as np
import scipy.special as sps
def log_odds(c_0, c_1, a, b):
# Total number of tosses
N = c_0 + c_1
M_fair = N*np.log(0.5)
M_fake = sps.gammaln(a+b) - sps.gammaln(a) - sps.gammaln(b) + sps.gammaln(c_0+a) + sps.gammaln(c_1+b) - sps.gammaln(N+a + b)
return M_fair - M_fake
# Number of Zeros observed
c_0 = 6
# Number of Ones
c_1 = 1
# Prior
a = 1
b = 1
print('log_odds = ', log_odds(c_0, c_1, a, b) )
In [43]:
a = 1
b = 1
N = 10
l = np.zeros(N+1)
for c in range(0,N+1):
l[c] = log_odds(N-c, c, a, b)
plt.plot(range(0,N+1), l, 'o')
plt.plot(range(0,N+1), np.zeros(N+1), 'k:')
ax = plt.gca()
ax.set_xlabel('Number of ones $c_1$')
ax.set_ylabel('log-odds $l(X)$')
plt.show()
We can visualize the region where we would decide that the coin is fake by plotting the points where the log-odds is negative.
In [50]:
a = 1
b = 1
for N in range(1, 25):
l = np.zeros(N+1)
for c in range(0,N+1):
l[c] = log_odds(N-c, c, a, b)
idx = np.where( np.array(l)<0 )
p = np.arange(0,N+1)/N
plt.plot(N*np.ones_like(p), p, '.k',markersize=4)
plt.plot(N*np.ones_like(p[idx]), p[idx], '.r',markersize=20)
ax = plt.gca()
ax.set_ylim((0,1))
ax.set_xlabel('$N$')
ax.set_ylabel('$c_1/N$')
plt.show()