This notebook implements the example, I consider a classic for understanding Expectation Maximisation.
See: http://www.nature.com/nbt/journal/v26/n8/full/nbt1406.html
Notations:
\begin{align*} \theta_A &= \text{Probability of a Heads showing up given the coin tossed is A}\\ \theta_B &= \text{Probability of a Heads showing up given the coin tossed is B}\\ \end{align*}
In [3]:
%matplotlib notebook
from __future__ import division
from collections import OrderedDict
from scipy.stats import binom as binomial
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#from ipywidgets import StaticInteract, RangeWidget
import pandas as pd
from IPython.display import display, Image
from scipy.spatial.distance import euclidean
from sympy import init_printing, symbols, Eq
init_printing()
In [4]:
Image('images/nbt1406-F1.png')
Out[4]:
In [5]:
coin_toss = []
coin_toss.append('H T T T H H T H T H'.split())
coin_toss.append('H H H H T H H H H H'.split())
coin_toss.append('H T H H H H H T H H'.split())
coin_toss.append('H T H T T T H H T T'.split())
coin_toss.append('T H H H T H H H T H'.split())
In [7]:
columns = range(1,11)
df = pd.DataFrame(coin_toss, index=None, columns=columns)
df.index.rename('Toss')
Out[7]:
Our configuration looks like this:
In [8]:
df
Out[8]:
If the identity of the coin being tossed is known and is observed = ['B', 'A', 'A', 'B', 'A']
it is not so difficult to calculate the corresponding values of $\theta_A$ and $\theta_B$:
In [9]:
thetaA, thetaB = symbols('theta_A theta_B')
a,b = thetaA, thetaB # Hack to display
In [10]:
## Observed Case
observed = ['B', 'A', 'A', 'B', 'A']
index_A = [i for i,x in enumerate(observed) if x=='A']
index_B = [i for i,x in enumerate(observed) if x=='B']
total_tosses = df.size
A_tosses = df.iloc[index_A].unstack()
B_tosses = df.iloc[index_B].unstack()
A_heads = A_tosses.value_counts()['H']
B_heads = B_tosses.value_counts()['H']
theta_A = A_heads/A_tosses.size
theta_B = B_heads/B_tosses.size
In [11]:
(a, theta_A)
Out[11]:
In [12]:
(b, theta_B)
Out[12]:
When the identity of coin being tossed is unknwon we rely on Expectation Maximisation to give us the estimates of $\theta_A$ and $\theta_B$. We start with an initial value of $\theta_A, \theta_B$ and then given the observed data (the 50 coin tosses) run the 'E-step' calculating the probability of coin $A$ or $B$ being used for a series of tosses . Remember each one of the 5 sets is actually 'in reality' done using coin of a single type, the Expectation step simply involves treating each set to have come out of a 'mixture' of coins 'A' and 'B'. Given initial values $\theta_A=0.6$ and $\theta_B=0.5$, let's try to calculate the 'weights' associated with our mixture model. Rather than simply estimating which coin is ore likely to have generated tho tossees, we calculate the probabilty of each possible 'completion' of the missing data. The missing data here of course, being the lable of the coin where the tosses came from.
Numerically this involves the following:
Consider the series of toss to be H T T T H H T H T H
$$
\begin{eqnarray*}
\theta_A = 0.6\\
\theta_B = 0.5\\
n_{heads} = 5\\
n_{tails} = 5\\
P(5H | \text{Coin A}) = {10 \choose 5} \theta_A^5(1-theta_A)^{10-5} = 0.2000\\
P(5H | \text{Coin B}) = {10 \choose 5} \theta_B^5(1-theta_B)^{10-5} = 0.2460\\
w_A = \frac{P(5H | \text{Coin A})}{P(5H | \text{Coin A})+P(5H | \text{Coin B})} = 0.4484\\
w_B = \frac{P(5H | \text{Coin B})}{P(5H | \text{Coin A})+P(5H | \text{Coin B})} = 0.5516\\
\end{eqnarray*}
$$
Mathematically:
$(X_1, X_2, X_3 \dots X_{5})$ : Coin tosses (Observed) for the set of 10 coint tosses, where $X_i$ is a $10 \times 1$ vector representing 10 coin tosses in each set.
$(Z_1, Z_2, Z_3, Z_4, Z_5)$: Unobserved(hidden) label of coins
Complete Data: $Y=\{(X_1,Z_1), (X_2, Z_2) \dots (X_5, Z_5)\}$
Parameter $\theta = (\theta_A, \theta_B)$
Likelihood: $L(\theta|X) = P(X|\theta) = \sum_{Z}P(X,Z|\theta)$
Complete data likelihood(When $Z_i$ is known): $L(\theta|(X,Z)) = \prod_{i=1}^5 \sum_{z=\{A,B\}} P(X_i,Z_i=z | \theta)$
$$ \log(L(\theta|(X,Z))) = \sum_{i=1}^5 \log\big( \sum_{z=\{A,B\}} P(X_i,Z_i=z | \theta) \big) $$Let $n_i$ be the number of heads in the $i^{th}$ set.
$$ \begin{align*} P(X_i,Z=z|\theta) &= {10 \choose n_i} \theta_z^{n_i}(1-\theta_z)^{10-n_i} \text{ where } z \in \{A,B\} \\ P(X_i|\theta) &= {10 \choose n_i} \theta_A^{n_i}(1-\theta_A)^{10-n_i} + {10 \choose n_i} \theta_B^{n_i}(1-\theta_B)^{10-n_i} \end{align*} $$$$ \begin{align*} P(Z_i=z|X_i,\theta) &= \frac{P(X_i,Z_i=z|\theta)}{P(X_i|\theta)}\\ &= \frac{P(X_i,Z_i=z|\theta)}{{10 \choose n_i} \theta_A^{n_i}(1-\theta_A)^{10-n_i} + {10 \choose n_i} \theta_B^{n_i}(1-\theta_B)^{10-n_i}} \end{align*} $$Thus,
$$ \begin{align*} P(Z_i=A|X_i,\theta) &= \frac{P(X_i,Z_i=A|\theta)}{{10 \choose n_i} \theta_A^{n_i}(1-\theta_A)^{10-n_i} + {10 \choose n_i} \theta_B^{n_i}(1-\theta_B)^{10-n_i}} \end{align*} $$For a binomial distribution $Bin(N,p)$ , $E[heads] = Np$
$$ E(\text{Number of heads of coin A} |X_i, \theta) = n_i \times P(Z_i=A|X_i,\theta) $$Now, the $M$ step:
In order to maximise the log likelihood, with respect to $\theta_A,\theta_B$:
In [14]:
thetaA = 0.6
thetaB = 0.5
def em(theta_old):
row_prob = []
## Expectation
for row in coin_toss:
count_heads = row.count('H')
p_a = binomial.pmf(count_heads, len(row), theta_old['A'])
p_b = binomial.pmf(count_heads, len(row), theta_old['B'])
p_t = p_a+p_b
p_a = p_a/p_t
p_b = p_b/p_t
row_prob.append({'A': p_a, 'B': p_b, 'count_heads': count_heads, 'total_tosses': len(row)})
## Maximisation
new_coin_toss = []
for row in row_prob:
total_tosses = row['total_tosses']
total_heads = row['count_heads']
A_heads = row['A']*total_heads
A_tails = row['A']*(total_tosses-total_heads)
B_heads = row['B']*total_heads
B_tails = row['B']*(total_tosses-total_heads)
new_coin_toss.append([A_heads, A_tails, B_heads, B_tails])
df = pd.DataFrame(new_coin_toss, columns=['A Heads', 'A Tails', 'B Heads', 'B Tails'])
new_pa = df['A Heads'].sum()/(df['A Heads'].sum()+df['A Tails'].sum())
new_pb = df['B Heads'].sum()/(df['B Heads'].sum()+df['B Tails'].sum())
new_theta = OrderedDict({'A': new_pa, 'B': new_pb})
display(df)
return new_theta
theta = OrderedDict({'A': thetaA, 'B': thetaB})
In [15]:
theta_new = OrderedDict()
max_iterations = 10000
iterations = 0
diff = 1
tolerance = 1e-6
while (iterations < max_iterations) and (diff>tolerance):
new_theta = em(theta)
diff = euclidean(new_theta.values(), theta.values())
theta = new_theta
In [59]:
(a, new_theta['A'])
Out[59]:
In [60]:
(b, new_theta['B'])
Out[60]:
In [ ]: