In [1]:
# Run this cell to set up the notebook.
import numpy as np
import pandas as pd
import seaborn as sns
import scipy as sci
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import patches, cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
from mpl_toolkits.mplot3d import Axes3D
from client.api.notebook import Notebook
ok = Notebook('lab07.ok')
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
Today's lab reviews Maximum Likelihood Estimation, and introduces interctive plotting in the jupyter notebook.
Recall that the binomial distribution describes the chance of $x$ successes out of $n$ trials, where the trials are independent and each has a probability $p$ of success. For instance, the number of sixes rolled in ten rolls of a die is distributed $Binomial(10, \frac{1}{6})$.
Given $n$ draws from a $Binomial(n, p)$ distribution, which resulted in $x$ successes, we wish to find the chance $p$ of success via maximum likelihood estimation.
What is the likelihood function for the binomial, L(p)? Remember, this is equal to the probability of the data occuring given some chance of success $p$.
As an aid, we provide a factorial(x)
function below.
In [2]:
factorial = sci.misc.factorial # so you don't have to look it up
In [3]:
def likelihood(n, p, x):
...
In [4]:
def log_likelihood(n, p, x):
...
In [5]:
def highest_likelihood(n, x):
...
Using the interact
jupyter notebook extension, we can create interactive plots. In this case, we create an interactive plot of likelihood as a function of $p$ - interactive in the sense that we can plug in our own values of $n$ and $x$ and see how the plot changes. We can also choose our method of plotting - likelihood or log(likelihood).
We've provided code that creates sliders for n and x, and a checkbox to determine whether to plot the likelihood or the log-likelihood. Finish our code by defining the variable yvals
, and then run it and play around a bit with the output.
In [6]:
n_widget = widgets.FloatSlider(min=1, max=20, step=1, value=20)
x_widget = widgets.FloatSlider(min=0, max=20, step=1, value=5)
# We want to make sure x <= n, otherwise we get into trouble
def update_x_range(*args):
x_widget.max = n_widget.value
n_widget.observe(update_x_range, 'value')
def plot_likelihood(n, x, plot_log=False):
# values of p are on the x-axis.
# We plot every value from 0.01 to 0.99
pvals = np.arange(1, 100)/100
# values of either Likelihood(p) or log(Likelihood(p))
# are on the y-axis, depending on the method
if plot_log:
yvals = ...
else:
yvals = ...
plt.plot(pvals, yvals)
# Put a line where L(p) is maximized and print the value p*
p_star = highest_likelihood(n, x)
plt.axvline(p_star, lw=1.5, color='r', ls='dashed')
plt.text(p_star + 0.01, min(yvals), 'p*=%.3f' % (p_star))
plt.xlabel('p')
if plot_log:
plt.ylabel('lik(p)')
plt.title("log(Likelihood(p)), if X ~ bin(n, p) = k")
else:
plt.ylabel('L(p)')
plt.title("Likelihood of p, if X ~ bin(n, p) = k")
plt.show()
interact(plot_likelihood, n=n_widget, x=x_widget, log=False);
Here's a more complex example, involving several variables. Recall the blood types experiment from lecture. We assume a model where a person's blood type is determined by two genes, each of which is identically-distributed between three alleles. We call the alleles $a$, $b$, and $o$. For each person, the two specific allele variants are random and independent of one another.
We know that, if a person has alleles $a$ and $b$, they have blood type $AB$. If the have alleles $a$ and $a$, or $a$ and $o$, they have blood type $A$. Similarly, if the have alleles $b$ and $b$, or $b$ and $o$, they have blood type $B$. Finally, if they have alleles $o$ and $o$, they have blood type $O$.
We measure the blood types of a group of people, and get counts of each type $A$, $B$, $AB$, and $O$. Using these counts, we wish to determine the frequency of alleles $a$, $b$, and $o$. We know that, under the assumption of Hardy-Weinberg equilibrium:
The frequency of type $O$ is $p_o^2$.
The frequency of type $A$ is $p_a^2 + 2p_op_a$.
The frequency of type $B$ is $p_b^2 + 2p_op_b$.
And the frequency of type $AB$ is $2p_ap_b$.
What's the likelihood of allele probabilities $p_a$, $p_b$, $p_o$, given sample counts O, A, B, AB?
Hint: Think about how the binomial formula can be extended. Don't worry about the $n$ choose $k$ bit - we're only concerned with the specific values O, A, B, and AB that we observed, so that term will be the same regardless of $p_a, p_b, p_o$, and it can be ignored.
In [7]:
def btype_likelihood(pa, pb, po, O, A, B, AB):
...
What's the log-likelihood? As before, don't just use np.log(btype_likelihood)
.
In [8]:
def btype_log_likelihood(pa, pb, po, O, A, B, AB):
...
Fill in the function plot_btype_likelihood_3d
, which plots the log-likelihood as $p_a$ and $p_b$ vary (since $p_o$ is a simple function of $p_a$ and $p_b$, this covers all possible triplets of values). You'll need to define four methods of interact
input - we recommend sticking with FloatSlider
. Allow for samples of up to 1000 people, with anywhere from 0 to 100% of the population having each phenotype $A$, $B$, $AB$, $O$.
First, run this cell to define a function for plotting 3D graphs:
In [9]:
def plot_surface_3d(X, Y, Z, orient_x = 45, orient_y = 45):
highest_Z = max(Z.reshape(-1,1))
lowest_Z = min(Z.reshape(-1,1))
fig = plt.figure()
ax = fig.gca(projection='3d')
surf = ax.plot_surface(X, Y, Z,
cmap=cm.coolwarm,
linewidth=0,
antialiased=False,
rstride=5, cstride=5)
ax.zaxis.set_major_locator(LinearLocator(5))
ax.zaxis.set_major_formatter(FormatStrFormatter('%.1f'))
ax.view_init(orient_y, orient_x)
fig.colorbar(surf, shrink=0.5, aspect=5)
plt.title("log(Likelihood(p_a, p_b))")
plt.xlabel("p_a")
plt.ylabel("p_b")
plt.show()
Now, complete the plot_btype_likelihood_3d
function.
In [10]:
O = ...
A = ...
B = ...
AB = ...
def plot_btype_likelihood_3d(O, A, B, AB):
pa = np.arange(1, 50)/100
pb = np.arange(1, 50)/100
pa, pb = np.meshgrid(pa, pb) # get all pairs
po = ...
likelihoods = ...
plot_surface_3d(pa, pb, likelihoods)
interact(plot_btype_likelihood_3d, O=O, A=A, B=B, AB=AB);
We can also rotate this 3D graphic by passing values orient_x
and orient_y
to the plot_surface_3d
function. Add two new sliders, and fill in the plot_btype_likelihood_3d_oriented
function. You may want to set the step
size on the new sliders to a value greater than one, and make sure the max
value is large enough such that they can rotate all the way around. You should be able to copy-paste a good deal of code from above.
In [11]:
O2 = ...
A2 = ...
B2 = ...
AB2 = ...
X = ...
Y = ...
def plot_btype_likelihood_3d_oriented(O, A, B, AB, X, Y):
pa = np.arange(1, 50)/100
pb = np.arange(1, 50)/100
pa, pb = np.meshgrid(pa, pb) # get all pairs
po = ...
likelihoods = ...
plot_surface_3d(pa, pb, likelihoods, orient_x=X, orient_y=Y)
interact(plot_btype_likelihood_3d_oriented, O=O2, A=A2, B=B2, AB=AB2, X=X, Y=Y);
We also can make some 2d color plots, to get a better view of exactly where our values are maximized. As in the 3d plots, redder colors refer to higher likelihoods.
In [ ]:
O3 = widgets.FloatSlider(min=1, max=200, step=1, value=120)
A3 = widgets.FloatSlider(min=1, max=200, step=1, value=100)
B3 = widgets.FloatSlider(min=1, max=200, step=1, value=30)
AB3 = widgets.FloatSlider(min=1, max=200, step=1, value=5)
def plot_btype_log_likelihood_heatmap(O, A, B, AB):
pa = np.arange(1, 50)/100
pb = np.arange(1, 50)/100
pa, pb = np.meshgrid(pa, pb) # get all possible pairs
po = 1 - pa - pb
likelihoods = btype_log_likelihood(pa, pb, po, O, A, B, AB)
plt.pcolor(pa, pb, likelihoods, cmap=cm.coolwarm)
plt.xlabel("p_a")
plt.ylabel("p_b")
plt.title("log(Likelihood(p_a, p_b))")
plt.show()
interact(plot_btype_log_likelihood_heatmap, O=O3, A=A3, B=B3, AB=AB3);
As with the binomial, the likelihood has a "sharper" distribution than the log-likelihood. So, plotting the likelihood, we can see our maximal point with greater clarity.
In [ ]:
O4 = widgets.FloatSlider(min=1, max=200, step=1, value=120)
A4 = widgets.FloatSlider(min=1, max=200, step=1, value=100)
B4 = widgets.FloatSlider(min=1, max=200, step=1, value=30)
AB4 = widgets.FloatSlider(min=1, max=200, step=1, value=5)
def plot_btype_likelihood_heatmap(O, A, B, AB):
pa = np.arange(1, 100)/100
pb = np.arange(1, 100)/100
pa, pb = np.meshgrid(pa, pb) # get all possible pairs
po = 1 - pa - pb
likelihoods = btype_likelihood(pa, pb, po, O, A, B, AB)
likelihoods[(pa + pb) > 1] = 0 # Don't plot impossible probability pairs
plt.pcolor(pa, pb, likelihoods, cmap=cm.coolwarm)
plt.xlabel("p_a")
plt.ylabel("p_b")
plt.title("Likelihood(p_a, p_b)")
plt.show()
interact(plot_btype_likelihood_heatmap, O=O4, A=A4, B=B4, AB=AB4);
Finally, we want to get our actual estimates for $p_a, p_b, p_o$! However, unlike in the Binomial example, we don't want to calculate our MLE by hand. So instead, we use function-minimizers to calculate the highest likelihood.
scipy
's optimize.minimize
function allows us to find the tuple of arguments that minimizes a function of $n$ variables, subject to desired constraints. Given any set of observed phenotype counts $O, A, B, AB$, we can thus find the specific values $p_a, p_b, p_o$ that maximize the log-likelihood function. Finish the nested function flipped_btype_fixed_params
in order to do just that.
In [ ]:
O5 = widgets.FloatSlider(min=1, max=200, step=1, value=120)
A5 = widgets.FloatSlider(min=1, max=200, step=1, value=100)
B5 = widgets.FloatSlider(min=1, max=200, step=1, value=30)
AB5 = widgets.FloatSlider(min=1, max=200, step=1, value=5)
def maximize_btype_likelihood(O, A, B, AB):
def flipped_btype_fixed_params(params):
# "params" is a list containing p_a, p_b, p_o
pa, pb, po = params
# We wish to return a value which is minimized when the log-likelihood is maximized...
# What function would accomplish this?
...
# We need to provide an initial guess at the solution
initial_guess = [1/3, 1/3, 1/3]
# Each variable is bounded between zero and one
# sci.optimize.minimize seems to dislike exact zero bounds, though, so we use 10^-6
bnds = ((1e-6, 1), (1e-6, 1), (1e-6, 1))
# An additional constraint on our parameters - they must sum to one
# The minimizer will only check params where constraint_fn(params) = 0
def constraint_fn(params):
# "params" is a list containing p_a, p_b, p_o
return sum(params) - 1
constraint = ({'type': 'eq', 'fun': constraint_fn},)
pa, pb, po = sci.optimize.minimize(flipped_btype_fixed_params,
x0=initial_guess,
bounds=bnds,
constraints=constraint).x
return "pa* = %.3f, pb* = %.2f, po* = %.3f" % (pa, pb, po)
interact(maximize_btype_likelihood, O=O5, A=A5, B=B5, AB=AB5);
In [ ]:
i_finished_the_lab = False
In [ ]:
_ = ok.grade('qcompleted')
_ = ok.backup()
In [ ]:
_ = ok.submit()
Now, run this code in your terminal to make a
git commit
that saves a snapshot of your changes in git
. The last line of the cell
runs git push, which will send your work to your personal Github repo.
# Tell git to commit your changes to this notebook
git add sp17/lab/lab07/lab07.ipynb
# Tell git to make the commit
git commit -m "lab07 finished"
# Send your updates to your personal private repo
git push origin master
In [ ]: