Lab 7: Maximum Likelihood Estimation


In [1]:
# Run this cell to set up the notebook.
import numpy as np
import pandas as pd
import seaborn as sns
import scipy as sci

import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import patches, cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
from mpl_toolkits.mplot3d import Axes3D

from client.api.notebook import Notebook
ok = Notebook('lab07.ok')

from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets

Today's lab reviews Maximum Likelihood Estimation, and introduces interctive plotting in the jupyter notebook.

Part 1: Likelihood of the Binomial Distribution

Recall that the binomial distribution describes the chance of $x$ successes out of $n$ trials, where the trials are independent and each has a probability $p$ of success. For instance, the number of sixes rolled in ten rolls of a die is distributed $Binomial(10, \frac{1}{6})$.

Given $n$ draws from a $Binomial(n, p)$ distribution, which resulted in $x$ successes, we wish to find the chance $p$ of success via maximum likelihood estimation.

Question 1: Likelihood of the Binomial

What is the likelihood function for the binomial, L(p)? Remember, this is equal to the probability of the data occuring given some chance of success $p$.

As an aid, we provide a factorial(x) function below.


In [2]:
factorial = sci.misc.factorial # so you don't have to look it up

In [3]:
def likelihood(n, p, x):
    ...

Question 2: Log-likelihood of the Binomial

What is the log of the likelihood function for the binomial, $log(L(p)) = lik(p)$? Don't just use np.log(likelihood) - determine the value as a new function of n, x, and p.


In [4]:
def log_likelihood(n, p, x):
    ...

Question 3: Maximum Likelihood Estimate of the Binomial

Given $n$ samples from a binomial distribution $Bin(n, p)$, $x$ of which were successes, what is the value $p$ which maximizes the log-likelihood function?

Hint: Find $\frac{d}{dp}lik(p)$, set it equal to 0, and solve for p in terms of x and n.


In [5]:
def highest_likelihood(n, x):
    ...

Question 4: Interactive Plotting

Using the interact jupyter notebook extension, we can create interactive plots. In this case, we create an interactive plot of likelihood as a function of $p$ - interactive in the sense that we can plug in our own values of $n$ and $x$ and see how the plot changes. We can also choose our method of plotting - likelihood or log(likelihood).

We've provided code that creates sliders for n and x, and a checkbox to determine whether to plot the likelihood or the log-likelihood. Finish our code by defining the variable yvals, and then run it and play around a bit with the output.


In [6]:
n_widget = widgets.FloatSlider(min=1, max=20, step=1, value=20)
x_widget = widgets.FloatSlider(min=0, max=20, step=1, value=5)

# We want to make sure x <= n, otherwise we get into trouble
def update_x_range(*args):
    x_widget.max = n_widget.value
n_widget.observe(update_x_range, 'value')

def plot_likelihood(n, x, plot_log=False):
    
    # values of p are on the x-axis. 
    # We plot every value from 0.01 to 0.99
    pvals = np.arange(1, 100)/100
    
    # values of either Likelihood(p) or log(Likelihood(p))
    # are on the y-axis, depending on the method
    if plot_log:
        yvals = ...
    else:
        yvals = ...
    
    plt.plot(pvals, yvals)
    
    
    # Put a line where L(p) is maximized and print the value p*
    p_star = highest_likelihood(n, x)
    plt.axvline(p_star, lw=1.5, color='r', ls='dashed')
    plt.text(p_star + 0.01, min(yvals), 'p*=%.3f' % (p_star))
    
    plt.xlabel('p')
    if plot_log:
        plt.ylabel('lik(p)')
        plt.title("log(Likelihood(p)), if X ~ bin(n, p) = k")
    else:
        plt.ylabel('L(p)')
        plt.title("Likelihood of p, if X ~ bin(n, p) = k")
    plt.show()

interact(plot_likelihood, n=n_widget, x=x_widget, log=False);

Part 2: Likelihood of the Blood Types

Here's a more complex example, involving several variables. Recall the blood types experiment from lecture. We assume a model where a person's blood type is determined by two genes, each of which is identically-distributed between three alleles. We call the alleles $a$, $b$, and $o$. For each person, the two specific allele variants are random and independent of one another.

We know that, if a person has alleles $a$ and $b$, they have blood type $AB$. If the have alleles $a$ and $a$, or $a$ and $o$, they have blood type $A$. Similarly, if the have alleles $b$ and $b$, or $b$ and $o$, they have blood type $B$. Finally, if they have alleles $o$ and $o$, they have blood type $O$.

We measure the blood types of a group of people, and get counts of each type $A$, $B$, $AB$, and $O$. Using these counts, we wish to determine the frequency of alleles $a$, $b$, and $o$. We know that, under the assumption of Hardy-Weinberg equilibrium:

The frequency of type $O$ is $p_o^2$.

The frequency of type $A$ is $p_a^2 + 2p_op_a$.

The frequency of type $B$ is $p_b^2 + 2p_op_b$.

And the frequency of type $AB$ is $2p_ap_b$.

Question 5: blood type likelihood formulas

What's the likelihood of allele probabilities $p_a$, $p_b$, $p_o$, given sample counts O, A, B, AB?

Hint: Think about how the binomial formula can be extended. Don't worry about the $n$ choose $k$ bit - we're only concerned with the specific values O, A, B, and AB that we observed, so that term will be the same regardless of $p_a, p_b, p_o$, and it can be ignored.


In [7]:
def btype_likelihood(pa, pb, po, O, A, B, AB):
    ...

What's the log-likelihood? As before, don't just use np.log(btype_likelihood).


In [8]:
def btype_log_likelihood(pa, pb, po, O, A, B, AB):
    ...

Question 6: Interactive 3D Plots of Allele Distribution Likelihood

Fill in the function plot_btype_likelihood_3d, which plots the log-likelihood as $p_a$ and $p_b$ vary (since $p_o$ is a simple function of $p_a$ and $p_b$, this covers all possible triplets of values). You'll need to define four methods of interact input - we recommend sticking with FloatSlider. Allow for samples of up to 1000 people, with anywhere from 0 to 100% of the population having each phenotype $A$, $B$, $AB$, $O$.

First, run this cell to define a function for plotting 3D graphs:


In [9]:
def plot_surface_3d(X, Y, Z, orient_x = 45, orient_y = 45):
    highest_Z = max(Z.reshape(-1,1))
    lowest_Z = min(Z.reshape(-1,1))
    fig = plt.figure()
    ax = fig.gca(projection='3d')
    surf = ax.plot_surface(X, Y, Z, 
                           cmap=cm.coolwarm, 
                           linewidth=0, 
                           antialiased=False, 
                           rstride=5, cstride=5)
    ax.zaxis.set_major_locator(LinearLocator(5))
    ax.zaxis.set_major_formatter(FormatStrFormatter('%.1f'))
    ax.view_init(orient_y, orient_x)
    fig.colorbar(surf, shrink=0.5, aspect=5)
    plt.title("log(Likelihood(p_a, p_b))")
    plt.xlabel("p_a")
    plt.ylabel("p_b")
    plt.show()

Now, complete the plot_btype_likelihood_3d function.


In [10]:
O = ...
A = ...
B = ...
AB = ...

def plot_btype_likelihood_3d(O, A, B, AB):
        
    pa = np.arange(1, 50)/100
    pb = np.arange(1, 50)/100
    pa, pb = np.meshgrid(pa, pb) # get all pairs
    po = ...
    
    likelihoods = ...
    
    plot_surface_3d(pa, pb, likelihoods)

interact(plot_btype_likelihood_3d, O=O, A=A, B=B, AB=AB);

Question 7: Rotating 3D Plots of Allele Distribution Likelihood

We can also rotate this 3D graphic by passing values orient_x and orient_y to the plot_surface_3d function. Add two new sliders, and fill in the plot_btype_likelihood_3d_oriented function. You may want to set the step size on the new sliders to a value greater than one, and make sure the max value is large enough such that they can rotate all the way around. You should be able to copy-paste a good deal of code from above.


In [11]:
O2 = ...
A2 = ...
B2 = ...
AB2 = ...
X = ...
Y = ...
    
def plot_btype_likelihood_3d_oriented(O, A, B, AB, X, Y):
        
    pa = np.arange(1, 50)/100
    pb = np.arange(1, 50)/100
    pa, pb = np.meshgrid(pa, pb) # get all pairs
    po = ...
    
    likelihoods = ...
    
    plot_surface_3d(pa, pb, likelihoods, orient_x=X, orient_y=Y)

interact(plot_btype_likelihood_3d_oriented, O=O2, A=A2, B=B2, AB=AB2, X=X, Y=Y);

We also can make some 2d color plots, to get a better view of exactly where our values are maximized. As in the 3d plots, redder colors refer to higher likelihoods.


In [ ]:
O3 = widgets.FloatSlider(min=1, max=200, step=1, value=120)
A3 = widgets.FloatSlider(min=1, max=200, step=1, value=100)
B3 = widgets.FloatSlider(min=1, max=200, step=1, value=30)
AB3 = widgets.FloatSlider(min=1, max=200, step=1, value=5)

def plot_btype_log_likelihood_heatmap(O, A, B, AB):
        
    pa = np.arange(1, 50)/100
    pb = np.arange(1, 50)/100
    pa, pb = np.meshgrid(pa, pb) # get all possible pairs
    po = 1 - pa - pb
    
    likelihoods = btype_log_likelihood(pa, pb, po, O, A, B, AB)
    plt.pcolor(pa, pb, likelihoods, cmap=cm.coolwarm)
    plt.xlabel("p_a")
    plt.ylabel("p_b")
    plt.title("log(Likelihood(p_a, p_b))")
    plt.show()
    
interact(plot_btype_log_likelihood_heatmap, O=O3, A=A3, B=B3, AB=AB3);

As with the binomial, the likelihood has a "sharper" distribution than the log-likelihood. So, plotting the likelihood, we can see our maximal point with greater clarity.


In [ ]:
O4 = widgets.FloatSlider(min=1, max=200, step=1, value=120)
A4 = widgets.FloatSlider(min=1, max=200, step=1, value=100)
B4 = widgets.FloatSlider(min=1, max=200, step=1, value=30)
AB4 = widgets.FloatSlider(min=1, max=200, step=1, value=5)

def plot_btype_likelihood_heatmap(O, A, B, AB):
        
    pa = np.arange(1, 100)/100
    pb = np.arange(1, 100)/100
    pa, pb = np.meshgrid(pa, pb) # get all possible pairs
    po = 1 - pa - pb
    
    likelihoods = btype_likelihood(pa, pb, po, O, A, B, AB)
    likelihoods[(pa + pb) > 1] = 0 # Don't plot impossible probability pairs
    plt.pcolor(pa, pb, likelihoods, cmap=cm.coolwarm)
    plt.xlabel("p_a")
    plt.ylabel("p_b")
    plt.title("Likelihood(p_a, p_b)")
    plt.show()
    
interact(plot_btype_likelihood_heatmap, O=O4, A=A4, B=B4, AB=AB4);

Question 8: Getting the MLE for the blood-type question

Finally, we want to get our actual estimates for $p_a, p_b, p_o$! However, unlike in the Binomial example, we don't want to calculate our MLE by hand. So instead, we use function-minimizers to calculate the highest likelihood.

scipy's optimize.minimize function allows us to find the tuple of arguments that minimizes a function of $n$ variables, subject to desired constraints. Given any set of observed phenotype counts $O, A, B, AB$, we can thus find the specific values $p_a, p_b, p_o$ that maximize the log-likelihood function. Finish the nested function flipped_btype_fixed_params in order to do just that.


In [ ]:
O5 = widgets.FloatSlider(min=1, max=200, step=1, value=120)
A5 = widgets.FloatSlider(min=1, max=200, step=1, value=100)
B5 = widgets.FloatSlider(min=1, max=200, step=1, value=30)
AB5 = widgets.FloatSlider(min=1, max=200, step=1, value=5)

def maximize_btype_likelihood(O, A, B, AB):
    
    def flipped_btype_fixed_params(params):
        # "params" is a list containing p_a, p_b, p_o
        pa, pb, po = params
        # We wish to return a value which is minimized when the log-likelihood is maximized...
        # What function would accomplish this?
        ...
    
    # We need to provide an initial guess at the solution
    initial_guess = [1/3, 1/3, 1/3]
    
    # Each variable is bounded between zero and one
    # sci.optimize.minimize seems to dislike exact zero bounds, though, so we use 10^-6
    bnds = ((1e-6, 1), (1e-6, 1), (1e-6, 1))
    
    # An additional constraint on our parameters - they must sum to one
    # The minimizer will only check params where constraint_fn(params) = 0
    def constraint_fn(params):
        # "params" is a list containing p_a, p_b, p_o
        return sum(params) - 1
        
    constraint = ({'type': 'eq', 'fun': constraint_fn},)
    
    pa, pb, po = sci.optimize.minimize(flipped_btype_fixed_params, 
                                       x0=initial_guess, 
                                       bounds=bnds, 
                                       constraints=constraint).x
    
    return "pa* = %.3f, pb* = %.2f, po* = %.3f" % (pa, pb, po)
    
interact(maximize_btype_likelihood, O=O5, A=A5, B=B5, AB=AB5);

Submitting your assignment

If you made a good-faith effort to complete the lab, change i_finished_the_lab to True in the cell below. In any case, run the cells below to submit the lab.


In [ ]:
i_finished_the_lab = False

In [ ]:
_ = ok.grade('qcompleted')
_ = ok.backup()

In [ ]:
_ = ok.submit()

Now, run this code in your terminal to make a git commit that saves a snapshot of your changes in git. The last line of the cell runs git push, which will send your work to your personal Github repo.

# Tell git to commit your changes to this notebook
git add sp17/lab/lab07/lab07.ipynb

# Tell git to make the commit
git commit -m "lab07 finished"

# Send your updates to your personal private repo
git push origin master

In [ ]: