Exercise 5.1: Likelihood model optimization

This exercise uses the Python programming language. We will make use of the statistical libraries scipy and numpy to generate data under a parametric model (a model that takes one or more variables as inputs which affect its results) to learn how likelihood can help us to estimate the correct values of parameters given data generated under a model.



In [1]:

    
import scipy.optimize as so
import numpy
import toyplot

Optimization

Maximum likelihood optimization is a statistical method for finding the best fitting set of parameters for a model. A likelihood function can be made up of many independent likelihood functions, where each describes the result of a trial relative to a statistical distribution. Let's consider a simple example of a coin flip, and estimating whether the coin is fair (i.e., whether p=0.5).

Likelihood equation

The likelihood is a statement about probability. For a coin toss, there are two possible outcomes: heads or tails. If we observe 10 coin tosses and learn that heads came up 5 times, and tails came up 5 times, we can try to learn from this what the probability is of each of the possible outcomes. Sounds like the probability is around 0.5, right? But what if we observe 38 heads and 46 tails, should we still think the true probability is around 0.5? Well, we can use likelihood to tell us.

Since there are only two possible outcomes to a coin toss we know that the probability of flipping heads (p) plus the probability of flipping tails (q) must sum to 1 (p + q = 1). This is called the joint probability distribution of our model. Because the probability of flipping tails (q) is simply 1 - p, it is fully conditional on p, and so in reality there is only one parameter to this model. Our goal then, put simply, is to estimate the probability that one event will come up heads (p).

Calculating likelihoods

Let's say the true probability of flipping heads for a coin is 0.5. Without knowing this, we can try to test if this is true by calculating the likelihood of a series of coin flips that are performed using this coin. We are observing data and trying to estimate a parameter of the model that is most likely to produce the results we observe.

Example

Probability theory tells us that the probability of many independent events is the product of their individual probabilities. In other words, the probability of five coin flips in a row coming up heads is p * p * p * p * p. Let's look at two examples. If the coin is fair (p=0.5) then the probability of five heads in a row is quite low (0.03), but if the coin is totally rigged (p=0.99) then it is super likely that we will see five heads in a row (0.95). From these observations, we could probably make a guess as whether the coin toss is fair or biased, but the more tosses we do the more accurate our estimate will be.



In [2]:

    
# if the coin is fair (p=0.5) then the probability isn't very high
p = 0.5
p * p * p * p * p









    Out[2]:





0.03125



In [3]:

    
# but if the coin is really unfair then the probability if quite high
p = 0.99
p * p * p * p * p









    Out[3]:





0.9509900498999999

Making things more concise

We can describe the same calculation above more concisely using exponents. So from now on we'll describe the product of n independent trials with probability p as $p^n$.



In [4]:

    
# the probability of observing 20 heads for a coin with p=0.6
p = 0.6
n = 20
p**n









    Out[4]:





3.6561584400629733e-05



In [5]:

    
# the probability of observing 10 heads and 10 tails for p=0.6
p = 0.6
q = 1 - p
np = 10
nq = 10

p**np * q**nq









    Out[5]:





6.340338096537601e-07

The goal of Maximum likelihood:

We want to find the best possible parameter to explain our observed data. This can be done in one of two ways: mathematically or empirically. Mathematically, if a likelihood equation is easy to solve then we can take the derivative of the equation and set it equal to zero and solve for our parameter. However, for many complex likelihood functions this is too difficult to solve. Therefore, a frequent approach is to maximize the likelihood empirically by trying many different values using a heuristic search. Computers are great for this. Since this is the practice in phylogenetics we will focus on heuristic optimization search.



In [6]:

    
# our observed data
np = 62
nq = 40

Here we print the parameter of value of p used to calculate the likelihood, and the likelihood score next to each other on each line. You can see that the value of 0.6 has the highest likelihood... but it's kind of hard to interpret because all of the likelihood values are such small numbers.



In [7]:

    
# let's see which parameter for p best fits the data
for p in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]:
    likelihood = p**np * (1-p)**nq
    print("p={}; likelihood={}".format(p, likelihood))









    



p=0.1; likelihood=1.478088294143466e-64
p=0.2; likelihood=6.12998216346359e-48
p=0.3; likelihood=2.4290664364642646e-39
p=0.4; likelihood=2.8429516759252963e-34
p=0.5; likelihood=1.9721522630525295e-31
p=0.6; likelihood=2.1270474435739034e-30
p=0.7; likelihood=3.026416316092381e-31
p=0.8; likelihood=1.0783978666860197e-34
p=0.9; likelihood=1.4555783429306787e-43

For this reason, people usually look at the negative log of the likelihood, which is easier to interpret. Although the method is called "maximum likelihood", when working with the negative log-likelihood we are actually trying to minimize this score, which still means finding the parameter that best fits the data. Below you can see that for p=0.6 the -loglik score is lowest (68.32).



In [8]:

    
# let's see which parameter for p best fits the data
for p in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]:
    likelihood = p**np * (1-p)**nq
    print("p={}; -loglik={:.2f}".format(p, -numpy.log(likelihood)))









    



p=0.1; -loglik=146.97
p=0.2; -loglik=108.71
p=0.3; -loglik=88.91
p=0.4; -loglik=77.24
p=0.5; -loglik=70.70
p=0.6; -loglik=68.32
p=0.7; -loglik=70.27
p=0.8; -loglik=78.21
p=0.9; -loglik=98.64

Functions

A cleaner and easier way to calculate the log-likelihood for our equation when using computer programs is to write a function. This is simply a tool that we can reuse over and over again to perform the same task of computing a result given some input variables.



In [9]:

    
def coin_flip_log(p, nheads, ntails):
    ## calculate likelihood
    logp = nheads*numpy.log(p) + ntails*numpy.log(1.-p)
    
    ## return negative log-likelihood
    return -1*logp



In [10]:

    
coin_flip_log(0.5, 100, 100)









    Out[10]:





138.62943611198907

Exhaustive parameter search

There are much more advanced methods for finding exact optimized parameters than to simply plug in every possible value by hand. The function below is from the scipy library and is a maximum likelihood optimization algorithm. It proposes new parameters based on the scores of the ones it searches previously, and keeps going until it finds the best value. For a simple problem like this it is super fast. For much more complex problems these computations can become quite computationally intensive.

In the example below I give run the maximum likelihood (ML) optimization for our likelihood function when the data (args) is 50 heads and 200 tails. The ML optimized parameter value of p is 0.2, which sounds correct, since 50/250 trials is 20%.



In [11]:

    
# starting value=0.5; observed flips = (50, 200)
so.fmin(coin_flip_log, x0=(0.5), args=(50, 200), disp=0)[0]









    Out[11]:





0.19999999999999973

Here is another trial where we enter a different set of observations. Now when the data is 133 heads and 385 tails the ML parameter estimate of p is 0.2567.



In [12]:

    
# starting value=0.5; observed flips = (133, 385)
so.fmin(coin_flip_log, x0=(0.5), args=(133, 385), disp=0)[0]









    Out[12]:





0.25673828124999976

Plot the likelihood over different parameter inputs

The first plot shows the likelihood calculated at different values for p between 0 and 1 when the observed data is heads=50, tails=200. It finds the optimum value for p at around 0.2. The second plot shows the likelihood when the observed data is heads=50, tails=50, and the optimum looks very close to 0.5.



In [13]:

    
## generate data across 100 equally spaced points for lambda
data = [coin_flip_log(p, 50, 200) for p in numpy.linspace(0.01, 0.99, 100)]
    
## plot the likelihood surface
toyplot.plot(
    b=numpy.log(data), 
    a=numpy.linspace(0.01, 0.99, 100),
    width=500, height=300,
    ylabel="-log-likelihood", 
    xlabel="probability of heads");



In [14]:

    
## generate data across 100 equally spaced points for lambda
data = [coin_flip_log(p, 50, 50) for p in numpy.linspace(0.01, 0.99, 100)]
    
## plot the likelihood surface
toyplot.plot(
    b=numpy.log(data), 
    a=numpy.linspace(0.01, 0.99, 100),
    width=500, height=300,
    ylabel="-log-likelihood", 
    xlabel="probability of heads");