Maximum Likelihood Model Parameter Estimation

\begin{equation*} D = \text{data} \\ \theta = \text{model parameters} \\ y(\theta, x) = \text{model} \end{equation*}

A common parameter-estimation strategy is to minimize error,

\begin{equation*} \sum_{i}(D_i - y_i(\theta, x))^2 \end{equation*}

E.g.

Create some phony straight line data with $\theta=(a, b)$ and normally distributed noise in $y$,


In [1]:
n_points = 10
x = linspace(0,10,n_points)
def y(x, theta=(2, 4), noise=0):
    return  theta[0] * x + theta[1] + noise*randn(n_points)
D = y(x, noise=1)
scatter(x,D)


Out[1]:
<matplotlib.collections.PathCollection at 0x1113cef50>

We are trying to discover parameters, $\theta$, so write error in terms of only $theta$,

$SSE(\theta; y(x), D)$


In [2]:
# cheat, do this with global scope of y, x and D
def ssq_err(theta):
    return np.sum((D - y(x,theta))*(D - y(x,theta)).transpose())

from scipy.optimize import minimize
fit = minimize(ssq_err,[1,1])
print fit
scatter(x, D)
plot(x,y(x, fit.x))
print fit


  status: 0
 success: True
    njev: 6
    nfev: 24
     fun: 6.857476349061612
       x: array([ 2.03497749,  3.43337585])
 message: 'Optimization terminated successfully.'
    hess: array([[ 0.00490909, -0.02454546],
       [-0.02454546,  0.17272728]])
     jac: array([ -1.19209290e-07,   5.96046448e-08])
  status: 0
 success: True
    njev: 6
    nfev: 24
     fun: 6.857476349061612
       x: array([ 2.03497749,  3.43337585])
 message: 'Optimization terminated successfully.'
    hess: array([[ 0.00490909, -0.02454546],
       [-0.02454546,  0.17272728]])
     jac: array([ -1.19209290e-07,   5.96046448e-08])

Not amazing so far $\ldots$

Leading quesiton: Is normally distributed noise "unusually compatible" with least squares errors minimization?

Maybe we should do something that takes into account the expected distibution of the model errors? What if the errors are distributed by Poisson, for example, does sum-squared-error still make sense?


In [3]:
from scipy.stats import norm, poisson
rvn, rvp = norm(5,1), poisson(5)
x = np.linspace(-2,10,140)
plot(x, rvn.pdf(x))
xd = np.arange(0, np.minimum(rvp.dist.b, 15))
vlines(xd, 0, rvp.pmf(xd), lw=2, color="red")


Out[3]:
<matplotlib.collections.LineCollection at 0x111ceca50>

New Idea: The parameter fitting problem can be framed as "Can we find the mostly likely value of the parameters $\theta$ of the model given the data?"

i.e. maximize $P(\theta|D)$

Frequentist Statistics

Significance is specified by probability that the data was drawn from the “wrong” model (null hypothesis): p value

  • Not the probability a model is correct
  • Only random variables have probabilities

That is, from a frequentist perspective, the PDF for normal noise is in terms of $P(D|\theta)$,

E.g. \begin{equation*} P(D|\theta) \propto exp(\sum_i -\frac{(D_i - y_i(\theta))^2}{2\sigma^2}) \end{equation*}

We need to flip this around.

Bayesian inference

  • Frequentist: Probability is a frequency (of a random variable)
  • Bayesian: Probability is a measure of belief or certainty

Some statistics Rules

\begin{equation} P(A,B) = P(A|B)P(B) = P(B|A)P(A) \\ P(A) = \int_\Omega P(A,B)dB = \int_\Omega P(A|B)P(B)dB \\ \end{equation}

Bayesian statistics is the application of these rules

Back to data and parameters:

\begin{equation*} P(\theta|D) = \frac{P(D|\theta)P(\theta)}{P(D)} \end{equation*}\begin{equation*} P(D|\theta) : \text{Likelihood} \\ P(\theta) : \text{Prior} \\ P(D) : \text{Evidence} \\ P(\theta|D) : \text{Posterier} \\ \end{equation*}\begin{equation*} P(D) = \int_\Omega P(D|\theta)P(\theta)d\theta \\ \end{equation*}

Integration over the parameter space of the model.

Simple Discrete Model: Coin Toss

In 10 trials, we get 7 heads: What's our best estimate of the probability of flipping heads with this coin?

The distribution of coin toss outcomes is the Binomial Distribution.

$pmf(y|t,w) = \binom{n}{k} p^k (1-p)^{(n-k)}$

Let's not forget how obvious this is and guess 0.7.


In [4]:
from scipy.stats import binom
rv = binom(10, 0.7)
x = np.arange(0, np.minimum(rv.dist.b, 10))
h = plt.vlines(x, 0, rv.pmf(x), lw=2, color="orange")


"Specifically, the PDF in Fig. 1 is a function of the data given a particular set of parameter values, defined on the data scale. On the other hand, the likelihood function is a function of the parameter given a particular set of observed data, defined on the parameter scale. In short, Fig. 1 tells us the probability of a particular data value for a fixed parameter, whereas Fig. 2 tells us the likelihood (‘‘unnormalized probability’’) of a particular parameter value for a fixed data set." (.J. Myung / Journal of Mathematical Psychology 47 (2003) 90–100)


In [5]:
def f(w):
    rv = binom(10, w)
    return rv.pmf(7)
# Hold the data constant, vary the parameter
x = np.linspace(0,1,100)
plot(x, f(x))


/Library/Python/2.7/site-packages/scipy/stats/distributions.py:6602: RuntimeWarning: divide by zero encountered in log
  return combiln + k*np.log(p) + (n-k)*np.log(1-p)
Out[5]:
[<matplotlib.lines.Line2D at 0x1127a0bd0>]

It looks like the maximum likelihood on the data over the parameter is around 0.7?

Our data could have been (6,10) or (8,10) with a 0.7 coin. But we started by asking about the coin, to make any headway, we need to flip the question around.


In [6]:
dat = [10, 7]
def L_binom(w):
    rv = binom(dat[0], w)
    x = dat[1]
    # -1 to make max problem a min problem
    return -rv.pmf(x)

In [7]:
x = np.linspace(0.1, 0.9, 9)
scatter(x, L_binom(x))
from scipy.optimize import minimize_scalar
minimize_scalar(L_binom, bounds=[0,1], method="bounded")


Out[7]:
  status: 0
    nfev: 10
 success: True
     fun: -0.26682793199995286
       x: 0.69999991374473847
 message: 'Solution found.'

Slightly More Complicated Model: Exponential with Normal Error


In [9]:
n_points = 10
x = linspace(0,100,n_points)
def y(x, theta=(10, 0.05), noise=0):
    return  theta[0] * np.exp(-x * theta[1]) + noise*randn(n_points)
D = y(x, noise=1)
scatter(x,D)


Out[9]:
<matplotlib.collections.PathCollection at 0x1127a2750>

Likelihood:

\begin{equation} P(D|\theta) \propto \prod_i {\exp(-\frac{(D_i - y_i(\theta))^2}{2\sigma^2})} \propto \prod_i {\exp(-\frac{(D_i - (10\exp(-wx)))^2}{2})} \end{equation}

In [10]:
# use x, D from global scope
def L_cost(w):
    return np.prod(np.exp(((D - y(x,(10, w)))*(D - y(x,(10, w))).transpose())/2.))
w = np.linspace(0,0.1,10)
L_c = [L_cost(i) for i in w]
scatter(w,L_c)


Out[10]:
<matplotlib.collections.PathCollection at 0x112953950>

Numerically troubling! Turn the product into a sum with log!


In [11]:
# use x, D from global scope
def L_cost(w):
    return np.sum(((D - y(x,(10, w)))*(D - y(x,(10, w))).transpose())/2.)
w = np.linspace(0,0.1,10)
L_c = [L_cost(i) for i in w]
scatter(w,L_c)
from scipy.optimize import minimize_scalar
minimize_scalar(L_cost, bounds=[0,0.1], method="bounded")


Out[11]:
  status: 0
    nfev: 12
 success: True
     fun: 7.8772450445969193
       x: 0.047362052034333135
 message: 'Solution found.'

So it might be useful to use the $\log$ likelihood:

\begin{equation} \log(P(D|\theta)) \propto \sum_i \log(exp(-\frac{(D_i - (10\exp(-wx)))^2}{2})) = - \sum_i \frac{(D_i - (10\exp(-wx)))^2}{2} \end{equation}

Normally distributed errors + Minimum-Log-Likelihood is the same solution as Lease Squares.

Exponential Decay and Poisson Likelihood

\begin{equation} \log(P) \propto \sum_i \log(\frac{\lambda^{x_i}e^{-\lambda}}{\Gamma(x_i)}) = \sum_i (x_i) \log\lambda + (-\lambda) - \log\Gamma(x_i) \end{equation}

Sterling $\ldots$ or realize that $\Gamma(x_i)$ is a constant w.r.t. MLE estimation of $\lambda$.

\begin{equation} \log(P(D|\theta)) \propto \sum_i \log(\frac{y_i(\theta))^{D_i}e^{-y_i(\theta))}}{\Gamma(D_i)}) = \sum_i D_i \log y_i(\theta) -y_i(\theta) \end{equation}

In [12]:
from numpy.random import poisson
n_points = 20
x = linspace(0,300,n_points)
def y(x, theta=(110, 0.03), noise=0):
    ty = theta[0] * np.exp(-x * theta[1])
    if noise != 0:
        ty = [poisson(k)for k in ty] 
    return ty
D = y(x, noise=5)
scatter(x,D)
plot(x,y(x))


Out[12]:
[<matplotlib.lines.Line2D at 0x11298a850>]

In [13]:
# use x, D from global scope
def P_cost(*w):
    return -np.sum(D*np.log(y(x,*w)) - y(x,*w))
fit_ml = minimize(P_cost, (105,0.05))
print fit_ml


  status: 2
 success: False
    njev: 40
    nfev: 172
     fun: -901.2355508250871
       x: array([  1.09423618e+02,   2.74569336e-02])
 message: 'Desired error not necessarily achieved due to precision loss.'
    hess: array([[  1.84852921e-05,  -1.78246740e-05],
       [ -1.78246740e-05,   1.78203684e-05]])
     jac: array([  2.28881836e-05,   2.34222412e-03])

In [14]:
# cheat, do this with global scope of y, x and D
def ssq_err(*w):
    return np.sum((D - y(x,*w))*(D - y(x,*w)).transpose())
fit_ssq = minimize(ssq_err,[105,0.05])
print fit_ssq
scatter(x, D)
plot(x,y(x,fit_ml.x), color="green")
plot(x,y(x,fit_ssq.x), color="red")


  status: 2
 success: False
    njev: 180
    nfev: 794
     fun: 368.19874406314955
       x: array([  1.07977531e+02,   2.66191370e-02])
 message: 'Desired error not necessarily achieved due to precision loss.'
    hess: array([[  7.47600254e-08,  -2.28259586e-07],
       [ -2.28259586e-07,   6.98553118e-07]])
     jac: array([ 0.        ,  0.01343155])
Out[14]:
[<matplotlib.lines.Line2D at 0x11298e150>]

In [ ]: