Chapter 4 - Classical Statistical Inference

Maximum Likelihood Estimator (MLE)

Likelihood:

  • First of data

    $L\equiv p(\{x_i\}|M(\theta)) = \Pi_{i=1}^n p(x_i|M(\theta))$

    Model M -- k parameters ($\Theta$ is a vector of those parameters)

    Likelihood of data given the model. . . but isn't normalized, often use lnL

Maximum Likelihood Approach

  1. Formulate M(\theta), p(D|M)
  2. search for best fit parameters that maximize p(D|M)
  3. Determine confidence intervals for those parameters
  4. Perform hypothesis tests to make other conclusions (maybe this comes later?)

Homoscedastic Gaussian Likelihood

(constant gaussian distribution for all error in the dataset)

example: measuring the length of a rod (length $\mu$) with a measurement error $\sigma$:

$L \equiv p({x_i}|\mu,\sigma) = \Pi_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma}\exp{\frac{-(x_i - \mu)^2}{2\sigma}}$

to maximize the likelihood--just take the derivative (w.r.t. \mu) (here it is useful to use $\ln L$ because it will always have the same maximum of L.

$lnL(\mu) = constant - \Sigma_{i=1}^n\frac{(x_i - \mu)^2}{2\sigma^2}$

and you can show analytically that this is maximized when:

$\mu^0 = (1/N)\Sigma_{i=1}^N x_i$ (the best fit is the average of the measurements)

ML Estimators

  1. consistent -- converge given many measurements
  2. asymptotically normal -- approaches normal (centered on the MLE) with infinite data points
  3. achieve the "Cramer-Rao" bound -- "best possible error given the data at hand"

MLE Confidence Intervals

covariance matrix: $\sigma_{jk} = (-\frac{d^2 lnL}{d\theta_j d\theta_k}|_{\theta=\theta_0})^{-1/2}$

. . . for gaussian errors:

$\sigma_\mu = (-\frac{d^2 lnL}{d\mu^2}|_{\mu_0})^{-1/2}$

MLE Heteroscedastic Gaussian Case

(different measurements have different errors)

$lnL(\mu) = constant - \Pi_{i=1}^n\frac{(x_i - \mu)^2}{2\sigma_{i}^2}$

$\mu^o = \frac{\Sigma_i^N w_i x_i}{\Sigma_i^N w_i}$ (where $w_i = \sigma^{-2}$)

  • If errors are different, then the maximum likelihood value will be the Weighted mean

  • Confidence interval:

$\sigma_{\mu} = (\Sigma_i^N \frac{1}{\sigma^2})^{-1/2} = (\Sigma_i^N w_i)^{-1/2}$

If there is intrinsic variation ($\sigma_o$) in the measured property in addition to measurement errors ($e_i$):

$\sigma_i = \sqrt{\sigma_o^2+e_i^2}$

MLE for Truncated and Censored Data

(What happens when the measurements don't properly sample the full distribution of values?)

truncated when the selection function (S(x)) goes to zero for some range of x

censored when a measurement is attempted but the value is outside a given range (upper/lower limits) -- this comes later in the book

  • Truncated Example: S(x) = 1 for $x_{min}\le x\le x_{max}$ and S(x)=0 everywhere else

$p(x_i|\mu,\sigma,x_{min},x_{max}) = C(\mu,x_{min},x_{max})\frac{1}{\sqrt{2\pi}\sigma}\exp{\frac{-(x_i - \mu)^2}{2\sigma}}$

$C(\mu,\sigma,x_{min},x_{max}) = (P(x_{max}|\mu,\sigma) - P(x_{min}|\mu,\sigma))^{-1}$

N.B. - $P(x_{max}|\mu,\sigma)$ is the CDF of $x_{max}$

$lnL(\mu) = constant - \Sigma_{i=1}^n\frac{(x_i - \mu)^2}{2\sigma^2} - N\ln C(\mu,\sigma,x_{min},x_{max})$

  • Example:

"Karpathia" is a country with high IQ scores (with no IQ scores < 120). Given that generally IQ scores follow a gaussian distribution (NOT centered on 100 in this case) with $\sigma=15$. Given that you meet one student in Karpathia with an IQ score of $S_{IQ}$, what do you know about the mean IQ score in Karpathia?


In [2]:
%matplotlib inline

import numpy as np
from matplotlib import pyplot as plt

In [6]:
IQs = np.random.normal(loc=100,scale=15,size=10000)
IQs_2 = np.random.normal(loc=90,scale=15,size=10000)
IQs_trunc = IQs_2[(IQs_2 >= 120)]

print "Mean of IQs_trunc: ", np.mean(IQs_trunc)
ax = plt.hist(IQs,50,normed=1)
ax = plt.hist(IQs_trunc,50,normed=1)


Mean of IQs_trunc:  125.642146716

Mean Integrated Square Error

  • Cost Function -- How close is the empirical estimate f(x) to the true pdf h(x)?
  • MISE --> "L2 norm" minimizes the mean square
  • vs. L1 which minimizes the absolute deviation

Goodness of fit and model selection

-- comparing fits to different models.

Now think of likelihood of model given data.

Gaussian:

$lnL = const - 1/2 \Sigma_i^N z_i^2 = const - 1/2\chi^2$ where $z_i = (x_i-\mu)/sigma$

For a good fit $\chi^2_{dof} = 1/(N-k)\Sigma_i^N z_i^2 \approx 1$

Example from the book -- Luminosities are simulated with Gaussian errors

(a) a model predicts the right luminosity and errors - $\chi^2_{dof}\approx1$ (b) a model predicts the right luminosity but errors are overpredicted - $\chi^2_{dof}< 1$ (c) a model predicts the right luminosity but errors are underpredicted - $\chi^2_{dof}> 1$ (d) a model predicts the incorrect luminosity but errors are accurate - $\chi^2_{dof}> 1$


In [8]:
#------------------------------------------------------------
# Generate Dataset
np.random.seed(2)

N = 50
L0 = 10
dL = 0.2

t = np.linspace(0, 1, N)
L_obs = np.random.normal(L0, dL, N)

#------------------------------------------------------------
# Plot the results
fig = plt.figure(figsize=(5, 5))
fig.subplots_adjust(left=0.1, right=0.95, wspace=0.05,
                    bottom=0.1, top=0.95, hspace=0.05)

y_vals = [L_obs, L_obs, L_obs, L_obs + 0.5 - t ** 2]
y_errs = [dL, dL * 2, dL / 2, dL]
titles = ['correct errors',
          'overestimated errors',
          'underestimated errors',
          'incorrect model']

for i in range(4):
    ax = fig.add_subplot(2, 2, 1 + i, xticks=[])

    # compute the mean and the chi^2/dof
    mu = np.mean(y_vals[i])
    z = (y_vals[i] - mu) / y_errs[i]
    chi2 = np.sum(z ** 2)
    chi2dof = chi2 / (N - 1)

    # compute the standard deviations of chi^2/dof
    sigma = np.sqrt(2. / (N - 1))
    nsig = (chi2dof - 1) / sigma

    # plot the points with errorbars
    ax.errorbar(t, y_vals[i], y_errs[i], fmt='.k', ecolor='gray', lw=1)
    ax.plot([-0.1, 1.3], [L0, L0], ':k', lw=1)

    # Add labels and text
    ax.text(0.95, 0.95, titles[i], ha='right', va='top',
            transform=ax.transAxes,
            bbox=dict(boxstyle='round', fc='w', ec='k'))
    ax.text(0.02, 0.02, r'$\hat{\mu} = %.2f$' % mu, ha='left', va='bottom',
            transform=ax.transAxes)
    ax.text(0.98, 0.02,
            r'$\chi^2_{\rm dof} = %.2f\, (%.2g\,\sigma)$' % (chi2dof, nsig),
            ha='right', va='bottom', transform=ax.transAxes)

    # set axis limits
    ax.set_xlim(-0.05, 1.05)
    ax.set_ylim(8.6, 11.4)

    # set ticks and labels
    ax.yaxis.set_major_locator(plt.MultipleLocator(1))

    if i > 1:
        ax.set_xlabel('observations')

    if i % 2 == 0:
        ax.set_ylabel('Luminosity')
    else:
        ax.yaxis.set_major_formatter(plt.NullFormatter())

plt.show()


Comparing multiple models

  • Given $L^o(M)$ for a number of models it isn't the case that the model with the maximum likelihood is the best -- different models may have varying numbers of free parameters -- penalize for unnecessary parameters.

One option = "Aikake information criterion" (AIC):

$AIC \equiv -2\ln(L^o(M)) + 2k + \frac{2k(k+1)}{N-k-1}$

k = number of parameters, N = number of data points

Given two models with equal likelihoods, the one with fewer parameters is better.

----------------------------- we stopped here basically ---------------------------------

Gaussian Mixtures: Expectation Maximization

for given datum $x_i$ (based on a number of parameters, which all follow gaussians)--

$p(x_i|\mathbf{\theta})$ = \Sigma_{j=1}^{M}\alpha_j \N(\mu_j,\sigma_j)$

for that pdf to be normalized $\Sigma\alpha_j = 1$ and the log likelihood for a dataset:

$lnL = \Sigma_{i=1}^N \ln \left[\Sigma_{j=1}^{M}\alpha_j N(\mu_j,\sigma_j)\right]$

. . . which you can no longer easily take partial derivatives w.r.t. each parameter

  • If the different gaussians just represent different subsamples "classes" of variables such that each piece of data is only produced by one of the gaussian distributions, then it becomes easier. But those "Class Labels" are unknown. The probability of a given class label can be assigned as:

$p(j|x_i) = \frac{\alpha_j N(\mu_j,\sigma_j)}{\Sigma_{j=1}^{M}\alpha_j N(\mu_j,\sigma_j)}$


In [ ]: