Estimating parameters from data

We review the statistical view of parameter estimation from data, in increasing generality:

Least squares error
Maximum likelihood
Method of moments
M-estimation
Estimating equations

The method of moments turns out to be inefficient, and one can additionally find the estimator with minimal variance, resulting in the Generalised method of moments.

When there are unknown correlations between the labels, one can additionally optimise over these correlations in the estimating equations, resulting in Generalised estimating equations.

General problem setup

Assume we have $D$ dimensional examples, i.e. $x_n\in\mathbb{R}^D$ and a scalar label $y_n\in\mathbb{R}$. We consider the supervised learning setting, where we obtain pairs $(x_1, y_1), \ldots, (x_N, y_N)$. Given this data, we would like to estimate a predictor $f(\cdot, \theta): \mathbb{R}^D \to \mathbb{R}$, parameterised by $\theta$. We hope to be able to find a good parameter $\theta^*$ such that we fit the data well $$ f(x_n, \theta^*) \approx y_n \quad\mbox{for all}\quad n=1,\ldots, N $$ Denote the random variables $X$ and $Y$ as those corresponding to the examples $x_n$ and labels $y_n$ respectively.

Least squares error

We specify that we measure error using the squared loss $\ell(y_n, f(x_n, \theta)) = (y_n - f(x_n, \theta))^2$ and minimise the empirical risk, which is the average of the losses over the data. $$ \min_\theta \frac{1}{N} \sum_{n=1}^N (y_n - f(x_n, \theta))^2 $$ There is a closed form solution for this, by solving the normal equations.

Maximum likelihood

We specify that conditional probability distribution of the labels given the examples, for the particular parameter setting $\theta$. For example we specify that the conditional distribution is Poisson with mean $\lambda := \exp(\theta^\top x)$. If we assume the conditional distribution is Gaussian, we recover the least squares approach above. $$ \mathbb{E}(Y | X) \sim \mathrm{Pois}(\lambda) $$ That is we consider a generalised linear model with the logarithm as a link function $$ \mathbb{E}(Y|X) = \exp(\theta^\top x_n). $$ The probability mass function of the Poisson distribution is given by \begin{align*} p(y|x,\theta) &= \frac{\lambda^y}{y!}\exp(-\lambda) = \frac{\exp(\theta^\top x)^y}{y!}\exp(-\exp(\theta^\top x))\\ &= \frac{\exp(y\theta^\top x - \exp(\theta^\top x))}{y!} \end{align*} We assume that the data is independent and hence the joint probability (likelihood) factorises $$ p(y_1, \ldots, y_N | x_1, \ldots, x_N, \theta) = \prod_{n=1}^N p(y_n|x_n, \theta). $$ Finding the maximum likelihood parameter $\theta$ involves finding $\theta$ where the above equation is maximised. For optimisation reasons we (equivalently) solve for the minimum of the negative log likelihood. $$ \min_\theta - \log \prod_{n=1}^N p(y_n|x_n, \theta) $$ Substituting the definition of the Poisson pmf and few steps of algebra results in $$ \min_\theta \sum_{n=1}^N \left(\exp(\theta^\top x_n) - y_n\theta^\top x_n\right) $$ There is no closed form solution for solving the gradient set to zero, and hence some optimisation method needs to be applied.

Method of Moments

Estimating Equations



In [ ]: