Estimation

Point Estimation

Let $X=(X_1, X_2, ..., X_n)$ be iid random variables, and their common distribution depends on a parameter $\theta$.

Question Find a function $T$ of the variable $X$ (or perhaps, more correctly of the observed data $x=(x_1,x_2,...x_n)$) to estimate $\theta$ (to guess what $\theta$ is).

Notation: $\hat{\theta}$ or $\hat{\theta}(X)$ for this function $T$, if it gives us access to $\theta$ directly. Otherwise, $T$ could give access to $g(\theta)$ where $g$ is some function.

Idea: as $n\to \infty$, we hope $\hat{\theta}(X) \to \theta$

Definition: Mean square error of $\hat{\theta}(X)$ is $\displaystyle \mathbf{E}_\theta\left[(\hat{\theta}(X) - \theta)^2\right]$ (this $\mathbf{E}_\theta$ means evaluate the expectaiton assuming that $\theta$ is the true parameter)

$\displaystyle \mathbf{E}_\theta\left[(\hat{\theta}(X) - \theta)^2\right] = \displaystyle \mathbf{Var}_\theta\left[\hat{\theta}(X) - \theta\right] + \displaystyle \left(\mathbf{E}_\theta\left[\hat{\theta}(X) - \theta\right]\right)^2$

$\displaystyle \mathbf{E}_\theta\left[(\hat{\theta}(X) - \theta)^2\right] = \displaystyle \mathbf{Var}_\theta\left[\hat{\theta}(X) \right] + \left(\mathbf{E}_\theta\left[\hat{\theta}(X)\right] - \theta \right)^2$

We see that the MSE of the estimator $\hat{\theta}(X)$ is the sum of two postitive terms:

$$r_{\hat{\theta}(X)} (\theta) = \mathbf{Var}_\theta \left[\hat{\theta}(X)\right] ~+~ \left(\mathbf{E}_\theta \left[\hat{\theta}(X)\right] - \theta\right)^2$$

The last term (before squared) is called Bias :$b_{\hat{\theta}}(\theta) = \mathbf{E}_\theta \left[\hat{\theta}(X)\right] - \theta$.

Definition: When BIAS $b_{\hat{\theta}}(\theta) = 0$ we say $\hat{\theta}$ is unbiased.

Example:

Estimate $\theta$ for $X_i \sim Uniform(0,\theta)$. We define two different estimators:

  • $T_1 = \displaystyle 2 \frac{X_1 + X_2 + ... + X_n}{n}$ so $\mathbf{E}[T_1] = \theta$ (UNBIASED)

  • $T_2 = \max \left(X_1, X_2, ..., X_n\right)$ turns out $\mathbf{E}_\theta[T_2] = -\frac{n}{n + 1}\theta$ (biased)

$$r_{T_1}(\theta) = \frac{\theta^2}{3n}$$$$r_{T_2}(\theta) = \frac{2 \theta^2}{(n+1)(n+2)}$$

So, we see that $r_{T_2}$ is roughly in order of $\frac{1}{n^2}$ and therefore $T_2$ is better than $T_1$, even though $T_2$ is biased.



Method of Moments (MME)

Example: for Gamma distribution: $X \sim \Gamma(\alpha, \theta)$.

We know that $\mathbf{E}[X] = \alpha \theta$ and $\mathbf{Var}[X] = \alpha \theta^2$

Therefore, if we know $\mu = \mathbf{E}[X]$ and $\sigma^2 = \mathbf{Var}[X]$. We solve above for $\alpha$ and $\theta$

  • $\displaystyle \theta = \frac{\sigma^2}{\mu}$

  • $\displaystyle \alpha = \frac{\mu^2}{\sigma^2}$

The method of moment estimator is what we get by replacing the $\mu$ by the data's sample mean $\displaystyle \hat{\mu}_n(X) = \frac{X_1 + X_2 + ... + X_n}{n}$, and replace $\sigma^2$ by sample variance $\displaystyle \hat{\sigma}_n^2(X) = \frac{1}{n-1} \sum_{i=1}^n (X_i - \hat{\mu}_n)^2$

Formal definition of method of moments:

We seek to find parameters $\theta\in \mathbb{R}^k$ (so there are actually $k$ scalar parameters).
We have data $X_1,X_2,..X_n$ coming from the distribution with that parameter $\theta$.
We can compute, for instance, $\mathbf{E}[X^j]$ for $j=1,2,...,k$ These are functions $\mu_j = \mu_j(\theta)$
Now, replace $\mu_j$ by its empirical (sample) value, i.e. $\bar{\mu}_j = \frac{1}{n} \sum_{i=1}^n \left(X_i\right)^j$

Therefore, we write down the following $k$ equations with $k-$dimensional unknown $\theta$:
$$\mu_j(\theta) = \bar{\mu_j} ~:~ j=1,...,k$$

The solution of this system is $\hat{\theta} = \text{Method of Moments Extimator of }\theta$

Two problems with M-M-Estimators

1) The above system may lack a solution or may have more than one solution. (identifiability problem)

2) Too much sensitivity of $\hat{\theta}$ to the data $X$. This means $\hat{\theta}$ can change dramatically, even under small experimental errors making small changes to $X$.



Maximum Likelihood Estimation (MLE)

Idea: if $X=x$ is observed, then, calculate the density (likelihood) or the probability mass function (likelihood) of $X$ at the value $x4: $f_X(x)$ or $P_X(x)$

under the assumption that the parameter $=\theta$ fixed vlues.

Therefore, the likelihood is a function of $x$ and $\theta$.

Notation, therefore is $$\text{Likelihood = }L_x(\theta) = \left\{\begin{array}{l}f_X(x;\theta) \\ \text{or} \\ P_X(x;\theta)\end{array}\right.$$

This $L_x(\theta)$ is better understood as a function of $\theta$ than of $x$ because our data $x$ is fixed and we are not sure what $\theta$ is.

Question: how do we make sure to pick a $\theta$ which is most consistent with the data?

Answer: find a $\theta$ which maximizes the $L_x(\theta)$ as a funciton of $\theta$

Mathematically: the $\text{MLE }\hat{\theta} = \hat{\theta}(x)$ is a value of $\theta$ which makes $L_x(\theta)$ as big as possible:

$$\hat{\theta}(x) = \arg\max_{\displaystyle \theta} L_x(\theta)$$

Note: many densities are more easily manipulated via their logarithm. So, we define the $\log L_x(\theta) = \text{"log likelihood"}$

Then, we can compute the $\frac{d~\log L_x(\theta)}{d\theta}$ instead of $\frac{dL_x(\theta)}{d\theta}$. This is called the score function, we use $l_x(\theta)$ for it.

So, since the $\log$ function is an increasing function: $\arg\max_{\displaystyle \theta} \log L_x(\theta) = \arg\max_{\displaystyle \theta} L_x(\theta) = \hat{\theta}(x)$

Example:

A box with $10$ balls, $\theta$ balls are white.

Pick $2$ balls without replacement.

Let $X=\text{# of white balls among the two we have picked}$

Then, $X$ is hyper-geometric and its probability mass function $P_X(x;\theta)$

$$P_X(x;\theta) = \frac{1}{\displaystyle \left(\begin{array}{c}10\\2\end{array}\right)} \left(\begin{array}{c}\theta\\x\end{array}\right) \left(\begin{array}{c}10-\theta\\2-x\end{array}\right)$$

for $\theta=0,1,2,...,10$ and $x=0,1,2$

Example:

For Bernoulli random variables, $X \sim Bernoulli(p=\theta)$, and data $x_1,x_2,...x_n $

PMF for one Bernoulli: $P(x; \theta) = (1-\theta)^{1-x}\theta^{x}$

For our whole enite vector, $P(x_1,x_2,..,x_n; \theta) = \prod_{i=1}^n \displaystyle \left[(1-\theta)^{1-x_i} \theta^{x_i}\right]$

(Note: this $L_x(\theta)$ is the same as $P(x; \theta)$)

Now, maximize $L_x(\theta)$.

$$\log L_x(\theta) = \sum_{i=1}^n \left[(1-x_i)\log (1-\theta) + x_i \log \theta\right]$$$$l_x(\theta)=\frac{d~\log L_x(\theta)}{d\theta} = \sum_{i=1}^n \left[(1-x_i)\frac{-1}{1-\theta} + x_i\frac{1}{\theta}\right] \text{ set this =0; solve for } \theta$$$$\sum_i (1-x_i)\frac{1}{1-\theta} = \sum_i x_i \frac{1}{\theta}$$$$\theta \sum_i (1-x_i) + (\theta - 1) \sum_i x_i = 0$$$$\theta \times n - \theta \sum_i x_i + \theta \sum_i x_i - \sum_i x_i = 0$$$$\theta \times n = \sum_i x_i \Longrightarrow \theta = \frac{\sum_i x_i}{n} = \bar{x}$$

Finally, we would just need to make sure thay when $\theta = \text{this }\hat{\theta}$, then second derivative of $L_x(\theta), i.e., $$\frac{d~l_x(\theta)}{d\theta} < 0$

Example:

$X = (X_1, X_2, ..., X_n)$ and iid $\sim \mathcal{N}(\mu, \sigma^2)$.

Let's just assume $\mu = \mu_0$ is known.

Therefore, for observation $x=(x_1,x_2, ..., x_n)$

$$L_x(\theta) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi \theta}} e^{\frac{-1}{2\theta} (x_i - \mu_0)^2}$$$$\log L_x(\theta) = \sum_{i=1}^n \frac{1}{2} \log(2\pi) \log(\theta) + \frac{-1}{2\theta} (x_i - \mu_0)^2$$$$l_x(\theta) = \frac{d}{d\theta} (\log L_x(\theta) ) = -\frac{1}{2} \sum_{i=1}^n \left(\frac{1}{\theta} - \frac{1}{\theta^2}(x_i-\mu_0)^2\right) = \\-\frac{1}{2} \left(\frac{n}{\theta} - \frac{1}{\theta^2} \sum_i (x_i-\mu_0)^2\right) = 0 \text{ solve for }\theta$$$$\theta\times n = \sum_i (x_i - \mu_0)^2 \longrightarrow MLE \hat{\theta} = \frac{1}{n} \sum_i (x_i - \mu_0)^2 = \hat{\sigma_n^2} (\text{like a sample variance})$$

The above is like sample variance while the mean is known. It turns out his $\hat{\sigma_n^2}$ is unbiased. Indeed,

$$\mathbf{E}[\hat{\sigma_n^2}(X) ] = \mathbf{E}\left[\frac{1}{n} \sum_{i=1}^n (X_i-\mu_0)\right] = \frac{1}{n} \sum_i \mathbf{E}[(X_i-\mu_0)] = n\frac{1}{n} \sigma^2 = \sigma^2$$
Example:

$X = (X_1, ..., X_n)$ iid $Uniform[0,\theta]$.

It turns out that the MLE for $\theta$ is

$$\hat{\theta}(x) = \max (x_1,..., x_n)$$
Example:

$X = (X_1,X_2,.., X_n)$ iid $Poisson(\lambda=\theta)$

For one variable for example $x_1$: $L_{x_1}(\theta) = \displaystyle e^{-\theta} \frac{\theta^x}{x!}$

For all data: $L_{x_1,x_2,..,x_n} = \prod_i e^{-\theta} \frac{\theta^{x_i}}{x_i!}$

$$\log L_x(\theta) = \sum_i \left(-\theta + x_i \log \theta - \log (x_i!)\right)$$$$l_x(\theta) = \sum_i \left(-1 + \frac{x_i}{\theta}\right) = 0$$$$\hat{\theta}(x) = \frac{1}{n}\sum_i x_i$$

This is the same as the result we would get with Method of Moment Estimator (MME).

Example

MLE for the pair $\mu,\sigma^2$ when $x = (x_1,x_2,..,x_n) $ iid $\sim \mathcal{N}(\mu, \sigma^2)$

$$\log L_x(\mu, \sigma^2) = ... = -\frac{n}{2} \log(2\pi) - \frac{n}{2}\log\sigma^2 - \frac{1}{2\sigma^2} \sum_i (x_i - \mu)^2$$

We see that we only ever encounter $\mu$ in the last term, and moreover, whatever optimal $\sigma$ is, the question on how to choose $\mu$ to make $- \frac{1}{2\sigma^2} \sum_i (x_i - \mu)^2$ as big as possible, this qiestion does not depend on $\sigma$.

Therefore, the optimal $\mu$ simply maximizes the negative sum, or minimizes $\sum_i (x_i-\mu)^2$.

We found in chapter 2 that the answer is take $\mu=\bar{x} = \displaystyle \frac{x_1 + x_2 + ...+x_n}{n}$. That value minimizes this sum of squares.

Now, that we know this, we are back to the previous example where the $\mu=\bar{x}$ instead of $\mu_0$, and therefore, we find $$\sigma^2 = \frac{1}{n}\sum_i (x_i-\bar{x})^2$$

$$\hat{\theta}(x) = \left(\hat{\mu}(x), \hat{\sigma^2}(x)\right) = \left(\bar{x}, \frac{1}{n}\sum_i (x_i - \bar{x})^2\right)$$$$\bar{x} \text{ and } \hat{\sigma}^2(x)$$

are independent, and $\frac{1}{n}\sum_i (x_i - \bar{x})^2$ is a chi-squared random variable with $n-1$ degrees of freedom.