Lecture 29: Law of Large Numbers, Central Limit

Stat 110, Prof. Joe Blitzstein, Harvard University

Law of Large Numbers

Let $X_1, X_2, \dots $ be i.i.d. random variables all with with mean $\mu$ and variance $\sigma^2$.

Let the sample mean be denoted by $\bar{X_n} = \frac{1}{n}\sum_{i=1}^{n}X_j$, where $n$ is the sample size.

What can we say about $\bar{X_n}$ when $n$ gets large?

(Strong) Law of Large Numbers: $\bar{X_n} \rightarrow \mu$ as $n \rightarrow \infty$ with probability 1.0

this means that the sample mean $\bar{X_n}$ converges (point-wise) to the true mean $\mu$ when $n$ gets large
hence it allows us to make an approximation of the true mean without having to actually carry out an infinite amount of experiments

Example

\begin{align} X_j &\sim \operatorname{Bern}(p) \text{, then } \frac{X_1 + \dots + X_n}{n} \rightarrow p \text{ with probability 1.0} \end{align}

(Weak) Law of Large Numbers: $\text{for any } c > 0 \text{, } P( \lvert \bar{X_n} - \mu > c \rvert ) \rightarrow 0 \text{ as } n \rightarrow \infty$

$c$ could be some small number, say, 0.001
if $n$ is large, it is extremely likely that the sample mean is close to true mean

Proof

\begin{align} P( \lvert \bar{X_n} - \mu \rvert > c) &\leq \frac{\operatorname{Var}(\bar{X_n})}{c^2} \quad \text{Chebyshev's Inequality} \\ \\ \text{but } \frac{\operatorname{Var}(\bar{X_n})}{c^2} &= \frac{ \frac{1}{n^2} n \, \sigma^2 }{c^2} \\ &= \frac{\sigma^2}{n \, c^2} \\ &= 0 \text{ as } n \rightarrow \infty \quad \blacksquare \end{align}

Central Limit Theorem

Another way to think about the Law of Large Numbers is to see that

\begin{align} \displaystyle{\lim_{n \to \infty}} \, \bar{X_n} - \mu &= 0 \\ \end{align}

However, this is only thinking point-wise.

But what is the distribution of $\bar{X_n}$?
What does that distribution look like?
That might be going to 0, but how fast is it doing so?

One way to study the distribution of $\bar{X_n}$ is to multiply $(\bar{X_n} - \mu)$ by some variable that itself goes to $\infty$.

if that product goes to $\infty$, then we know that the variable we have chosen is dominating over $(\bar{X_n} - \mu)$
and if that product instead still approaches 0, then that also tells us something significant about $(\bar{X_n} - \mu)$

Consider:

\begin{align} n^{?} \, (\bar{X_n} - \mu) \end{align}

We could learn more about the distribution of $\bar{X_n}$ by selecting some power of $n > 0$, and thinking about what happens.

The Central Limit Theorem

\begin{align} n^{1/2} \, \frac{(\bar{X_n} - \mu)}{\sigma} \rightarrow \mathcal{N}(0,1) \quad \text{in distribution} \\ \end{align}

$X$ may be discrete or it may be continuous
we do need to assume that $X$ has a finite variance
the CDF of $X$ will converge to CDF $\Phi$
CLT is the justification for using normal approximations
$n^{1/2}$ is just right; that is what allows for a non-degenerate distribution that actually converges into something beautiful and useful (it doesn't just go to 0 or blow up to $\infty$)

Proof: in terms of sum

we claim that the convolution $\sum_{j=1}^{n} X_j$ is approximately normal with mean $n \mu$
we can standardize this by centering $\sum_{j=1}^{n} X_j - n \mu$; this makes mean $\mu=0$
the variance of the convolution is $n \sigma^2$ (c.f. property [7] of covariance)
we complete standardization by dividing by the standard deviation $\sqrt{n} \sigma$

Here is the standardized version: \begin{align} &\quad \frac{\sum_{j=1}^{n} X_j - n\mu}{\sqrt{n} \, \sigma} \rightarrow \mathcal{N}(0,1) \end{align}

Assume that MGF $M(t)$ of $X_j$ exists
Assume wlog that $\mu=0$ and $\sigma=1$, since we could have alternatively standardized the r.v. separately and declared $\frac{1}{\sqrt{n}} \sum_{j=1}^{n} \frac{(X_j - \mu)}{\sigma}$

Let $S_n = \sum_{j=1}^{n} X_j$, show $M \left[ \frac{S_n}{\sqrt{n}} \right] \rightarrow M \left[ \mathcal{N}(0,1) \right]$.

Here are some quick facts about Moment Generating Functions to keep in mind as we go along our proof:

$M(t) = \mathbb{E}(e^{tX_1})$
$M(0) = 1$
$M\prime(0) = \mu = 0$
$M\prime(0) = 1$

\begin{align} M\left[ S_n \right] &= \mathbb{E}( e^{t \, S_n / \sqrt{n} } ) \\ &= \mathbb{E}(e^{t \, X_1/\sqrt{n}}) \dots \mathbb{E}(e^{t \, X_n/\sqrt{n}}) &\quad \text{independence, not correlated} \\ &= \left( M\left[ \frac{t}{\sqrt{n}} \right] \right)^n &\quad \text{but these are just the MGF} \\ &= \displaystyle{\lim_{n \to \infty}} n \, log M\left( \frac{t}{\sqrt{n}} \right) \\ &= \displaystyle{\lim_{n \to \infty}} \frac{log M\left( \frac{t}{\sqrt{n}} \right)}{\frac{1}{n}} &\quad \text{let } y = \frac{1}{\sqrt{n}} \text{ and } y \text{ is real} \\ &= \displaystyle{\lim_{y \to 0}} \frac{log M(y\,t)}{y^2} &\quad \text{and using l'Hopital's Rule} \\ &= \displaystyle{\lim_{y \to 0}} \frac{t \, M\prime(yt)}{2y \, M(yt)} &\quad \text{and applying l'Hopital's Rule again} \\ &= \frac{t}{2} \, \displaystyle{\lim_{y \to 0}} \frac{M\prime(yt)}{y} &\quad \text{and applying l'Hopital's Rule yet again} \\ &= \frac{t^2}{2} \, \displaystyle{\lim_{y \to 0}} \frac{M\prime\prime(yt)}{1} \\ &= \frac{t^2}{2} \\ \end{align}

But $\frac{t^2}{2}$ is the log of $e^{t^2 / 2}$, and that is the MGF of $\mathcal{N}(0,1)$. QED.

Example: Basic Normal Approximation of a Binomial

Let $X \sim \operatorname{Bin}(n,p)$, and think of $X = \sum_{j=1}^{n} X_j, \, X \sim \operatorname{Bern}(p)$ i.i.d.

By the Central Limit Theorem, we can approximate $X$ with a Normal distribution if $n$ is large enough, and if we standardize $X$ first.

\begin{align} P(a \leq X \leq b) &= P\left( \frac{a - np}{\sqrt{npq}} \leq \frac{X - np}{\sqrt{npq}} \leq \frac{b - np}{\sqrt{npq}} \right) \\ &\approx \Phi\left( \frac{b - np}{\sqrt{npq}} \right) - \Phi\left( \frac{a - np}{\sqrt{npq}} \right) \end{align}

Contrast the above Normal approximation of a binomial with a Poisson approximation. With a Poisson approximation of a binomial, we assumed:

$n$ is very large, i.e., $n \rightarrow \infty$
$p$ is very small
let $\lambda = np$, i.e. $p$ is fixed

But in the case of a Normal approximation, while we do wish $n$ to be large, it is best if $p$ is close to $\frac{1}{2}$. Why?

Remember that the Normal distribution in a CLR is symmetric about $\mu = 0$. If $p$ is too far away from the mean, then the distribution will get very skewed. That is bad. If $n$ is really, really, really large, then the CLR would still work no matter what $p$ might be, but you will need to be careful when $n$ is not that large.

Continuity Correction

Now in the above example, we are approximating a discrete distribution using a continuous one.

What would we do if we instead started with something like $P(X=a)$, where $a$ is some integer?

\begin{align} P(X=a) &= P(a - \epsilon \leq X \leq a + \epsilon) \end{align}

... where $\epsilon$ is a very small value that allows us to look at a range centered at $a$ instead of a single value $a$. Now we can continue using the Normal approximation.

View Lecture 29: Law of Large Numbers and Central Limit Theorem | Statistics 110 on YouTube.