Lecture 27: Conditional expectation (cont.); taking out what's known; Adam's Law, Eve's Law; projection picture

Stat 110, Prof. Joe Blitzstein, Harvard University

Conditioning on Random Variables

Ex. $\mathbb{E}(Y|X)$ where $X \sim N(0,1)$

Let $X \sim N(0,1)$ and $Y=X^2$.

Then

\begin{align} \mathbb{E}(Y|X) &= \mathbb{E}(X^2|X) \\ &= X^2 \\ &= Y \end{align}

this is simple enough, and very clear.

But how about the other way 'round?

Ex. $\mathbb{E}(X|Y)$ where $X \sim N(0,1)$

\begin{align} \mathbb{E}(X|Y) &= \mathbb{E}(X|X^2) \\ &= 0 \end{align}

Why?

we don't know $X$, but what we are given is $X^2$
if we observe $x^2 = a$, then we know $x = \pm \sqrt{a}$
by symmetry, both $x=-\sqrt{a}$ and $x=\sqrt{a}$ are equally likely
hence the best estimate of $X$ would be... 0!

Uniform

Say we have a stick of length 1, and we break it at point $x$. Then we take that stick of length $x$, and break that at point $y$.

What is $\mathbb{E}(Y|X)$?

$X \sim \operatorname{Unif}(0,1)$
$Y|X \sim \operatorname{Unif}(0,X)$

\begin{align} \mathbb{E}(Y|X=x) &= 0 \\ &= \frac{x}{2} \\ \\ \Rightarrow \mathbb{E}(Y|X) &= \frac{X}{2} \\ \\ \mathbb{E} \left( \mathbb{E}(Y|X) \right) &= \frac{1}{4} \\ &= \mathbb{E}(Y) \end{align}

the expected length of $y = \frac{1}{4}$ is pretty intuitive; take a stick, break it in half, break that half in half again
we will get more into that $\mathbb{E} \left( \mathbb{E}(Y|X) \right) = \mathbb{E}(Y)$ in a bit

Useful Properties

Here are some useful properties related to conditional expectation.

\begin{align} &\text{(1) } \mathbb{E}\left( h(X) Y|X \right) = h(X) \, \mathbb{E}(Y|X) &\text{"taking out what is known"} \\\\ &\text{(2) } \mathbb{E}(Y|X) = \mathbb{E}(Y) &\text{if } X,Y \text{ are independent} \\\\ &\text{(3) } \mathbb{E}\left( \mathbb{E}(Y|X) \right) = \mathbb{E}(Y) &\text{Iterated Expectation, or Adam's Law} \\\\ &\text{(4) } \mathbb{E}\left( (Y - \mathbb{E}(Y|X) \, h(X) \right) = 0 &\text{residual is uncorrelated with } h(X) \\\\ &\text{(5) } \operatorname{Var}(Y) = \mathbb{E}\left( \operatorname{Var}(Y|X) \right) + \operatorname{Va}r\left( \mathbb{E}(Y|X) \right) &\text{EVvE's Law} \\\ \end{align}

Proof of Property 4

Here is a pictorial explanation to aid your intuition.

A vector could be anything (point, function, cow); as long as it follows the axioms of vector space, then anything can be treated as a vector.

The "plane" in the image represents all of the possible functions of $X$. As such, it neccessarily passes through the origin.
Conditional expectation is simply projecting $Y$ into the plane of all functions of $X$.
$\mathbb{E}(Y|X)$ is the point in $X$ that is closest to $Y$.
If $Y$ is already a function of $X$, then $Y$ lies in that plane of $X$ functions.
If $Y$ is not a function of $X$, then the length of that projection is the residual.
in this image, we implicitly assume finite variance for all functions of $X$.

So let us show that the residual $Y - \mathbb{E}(Y|X)$ is uncorrelated with any function $h(X)$:

\begin{align} \operatorname{Cov}\left( Y - \mathbb{E}(Y|X) , h(X) \right) &= \mathbb{E}\left( (Y - \mathbb{E}(Y|X)) \, h(X) \right) - \mathbb{E}\left(Y-\mathbb{E}(Y|X)\right) \, \mathbb{E}\left(h(X)\right) \\ &= \mathbb{E}\left( (Y - \mathbb{E}(Y|X)) \, h(X) \right) - \left[\mathbb{E}(Y) - \mathbb{E}(Y) \right] \, \mathbb{E}\left(h(X)\right) &\text{ linearity, Adam's Law} \\ &= \mathbb{E}\left( (Y - \mathbb{E}(Y|X)) \, h(X) \right) - 0 \\ &= \mathbb{E}\left( (Y - \mathbb{E}(Y|X)) \, h(X) \right)\\ &= \mathbb{E}\left( Y \, h(X) \right) - \mathbb{E}\left( \mathbb{E}(Y|X) \, h(X) \right) \\ &= \mathbb{E}\left( Y \, h(X) \right) - \mathbb{E} \left( \mathbb{E}(Y \, h(X))|X) \right) & \text{if we can take out, we can put back} \\ &= \mathbb{E}\left( Y \, h(X) \right) - \mathbb{E}\left( Y \, h(X) \right) & \text{Adam's Law} \\ &= 0 &\quad \blacksquare \end{align}

And so the residual $Y - \mathbb{E}(Y|X)$ is indeed uncorrelated with any function $h(X)$

Proof of Property 3

Returning to Property 3, let's do the discrete case (but the continuous case is analogous).

Since $\mathbb{E}(Y|X)$ is just a function of $X$, we can call it by another name, say $g(X)$.

\begin{align} \mathbb{E}\left( \mathbb{E}(Y|X) \right) &= \mathbb{E}\left( g(X) \right) \\ &= \sum_x g(x) \, P(X=x) &\text{by LOTUS, definition} \\ &= \sum_x \mathbb{E}(Y|X=x) \, P(X=x) \\ &= \sum_x \left[ \sum_y y \, P(Y=y|X=x) \right] P(X=x) \\ &= \sum_y \sum_x y \, P(Y=y, X=x) \\ &= \sum_y y \sum_x P(Y=y, X=x) \\ &= \sum_y y P(Y=y) \\ &= \mathbb{E}(Y) &\quad \blacksquare \end{align}

Conditional Variance

Conditional variance is defined thusly:

\begin{align} \operatorname{Var}(Y|X) &= \mathbb{E}(Y^2|X) - \mathbb{E}(Y|X)^2 &\text{or alternately} \\\\ &= \mathbb{E}\left[ (Y - \mathbb{E}(Y|X))^2 | X \right] \end{align}

Proof

Let $g(X) = \mathbb{E}(Y|X)$; this will make things a bit clearer.

\begin{align} \operatorname{Var}(Y|X) &= \mathbb{E}\left[ (Y - \mathbb{E}g(X))^2 | X \right] \\ &= \mathbb{E}\left[ Y^2 - 2Y \, g(X) + g(X)^2 | X \right] \\ &= \mathbb{E}(Y^2|X) - 2\mathbb{E}(Y\,g(X)|X) + \mathbb{E}(g(X)^2 | X) \\ &= \mathbb{E}(Y^2|X) - 2 \, g(X) \, \mathbb{E}(Y|X) + \mathbb{E}(g(X)^2 | X) \\ &= \mathbb{E}(Y^2|X) - 2 \, g(X) \, g(X) + g(X)^2 \\ &= \mathbb{E}(Y^2|X) - 2 \, g(X)^2 + g(X)^2 \\ &= \mathbb{E}(Y^2|X) - g(X)^2 \\ &= \mathbb{E}(Y^2|X) - \mathbb{E}(Y|X)^2 &\quad \blacksquare \\ \end{align}

Proof of Property 5

EVvE's Law states that $\operatorname{Var}(Y) = \mathbb{E}\left( \operatorname{Var}(Y|X) \right) + \operatorname{Var}\left( \mathbb{E}(Y|X) \right)$.

Graphically, conditional variance deals with both the variance within a sub-groups $\mathbb{E}\left( \operatorname{Var}(Y|X) \right)$, and the variance amongst the groups $\operatorname{Var}\left( \mathbb{E}(Y|X) \right)$.

In order to prove EVvE's Law, we will do the following to make things simpler:

let $g(X) = \mathbb{E}(Y|X)$
by Adam's Law, $\mathbb{E}(g(X))=\mathbb{E}(Y)$

Then:

\begin{align} \mathbb{E}\left( \operatorname{Var}(Y|X) \right) &= \mathbb{E}\left[ \mathbb{E}(Y^2|X) - (\mathbb{E}(Y|X))^2 \right] &\text{ for the first part} \\ &= \mathbb{E}(Y^2) - \mathbb{E}(g(X))^2 \\ \\ Var\left( \mathbb{E}(Y|X) \right) &= \operatorname{Var}(g(X)) &\text{ for the second part} \\ &= \mathbb{E}(g(X))^2 - (\mathbb{E}(g(X))^2 \\ \\ \operatorname{Var}(Y) &= \mathbb{E}(Y^2) - \mathbb{E}\left(g(X)\right)^2 + \mathbb{E}(g(X))^2 - (\mathbb{E}\left(g(X)\right)^2 \\ &= \mathbb{E}(Y^2) - (\mathbb{E}( \mathbb{E}(Y|X) ))^2 \\ &= \mathbb{E}(Y^2) - (\mathbb{E}(Y))^2 &\quad \blacksquare \end{align}

Example: Epidemiology and Conditional Variance

Suppose we are studying infectious disease in a certain state. Due to circumstances (lack of resources and/or time), rather than taking samples across the state, we will randomly select a city and study a random sample of $n$ people there.

Let $X$ be the number of infected people in the sample.

Let $Q$ be the proportion of infected people in the randomly selected city. Keep in mind that different cities will have different proportions, hence $Q$ is a random variable.

Find $\mathbb{E}(X)$ and $\operatorname{Var}(X)$.

But to do this, we need to make an assumption about the distribution of $Q$. Given its flexibility, computational convenience and the fact that it is the conjugate prior to the binomial distribution, we will assume $Q \sim \operatorname{Beta}(a,b)$.

It should be clear then that we are assuming that $X|Q \sim \operatorname{Bin}(n, Q)$. A hypergeometric might also work, but since $n$ is probably small compared to the total population size, and since we are sampling without replacement, we can choose to use the binomial along with the Beta distribution.

Remember that conditioning is the soul of statistics, and so we will condition on the proportion of infection $Q$ of our randomly selected city.

$\mathbb{E}(X)$ via Adam's Law

Thinking conditionally, we have:

\begin{align} \mathbb{E}(X) &= \mathbb{E}\left( \mathbb{E}(X|Q) \right) \\ &= \mathbb{E}( nQ) &\text{expected value of }\operatorname{Bin}(n,Q) \\ &= n \, \mathbb{E}(Q) \\ &= n \, \frac{a}{a+b} &\text{expected value of }\operatorname{Beta}(a,b) \end{align}

$\operatorname{Var}(X)$ via EVvE's Law

Again thinking conditionally, we have:

\begin{align} \operatorname{Var}(X) &= \mathbb{E}\left( \operatorname{Var}(X|Q) \right) + \operatorname{Var}\left( \mathbb{E}(X|Q) \right) &\text{ by EVvE's Law} \\ \\ \mathbb{E}\left( \operatorname{Var}(X|Q) \right) &= \mathbb{E}\left( n \, Q \, (1-Q) \right) &\text{ for the first part} \\ &= n \, \mathbb{E}\left( Q \, (1-Q) \right) \\ &= n \, \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \, \int_{0}^{1} q \, (1-q) \, q^{a-1} \, (1-q)^{b-1} \, dq &\text{LOTUS} \\ &= n \, \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \, \int_{0}^{1} q^{a} \, (1-q)^{b} \, dq \\ &= n \, \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \, \frac{\Gamma(a+1)\Gamma(b+1)}{\Gamma(a+b+2)} &\text{that is another }Beta \\ &= n \, \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \, \frac{a\Gamma(a)b\Gamma(b)}{(a+b+1)(a+b)\Gamma(a+b)} \\ &= \frac{n \, a \, b}{(a+b+1)(a+b)} \\ \\ Var\left( \mathbb{E}(X|Q) \right) &= Var(n \, Q) &\text{ for the second part} \\ &= n^2 \, \operatorname{Var}(Q) \\ &= n^2 \, \frac{\mu(1-\mu)}{a+b+1} &\text{where } \mu = \frac{a}{a+b} \\ \end{align}

View Lecture 27: Conditional Expectation given an R.V. | Statistics 110 on YouTube.