Conditional Distribution

Definition

Let $X$ and $Y$ be two discrete random variables, with probability mass functions: $P_X$ and $P_Y$. Then, the conditional probability mass function of $Y$ given $X$ is the following:

$$P_{Y|X} (y|x) = \mathbf{P}[Y = y | X = x] = \frac{\mathbf{P}[Y=y \text{ and }X = x]}{\mathbf{P}[X=x]} = \frac{P_{X,Y}(x,y)}{P_X(x)}$$

where $P_{X,Y}(x,y)$ is the joint-probability mass function of $X,Y$.

Definition

Assume $X\&Y$ are continuous random variables with $f_X=\text{marginal density of }X$ and $f_{X,Y} = \text{joint density of }(X,Y)$, then, the conditional density of $Y$ given $X$ is

$$f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)}$$

Intuition: why does it make sense?

We re-interpret the left-hand side as follows. $\displaystyle \frac{\mathbf{P}[Y\in dy \cap X\in dx]/(dx~dy)}{\mathbf{P}[X \in dx]/dx} = \frac{f_{X,Y}(x,y)~dx~dy/(dx~dy)}{f_X(x) dx / dx}$

Note: It is easy to find the joint density if we know the conditional density and marginal density.

Indeed: $f_{X,Y}(x,y) = f_X(x) f_{Y|X}(y|x)$

For now, let's concentrate on the discrete case.

Example:

$X_1 \sim Poi(\lambda_1)$ and $X_2 \sim Poi(\lambda_2)$ and $X_1,X_2$ are independent.

Let $T=X_1+X_2$. We have $T \sim Poi(\lambda_1 + \lambda_2)$

Find the conditional distribution of $X_1$ given $T$.

This question applies to any scenario where we observe the total number of events but we don't know exactly how many events fall into each of several categories.

We compute the $\displaystyle P_{X_1|T}(x_1 | t) = \mathbf{P}[X_1 = x_1 | T = t_1]$

$$= \frac{P_{X_1,T}(x_1,t)}{P_T(t)}$$

The denominator is the Poisson with parmaeter $\lambda_1 + \lambda_2$

$$=\frac{\mathbf{P}[X_1 = x_1 \& X_1+X_2 = t]}{\displaystyle e^{-(\lambda_1 + \lambda_2)} \frac{(\lambda_1 + \lambda_2)^t}{t!}}$$

A powerfull little thing is that we can replace $X_1 + X_2 = t$ with $X_2 = t - x_1$

$$=\frac{\mathbf{P}[X_1 = x_1 \text{ AND } X_2 = t - x_1]}{\displaystyle e^{-(\lambda_1 + \lambda_2)} \frac{(\lambda_1 + \lambda_2)^t}{t!}}$$

And remember that $X_1$ and $X_2$ are independent, so we write:

$$=\frac{\mathbf{P}[X_1 = x_1]\ \times \mathbf{P}[X_2 = t - x_1]}{\displaystyle e^{-(\lambda_1 + \lambda_2)} \frac{(\lambda_1 + \lambda_2)^t}{t!}}$$

And we know the probability mass functions of each of $X_1$ and $X_2$, which is Poisson:

$$=\frac{ \displaystyle e^{-\lambda_1} \frac{\lambda_1^{x_1}}{x_1!} \times e^{-\lambda_2} \frac{\lambda_2^{t-x_1}}{(t-x_1)!}}{\displaystyle e^{-(\lambda_1 + \lambda_2)} \frac{(\lambda_1 + \lambda_2)^t}{t!}}$$$$=\frac{t!}{x_1! (t-x_1)!} \times \frac{\lambda_1^{x_1} \lambda_2^{t-x_1}}{(\lambda_1+\lambda_2)^t}$$

The last results looks pretty much similar to binomial distribution. Let's simplify it further

$$\left(\begin{array}{c}t\\x_1\end{array}\right) \times \left(\begin{array}{c}\frac{\lambda_1}{\lambda_1 + \lambda_2}\end{array}\right)^{x_1} \times \left(\begin{array}{c}\frac{\lambda_2}{\lambda_1 + \lambda_2}\end{array}\right)^{t-x_1}$$

We can see that $p=\displaystyle \left(\begin{array}{c}\frac{\lambda_1}{\lambda_1 + \lambda_2}\end{array}\right)$ and $1-p=\left(\begin{array}{c}\frac{\lambda_2}{\lambda_1 + \lambda_2}\end{array}\right)$

We proved that $X_1 \text{ given } T=t ~~ \sim Binomial(n=t, p=\frac{\lambda_1}{\lambda_1 + \lambda_2})$

Example:

Similar to the previous one (opposite in some sence): mixture distributions

Assume $T\sim Poi(\lambda)$. And that $\text{given }T=t, \ \ X\sim Binom(t, \theta)$

Question: What is the (non-conditional) distribution of $X$ by itself?

Answer: $X\sim Poi(\theta \lambda)$

The reason that we say $X$ has mixture distribution is because one of the parameters in its (conditional) distribution is a random variable. Symbollically, we can write: $$X\sim Binom(T, p)\text{ where }T\sim Poi(\lambda)$$

Generic Other Way of Using Conditional Distribution: Discrete Markov Chain

Construction: A Markov chain is a stochastic process $\{X(t): ~ t\in N\}$ We want to define the joint distribution of all the $X(t)$'s for all $t$'s simultaneously.
Prescription: Given conditional distribution of $X(t+1)$ given $X(t)$, we also insist that this conditional distribution is the same as $X(t+1)$ given all $X(s)$'s for $s\le t$.
Notation: the $X(t)$'s all take values in the "state space" $I=\{x_1,x_2,...x_m\}$

We only need to prescribe these values:

$$\mathbf{P}[X(t+1) = x_j | X(t)=x_i] = P_{ij}$$

These $P_{ij}$ are called transition probabilities. These are arranged in a matrix $\mathbf{P}$ which is the transition matrix.

Example: Random Walk

Let $Y_0,Y_1,Y_2,...,Y_n,...$ be $ii$, with $\displaystyle Y_i = \left\{\begin{array}{lrr}1 & \text{with probability} & p\\-1 & \text{with probability} & 1-p\end{array}\right.$

Define $X(0)=0$ and $\forall t\ge 0, X(t+1) = X(t) + Y_{t}$. Therefore, we notice that $X(t) = Y_0+Y_1 + Y_2 + ... +Y_{t-1}$

From this definition, we see that $X(t)$ is a Markov chain.

Conditional Expectation

[Most discrete results hold for the continous case too.]

We understand conditional probability mass functions (P.M.F.) (Reminder for PMF: for discrete case: $P_{Y|X}(y|x) = \mathbf{P}[Y = y | X=x]$, for continous case: $\displaystyle f_{Y|X}(y|x)=\frac{f_{X,Y}(x,y)}{f_X(x)}$)

Now, define conditional expectation :

$$\mathbf{E}[Y | X=x] = \left\{\begin{array}{lrr}\displaystyle \sum_y y~P_{Y|X}(y|x) & & \text{discrete}\\\displaystyle \int_{-\infty}^\infty y f_{Y|X}(y|x) dy && \text{continous}\end{array}\right.$$

People usually say that $\mathbf{E}[Y|X=x]$ which happens to be a non-random function of $x$ which is called the regression function of $Y$ on $X$. This function can be denoted as $g(x)$.

Very important "meta-theorem" (definition): the notation $\mathbf{E}[Y|X]$ makes sense as $g(X)$.

Note: this $g(X)$ remains random.
Note: the random variable $\mathbf{E}[Y|X] = g(X)$ clearly only depends on $X$.
Intuition: if we know the value of $X$, then $\mathbf{E}[Y|X=x]$ is fixed, but otherwise, $\mathbf{E}[Y|X]$ is a random variable $g(X)$.

Theorem

$$\mathbf{E}[Y] = \mathbf{E}\left[\mathbf{E}[Y|X]\right] = \mathbf{E}[g(X)]$$

More generally, if $h$ is a non-random function, then $\mathbf{E}[h(X)Y] = \mathbf{E}[h(X) g(X)]$
Idea: when conditioning on $X$, any factor involving $X$ only pulls out like a constant from the expectation.
Note: we use the last idea to prove the previous result: $$\mathbf{E}[h(X)Y] \rightarrow \text{ using the theorem }= \mathbf{E}[\mathbf{E}\left[h(X)Y|X]\right] \\\rightarrow \text{ using the idea above (pulling out }X\text{)} = \mathbf{E}\left[h(X) ~\mathbf{E}[Y|X]\right] \\\text{using the meta-theorem}= \mathbf{E}\left[h(X)~g(X)\right]$$

Variances:

Useful notation: $v(x) = \mathbf{Var}[Y|X=x]=\mathbf{E}[(Y - \mathbf{E}[Y|X])^2|X=x] \text{ .. some math ..} =\mathbf{E}[Y^2|X=x] - (\mathbf{E}[Y|X=x])^2$

Theorem: $\mathbf{Var}[Y] = \mathbf{E}[v(X)] + \mathbf{Var}[g(X)] = \mathbf{E}[\mathbf{Var}[Y|X]] + \mathbf{Var}[\mathbf{E}[Y|X]]$

The second notation above is more preferable, because $v$ & $g$ are not universal notations.

$$\begin{array}{ccccc}\mathbf{Var}[Y] &=& \mathbf{E}[\mathbf{Var}[Y|X]] &+& \mathbf{Var}[\mathbf{E}[Y|X]]\\ \textbf{Unconditional Variance}& =&\textbf{Expectation of Conditional Variance} &+& \textbf{Variance of Conditional Expectation}\end{array}$$

Application (Example):

Car dealer sells $N$ cars per day. $\displaystyle N =\left\{\begin{array}{lrr}1 &\text{with probability}&0.3\\2&&0.5\\3&&0.2 \end{array}\right.$

Assuming that $N=i$, the profit for the dealer is $X_j$ for $j=1,..,i$ where the $X_j$s are iid: $\displaystyle \left\{\begin{array}{lrr}1 &\text{with probability}&0.4\\2&&0.3\\3&&0.2\\4&&0.1 \end{array}\right.$

Question: What is the total profit $T$? Answer: $T = \sum_{j=1}^N X_j$
Question: What is the expected profit?

$$\mathbf{E}[T] = \mathbf{E}\left[\mathbf{E}[T|N]\right]$$

for the calculation of insider $\mathbf{E}$ on the right-hand-side, we can assume that $N$ is fixed.

$$\mathbf{E}[T] = \mathbf{E}\left[\mathbf{E}[T|N]\right] = \mathbf{E}\left[\mathbf{E}[\sum_{j=1}^N X_j|N]\right] \\ \longrightarrow \text{linearity of expectations} = \mathbf{E}\left[\mathbf{E}[N ~ X_1|N]\right] \\ \longrightarrow \text{ pull out }N \rightarrow = \mathbf{E}\left[N \mathbf{E}[X_1|N]\right]$$

Now, we assume that $X_1,X_2,X_3$ are independent of $N$.

$$\mathbf{E}[T] = .. = \mathbf{E}\left[N \mathbf{E}[X_1|N]\right] \\ \rightarrow \text{independence of }N,T \rightarrow \\ \mathbf{E}\left[N \mathbf{E}[X_1]\right] =\mathbf{E}[N]\mathbf{E}[X_1] = 1.9\times 2 = 3.8$$

Question: Compute the variance $\mathbf{Var}[T]$?

Using the theorem on conditional variance:

$$\mathbf{Var}[T] = \mathbf{E}[v(N)] + \mathbf{Var}[g(N)]$$

The unconditional variance is the sum of the expected conditional variance and the variance of the conditional variance.

$$\mathbf{Var}[T]=\mathbf{E}[\mathbf{Var}[T~|~N]] + \mathbf{Var}[\mathbf{E}[T~|~N]]$$

Since $X_j$ are independet of each other, the variance of sum is the sum of variance: $\mathbf{Var}[\sum_{j=1}^N X_j] = \sum_{j=1}^N \mathbf{Var}[X_j] = N \mathbf{Var}[X_1]$

$$=\mathbf{E}[N~\mathbf{Var}[X_1|N]] + \mathbf{Var}[N\mathbf{E}[X_1]] \\ =\mathbf{E}[N~\mathbf{Var}[X_1]] + \mathbf{Var}[N\mathbf{E}[X_1]] \\ = \mathbf{E}[N]\mathbf{Var}[X_1] + \mathbf{Var}[N](\mathbf{E}[X_1])^2$$

And, therefore we get $1.9\times 1.0 + 0.49\times 2^2 = 3.86$

This also proves the following theorem:

Theorem:

Let $N,X_1,X_2,..$ be independent random variables. Assume $N$ be integer-valued. No assumptions on $X_j$ except iid and they have the same distribution.
Let $T = \sum_{j=1}^NX_j$.
Then $$\mathbf{E}[T] = \mathbf{E}[N]~\mathbf{E}[X]$$
And $$\mathbf{Var}[T] = \mathbf{E}[N]\mathbf{Var}[X] + \mathbf{Var}[N](\mathbf{E}[X])^2$$

We also get:

$$\mathbf{Cov}(N,T) = \mathbf{E}[X]\mathbf{Var}[N]$$$$\mathbf{Corr}(N,T) = \frac{1}{\sqrt{1 + \theta}} \text{ where } \theta= \frac{\mathbf{Var}[X]}{\mathbf{E}[X] \mathbf{Var}[N]}$$

Example: sample survey

Total population is $n=n_1+n_2+..+n_k$ where $k$ subpopulation of sizes $n_1,n_2,...,n_k$ have been identified. Data points are $x_{ij}$ where $i=1,...k$ & $j$ runs among all data points from the $i$th population.
Let $Y$ represent a randomly selected data point. Let $I=\text{number of subpopulaiton to which }Y\text{ belongs}$

Thus, $\mathbf{E}[Y|I=i] = \sum_{j}x_{ij} \frac{1}{n_i} = \bar{x}_i$

We assume that $\mathbf{P}[I = i] = \frac{n_i}{n}$

$$\Longrightarrow \mathbf{E}[Y] = \sum_{i=1}^k \frac{n_i}{n} \bar{x}_i$$

Also, of course, $$\mathbf{Var}[Y] = \frac{1}{n}\sum_i\sum_j (x_{ij}-\bar{x})^2$$ where $\bar{x}$ is the population average $=\frac{1}{n}\sum_i\sum_jx_{ij}$

However, conditioning on $I$, we also have the following:

Note: $g(i) = \mathbf{E}[Y|I=i] = \bar{x}_i$
Note: $v(i) = \mathbf{Var}[Y|I=i] = \displaystyle \frac{1}{n_i} \sum_j (x_{ij}-\bar{x}_i)^2$
Also, by theorem on conditional variance: $\mathbf{Var}[Y] = \mathbf{E}[v(I)] + \mathbf{Var}[g(I)] = \sum_{i=1}^k \frac{n_i}{n}~v(i) + \sum_{i=1}^k \frac{n_i}{n}(g(i)-\bar{x})^2$
Now, equate the 2 formulae fr $\mathbf{Var}[Y]$ and multiply by $n$:

$$\sum_i\sum_j (x_{ij} - \bar{x})^2 = \sum_i\sum_j (x_{ij}-\bar{x}_i)^2 + \sum_i n_i(\bar{x}_i - \bar{x})^2$$$$\text{Total Sum of Squares} = \text{Error Sum of Squares} + \text{Among-means Sum of Squares}$$

Example

Let $X$ be a random variable from Geometric distribution with parameter $p=\frac{1}{p}$.

Let $D$ be a random variable with negatove binomial distribution with parameter $p=\frac{1}{p}$ and $n=X$.

What is expectation and variance of $Y=D+X$?

Solution

$D=Y-X$. We have that $Y=D+X$ and conditioned on $X$, $D$ is a negatove binomial distribution: $P(D|X)\sim NegBinom(p,X)$

We also have $X\sim Geometric(p)$

We compute $\mathbf{E}[Y] = \mathbf{E}[X+D] = \mathbf{E}[X] + \mathbf{E}[D|X]$

$\displaystyle \mathbf{E}[X] = \frac{1}{p}$
$\displaystyle \mathbf{E}[D|X] = \mathbf{E}[\frac{X}{p}] = \frac{1}{p}.\frac{1}{p} = \frac{1}{p^2}$

$$\Longrightarrow \ \ \mathbf{E}[Y] = \frac{1}{p} + \frac{1}{p^2} = \frac{1+p}{p^2}$$

For variance: $\mathbf{Var}[Y] = \mathbf{Var}[(D + X)] = \mathbf{E}[\mathbf{Var}[(D+X)|X]] + \mathbf{Var}[\mathbf{E}[(D+X)|X]]$

$\displaystyle \mathbf{E}[\mathbf{Var}[D+X|X]] = \mathbf{E}[\mathbf{Var}[D|X]] \\= \displaystyle \mathbf{E}[\frac{X(1-p)}{p^2}] = \frac{1-p}{p^3}$
$\displaystyle \mathbf{Var}[\mathbf{E}[D+X|X]] = \mathbf{Var}[\mathbf{E}[D|X] +\mathbf{E}[X|X]] \\= \mathbf{Var}[\mathbf{E}[D|X] ~+~ X] = \displaystyle \mathbf{Var}[\frac{X}{p} + X] = \left(\frac{1+p}{p}\right)^2 \mathbf{Var}[X] = \frac{(1+p)^2 (1-p)}{p^4}$

$$\Longrightarrow ~\ ~\mathbf{Var}[Y] = \frac{1-p}{p^3} + \frac{(1+p)^2 (1-p)}{p^4}$$

Prediction

Recall $g(X) = \mathbf{E}[Y ~|~ X]$. This predcits $Y$ given $X$.

We write $Y = \mathbf{E}[Y|X] + "D"$ where $D$ is the error in predictions.

Is there a better way to predict $Y$ given $X$?

Let $h$ be non-random. Consider $h(X)$ as another predictor, and let us minimize its "error".

$Error = \mathbf{E}\left[(Y - h(x))^2\right]= \text{ use the variance formula } = \mathbf{E}[\mathbf{Var}[Y|X]] + \mathbf{E}(..)$

$$=\mathbf{E}[\mathbf{Var}[Y|X]] + \mathbf{E}[(g(X) - h(X))^2]$$

So to make the above minimum, we shoudl choose $h(X)$ close to $g(X)$.

This proves that by choosing $h=g$, we get the predictor with the minimal (least-squares) error.

Theorem:

Let $h$ be a fixed function. Let $MSE(h)=\text{mean square error}$ of predictor $h(X)$ for $Y$ be $=\mathbf{E}[(Y-h(X))^2]$. This $MSE(h)$ is minimal for $h(X)=g(X)=\mathbf{E}[Y|X]$

Linear Predictor

We know $g(x) = \mathbf{E}[Y|X=x]$ is the best least square predictor function for $Y$ against $X$. But, what if $g$ is too hard to compute?

Then, we can see what would happen if $X$ and $Y$ are linearly related: look to minimiz prediction error with lienar $h$. Therefore, our problem is minimize $\mathbf{E}[(Y - (aX+b))^2]$. Pick the best $a,b$.
Idea: use standardized versions $Z_X = \frac{X - \mu_X}{\sigma_X}$ and $Z_Y = \frac{Y - \mu_Y}{\sigma_Y}$

Best value of $a=\mathbf{Corr}(X,Y)\displaystyle ~\frac{\sigma_Y}{\sigma_X}$ and best value for $b=\mu_Y - a \mu_X$.

This means that $h(x)$ is $$h(x) = \mu_Y + a~(x - \mu_X)$$

This is the linear function with slope $a$, through the point $(\mu_X, \mu_Y)$.

It also turns out that $$MSE(h) = {(1 - \rho^2)}{\sigma_Y^2}$$

This last fact says that while $Y$ has variance $\sigma_Y^2$, the proportion of that which is not explained by $X$ (or by $aX+b$) is $\displaystyle \frac{MSE(h)}{\sigma_Y^2} = 1 - \rho^2$

Specific problems with conditional densities

Example:

Let $X\sim\Gamma(\alpha, 1)$ and $Y\sim\Gamma(\beta, 1)$. Let $U=X+Y$, and $V=\frac{X}{X+Y}$.

We know that $U\Gamma(\alpha+\beta,1)$, and $V\sim Beta(\alpha,\beta)$

We can show that $U,V$ are independent.

Let $g(u) = \mathbf{E}[X | U=u]$. We can show that $\mathbf{E}[V] = \frac{\alpha}{\alpha + \beta}$. So $\mathbf{E}[UV | U=u] = u\mathbf{E}[V|U=u] \Rightarrow \text{ V is independet of U} \Rightarrow = u \frac{\alpha}{\alpha + \beta}$

This proves that $g$ is linear: $g(u) = u \frac{\alpha}{\alpha + \beta}$

This is one of the few complicated examples where the regression function is also the linear predictor.

Note: when $\alpha=\beta=1$ then $X\&Y$ are $\sim Expon(1)$

Discrete Mixtures

General statement: Let $I=i$ with probability $p_i$. Assume that given $I=i$, $X$ has CDF $=F_i$, given function.

Unconditionally: $$F_X(x) = \mathbf{P}[X \le x] = \mathbf{E}\left[\mathbf{P}[X \le x | I]\right] = \sum_i p_i F_i(x)$$

Now assume $\frac{d~F_i}{dx} = f_i$ (density). Then, $$f_X(x) = \sum_i p_i f_i(x)$$

In all cases, $$\mathbf{E}[X] = \mathbf{E}[\mathbf{E}[X|I]] = \sum_{i=1}^k p_i~ \mathbf{E}[X_i]$$ where $X_i$ has $CDF=F_i$.

Similarly, $$\mathbf{Var}[X] = \sum_i p_i~\sigma_i^2 + \sum_ip_i~(\mu_i - \mu)$$

I defined $\sigma_i^2 = \mathbf{Var}[X_i]$ under $F_i$.

$$\mu_i = \mathbf{E}[X_i] - F_i$$$$\mu = \mathbf{E}[X] = \sum_i p_i \mu_i$$

Example:

Let $X\sim Poi(\lambda)$ assume $\lambda$ is a random variable $\lambda\sim \Gamma(m,\theta)$

This is a continuous mixture because there is an underlying parameter which is made random and has a density.



In [ ]: