Let $X$ and $Y$ be two discrete random variables, with probability mass functions: $P_X$ and $P_Y$. Then, the conditional probability mass function of $Y$ given $X$ is the following:
$$P_{Y|X} (y|x) = \mathbf{P}[Y = y | X = x] = \frac{\mathbf{P}[Y=y \text{ and }X = x]}{\mathbf{P}[X=x]} = \frac{P_{X,Y}(x,y)}{P_X(x)}$$where $P_{X,Y}(x,y)$ is the joint-probability mass function of $X,Y$.
Note: It is easy to find the joint density if we know the conditional density and marginal density.
Indeed: $f_{X,Y}(x,y) = f_X(x) f_{Y|X}(y|x)$
For now, let's concentrate on the discrete case.
This question applies to any scenario where we observe the total number of events but we don't know exactly how many events fall into each of several categories.
We compute the $\displaystyle P_{X_1|T}(x_1 | t) = \mathbf{P}[X_1 = x_1 | T = t_1]$
$$= \frac{P_{X_1,T}(x_1,t)}{P_T(t)}$$A powerfull little thing is that we can replace $X_1 + X_2 = t$ with $X_2 = t - x_1$
$$=\frac{\mathbf{P}[X_1 = x_1 \text{ AND } X_2 = t - x_1]}{\displaystyle e^{-(\lambda_1 + \lambda_2)} \frac{(\lambda_1 + \lambda_2)^t}{t!}}$$And remember that $X_1$ and $X_2$ are independent, so we write:
$$=\frac{\mathbf{P}[X_1 = x_1]\ \times \mathbf{P}[X_2 = t - x_1]}{\displaystyle e^{-(\lambda_1 + \lambda_2)} \frac{(\lambda_1 + \lambda_2)^t}{t!}}$$And we know the probability mass functions of each of $X_1$ and $X_2$, which is Poisson:
$$=\frac{ \displaystyle e^{-\lambda_1} \frac{\lambda_1^{x_1}}{x_1!} \times e^{-\lambda_2} \frac{\lambda_2^{t-x_1}}{(t-x_1)!}}{\displaystyle e^{-(\lambda_1 + \lambda_2)} \frac{(\lambda_1 + \lambda_2)^t}{t!}}$$$$=\frac{t!}{x_1! (t-x_1)!} \times \frac{\lambda_1^{x_1} \lambda_2^{t-x_1}}{(\lambda_1+\lambda_2)^t}$$The last results looks pretty much similar to binomial distribution. Let's simplify it further
$$\left(\begin{array}{c}t\\x_1\end{array}\right) \times \left(\begin{array}{c}\frac{\lambda_1}{\lambda_1 + \lambda_2}\end{array}\right)^{x_1} \times \left(\begin{array}{c}\frac{\lambda_2}{\lambda_1 + \lambda_2}\end{array}\right)^{t-x_1}$$We can see that $p=\displaystyle \left(\begin{array}{c}\frac{\lambda_1}{\lambda_1 + \lambda_2}\end{array}\right)$ and $1-p=\left(\begin{array}{c}\frac{\lambda_2}{\lambda_1 + \lambda_2}\end{array}\right)$
We proved that $X_1 \text{ given } T=t ~~ \sim Binomial(n=t, p=\frac{\lambda_1}{\lambda_1 + \lambda_2})$
Answer: $X\sim Poi(\theta \lambda)$
The reason that we say $X$ has mixture distribution is because one of the parameters in its (conditional) distribution is a random variable. Symbollically, we can write: $$X\sim Binom(T, p)\text{ where }T\sim Poi(\lambda)$$
Construction: A Markov chain is a stochastic process $\{X(t): ~ t\in N\}$ We want to define the joint distribution of all the $X(t)$'s for all $t$'s simultaneously.
Prescription: Given conditional distribution of $X(t+1)$ given $X(t)$, we also insist that this conditional distribution is the same as $X(t+1)$ given all $X(s)$'s for $s\le t$.
Notation: the $X(t)$'s all take values in the "state space" $I=\{x_1,x_2,...x_m\}$
We only need to prescribe these values:
$$\mathbf{P}[X(t+1) = x_j | X(t)=x_i] = P_{ij}$$
These $P_{ij}$ are called transition probabilities. These are arranged in a matrix $\mathbf{P}$ which is the transition matrix.
Let $Y_0,Y_1,Y_2,...,Y_n,...$ be $ii$, with $\displaystyle Y_i = \left\{\begin{array}{lrr}1 & \text{with probability} & p\\-1 & \text{with probability} & 1-p\end{array}\right.$
Define $X(0)=0$ and $\forall t\ge 0, X(t+1) = X(t) + Y_{t}$. Therefore, we notice that $X(t) = Y_0+Y_1 + Y_2 + ... +Y_{t-1}$
From this definition, we see that $X(t)$ is a Markov chain.
[Most discrete results hold for the continous case too.]
We understand conditional probability mass functions (P.M.F.) (Reminder for PMF: for discrete case: $P_{Y|X}(y|x) = \mathbf{P}[Y = y | X=x]$, for continous case: $\displaystyle f_{Y|X}(y|x)=\frac{f_{X,Y}(x,y)}{f_X(x)}$)
Now, define conditional expectation :
$$\mathbf{E}[Y | X=x] = \left\{\begin{array}{lrr}\displaystyle \sum_y y~P_{Y|X}(y|x) & & \text{discrete}\\\displaystyle \int_{-\infty}^\infty y f_{Y|X}(y|x) dy && \text{continous}\end{array}\right.$$People usually say that $\mathbf{E}[Y|X=x]$ which happens to be a non-random function of $x$ which is called the regression function of $Y$ on $X$. This function can be denoted as $g(x)$.
Very important "meta-theorem" (definition): the notation $\mathbf{E}[Y|X]$ makes sense as $g(X)$.
More generally, if $h$ is a non-random function, then $\mathbf{E}[h(X)Y] = \mathbf{E}[h(X) g(X)]$
Idea: when conditioning on $X$, any factor involving $X$ only pulls out like a constant from the expectation.
Note: we use the last idea to prove the previous result: $$\mathbf{E}[h(X)Y] \rightarrow \text{ using the theorem }= \mathbf{E}[\mathbf{E}\left[h(X)Y|X]\right] \\\rightarrow \text{ using the idea above (pulling out }X\text{)} = \mathbf{E}\left[h(X) ~\mathbf{E}[Y|X]\right] \\\text{using the meta-theorem}= \mathbf{E}\left[h(X)~g(X)\right]$$
Useful notation: $v(x) = \mathbf{Var}[Y|X=x]=\mathbf{E}[(Y - \mathbf{E}[Y|X])^2|X=x] \text{ .. some math ..} =\mathbf{E}[Y^2|X=x] - (\mathbf{E}[Y|X=x])^2$
The second notation above is more preferable, because $v$ & $g$ are not universal notations.
$$\begin{array}{ccccc}\mathbf{Var}[Y] &=& \mathbf{E}[\mathbf{Var}[Y|X]] &+& \mathbf{Var}[\mathbf{E}[Y|X]]\\ \textbf{Unconditional Variance}& =&\textbf{Expectation of Conditional Variance} &+& \textbf{Variance of Conditional Expectation}\end{array}$$Car dealer sells $N$ cars per day. $\displaystyle N =\left\{\begin{array}{lrr}1 &\text{with probability}&0.3\\2&&0.5\\3&&0.2 \end{array}\right.$
Assuming that $N=i$, the profit for the dealer is $X_j$ for $j=1,..,i$ where the $X_j$s are iid: $\displaystyle \left\{\begin{array}{lrr}1 &\text{with probability}&0.4\\2&&0.3\\3&&0.2\\4&&0.1 \end{array}\right.$
Question: What is the total profit $T$? Answer: $T = \sum_{j=1}^N X_j$
Question: What is the expected profit?
for the calculation of insider $\mathbf{E}$ on the right-hand-side, we can assume that $N$ is fixed.
$$\mathbf{E}[T] = \mathbf{E}\left[\mathbf{E}[T|N]\right] = \mathbf{E}\left[\mathbf{E}[\sum_{j=1}^N X_j|N]\right] \\ \longrightarrow \text{linearity of expectations} = \mathbf{E}\left[\mathbf{E}[N ~ X_1|N]\right] \\ \longrightarrow \text{ pull out }N \rightarrow = \mathbf{E}\left[N \mathbf{E}[X_1|N]\right]$$Now, we assume that $X_1,X_2,X_3$ are independent of $N$.
$$\mathbf{E}[T] = .. = \mathbf{E}\left[N \mathbf{E}[X_1|N]\right] \\ \rightarrow \text{independence of }N,T \rightarrow \\ \mathbf{E}\left[N \mathbf{E}[X_1]\right] =\mathbf{E}[N]\mathbf{E}[X_1] = 1.9\times 2 = 3.8$$Using the theorem on conditional variance:
$$\mathbf{Var}[T] = \mathbf{E}[v(N)] + \mathbf{Var}[g(N)]$$The unconditional variance is the sum of the expected conditional variance and the variance of the conditional variance.
$$\mathbf{Var}[T]=\mathbf{E}[\mathbf{Var}[T~|~N]] + \mathbf{Var}[\mathbf{E}[T~|~N]]$$And, therefore we get $1.9\times 1.0 + 0.49\times 2^2 = 3.86$
This also proves the following theorem:
Let $N,X_1,X_2,..$ be independent random variables. Assume $N$ be integer-valued. No assumptions on $X_j$ except iid and they have the same distribution.
Let $T = \sum_{j=1}^NX_j$.
Then $$\mathbf{E}[T] = \mathbf{E}[N]~\mathbf{E}[X]$$
And $$\mathbf{Var}[T] = \mathbf{E}[N]\mathbf{Var}[X] + \mathbf{Var}[N](\mathbf{E}[X])^2$$
We also get:
$$\mathbf{Cov}(N,T) = \mathbf{E}[X]\mathbf{Var}[N]$$$$\mathbf{Corr}(N,T) = \frac{1}{\sqrt{1 + \theta}} \text{ where } \theta= \frac{\mathbf{Var}[X]}{\mathbf{E}[X] \mathbf{Var}[N]}$$Total population is $n=n_1+n_2+..+n_k$ where $k$ subpopulation of sizes $n_1,n_2,...,n_k$ have been identified. Data points are $x_{ij}$ where $i=1,...k$ & $j$ runs among all data points from the $i$th population.
Let $Y$ represent a randomly selected data point.
Let $I=\text{number of subpopulaiton to which }Y\text{ belongs}$
Thus, $\mathbf{E}[Y|I=i] = \sum_{j}x_{ij} \frac{1}{n_i} = \bar{x}_i$
We assume that $\mathbf{P}[I = i] = \frac{n_i}{n}$
$$\Longrightarrow \mathbf{E}[Y] = \sum_{i=1}^k \frac{n_i}{n} \bar{x}_i$$Also, of course, $$\mathbf{Var}[Y] = \frac{1}{n}\sum_i\sum_j (x_{ij}-\bar{x})^2$$ where $\bar{x}$ is the population average $=\frac{1}{n}\sum_i\sum_jx_{ij}$
However, conditioning on $I$, we also have the following:
Also, by theorem on conditional variance: $\mathbf{Var}[Y] = \mathbf{E}[v(I)] + \mathbf{Var}[g(I)] = \sum_{i=1}^k \frac{n_i}{n}~v(i) + \sum_{i=1}^k \frac{n_i}{n}(g(i)-\bar{x})^2$
Now, equate the 2 formulae fr $\mathbf{Var}[Y]$ and multiply by $n$:
Let $X$ be a random variable from Geometric distribution with parameter $p=\frac{1}{p}$.
Let $D$ be a random variable with negatove binomial distribution with parameter $p=\frac{1}{p}$ and $n=X$.
What is expectation and variance of $Y=D+X$?
$D=Y-X$. We have that $Y=D+X$ and conditioned on $X$, $D$ is a negatove binomial distribution: $P(D|X)\sim NegBinom(p,X)$
We also have $X\sim Geometric(p)$
We compute $\mathbf{E}[Y] = \mathbf{E}[X+D] = \mathbf{E}[X] + \mathbf{E}[D|X]$
For variance: $\mathbf{Var}[Y] = \mathbf{Var}[(D + X)] = \mathbf{E}[\mathbf{Var}[(D+X)|X]] + \mathbf{Var}[\mathbf{E}[(D+X)|X]]$
$\displaystyle \mathbf{E}[\mathbf{Var}[D+X|X]] = \mathbf{E}[\mathbf{Var}[D|X]] \\= \displaystyle \mathbf{E}[\frac{X(1-p)}{p^2}] = \frac{1-p}{p^3}$
$\displaystyle \mathbf{Var}[\mathbf{E}[D+X|X]] = \mathbf{Var}[\mathbf{E}[D|X] +\mathbf{E}[X|X]] \\= \mathbf{Var}[\mathbf{E}[D|X] ~+~ X] = \displaystyle \mathbf{Var}[\frac{X}{p} + X] = \left(\frac{1+p}{p}\right)^2 \mathbf{Var}[X] = \frac{(1+p)^2 (1-p)}{p^4}$
Recall $g(X) = \mathbf{E}[Y ~|~ X]$. This predcits $Y$ given $X$.
We write $Y = \mathbf{E}[Y|X] + "D"$ where $D$ is the error in predictions.
Is there a better way to predict $Y$ given $X$?
Let $h$ be non-random. Consider $h(X)$ as another predictor, and let us minimize its "error".
$Error = \mathbf{E}\left[(Y - h(x))^2\right]= \text{ use the variance formula } = \mathbf{E}[\mathbf{Var}[Y|X]] + \mathbf{E}(..)$
$$=\mathbf{E}[\mathbf{Var}[Y|X]] + \mathbf{E}[(g(X) - h(X))^2]$$So to make the above minimum, we shoudl choose $h(X)$ close to $g(X)$.
This proves that by choosing $h=g$, we get the predictor with the minimal (least-squares) error.
Let $h$ be a fixed function. Let $MSE(h)=\text{mean square error}$ of predictor $h(X)$ for $Y$ be $=\mathbf{E}[(Y-h(X))^2]$. This $MSE(h)$ is minimal for $h(X)=g(X)=\mathbf{E}[Y|X]$
We know $g(x) = \mathbf{E}[Y|X=x]$ is the best least square predictor function for $Y$ against $X$. But, what if $g$ is too hard to compute?
Then, we can see what would happen if $X$ and $Y$ are linearly related: look to minimiz prediction error with lienar $h$. Therefore, our problem is minimize $\mathbf{E}[(Y - (aX+b))^2]$. Pick the best $a,b$.
Idea: use standardized versions $Z_X = \frac{X - \mu_X}{\sigma_X}$ and $Z_Y = \frac{Y - \mu_Y}{\sigma_Y}$
Best value of $a=\mathbf{Corr}(X,Y)\displaystyle ~\frac{\sigma_Y}{\sigma_X}$ and best value for $b=\mu_Y - a \mu_X$.
This means that $h(x)$ is $$h(x) = \mu_Y + a~(x - \mu_X)$$
This is the linear function with slope $a$, through the point $(\mu_X, \mu_Y)$.
It also turns out that $$MSE(h) = {(1 - \rho^2)}{\sigma_Y^2}$$
This last fact says that while $Y$ has variance $\sigma_Y^2$, the proportion of that which is not explained by $X$ (or by $aX+b$) is $\displaystyle \frac{MSE(h)}{\sigma_Y^2} = 1 - \rho^2$
Let $X\sim\Gamma(\alpha, 1)$ and $Y\sim\Gamma(\beta, 1)$. Let $U=X+Y$, and $V=\frac{X}{X+Y}$.
We know that $U\Gamma(\alpha+\beta,1)$, and $V\sim Beta(\alpha,\beta)$
We can show that $U,V$ are independent.
Let $g(u) = \mathbf{E}[X | U=u]$. We can show that $\mathbf{E}[V] = \frac{\alpha}{\alpha + \beta}$. So $\mathbf{E}[UV | U=u] = u\mathbf{E}[V|U=u] \Rightarrow \text{ V is independet of U} \Rightarrow = u \frac{\alpha}{\alpha + \beta}$
This proves that $g$ is linear: $g(u) = u \frac{\alpha}{\alpha + \beta}$
This is one of the few complicated examples where the regression function is also the linear predictor.
Note: when $\alpha=\beta=1$ then $X\&Y$ are $\sim Expon(1)$
General statement: Let $I=i$ with probability $p_i$. Assume that given $I=i$, $X$ has CDF $=F_i$, given function.
Unconditionally: $$F_X(x) = \mathbf{P}[X \le x] = \mathbf{E}\left[\mathbf{P}[X \le x | I]\right] = \sum_i p_i F_i(x)$$
Now assume $\frac{d~F_i}{dx} = f_i$ (density). Then, $$f_X(x) = \sum_i p_i f_i(x)$$
In all cases, $$\mathbf{E}[X] = \mathbf{E}[\mathbf{E}[X|I]] = \sum_{i=1}^k p_i~ \mathbf{E}[X_i]$$ where $X_i$ has $CDF=F_i$.
Similarly, $$\mathbf{Var}[X] = \sum_i p_i~\sigma_i^2 + \sum_ip_i~(\mu_i - \mu)$$
I defined $\sigma_i^2 = \mathbf{Var}[X_i]$ under $F_i$.
$$\mu_i = \mathbf{E}[X_i] - F_i$$$$\mu = \mathbf{E}[X] = \sum_i p_i \mu_i$$
In [ ]: