Expectations

Mathematical Expectation

Mathematical Expectation is the analog for a random variable $X$ of what we would get by drawing a very large number of independent samples from $X$ and computing the samples's empirical average.

Def: If $X$ is a random variable taking values $x_1, x_2, .., x_n ..$, with respective probabilities $P_1,P_2, .., P_n ..$, then $\mathbf{E}[X] = \sum_{i=1}^\infty P_i x_i$

Example:

$X=\text{value of a 6-sided die flip}$. Therefore, $x_i=i \ \text{ for }\ i=1,2,3,4,5,6$ and $P_i=\frac{1}{6}$. Thus $\mathbf{E}[X] = \frac{1}{6}. 1 + \frac{1}{6}.2 + ... \frac{1}{6}.6 = \frac{1}{6}(1+2+3+4+5+6)=\frac{21}{6} = 3.5$

Example

$X \sim Bern(p)$. Therefore, $X=\left\{\begin{array}{lr}1 & \text{with probability }p\\0 & \text{with probability }1-p\end{array}\right.$

Thus, $\mathbf{E}[X] = 0\times (1-p) \ + \ 1 \times p = p$

Rules of Expectations

Linearity of Expectations:

Let $X\&Y$ be two random variables, and let $a\&b$ be two constants. Then $\mathbf{E}[aX + bY] = a\mathbf{E}[X] + b\mathbf{E}[Y]$

  • Example: $X=\{0,1\} $ coin toss (fair). $Y$ is 6-sided die toss. You will be paid $\$Z$ amounts, where $Z = 2X + 3Y$. $\mathbf{E}[Z] = 2\mathbf{E}[X] + 3\mathbf{E}[Y] = 2\times \frac{1}{2} + 3\times \frac{7}{2} = \frac{23}{2} = 11.5$
  • Example: Recall the Binomial random variable. Let $X \sim Binom(n,p)$.
    Recall: $X$ can be represented as $X = X_1 + X_2 + ... X_n$ where $X_i$'s are (independent) $Bern(p)$ radom variables.
    Thus: $\mathbf{E}[X] = \mathbf{E}[X_1 + X_2 + ... X_n] = \mathbf{E}[X_1] + \mathbf{E}[X_2] + ..\mathbf{E}[X_n] = p + p + ...+p = np$
  • Example: Geometric distribution. Let $X_1,X_2,...X_m,...$ be an infinite number of independent Bernoulli trials all with the same success probability $p$. Let $Y$ be first time that $X_i = 1$. It turns out $\mathbf{E}[Y] = \frac{1}{p}$.
    Note: $\mathbf{P}[Y = k] = (1-p)^{1-k} p $. Indeed the event $\{Y=k\}$ is $k-1$ failures followed by one independent success.
    Therefore, $\mathbf{E}[Y] = \sum_{i=1}^\infty k (1-p)^{k-1} p = **\text{ some math/calculus }** = \frac{1}{p}$

Theorem/Definition:

Let $X$ be a random variable, and let $f$ be a function from $\mathbb{R}\to \mathbb{R}$. Then, $Y=f(x)$ is a random variable. $$\mathbf{E}[Y] = \sum_{i=1}^\infty p_i \times f(X_i)$$

  • Example: $E[X^2] = \sum_{i=1}^\infty p_i\times (x_i)^2$

Theorem/Definition:

Let $X\&Y$ be two independent random variables. Thus, $\{X=x_k\}\ \& \ \{Y=y_j\}$ are independent. Then, if $f\&g$ are two functions, we get $\mathbf{E}[f(X)\times g(Y)] = \mathbf{E}[f(X)] \times \mathbf{E}[g(Y)]$

  • Proof: This works because $\mathbf{P}[\{X=x_k\}\cap \{Y=y_j\}] = \mathbf{P}[X=x_k]\times \mathbf{P}[Y=y_j]$.

Notation: The pair $(X,Y)$ is a 2-dimensional random variable. We say it is bivariate, and we define its probability mass function $$f_{X,Y}(x_k,y_j) = \mathbf{P}[\{X = x_k\}\cap\{Y=y_j\}]$$

Theorem: Positivity property

Let $X$ be a random variable on the sample space $S$. Assume $\omega \in S,\ X(\omega)\ge 0$. Then $\mathbf{E}[X]\ge 0$

Markov inequality

Let $X$ be a non-negative random variable $(X\ge 0; \mathbf{P}[X<0] = 0) $. Then, $\mathbf{P}[X \ge c] \le \mathbf{E}[X]/c$

  • Proof: Let $Y$ be a random variable which is $=0\text{when }X<c$ and $=1\text{ when }X\ge c$. So, $Y\sim Bern(p)$ with parameter $p=\mathbf{P}[X\ge c]$.
    However, we have $XY \le X$. Therefore, $\mathbf{E}[XY] \le \mathbf{E}[X]$
    Claim: $XY \ge cY$. Also using same argument: $\mathbf{E}[cY] \le \mathbf{E}[XY]$
    $\Rightarrow \ \ \ \mathbf{E}[cY] \le \mathbf{E}[X] \Rightarrow \ \ \ c\mathbf{E}[Y]\le\mathbf{E}[X]$
    But since $Y\sim Bern(p)$, then, $\mathbf{E}[Y]=p=\mathbf{P}[X\ge c] \ \Rightarrow \ \ c\mathbf{E}[Y] = c\mathbf{P}[X\ge c]$.
    Finally: $c\mathbf{P}[X\ge c] \le \mathbf{E}[X]$

Variance

Definition:

Let $X$ be a random variable. Then, $\mathbf{Var}[X] = \mathbf{E}[\left(X - \mathbf{E}[X]\right)^2]$

  • Also, the quantity $\sqrt{\mathbf{Var}[X]}$ is called the standard deviation (typical deviation)
Notation:
  • $\mu = \mathbf{E}[X]$
  • $\sigma^2 = \mathbf{Var}[X]$

In practice how to compute the variance from a probability mass function?

  • First, commpute $\mu = \sum_i^\infty P_X(x_i)x_i$
  • Then, $\mathbf{Var}[X] = \mathbf{E}[(X - \mu)^2] = \sum_i^\infty P_X(x_i)(x_i - \mu)^2$
Example

Let random variable $X$ have probability function $f(k) = k/10$ for $k = 1, 2, 3, 4$. Then $\mu = /10 = 3$ and Var(X ) = (1 − 3) 2 (1/10) + (2 − 3) 2 (2/10) + (3 − 3) 2 (3/10) + (4 − 3) 2 (4/10) = 10/10 = 1

Properties of Variance

Let $c$ be a constant. $X \&X_1\&X_2..X_n$ be independent random variables.

  • $\mathbf{Var}[c] = 0$
  • $\mathbf{Var}[cX] = c^2 \mathbf{Var}[X]$
  • $\mathbf{Var}[X + c] = \mathbf{Var}[X]$
  • $\mathbf{Var}[X_1 + X_2] = \mathbf{Var}[X_1] + \mathbf{Var}[X_2]$ (because $X_1\&X_2$ are independent)
    • Proof: let $\mu_1=\mathbf{E}[X_1]$ and $\mu_2 = \mathbf{E}[X_2]$. Thus, $\mathbf{Var}[X_1 + X_2] = \mathbf{Var}[X_1 - \mu_1 + X_2 - \mu_2 + \mu_1 + \mu_2] = \mathbf{Var}[X_1 - \mu_1 + X_2 - \mu_2]$ Note: $\mathbf{E}[X_1 - \mu_1] = \mathbf{E}[X_2 - \mu_2] = 0$
      Let's call $D_1 = X_1 - \mu_1$ and $D_2=X_2 - \mu_2$. We can see that $\mathbf{E}[D_1]=\mathbf{E}[D_2] = \mathbf{E}[D_1 + D_2] = 0$.
      Then, $\mathbf{Var}[D_1 + D_2] = \mathbf{E}[(D_1 + D_2 - \mathbf{E}[D_1 + D_2])^2] = \mathbf{E}[(D_1 + D_2)^2]\\=\mathbf{E}[D_1^2 + D_2^2 + 2D_1 D_2] = \mathbf{E}[D_1^2] + \mathbf{E}[D_2^2] + 2\mathbf{E}[D_1D_2] \\= \mathbf{E}[D_1^2] + \mathbf{E}[D_2^2] \\ = \mathbf{Var}[D_1] + \mathbf{Var}[D_2]$
  • Assume, $X_1,X_2,..X_n$ all have the same distribution. Therefore:
    $\mathbf{Var}[X_1]=..=\mathbf{Var}[X_n]$
    $\mathbf{Var}[X_1 + X_2 + ..X_n] = n\mathbf{Var}[X_1]$
    • Example: Let $X$ be a $Binom(n,p)$. We say $\mathbf{E}[X] = n.p$
      Let $X_1 \sim Bern(p) \Rightarrow \mathbf{Var}[X_1] = \mathbf{E}[(X_1 - p)^2] \\ = (1-p)(-p)^2 + (1-p)^2p \\= p(1-p)$
    • Therefore, if $X\sim Binom(n,p)$, then $\mathbf{E}[X] = np$ and $\mathbf{Var}[X]=np(1-p)$

Theorem:

$\mathbf{E}[(X-c)^2] = \mathbf{Var}[X] + (c-\mu)^2$ where $\mu=\mathbf{E}[X] $

  • Note: $\mathbf{E}[(X-c)^2]$ is minimal when $c=\mu$.
  • Note: with $c=0$ we get $\mathbf{Var}[X] = \mathbf{E}[X^2] - \mu^2$
    This is good for computing variance
  • For uniform distribution on $\{1,2,3..N\}$, we get
    $\mathbf{E}[X] = \frac{N+1}{2}$
    $\mathbf{Var}[X] = \frac{N^2-1}{12}$

Chebyshev Inequlity

Let $X$ be a random variable with $\mathbf{E}[X] = \mu$ and $\mathbf{Var}[X]=\sigma^2$. Then
$$\mathbf{P}[\vert X - \mu \vert \ge k] \le \frac{\sigma^2}{k^2}$$

  • To prove, you can use Markov inequality on the random variable $\vert X - \mu\vert ^2$
Example:

Let $X_1,..X_n$ be $n$ independent & identically distributed ($iid$) random variables. These ccould be samples in an experiment. Let $\bar{X}_n = \frac{X_1 + X_2 + ... X_n}{n}$ (called "Sample mean"). We see that $$\mathbf{E}[\bar{X}_n] = \mu$$ $$\mathbf{Var}[\bar{X}_n] = \frac{n\sigma^2}{n^2} = \frac{\sigma^2}{n}$$

  • By Chebyshev inequality: $\mathbf{P}[|\bar{X}_n - \mu|\ge k] \le \frac{\sigma^2}{n k^2}$

Weak Law of Large Numbers

$$\mathbf{P}[|\bar{X}_n - \mu| \ge \epsilon] \to_{n\to \infty} 0$$

  • This is true even if $\sigma^2 \to \infty$

  • Intuitively, it means that the empirical mean will be close to the actual mean, when $n \to \infty$

Unbiased Estimation of Variance

$\hat{\sigma}_n^2 = \frac{1}{n}\sum_{i=1}^n (X_i - \bar{X}_n)^2$ is called "sample variance".

  • Here, $\bar{X}_n = \text{ sample mean or empirical mean} = \frac{\sum_{k=1}^n X_k}{n} $
  • $\hat{\sigma}_n^2$ is an estimator of $\sigma^2$ but it is biased.
$$\mathbf{E}[\hat{\sigma}_n^2] = \sigma^2 (1 - \frac{1}{n})$$
  • To remove the bias, we write $$\bar{\sigma}_n^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \mu)^2$$

Proof of $\mathbf{E}[\hat{\sigma}_n^2] = \sigma^2 (1 - \frac{1}{n})$


In [ ]: