Covariance of any 2 random variables $X, Y$ is defined as
\begin{align} \operatorname{Cov}(X, Y) &= \mathbb{E}\left( (X - \mathbb{E}(X)) (Y - \mathbb{E}(Y)) \right) \\ &= \mathbb{E}(XY) - \mathbb{E}(X) \, \mathbb{E}(Y) & \quad \text{similar to definition of variance} \end{align}Covariance is a measure of how $X, Y$ might vary in tandem.
If the product of $(X - \mathbb{E}(X))$ and $(Y - \mathbb{E}(Y))$ is positive then that means that both values are either positive ($X,Y$ tend to be greater than their respective means); or they are both negative ($X,Y$ tend to be less than their means).
Correlation is defined in terms of covariance, as you will see in a bit.
If you imagine treating one variable as fixed and then start working with the other, it sort of looks like linearity. It also sort of looks like the distributive property, too.
Bilinearity lets us
\begin{align} \operatorname{Cov}(\lambda X, Y) &= \lambda \, \operatorname{Cov}(X, Y) = \operatorname{Cov}(X, \lambda Y) & \quad \text{freely move scaling factors} \\ \\ \operatorname{Cov}(X, Y_1 + Y_2) &= \operatorname{Cov}(X, Y_1) + \operatorname{Cov}(X, Y_2) \\ \operatorname{Cov}(X_1 + X_2, Y) &= \operatorname{Cov}(X_1, Y) + \operatorname{Cov}(X_2, Y) & \quad \text{distribution in }\operatorname{Cov} \text{ operator} \\ \end{align}The converse is not true, however: $\operatorname{Cov}(X,Y) = 0$ does not necessarily mean that $X,Y$ are independent.
Consider $Z \sim \mathcal{N}(0,1)$, and $X \sim Z, Y \sim Z^2$.
\begin{align} \operatorname{Cov}(X,Y) &= \mathbb{E}(XY) - \mathbb{E}(X)\,\mathbb{E}(Y) \\ &= \mathbb{E}(Z^3) - \mathbb{E}(Z) \, \mathbb{E}(Z^2) \\ &= 0 - 0 & \quad \text{odd moments of }Z \text{ are 0} \\ &= 0 \end{align}But given $X \sim Z$, we know exactly everything about $Y$. And knowing $Y$, gives us everything about $X$, except for the sign.
So $X,Y$ are dependent, yet their covariance (and hence correlation) is 0.
Correlation is defined in terms of covariance.
\begin{align} \operatorname{Corr}(X,Y) &= \frac{\operatorname{Cov}(X,Y)}{\sigma_{X} \, \sigma_{Y}} = \frac{\operatorname{Cov}(X,Y)}{\sqrt{\operatorname{Var}(X)} \, \sqrt{\operatorname{Var}(Y)}} \\ \\ &= \operatorname{Cov} \left( \frac{X-\mathbb{E}(X)}{\sigma_{X} }, \frac{Y-\mathbb{E}(Y)}{\sigma_{Y}} \right) & \quad \text{standardize first, then find covariance} \\ \end{align}Correlation is dimensionless, and like standard deviation, is unit-less.
Correlation also ranges from -1 to 1. This is what happens when we divide by $\sqrt{\operatorname{Var}(X)} \, \sqrt{\operatorname{Var}(Y)}$
Without loss of generality, assume that $X,Y$ are already standarized (mean 0, variance 1).
\begin{align} \operatorname{Va}r(X + Y) &=\operatorname{Var}(X) + \operatorname{Var}(Y) + 2\, \operatorname{Corr}(X,Y) \\ &= 1 + 1 + 2 \, \rho & \quad \text{ where } \rho = \operatorname{Corr}(X,Y) \\ &= 2 + 2 \, \rho \\ 0 &\le 2 + 2 \, \rho & \quad \text{since }Var \ge 0 \\ 0 &\le 1 + \rho & \Rightarrow \rho \text{ has a floor of }-1 \\ \\ \operatorname{Var}(X - Y) &= \operatorname{Var}(X) + \operatorname{Var}(-Y) - 2\, \operatorname{Corr}(X,Y) \\ &= 1 + 1 - 2 \, \rho & \quad \text{ where } \rho = \operatorname{Corr}(X,Y) \\ &= 2 - 2 \, \rho \\ 0 &\le 2 - 2 \, \rho & \quad \text{since }Var \ge 0 \\ 0 &\le 1 - \rho & \Rightarrow \rho \text{ has a ceiling of }1 \\ \\ &\therefore -1 \le \operatorname{Corr}(X,Y) \le 1 &\quad \blacksquare \end{align}Given $(X_1, \dots , X_k) \sim \operatorname{Mult}(n, \vec{p})$, find $\operatorname{Cov}(X_i, X_j)$ for all $i,j$.
\begin{align} \text{case 1, where } i=j \text{ ...} \\ \\ \operatorname{Cov}(X_i, X_j) &= \operatorname{Var}(X_i) \\ &= n \, p_i \, (1 - p_i) \\ \\ \\ \text{case 2, where } i \ne j \text{ ...} \\ \\ \operatorname{Var}(X_i + X_j) &= \operatorname{Var}(X_i) + \operatorname{Var}(X_j) + 2 \, \operatorname{Cov}(X_i, X_j) & \quad \text{property [7] above} \\ n \, (p_i + p_j) \, (1 - (p_i + p_j)) &= n \, p_i \, (1 - p_i) + n \, p_j \, (1 - p_j) + 2 \, \operatorname{Cov}(X_i, X_j) & \quad \text{lumping property} \\ 2 \, \operatorname{Cov}(X_i, X_j) &= n \, (p_i + p_j) \, (1 - (p_i + p_j)) - n \, p_i \, (1 - p_i) - n \, p_j \, (1 - p_j) \\ &= n \left( (p_i + p_j) - (p_i + p_j)^2 - p_i + p_i^2 - p_j + p_j^2 \right) \\ &= n ( -2 \, p_i \, p_j ) \\ \operatorname{Cov}(X_i, X_j) &= - n \, p_i \, p_j \end{align}Notice how case 2 where $i \ne j$ yields $\operatorname{Cov}(X_i, X_j) = - n \, p_i \, p_j$, which is negative. This should agree with your intuition that for $(X_1, \dots , X_k) \sim \operatorname{Mult}(n, \vec{p})$, categories $i,j$ are competing with another, and so they cannot be positively correlated nor independent of each other.
Applying what we now know of covariance, we can obtain the variance of $X \sim \operatorname{Bin}(n, p)$.
We can describe $X = X_1 + \dots + X_n$ where $X_i$ are i.i.d. $\operatorname{Bern}(p)$.
Now consider indicator random variables. Let $I_A$ be the indicator of event $A$.
\begin{align} I_A &\in \{0, 1\} \\ \\ \Rightarrow I_A^2 &= I_A \\ I_A^3 &= I_A \\ & \text{...} \\ \\ \Rightarrow I_A \, I_B &= I_{A \cap B} & \quad \text{for another event } B \\ \\ Var(X_i) &= \mathbb{E}X_i^2 - ( \mathbb{E}X_i)^2 \\ &= p_i - p_i^2 \\ &= p_i \, (1 - p_i) \\ &= p \, q & \quad \text{variance of Bernoulli} \\ \\ \Rightarrow Var(X) &= Var(X_i + \dots + X_n) \\ &= \operatorname{Var}(X_i) + \dots + \operatorname{Var}(X_n) + 2 \, \left( \sum_{i \ne j} Cov(X_i, X_j) \right) \\ &= n \, p \, q + 2 \, (0) & \quad \operatorname{Cov}(X_i,X_j) = 0 \text{ since } X_i \text{ are i.i.d.} \\ &= n \, p \, q \end{align}Given $X \sim \operatorname{HGeom}(w, b, n)$ where
\begin{align} X &= X_1 + \dots + X_n \\ \\ X_j &= \begin{cases} 1, &\text{ if }j^{th} \text{ ball is White} \\ 0, &\text{ otherwise } \\ \end{cases} \\ \end{align}Recall that the balls are sampled without replacement, so the draws are not independent.
However, for any draw, it is not the case that any particular ball would prefer to be drawn for that particular iteration. So there is some symmetry we can leverage!
\begin{align} \operatorname{Var}(X) &= n \, \operatorname{Var}(X_1) + 2 \, \binom{n}{2} \operatorname{Cov}(X_1, X_2) & \quad \text{symmetry!} \\ \\ \text{now } \operatorname{Cov}(X_1, X_2) &= \mathbb{E}(X_1 X_2) - \mathbb{E}(X_1) \, \mathbb{E}(X_2) \\ &= \frac{w}{w+b} \, \frac{w-1}{w+b-1} - \left( \frac{w}{w+b} \right)^2 & \quad \text{recall } I_A \, I_B = I_{A \cap B} \text{; and symmetry} \\ \end{align}Prof. Blitzstein runs out of time, but the rest is just algebra. To be concluded in the next lecture.
Let's relate what we know about variance, covariance and correlation with the concept of the dot product in linear algebra:
\begin{align} \operatorname{Cov}(X, Y) &= X \cdot Y \\ &= x_1 \, y_1 + \dots + x_n \, y_n \\ \\ \operatorname{Var}(X) &= \operatorname{Cov}(X, X) \\ &= X \cdot X \\ &= x_1 \, x_1 + \dots + x_n \, x_n \\ &= |X|^2 \\ \\ \Rightarrow |X| &= \sigma_{X} \\ &= \sqrt{\operatorname{Var}(X)} \\ &= \sqrt{x_1 \, x_1 + \dots + x_n \, x_n} & \quad \text{the size of }\vec{X} \text{ is its std. deviation} \\ \end{align}Now let's take this a bit further...
\begin{align} X \cdot Y &= |X| \, |Y| \, cos \theta \\ \\ \Rightarrow cos \theta &= \frac{X \cdot Y}{|X| \, |Y|} \\ &= \frac{\operatorname{Cov}(X, Y)}{\sigma_{X} \, \sigma_{Y}} \\ &= \rho & \quad \text{... otherwise known as }\operatorname{Corr}(X,Y) \\ \end{align}Since $\rho_{X,Y} = cos_{X,Y} \theta$, we can see that the correlation of $X,Y$ is also the factor that projects vector $X$ onto $Y$, or vice versa. Now you can see why we can use cosine similarity as a metric to measure how close 2 vectors $X,Y$ are, since it is $\operatorname{Corr}(X,Y)$.
View Lecture 21: Covariance and Correlation | Statistics 110 on YouTube.