In [5]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = np.array([1,2,3,3,4,3.5,3,4,5.5,5,6,2,3,4,4])
y = np.array([2,3,1,2,3,4.5,3.5,4.5,6.5,7,5,1,3,2.8,5])
fig = plt.figure()
plt.scatter(x, y)
plt.xlim(0,7)
plt.ylim(0,8)
plt.show()
$x$ is the explanatory variable, and $y$ is the response variable.
$$y = b_0 ~+~ b_1 x$$More generally, each $x_i$ is a k-dimensional vector.
I.e., we want to an explanation like:
$$y = b_0 ~+~ b_1x_1 + ... ~+~ b_kx_k$$
Eaxh $x_j$ for $j=1,...,k$ is actually an n-dimensional vector of data. Thus, $x_j = (x_{i,j})~i=1,...,n$
Same quesiton here: fit a hyperplane of dimension $k$ in a $\mathbb{R}^{k+1}$ through that cloud of points $\displaystyle \left(x_{i,1}, ..., x_{i,k}, y_i\right)_{i=1,..,n}$
Problem is: the question is no precise. We need a precide criterion. We need to minimize the error between $y_i ~\&~ b_0 + \displaystyle \sum_{j=1}^k b_j x_{ij}$
We compute the Euclidean distance between the vector $y=\displaystyle \left(y_i\right)_{i=1}^n\in \mathbb{R}^n$ and the explnatory vector $\displaystyle \left(b_0 + \sum_{i=1}^k b_jx_j\right)_{i=1}^n$
$$Q(b) = \sum_{1}^n \left(y_i - \left(b_0 + \sum_{i=1}^k b_jx_j\right)\right)^2 \text{ Squared distance between }y\in \mathbb{R}^n \text{ and }b_0 + \sum_{i=1}^k b_jx_j \in \mathbb{R}^n$$Finally, answer to the question is: find $b$ to make $Q(b)$ as small as possible.
The criterion "find $b$ to minimize $Q(b)$" is the PRINCIPLE of LEAST SQUARES.
In [21]:
x = [1,5]
y = [1,5]
plt.plot(x,y)
plt.scatter(4,2)
plt.scatter(3.37,3.37)
plt.plot([4,3.37], [2,3.37], '--')
plt.show()
Then, we see that $\hat{y}$ should be exactly the orthogonal projection of the point $y$ onto the linear subspace $V$ generated by the data vectors $x_1,x_2,...,x_k$. This means $y - \hat{y} \perp V$.
This really means the following:
$$\forall j=1,...,k: ~~~ y_.-\hat{y}_. \perp x_{.j}$$Therefore, the only thing that we need to do is to solve the following for $b=(b_0,b_1,...,b_k)$:
$$\forall j=0,1,..,k: ~~~~ y-\sum_{j'=0}^k b_{j'}x_{j'} \perp x_j$$here $x_0 = \left(\begin{array}{c}1\\1\\.\\.\\1\end{array}\right) = \text{n-dimensional vector of }1{'s}$
This is equivalent to
$$\forall j=0,1,...k:~~~~ \sum_{i=1}^n \left(y_i - \sum_{j'=0}^k b_{j'}x_{ij'}\right)\left(x_{ij}\right) = 0$$These $k+1$ equations are all linear in $b=\left(b_0, b_1, ..., b_k\right)^T$. There are $k+1$ unknowns in $b$.
Our system is
$$\forall j=0,..k: ~~~~ y_. \boldsymbol{\cdot} x_{.j} = \sum_{j'=0}^k b_{j'} ~ x_{.j}\boldsymbol{\cdot}x_{.j'}$$We want to write this system in matrix-vector notation:
$b=\left(b_0, b_1, ..., b_k\right)^T$ is a k-dimensional column vector
Consider the matrix: $X^T = \left[\begin{array}{ccc} .. & x_0 & ..\\ .. & x_1 & ..\\ .. & .. & ..\\ .. & x_k & ..\end{array}\right]$ a $(k+1)\times n$-dimensional matrix
Notice: $\displaystyle \left(y \boldsymbol{\cdot} x_j\right)_{j=0}^k = X^T y$
Also, notice the matrix $M = \displaystyle \left((x_ix_{j'})\right)_{\begin{array}{c}j=0,...,k\\j'=0,...,k\end{array}}$ can be expressed as $M = X^T X$
Therefore, the right-hand side of the system $\displaystyle \sum_{j'=0}^k b_{j'} x_{j}\boldsymbol{\cdot} x_{j'}$ is the $k+1$-dimensional vector $Mb$
Summarizing: our system is:
(this is called the "normal equation"; here, normal refers to orthogonality.)
If the vectors $x_0,x_1,...,x_n$ are linearly independent, then $M$ is invertible.
Therefore solution is:
$$b = M^{-1} X^T y$$The error in this regression is $e = y - \hat{y}$.
Since by defintion, $e \perp x_j ~~\forall j$, therefore, $e \perp \hat{y}$ then by Pythagoras we write that $\displaystyle \|y\|^2 = \|y-\hat{y} + \hat{y}\|^2 = \|y-\hat{y}\|^2 + \|\hat{y}\|^2$
$\|\hat{y}\|^2 = \text{Total Sum of Squares}$
$\|y-\hat{y}\|^2 = \|e\|^2 = \text{Residual Sum of Squares (or Error sum of Squares)}$
$\|\hat{y}\|^2 = \text{Regression Sum of Squares}$
$$"TSS" = "ResidualSS" ~+~ "RegressionSS"$$
It turns out $$\displaystyle \|\hat{y}\|^2 = U^T b = \left(X^T y\right)^T b$$
In case $k=1$, thurns out $$b = \frac{1}{\|x\|^2}x^T y$$
Now, let's look in more detail in $k=1$ case, and identify $\bar{x}=\text{mean of }x$ and $\bar{y} = \text{mean of }y$ and use $b_0$ and $b_1$.
We know that $\hat{y} = b_0 + b_1 x~~~~ \left(x,y ~\in~\mathbb{R}\right)$
In another words, $$\begin{array}{ll} \forall i=1,...,n: ~~~~ \hat{y}_i & = b_0 + b_1x_i \\ &= a_0 + a_1(x_i - \bar{x})\end{array}$$
Now, find $a_0 \& a_1$.
It turns out : $a_1 = b_1$ also $a_0 = \bar{y}$
Therefore, the regression line is $y = \bar{y} = b_1 (x- \bar{x})$
and in particular, the "point of averages $(\bar{x},\bar{y})$" is on the line.
Also, turns out that $$b_1 = \frac{~~ S_{xy} ~~ }{ ~~ S_{xx} ~~ } = \frac{\displaystyle \sum_{i=1}^n (x_ i -\bar{x})y_i}{\displaystyle \sum_{i=1}^n (x_i - \bar{x})^2}$$
$$b_0 = \bar{y} - b_1 \bar{x}$$
Also, $$\begin{array}{lcl}\text{RegressionSS} &=& \displaystyle \|\hat{y}\|^2 = \bar{y}^2 n + \frac{~~ S_{xy} ~~ }{ ~~ S_{xx} ~~ } \\ \text{ResidualSS} &=& \displaystyle \|y - \hat{y}\|^2 = \|y\|^2 - \|\hat{y}\|^2 = \sum_{i=1}^n y_i^2 - \bar{y}^2n - \frac{~~ S_{xy} ~~ }{ ~~ S_{xx} ~~ } \\ &=& \sum_{i=1}^n (y_i - \bar{y})^2 - \frac{~~ S_{xy} ~~ }{ ~~ S_{xx} ~~ }\end{array}$$
Correlation Coefficient: $$r = \frac{S_{xy}}{\displaystyle \sqrt{S_{xx} + S_{yy}}}$$
Note: we can find $b_1$ by converting $r$ into the right scale:
$$b_1 = r \frac{\sqrt{S_{yy}}}{\sqrt{S_{xx}}}$$
Coefficient of Determination:
$$R^2 = \frac{\displaystyle \text{Regression Sum of Squares (RegressionSS)}}{\displaystyle \text{Total Sum of Squares (TotalSS)}}$$
$R^2$ is interpreted as the proportion of variablity in the data $y$ which is explained by the data $x$ via $\hat{y}$.
See: $0 \le R^2 \le 1$. We want $R^2$ as large as possible.
General case: $k \ge 1$:
$R^2 = \text{same definition as befpre}$
$R = \sqrt{R^2} = \text{Multiple Regression Coefficient}$
Let's say $Y$ is explained as $Y = \beta x + \epsilon$
$\epsilon$ is also a random vector in $\mathbb{R}^n; iid$
$Y = \left(Y_i\right)_{i=1}^n$ and $\epsilon = \left(\epsilon_i\right)_{i=1}^n$
Assume $\mathbf{E}[\epsilon_i] = 0$ and $\mathbf{Conv}[\epsilon_i, \epsilon_j] = 0 \text{ if } i\ne j$ and preferable, $\epsilon_i$'s are all $iid$. In particular, $\mathbf{Var}[\epsilon_i] = \sigma^2 = Constant$
Hope: based on some initial data analysis where $k=1$, and we found "a large" $R^2$ for least-square regression of $y$ data against $x$ data. Now, model $y$ is coming from random variable $Y$ with e.g. $\epsilon_i ~~ iid ~~ \mathcal{N}(0,\sigma^2)$.
Question: find properties of estimator for $\beta$. This $\beta$ is an unknown true parameter.
Strategy: use least-squares to estimate $\beta$
Since $k=1$, we know $b_1 = \frac{S_{xy}}{S_{xx}} = \frac{x \cdot y}{\|x\|^2}$. This suggests that an estimator for $\beta$ should be
$$\hat{\beta} = \frac{x \cdot Y}{\|x\|^2} = \frac{x\cdot (\beta x + \epsilon)}{\|x\|^2} = \beta +\frac{x\cdot \epsilon}{\|x\|^2} $$This $\hat{\beta}$ is the least-squares estimator for $\beta$ using the data $x$ and the model for $Y$.
Is it unbiased? Yes, because $\mathbf{E}[\epsilon_i] = 0 \forall i$
How about variance $\mathbf{Var}[\hat{\beta}]$? $$\mathbf{Var}[\hat{\beta}] = \mathbf{E}\left[\left(\frac{1}{\|x\|^2} x \cdot \epsilon\right)^2\right] = \frac{1}{\|x\|^4} \mathbf{E}\left[\left(\sum_{i=1}^n x_i \epsilon_i\right)^2\right] \\ = \frac{1}{\|x\|^4} \mathbf{E}\left[\sum_{i=1}^n \sum_{i'=1}^n x_i x_{i'} \epsilon_i \epsilon_{i'} \right]\\ = \frac{1}{\|x\|^4} \sum_{i=1}^n (x_i)^2 \mathbf{Var}[\epsilon_i]$$
here, we used the assumption that $\epsilon_i$ and $\epsilon_{i'}$ are uncorrelated if $i\ne i'$.
$$ = \sigma^2 \frac{\|x\|^2}{\|x\|^4}$$We proved that if $\mathbf{Var}[\epsilon_i] = \sigma^2 ~(constant) \forall i$ and $\epsilon_i \& \epsilon_{i'}$ are uncorrelated for $i\ne i'$, then
$$\mathbf{Var}[\hat{\beta}] = \frac{\sigma^2}{\|x_i\|^2}$$This is small if $n$ is big.
We found that with $\hat{\beta} = LSE$ is unbiased: $\mathbf{E}[\hat{\beta}] = \beta$ and have variance $\mathbf{Var}[\hat{\beta}] = \displaystyle \frac{\sigma^2}{\|x\|^2}$
(By the way, $\hat{\beta} = \displaystyle \frac{Y.x}{\|x\|^2}$)
Here, $\epsilon$ is same as before.
We try to repeat what we did before.
Define Least Square Estimate: $$\hat{\beta} = M^{-1}U$$ where $M = X^T X$ and $U =- X^T Y$
Now, we are interested in mean and variance of $\hat{\beta}$:
where $A=(X^TX)^{-1} X^T$.
Notice that $AX = I$, so $A$ is callled the pseudo-inverse of $X$ where $k=\text{rank}~X < n$.
In particular, $$\mathbf{E}[\hat{\beta}] = \beta + A\mathbf{E}[\epsilon] = \beta \Longrightarrow ~~ \hat{\beta}\text{ is unbiased}$$
Facts:
Since $\epsilon$ is normal and $\hat{\beta}$ is a linear function of $\epsilon$, thus $\hat{\beta}$ is also normal.
If the vectors $x_1, x_2, ..., x_k$ are mutually orthogonal, then the components of $\hat{\beta}$ are uncorrelated.
Thus, since $\hat{\beta}$ is a normal vector (and uncorrelated), its components are independent.
Similarly, let $\hat{Y} = X \hat{\beta}$. and let $e = Y - \hat{Y}$. Thus, $\hat{Y} \& \epsilon$ are also uncorrelated and thus also normal and independent.
$\hat{Y} = P_V Y = (\text{projection of }Y\text{ onto the vector space generated by }X )= X(X^TX)^{-1} Y$
Also, $\hat{\beta}$ is linear in $Y$ and uncorrelated with $e$, so $\hat{\beta} \& e$ are also normal and independent.
Some information about bilinear forms:
Any bilinear form $Q: (X,Y)\in \mathbb{R}^n\times \mathbb{R}^m \longrightarrow Q(X,Y) \in \mathbb{R}$ can be written as $Q(X,Y)=X^T B Y$ for $B$ some $n\times m$ matrix/
Moreover, $\mathbf{E}[W(X,X)] = \text{trace}(B \mathbf{Cov}(X)) + Q(\mu, \mu)$ where $\mu = \mathbf{E}[X]$.
Now, let's look for some quantities of interest:
$$\|e\|^2 = \|Y - \hat{Y}\|^2 = \| Y - P_VY\|^2 = \|(I - P_V)Y\|^2 = Y^T(I - P_V)Y$$Note, this formula is like $Q(Y,Y)$ where $Q$ is the bilinear form with this matrix $B=I-P_V$
So, from the previous formula: $$\mathbf{E}[\|e\|^2] = \text{trace}\left((I-P_V) \mathbf{Cov}(Y)\right) + \mathbf{E}[Y]^T(I-P_V)\mathbf{E}[Y]$$
Assuming that the $\bar{Y}$ is subtracted from $Y$ to compute $e$ in practice, therefore, we know that $\mathbf{E}[Y] = 0$.
$$\Longrightarrow ~~~~~~~~~~ = \text{trace}(\mathbf{Cov}(Y) - P_V \mathbf{Cov}(Y)) = \sigma^2 (n - \text{trace}(P_V)) \\ = \sigma^2(n - \text{trace}(X(X^TX)^{-1}X)) = = \sigma^2 (n - \text{trace}((X^TX) (X^TX)^{-1})) = (n-k) \sigma^2$$
Property of $\text{trace}$ of some matrices multiplied together, you can rotate the multiplications around.
Notattion: $\displaystyle S^2 := \frac{\|e\|^2}{n - k}$
Thus $S^2$ is therefore an unbiased estimator of $\sigma^2$. ($\mathbf{E}[S^2] = \sigma^2$)
The $\displaystyle \frac{1}{n-k}$ is reminiscant of the $\displaystyle \frac{1}{n-1}$ factor for the old notation of $S^2$ which is ordinary sample variance. We expect here that this general new $S^2$ which generalizes sample variance, should have some relation to $\chi_{n-k}$
Lastly: we saw that $e \& \hat{\beta}$ are independent. Therefore, since $S^2$ is a deterministic function of $e$, then $S^2 \& \hat{\beta}$ are also independent.
We illustrate these properties by looking for confidence intervals for combination of $\beta$'s:
Setups: let $C_1, C_2, ..., C_k$ be fixed constants. Let $\eta = C_1\beta_1 + C_2\beta_2 + ... + C_k\beta_k$
This $\eta$ is a linear combination of $\beta$'s of interest to you.
We start off with $\hat{\eta} : = C^T \hat{\beta}$. This is a LSE for $\eta$.
In matrix notation: $\eta = C^T \beta$
In fact: $\mathbf{E}[\hat{\eta}] = C^T \mathbf{E}[\hat{\beta}] = (\hat{\eta} \text{is unbiased } \Rightarrow ) C^T \beta = \eta \Longrightarrow ~~ \hat{\eta} \text{ is also unbiased}$
$\mathbf{Var}[\hat{\eta}] = C^T \mathbf{Cov}(\hat{\beta}) C $
Note: $\mathbf{Cov}(\hat{\beta}) = (X^TX)^{-1} \sigma^2$. So, $\mathbf{Var}(\hat{\eta}) = C^T (X^TX)^{-1} C ~~ \sigma^2$ where we use notation $h(C) = C^T (C^TX)^{-1} C$. Therefore, $$Z = \frac{\hat{\eta} - \eta}{\sqrt{h(C)}} \sim \mathcal{N}(0,1)$$
Also, consider $$U = \frac{(n-k) S^2}{\sigma^2}$$
it turns out that $U \sim \chi^2_{n-k}$
Moreover, from the previous results, $\hat{\beta} \& S^2$ are independent; therefore, $Z \& U$ are independent.
In particular, $T := \frac{Z}{\sqrt{U/(n-k)}} \sim t_{n-k} \text{(student t-distrobution)}$
We can compute this $T$ based only on the knowledge of $\eta, \hat{\eta} , h(C), \& S^2$ $$ = \frac{\hat{\eta} - \eta}{\sqrt{S^2~h(C)}}$$
so it does not depend on $\sigma^2$ only on $\eta$ and observables $\hat{\eta}, S^2, h(C)$
We want at significant level $\gamma$:
$$ = \mathbf{P}\left[-t_{\frac{1+\gamma}{2}} \le T \le t_{\frac{1+\gamma}{2}}\right] \\ = \mathbf{P}\left[-t_{\frac{1+\gamma}{2}} \le \frac{\hat{\eta} - \eta}{\sqrt{h(C) \sigma^2}}\le t_{\frac{1+\gamma}{2}}\right] \\ = \mathbf{P}\left[-t_{\frac{1+\gamma}{2}}~\sqrt{h(C) \sigma^2} \le \hat{\eta} - \eta\le t_{\frac{1+\gamma}{2}} \sqrt{h(C) \sigma^2}\right] \\ = \mathbf{P}\left[-t_{\frac{1+\gamma}{2}}~\sqrt{h(C) \sigma^2} - \hat{\eta}\le - \eta\le t_{\frac{1+\gamma}{2}} \sqrt{h(C) \sigma^2} - \hat{\eta}\right] \\ = \mathbf{P}\left[\hat{\eta}-t_{\frac{1+\gamma}{2}}~\sqrt{h(C) \sigma^2} \le \eta\le \hat{\eta} + t_{\frac{1+\gamma}{2}} \sqrt{h(C) \sigma^2} \right]$$This means my $\gamma-level$ confidence interval for $\eta$ is $\displaystyle \left[\hat{\eta}-t_{\frac{1+\gamma}{2}}~\sqrt{h(C) \sigma^2} \le \eta\le \hat{\eta} + t_{\frac{1+\gamma}{2}} \sqrt{h(C) \sigma^2} \right]$
Things which remain to be proved: $\frac{(n-k)S^2}{\sigma^2} \sim \chi^2_{n-k}$ and $\mathbf{Cov}(\hat{\beta}) = \sigma^2 X^TX$.