In [1]:
# Copyright (c) Thalesians Ltd, 2018-2019. All rights reserved
# Copyright (c) Paul Alexander Bilokon, 2018-2019. All rights reserved
# Author: Paul Alexander Bilokon <paul@thalesians.com>
# Version: 2.0 (2019.04.19)
# Previous versions: 1.0 (2018.08.03)
# Email: education@thalesians.com
# Platform: Tested on Windows 10 with Python 3.6

Linear algebra

Motivation

In data science, machine learning (ML), and artificial intelligence (AI), we usually deal not with single numbers but with multivariate (i.e. containing multiple elements or entries) lists of numbers — mathematically speaking, vectors, — and multivariate tables of numbers — mathematically speaking, matrices. Therefore we solve multivariate equations, apply multivariate calculus to find optima of multivariate functions, etc.

The branch of mathematics that studies vectors, matrices, and related mathematical objects is called linear algebra. It is one of the most practically useful areas of mathematics in applied work and a prerequisite for data science, machine learning (ML), and artificial intelligence (AI).

Objectives

  • To consider matrices.
  • To introduce the transpose of a matrix.
  • To demonstrate that matrices also form a vector space.
  • To introduce matrix multiplication and consider four different views of matrix multiplication.
  • To introduce the identity matrix.
  • To introduce linear systems.
  • To introduce matrix inverses.
  • To demonstrate that matrices "work" on vectors as linear operators.
  • To introduce the notion of a determinant.
  • To consider in more detail the homogeneous and nonhomogeneous linear systems.
  • To consider when a square matrix is invertible and introduce the invertible matrix theorem.
  • To introduce gradients (grad), Jacobians, and Hessians.

In [2]:
%matplotlib inline

In [3]:
import math
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

Question: Why are linear systems so important?

Richard Feynman's answer: Finally, we make some remarks on why linear systems are so important. The answer is simple: because we can solve them! (from The Feynman Lectures on Physics, mainly mechanics, radiation, and heat (1963)).

One of the most useful techniques in all of finance

Regression analysis is one of the most widely used techniques for analyzing multi-factor data. Its broad appeal and usefulness result from the conceptually logical process of using an equation to express the relationship between a variable of interest (the response) and a set of related predictor variables. Regression analysis is also interesting theoretically because of elegant underlying mathematics and a well-developed statistical theory. Successful use of regression requires an appreciation of both the theory and the practical problems that typically arise when the technique is employed with real-world data. \cite{montgomery-2012}

\begin{frame} \frametitle{Linear regression model in matrix form (i)} For $n \in \Nnz$ (the number of \defn{data points}), $p \in \Nnz$ (the number of \defn{features}), $i \in [1, \ldots, n]$, \begin{equation*} y_i = \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \epsilon_i, \end{equation*} or, in matrix form, \begin{equation*} \V{y} = \M{X}\V{\beta} + \V{\epsilon}, \end{equation*} where \begin{small} \begin{equation*} \V{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix}, \quad \M{X} = \begin{pmatrix} \transpose{\V{x}_1} \\ \transpose{\V{x}_2} \\ \vdots \\ \transpose{\V{x}_n} \\ \end{pmatrix} = \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np} \\ \end{pmatrix}, \quad \V{\beta} = \begin{pmatrix} \beta_1 \\ \beta_2 \\ \vdots \\ \beta_p \end{pmatrix}, \quad \V{\epsilon} = \begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{pmatrix}. \end{equation*} \end{small} \begin{itemize} \item When $p = 1$ the model is referred to as a \defn{simple} linear regression model, when $p > 1$ it is referred to as a \defn{multivariate} linear regression model. \item $\V{x}_i$ are referred to as the \defn{inputs} (but also \defn{regressors}, \defn{independent variables}, and \defn{explanatory variables}). \item $y_i$ are referred to as the \defn{outputs} (but also \defn{regressands} and \defn{dependent variables}). \item $\epsilon_i$ are referred to as \defn{disturbance} (but also \defn{noise}, \defn{error}) terms. \end{itemize} \end{frame}
\begin{frame} \frametitle{Linear regression model in matrix form (ii)} \begin{itemize} \item In \begin{equation*} \V{y} = \M{X}\V{\beta} + \V{\epsilon}, \end{equation*} $\M{X}$ and $\V{y}$ are \nb{given} and \nb{fixed}, perhaps as a table: \begin{center} \begin{tabular}{cccc|c} $x_{11}$ & $x_{12}$ & $\cdots$ & $x_{1p}$ & $y_1$ \\ $x_{21}$ & $x_{22}$ & $\cdots$ & $x_{2p}$ & $y_2$ \\ $\vdots$ & $\vdots$ & $\ddots$ & $\vdots$ & $\vdots$ \\ $x_{n1}$ & $x_{n2}$ & $\cdots$ & $x_{np}$ & $y_n$ \\ \end{tabular} \end{center} \item $\V{\epsilon}$ is a (vector) \nb{random variable}. \item $\V{\beta}$ is unknown. \end{itemize} \end{frame}
\begin{frame} \frametitle{Disturbance versus residual} \begin{itemize} \item Notice that, in \begin{equation*} \V{y} = \M{X}\V{\beta} + \V{\epsilon}, \end{equation*} $\V{\epsilon}$ is a (vector) random variable modelling the difference between the regressand $\V{y}$ and the deterministic function $\M{X}\V{\beta}$ involving the regressands $\M{X}$. \item Since $\V{y}$ depends on $\V{\epsilon}$, it is itself a random variable. \item But when we are actually \alert{given} $\V{y}$ as well as $\M{X}$ as data, we can then compute the \alert{deterministic} quantity \begin{equation} \hat{\V{\epsilon}} = \V{y} - \M{X}\V{\beta}, \label{eq:residual-in-linear-regression-wrong} \end{equation} which we call the \alert{residual}. \item Confusingly enough, we use the same notation for $\V{y}$ when it is a random variable as for its fixed sample when it is given. \item To recap, the \alert{disturbance} is random, the \alert{residual} is deterministic. \item Equation (\ref{eq:residual-in-linear-regression-wrong}) is all very well, but we don't actually know $\V{\beta}$ in it! Given the data, $\V{y}$ and $\M{X}$, we need to \alert{fit} the linear regression model and find a ``good enough'' estimate for $\V{\beta}$, say $\hat{\V{\beta}}$. The equation (\ref{eq:residual-in-linear-regression-wrong}) is actually wrong, it should be \begin{equation*} \hat{\V{\epsilon}} = \V{y} - \M{X}\hat{\V{\beta}}. \label{eq:residual-in-linear-regression} \end{equation*} \item We find a ``good enough'' $\hat{\V{\beta}}$ by minimising some sensible function of the residual $\hat{\V{\epsilon}}$. We shall start by considering a particular method of finding the ``good enough'' $\hat{\V{\beta}}$ --- the \defn{ordinary least squares (OLS)}. \end{itemize} \end{frame}
\begin{frame} \frametitle{Who invented linear regression? (i)} \begin{columns} \begin{column}{0.3\textwidth} \begin{figure} \centering \includegraphics[width=1.\textwidth]{francis-galton.pdf} \end{figure} \begin{small} A photograph of Sir~Francis~Galton (1822--1911) from~\cite{pearson-1914} \end{small} \end{column} \begin{column}{0.7\textwidth} \begin{itemize} \item The term \defn{``regression''} was introduced in an 1886 paper on genetics, \atitle{Regression towards Mediocrity in Hereditary Stature}~\cite{galton-1886} by \nb{Sir~Francis~Galton}, a Victorian statistician and polymath interested, among many other things, in genetics, evolution, and natural selection. \item Galton used the word in the sense of \defn{``regression towards the mean''} or, put more bluntly, \defn{``regression towards mediocrity''}. \item His inquiry was prompted by his earlier interest in hereditary genius (Galton was himself a child prodigy): \begin{quote} If a man breeds from strong, well-shaped dogs, but of mixed pedigree, the puppies will be sometimes, but rarely, the equals of their parents. They will commonly be of a mongrel, nondescript type, because \nb{ancestral peculiarities are apt to crop out in the offspring}.~\cite{galton-1869} \end{quote} \end{itemize} \end{column} \end{columns} \end{frame}
\begin{frame} \frametitle{Who invented linear regression? (ii)} Galton was a pioneer in statistics. He computed correlations to make sense of empirical data and made wide use of what would later become known as \defn{regression lines}~\cite{bulmer-2003}. Here is his Plate~X from~\cite{galton-1886}. \begin{center} \begin{figure} \centering \includegraphics[width=.7\textwidth]{galton-1875.pdf} \end{figure} \end{center} \end{frame}
\begin{frame} \frametitle{The classical linear regression model} In the \alert{classical} linear regression model (CLRM), the following assumptions are made: \begin{itemize} \item for all $i \in [1, \ldots, n]$, $\epsilon_i$ have a mean of zero, $\E[\epsilon_i] = 0$; \item for all $i \in [1, \ldots, n]$, $\epsilon_i$ have the same finite variance, $\Var[\epsilon_i] = \sigma^2 < \infty$, in other words, the random variables $\epsilon_i$ are \alert{homoscedastic}; \item for all $i, j \in [1, \ldots, n]$, $i \neq j$, the $\epsilon_i$ are uncorrelated, $\Cov[\epsilon_i, \epsilon_j] = 0$; \item the regressors are \alert{non-random} and \alert{observable}; \item the regressors are not perfectly linearly correlated with each other. \end{itemize} \end{frame}

\section{Ordinary least squares}

\begin{frame} \frametitle{The ordinary least squares (OLS) approach} \begin{itemize} \item The idea is to find such a $\V{\beta}$ in \begin{equation*} \V{y} = \M{X}\V{\beta} + \V{\epsilon}, \end{equation*} --- we'll refer to it as $\hat{\V{\beta}}$ --- that will give us the \defn{estimate} of $\V{y}$, \begin{equation*} \hat{\V{y}} \defeq \M{X}\hat{\V{\beta}}, \end{equation*} \item ...that will minimize a certain \nb{loss function} of the \defn{residual} \begin{equation*} \hat{\V{\epsilon}} \defeq \V{y} - \hat{\V{y}}, \end{equation*} \item ...namely the squared error, which we'll now call the \defn{residual sum of squares (RSS)}, \begin{equation*} \text{RSS} = \transpose{\hat{\V{\epsilon}}}\hat{\V{\epsilon}} = \transpose{\left(\V{y} - \M{X}\hat{\V{\beta}}\right)} \left(\V{y} - \M{X}\hat{\V{\beta}}\right). \end{equation*} \end{itemize} \end{frame}
\begin{frame} \frametitle{Derivation of OLS} \begin{theorem} \begin{equation*} \hat{\V{\beta}} = \left(\transpose{\M{X}}\M{X}\right)^{-1}\transpose{\M{X}}\V{y}. \end{equation*} \begin{proof} \begin{equation*} \text{RSS} = \transpose{\hat{\V{\epsilon}}}\hat{\V{\epsilon}} = \transpose{\left(\V{y} - \M{X}\V{\beta}\right)} \left(\V{y} - \M{X}\V{\beta}\right) = \transpose{\V{y}}\V{y} - 2\transpose{\V{y}}\M{X}\V{\beta} + \transpose{\V{\beta}}\transpose{\M{X}}\M{X}\V{\beta}. \end{equation*} Using the denominator notation for vector-by-vector and scalar-by-vector derivatives, \begin{equation*} \nabla_{\V{\beta}}\text{RSS} = \frac{\partial\text{RSS}}{\partial\V{\beta}} = -2\transpose{\M{X}}\V{y} + 2\transpose{\M{X}}\M{X}\V{\beta}. \end{equation*} When $\nabla_{\V{\beta}}\text{RSS} = 0$, $\transpose{\M{X}}\M{X}\V{\beta} = \transpose{\M{X}}\V{y}$ (this is the so-called \alert{normal equation}), hence $\V{\beta} = \left(\transpose{\M{X}}\M{X}\right)^{-1}\transpose{\M{X}}\V{y}$. To check that this critical point is a minimum, we examine the Hessian matrix \begin{equation*} \frac{\partial}{\partial{\V{\beta}}}\left(\frac{\partial\text{RSS}}{\partial\V{\beta}}\right) = 2\transpose{\M{X}}\M{X}. \end{equation*} If $\M{X}$ is full rank (i.e. has rank $p+1$ if we allow the intercept and $p$ if we don't; this implies that $\M{X}$ is skinny or square), then $\transpose{\M{X}}\M{X}$ is positive definite and the critical point is indeed a minimum. \end{proof} \end{theorem} \end{frame}

In [ ]:
\begin{frame}
\frametitle{The OLS estimator is unbiased}
\begin{align*}
\E[\hat{\V{\beta}} \given \M{X}] &= \E[\left(\transpose{\M{X}}\M{X}\right)^{-1}\transpose{\M{X}}\V{y} \given \M{X}] \\
&= \left(\transpose{\M{X}}\M{X}\right)^{-1}\transpose{\M{X}} \E[\V{y} \given \M{X}] \\
&= \left(\transpose{\M{X}}\M{X}\right)^{-1}\transpose{\M{X}}\M{X}\V{\beta} \\
&= \V{\beta}.
\end{align*}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{Relationship with maximum likelihood estimation (i)}
\begin{itemize}
\item Now suppose that $\epsilon_1, \ldots, \epsilon_n \overset{\text{i.i.d.}}{\sim} \Normal{0}{\sigma^2}$. Then, taking $\V{x}_1, \ldots, \V{x}_n$ as data, the observations have density functions $\normalpdf[y_i]{\transpose{\V{x}_i}\V{\beta}}{\sigma^2}$, and the likelihood function of $\V{\beta}$ and $\sigma^2$, based on the sample, is
    \begin{equation*}
    \mathcal{L} = \prod_{i=1}^n \normalpdf[y_i]{\transpose{\V{x}_i}\V{\beta}}{\sigma^2} = (2\pi\sigma^2)^{-n/2} \exp\left\{ -\frac{1}{2\sigma^2} \transpose{\left(\V{y} - \M{X}\V{\beta}\right)} \left(\V{y} - \M{X}\V{\beta}\right) \right\}.
    \end{equation*}
\item Taking logarithms we obtain the log-likelihood
    \begin{align*}
    \ln\left(\mathcal{L}\right) &= -\frac{n}{2}\ln\left(2\pi\right) - \frac{n}{2}\ln\left(\sigma^2\right) - \frac{1}{2\sigma^2}\transpose{\left(\V{y} - \M{X}\V{\beta}\right)} \left(\V{y} - \M{X}\V{\beta}\right) \\
    &= -\frac{n}{2}\ln\left(2\pi\right) - \frac{n}{2}\ln\left(\sigma^2\right) - \frac{1}{2\sigma^2}\text{RSS}.
    \end{align*}
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{Relationship with maximum likelihood estimation (ii)}
To find the critical points, we compute the first partial derivatives
\begin{itemize}
\item $\frac{\partial\ln\left(\mathcal{L}\right)}{\partial\V{\beta}} = -\frac{1}{2\sigma^2}\frac{\partial\text{RSS}}{\partial\V{\beta}} = \frac{1}{\sigma^2} \left( \transpose{\M{X}}\V{y} - \transpose{\M{X}}\M{X}\V{\beta} \right)$,
\item $\frac{\partial\ln\left(\mathcal{L}\right)}{\partial\sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4}\text{RSS}$.
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{Relationship with maximum likelihood estimation (iii)}
\begin{itemize}
\item $\frac{\partial}{\partial\V{\beta}}\left(\frac{\partial\ln\left(\mathcal{L}\right)}{\partial\V{\beta}}\right) = -\frac{1}{2\sigma^2} \frac{\partial}{\partial{\V{\beta}}}\left(\frac{\partial\text{RSS}}{\partial\V{\beta}}\right) = -\frac{1}{\sigma^2} \transpose{\V{X}}\V{X}$,
\item $\frac{\partial}{\partial\sigma^2}\left(\frac{\partial\ln\left(\mathcal{L}\right)}{\partial\sigma^2}\right) = \frac{n}{2\sigma^4} - \frac{1}{\sigma^6}\text{RSS}$,
\item $\frac{\partial}{\partial\sigma^2}\left(\frac{\partial\ln\left(\mathcal{L}\right)}{\partial\V{\beta}}\right) = \frac{\partial}{\partial\sigma^2}\left(-\frac{1}{2\sigma^2} \frac{\partial\text{RSS}}{\partial\V{\beta}}\right) = \frac{1}{2\sigma^4}\frac{\partial\text{RSS}}{\partial\V{\beta}} = \frac{1}{\sigma^4}\left(-\transpose{\M{X}}\V{y} + \transpose{\M{X}}\M{X}\V{\beta}\right)$.
\item ...
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{Who invented OLS? Was it Gau{\ss}?}
\begin{columns}
    \begin{column}{0.3\textwidth}
    \begin{figure}
    \centering
    \includegraphics[width=1.\textwidth]{carl-friedrich-gauss.pdf}
    \end{figure}
    \begin{small}
    Portrait of Carl~Friedrich~Gau{\ss} (1777--1855) by Christian~Albrecht~Jensen
    \end{small}
    \end{column}
    \begin{column}{0.7\textwidth}
    \begin{itemize}
    \item The invention of the OLS method was claimed by both \nb{Carl~Friedrich~Gau{\ss}}, and \nb{Adrien-Marie Legendre}, which had led to the ``most famous priority dispute in the history of statistics''~\cite{stigler-1981}.
    \item Gau{\ss} did not deny that Legendre published the method first (in~\nb{1805}), but he claimed to have been in possession of the method since~\nb{1795}.
    \item Both Gau{\ss} and Legendre discovered it when calculating the \nb{orbits of celestial bodies}. They had to make optimal use of excessive information: they had more observations than they needed --- the system was \nb{overdetermined} --- but the observations were \nb{noisy}.
    \item In his work, Gau{\ss} also connected the method with ideas from \nb{probability} and \nb{normal distribution}.
    \end{itemize}
    \end{column}
\end{columns}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{Or was it Legendre?}
\begin{columns}
    \begin{column}{0.3\textwidth}
    \begin{figure}
    \centering
    \includegraphics[width=1.\textwidth]{adrien-marie-legendre.pdf}
    \end{figure}
    \begin{small}
    1820 watercolour caricature of Adrien-Marie Legendre (1752--1833) by Julien-Leopold Boilly, the only known existing portrait of Legendre
    \end{small}
    \end{column}
    \begin{column}{0.7\textwidth}
    In 1805, Legendre, in a work dedicated to the determination of the orbits of comets, published~\cite{legendre-1805} the first clear explanation of the method of least squares with a worked example:
    \begin{small}
    \begin{quote}
    De tous les principes qu'on peut proposer pour cet objet, je pense qu'il n'en est pas de plus g{\'e}n{\'e}ral, de plus exact, ni d'une application plus facile, que celui dont nous avons fait usage dans les recherches p{\'e}c{\'e}dentes, et qui consiste {\`a} rendre \nb{minimum la somme des carr{\'e}s des erreurs}. Par ce moyen il s'{\'e}tablit entre les erreurs une sorte d'{\'e}quilibre qui, \nb{emp{\^e}chant les extr{\^e}mes de pr{\`e}valoir}, est tr{\`e}s propre {\`a} faire connaitre \nb{l'{\'e}tat du syst{\`e}me le plus proche de la v{\'e}rit{\'e}}.
    \end{quote}
    \end{small}
    Or, in English,
    \begin{small}
    \begin{quote}
    Of all the principles which may be proposed for this purpose, I think that there is none more general, more exact, or easier to apply than that which we have used in this particular research, and which consists in making the \nb{sum of the squares of the errors minimum}. By this means, a kind of equilibrium is established between the errors, which, preventing the \nb{extremes of prevalence}, is very suitable for making known the \nb{state of the system closest to truth}.
    \end{quote}
    \end{small}
    \end{column}
\end{columns}
\end{frame}
\begin{frame} \frametitle{The answer is...} ...\nb{yes}, it was Gau{\ss} \nb{or} Legendre. In~\cite{stigler-1981}, \begin{quote} It is argued (though not conclusively) that Gauss probably possessed the method well before Legendre, but that he was unsuccessful in communicating it to his contemporaries. \end{quote} \end{frame}

\section{The geometry of linear regression}

\begin{frame} \frametitle{A word to Sir Isaac Newton} \begin{columns} \begin{column}{0.3\textwidth} \begin{figure} \centering \includegraphics[width=1.\textwidth]{isaac-newton.pdf} \end{figure} \begin{small} Portrait of Sir~Isaac~Newton at~46 (1642--1726/7) by Godfrey~Kneller \end{small} \end{column} \begin{column}{0.7\textwidth} In~\atitle{Rules for methodizing the Apocalypse}: \begin{quote} Truth is ever to be found in \nb{simplicity}, and not in the multiplicity and confusion of things. \end{quote} So let us look at a simple, low-dimensional example. (We would like to thank \nb{Douglas Machado Vieira} for this suggestion.) \end{column} \end{columns} \end{frame}\begin{frame} \frametitle{Let us understand the geometry of this linear regression problem} \begin{itemize} \item We are given the linear model \begin{equation*} \M{y} = \M{X}\V{\beta} + \V{\epsilon}, \end{equation*} where \begin{equation*} \M{X} = \begin{pmatrix} 1 \\ 3 \\ 4 \end{pmatrix}, \V{y} = \begin{pmatrix} 2 \\ 1 \\ 2 \end{pmatrix}, \V{\epsilon} = \begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \end{pmatrix}. \end{equation*} \item We need to estimate $\V{\beta}$. \item Here $p = 1$ (we have a single feature) and $n = 3$ (we have three data points). Our dataset consists of three data points, three $(x_i, y_i)$ pairs: $(1, 2)$, $(3, 1)$, $(4, 2)$. Because $p = 1$, all our $x_i$ are univariate: 1, 3, and 4. \end{itemize} \end{frame}\begin{frame} \frametitle{What are we minimising? (i)} \begin{center} We are fitting the regression line (shown in \nb{red}) to minimize... \\ ...the sum of squares of the lengths of these \nb{vertical} dashed lines? \end{center} \begin{columns} \begin{column}{0.5\textwidth} \begin{tikzpicture}[scale=0.7, every node/.style={scale=0.7}] \tkzInit[xmin=-1,ymin=-1,xmax=5,ymax=5] \tkzAxeXY \tkzGrid \tkzDefPoints{1/2/A, 1/0.5/A1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](A,A1) \tkzDrawPoints[size=10,color=red](A, A1) \tkzLabelPoint[above,color=blue](A){$(1,2)$} \tkzDefPoints{3/1/B, 3/1.5/B1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](B,B1) \tkzDrawPoints[size=10,color=red](B, B1) \tkzLabelPoint[above right,color=blue](B){$(3,1)$} \tkzDefPoints{4/2/C, 4/2/C1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](C,C1) \tkzDrawPoints[size=10,color=red](C, C1) \tkzLabelPoint[above left,color=blue](C){$(4,2)$} \tkzDefPoints{0/0/J, 5/2.5/K} \tkzDrawSegments[color=red, ultra thick](J,K) \end{tikzpicture} \end{column} \begin{column}{0.5\textwidth} Given the linear model \begin{equation*} \M{y} = \M{X}\V{\beta} + \V{\epsilon}, \end{equation*} where \begin{equation*} \M{X} = \begin{pmatrix} 1 \\ 3 \\ 4 \end{pmatrix}, \V{y} = \begin{pmatrix} 2 \\ 1 \\ 2 \end{pmatrix}, \V{\epsilon} = \begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \end{pmatrix}, \end{equation*} estimate $\V{\beta}$. \end{column} \end{columns} \end{frame}\begin{frame} \frametitle{What are we minimising? (ii)} \begin{center} We are fitting the regression line (shown in \nb{red}) to minimize... \\ ...the sum of squares of the lengths of these \nb{orthogonal} (to the fitted line) dashed lines? \end{center} \begin{columns} \begin{column}{0.5\textwidth} \begin{tikzpicture}[scale=0.7, every node/.style={scale=0.7}] \tkzInit[xmin=-1,ymin=-1,xmax=5,ymax=5] \tkzAxeXY \tkzGrid \tkzDefPoints{1/2/A, 1.6/0.8/A1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](A,A1) \tkzDrawPoints[size=10,color=red](A, A1) \tkzLabelPoint[above,color=blue](A){$(1,2)$} \tkzDefPoints{3/1/B, 2.8/1.4/B1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](B,B1) \tkzDrawPoints[size=10,color=red](B, B1) \tkzLabelPoint[above right,color=blue](B){$(3,1)$} \tkzDefPoints{4/2/C, 4/2/C1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](C,C1) \tkzDrawPoints[size=10,color=red](C, C1) \tkzLabelPoint[above left,color=blue](C){$(4,2)$} \tkzDefPoints{0/0/J, 5/2.5/K} \tkzDrawSegments[color=red, ultra thick](J,K) \end{tikzpicture} \end{column} \begin{column}{0.5\textwidth} Given the linear model \begin{equation*} \M{y} = \M{X}\V{\beta} + \V{\epsilon}, \end{equation*} where \begin{equation*} \M{X} = \begin{pmatrix} 1 \\ 3 \\ 4 \end{pmatrix}, \V{y} = \begin{pmatrix} 2 \\ 1 \\ 2 \end{pmatrix}, \V{\epsilon} = \begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \end{pmatrix}, \end{equation*} estimate $\V{\beta}$. \end{column} \end{columns} \end{frame}\begin{frame} \frametitle{Ordinary least squares (OLS)} \begin{align*} \hat{\V{\beta}} &= (\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\V{y} \\ &= (1^2 + 3^2 + 4^2)^{-1} (1 \cdot 2 + 3 \cdot 1 + 4 \cdot 2) \\ &= \frac{13}{26} = \frac{1}{2}. \end{align*} \begin{align*} \hat{\V{y}} &= \M{X}\hat{\V{\beta}} \\ &= \begin{pmatrix} 1 \\ 3 \\ 4 \end{pmatrix} \cdot \frac{1}{2} \\ &= \begin{pmatrix} \sfrac{1}{2} \\ \sfrac{3}{2} \\ 2 \end{pmatrix}. \end{align*} \end{frame}\begin{frame} \frametitle{So indeed we are minimising...} \begin{center} We are fitting the regression line (shown in \nb{red}) to minimize... \\ ...the sum of squares of the \nb{vertical} dashed lines, $\V{y} - \hat{\V{y}} = \V{y} - \M{X}\hat{\V{\beta}}$. \end{center} \begin{columns} \begin{column}{0.5\textwidth} \begin{tikzpicture}[scale=0.7, every node/.style={scale=0.7}] \tkzInit[xmin=-1,ymin=-1,xmax=5,ymax=5] \tkzAxeXY \tkzGrid \tkzDefPoints{1/2/A, 1/0.5/A1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](A,A1) \tkzDrawPoints[size=10,color=red](A, A1) \tkzLabelPoint[above,color=blue](A){$(1,2)$} \tkzLabelPoint[below,color=red](A1){$(1,\frac{1}{2})$} \tkzDefPoints{3/1/B, 3/1.5/B1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](B,B1) \tkzDrawPoints[size=10,color=red](B, B1) \tkzLabelPoint[above right,color=blue](B){$(3,1)$} \tkzLabelPoint[above left,color=red](B1){$(3,\frac{3}{2})$} \tkzDefPoints{4/2/C, 4/2/C1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](C,C1) \tkzDrawPoints[size=10,color=red](C, C1) \tkzLabelPoint[above left,color=blue](C){$(4,2)$} \tkzLabelPoint[below right,color=red](C1){$(4,2)$} \tkzDefPoints{0/0/J, 5/2.5/K} \tkzDrawSegments[color=red, ultra thick](J,K) \end{tikzpicture} \end{column} \begin{column}{0.5\textwidth} Given the linear model \begin{equation*} \M{y} = \M{X}\V{\beta} + \V{\epsilon}, \end{equation*} where \begin{equation*} \M{X} = \begin{pmatrix} 1 \\ 3 \\ 4 \end{pmatrix}, \V{y} = \begin{pmatrix} 2 \\ 1 \\ 2 \end{pmatrix}, \V{\epsilon} = \begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \end{pmatrix}, \end{equation*} estimate $\V{\beta}$. \end{column} \end{columns} \end{frame}\begin{frame} \frametitle{Moore--Penrose pseudoinverse} Let's have a look at our $\hat{\V{\beta}}$: \begin{equation*} \hat{\V{\beta}} = \underbrace{(\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}}_{\M{X}^+}\V{y} \end{equation*} \begin{itemize} \item $\M{X}^+$ is the \defn{Moore--Penrose pseudoinverse} matrix of $\M{X}$. \item In our example, \begin{align*} \M{X}^+ &= (\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}} \\ &= (1^2 + 3^2 + 4^2)^{-1} \begin{pmatrix} 1 & 3 & 4 \end{pmatrix} \\ &= \begin{pmatrix} \frac{1}{26} & \frac{3}{26} & \frac{4}{26} \end{pmatrix}. \end{align*} \item One of its properties is $(\M{X}\M{X}^+)\M{X} = \M{X}$; in other words, $\M{X}\M{X}^+$ maps all column vectors of $\M{X}$ to themselves, while $\M{X}\M{X}^+$ need not be the identity matrix. \item In our example, \begin{equation*} (\M{X}\M{X}^+)\M{X} = \left[ \begin{pmatrix} 1 \\ 3 \\ 4 \end{pmatrix} \begin{pmatrix} \frac{1}{26} & \frac{3}{26} & \frac{4}{26} \end{pmatrix} \right] \begin{pmatrix} 1 \\ 3 \\ 4 \end{pmatrix} = \underbrace{\begin{pmatrix} \sfrac{1}{26} & \sfrac{3}{26} & \sfrac{4}{26} \\ \sfrac{3}{26} & \sfrac{9}{26} & \sfrac{12}{26} \\ \sfrac{4}{26} & \sfrac{12}{26} & \sfrac{16}{26} \end{pmatrix}}_{\text{not an identity matrix, but...}} \begin{pmatrix} 1 \\ 3 \\ 4 \end{pmatrix} = \begin{pmatrix} 1 \\ 3 \\ 4 \end{pmatrix}. \end{equation*} \end{itemize} \end{frame}\begin{frame} \frametitle{The hat matrix} Let's have a look at our $\hat{\V{y}}$: \begin{align*} \hat{\V{y}} &= \M{X}\hat{\V{\beta}} \\ &= \underbrace{\M{X}\underbrace{(\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}}_{\M{X}^+}}_{\M{H}}\V{y} \end{align*} \begin{itemize} \item The matrix $\M{H} \defeq \M{X}(\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}$ is known as the \defn{hat matrix} --- because it takes $\V{y}$ and puts a hat on it, $\hat{\V{y}}$. \item In our example, \begin{equation*} \M{H} = \M{X}(\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}} = \M{X}\M{X}^+ = \begin{pmatrix} 1 \\ 3 \\ 4 \end{pmatrix} \begin{pmatrix} \frac{1}{26} & \frac{3}{26} & \frac{4}{26} \end{pmatrix} = \begin{pmatrix} \sfrac{1}{26} & \sfrac{3}{26} & \sfrac{4}{26} \\ \sfrac{3}{26} & \sfrac{9}{26} & \sfrac{12}{26} \\ \sfrac{4}{26} & \sfrac{12}{26} & \sfrac{16}{26} \end{pmatrix}. \end{equation*} \item But this is just the matrix from the previous slide that pretended to be an identity matrix (although only for $\V{X}$)... \end{itemize} \end{frame}\begin{frame} \frametitle{Orthogonal projection matrices} \begin{itemize} \item We recognize $\M{H}$ as one of the so-called \defn{orthogonal projection matrices}. \item Let's check that it has the two properties of orthogonal projection matrices: \begin{itemize} \item it is \defn{symmetric}: \begin{align*} \transpose{\M{H}} &= \transpose{(\M{X}(\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}})} \\ &= \transpose{(\transpose{\M{X}})}\transpose{[(\transpose{\M{X}}\M{X})^{-1}]}\transpose{\M{X}} \\ &= \M{X}[\transpose{(\transpose{\M{X}}\M{X})}]^{-1}\transpose{\M{X}} \\ &= \M{X}(\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}} \\ &= \M{H}; \end{align*} \item it is \defn{idempotent}: \begin{align*} \M{H}^2 &= (\M{X}(\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}) (\M{X}(\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}) \\ &= \M{X}[(\transpose{\M{X}}\M{X})^{-1}(\transpose{\M{X}} \M{X})](\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}} \\ &= \M{X}(\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}} \\ &= \M{H}. \end{align*} \end{itemize} \item \nb{STOP!} Did we not decide that we were minimising the sum of squares of the lengths of the \nb{vertical} dashed lines and not of the \nb{orthogonal} dashed lines? \item \nb{Where is the orthogonal projection?} \end{itemize} \end{frame}\begin{frame} \frametitle{The mystery of the missing orthogonal projection (i)} \begin{center} We are fitting the regression line (shown in \nb{red} on the left plot) to minimize... \\ ...the sum of squares of the \nb{vertical} dashed lines, $\V{y} - \hat{\V{y}} = \V{y} - \M{X}\hat{\V{\beta}}$. \end{center} \begin{columns} \begin{column}{0.5\textwidth} \begin{tikzpicture}[scale=0.7, every node/.style={scale=0.7}] \tkzInit[xmin=-1,ymin=-1,xmax=5,ymax=5] \tkzAxeXY \tkzGrid \tkzDefPoints{1/2/A, 1/0.5/A1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](A,A1) \tkzDrawPoints[size=10,color=red](A, A1) \tkzLabelPoint[above,color=blue](A){$(1,2)$} \tkzLabelPoint[below,color=red](A1){$(1,\frac{1}{2})$} \tkzDefPoints{3/1/B, 3/1.5/B1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](B,B1) \tkzDrawPoints[size=10,color=red](B, B1) \tkzLabelPoint[above right,color=blue](B){$(3,1)$} \tkzLabelPoint[above left,color=red](B1){$(3,\frac{3}{2})$} \tkzDefPoints{4/2/C, 4/2/C1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](C,C1) \tkzDrawPoints[size=10,color=red](C, C1) \tkzLabelPoint[above left,color=blue](C){$(4,2)$} \tkzLabelPoint[below right,color=red](C1){$(4,2)$} \tkzDefPoints{0/0/J, 5/2.5/K} \tkzDrawSegments[color=red, ultra thick](J,K) \end{tikzpicture} \end{column} \begin{column}{0.5\textwidth} \begin{tikzpicture}[scale=0.7, every node/.style={scale=0.7}, tdplot_main_coords, axis/.style={->,dashed}, thick] % -- remove these 3 lines if no axis is preferred \draw[axis] (0, 0, 0) -- (3, 0, 0) node [below] {1st coordinate}; \draw[axis] (0, 0, 0) -- (0, 4, 0) node [below left] {2nd coordinate}; \draw[axis] (0, 0, 0) -- (0, 0, 5) node [above] {3rd coordinate}; \coordinate (O) at (0,0,0){}; \coordinate (X) at (1,3,4){}; \coordinate (X1) at (1,0,0){}; \coordinate (X2) at (0,3,0){}; \coordinate (X3) at (0,0,4){}; \coordinate (X12) at (1,3,0){}; \coordinate (X13) at (1,0,4){}; \coordinate (X23) at (0,3,4){}; \coordinate (y) at (2,1,2){}; \coordinate (y1) at (2,0,0){}; \coordinate (y2) at (0,1,0){}; \coordinate (y3) at (0,0,2){}; \coordinate (y12) at (2,1,0){}; \coordinate (y13) at (2,0,2){}; \coordinate (y23) at (0,1,2){}; \coordinate (yhat) at (0.5,1.5,2){}; \draw[fill=black] (O) circle (0.2em) node[below] {$0$}; \draw[color=blue, thick] (O)--(X); \draw[color=blue, dashed, thin] (X1)--(X12); \draw[color=blue, dashed, thin] (X1)--(X13); \draw[color=blue, dashed, thin] (X2)--(X12); \draw[color=blue, dashed, thin] (X2)--(X23); \draw[color=blue, dashed, thin] (X3)--(X13); \draw[color=blue, dashed, thin] (X3)--(X23); \draw[color=blue, dashed, thin] (X12)--(X); \draw[color=blue, dashed, thin] (X13)--(X); \draw[color=blue, dashed, thin] (X23)--(X); \draw[fill=black] (X) circle (0.2em) node[below right] {$\V{X}$}; \draw[color=green, thick] (O)--(y); \draw[color=green, dashed, thin] (y1)--(y12); \draw[color=green, dashed, thin] (y1)--(y13); \draw[color=green, dashed, thin] (y2)--(y12); \draw[color=green, dashed, thin] (y2)--(y23); \draw[color=green, dashed, thin] (y3)--(y13); \draw[color=green, dashed, thin] (y3)--(y23); \draw[color=green, dashed, thin] (y12)--(y); \draw[color=green, dashed, thin] (y13)--(y); \draw[color=green, dashed, thin] (y23)--(y); \draw[fill=black] (y) circle (0.2em) node[above left] {$\V{y}$}; \draw[dashed, color=green] (O)--(yhat); \draw[fill=black] (yhat) circle (0.2em) node[below right] {$\hat{\V{y}}$}; \draw[dashed, color=red] (y)--(yhat); \tkzMarkRightAngle[draw=black, thin](O,yhat,y) \end{tikzpicture} \end{column} \end{columns} \begin{center} We find our (OLS) solution by orthogonally projecting $\V{y}$ (the \textcolor{green}{green} line) onto the column space of $\M{X}$ (the \textcolor{blue}{blue} line or hyperplane) --- see the \nb{dashed red line} on the right plot. \end{center} \end{frame}\begin{frame} \frametitle{The mystery of the missing orthogonal projection (ii)} \begin{itemize} \item We spotted the orthogonal projection by \nb{changing our point of view}. \item The left-hand plot represents the so-called \defn{variable space} view, $\R^p$, view of the linear regression: here each data point is indeed represented by a point and the dimensions represent the features. Here we can see the regression line. \item The right-hand plot represents the so-called \defn{subject space} view, $\R^n$. Here $y$ and $\hat{y}$ appear as vectors represented by lines and the dimensions represent the data points. \item The orthogonal projection was \nb{hidden} on the variable space view, but became \nb{revealed} on the subject space view. \item This in itself is a useful lesson in data science. \end{itemize} \end{frame}\begin{frame} \frametitle{The importance of point of view in data science} \begin{columns} \begin{column}{0.3\textwidth} \begin{figure} \centering \includegraphics[width=1.\textwidth]{obi-wan-kenobi.pdf} \end{figure} \end{column} \begin{column}{0.7\textwidth} You're going to find that many of the truths we cling to depend greatly on our own point of view. \end{column} \end{columns} \end{frame}\begin{frame} \frametitle{The mystery of the missing orthogonal projection (iii)} \begin{itemize} \item The hat matrix, $\M{H}$ is an \defn{orthogonal projection matrix} that projects $\V{y}$ onto the \defn{column space} of $\M{X}$ --- the space spanned by the column vectors of $\M{X}$. \item The result, $\hat{\V{y}}$, is a \defn{linear combination} of the columns of $\M{X}$. \item The \defn{weights} in this linear combination are given by the elements of $\hat{\V{\beta}}$. \item In our example, $p = 1$, so $\M{X}$ has a single column. Thus the result, $\hat{\V{y}}$, is a scalar multiple of that column, the scalar being our $\hat{\beta} = \frac{1}{2}$: \begin{equation*} \hat{\V{y}} = \M{H}\V{y} = \begin{pmatrix} \sfrac{1}{26} & \sfrac{3}{26} & \sfrac{4}{26} \\ \sfrac{3}{26} & \sfrac{9}{26} & \sfrac{12}{26} \\ \sfrac{4}{26} & \sfrac{12}{26} & \sfrac{16}{26} \end{pmatrix} \begin{pmatrix} 2 \\ 1 \\ 2 \end{pmatrix} = \begin{pmatrix} \sfrac{1}{2} \\ \sfrac{3}{2} \\ 2 \end{pmatrix} = \underbrace{\frac{1}{2}}_{\hat{\beta}} \underbrace{\begin{pmatrix} 1 \\ 3 \\ 4 \end{pmatrix}}_{\text{the single column of $\M{X}$}}. \end{equation*} \end{itemize} \end{frame}\begin{frame} \frametitle{Would we actually draw this line?} If we were drawing a line of best fit, would we actually draw it through $(x_1, \hat{y}_1) = (1, \sfrac{1}{2})$, $(x_2, \hat{y}_2) = (3, \sfrac{3}{2})$, $(x_3, \hat{y}_3) = (4, 2)$? Or would we draw the line on the right? \begin{columns} \begin{column}{0.5\textwidth} \begin{tikzpicture}[scale=0.7, every node/.style={scale=0.7}] \tkzInit[xmin=-1,ymin=-1,xmax=5,ymax=5] \tkzAxeXY \tkzGrid \tkzDefPoints{1/2/A, 1/0.5/A1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](A,A1) \tkzDrawPoints[size=10,color=red](A, A1) \tkzLabelPoint[above,color=blue](A){$(1,2)$} \tkzLabelPoint[below,color=red](A1){$(1,\frac{1}{2})$} \tkzDefPoints{3/1/B, 3/1.5/B1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](B,B1) \tkzDrawPoints[size=10,color=red](B, B1) \tkzLabelPoint[above right,color=blue](B){$(3,1)$} \tkzLabelPoint[above left,color=red](B1){$(3,\frac{3}{2})$} \tkzDefPoints{4/2/C, 4/2/C1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](C,C1) \tkzDrawPoints[size=10,color=red](C, C1) \tkzLabelPoint[above left,color=blue](C){$(4,2)$} \tkzLabelPoint[below right,color=red](C1){$(4,2)$} \tkzDefPoints{0/0/J, 5/2.5/K} \tkzDrawSegments[color=red, ultra thick](J,K) \end{tikzpicture} \end{column} \begin{column}{0.5\textwidth} \begin{tikzpicture}[scale=0.7, every node/.style={scale=0.7}] \tkzInit[xmin=-1,ymin=-1,xmax=5,ymax=5] \tkzAxeXY \tkzGrid \tkzDefPoints{1/2/A, 1/1.7857/A1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](A,A1) \tkzDrawPoints[size=10,color=red](A, A1) \tkzLabelPoint[above,color=blue](A){$(1,2)$} \tkzLabelPoint[below,color=red](A1){} \tkzDefPoints{3/1/B, 3/1.6428/B1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](B,B1) \tkzDrawPoints[size=10,color=red](B, B1) \tkzLabelPoint[above right,color=blue](B){$(3,1)$} \tkzLabelPoint[above left,color=red](B1){} \tkzDefPoints{4/2/C, 4/1.5714/C1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](C,C1) \tkzDrawPoints[size=10,color=red](C, C1) \tkzLabelPoint[above left,color=blue](C){$(4,2)$} \tkzLabelPoint[below right,color=red](C1){} \tkzDefPoints{0/1.8571/J, 5/1.5/K} \tkzDrawSegments[color=red, ultra thick](J,K) \end{tikzpicture} \end{column} \end{columns} \end{frame}\begin{frame} \frametitle{The $\V{y}$-intercept} \begin{itemize} \item If we didn't have to start our line of best fit at the origin, we would draw the line on the right. \item Fortunately, it is easy to introduce the \defn{($\V{y}$-)~intercept} into our linear model without changing much of its structure: \begin{itemize} \item prepend a column of ones to $\M{X}$, so it becomes the $n \times (p+1)$-dimensional matrix \begin{equation*} \M{X} = \begin{pmatrix} 1 & 1 \\ 1 & 3 \\ 1 & 4 \end{pmatrix}; \end{equation*} \item prepend $\beta_0$ to the column vector $\hat{\V{\beta}}$, so it becomes the $p+1$-dimensional column vector \begin{equation*} \hat{\V{\beta}} = \begin{pmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \end{pmatrix}. \end{equation*} \end{itemize} \item And then proceed as before... \end{itemize} \end{frame}\begin{frame} \frametitle{Linear regression model with $\V{y}$-intercept in matrix form} For $n \in \Nnz$, $p \in \Nnz$, $i \in [1, \ldots, n]$, \begin{equation*} y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \epsilon_i, \end{equation*} or, in matrix form, \begin{equation*} \V{y} = \M{X}\V{\beta} + \V{\epsilon}, \end{equation*} where \begin{small} \begin{equation*} \V{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix}, \quad \M{X} = \begin{pmatrix} 1 & \transpose{\V{x}_1} \\ 1 & \transpose{\V{x}_2} \\ \vdots & \vdots \\ 1 & \transpose{\V{x}_n} \\ \end{pmatrix} = \begin{pmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \\ \end{pmatrix}, \quad \V{\beta} = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{pmatrix}, \quad \V{\epsilon} = \begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{pmatrix}. \end{equation*} \end{small} \end{frame}\begin{frame} \frametitle{An OLS solution with the $\V{y}$-intercept permitted} \begin{itemize} \item The Moore--Penrose pseudoinverse matrix of $\M{X}$ is given by \begin{align*} \M{X}^+ = (\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}} &= \left[ \begin{pmatrix} 1 & 1 & 1 \\ 1 & 3 & 4 \end{pmatrix} \begin{pmatrix} 1 & 1 \\ 1 & 3 \\ 1 & 4 \end{pmatrix} \right]^{-1} \begin{pmatrix} 1 & 1 & 1 \\ 1 & 3 & 4 \end{pmatrix} \\ &= \begin{pmatrix} 3 & 8 \\ 8 & 26 \end{pmatrix}^{-1} \begin{pmatrix} 1 & 1 & 1 \\ 1 & 3 & 4 \end{pmatrix} \\ &= \frac{1}{14} \begin{pmatrix} 26 & -8 \\ -8 & 3 \end{pmatrix} \begin{pmatrix} 1 & 1 & 1 \\ 1 & 3 & 4 \end{pmatrix} \\ &= \frac{1}{14} \begin{pmatrix} 18 & 2 & -6 \\ -5 & 1 & 4 \end{pmatrix}. \end{align*} \item Hence \begin{equation*} \hat{\V{\beta}} = \M{X}^+ \V{y} = \frac{1}{14} \begin{pmatrix} 18 & 2 & -6 \\ -5 & 1 & 4 \end{pmatrix} \begin{pmatrix} 2 \\ 1 \\ 2 \end{pmatrix} = \begin{pmatrix} \sfrac{13}{7} \\ -\sfrac{1}{14} \end{pmatrix} = \begin{pmatrix} \beta_0 \\ \beta_1 \end{pmatrix}, \end{equation*} \item and \begin{equation*} \hat{\V{y}} = \M{X}\hat{\V{\beta}} = \begin{pmatrix} 1 & 1 \\ 1 & 3 \\ 1 & 4 \end{pmatrix} \begin{pmatrix} \sfrac{13}{7} \\ -\sfrac{1}{14} \end{pmatrix} = \begin{pmatrix} \sfrac{25}{14} \\ \sfrac{23}{14} \\ \sfrac{11}{7} \end{pmatrix} = \hat{\beta}_0 \begin{pmatrix} 1 \\ 1 \\ 1 \end{pmatrix} + \hat{\beta}_1 \begin{pmatrix} 1 \\ 3 \\ 4 \end{pmatrix}. \end{equation*} \end{itemize} \end{frame}\begin{frame} \frametitle{The $\M{X}$-$\V{y}$ plot} \begin{columns} \begin{column}{0.5\textwidth} Without the intercept: \begin{tikzpicture}[scale=0.7, every node/.style={scale=0.7}] \tkzInit[xmin=-1,ymin=-1,xmax=5,ymax=5] \tkzAxeXY \tkzGrid \tkzDefPoints{1/2/A, 1/0.5/A1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](A,A1) \tkzDrawPoints[size=10,color=red](A, A1) \tkzLabelPoint[above,color=blue](A){$(1,2)$} \tkzLabelPoint[below,color=red](A1){$(1,\frac{1}{2})$} \tkzDefPoints{3/1/B, 3/1.5/B1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](B,B1) \tkzDrawPoints[size=10,color=red](B, B1) \tkzLabelPoint[above right,color=blue](B){$(3,1)$} \tkzLabelPoint[above left,color=red](B1){$(3,\frac{3}{2})$} \tkzDefPoints{4/2/C, 4/2/C1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](C,C1) \tkzDrawPoints[size=10,color=red](C, C1) \tkzLabelPoint[above left,color=blue](C){$(4,2)$} \tkzLabelPoint[below right,color=red](C1){$(4,2)$} \tkzDefPoints{0/0/J, 5/2.5/K} \tkzDrawSegments[color=red, ultra thick](J,K) \end{tikzpicture} \end{column} \begin{column}{0.5\textwidth} With the intercept: \begin{tikzpicture}[scale=0.7, every node/.style={scale=0.7}] \tkzInit[xmin=-1,ymin=-1,xmax=5,ymax=5] \tkzAxeXY \tkzGrid \tkzDefPoints{1/2/A, 1/1.7857/A1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](A,A1) \tkzDrawPoints[size=10,color=red](A, A1) \tkzLabelPoint[above,color=blue](A){$(1,2)$} \tkzLabelPoint[below,color=red](A1){$(1,1\sfrac{11}{14})$} \tkzDefPoints{3/1/B, 3/1.6428/B1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](B,B1) \tkzDrawPoints[size=10,color=red](B, B1) \tkzLabelPoint[above right,color=blue](B){$(3,1)$} \tkzLabelPoint[above left,color=red](B1){$(3,1\sfrac{9}{14})$} \tkzDefPoints{4/2/C, 4/1.5714/C1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](C,C1) \tkzDrawPoints[size=10,color=red](C, C1) \tkzLabelPoint[above left,color=blue](C){$(4,2)$} \tkzLabelPoint[below right,color=red](C1){$(4,1\sfrac{4}{7})$} \tkzDefPoints{0/1.8571/J, 5/1.5/K} \tkzDrawSegments[color=red, ultra thick](J,K) \end{tikzpicture} \end{column} \end{columns} \end{frame}\begin{frame} \frametitle{The orthogonal projection without the intercept} \begin{center} We are fitting the regression line (shown in \nb{red} on the left plot) to minimize... \\ ...the sum of squares of the \nb{vertical} dashed lines, $\V{y} - \hat{\V{y}} = \V{y} - \M{X}\hat{\V{\beta}}$. \end{center} \begin{columns} \begin{column}{0.5\textwidth} \begin{tikzpicture}[scale=0.7, every node/.style={scale=0.7}] \tkzInit[xmin=-1,ymin=-1,xmax=5,ymax=5] \tkzAxeXY \tkzGrid \tkzDefPoints{1/2/A, 1/0.5/A1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](A,A1) \tkzDrawPoints[size=10,color=red](A, A1) \tkzLabelPoint[above,color=blue](A){$(1,2)$} \tkzLabelPoint[below,color=red](A1){$(1,\frac{1}{2})$} \tkzDefPoints{3/1/B, 3/1.5/B1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](B,B1) \tkzDrawPoints[size=10,color=red](B, B1) \tkzLabelPoint[above right,color=blue](B){$(3,1)$} \tkzLabelPoint[above left,color=red](B1){$(3,\frac{3}{2})$} \tkzDefPoints{4/2/C, 4/2/C1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](C,C1) \tkzDrawPoints[size=10,color=red](C, C1) \tkzLabelPoint[above left,color=blue](C){$(4,2)$} \tkzLabelPoint[below right,color=red](C1){$(4,2)$} \tkzDefPoints{0/0/J, 5/2.5/K} \tkzDrawSegments[color=red, ultra thick](J,K) \end{tikzpicture} \end{column} \begin{column}{0.5\textwidth} \begin{tikzpicture}[scale=0.7, every node/.style={scale=0.7}, tdplot_main_coords, axis/.style={->,dashed}, thick, rotate around x=15, rotate around z=-3] % -- remove these 3 lines if no axis is preferred \draw[axis] (0, 0, 0) -- (3, 0, 0) node [below] {1st coordinate}; \draw[axis] (0, 0, 0) -- (0, 4, 0) node [below left] {2nd coordinate}; \draw[axis] (0, 0, 0) -- (0, 0, 5) node [above] {3rd coordinate}; \coordinate (O) at (0,0,0){}; \coordinate (X) at (1,3,4){}; \coordinate (X1) at (1,0,0){}; \coordinate (X2) at (0,3,0){}; \coordinate (X3) at (0,0,4){}; \coordinate (X12) at (1,3,0){}; \coordinate (X13) at (1,0,4){}; \coordinate (X23) at (0,3,4){}; \coordinate (y) at (2,1,2){}; \coordinate (y1) at (2,0,0){}; \coordinate (y2) at (0,1,0){}; \coordinate (y3) at (0,0,2){}; \coordinate (y12) at (2,1,0){}; \coordinate (y13) at (2,0,2){}; \coordinate (y23) at (0,1,2){}; \coordinate (yhat) at (0.5,1.5,2){}; \draw[fill=black] (O) circle (0.2em) node[below] {$0$}; \draw[color=blue, thick] (O)--(X); \draw[color=blue, dashed, thin] (X1)--(X12); \draw[color=blue, dashed, thin] (X1)--(X13); \draw[color=blue, dashed, thin] (X2)--(X12); \draw[color=blue, dashed, thin] (X2)--(X23); \draw[color=blue, dashed, thin] (X3)--(X13); \draw[color=blue, dashed, thin] (X3)--(X23); \draw[color=blue, dashed, thin] (X12)--(X); \draw[color=blue, dashed, thin] (X13)--(X); \draw[color=blue, dashed, thin] (X23)--(X); \draw[fill=black] (X) circle (0.2em) node[below right] {$\V{X}$}; \draw[color=green, thick] (O)--(y); \draw[color=green, dashed, thin] (y1)--(y12); \draw[color=green, dashed, thin] (y1)--(y13); \draw[color=green, dashed, thin] (y2)--(y12); \draw[color=green, dashed, thin] (y2)--(y23); \draw[color=green, dashed, thin] (y3)--(y13); \draw[color=green, dashed, thin] (y3)--(y23); \draw[color=green, dashed, thin] (y12)--(y); \draw[color=green, dashed, thin] (y13)--(y); \draw[color=green, dashed, thin] (y23)--(y); \draw[fill=black] (y) circle (0.2em) node[above left] {$\V{y}$}; \draw[dashed, color=green] (O)--(yhat); \draw[fill=black] (yhat) circle (0.2em) node[below right] {$\hat{\V{y}}$}; \draw[dashed, color=red] (y)--(yhat); \tkzMarkRightAngle[draw=black, thin](O,yhat,y) \end{tikzpicture} \end{column} \end{columns} \begin{center} We find our (OLS) solution by orthogonally projecting $\V{y}$ (the \textcolor{green}{green} line) onto the column space of $\M{X}$ (the \textcolor{blue}{blue} line or hyperplane) --- see the \nb{dashed red line} on the right plot. \end{center} \end{frame}\begin{frame} \frametitle{The orthogonal projection with the intercept} \begin{center} We are fitting the regression line (shown in \nb{red} on the left plot) to minimize... \\ ...the sum of squares of the \nb{vertical} dashed lines, $\V{y} - \hat{\V{y}} = \V{y} - \M{X}\hat{\V{\beta}}$. \end{center} \begin{columns} \begin{column}{0.5\textwidth} \begin{tikzpicture}[scale=0.7, every node/.style={scale=0.7}] \tkzInit[xmin=-1,ymin=-1,xmax=5,ymax=5] \tkzAxeXY \tkzGrid \tkzDefPoints{1/2/A, 1/1.7857/A1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](A,A1) \tkzDrawPoints[size=10,color=red](A, A1) \tkzLabelPoint[above,color=blue](A){$(1,2)$} \tkzLabelPoint[below,color=red](A1){$(1,1\sfrac{11}{14})$} \tkzDefPoints{3/1/B, 3/1.6428/B1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](B,B1) \tkzDrawPoints[size=10,color=red](B, B1) \tkzLabelPoint[above right,color=blue](B){$(3,1)$} \tkzLabelPoint[above left,color=red](B1){$(3,1\sfrac{9}{14})$} \tkzDefPoints{4/2/C, 4/1.5714/C1} \tkzDrawSegments[color=blue, ultra thick, densely dashed](C,C1) \tkzDrawPoints[size=10,color=red](C, C1) \tkzLabelPoint[above left,color=blue](C){$(4,2)$} \tkzLabelPoint[below right,color=red](C1){$(4,1\sfrac{4}{7})$} \tkzDefPoints{0/1.8571/J, 5/1.5/K} \tkzDrawSegments[color=red, ultra thick](J,K) \end{tikzpicture} \end{column} \begin{column}{0.5\textwidth} \begin{tikzpicture}[scale=0.7, every node/.style={scale=0.7}, tdplot_main_coords, axis/.style={->,dashed}, thick, rotate around x=15, rotate around z=-3] % -- remove these 3 lines if no axis is preferred \draw[axis] (0, 0, 0) -- (3, 0, 0) node [below] {1st coordinate}; \draw[axis] (0, 0, 0) -- (0, 4, 0) node [below left] {2nd coordinate}; \draw[axis] (0, 0, 0) -- (0, 0, 5) node [above] {3rd coordinate}; \coordinate (O) at (0,0,0){}; \coordinate (X) at (1,3,4){}; \coordinate (X1) at (1,0,0){}; \coordinate (X2) at (0,3,0){}; \coordinate (X3) at (0,0,4){}; \coordinate (X12) at (1,3,0){}; \coordinate (X13) at (1,0,4){}; \coordinate (X23) at (0,3,4){}; \coordinate (X0) at (1,1,1){}; \coordinate (X01) at (1,0,0){}; \coordinate (X02) at (0,1,0){}; \coordinate (X03) at (0,0,1){}; \coordinate (X012) at (1,1,0){}; \coordinate (X013) at (1,0,1){}; \coordinate (X023) at (0,1,1){}; \coordinate (y) at (2,1,2){}; \coordinate (y1) at (2,0,0){}; \coordinate (y2) at (0,1,0){}; \coordinate (y3) at (0,0,2){}; \coordinate (y12) at (2,1,0){}; \coordinate (y13) at (2,0,2){}; \coordinate (y23) at (0,1,2){}; \coordinate (yhat) at (1.7857,1.6428,1.5714){}; \draw[fill=black] (O) circle (0.2em) node[below] {$0$}; \draw[color=blue, thick] (O)--(X); \draw[color=blue, dashed, thin] (X1)--(X12); \draw[color=blue, dashed, thin] (X1)--(X13); \draw[color=blue, dashed, thin] (X2)--(X12); \draw[color=blue, dashed, thin] (X2)--(X23); \draw[color=blue, dashed, thin] (X3)--(X13); \draw[color=blue, dashed, thin] (X3)--(X23); \draw[color=blue, dashed, thin] (X12)--(X); \draw[color=blue, dashed, thin] (X13)--(X); \draw[color=blue, dashed, thin] (X23)--(X); \draw[fill=black] (X) circle (0.2em) node[below right] {$\V{X}_{:,1}$}; \draw[color=blue, thick] (O)--(X0); \draw[color=blue, dashed, thin] (X01)--(X012); \draw[color=blue, dashed, thin] (X01)--(X013); \draw[color=blue, dashed, thin] (X02)--(X012); \draw[color=blue, dashed, thin] (X02)--(X023); \draw[color=blue, dashed, thin] (X03)--(X013); \draw[color=blue, dashed, thin] (X03)--(X023); \draw[color=blue, dashed, thin] (X012)--(X0); \draw[color=blue, dashed, thin] (X013)--(X0); \draw[color=blue, dashed, thin] (X023)--(X0); \draw[fill=black] (X0) circle (0.2em) node[below right] {$\V{X}_{:,0}$}; \draw[color=green, thick] (O)--(y); \draw[color=green, dashed, thin] (y1)--(y12); \draw[color=green, dashed, thin] (y1)--(y13); \draw[color=green, dashed, thin] (y2)--(y12); \draw[color=green, dashed, thin] (y2)--(y23); \draw[color=green, dashed, thin] (y3)--(y13); \draw[color=green, dashed, thin] (y3)--(y23); \draw[color=green, dashed, thin] (y12)--(y); \draw[color=green, dashed, thin] (y13)--(y); \draw[color=green, dashed, thin] (y23)--(y); \draw[fill=black] (y) circle (0.2em) node[above left] {$\V{y}$}; \draw[dashed, color=green] (O)--(yhat); \draw[fill=black] (yhat) circle (0.2em) node[above left] {$\hat{\V{y}}$}; \draw[dashed, color=red] (y)--(yhat); \coordinate (aa) at (3,1,0){}; \filldraw[color=blue, fill opacity=0.3, dashed, thin] (O)--(aa)--(X); \tkzMarkRightAngle[draw=black, thin](O,yhat,y) \end{tikzpicture} \end{column} \end{columns} \begin{center} We find our (OLS) solution by orthogonally projecting $\V{y}$ (the \textcolor{green}{green} line) onto the column space of $\M{X}$ (the \textcolor{blue}{blue} line or hyperplane) --- see the \nb{dashed red line} on the right plot. \end{center} \end{frame}\begin{frame} \frametitle{The hat matrix now projects onto a plane, not onto a line} \begin{itemize} \item When we allow the intercept, the hat matrix becomes \begin{align*} \M{H} &= \M{X}(\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}} \\ &= \M{X}\M{X}^+ \\ &= \begin{pmatrix} 1 & 1 \\ 1 & 3 \\ 1 & 4 \end{pmatrix} \left[ \frac{1}{14} \begin{pmatrix} 18 & 2 & -6 \\ -5 & 1 & 4 \end{pmatrix} \right] \\ &= \frac{1}{14} \begin{pmatrix} 13 & 3 & -2 \\ 3 & 5 & 6 \\ -2 & 6 & 10 \end{pmatrix}. \end{align*} \item As before, it projects $\V{y}$ onto the column space of $\M{X}$, the resulting projection being $\hat{\V{y}}$. \item But now $\M{X}$ has two (linearly independent) columns, not one, so the projection is onto a plane, not onto a line. If we increase $p$ and add more columns to $\M{X}$, if these columns are linearly independent, we'll further increase the dimension of the column space of $\M{X}$ onto which we project. \end{itemize} \end{frame}\begin{frame} \frametitle{The equation of a plane} \begin{itemize} \item Recall that the equation of a plane orthogonal to a vector $\V{a}$ and containing a point $\V{x}_0$ is given by $\transpose{\V{a}} (\V{x} - \V{x}_0) = 0$. (We are simply stating a fact: $\V{a}$ and $\V{x} - \V{x}_0$ (the latter lies on the plane) are orthogonal.) \item In our case, \begin{equation*} \hat{\V{\epsilon}} = \V{y} - \hat{\V{y}} = \begin{pmatrix} \sfrac{3}{14} \\ -\sfrac{9}{14} \\ \sfrac{3}{7} \end{pmatrix} \end{equation*} is orthogonal to the plane that contains the column vectors of $\M{X}$ --- $\transpose{\begin{pmatrix} 1 & 1 & 1 \end{pmatrix}}$ and $\transpose{\begin{pmatrix} 1 & 3 & 4 \end{pmatrix}}$. \item We can use either of them to work out the equation of the plane: \begin{equation*} \frac{3}{14} (x_1 - 1) - \frac{9}{14} (x_2 - 3) + \frac{3}{7} (x_3 - 4) = 0 \Rightarrow \frac{3}{14} x_1 - \frac{9}{14} x_2 + \frac{3}{7} x_3 = 0. \end{equation*} \end{itemize} \end{frame}\begin{frame} \frametitle{The OLS solution} \begin{itemize} \item The resulting OLS solution is a linear combination of the columns of $\M{X}$, the weights being given by the elements of $\hat{\V{\beta}}$: \begin{equation*} \hat{\V{y}} = \M{X}\hat{\V{\beta}} = \begin{pmatrix} \sfrac{25}{14} \\ \sfrac{23}{14} \\ \sfrac{11}{7} \end{pmatrix} = \hat{\beta}_0 \begin{pmatrix} 1 \\ 1 \\ 1 \end{pmatrix} + \hat{\beta}_1 \begin{pmatrix} 1 \\ 3 \\ 4 \end{pmatrix}. \end{equation*} \item Here $\hat{\beta}_0$ is, by construction, the $y$-intercept of the regression line and $\hat{\beta}_1$ is its gradient. \end{itemize} \end{frame}\begin{frame} \frametitle{So have we...} ...solved the mystery of the missing orthogonal projection (and understood the geometry of~OLS)? \end{frame}\begin{frame} \frametitle{Just one more thing} \begin{columns} \begin{column}{0.3\textwidth} \begin{figure} \centering \includegraphics[width=1.\textwidth]{columbo.pdf} \end{figure} \end{column} \begin{column}{0.7\textwidth} Another orthogonal projection is lurking right in front of us and we have nearly missed it! \end{column} \end{columns} \end{frame}\begin{frame} \frametitle{The other projection matrix} \begin{itemize} \item The matrix $\M{M} = \M{I}_n - \M{H}$, known as the \defn{annihilator} matrix, is \nb{also} an orthogonal projection matrix, and it is closely linked to the hat matrix $\M{H}$. \item Notice that $\M{M}$ inherits the following properties from $\M{H}$: it is also \begin{itemize} \item \nb{symmetric}: \begin{equation*} \transpose{\M{M}} = \transpose{\M{I}_n - \M{H}} = \transpose{\M{I}_n} - \transpose{\M{H}} = \M{I}_n - \M{H} = \M{M}; \end{equation*} \item \nb{idempotent}: \begin{equation*} \M{M}^2 = (\M{I}_n - \M{H}) (\M{I}_n - \M{H}) = \M{I}_n \M{I}_n - \M{H} \M{I}_n - \M{I}_n \M{H} + \M{H} \M{H} = \M{I}_n - \M{H} - \M{H} + \M{H} = \M{I}_n - \M{H} = \M{M}. \end{equation*} \end{itemize} \item This matrix projects onto the space that is orthogonal\footnote{Two vector subspaces $A$ and $B$ are \defn{orthogonal} if each vector in $A$ is orthogonal to each vector in $B$.} to the column space of $\M{X}$ --- its so called \defn{orthogonal complement}. \item In fact, \begin{equation*} \hat{\V{\epsilon}} = \V{y} - \hat{\V{y}} = \V{y} - \M{H}\V{y} = (\M{I}_n - \M{H})\V{y} = \M{M}\V{y}, \end{equation*} and we see that the residuals lie in this vector space. \item The orthogonal projection matrix $\M{M}$ acts on $\M{y}$ to obtain the residuals. \end{itemize} \end{frame}\begin{frame} \frametitle{The orthogonal decomposition} Thus we have an \defn{orthogonal decomposition} of $\V{y}$: $\V{y} = \hat{\V{y}} + \V{y} - \hat{\V{y}} = \hat{\V{y}} + \hat{\V{\epsilon}} = \underbrace{\M{H}\V{y}}_{\hat{\V{y}}} + \underbrace{\M{M}\V{y}}_{\hat{\V{\epsilon}}}$: \begin{columns} \begin{column}{0.5\textwidth} \begin{tikzpicture}[scale=0.7, every node/.style={scale=0.7}, tdplot_main_coords, axis/.style={->,dashed}, thick, rotate around x=15, rotate around z=-3] % -- remove these 3 lines if no axis is preferred \draw[axis] (0, 0, 0) -- (3, 0, 0) node [below] {1st coordinate}; \draw[axis] (0, 0, 0) -- (0, 4, 0) node [below left] {2nd coordinate}; \draw[axis] (0, 0, 0) -- (0, 0, 5) node [above] {3rd coordinate}; \coordinate (O) at (0,0,0){}; \coordinate (X) at (1,3,4){}; \coordinate (X1) at (1,0,0){}; \coordinate (X2) at (0,3,0){}; \coordinate (X3) at (0,0,4){}; \coordinate (X12) at (1,3,0){}; \coordinate (X13) at (1,0,4){}; \coordinate (X23) at (0,3,4){}; \coordinate (y) at (2,1,2){}; \coordinate (y1) at (2,0,0){}; \coordinate (y2) at (0,1,0){}; \coordinate (y3) at (0,0,2){}; \coordinate (y12) at (2,1,0){}; \coordinate (y13) at (2,0,2){}; \coordinate (y23) at (0,1,2){}; \coordinate (yhat) at (0.5,1.5,2){}; \draw[fill=black] (O) circle (0.2em) node[below] {$0$}; \draw[color=blue, thick] (O)--(X); \draw[color=blue, dashed, thin] (X1)--(X12); \draw[color=blue, dashed, thin] (X1)--(X13); \draw[color=blue, dashed, thin] (X2)--(X12); \draw[color=blue, dashed, thin] (X2)--(X23); \draw[color=blue, dashed, thin] (X3)--(X13); \draw[color=blue, dashed, thin] (X3)--(X23); \draw[color=blue, dashed, thin] (X12)--(X); \draw[color=blue, dashed, thin] (X13)--(X); \draw[color=blue, dashed, thin] (X23)--(X); \draw[fill=black] (X) circle (0.2em) node[below right] {$\V{X}$}; \draw[color=green, thick] (O)--(y); \draw[color=green, dashed, thin] (y1)--(y12); \draw[color=green, dashed, thin] (y1)--(y13); \draw[color=green, dashed, thin] (y2)--(y12); \draw[color=green, dashed, thin] (y2)--(y23); \draw[color=green, dashed, thin] (y3)--(y13); \draw[color=green, dashed, thin] (y3)--(y23); \draw[color=green, dashed, thin] (y12)--(y); \draw[color=green, dashed, thin] (y13)--(y); \draw[color=green, dashed, thin] (y23)--(y); \draw[fill=black] (y) circle (0.2em) node[above left] {$\V{y}$}; \draw[dashed, color=green] (O)--(yhat); \draw[fill=black] (yhat) circle (0.2em) node[below right] {$\hat{\V{y}}$}; \draw[dashed, color=red] (y)--(yhat) node[midway, above, color=black] {$\hat{\V{\epsilon}}$}; \tkzMarkRightAngle[draw=black, thin](O,yhat,y) \tkzFindAngle(yhat,O,y) \tkzGetAngle{angleyhatOy}; \tkzDrawArc[R with nodes, color=red](O,1cm)(yhat,y) \tkzLabelAngle[above, pos=-0.1](yhat,O,y){$\theta$} \end{tikzpicture} \end{column} \begin{column}{0.5\textwidth} \begin{tikzpicture}[scale=0.7, every node/.style={scale=0.7}, tdplot_main_coords, axis/.style={->,dashed}, thick, rotate around x=15, rotate around z=-3] % -- remove these 3 lines if no axis is preferred \draw[axis] (0, 0, 0) -- (3, 0, 0) node [below] {1st coordinate}; \draw[axis] (0, 0, 0) -- (0, 4, 0) node [below left] {2nd coordinate}; \draw[axis] (0, 0, 0) -- (0, 0, 5) node [above] {3rd coordinate}; \coordinate (O) at (0,0,0){}; \coordinate (X) at (1,3,4){}; \coordinate (X1) at (1,0,0){}; \coordinate (X2) at (0,3,0){}; \coordinate (X3) at (0,0,4){}; \coordinate (X12) at (1,3,0){}; \coordinate (X13) at (1,0,4){}; \coordinate (X23) at (0,3,4){}; \coordinate (X0) at (1,1,1){}; \coordinate (X01) at (1,0,0){}; \coordinate (X02) at (0,1,0){}; \coordinate (X03) at (0,0,1){}; \coordinate (X012) at (1,1,0){}; \coordinate (X013) at (1,0,1){}; \coordinate (X023) at (0,1,1){}; \coordinate (y) at (2,1,2){}; \coordinate (y1) at (2,0,0){}; \coordinate (y2) at (0,1,0){}; \coordinate (y3) at (0,0,2){}; \coordinate (y12) at (2,1,0){}; \coordinate (y13) at (2,0,2){}; \coordinate (y23) at (0,1,2){}; \coordinate (yhat) at (1.7857,1.6428,1.5714){}; \draw[fill=black] (O) circle (0.2em) node[below] {$0$}; \draw[color=blue, thick] (O)--(X); \draw[color=blue, dashed, thin] (X1)--(X12); \draw[color=blue, dashed, thin] (X1)--(X13); \draw[color=blue, dashed, thin] (X2)--(X12); \draw[color=blue, dashed, thin] (X2)--(X23); \draw[color=blue, dashed, thin] (X3)--(X13); \draw[color=blue, dashed, thin] (X3)--(X23); \draw[color=blue, dashed, thin] (X12)--(X); \draw[color=blue, dashed, thin] (X13)--(X); \draw[color=blue, dashed, thin] (X23)--(X); \draw[fill=black] (X) circle (0.2em) node[below right] {$\V{X}_{:,1}$}; \draw[color=blue, thick] (O)--(X0); \draw[color=blue, dashed, thin] (X01)--(X012); \draw[color=blue, dashed, thin] (X01)--(X013); \draw[color=blue, dashed, thin] (X02)--(X012); \draw[color=blue, dashed, thin] (X02)--(X023); \draw[color=blue, dashed, thin] (X03)--(X013); \draw[color=blue, dashed, thin] (X03)--(X023); \draw[color=blue, dashed, thin] (X012)--(X0); \draw[color=blue, dashed, thin] (X013)--(X0); \draw[color=blue, dashed, thin] (X023)--(X0); \draw[fill=black] (X0) circle (0.2em) node[below right] {$\V{X}_{:,0}$}; \draw[color=green, thick] (O)--(y); \draw[color=green, dashed, thin] (y1)--(y12); \draw[color=green, dashed, thin] (y1)--(y13); \draw[color=green, dashed, thin] (y2)--(y12); \draw[color=green, dashed, thin] (y2)--(y23); \draw[color=green, dashed, thin] (y3)--(y13); \draw[color=green, dashed, thin] (y3)--(y23); \draw[color=green, dashed, thin] (y12)--(y); \draw[color=green, dashed, thin] (y13)--(y); \draw[color=green, dashed, thin] (y23)--(y); \draw[fill=black] (y) circle (0.2em) node[above left] {$\V{y}$}; \draw[dashed, color=green] (O)--(yhat); \draw[fill=black] (yhat) circle (0.2em) node[above right] {$\hat{\V{y}}$}; \draw[dashed, color=red] (y)--(yhat) node[midway, above, color=black] {$\hat{\V{\epsilon}}$}; \coordinate (aa) at (3,1,0){}; \filldraw[color=blue, fill opacity=0.3, dashed, thin] (O)--(aa)--(X); \tkzMarkRightAngle[draw=black, thin](O,yhat,y) \end{tikzpicture} \end{column} \end{columns} \end{frame}

\section{How well are we doing?}

\begin{frame} \frametitle{Variance partitioning (i)} \begin{itemize} \item Let $\bar{y} = \sum_{i=1}^n y_i$ be the sample mean. Let us define \begin{itemize} \item \defn{Total Sum of Squares (TSS)}, sometimes also referred to as \nb{Sum of Squares, Total (SST, SSTO)} and \defn{SYY}: $\sum_{i=1}^n (y_i - \bar{y})^2$; \item \defn{Estimated Sum of Squares (ESS)}, sometimes also referred to as \nb{Sum of Squares, Model (SSM)}, \nb{Estimated Sum of Squares (ESS)}, and \nb{Sum of Squares, Regression (SSR, SSreg)}: $\sum_{i=1}^n (\hat{y}_i - \bar{y})^2$; \item \defn{Residual Sum of Squares (RSS)}, sometimes also referred to as \nb{Sum of Squares, Error (SSE)}: $\sum_{i=1}^n (y_i - \hat{y}_i)^2$. \end{itemize} \item Notice that the abbreviations used in the literature can be quite \nb{confusing}: does ``E'' stand for ``estimated'' or for ``error''; does ``R'' stand for ``regression'' or for ``residual''? To avoid confusion, we shall stick with \defn{Total Sum of Squares (TSS)}, \defn{Estimated Sum of Squares (ESS)}, \defn{Residual Sum of Squares (RSS)}. \item If we define $\bar{\V{y}} \in \R^n$ as a vector all of whose elements are the sample mean $\bar{y}$, we can write these down in vector notation: \begin{itemize} \item $\text{TSS} = \sum_{i=1}^n (y_i - \bar{y})^2 = \transpose{(\V{y} - \bar{\V{y}})}(\V{y} - \bar{\V{y}})$; \item $\text{ESS} = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2 = \transpose{(\hat{\V{y}} - \bar{\V{y}})}(\hat{\V{y}} - \bar{\V{y}})$; \item $\text{RSS} = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \transpose{(\V{y} - \hat{\V{y}})}(\V{y} - \hat{\V{y}})$. \end{itemize} \end{itemize} \end{frame}\begin{frame} \frametitle{Variance partitioning (ii)} \begin{itemize} \item Now notice that \begin{itemize} \item $\text{TSS} = \norm{\V{y} - \bar{\V{y}}}^2 = \norm{\V{y}}^2 - 2\transpose{\V{y}}\bar{\V{y}} + \norm{\bar{\V{y}}}^2$; \item $\text{ESS} = \norm{\hat{\V{y}} - \bar{\V{y}}}^2 = \norm{\hat{\V{y}}}^2 - 2\transpose{\hat{\V{y}}}\bar{\V{y}} + \norm{\bar{\V{y}}}^2$; \item $\text{RSS} = \norm{\V{y} - \hat{\V{y}}}^2 = \norm{\V{\epsilon}}^2$. \end{itemize} \item If the model includes the intercept, the first column of $\M{X}$ is a column of ones and, since $\transpose{\M{X}}\hat{\V{\epsilon}} = \V{0}$, the sum of the residuals is zero. This is equivalent to $\transpose{\V{y}}\bar{\V{y}} = \transpose{\hat{\V{y}}}\bar{\V{y}}$. \item By Pythagoras's theorem, \begin{equation*} \norm{\V{y}}^2 = \norm{\hat{\V{y}}}^2 + \norm{\hat{\V{\epsilon}}}^2. \end{equation*} \item Hence \begin{equation*} \text{TSS} = \text{ESS} + \text{RSS}, \end{equation*} a fundamental result known as \defn{variance partitioning}. \item Irrespective of whether our model includes the intercept, if we center $\V{y}$ by subtracting the sample mean $\bar{y} = \sum_{i=1}^n y_i$ from each element, we notice that $\text{TSS} = \text{ESS} + \text{RSS}$ simply amounts to the above application of Pythagoras's theorem. \end{itemize} \end{frame}\begin{frame} \frametitle{From Christopher Clavius's edition of Euclid's Elements} \begin{figure} \centering \includegraphics[width=.72\textwidth]{euclidis-elementorum.pdf} \end{figure} \begin{small} From \url{https://www.maa.org/press/periodicals/convergence/mathematical-treasures-christopher-claviuss-edition-of-euclids-elements} \end{small} \end{frame}\begin{frame} \frametitle{Pythagoras's theorem} \begin{figure} \centering \includegraphics[width=.72\textwidth]{pythagoras-theorem.pdf} \end{figure} \end{frame}\begin{frame} \frametitle{Question: can you spot $R^2$ on these graphs?} \begin{columns} \begin{column}{0.5\textwidth} \begin{tikzpicture}[scale=0.7, every node/.style={scale=0.7}, tdplot_main_coords, axis/.style={->,dashed}, thick, rotate around x=15, rotate around z=-3] % -- remove these 3 lines if no axis is preferred \draw[axis] (0, 0, 0) -- (3, 0, 0) node [below] {1st coordinate}; \draw[axis] (0, 0, 0) -- (0, 4, 0) node [below left] {2nd coordinate}; \draw[axis] (0, 0, 0) -- (0, 0, 5) node [above] {3rd coordinate}; \coordinate (O) at (0,0,0){}; \coordinate (X) at (1,3,4){}; \coordinate (X1) at (1,0,0){}; \coordinate (X2) at (0,3,0){}; \coordinate (X3) at (0,0,4){}; \coordinate (X12) at (1,3,0){}; \coordinate (X13) at (1,0,4){}; \coordinate (X23) at (0,3,4){}; \coordinate (y) at (2,1,2){}; \coordinate (y1) at (2,0,0){}; \coordinate (y2) at (0,1,0){}; \coordinate (y3) at (0,0,2){}; \coordinate (y12) at (2,1,0){}; \coordinate (y13) at (2,0,2){}; \coordinate (y23) at (0,1,2){}; \coordinate (yhat) at (0.5,1.5,2){}; \draw[fill=black] (O) circle (0.2em) node[below] {$0$}; \draw[color=blue, thick] (O)--(X); \draw[color=blue, dashed, thin] (X1)--(X12); \draw[color=blue, dashed, thin] (X1)--(X13); \draw[color=blue, dashed, thin] (X2)--(X12); \draw[color=blue, dashed, thin] (X2)--(X23); \draw[color=blue, dashed, thin] (X3)--(X13); \draw[color=blue, dashed, thin] (X3)--(X23); \draw[color=blue, dashed, thin] (X12)--(X); \draw[color=blue, dashed, thin] (X13)--(X); \draw[color=blue, dashed, thin] (X23)--(X); \draw[fill=black] (X) circle (0.2em) node[below right] {$\V{X}$}; \draw[color=green, thick] (O)--(y); \draw[color=green, dashed, thin] (y1)--(y12); \draw[color=green, dashed, thin] (y1)--(y13); \draw[color=green, dashed, thin] (y2)--(y12); \draw[color=green, dashed, thin] (y2)--(y23); \draw[color=green, dashed, thin] (y3)--(y13); \draw[color=green, dashed, thin] (y3)--(y23); \draw[color=green, dashed, thin] (y12)--(y); \draw[color=green, dashed, thin] (y13)--(y); \draw[color=green, dashed, thin] (y23)--(y); \draw[fill=black] (y) circle (0.2em) node[above left] {$\V{y}$}; \draw[dashed, color=green] (O)--(yhat); \draw[fill=black] (yhat) circle (0.2em) node[below right] {$\hat{\V{y}}$}; \draw[dashed, color=red] (y)--(yhat) node[midway, above, color=black] {$\hat{\V{\epsilon}}$}; \tkzMarkRightAngle[draw=black, thin](O,yhat,y) \tkzFindAngle(yhat,O,y) \tkzGetAngle{angleyhatOy}; \tkzDrawArc[R with nodes, color=red](O,1cm)(yhat,y) \tkzLabelAngle[above, pos=-0.1](yhat,O,y){$\theta$} \end{tikzpicture} \end{column} \begin{column}{0.5\textwidth} \begin{tikzpicture}[scale=0.7, every node/.style={scale=0.7}, tdplot_main_coords, axis/.style={->,dashed}, thick, rotate around x=15, rotate around z=-3] % -- remove these 3 lines if no axis is preferred \draw[axis] (0, 0, 0) -- (3, 0, 0) node [below] {1st coordinate}; \draw[axis] (0, 0, 0) -- (0, 4, 0) node [below left] {2nd coordinate}; \draw[axis] (0, 0, 0) -- (0, 0, 5) node [above] {3rd coordinate}; \coordinate (O) at (0,0,0){}; \coordinate (X) at (1,3,4){}; \coordinate (X1) at (1,0,0){}; \coordinate (X2) at (0,3,0){}; \coordinate (X3) at (0,0,4){}; \coordinate (X12) at (1,3,0){}; \coordinate (X13) at (1,0,4){}; \coordinate (X23) at (0,3,4){}; \coordinate (X0) at (1,1,1){}; \coordinate (X01) at (1,0,0){}; \coordinate (X02) at (0,1,0){}; \coordinate (X03) at (0,0,1){}; \coordinate (X012) at (1,1,0){}; \coordinate (X013) at (1,0,1){}; \coordinate (X023) at (0,1,1){}; \coordinate (y) at (2,1,2){}; \coordinate (y1) at (2,0,0){}; \coordinate (y2) at (0,1,0){}; \coordinate (y3) at (0,0,2){}; \coordinate (y12) at (2,1,0){}; \coordinate (y13) at (2,0,2){}; \coordinate (y23) at (0,1,2){}; \coordinate (yhat) at (1.7857,1.6428,1.5714){}; \draw[fill=black] (O) circle (0.2em) node[below] {$0$}; \draw[color=blue, thick] (O)--(X); \draw[color=blue, dashed, thin] (X1)--(X12); \draw[color=blue, dashed, thin] (X1)--(X13); \draw[color=blue, dashed, thin] (X2)--(X12); \draw[color=blue, dashed, thin] (X2)--(X23); \draw[color=blue, dashed, thin] (X3)--(X13); \draw[color=blue, dashed, thin] (X3)--(X23); \draw[color=blue, dashed, thin] (X12)--(X); \draw[color=blue, dashed, thin] (X13)--(X); \draw[color=blue, dashed, thin] (X23)--(X); \draw[fill=black] (X) circle (0.2em) node[below right] {$\V{X}_{:,1}$}; \draw[color=blue, thick] (O)--(X0); \draw[color=blue, dashed, thin] (X01)--(X012); \draw[color=blue, dashed, thin] (X01)--(X013); \draw[color=blue, dashed, thin] (X02)--(X012); \draw[color=blue, dashed, thin] (X02)--(X023); \draw[color=blue, dashed, thin] (X03)--(X013); \draw[color=blue, dashed, thin] (X03)--(X023); \draw[color=blue, dashed, thin] (X012)--(X0); \draw[color=blue, dashed, thin] (X013)--(X0); \draw[color=blue, dashed, thin] (X023)--(X0); \draw[fill=black] (X0) circle (0.2em) node[below right] {$\V{X}_{:,0}$}; \draw[color=green, thick] (O)--(y); \draw[color=green, dashed, thin] (y1)--(y12); \draw[color=green, dashed, thin] (y1)--(y13); \draw[color=green, dashed, thin] (y2)--(y12); \draw[color=green, dashed, thin] (y2)--(y23); \draw[color=green, dashed, thin] (y3)--(y13); \draw[color=green, dashed, thin] (y3)--(y23); \draw[color=green, dashed, thin] (y12)--(y); \draw[color=green, dashed, thin] (y13)--(y); \draw[color=green, dashed, thin] (y23)--(y); \draw[fill=black] (y) circle (0.2em) node[above left] {$\V{y}$}; \draw[dashed, color=green] (O)--(yhat); \draw[fill=black] (yhat) circle (0.2em) node[above right] {$\hat{\V{y}}$}; \draw[dashed, color=red] (y)--(yhat) node[midway, above, color=black] {$\hat{\V{\epsilon}}$}; \coordinate (aa) at (3,1,0){}; \filldraw[color=blue, fill opacity=0.3, dashed, thin] (O)--(aa)--(X); \tkzMarkRightAngle[draw=black, thin](O,yhat,y) \end{tikzpicture} \end{column} \end{columns} \end{frame}\begin{frame} \frametitle{Coefficient of determination ($R^2$)} \begin{itemize} \item Recall that, in a right triangle, \begin{equation*} \cos(\theta) = \frac{|\text{adjacent}|}{|\text{hypotenuse}|}. \end{equation*} \item Thus \begin{equation*} (\cos \theta)^2 = \frac{\norm{\hat{\V{y}}}^2}{\norm{\V{y}}^2} = \frac{\text{ESS}}{\text{TSS}} = \frac{\text{TSS} - \text{RSS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}} \eqdef R^2. \end{equation*} \item This quantity is also known as the \defn{coefficient of determination}. \item It indicates how much better the function predicts the dependent variable than the default predictor in the absence of any inputs --- the mean value of the output. \item It is also sometimes described as the \defn{proportion of variance explained}. \end{itemize} \end{frame}\begin{frame} \frametitle{$R^2$ never decreases (increases) when features are added (removed)} \begin{itemize} \item Regression can be seen as an orthogonal projection of $\V{y}$ onto the column space of $\M{X}$. \item An orthogonal projection minimizes the length of the vector that represents the difference between $\V{y}$ and the fitted values, $\hat{\V{y}}$, i.e. minimizes $\V{\epsilon}$. \item The fitted values are a vector in the column space of $\M{X}$. \item If you add a column to $\M{X}$, the column space of $\M{X}$ either stays the same or gets bigger. \item Therefore, the orthogonal projection onto this new space cannot possibly be longer since the original column space is a subspace of the new space. \end{itemize} \end{frame}\begin{frame} \frametitle{Theory and practice} \begin{itemize} \item So far we have been careful to address an academic economist's very valid concern: \begin{quote} That works very well in practice, but does it work in theory? \end{quote} \item This said, \begin{quote} The difference between theory and practice is considerably greater in practice than in theory. \end{quote} \item OK, it's time for some practice! \end{itemize} \end{frame}

In [ ]:
\begin{frame}
\frametitle{A word to Sir Isaac Newton}
\begin{columns}
    \begin{column}{0.3\textwidth}
    \begin{figure}
    \centering
    \includegraphics[width=1.\textwidth]{isaac-newton.pdf}
    \end{figure}
    \begin{small}
    Portrait of Sir~Isaac~Newton at~46 (1642--1726/7) by Godfrey~Kneller
    \end{small}
    \end{column}
    \begin{column}{0.7\textwidth}
    \begin{quote}
    A Vulgar Mechanick can \nb{practice what he has been taught or seen done}, but if he is in an error he knows not how to find it out and correct it, and if you put him out of his road he is at a stand. Whereas he that is able to \nb{reason nimbly and judiciously} about figure, force, and motion, is never at rest till he gets over every rub.~\cite{newton-1694}
    \end{quote}
    \end{column}
\end{columns}
\end{frame}

In [ ]:
\begin{frame}[fragile]
\frametitle{\texttt{lm} in R}
\begin{itemize}
\item While Python is now a \latin{de facto} \latin{lingua franca} in machine learning, R is still a \latin{de facto} \latin{lingua franca} in statistical analysis. Let us consider it's implementation of linear regression.
\item Let us play with our example in R:
    \begin{snippet}{R}
x <- c(1., 3., 4.)
y <- c(2., 1., 2.)
regr <- lm(y ~ x)
regr
    \end{snippet}
\item This outputs
    \begin{snippet}{output}
Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x
    1.85714     -0.07143
    \end{snippet}
\item The coefficients match $\hat{\beta}_0 = \sfrac{13}{7}$ and $\hat{\beta}_1 = -\sfrac{1}{14}$, respectively, which is encouraging. But we need to know more.
\end{itemize}
\end{frame}

\begin{frame}[fragile]
\frametitle{\texttt{summary} in R}
\begin{itemize}
\item Our next step is
    \begin{snippet}{R}
summary(regr)
    \end{snippet}
\item which outputs
    \begin{snippet}{output}
Call:
lm(formula = y ~ x)

Residuals:
      1       2       3
 0.2143 -0.6429  0.4286

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.85714    1.09265   1.700    0.339
x           -0.07143    0.37115  -0.192    0.879

Residual standard error: 0.8018 on 1 degrees of freedom
Multiple R-squared:  0.03571,   Adjusted R-squared:  -0.9286
F-statistic: 0.03704 on 1 and 1 DF,  p-value: 0.879
    \end{snippet}
\end{itemize}
\end{frame}

\begin{frame}[fragile]
\frametitle{\texttt{summary} in R: residuals}
\begin{itemize}
\item In this summary,
    \begin{snippet}{output}
Residuals:
      1       2       3
 0.2143 -0.6429  0.4286
    \end{snippet}
    can also be obtained through
    \begin{snippet}{R}
resid(regr)
    \end{snippet}
\item This is simply our vector of \defn{residuals}, i.e.
    \begin{snippet}{R}
y - regr$fitted.values
    \end{snippet}
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}[fragile]
\frametitle{\texttt{summary} in R: coefficients}
\begin{itemize}
\item The table of coefficients, i.e. $\hat{\V{\beta}}$, follows:
    \begin{snippet}{output}
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.85714    1.09265   1.700    0.339
x           -0.07143    0.37115  -0.192    0.879
    \end{snippet}
\item The first column, \nb{Estimate}, lists our $\hat{\beta}_0 = \sfrac{13}{7} \approx 1.85714$ and $\hat{\beta}_1 = -\sfrac{1}{14} \approx -0.07143$.
\item The second column, \nb{Std. Error}, lists the standard error for each regression coefficient. Recall that $\hat{\V{\beta}}$ is a \nb{point} estimate of $\V{\beta}$. Because of sampling variability, this estimate may be too high or too low. The standard error gives us an indication of how much the point estimate is likely to vary from the corresponding population parameter. These standard errors can also be computed like this:
    \begin{snippet}{R}
sqrt(diag(vcov(regr)))
    \end{snippet}
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{$\hat{\V{\beta}}$ is an unbiased estimator of $\V{\beta}$}
\begin{itemize}
\item The Ordinary Least Squares (OLS) method has given us the estimator
    \begin{equation*}
    \hat{\V{\beta}} = \M{X}^+\V{y} = (\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\V{y}.
    \end{equation*}
\item Let us compute its expected value:
    \begin{small}
    \begin{align*}
    \E[\hat{\V{\beta}}] &= \E[(\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\V{y}] \\
    &= \E[(\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}(\M{X}\V{\beta} + \V{\epsilon})] \quad \text{(substituting in the linear model $\V{y} = \M{X}\V{\beta} + \V{\epsilon}$)} \\
    &= \E[(\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\M{X}\V{\beta} + (\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\V{\epsilon}] \quad{\text{(multiplying out)}} \\
    &= \E[(\transpose{\M{X}}\M{X})^{-1}(\transpose{\M{X}}\M{X})\V{\beta} + (\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\V{\epsilon}] \quad{\text{(commutativity of matrix multiplication)}} \\
    &= \E[\M{I}_p\V{\beta} + (\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\V{\epsilon}] \quad{\text{(matrix inverse)}} \\
    &= \V{\beta} + \E[(\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\V{\epsilon}] \quad{\text{(matrix identity; $\V{\beta}$ \nb{assumed} to be fixed)}} \\
    &= \V{\beta} + \E[\E[(\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\V{\epsilon} \given \M{X}]] \quad{\text{(Law of Iterated Expectation)}} \\
    &= \V{\beta} + \E[(\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}} \E[\V{\epsilon} \given \M{X}]] \quad{\text{(taking $\M{X}$ factors out of $\E[\cdot \given \M{X}]$)}} \\
    &= \V{\beta} + \E[(\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\V{0}_n] \quad{\text{(\nb{assuming} $\E[\V{\epsilon} \given \M{X}] = \V{0}_n$)}} \\
    &= \V{\beta} + \E[\V{0}_{p+1}] = \V{\beta}.
    \end{align*}
    \end{small}
\item Notice that, as in frequentist estimation, we assume that the true value of $\V{\beta}$ is fixed and non-random, so we can take it out of the expectation operator.
\item Also notice that we didn't need to assume that $\M{X}$ is deterministic, only that $\E[\V{\epsilon} \given \M{X}] = 0$.
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{Question}
\begin{itemize}
\item Is the \nb{assumption} $\E[\V{\epsilon} \given \M{X}] = \V{0}_n$ justified?
\item Does it follow from the fact that the residuals add up to zero (which would be enforced by the leading column of ones in $\M{X}$)?
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{The variance of the $\hat{\V{\beta}}$ estimator (i)}
\begin{itemize}
\item Now let us compute the variance of the $\hat{\V{\beta}}$ estimator. First, notice that
    \begin{align*}
    \hat{\V{\beta}} - \E[\hat{\V{\beta}}] &= \hat{\V{\beta}} - \V{\beta} \quad \text{($\hat{\V{\beta}}$ is unbiased)} \\
    &= (\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\V{y} - \V{\beta} \quad \text{(substituting in $\hat{\V{\beta}} = (\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\V{y}$)} \\
    &= (\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}(\M{X}\V{\beta} + \V{\epsilon}) - \V{\beta} \quad \text{(substituting in $\V{y} = \M{X}\V{\beta} + \V{\epsilon}$)} \\
    &= (\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\M{X}\V{\beta} + (\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\V{\epsilon} - \V{\beta} \quad \text{(multiplying out)} \\
    &= (\transpose{\M{X}}\M{X})^{-1}(\transpose{\M{X}}\M{X})\V{\beta} + (\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\V{\epsilon} - \V{\beta} \quad \text{(commutativity of matrix multiplication)} \\
    &= \M{I}_p\V{\beta} + (\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\V{\epsilon} - \V{\beta} \quad \text{(matrix inverses)} \\
    &= \V{\beta} + (\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\V{\epsilon} - \V{\beta} \quad \text{(matrix identity)} \\
    &= \underbrace{(\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}}_{\M{X}^+}\V{\epsilon}.
    \end{align*}
\item Thus the Moore--Penrose pseudoinverse, $\M{X}^+$, acts as a linear transformation on $\V{\epsilon}$ giving $\hat{\V{\beta}} - \V{\beta}$ as a result (while it acts on $\V{y}$ giving $\hat{\V{\beta}}$ as a result, as we know).
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{The variance of the $\hat{\V{\beta}}$ estimator (ii)}
\begin{itemize}
\item Hence
    \begin{align*}
    \Var[\hat{\V{\beta}}] &\defeq \E[\left(\hat{\V{\beta}} - \E[\hat{\V{\beta}}]\right)\transpose{\left(\hat{\V{\beta}} - \E[\hat{\V{\beta}}]\right)}] \\
    &= \E[\left((\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\V{\epsilon}\right)\transpose{\left((\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\V{\epsilon}\right)}] \\
    &= \E[\left((\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\V{\epsilon}\right)\left(\transpose{\V{\epsilon}}\transpose{(\transpose{\M{X}})}\transpose{((\transpose{\M{X}}\M{X})^{-1})}\right)] \quad{(\text{transpose of a product})} \\
    &= \E[\left((\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\V{\epsilon}\right)\left(\transpose{\V{\epsilon}}\M{X}(\transpose{(\transpose{\M{X}}\M{X})})^{-1}\right)] \quad{(\text{transposes and inverses commute})} \\
    &= \E[\left((\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\V{\epsilon}\right)\left(\transpose{\V{\epsilon}}\M{X}(\transpose{\M{X}}\M{X})^{-1}\right)] \quad{(\text{transpose of a product})} \\
    &= \E[\left((\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\right)(\V{\epsilon}\transpose{\V{\epsilon}})\left(\M{X}(\transpose{\M{X}}\M{X})^{-1}\right)]. \quad{(\text{commutativity of matrix multiplication})} \\
    \end{align*}
\item To proceed, we make the \nb{assumptions} that $\E[\V{\epsilon}] = \V{0}_n$ and $\Var[\V{\epsilon}] = \sigma^2\M{I}_n$, for some constant $\sigma < \infty$, which is the case when the following three assumptions from the \nb{classical linear regression model (CLRM)} hold:
    \begin{itemize}
    \item for all $i \in [1, \ldots, n]$, $\epsilon_i$ have a mean of zero, $\E[\epsilon_i] = 0$;
    \item for all $i \in [1, \ldots, n]$, $\epsilon_i$ have the same finite variance, $\Var[\epsilon_i] = \sigma^2 < \infty$, in other words, the random variables $\epsilon_i$ are \alert{homoscedastic};
    \item for all $i, j \in [1, \ldots, n]$, $i \neq j$, the $\epsilon_i$ are uncorrelated, $\Cov[\epsilon_i, \epsilon_j] = 0$.
    \end{itemize}
\item Then $\E[\V{\epsilon}\transpose{\V{\epsilon}}] = \sigma^2\M{I}_n$ and...
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{The variance of the $\hat{\V{\beta}}$ estimator (iii)}
\begin{itemize}
\item ...hence
    \begin{align*}
    \Var[\hat{\V{\beta}}] &= \E[\left((\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\right)(\V{\epsilon}\transpose{\V{\epsilon}})\left(\M{X}(\transpose{\M{X}}\M{X})^{-1}\right)] \\
    &= \E[\left((\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\right)(\sigma^2\M{I}_n)\left(\M{X}(\transpose{\M{X}}\M{X})^{-1}\right)] \\
    &= \sigma^2\E[\left((\transpose{\M{X}}\M{X})^{-1}\transpose{\M{X}}\right)\left(\M{X}(\transpose{\M{X}}\M{X})^{-1}\right)] \quad \text{(matrix identity, taking out the constant)} \\
    &= \sigma^2\E[(\transpose{\M{X}}\M{X})^{-1}(\transpose{\M{X}}\M{X})(\transpose{\M{X}}\M{X})^{-1}] \quad \text{(commutativity of matrix multiplication)} \\
    &= \sigma^2\E[\M{I}_p(\transpose{\M{X}}\M{X})^{-1}] \quad \text{(matrix inverses)} \\
    &= \sigma^2\E[(\transpose{\M{X}}\M{X})^{-1}]. \quad \text{(matrix identity)} \end{align*}
\item If we were also to \nb{assume} that the regressors, $\M{X}$, are non-random and observable (another assumption from \nb{CLRM}), we could get rid of the expectation operator and write $\Var[\hat{\V{\beta}}] = \sigma^2(\transpose{\M{X}}\M{X})^{-1}$.
\item When we are writing $(\transpose{\M{X}}\M{X})^{-1}$ anywhere, we are also making another assumption from \nb{CLRM}: that the regressors are not perfectly linearly correlated with each other, thus $\M{X}$ is full rank and therefore the inverse of the Gram matrix does exist.
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{Recap: some statistical properties of $\hat{\V{\beta}}$}
\begin{itemize}
\item To recap, modulo the \nb{assumptions} stated above,
    \begin{itemize}
    \item $\hat{\V{\beta}}$ is unbiased, and so $\E[\hat{\V{\beta}}] = \V{0}_p$;
    \item $\Var[\hat{\V{\beta}}] = \sigma^2(\transpose{\M{X}}\M{X})^{-1}$.
    \end{itemize}
\item But in general, unless we have generated our inputs, $\M{X}$, and outputs, $\V{y}$, ourselves, we don't usually know $\sigma^2$.
\item So we need to estimate it.
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{A \nb{biased} estimator for $\sigma^2$ (i)}
\begin{itemize}
\item How would we go about it?
\item We could reason (na{\"i}vely!) as follows: ``$\sigma^2$ is the variance of the disturbances $\epsilon_1, \epsilon_2, \ldots, \epsilon_n$. $\hat{\V{\epsilon}}$ is an estimate of $\V{\epsilon}$. Perhaps we should take the uncorrected sample variance of $\hat{\epsilon}_1, \hat{\epsilon}_2, \ldots, \hat{\epsilon}_n$,
    \begin{equation*}
    S^2 = \frac{1}{n} \sum_{i=1}^n \hat{\epsilon}_i^2 = \frac{1}{n}\transpose{\hat{\V{\epsilon}}}\hat{\V{\epsilon}},
    \end{equation*}
    or the corrected sample variance of $\hat{\epsilon}_1, \hat{\epsilon}_2, \ldots, \hat{\epsilon}_n$,
    \begin{equation*}
    \hat{S}^2 = \frac{1}{n-1} \sum_{i=1}^n \hat{\epsilon}_i^2 = \frac{1}{n-1}\transpose{\hat{\V{\epsilon}}}\hat{\V{\epsilon}},
    \end{equation*}
    as our estimate of $\sigma^2$.''
\item In general, these estimators would be \nb{biased}!
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{A \nb{biased} estimator for $\sigma^2$ (ii)}
\begin{itemize}
\item To see why, let us compute $\E[\transpose{\hat{\V{\epsilon}}}\hat{\V{\epsilon}}]$:
    \begin{align*}
    \E[\transpose{\hat{\V{\epsilon}}}\hat{\V{\epsilon}}] &= \E[\transpose{\left(\M{M}\V{y}\right)}\left(\M{M}\V{y}\right)] \\
    &= \E[\transpose{\V{y}}\transpose{\M{M}}\M{M}\V{y}] \quad \text{(transpose of a product)} \\
    &= \E[\transpose{\V{y}}(\transpose{\M{M}}\M{M})\V{y}] \quad \text{(commutativity of matrix multiplication)} \\
    &= \E[\transpose{\V{y}}(\M{M}\M{M})\V{y}] \quad \text{(projection matrices are symmetric)} \\
    &= \E[\transpose{\V{y}}\M{M}\V{y}] \quad \text{(projection matrices are idempotent)} \\
    &= \E[\transpose{(\M{X}\V{\beta} + \V{\epsilon})}\M{M}(\M{X}\V{\beta} + \V{\epsilon})] \\
    &= \E[(\transpose{\V{\beta}}\transpose{\M{X}} + \transpose{\V{\epsilon}})\M{M}(\M{X}\V{\beta} + \V{\epsilon})] \quad \text{(transpose of a sum, product)} \\
    &= \E[\transpose{\V{\beta}}\transpose{\M{X}}\M{M}\M{X}\V{\beta} + \transpose{\V{\beta}}\transpose{\M{X}}\M{M}\V{\epsilon} + \transpose{\V{\epsilon}}\M{M}\M{X}\V{\beta} + \transpose{\V{\epsilon}}\M{M}\V{\epsilon}]. \quad \text{(multiplying out)} \\
    \end{align*}
\item Since $\M{M}$ projects onto the vector space orthogonal to the column space of $\M{X}$, $\M{M}\M{X} = \M{0}_{n \times p}$ and $\transpose{\M{X}}\M{M} = \M{0}_{p \times n}$, and so
    \begin{align*}
    \E[\transpose{\hat{\V{\epsilon}}}\hat{\V{\epsilon}}] &= \E[\transpose{\V{\epsilon}}\M{M}\V{\epsilon}] \\
    &= \E[\transpose{\V{\epsilon}}(\M{I}_n - \M{H})\V{\epsilon}] \quad \text{(the relationship between the hat and annihilator matrices)} \\
    &= \E[\transpose{\V{\epsilon}}\V{\epsilon} - \transpose{\V{\epsilon}}\M{H}\V{\epsilon}] \\
    &= \E[\transpose{\V{\epsilon}}\V{\epsilon}] - \E[\transpose{\V{\epsilon}}\M{H}\V{\epsilon}]. \quad \text{(linearity of expectations)}
    \end{align*}
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{A \nb{biased} estimator for $\sigma^2$ (iii)}
\begin{itemize}
\item Thus we end up with
    \begin{equation*}
    \E[\transpose{\hat{\V{\epsilon}}}\hat{\V{\epsilon}}] = \E[\transpose{\V{\epsilon}}\V{\epsilon}] - \E[\transpose{\V{\epsilon}}\M{H}\V{\epsilon}].
    \end{equation*}
\item The first term is
    \begin{equation*}
    \E[\transpose{\V{\epsilon}}\V{\epsilon}] = \E[\sum_{i=1}^n \epsilon_i^2] = \sum_{i=1}^n \E[\epsilon_i^2] = \sum_{i=1}^n \sigma^2 = n\sigma^2.
    \end{equation*}
\item To compute the second, we first simplify it as
    \begin{gather*}
    \E[\transpose{\V{\epsilon}}\M{H}\V{\epsilon}] = \E[\sum_{i=1}^n \sum_{j=1}^n H_{i,j} \epsilon_i \epsilon_j] = \sum_{i=1}^n \sum_{j=1}^n \E[H_{i,j} \epsilon_i \epsilon_j] = \sum_{i=1}^n H_{i,i} \E[\epsilon_i^2] \\
    = \sum_{i=1}^n H_{i,i} \sigma^2 = \sigma^2 \sum_{i=1}^n H_{i,i} = \sigma^2 \tr{\M{H}}.
    \end{gather*}
\item Then we recall that the matrices in a trace of a product can be switched without changing the result, i.e. if $\M{A} \in \R^{q \times r}, \M{B} \in \R^{r \times q}$, then $\tr(\M{A}\M{B}) = \tr(\M{B}\M{A})$. (Thus trace is \defn{``cyclic''}.) Hence
    \begin{equation*}
    \tr\M{H} = \tr([\M{X}(\transpose{\M{X}}\M{X})^{-1}]\transpose{\M{X}}) = \tr(\transpose{\M{X}}[\M{X}(\transpose{\M{X}}\M{X})^{-1}]) = \tr([\transpose{\M{X}}\M{X}](\transpose{\M{X}}\M{X})^{-1}) = \tr\M{I}_{p^*} = p^*,
    \end{equation*}
    where $p^*$ is the number of (independent) columns in $\M{X}$. And since, there is one column of ones to enable the intercept and $p$ columns corresponding to the $p$ features, $p^* = p + 1$.
\item More generally, the trace of any idempotent matrix is equal to its rank.
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{An unbiased estimator for $\sigma^2$}
\begin{itemize}
\item Thus
    \begin{equation*}
    \E[\transpose{\hat{\V{\epsilon}}}\hat{\V{\epsilon}}] = \E[\transpose{\V{\epsilon}}\V{\epsilon}] - \E[\transpose{\V{\epsilon}}\M{H}\V{\epsilon}] = n\sigma^2 - p^*\sigma^2 = (n - p^*)\sigma^2.
    \end{equation*}
\item Therefore, an unbiased estimator of $\sigma^2$ is given \nb{neither} by
    \begin{equation*}
    S^2 = \frac{1}{n} \sum_{i=1}^n \hat{\epsilon}_i^2 = \frac{1}{n}\transpose{\hat{\V{\epsilon}}}\hat{\V{\epsilon}},
    \end{equation*}
    \nb{nor} by
    \begin{equation*}
    \hat{S}^2 = \frac{1}{n-1} \sum_{i=1}^n \hat{\epsilon}_i^2 = \frac{1}{n-1}\transpose{\hat{\V{\epsilon}}}\hat{\V{\epsilon}},
    \end{equation*}
    but by
    \begin{equation*}
    \hat{\sigma}^2 \defeq \frac{1}{n-p^*} \sum_{i=1}^n \hat{\epsilon}_i^2 = \frac{1}{n-p^*}\transpose{\hat{\V{\epsilon}}}\hat{\V{\epsilon}},
    \end{equation*}
    where the quantity $n-p^*$ is known as the number of \defn{degrees of freedom}.
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{In our example... (i)}
\begin{itemize}
\item In our example,
    \begin{equation*}
    \hat{\V{\epsilon}} = \V{y} - \hat{\V{y}} = \begin{pmatrix} 2 \\ 1 \\ 2 \end{pmatrix} - \begin{pmatrix} \sfrac{25}{14} \\ \sfrac{23}{14} \\ \sfrac{11}{7} \end{pmatrix} = \begin{pmatrix} \sfrac{3}{14} \\ -\sfrac{9}{14} \\ \sfrac{3}{7} \end{pmatrix},
    \end{equation*}
\item so
    \begin{equation*}
    \transpose{\hat{\V{\epsilon}}}\hat{\V{\epsilon}} = \left(\sfrac{3}{14}\right)^2 + \left(-\sfrac{9}{14}\right)^2 + \left(\sfrac{3}{7}\right)^2 = \sfrac{9}{14}.
    \end{equation*}
\item $n = 3$ (we have 3 data points), $p = 1$ (we have one feature), the intercept is enabled, so there is an extra column in $\M{X}$, and so $p^* = p + 1 = 2$. Therefore
    \begin{equation*}
    \frac{1}{n - p^*} = \frac{1}{3 - 2} = 1.
    \end{equation*}
\item Hence
    \begin{equation*}
    \hat{\sigma}^2 = \frac{1}{n-p^*} \sum_{i=1}^n \hat{\epsilon}_i^2 = \frac{1}{n-p^*}\transpose{\hat{\V{\epsilon}}}\hat{\V{\epsilon}} = 1 \cdot \sfrac{9}{14} = \sfrac{9}{14}.
    \end{equation*}
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{In our example... (ii)}
\begin{itemize}
\item The Gram matrix is
    \begin{equation*}
    \transpose{\M{X}}\M{X} = \begin{pmatrix} 3 & 8 \\ 8 & 26 \end{pmatrix},
    \end{equation*}
\item its inverse
    \begin{equation*}
    (\transpose{\M{X}}\M{X})^{-1} = \frac{1}{14} \begin{pmatrix} 26 & -8 \\ -8 & 3 \end{pmatrix},
    \end{equation*}
\item and so we can \nb{estimate}
    \begin{align*}
    \Var[\hat{\V{\beta}}] &= \sigma^2\E[(\transpose{\M{X}}\M{X})^{-1}] \\
        &= \sigma^2(\transpose{\M{X}}\M{X})^{-1} \\
        &\approx \hat{\sigma}^2(\transpose{\M{X}}\M{X})^{-1} \\
        &= \frac{9}{14} \cdot \frac{1}{14} \begin{pmatrix} 26 & -8 \\ -8 & 3 \end{pmatrix} \\
        &= \frac{9}{196} \begin{pmatrix} 26 & -8 \\ -8 & 3 \end{pmatrix}.
    \end{align*}
\item Now look at the square roots of the diagonal entries of this matrix:
    \begin{equation*}
    \sfrac{3}{14} \sqrt{26} \approx 1.0926470, \sfrac{3}{14} \sqrt{3} \approx 0.3711537.
    \end{equation*}
    These are precisely the entries in the \nb{Std. Error} column of the \nb{Coefficients} table.
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{How many data points do we need?}
\begin{itemize}
\item Thus the standard errors are proportional to $\frac{1}{\sqrt{n - p^*}}$.
\item As in the case of the standard error of the mean, we need \nb{four times} as many data points to \nb{halve} the standard errors in the estimate of the elements of $\hat{\V{\beta}}$; \nb{100} times as many data points to decrease the standard errors by a factor of \nb{10}; \nb{10,000} times as many data points to decrease them by a factor of \nb{100}, etc.
\item This is, of course, assuming that $n \gg p^*$, which may not be the case if we are dealing with a \nb{small sample}.
\item Depending on the problem at hand, small samples may not be uncommon.
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}[fragile]
\frametitle{The next column in the table of coefficients}
\begin{itemize}
\item In our table of coefficients,
    \begin{snippet}{output}
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.85714    1.09265   1.700    0.339
x           -0.07143    0.37115  -0.192    0.879
    \end{snippet}
    we have understood the columns \texttt{Estimate} and \texttt{Std. Error}.
\item Let us now turn to the \nb{\texttt{t value}} column and see if we can understand it.
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{Chi-squared ($\chi^2$) distribution}
\begin{itemize}
\item Suppose that $Y_1, Y_2, \ldots, Y_n \sim \Normal{\mu}{\sigma^2}$.
\item We shall soon show\footnote{Closely following the 2012 lecture notes by Dan Nettleton of Iowa State University for Statistics~611.} that the sample mean, $\bar{Y} = \frac{1}{n} \sum_{i=1}^n Y_i$, and the (corrected) sample variance,
    \begin{equation*}
    \hat{S}^2 = \frac{1}{n-1} \sum_{i=1}^n (Y_i - \bar{Y})^2
    \end{equation*}
    are \nb{independent} and the random variable
    \begin{equation*}
    \frac{(n-1)\hat{S}^2}{\sigma^2} = \frac{\sum_{i=1}^n (Y_i - \bar{Y})^2}{\sigma^2} \eqdef U
    \end{equation*}
    has the \defn{chi-squared distribution}, $\chi^2_{\nu}$, with parameter $\nu = n - 1$.
\item Notice that $U$ depends on the true value $\sigma^2$, which often isn't known. The chi-squared distribution characterizes the \nb{normal} random variable with \nb{known} variance.
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{The variance of $\hat{S}^2$ for samples from normal distribution}
\begin{itemize}
\item The pdf of $\chi^2_{\nu}$ is
    \begin{equation*}
    f_U(u) = \frac{1}{2^{\nu/2}\Gamma(\nu/2)}u^{\nu/2-1}e^{-u/2}, \quad u \geq 0,
    \end{equation*}
\item its expected value and variance, respectively,
    \begin{equation*}
    \E[U] = \nu, \qquad \Var[U] = 2\nu.
    \end{equation*}
\item From this it follows that
    \begin{equation*}
    \E[\hat{S}^2] = \frac{\sigma^2}{n-1}\E[U] = \frac{\sigma^2}{n-1} \cdot (n-1) = \sigma^2,
    \end{equation*}
    we yet again confirm that $\hat{S}^2$ is an unbiased estimator, and
    \begin{equation*}
    \Var[\hat{S}^2] = \frac{\sigma^4}{(n-1)^2}\Var[U] = \frac{\sigma^4}{(n-1)^2} \cdot 2(n-1) = \frac{2\sigma^4}{n-1}.
    \end{equation*}
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{The pdf of the chi-squared distribution}
\begin{figure}
\centering
\includegraphics[width=.72\textwidth]{chi-squared-distribution-pdf.pdf}
\end{figure}
\begin{small}
From Wikipedia \url{https://en.wikipedia.org/wiki/Chi-squared_distribution}
\end{small}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{The noncentral chi-squared distribution}
\begin{itemize}
\item The chi-squared distribution can be generalized to the \defn{noncentral chi-squared distribution}, $\chi^2_{\nu}(\phi)$.
\item Its mean and variance are given by, respectively,
    \begin{equation*}
    \E[U] = \nu + \phi, \quad \Var[U] = 2(\nu + 2\phi).
    \end{equation*}
\item The additional (compared to the chi-squared distribution) parameter, $\phi$, is known as the \defn{noncentrality}.
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{The pdf of the noncentral chi-squared distribution}
\begin{figure}
\centering
\includegraphics[width=.72\textwidth]{noncentral-chi-squared-distribution-pdf.pdf}
\end{figure}
\begin{small}
From Wikipedia \url{https://en.wikipedia.org/wiki/Noncentral_chi-squared_distribution}
\end{small}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{Cochran's theorem}
\begin{columns}
    \begin{column}{0.3\textwidth}
    \begin{figure}
    \centering
    \includegraphics[width=1.\textwidth]{william-gemmell-cochran.pdf}
    \end{figure}
    \begin{small}
    William Gemmell Cochran (1909--1980)
    \end{small}
    \end{column}
    \begin{column}{0.7\textwidth}
    Suppose $\V{y} \in \Normal{\V{\mu}}{\sigma^2\M{I}_{n \times n}}$ is an $n$-dimensional multivariate normal random vector, $\M{Q}_1, \M{Q}_2, \ldots, \M{Q}_k$, $k \in \Nnz$, are symmetric matrices with $r_i = \rank(\M{Q}_i)$ for $i = 1, 2, \ldots, k$, such that
    \begin{equation*}
    \sum_{i=1}^k \M{Q}_i = \M{I}_{n \times n}.
    \end{equation*}
    Then any one of the following conditions implies the other two:
    \begin{enumerate}
    \item $\sum_{i=1}^k r_i = n$;
    \item for all $i = 1, 2, \ldots, k$, the \defn{quadratic forms} $\transpose{\V{y}}\M{Q}_i\V{y}$ are (possibly noncentrally) chi-squared distributed, $\transpose{\V{y}}\M{Q}_i\V{y} \sim \sigma^2 \chi^2_{r_i}(\phi_i)$, with noncentrality parameter $\phi_i = \frac{1}{2\sigma^2}\transpose{\V{\mu}}\M{Q}_i\V{\mu}$, (as a consequence, $\M{Q}_i$ are positive definite);
    \item $\transpose{\V{y}}\M{Q}_i\V{y}$ is independent of $\transpose{\V{y}}\M{Q}_j\V{y}$ for all $i = 1, 2, \ldots, k$, $j = 1, 2, \ldots, k$, $i \neq j$.
    \end{enumerate}
    Note that the notation $\transpose{\V{y}}\M{Q}_i\V{y} \sim \sigma^2 \chi^2_{r_i}(\phi_i)$ is ``syntactic sugar'' for $\frac{\transpose{\V{y}}\M{Q}_i\V{y}}{\sigma^2} \sim \chi^2_{r_i}(\phi_i)$.

    Cochran's theorem was published in~\cite{cochran-1934}. For proof, see~\cite{siotani-1985}.
    \end{column}
\end{columns}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{The variance of $\hat{S}^2$ for samples from normal distribution: proof (i)}
\begin{itemize}
\item Let $Y_1, Y_2, \ldots, Y_n \overset{\text{i.i.d.}}{\sim} \Normal{\mu}{\sigma^2}$. Set $\V{y} \defeq \transpose{\begin{pmatrix} Y_1 & Y_2 & \ldots & Y_n \end{pmatrix}}$, $\V{\mu} \defeq \mu \V{1}_n$, where $\V{1}_n$ is an $n$-dimensional vector of ones.
\item Set $\M{Q}_1 = \V{1}_n(\transpose{\V{1}_n}\V{1})^{-1}\transpose{\V{1}_n} = \frac{1}{n} \V{1}_n\transpose{\V{1}_n}$, $\M{Q}_2 = \M{I}_{n \times n} - \M{Q}_1$.
\item By construction, the matrices satisfy $\M{Q}_1 + \M{Q}_2 = \M{I}_{n \times n}$.
\item Both matrices are symmetric and idempotent. We know that the rank of an idempotent matrix is equal to its trace, so
    \begin{gather*}
    r_1 \defeq \rank{\M{Q}_1} = \tr{\M{Q}_1} = n \cdot \frac{1}{n} = 1, \\
    r_2 \defeq \rank{\M{Q}_2} = \tr{\M{Q}_2} = n \cdot \left( 1 - \frac{1}{n} \right) = n - 1.
    \end{gather*}
\item The ranks $r_1$ and $r_2$ add up to $n$, so we satisfy the first condition of Cochran's theorem.
\item As a consequence, the second and third conditions must hold, too.
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{The variance of $\hat{S}^2$ for samples from normal distribution: proof (ii)}
\begin{itemize}
\item For $i = 1$,
    \begin{equation*}
    \transpose{\V{y}}\M{Q}_1\V{y} = \transpose{\V{y}}\left(\frac{1}{n}\V{1}\transpose{\V{1}}\right)\V{y} = n\bar{Y}^2,
    \end{equation*}
\item and
    \begin{equation*}
    \phi_1 = \frac{1}{2\sigma^2} \cdot \transpose{\left(\mu\V{1}\right)} \left(\frac{1}{n}\V{1}\transpose{\V{1}}\right)\left(\mu\V{1}\right) = \frac{n\mu^2}{2\sigma^2}.
    \end{equation*}
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{The variance of $\hat{S}^2$ for samples from normal distribution: proof (iii)}
\begin{small}
\begin{itemize}
\item For $i = 2$,
    \begin{align*}
    \transpose{\V{y}}\M{Q}_2\V{y} &= \transpose{\V{y}}\transpose{\M{Q}_2}\M{Q}_2\V{y} \quad \text{($\M{Q}_2$ is symmetric and idempotent)} \\
    &= \norm{\M{Q}_2\V{y}}^2 \\
    &= \norm{\left(\M{I}_{n \times n} - \frac{1}{n}\V{1}_n\transpose{\V{1}_n}\right)\V{y}}^2 \\
    &= \norm{\V{y} - \V{1}_n\bar{Y}} \\
    &= \sum_{i=1}^n (Y_i - \bar{Y})^2,
    \end{align*}
\item and
    \begin{align*}
    \phi_2 &= \frac{1}{2\sigma^2}\transpose{\V{\mu}}\M{Q}_2\V{\mu} \\
    &= \frac{1}{2\sigma^2} \transpose{(\mu \V{1}_n)}\M{Q}_2(\mu \V{1}_n) \\
    &= \frac{1}{2\sigma^2} \cdot \mu^2 \cdot \transpose{\V{1}_n} \left(\M{I}_{n \times n} - \frac{1}{n}\V{1}_n\transpose{\V{1}_n}\right) \V{1} \\
    &= \frac{1}{2\sigma^2} \cdot \mu^2 \cdot \left(\transpose{\V{1}_n}\M{I}_{n \times n}\V{1} - \frac{1}{n}\transpose{\V{1}_n}\V{1}_n\transpose{\V{1}_n}\V{1}\right) \\
    &= \frac{1}{2\sigma^2} \cdot \mu^2 \cdot \left(n - \frac{n^2}{n}\right) = 0.
    \end{align*}
\end{itemize}
\end{small}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{The variance of $\hat{S}^2$ for samples from normal distribution: proof (iv)}
\begin{itemize}
\item Thus, by Cochran's theorem,
    \begin{equation*}
    n\bar{Y}^2 \sim \sigma^2\chi^2_1\left(\frac{n\mu^2}{2\sigma^2}\right)
    \end{equation*}
\item and
    \begin{equation*}
    \sum_{i=1}^n(Y_i - \bar{Y})^2 \sim \sigma^2\chi^2_{n-1}.
    \end{equation*}
\item are independent random variables.
\item As a consequence, the sample mean, $\bar{Y} = \frac{1}{n} \sum_{i=1}^n Y_i$, and the (corrected) sample variance,
    \begin{equation*}
    \hat{S}^2 = \frac{1}{n-1} \sum_{i=1}^n (Y_i - \bar{Y})^2
    \end{equation*}
    are independent and the random variable
    \begin{equation*}
    \frac{(n-1)\hat{S}^2}{\sigma^2} = \frac{\sum_{i=1}^n (Y_i - \bar{Y})^2}{\sigma^2} \eqdef U
    \end{equation*}
    has the chi-squared distribution, $\chi^2_{\nu}$, with parameter $\nu = n - 1$.
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{Applying Cochran's theorem to partitioned variances in linear regression (i)}
Let $\bar{y} = \sum_{i=1}^n y_i$ be the sample mean. Note that the partitioned variances can also be written as follows:
\begin{itemize}
\item \defn{Total Sum of Squares}:
    \begin{equation*}
    \text{TSS} = \sum_{i=1}^n (y_i - \bar{y})^2 = \transpose{(\V{y} - \bar{\V{y}})}(\V{y} - \bar{\V{y}}) = \transpose{\V{y}}\left(\M{I}_{n \times n} - \frac{1}{n}\V{1}_n\transpose{\V{1}_n}\right)\V{y};
    \end{equation*}
\item \defn{Estimated Sum of Squares}:
    \begin{equation*}
    \text{ESS} = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2 = \transpose{(\hat{\V{y}} - \bar{\V{y}})}(\hat{\V{y}} - \bar{\V{y}}) = \transpose{\V{y}}\left(\M{H} - \frac{1}{n}\V{1}_n\transpose{\V{1}_n}\right)\V{y};
    \end{equation*}
\item \defn{Residual Sum of Squares}:
    \begin{equation*}
    \text{RSS} = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \transpose{(\V{y} - \hat{\V{y}})}(\V{y} - \hat{\V{y}}) = \transpose{\V{y}}\M{M}\V{y} = \transpose{\V{y}}\left(\M{I}_{n \times n} - \M{H}\right)\V{y}.
    \end{equation*}
\end{itemize}
\end{frame}
\begin{frame} \frametitle{Applying Cochran's theorem to partitioned variances in linear regression (ii)} \begin{itemize} \item Recall our variance partitioning result, \begin{equation*} \text{TSS} = \text{ESS} + \text{RSS}. \end{equation*} \item It can be written as the following sum of matrix \nb{quadratic forms}: \begin{equation*} \transpose{\V{y}}\left(\M{I}_{n \times n} - \frac{1}{n}\V{1}_n\transpose{\V{1}_n}\right)\V{y} = \transpose{\V{y}}\left(\M{H} - \frac{1}{n}\V{1}_n\transpose{\V{1}_n}\right)\V{y} + \transpose{\V{y}}\left(\M{I}_{n \times n} - \M{H}\right)\V{y}. \end{equation*} \item Hence \begin{equation*} \transpose{\V{y}}\left(\M{I}_{n \times n}\right)\V{y} = \transpose{\V{y}}\left(\frac{1}{n}\V{1}_n\transpose{\V{1}_n}\right)\V{y} + \transpose{\V{y}}\left(\M{H} - \frac{1}{n}\V{1}_n\transpose{\V{1}_n}\right)\V{y} + \transpose{\V{y}}\left(\M{I}_{n \times n} - \M{H}\right)\V{y} \end{equation*} \item and the matrix decomposition \begin{equation*} \M{I}_{n \times n} = \frac{1}{n}\V{1}_n\transpose{\V{1}_n} + \left(\M{H} - \frac{1}{n}\V{1}_n\transpose{\V{1}_n}\right) + \left(\M{I}_{n \times n} - \M{H}\right). \end{equation*} \end{itemize} \end{frame}

In [ ]:
\begin{frame}
\frametitle{Applying Cochran's theorem to partitioned variances in linear regression (iii)}
\begin{itemize}
\item In this matrix decomposition,
    \begin{equation*}
    \M{I}_{n \times n} = \frac{1}{n}\V{1}_n\transpose{\V{1}_n} + \left(\M{H} - \frac{1}{n}\V{1}_n\transpose{\V{1}_n}\right) + \left(\M{I}_{n \times n} - \M{H}\right),
    \end{equation*}
\item The rank of $\frac{1}{n}\V{1}_n\transpose{\V{1}_n}$ is clearly 1 (there is only one linearly independent column, all the columns are equal).
\item Since (as is easy to check) $\left(\M{H} - \frac{1}{n}\V{1}_n\transpose{\V{1}_n}\right)$ is idempotent (in fact, it is a projection matrix: it's both symmetric and idempotent), its rank is given by its trace:
    \begin{small}
    \begin{align*}
    \rank\left(\M{H} - \frac{1}{n}\V{1}_n\transpose{\V{1}_n}\right) &=
    \tr\left(\M{H} - \frac{1}{n}\V{1}_n\transpose{\V{1}_n}\right) \\
    &= \tr\left(\M{H}\right) - \frac{1}{n}\tr\left(\V{1}_n\transpose{\V{1}_n}\right) \quad \text{(trace is linear)} \\
    &= \rank\left(\M{H}\right) - \frac{1}{n} \cdot n \quad \text{$\M{H}$ is idempotent} \\
    &= p^* - 1.
    \end{align*}
    \end{small}
\item Since (as we know) $\left(\M{I}_{n \times n} - \M{H}\right)$ is idempotent (it is also, in fact, a projection matrix), its rank is given by its trace:
    \begin{small}
    \begin{align*}
    \rank\left(\M{I}_{n \times n} - \M{H}\right) &=
    \tr\left(\M{I}_{n \times n} - \M{H}\right) \\
    &= \tr\left(\M{I}_{n \times n}\right) - \tr\left(\M{H}\right) \quad \text{trace is linear} \\
    &= n - \rank\left(\M{H}\right) \quad \text{$\M{H}$ is idempotent} \\
    &= n - p^*.
    \end{align*}
    \end{small}
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{Applying Cochran's theorem to partitioned variances in linear regression (iv)}
\begin{itemize}
\item So the ranks of the matrices in our decomposition, $1$, $p^* - 1$ and $n - p^*$, do add up to $n$.
\item Can we apply Cochran's theorem?
\item Only if we \nb{assume} that $\V{y} \sim \Normal{\V{\mu}}{\sigma_y^2\V{I}_{n \times n}}$ for some $\V{\mu} \in \R^n$.
\item If we were to \nb{assume}, as a \nb{null hypothesis}, that $\V{\beta} = \V{0}$, then
    \begin{equation*}
    \V{y} = \M{X}\V{\beta} + \V{\epsilon} = \V{\epsilon},
    \end{equation*}
    furthermore, \nb{assuming} (as in the CLRM) that $\V{\epsilon} \sim \Normal{\V{0}}{\sigma^2\V{I}_{n \times n}}$, we do get $\V{y} \sim \Normal{\V{0}_n}{\sigma^2\V{I}_{n \times n}}$. Modulo these assumptions, we are justified in applying Cochran's theorem.
\item If we do make this assumption, since $\V{\mu} = \V{0}_n$, all $\phi_i$ in Cochran's theorem are 0, i.e. all the chi-squared distributions are central.
\item This null hypothesis is basically saying that our linear model adds no value whatsoever.
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{Applying Cochran's theorem to partitioned variances in linear regression (v)}
\begin{itemize}
\item If Cochran's theorem is indeed applicable, then:
    \begin{itemize}
    \item $\text{ESS} = \sum_{i=1}^n(\hat{y}_i - \bar{y})^2 \sim \sigma^2 \chi^2_{p^* - 1}$,
    \item $\text{RSS} = \sum_{i=1}^n(y_i - \hat{y}_i)^2 \sim \sigma^2 \chi^2_{n - p^*}$,
    \end{itemize}
\item moreover, the two random variables are independent.
\item Since the sum of two independent central chi-squared random variables with parameters $\nu_1$ and $\nu_2$ is again a central chi-squared random variable with parameter $\nu_1 + \nu_2$, and
    \begin{equation*}
    \text{TSS} = \text{ESS} + \text{RSS},
    \end{equation*}
    we have
    \begin{equation*}
    \text{TSS} = \sum_{i=1}^n (y_i - \bar{y})^2 \sim \sigma^2 \chi^2_{n - 1}.
    \end{equation*}
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{A cautionary quote on assuming normality}
\begin{columns}
    \begin{column}{0.3\textwidth}
    \begin{figure}
    \centering
    \includegraphics[width=1.\textwidth]{gabriel-lippmann.pdf}
    \end{figure}
    \begin{small}
    (Jonas Ferdinand) Gabriel Lippmann (1845--1921)
    \end{small}
    \end{column}
    \begin{column}{0.7\textwidth}
    As quoted in D'Arcy Thompson's \atitle{On Growth and Form} Volume I (p.~121):
    \begin{quote}
    Les Exp{\'e}rimentateurs s'imaginent que [la approximation normale] c'est un th{\'e}or{\`e}me de math{\'e}matique, et les math{\'e}maticiens d'{\'e}treun fait exp{\'e}rimental!
    \end{quote}
    \begin{quote}
    Everybody believes in the [normal approximation], the experimenters because they think it is a mathematical theorem, the mathematicians because they think it is an experimental fact!
    \end{quote}
    \end{column}
\end{columns}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{Degrees of freedom (i)}
Recall that we were able to obtain a matrix decomposition
\begin{equation*}
 \frac{1}{n} \V{1}_n\transpose{\V{1}_n} + (\M{I}_{n \times n} - \frac{1}{n} \V{1}_n\transpose{\V{1}_n}) = \M{I}_{n \times n}
\end{equation*}
where the two matrices have rank 1 and \textcolor{red}{$n - 1$,} respectively, when applying Cochran's theorem to sample variance $\hat{S}^2$ for samples from normal distribution. The second matrix, corresponding to the $\sum_{i=1}^n (Y_i - \bar{Y})^2$ term, has rank \textcolor{red}{$n - 1$}. The sample variance has the form
\begin{equation*}
\hat{S}^2 = \frac{1}{\textcolor{red}{n - 1}} \sum_{i=1}^n (Y_i - \bar{Y})^2.
\end{equation*}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{Degrees of freedom (ii)}
Recall that we were able to obtain a matrix decomposition
\begin{equation*}
\frac{1}{n}\V{1}_n\transpose{\V{1}_n} + \left(\M{H} - \frac{1}{n}\V{1}_n\transpose{\V{1}_n}\right) + \left(\M{I}_{n \times n} - \M{H}\right) = \M{I}_{n \times n}
\end{equation*}
when applying Cochran's theorem to partitioned variances in linear regression. We ended up finding that
\begin{align*}
\text{TSS} &= \sum_{i=1}^n (y_i - \bar{y})^2 \sim \sigma^2 \chi^2_{\textcolor{red}{n - 1}}, \\
\text{ESS} &= \sum_{i=1}^n(\hat{y}_i - \bar{y})^2 \sim \sigma^2 \chi^2_{\textcolor{red}{p^* - 1}}, \\
\text{RSS} &= \sum_{i=1}^n(y_i - \hat{y}_i)^2 \sim \sigma^2 \chi^2_{\textcolor{red}{n - p^*}}, \\
\end{align*}
where the quantities \textcolor{red}{$n - 1$}, \textcolor{red}{$p^* - 1$} and \textcolor{red}{$n - p^*$} again correspond to the ranks of the corresponding matrices in the matrix decomposition.

We also found that the unbiased estimator for RSS is given by
\begin{equation*}
\hat{\sigma}^2 \defeq \frac{1}{\textcolor{red}{n-p^*}} \sum_{i=1}^n \hat{\epsilon}_i^2 = \frac{1}{\textcolor{red}{n-p^*}}\transpose{\hat{\V{\epsilon}}}\hat{\V{\epsilon}}.
\end{equation*}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{Degrees of freedom (iii)}
\begin{itemize}
\item These quantities are referred to as \defn{degrees of freedom}.
\item As is now obvious, they correspond to matrix \nb{ranks}, i.e. \nb{dimensions} of linear spaces.
\item In our estimators for scalar, single-dimensional quantities we produce \nb{averages over degrees of freedom} in multidimensional spaces. Hence the factors such as $\sfrac{1}{(n-1)}$ in the estimator for sample variance, $\hat{S}^2$.
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{Degrees of freedom (iv)}
\begin{itemize}
\item To get an intuition for what is going on, consider decomposing an $n$-dimensional random vector into the sum of a sample mean and a vector of residuals:
    \begin{equation*}
    \begin{pmatrix} X_1 \\ X_2 \\ \vdots \\ X_n \end{pmatrix} =
    \bar{X} \begin{pmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{pmatrix} +
    \begin{pmatrix} X_1 - \bar{X} \\ X_2 - \bar{X} \\ \vdots \\ X_n - \bar{X} \end{pmatrix}.
    \end{equation*}
\item The first vector on the right-hand side is \nb{constrained} to be a multiple of the vector of 1's. The only quantity that is \nb{free to vary} is the scalar $\bar{X}$. Thus it has \nb{one degree of freedom}.
\item The second vector is also \nb{constrained} --- by the relation $\sum_{i=1}^n (X_i - \bar{X}) = 0$. If we fix any $n - 1$ components, say $X_1, X_2, \ldots, X_{n-1}$, the remaining component is determined by this relation and we are not \nb{free} to set it at will. Therefore the second vector has \nb{$n - 1$ degrees of freedom}.
\item The first vector on the right-hand side is a \nb{projection} of the data vector onto the subspace spanned by the vector of 1's. This subspace has the \nb{dimension 1}, hence \nb{1 degree of freedom}.
\item The second vector on the right-hand side is a \nb{projection} onto the \nb{$(n - 1)$-dimensional orthogonal complement} of that subspace. Hence \nb{$n - 1$ degrees of freedom}.
\end{itemize}
\end{frame}

In [ ]:
\begin{frame}
\frametitle{Degrees of freedom (v)}
\begin{itemize}
\item We have now understood the degrees of freedom as \nb{dimensions} of \nb{orthogonal vector subspaces} onto which the data is being \nb{projected}.
\item We have already seen how the \nb{(central) chi-squared} distribution with parameter $\nu$ and the \nb{noncentral chi-squared} distribution with parameters $\nu$ and $\phi$ arise naturally in the context of estimated variances.
\item It is therefore not surprising that the parameter $\nu$ of both the central and noncentral chi-squared distribution is also referred to as \defn{degrees of freedom}.
\end{itemize}
\end{frame}

Eigenvalues and eigenvectors

Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a square matrix, $n \in \mathbb{N}$.

Almost all vectors, when multiplied on the left by $\mathbf{A}$, change direction. For example, \begin{equation*} \underbrace{\begin{pmatrix} 2 & -4 \\ -1 & -1 \end{pmatrix}}_{\mathbf{A}} \begin{pmatrix} 2 \ 3

\end{pmatrix}

\begin{pmatrix} -8 \\ -5 \end{pmatrix}

. \end{equation*}

But there are a few vectors whose direction is preserved by $\mathbf{A}$, even though they may be rescaled. Under "preserved" we, somewhat paradoxically, include "reversed". The point is that the vector remains along the same line: \begin{equation*} \begin{pmatrix} 2 & -4 \\ -1 & -1 \end{pmatrix} \begin{pmatrix} -4 \ 1

\end{pmatrix}

\begin{pmatrix} -12 \\ 3 \end{pmatrix}

, \quad \begin{pmatrix} 2 & -4 \\ -1 & -1 \end{pmatrix} \begin{pmatrix} 1 \ 1

\end{pmatrix}

\begin{pmatrix} -2 \\ -2 \end{pmatrix}

. \end{equation*}

These special vectors are known as the eigenvectors of $\mathbf{A}$, and the corresponding scalings, $3$ and $-2$, respectively, in \begin{equation*} \begin{pmatrix} 2 & -4 \\ -1 & -1 \end{pmatrix} \begin{pmatrix} -4 \ 1

\end{pmatrix}

3 \cdot \begin{pmatrix} -4 \\ 1 \end{pmatrix}, \quad \begin{pmatrix} 2 & -4 \\ -1 & -1 \end{pmatrix} \begin{pmatrix} 1 \ 1

\end{pmatrix}

-2 \cdot \begin{pmatrix} 1 \\ 1 \end{pmatrix}, \end{equation*} their respective eigenvalues.

Thus the eigenvectors of $\mathbf{A}$ are those special vectors whose direction is preserved (or reversed) when they are multiplied by $\mathbf{A}$ on the left; the corresponding eigenvalues are the corresponding scalings.

\begin{frame} \frametitle{Eigenvalues and eigenvectors} \begin{center} \end{center} \begin{tikzpicture}[scale=0.7, every node/.style={scale=0.7}] \tkzInit[xmin=-13,ymin=-3,xmax=2,ymax=4] \tkzAxeXY \tkzGrid \tkzDefPoints{0/0/O, -4/1/A, -12/3/A1, 1/1/B, -2/-2/B1} \tkzDrawSegments[color=green, ultra thick, densely dashed](O,A1) \tkzDrawPoints[size=10,color=red](O, A1) \tkzLabelPoint[above,color=green](A1){$\mathbf{A} \begin{pmatrix} -4 \\ 1 \end{pmatrix} = 3 \cdot \begin{pmatrix} -4 \\ 1 \end{pmatrix}$} \tkzDrawSegments[color=green, ultra thick, densely dashed](O,B1) \tkzDrawPoints[size=10,color=red](O, B1) \tkzLabelPoint[below,color=green](B1){$\mathbf{A} \cdot \begin{pmatrix} 1 \\ 1 \end{pmatrix} = -2 \cdot \begin{pmatrix} 1 \\ 1 \end{pmatrix}$} \tkzDrawSegments[color=blue, ultra thick, densely dashed](O,A) \tkzDrawPoints[size=10,color=red](O, A) \tkzLabelPoint[above,color=blue](A){$\begin{pmatrix} -4 \\ 1 \end{pmatrix}$} \tkzDrawSegments[color=blue, ultra thick, densely dashed](O,B) \tkzDrawPoints[size=10,color=red](O, B) \tkzLabelPoint[above,color=blue](B){$\begin{pmatrix} 1 \\ 1 \end{pmatrix}$} \end{tikzpicture} \end{frame}

\frametitle{Unit eigenvectors} \begin{itemize} \item Since, for any matrix $\mathbf{A}$, $\mathbf{A} \mathbf{0} = \lambda \cdot \mathbf{0}$ for any scalar $\lambda$, we exclude $\mathbf{0}$ when talking about eigenvectors; $\mathbf{0}$ is not counted in with the eigenvectors. \item If $\V{v}$ is an eigenvector of $\mathbf{A}$ and $\lambda$ the corresponding eigenvalue, then \begin{equation*} \M{A} (\alpha \V{v}) = \alpha (\M{A} \V{v}) = \alpha (\lambda \V{v}) = \lambda (\alpha \V{v}). \end{equation*} \item So $\alpha \V{v}$ is also an eigenvector with the same eigenvalue $\lambda$. \item We can pick $\alpha = \frac{1}{\norm{\V{v}}}$. \item Thus we can choose to always list the \defn{unit eigenvectors}, i.e. eigenvectors of unit length. \end{itemize}

\begin{frame} \end{frame}\begin{frame} \frametitle{Thus, in our example...} \begin{center} We could, for consistency, consider unit eigenvectors instead of the ones that we have originally written down: \end{center} \begin{tikzpicture}[scale=0.7, every node/.style={scale=0.7}] \tkzInit[xmin=-13,ymin=-3,xmax=2,ymax=4] \tkzAxeXY \tkzGrid \tkzDefPoints{0/0/O, -4/1/A, -12/3/A1, 1/1/B, -2/-2/B1} \tkzDrawSegments[color=green, ultra thick, densely dashed](O,A1) \tkzDrawPoints[size=10,color=red](O, A1) \tkzLabelPoint[right,color=green](A1){$\quad \mathbf{A} \begin{pmatrix} -4 \\ 1 \end{pmatrix} = 3 \cdot \begin{pmatrix} -4 \\ 1 \end{pmatrix}$} \tkzDrawSegments[color=green, ultra thick, densely dashed](O,B1) \tkzDrawPoints[size=10,color=red](O, B1) \tkzLabelPoint[below,color=green](B1){$\M{A} \cdot \begin{pmatrix} 1 \\ 1 \end{pmatrix} = -2 \cdot \begin{pmatrix} 1 \\ 1 \end{pmatrix}$} \tkzDrawSegments[color=blue, ultra thick, densely dashed](O,A) \tkzDrawPoints[size=10,color=red](O, A) \tkzLabelPoint[above,color=blue](A){$\begin{pmatrix} -4 \\ 1 \end{pmatrix}$} \tkzDrawSegments[color=blue, ultra thick, densely dashed](O,B) \tkzDrawPoints[size=10,color=red](O, B) \tkzLabelPoint[above,color=blue](B){$\begin{pmatrix} 1 \\ 1 \end{pmatrix}$} \end{tikzpicture} \end{frame}\begin{frame} \frametitle{Thus, in our example...} \begin{center} We could, for consistency, consider unit eigenvectors instead of the ones that we have originally written down: \end{center} \begin{tikzpicture}[scale=0.7, every node/.style={scale=0.7}] \tkzInit[xmin=-13,ymin=-3,xmax=2,ymax=4] \tkzAxeXY \tkzGrid \tkzDefPoints{0/0/O, -0.97014250014533188/0.24253562503633297/A, -2.9104275004359956/0.72760687510899891/A1, 0.70710678118654746/0.70710678118654746/B, -1.4142135623730949/-1.4142135623730949/B1} \tkzDrawSegments[color=green, ultra thick, densely dashed](O,A1) \tkzDrawPoints[size=10,color=red](O, A1) \tkzLabelPoint[left,color=green](A1){$\M{A} \begin{pmatrix} -4/\sqrt{17} \\ 1/\sqrt{17} \end{pmatrix} = 3 \cdot \begin{pmatrix} -4/\sqrt{17} \\ 1/\sqrt{17} \end{pmatrix}$} \tkzDrawSegments[color=green, ultra thick, densely dashed](O,B1) \tkzDrawPoints[size=10,color=red](O, B1) \tkzLabelPoint[below,color=green](B1){$\M{A} \cdot \begin{pmatrix} 1/\sqrt{2} \\ 1/\sqrt{2} \end{pmatrix} = -2 \cdot \begin{pmatrix} 1/\sqrt{2} \\ 1/\sqrt{2} \end{pmatrix}$} \tkzDrawSegments[color=blue, ultra thick, densely dashed](O,A) \tkzDrawPoints[size=10,color=red](O, A) \tkzLabelPoint[above,color=blue](A){$\begin{pmatrix} -4/\sqrt{17} \\ 1/\sqrt{17} \end{pmatrix}$} \tkzDrawSegments[color=blue, ultra thick, densely dashed](O,B) \tkzDrawPoints[size=10,color=red](O, B) \tkzLabelPoint[above,color=blue](B){$\begin{pmatrix} 1/\sqrt{2} \\ 1/\sqrt{2} \end{pmatrix}$} \end{tikzpicture} \end{frame}\begin{frame} \frametitle{Symmetric and positive definite matrices} A square matrix $\M{A} \in \R^{n \times n}$ is \begin{itemize} \item \defn{symmetric} if it is equal to its transpose: \begin{equation*} \M{A} = \transpose{\M{A}}; \end{equation*} \item \defn{positive definite} if, for each nonzero column vector $\V{z} \in \R^{n \times 1}$, the scalar \begin{equation*} \transpose{\V{z}} \M{A} \V{z} > 0; \end{equation*} \item \defn{positive semidefinite} if, for each nonzero column vector $\V{z} \in \R^{n \times 1}$, the scalar \begin{equation*} \transpose{\V{z}} \M{A} \V{z} \geq 0. \end{equation*} \end{itemize} \end{frame}\begin{frame} \frametitle{Question: The diagonal entries of positive (semi)definite matrices} Show that, if a matrix is positive (semi)definite, then all its diagonal entries are positive (nonnegative, if \nb{semi}definite). \end{frame}\begin{frame} \frametitle{Solution: The diagonal entries of positive (semi)definite matrices} For $i = 1, \ldots, n$, take $\V{z} = \V{e}_i$, the column vector with 1 in the $i$th row and all other elements equal to zero. Then \begin{equation*} \transpose{\V{z}}\M{A}\V{z} = \transpose{\V{e}_i}\M{A}\V{e}_i = A_{ii}. \end{equation*} Thus, for positive definite matrices, $A_{ii} > 0$ and for positive semidefinite matrices, $A_{ii} \geq 0$. \qed \end{frame}\begin{frame} \frametitle{Eigenvalues and eigenvectors of symmetric positive (semi)definite matrices} \begin{itemize} \item The eigenvalues of symmetric positive (semi)definite matrices are real and positive (nonnegative if \nb{semi}definite). \item The eigenvectors of symmetric positive (semi)definite vectors are mutually orthogonal (perpendicular). \end{itemize} \end{frame}\begin{frame} \frametitle{An example of a symmetric positive definite matrix} \begin{itemize} \item Consider the \nb{symmetric positive definite} matrix \begin{equation*} \M{A} = \begin{pmatrix} 9 & -4 \\ -4 & 15 \end{pmatrix} \in \R^{2 \times 2}. \end{equation*} \item Its eigenvalues are \begin{itemize} \item $\lambda_1 = 7$, with the corresponding unit eigenvector $\V{v}_1 = \begin{pmatrix} -2/\sqrt{5} \\ -1/\sqrt{5} \end{pmatrix}$; \item $\lambda_2 = 17$, with the corresponding unit eigenvector $\V{v}_2 = \begin{pmatrix} 1/\sqrt{5} \\ -2/\sqrt{5} \end{pmatrix}$. \end{itemize} \item Notice that the eigenvalues are both real and positive: \begin{equation*} \lambda_1 = 7 > 0, \quad \lambda_2 = 17 > 0. \end{equation*} \item Also notice that the two eigenvectors are mutually orthogonal: \begin{equation*} \transpose{\V{v}_1} \V{v}_2 = \frac{-2}{\sqrt{5}} \cdot \frac{1}{\sqrt{5}} + \frac{-1}{\sqrt{5}} \cdot \frac{-2}{\sqrt{5}} = 0. \end{equation*} \end{itemize} \end{frame}\begin{frame} \frametitle{Let us visualize the eigenvectors and eigenvalues in this example} \begin{center} Notice that the two eigenvectors are orthogonal (perpendicular): \end{center} \begin{center} \begin{tikzpicture}[scale=0.35, every node/.style={scale=0.35}] \tkzInit[xmin=-7,ymin=-16,xmax=8,ymax=1] \tkzAxeXY \tkzGrid \tkzDefPoints{0/0/O, -0.89442719099991586/-0.44721359549995793/A, -6.260990336999411/-3.1304951684997055/A1, 0.44721359549995793/-0.89442719099991586/B, 7.602631123499285/-15.20526224699857/B1} \tkzDrawSegments[color=green, ultra thick, densely dashed](O,A1) \tkzDrawPoints[size=10,color=red](O, A1) \tkzLabelPoint[left,color=green](A1){$\M{A} \begin{pmatrix} -2/\sqrt{5} \\ -1/\sqrt{5} \end{pmatrix} = 7 \cdot \begin{pmatrix} -2/\sqrt{5} \\ -1/\sqrt{5} \end{pmatrix}$} \tkzDrawSegments[color=green, ultra thick, densely dashed](O,B1) \tkzDrawPoints[size=10,color=red](O, B1) \tkzLabelPoint[below,color=green](B1){$\M{A} \cdot \begin{pmatrix} 1/\sqrt{5} \\ -2/\sqrt{5} \end{pmatrix} = 17 \cdot \begin{pmatrix} 1/\sqrt{5} \\ -2/\sqrt{5} \end{pmatrix}$} \tkzDrawSegments[color=blue, ultra thick, densely dashed](O,A) \tkzDrawPoints[size=10,color=red](O, A) \tkzLabelPoint[above,color=blue](A){$\begin{pmatrix} -2/\sqrt{5} \\ -1/\sqrt{5} \end{pmatrix}$} \tkzDrawSegments[color=blue, ultra thick, densely dashed](O,B) \tkzDrawPoints[size=10,color=red](O, B) \tkzLabelPoint[below,color=blue](B){$\begin{pmatrix} 1/\sqrt{5} \\ -2/\sqrt{5} \end{pmatrix}$} \end{tikzpicture} \end{center} \end{frame}

Bibliography

  1. An excellent, deeper introduction to linear algebra can be found in Professor Gilbert Strang's video lectures for the 18.06 Linear Algebra course at MIT: https://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/video-lectures/
  2. The supporting textbook for that course is Introduction to Linear Algebra, 5th edition, by Gilbert Strang: https://www.amazon.co.uk/Introduction-Linear-Algebra-Gilbert-Strang/dp/0980232775/
  3. A more recent version of this book updated for data science and deep learning is Linear Algebra and Learning from Data, by Gilbert Strang, published in 2019: https://www.amazon.co.uk/Linear-Algebra-Learning-Gilbert-Strang/dp/0692196382/
  4. Another good text on linear algebra is Linear Algebra, 3rd edition, by John B. Fraleigh and Raymond A. Beauregard: https://www.amazon.co.uk/Linear-Algebra-Third-Fraleigh-1995-01-10/dp/B01MXJ76GM/
  5. One may also be recommended to read Schaum's Outline of Linear Algebra, 6th edition, by Seymour Lipschutz and Marc Lipson: https://www.amazon.co.uk/Schaums-Outline-Linear-Algebra-Outlines/dp/1260011445/
  6. Finally, we recommend getting hold of M2N1 — Numerical Analysis lecture notes by Brad Baxter, which, in addition to theory, contain some useful exercises: http://www.cato.tzo.com/brad/M2N1/m2n1_notes.pdf