In [1]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
%matplotlib inline

Introduction

Linear regression, like any model regression methods, is made of two parts:

  • a regression model = a structure with parameters
  • a regression method to estimate which parameters minimize a residual between the points and the model

When least squares methods are used, the residual is the sum of the squares of the distances between each data point and the corresponding model value.

The linear regression model

Simple linear regression

Let us consider a problem with only one independent (explanatory) variable $x$ and one dependent variable $y$.

With $n \in \mathbb{N}^*$, let $\{(x_i,y_i),\, i=1,\dots,n\}$ be a set of points.


In [2]:
n = 100
x = np.random.normal(1, 0.5, n)
noise = np.random.normal(0, 0.25, n)
y = 0.75*x + 1 + noise

In [3]:
fig, ax = plt.subplots(1, 1, figsize=(6,4))
ax.scatter(x, y)
ax.set_xlim([0,2])
ax.set_ylim([0,3.1])


Out[3]:
(0, 3.1)

Simple linear regression considers the model function

\begin{equation} y = \alpha + \beta x \end{equation}

We can write the relation between $y_i$ and $x_i$ as:

\begin{equation} \forall i, \quad y_i = \alpha + \beta x_i + \varepsilon_i \end{equation}

where the $\varepsilon_i$, called residuals, are what's missing between the model and the data.

(For the sake of demonstration, the points above were generated as a uniform sample of 50 $x$ values in $[0,1]$, and the $y$'s were computed as $y = \frac{3}{4}x+1+\varepsilon$, where $\varepsilon \sim \mathcal{N}(0,0.1)$.)

Now comes the fun part: we are looking for the $(\alpha,\beta)$ pair that provides the best fit between model and data. Best, but in which sense?

General form of linear regression

In the general case, we can write for each dependent variable $y_i, \, i = 1, \dots, n$ a set of $p$-vectors $\mathbf{x}_i$ called regressors.

The regression model above then takes the form

\begin{equation} y_i = \beta_0 1 + \beta_1 x_{i1} + \dots + \beta_p x_{ip} + \varepsilon_i,\quad i = 1, \dots, n \end{equation}

which can be more concisely written as:

\begin{equation} y_i = \mathbf{x}_i^T \mathbf{\beta} + \varepsilon_i,\quad i = 1, \dots, n \end{equation}

and in vector form:

\begin{equation} \mathbf{y} = X \mathbf{\beta} + \mathbf{\varepsilon} \end{equation}

You may have noticed that there is no more $\alpha$ in this form: it has been replaced by $\beta_0$, which is the multiplication factor for the constant value $1$. This makes $X$ an $n \times (p+1)$ matrix and $\mathbf{\beta}$ a $(p+1)$ vector. This is more understandable with an explicit display of $X$, $\mathbf{y}$, $\mathbf{\beta}$ and $\mathbf{\epsilon}$:

$$\mathbf{y} = \left( \begin{array}{c} y_1\\ y_2\\ \vdots\\ y_n\\ \end{array}\right), \quad X = \left( \begin{array}{cccc} 1&x_{11}&\cdots&x_{1p}\\ 1&x_{21}&\cdots&x_{2p}\\ \vdots&\vdots&\ddots&\vdots\\ 1&x_{n1}&\cdots&x_{np}\\ \end{array}\right), \quad \mathbf{\beta} = \left( \begin{array}{c} \beta_0\\ \beta_1\\ \beta_2\\ \vdots\\ \beta_n\\ \end{array}\right), \quad \mathbf{\epsilon} = \left( \begin{array}{c} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_n\\ \end{array}\right) $$

Fitting the model

The best known family of regression methods rely this definition of the 'best fit': the best fit is the one that minimizes the sum of the squares of the residuals $\varepsilon_i$.

In the case of the simple linear regression, that is:

\begin{equation} \textrm{Find} \min_{\alpha,\beta} Q(\alpha,\beta)\quad Q(\alpha,\beta) = \sum_{i=1}^n \varepsilon_i^2 = \sum_{i=1}^n (y_i - \alpha - \beta x_i)^2 \end{equation}

Ordinary / Linear least squares (OLS)

In the general case we are looking for $\hat{\beta}$ that minimizes:

\begin{equation} S(b) = (y - Xb)^T(y-Xb) \end{equation}

We are then looking for $\hat{\beta}$ such that

\begin{equation} 0 = \frac{dS}{db}(\hat{\beta}) = \frac{d}{db}\left.\left(y^T y - b^TX^Ty-y^TXb + b^TX^TXb\right)\right|_{b=\hat{\beta}} \end{equation}

By matrix calculus:

$$ \begin{array}{rcl} \frac{d}{db}(-b^TX^Ty) &=& -\frac{d}{db}(b^TX^Ty) = -X^Ty\\ \frac{d}{db}(-y^TXb) &=& -\frac{d}{db}(y^TXb) = -(y^TX)^T = -X^Ty\\ \frac{d}{db}(b^TX^TXb) &=& 2X^TXb\\ \end{array} $$

So that we get

$$-2X^Ty + 2X^TX\hat{\beta} = 0$$
Assumption: $X$ has full column rank
That is, features should not be linearly dependent. Under this assumption $X^TX$ is invertible, and we can write the least squares estimator for $\beta$: \begin{equation} \hat{\beta} = (X^TX)^{-1}X^Ty \end{equation}

For the simple linear regression, we get:

$$\hat{\beta} = \frac{\textrm{Cov}(x,y)}{\textrm{Var}(x)},\quad \hat{\alpha} = \bar{y} - \hat{\beta} \bar{x}$$

In [4]:
def fit(x,y, with_constant=True):
    beta = np.cov(x,y)[0][1] / np.var(x)
    if with_constant:
        alpha = np.mean(y) - beta * np.mean(x)
    else:
        alpha = 0
    r = np.cov(x,y)[0][1] / (np.std(x)*np.std(y))
    mse = np.sum((y-alpha-beta*x)**2)/n
    
    return([beta,alpha,r,mse])

In [5]:
beta,alpha,r,mse = fit(x,y)
print('alpha: {:.2f}, beta: {:.2f}'.format(alpha,beta))
print('r squared: {:.2f}'.format(r*r))
print('MSE: {:.2f}'.format(mse))


alpha: 1.05, beta: 0.69
r squared: 0.61
MSE: 0.06

In [6]:
def fit_plot(x,y,noise,beta,alpha):
    fig, ax = plt.subplots(1, 3, figsize=(18,4))
    ax[0].scatter(x, y)
    x_ = np.linspace(0,2,5)
    ax[0].plot(x_, alpha + beta*x_, color='orange', linewidth=2)
    ax[0].set_xlim([0,2])
    ax[0].set_ylim([0,3.1])
    mse = np.sum((y-alpha-beta*x)**2)/n
    ax[0].set_title('MSE: {:.2f}'.format(mse), fontsize=14)

    ax[1].hist(noise, alpha=0.5, bins=np.arange(-0.7,0.7,0.1));
    ax[1].hist(y - alpha - beta*x, alpha=0.5, bins=np.arange(-0.7,0.7,0.1))
    ax[1].legend(['original noise', 'residual']);

    stats.probplot(y - alpha - beta*x, dist="norm", plot=ax[2]);

In [7]:
fit_plot(x,y,noise,beta,alpha)


Expected value

$$ \begin{array}{rcl} \mathbf{E}[\hat{\beta}] &=& \mathbf{E}\left[(X^TX)^{-1}X^T(X\beta + \varepsilon)\right]\\ &=& \beta + \mathbf{E}\left[(X^TX)^{-1}X^T \varepsilon\right]\\ &=& \beta + \mathbf{E}\left[\mathbf{E}\left[(X^TX)^{-1}X^T \varepsilon|X\right]\right] \quad \textit{(law of total expectation)}\\ &=& \beta + \mathbf{E}\left[(X^TX)^{-1}X^T \mathbf{E}\left[\varepsilon|X\right]\right]\\ \end{array} $$
Assumption: strict exogeneity: $\mathbf{E}\left[\varepsilon|X\right] = 0$
Under this assumption, $\mathbf{E}[\hat{\beta}] = \beta$, the ordinary least squares estimator is unbiased.

Variance

$$ \begin{array}{rcl} \textrm{Var}(\hat{\beta}) &=& \mathbf{E}\left[\left(\hat{\beta} - \mathbf{E}(\hat{\beta})\right)^2\right]\\ &=& \mathbf{E}\left[(\hat{\beta} - \beta)^2\right]\\ &=& \mathbf{E}\left[(\hat{\beta} - \beta)(\hat{\beta} - \beta)^T\right]\\ &=& \mathbf{E}\left[((X^TX)^{-1}X^T\varepsilon)((X^TX)^{-1}X^T\varepsilon)^T\right]\\ &=& \mathbf{E}\left[(X^TX)^{-1}X^T\varepsilon\varepsilon^TX(X^TX)^{-1}\right]\\ &=& (X^TX)^{-1}X^T\mathbf{E}\left[\varepsilon\varepsilon^T\right]X(X^TX)^{-1}\\ \end{array} $$
Assumption: homoskedasticity: $\mathbf{E}\left[\varepsilon_i^2|X\right] = \sigma^2$

Then we have:

$$ \begin{array}{rcl} \textrm{Var}(\hat{\beta}) &=& \sigma^2 (X^TX)^{-1}X^TX(X^TX)^{-1}\\ &=& \sigma^2 (X^TX)^{-1}\\ \end{array} $$

For simple linear regression, this gives ${\displaystyle\textrm{Var}(\beta) = \frac{\sigma^2}{\sum_{i=1}^n(x_i - \overline{x})}}$

Confidence intervals for coefficients

Here we will go through the case of the simple linear regression. There are two situations: either we can make the assumption that the residuals are normally distributed, or that the number of points in the dataset is "large enough", so that the law of large numbers and the central limit theorem become applicable.

Under normality assumption

Under a normality assumption for $\varepsilon$, $\hat{\beta}$ is also normally distributed.

Then,

$$Z = \frac{\hat{\beta}-\beta}{\frac{1}{\sqrt{n}}\sqrt{\frac{\sigma^2}{\sum_{i=1}^n (x_i - \overline{x})^2}}} \sim \mathcal{N}(0,1)$$

Let $S = \frac{n}{\sigma} \sum_{i=1}^n \hat{\varepsilon_i}^2$. $S$ has a chi-squared distribution with $n-2$ degrees of freedom.

Then $T = \frac{Z}{\sqrt{\frac{S}{n-2}}}$ has a Student's t-distribution with $n-2$ degrees of freedom.

$T$ will be easier to use if we write it as:

$$T = \frac{\hat{\beta} - \beta}{s_{\hat{\beta}}}$$

where

$$s_{\hat{\beta}} = \sqrt{\frac{\frac{1}{n-2} \sum_{i=1}^n \hat{\varepsilon_i}^2}{\sum_{i=1}^n (x_i - \overline{x})^2}}$$

With this setting, the confidence interval for $\beta$ with confidence level $\alpha$ is

$$\left[\hat{\beta} - t_{1-\alpha/2}^{n-2} s_{\hat{\beta}}, \hat{\beta} + t_{1-\alpha/2}^{n-2} s_{\hat{\beta}}\right]$$

TODO: confidence interval for $\alpha$

Under asymptotic assumption

TODO

Using regression as a machine learning predictor


In [8]:
n = 200
x = np.random.normal(5, 2, n)
noise = np.random.normal(0, 0.25, n)
y = 0.75*x + 1 + noise

In [9]:
fig, ax = plt.subplots(1, 1, figsize=(6,4))
ax.scatter(x, y)
ax.set_xlim([0,10])
ax.set_ylim([0,10])


Out[9]:
(0, 10)

In [10]:
x_train = x[-40:]
x_test = x[:-40]
y_train = y[-40:]
y_test = y[:-40]

We can fit a linear regression model on the training data:


In [11]:
beta,alpha,r,mse = fit(x,y)

And use it to predict $y$ on test data:


In [12]:
y_pred = alpha + beta*x_test

In [13]:
def pred_plot(y_pred,y_test):
    fig, ax = plt.subplots(1, 2, figsize=(12,4))
    ax[0].scatter(y_test, y_pred)
    ax[0].plot([0,10],[0,10],color='g')
    ax[0].set_xlim([0,10])
    mse_pred = np.sum((y_test-y_pred)**2)/n
    ax[0].set_title('MSE: {:.2f}'.format(mse_pred), fontsize=14)

    ax[1].hist(y_test-y_pred);

In [14]:
pred_plot(y_pred,y_test)


What could possibly go wrong? Assumptions for OLS

$X$ has full column rank

If we break this assumption, we can't even compute the OLS estimator.

strict exogeneity: $\mathbf{E}\left[\varepsilon|X\right] = 0$

If we break this one, then the OLS estimator gets biased.


In [15]:
n = 100
x = np.random.normal(1, 0.5, n)
noise = np.random.normal(0.25, 0.25, n)
y = 0.75*x + noise

In [16]:
beta,alpha,r,mse = fit(x,y, with_constant=False)

In [17]:
fit_plot(x,y,noise,beta,alpha)


What would it mean for a prediction model?


In [18]:
n = 200
x = np.random.normal(5, 2, n)
noise = np.random.normal(1, 0.25, n)
y = 0.75*x + noise

In [19]:
x_train = x[-40:]
x_test = x[:-40]
y_train = y[-40:]
y_test = y[:-40]

In [20]:
beta,alpha,r,mse = fit(x,y, with_constant=False)

In [21]:
y_pred = beta*x_test

In [22]:
pred_plot(y_pred,y_test)


But if we have a constant in our model, it is unbiased:


In [23]:
n = 100
x = np.random.normal(1, 0.5, n)
noise = np.random.normal(0.25, 0.25, n)
y = 0.75*x + noise

In [24]:
beta,alpha,r,mse = fit(x,y, with_constant=True)

In [25]:
fit_plot(x,y,noise,beta,alpha)



In [26]:
n = 200
x = np.random.normal(5, 2, n)
noise = np.random.normal(1, 0.25, n)
y = 0.75*x + noise

In [27]:
x_train = x[-40:]
x_test = x[:-40]
y_train = y[-40:]
y_test = y[:-40]

In [28]:
beta,alpha,r,mse = fit(x,y, with_constant=True)

In [29]:
y_pred = alpha+beta*x_test

In [30]:
pred_plot(y_pred, y_test)


What happened?

Homoscedasticity: $\mathbf{E}\left[\varepsilon_i^2|X\right] = \sigma^2$

What happens if we break this assumption?


In [31]:
n = 100
x = np.linspace(0, 2, n)
sigma2 = 0.1*x**2
noise = np.random.normal(0, np.sqrt(sigma2), n)
y = 0.75*x + 1 + noise

In [32]:
beta,alpha,r,mse = fit(x,y)

In [33]:
fit_plot(x,y,noise,beta,alpha)


When this assumption is violated, we will turn to a weighted least squares estimator.

No autocorrelation: $\mathbf{E}\left[\varepsilon_i \varepsilon_j|X\right] = 0 \textrm{ if } i \neq j$

What happens if we break this assumption?


In [34]:
n = 100
x = np.linspace(0, 2, n)
sigma2 = 0.1*((6*x).astype(int) % 4)
noise = np.random.normal(0, np.sqrt(sigma2), n)
y = 0.75*x + 1 + noise

In [35]:
beta,alpha,r,mse = fit(x,y)

In [36]:
fit_plot(x,y,noise,beta,alpha)


When this assumption is violated, we will turn to a generalized least squares estimator.

  • (optional) normality assumption on noise

Weighted least squares

Generalized least squares

Regularization

Tikhonov

Lasso

Other regression methods

Least absolute deviations

Median slope regression: Theil-Sen estimator

Total least squares: Deming regression

Generalized linear models


In [ ]: