In [1]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
%matplotlib inline
Linear regression, like any model regression methods, is made of two parts:
When least squares methods are used, the residual is the sum of the squares of the distances between each data point and the corresponding model value.
In [2]:
n = 100
x = np.random.normal(1, 0.5, n)
noise = np.random.normal(0, 0.25, n)
y = 0.75*x + 1 + noise
In [3]:
fig, ax = plt.subplots(1, 1, figsize=(6,4))
ax.scatter(x, y)
ax.set_xlim([0,2])
ax.set_ylim([0,3.1])
Out[3]:
Simple linear regression considers the model function
\begin{equation} y = \alpha + \beta x \end{equation}We can write the relation between $y_i$ and $x_i$ as:
\begin{equation} \forall i, \quad y_i = \alpha + \beta x_i + \varepsilon_i \end{equation}where the $\varepsilon_i$, called residuals, are what's missing between the model and the data.
(For the sake of demonstration, the points above were generated as a uniform sample of 50 $x$ values in $[0,1]$, and the $y$'s were computed as $y = \frac{3}{4}x+1+\varepsilon$, where $\varepsilon \sim \mathcal{N}(0,0.1)$.)
Now comes the fun part: we are looking for the $(\alpha,\beta)$ pair that provides the best fit between model and data. Best, but in which sense?
In the general case, we can write for each dependent variable $y_i, \, i = 1, \dots, n$ a set of $p$-vectors $\mathbf{x}_i$ called regressors.
The regression model above then takes the form
\begin{equation} y_i = \beta_0 1 + \beta_1 x_{i1} + \dots + \beta_p x_{ip} + \varepsilon_i,\quad i = 1, \dots, n \end{equation}which can be more concisely written as:
\begin{equation} y_i = \mathbf{x}_i^T \mathbf{\beta} + \varepsilon_i,\quad i = 1, \dots, n \end{equation}and in vector form:
\begin{equation} \mathbf{y} = X \mathbf{\beta} + \mathbf{\varepsilon} \end{equation}You may have noticed that there is no more $\alpha$ in this form: it has been replaced by $\beta_0$, which is the multiplication factor for the constant value $1$. This makes $X$ an $n \times (p+1)$ matrix and $\mathbf{\beta}$ a $(p+1)$ vector. This is more understandable with an explicit display of $X$, $\mathbf{y}$, $\mathbf{\beta}$ and $\mathbf{\epsilon}$:
$$\mathbf{y} = \left( \begin{array}{c} y_1\\ y_2\\ \vdots\\ y_n\\ \end{array}\right), \quad X = \left( \begin{array}{cccc} 1&x_{11}&\cdots&x_{1p}\\ 1&x_{21}&\cdots&x_{2p}\\ \vdots&\vdots&\ddots&\vdots\\ 1&x_{n1}&\cdots&x_{np}\\ \end{array}\right), \quad \mathbf{\beta} = \left( \begin{array}{c} \beta_0\\ \beta_1\\ \beta_2\\ \vdots\\ \beta_n\\ \end{array}\right), \quad \mathbf{\epsilon} = \left( \begin{array}{c} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_n\\ \end{array}\right) $$The best known family of regression methods rely this definition of the 'best fit': the best fit is the one that minimizes the sum of the squares of the residuals $\varepsilon_i$.
In the case of the simple linear regression, that is:
\begin{equation} \textrm{Find} \min_{\alpha,\beta} Q(\alpha,\beta)\quad Q(\alpha,\beta) = \sum_{i=1}^n \varepsilon_i^2 = \sum_{i=1}^n (y_i - \alpha - \beta x_i)^2 \end{equation}In the general case we are looking for $\hat{\beta}$ that minimizes:
\begin{equation} S(b) = (y - Xb)^T(y-Xb) \end{equation}We are then looking for $\hat{\beta}$ such that
\begin{equation} 0 = \frac{dS}{db}(\hat{\beta}) = \frac{d}{db}\left.\left(y^T y - b^TX^Ty-y^TXb + b^TX^TXb\right)\right|_{b=\hat{\beta}} \end{equation}By matrix calculus:
$$ \begin{array}{rcl} \frac{d}{db}(-b^TX^Ty) &=& -\frac{d}{db}(b^TX^Ty) = -X^Ty\\ \frac{d}{db}(-y^TXb) &=& -\frac{d}{db}(y^TXb) = -(y^TX)^T = -X^Ty\\ \frac{d}{db}(b^TX^TXb) &=& 2X^TXb\\ \end{array} $$So that we get
$$-2X^Ty + 2X^TX\hat{\beta} = 0$$For the simple linear regression, we get:
$$\hat{\beta} = \frac{\textrm{Cov}(x,y)}{\textrm{Var}(x)},\quad \hat{\alpha} = \bar{y} - \hat{\beta} \bar{x}$$
In [4]:
def fit(x,y, with_constant=True):
beta = np.cov(x,y)[0][1] / np.var(x)
if with_constant:
alpha = np.mean(y) - beta * np.mean(x)
else:
alpha = 0
r = np.cov(x,y)[0][1] / (np.std(x)*np.std(y))
mse = np.sum((y-alpha-beta*x)**2)/n
return([beta,alpha,r,mse])
In [5]:
beta,alpha,r,mse = fit(x,y)
print('alpha: {:.2f}, beta: {:.2f}'.format(alpha,beta))
print('r squared: {:.2f}'.format(r*r))
print('MSE: {:.2f}'.format(mse))
In [6]:
def fit_plot(x,y,noise,beta,alpha):
fig, ax = plt.subplots(1, 3, figsize=(18,4))
ax[0].scatter(x, y)
x_ = np.linspace(0,2,5)
ax[0].plot(x_, alpha + beta*x_, color='orange', linewidth=2)
ax[0].set_xlim([0,2])
ax[0].set_ylim([0,3.1])
mse = np.sum((y-alpha-beta*x)**2)/n
ax[0].set_title('MSE: {:.2f}'.format(mse), fontsize=14)
ax[1].hist(noise, alpha=0.5, bins=np.arange(-0.7,0.7,0.1));
ax[1].hist(y - alpha - beta*x, alpha=0.5, bins=np.arange(-0.7,0.7,0.1))
ax[1].legend(['original noise', 'residual']);
stats.probplot(y - alpha - beta*x, dist="norm", plot=ax[2]);
In [7]:
fit_plot(x,y,noise,beta,alpha)
Then we have:
$$ \begin{array}{rcl} \textrm{Var}(\hat{\beta}) &=& \sigma^2 (X^TX)^{-1}X^TX(X^TX)^{-1}\\ &=& \sigma^2 (X^TX)^{-1}\\ \end{array} $$For simple linear regression, this gives ${\displaystyle\textrm{Var}(\beta) = \frac{\sigma^2}{\sum_{i=1}^n(x_i - \overline{x})}}$
Here we will go through the case of the simple linear regression. There are two situations: either we can make the assumption that the residuals are normally distributed, or that the number of points in the dataset is "large enough", so that the law of large numbers and the central limit theorem become applicable.
Under a normality assumption for $\varepsilon$, $\hat{\beta}$ is also normally distributed.
Then,
$$Z = \frac{\hat{\beta}-\beta}{\frac{1}{\sqrt{n}}\sqrt{\frac{\sigma^2}{\sum_{i=1}^n (x_i - \overline{x})^2}}} \sim \mathcal{N}(0,1)$$Let $S = \frac{n}{\sigma} \sum_{i=1}^n \hat{\varepsilon_i}^2$. $S$ has a chi-squared distribution with $n-2$ degrees of freedom.
Then $T = \frac{Z}{\sqrt{\frac{S}{n-2}}}$ has a Student's t-distribution with $n-2$ degrees of freedom.
$T$ will be easier to use if we write it as:
$$T = \frac{\hat{\beta} - \beta}{s_{\hat{\beta}}}$$where
$$s_{\hat{\beta}} = \sqrt{\frac{\frac{1}{n-2} \sum_{i=1}^n \hat{\varepsilon_i}^2}{\sum_{i=1}^n (x_i - \overline{x})^2}}$$With this setting, the confidence interval for $\beta$ with confidence level $\alpha$ is
$$\left[\hat{\beta} - t_{1-\alpha/2}^{n-2} s_{\hat{\beta}}, \hat{\beta} + t_{1-\alpha/2}^{n-2} s_{\hat{\beta}}\right]$$TODO: confidence interval for $\alpha$
In [8]:
n = 200
x = np.random.normal(5, 2, n)
noise = np.random.normal(0, 0.25, n)
y = 0.75*x + 1 + noise
In [9]:
fig, ax = plt.subplots(1, 1, figsize=(6,4))
ax.scatter(x, y)
ax.set_xlim([0,10])
ax.set_ylim([0,10])
Out[9]:
In [10]:
x_train = x[-40:]
x_test = x[:-40]
y_train = y[-40:]
y_test = y[:-40]
We can fit a linear regression model on the training data:
In [11]:
beta,alpha,r,mse = fit(x,y)
And use it to predict $y$ on test data:
In [12]:
y_pred = alpha + beta*x_test
In [13]:
def pred_plot(y_pred,y_test):
fig, ax = plt.subplots(1, 2, figsize=(12,4))
ax[0].scatter(y_test, y_pred)
ax[0].plot([0,10],[0,10],color='g')
ax[0].set_xlim([0,10])
mse_pred = np.sum((y_test-y_pred)**2)/n
ax[0].set_title('MSE: {:.2f}'.format(mse_pred), fontsize=14)
ax[1].hist(y_test-y_pred);
In [14]:
pred_plot(y_pred,y_test)
In [15]:
n = 100
x = np.random.normal(1, 0.5, n)
noise = np.random.normal(0.25, 0.25, n)
y = 0.75*x + noise
In [16]:
beta,alpha,r,mse = fit(x,y, with_constant=False)
In [17]:
fit_plot(x,y,noise,beta,alpha)
What would it mean for a prediction model?
In [18]:
n = 200
x = np.random.normal(5, 2, n)
noise = np.random.normal(1, 0.25, n)
y = 0.75*x + noise
In [19]:
x_train = x[-40:]
x_test = x[:-40]
y_train = y[-40:]
y_test = y[:-40]
In [20]:
beta,alpha,r,mse = fit(x,y, with_constant=False)
In [21]:
y_pred = beta*x_test
In [22]:
pred_plot(y_pred,y_test)
But if we have a constant in our model, it is unbiased:
In [23]:
n = 100
x = np.random.normal(1, 0.5, n)
noise = np.random.normal(0.25, 0.25, n)
y = 0.75*x + noise
In [24]:
beta,alpha,r,mse = fit(x,y, with_constant=True)
In [25]:
fit_plot(x,y,noise,beta,alpha)
In [26]:
n = 200
x = np.random.normal(5, 2, n)
noise = np.random.normal(1, 0.25, n)
y = 0.75*x + noise
In [27]:
x_train = x[-40:]
x_test = x[:-40]
y_train = y[-40:]
y_test = y[:-40]
In [28]:
beta,alpha,r,mse = fit(x,y, with_constant=True)
In [29]:
y_pred = alpha+beta*x_test
In [30]:
pred_plot(y_pred, y_test)
What happened?
In [31]:
n = 100
x = np.linspace(0, 2, n)
sigma2 = 0.1*x**2
noise = np.random.normal(0, np.sqrt(sigma2), n)
y = 0.75*x + 1 + noise
In [32]:
beta,alpha,r,mse = fit(x,y)
In [33]:
fit_plot(x,y,noise,beta,alpha)
When this assumption is violated, we will turn to a weighted least squares estimator.
In [34]:
n = 100
x = np.linspace(0, 2, n)
sigma2 = 0.1*((6*x).astype(int) % 4)
noise = np.random.normal(0, np.sqrt(sigma2), n)
y = 0.75*x + 1 + noise
In [35]:
beta,alpha,r,mse = fit(x,y)
In [36]:
fit_plot(x,y,noise,beta,alpha)
When this assumption is violated, we will turn to a generalized least squares estimator.
In [ ]: