Linear regression is a method that allows us to study the statistical relationship between independent/predictor variables and dependent/response/outcome variables. It is simple to use and interpreting a linear regression model is straightforward. There are two variants:
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Simple linear regression assumes that there is a linear relationship between two continuous variables; the predictor variable $x$ and the response variable $y$. This can be described as:
\begin{equation*} y \thickapprox \beta_0 + \beta_1 x \end{equation*}where
Together, $\beta_0$ and $\beta_1$ are called the model coefficients or parameters.
Since these coefficients are unknown to begin with, we use our data to estimate or learn them using least squares criteria. Essentially, we find the best fitting line by minimising the sum of squared errors/residuals. The errors or residuals are the differences between the observed value and the estimated value. We square the difference to avoid negative values. The residual sum of squares (RSS) is defined as: \begin{equation*} RSS = \sum_{i=1}^n{(y_i - \hat{y}_i)^2} \end{equation*} where
We can illustrate this with a plot:
In [2]:
from sklearn.linear_model import LinearRegression
# Create artificial data
X = np.array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
Y = np.array([-1.1, 4, 1, 6, 4, 2, 8, 5, 12, 7])
# Create a model
lrm = LinearRegression()
lrm.fit(X.reshape(-1, 1), Y)
model_line = lrm.intercept_ + X * lrm.coef_[0]
fig, ax = plt.subplots(figsize=(8,4))
# Plot the data
ax.plot(X, Y, 'go')
# Plot the regression line
ax.plot(X, model_line, 'bo-') #
# Plot the residuals
ax.vlines(x=X, ymin=np.minimum(Y, model_line), ymax=np.maximum(Y, model_line), color='red')
Out[2]:
The green dots represent our artificial data i.e. the observed data. The red vertical lines indicate the errors or the residuals. The blue line represent the regression line that best fit the observed data. This line is found by minimising the residuals.
Typically, we have more than one predictor variable (features) in our data. In such instances, we use Multiple Linear Regression:
\begin{equation*} y \thickapprox \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n \end{equation*}
In [3]:
from sklearn.datasets import load_boston
boston = load_boston()
In [4]:
boston.keys()
Out[4]:
In [5]:
print(boston['DESCR'])
In [6]:
df = pd.DataFrame(data=boston.data, columns=boston['feature_names'])
# Add a column for the target feature
df['MEDV'] = boston['target']
df.head()
Out[6]:
In [7]:
df.info()
Check data distributions.
In [8]:
df.describe().transpose()
Out[8]:
We can get check if we have missing values in the dataset:
In [9]:
df.isnull().any()
Out[9]:
No missing data. If any of our features contained missing values, we could use a heatmap to get a quick overview of where there missing data are and a rough picture on how much data is missing.
In [10]:
sns.heatmap(df.isnull(), cbar=False, yticklabels=False)
Out[10]:
Next step is to identify which features are categorical:
In [11]:
# Features with number of unique values
unq = {
column : df[column].nunique()
for column in df.columns
}
unq
Out[11]:
In [12]:
import operator
sorted(unq.items(), key=operator.itemgetter(1))
Out[12]:
In [13]:
sorted(df['RAD'].unique())
Out[13]:
We see that CHAS is a Boolean feature. RAD (index of accessibility to radial highways) is also categorical since it has 9 distinct values. Let us fix them:
In [14]:
df['CHAS'] = df['CHAS'].astype('category')
df['RAD'] = df['RAD'].astype('category')
In [15]:
df.info()
Since CHAS is a binary feature, we can use Point-Biserial to compute correlation between CHAS and MEDV:
In [16]:
from scipy import stats
stats.pointbiserialr(df['CHAS'], df['MEDV'])
Out[16]:
Before we can train a linear regression model, we must decide on which feature to pick as the independent variable. We must find a good predictor variable for the dependent variable. One way to find such a candidate is to create a heat map of the correlations between each of the features. Pearson's correlation can tell us as one variable increses, whether another variable increases (positive correlation), decreases (negative correlation), or stays the same (no correlation). Read more on Cross Validated.
In [17]:
plt.figure(figsize = (10,6))
sns.heatmap(df.corr(), annot=True, linewidths=1)
Out[17]:
There is a strong correlation between our target feature MEDV (the house price) and RM which is the average number of rooms per house. There are also negative correlations with LSTAT and PTRATIO. This means that as one of the variables (LSTAT or PTRATIO) increases then our dependent variable (MEDV) tends to decrease and the other way around.
For now, we focus on the relationship between MEDV and RM. Let us create a linear model plot of these two attributes:
In [18]:
sns.lmplot(data=df, x='MEDV', y='RM')
Out[18]:
From the plot, it looks like there is a good linear fit since the error bars are not that big.
Before we can train a model, it is important to split the data into training set and testing set to avoid overfitting. Overfitting occurs when our model becomes too complex which may yield high accuracy on the training data but does not generalise well.
Note that the train_test_split()
method simply partitions data randomly. This means that we may get another coefficient if we run the method again. This means that the accuracy of our testing can vary depending on the random split. Therefore, a better approach would be to apply k-fold cross-validation. However, cross-validation may not work well when the data is not uniformly distributed.
In [19]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
In [20]:
X = df['RM']
y = df['MEDV']
# Partition the data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
lrm = LinearRegression()
lrm.fit(X_train.values.reshape(-1, 1), y_train)
Out[20]:
In [21]:
intercept = lrm.intercept_
coefficient = lrm.coef_[0]
print("Intercept: {0}\nCoefficient: {1}".format(intercept, coefficient))
It is straightforward to interpret the coefficient:
In [22]:
print('A unit increase in the RM feature is associated with a {0} units increase in the house price.'.format(coefficient))
Now we use the fitted model to perform predictions on our test set.
In [23]:
predicted = lrm.predict(X_test.values.reshape(-1, 1))
Let us check how far off the predicted values are from the observed values. We can visualise this via scatter plot. We can also plot a dashed line that indicates where perfect prediction values would lie.
In [24]:
fig, ax = plt.subplots()
ax.scatter(y_test, predicted)
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', linewidth=3)
ax.set_xlabel('Observed')
ax.set_ylabel('Predicted')
Out[24]:
Create a histogram of our residuals/errors. If the residuals follow a normal distribution, then we have selected a correct model for the data set. Otherwise, we may have to reconsider the choice of linear regression model.
In [25]:
sns.distplot(y_test - predicted);
Next, we evaluate the performance of our model. The sci-learn library implements several metric functions:
Mean Absolute Error is the mean of the absolute value of the errors
\begin{equation*} \frac{1}{n} \sum_{i=1}^n \mid y_i - \hat{y}_i \mid \end{equation*}
Median Absolute Error is interesting because it is robust to outliers.
Explained Variance Score
\begin{equation*} 1- \frac{Var(y - \hat{y})}{Var(y)} \end{equation*}
Coefficient of Determination ($R^2$) is often used as a tool for comparing different models. Although we want to achieve an $R^2$ value close to 1, there is no guarantee that a model with such high value will be good because $R^2$ is susceptible to overfitting.
Mean Squared Error tends to be useful with real-world data because it takes "larger" errors into account
\begin{equation*} \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i )^2 \end{equation*}
Root Mean Squared Error
In [26]:
from sklearn import metrics
print('Mean Absolute Error: ', metrics.mean_absolute_error(y_test, predicted))
print('Median Absolute Error: ', metrics.median_absolute_error(y_test, predicted))
print('Explained Variance Score: ', metrics.explained_variance_score(y_test, predicted))
print('Coefficient of Determination (R^2): ', metrics.r2_score(y_test, predicted))
print('Mean Squared Error: ', metrics.mean_squared_error(y_test, predicted))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, predicted)))
There are many potential challenges that arises when we learn a linear regression model using a particular data set. Here are few common issues:
We say that we have collinearity in our data set when two or more predictor variables are correlated with one another. These variables are called collinear variables. One collinear variable can predict another variable with high accuracy. In the left plot of Figure 3.14 shows that there are no collinearity between Age and Limit. However, there is a high collinearity between Rating and Limit as shown in the right plot.
Collinearity can be a problem because it can be difficult to differentiate the effects that each of the collinear variables has on the dependent variables. Another issue is that model predictions tend to be less precise if there is high level of collinearity in our training set. Therefore, if two predictor variables are highly correlated then we should exclude one of them from our data set. Alternative approach is to use Ridge Regression .
Outliers ...