Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!
The easiest regression model is called linear regression. The idea behind linear regression is to describe a target variable (such as Boston house pricing) with a linear combination of features.
If you want to understand how the math behind linear regression works, please refer to the book (p.65ff.).
To get a better understanding of linear regression, we want to build a simple model that can be applied to one of the most famous machine learning datasets known as the Boston housing prices dataset. Here, the goal is to predict the value of homes in several Boston neighborhoods in the 1970s, using information such as crime rate, property tax rate, distance to employment centers, and highway accessibility.
In [1]:
import numpy as np
import cv2
from sklearn import datasets
from sklearn import metrics
from sklearn import model_selection
from sklearn import linear_model
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.rcParams.update({'font.size': 16})
Then, loading the dataset is a one-liner:
In [2]:
boston = datasets.load_boston()
The structure of the boston object is identical to the iris
object. We can get more information about the dataset by looking at the fields of the boston
object:
DESCR
: Get a description of the datadata
: The actual data, <num_samples
x num_features
>feature_names
: The names of the featurestarget
: The class labels, <num_samples
x 1>target_names
: The names of the class labels
In [3]:
dir(boston)
Out[3]:
The dataset contains a total of 506 data points, each of which has 13 features:
In [4]:
boston.data.shape
Out[4]:
Of course, we have only a single target value, which is the housing price:
In [5]:
boston.target.shape
Out[5]:
In [6]:
linreg = linear_model.LinearRegression()
In the preceding command, we want to split the data into training and test sets. We are free
to make the split as we see fit, but usually it is a good idea to reserve between 10 percent
and 30 percent for testing. Here, we choose 10 percent, using the test_size
argument:
In [7]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(
boston.data, boston.target, test_size=0.1, random_state=42
)
In scikit-learn, the train
function is called fit
, but otherwise behaves exactly the same as
in OpenCV:
In [8]:
linreg.fit(X_train, y_train)
Out[8]:
We can look at the mean squared error of our predictions by comparing the true housing
prices, y_train
, to our predictions, linreg.predict(X_train)
:
In [9]:
metrics.mean_squared_error(y_train, linreg.predict(X_train))
Out[9]:
The score
method of the linreg
object returns the coefficient of determination (R
squared):
In [10]:
linreg.score(X_train, y_train)
Out[10]:
In [11]:
y_pred = linreg.predict(X_test)
In [12]:
metrics.mean_squared_error(y_test, y_pred)
Out[12]:
We note that the mean squared error is a little lower on the test set than the training set. This is good news, as we care mostly about the test error. However, from these numbers, it is really hard to understand how good the model really is. Perhaps it's better to plot the data:
In [13]:
plt.figure(figsize=(10, 6))
plt.plot(y_test, linewidth=3, label='ground truth')
plt.plot(y_pred, linewidth=3, label='predicted')
plt.legend(loc='best')
plt.xlabel('test data points')
plt.ylabel('target value')
Out[13]:
This makes more sense! Here we see the ground truth housing prices for all test samples in blue and our predicted housing prices in red. Pretty close, if you ask me. It is interesting to note though that the model tends to be off the most for really high or really low housing prices, such as the peak values of data point 12, 18, and 42. We can formalize the amount of variance in the data that we were able to explain by calculating $R^2$:
In [14]:
plt.figure(figsize=(10, 6))
plt.plot(y_test, y_pred, 'o')
plt.plot([-10, 60], [-10, 60], 'k--')
plt.axis([-10, 60, -10, 60])
plt.xlabel('ground truth')
plt.ylabel('predicted')
scorestr = r'R$^2$ = %.3f' % linreg.score(X_test, y_test)
errstr = 'MSE = %.3f' % metrics.mean_squared_error(y_test, y_pred)
plt.text(-5, 50, scorestr, fontsize=12)
plt.text(-5, 45, errstr, fontsize=12);
If our model was perfect, then all data points would lie on the dashed diagonal, since
y_pred
would always be equal to y_true
. Deviations from the diagonal indicate that the
model made some errors, or that there is some variance in the data that the model was not
able to explain. Indeed, $R^2$ indicates that we were able to explain 76 percent of the scatter in
the data, with a mean squared error of 15.011. These are some hard numbers we can use to
compare the linear regression model to some more complicated ones.