This notebook contains an excerpt from the book Machine Learning for OpenCV by Michael Beyeler. The code is released under the MIT license, and is available on GitHub.

Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!

< Understanding the k-NN Classifier | Contents | Applying Lasso and Ridge Regression >

Using Regression Models to Predict Continuous Outcomes

Now let's turn our attention to a regression problem. Regression is all about predicting continuous outcomes rather than predicting discrete class labels.

Using linear regression to predict Boston housing prices

The easiest regression model is called linear regression. The idea behind linear regression is to describe a target variable (such as Boston house pricing) with a linear combination of features.

If you want to understand how the math behind linear regression works, please refer to the book (p.65ff.).

To get a better understanding of linear regression, we want to build a simple model that can be applied to one of the most famous machine learning datasets known as the Boston housing prices dataset. Here, the goal is to predict the value of homes in several Boston neighborhoods in the 1970s, using information such as crime rate, property tax rate, distance to employment centers, and highway accessibility.

Loading the dataset

We can again thank scikit-learn for easy access to the dataset. We first import all the necessary modules, as we did earlier:



In [1]:

    
import numpy as np
import cv2

from sklearn import datasets
from sklearn import metrics
from sklearn import model_selection
from sklearn import linear_model

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.rcParams.update({'font.size': 16})

Then, loading the dataset is a one-liner:



In [2]:

    
boston = datasets.load_boston()

The structure of the boston object is identical to the iris object. We can get more information about the dataset by looking at the fields of the boston object:

DESCR: Get a description of the data
data: The actual data, <num_samples x num_features>
feature_names: The names of the features
target: The class labels, <num_samples x 1>
target_names: The names of the class labels



In [3]:

    
dir(boston)









    Out[3]:





['DESCR', 'data', 'feature_names', 'target']

The dataset contains a total of 506 data points, each of which has 13 features:



In [4]:

    
boston.data.shape









    Out[4]:





(506, 13)

Of course, we have only a single target value, which is the housing price:



In [5]:

    
boston.target.shape









    Out[5]:





(506,)

Training the model

Believe it or not, OpenCV does not offer any good implementation of linear regression. Some people online say that you can use cv2.fitLine, but that is different. This is a perfect opportunity to get familiar with scikit-learn's API:



In [6]:

    
linreg = linear_model.LinearRegression()

In the preceding command, we want to split the data into training and test sets. We are free to make the split as we see fit, but usually it is a good idea to reserve between 10 percent and 30 percent for testing. Here, we choose 10 percent, using the test_size argument:



In [7]:

    
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    boston.data, boston.target, test_size=0.1, random_state=42
)

In scikit-learn, the train function is called fit, but otherwise behaves exactly the same as in OpenCV:



In [8]:

    
linreg.fit(X_train, y_train)









    Out[8]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

We can look at the mean squared error of our predictions by comparing the true housing prices, y_train, to our predictions, linreg.predict(X_train):



In [9]:

    
metrics.mean_squared_error(y_train, linreg.predict(X_train))









    Out[9]:





22.739484154236614

The score method of the linreg object returns the coefficient of determination (R squared):



In [10]:

    
linreg.score(X_train, y_train)









    Out[10]:





0.73749340919011974

Testing the model

In order to test the generalization performance of the model, we calculate the mean squared error on the test data:



In [11]:

    
y_pred = linreg.predict(X_test)



In [12]:

    
metrics.mean_squared_error(y_test, y_pred)









    Out[12]:





15.010997321630107

We note that the mean squared error is a little lower on the test set than the training set. This is good news, as we care mostly about the test error. However, from these numbers, it is really hard to understand how good the model really is. Perhaps it's better to plot the data:



In [13]:

    
plt.figure(figsize=(10, 6))
plt.plot(y_test, linewidth=3, label='ground truth')
plt.plot(y_pred, linewidth=3, label='predicted')
plt.legend(loc='best')
plt.xlabel('test data points')
plt.ylabel('target value')









    Out[13]:





<matplotlib.text.Text at 0x7f08d2a34eb8>

This makes more sense! Here we see the ground truth housing prices for all test samples in blue and our predicted housing prices in red. Pretty close, if you ask me. It is interesting to note though that the model tends to be off the most for really high or really low housing prices, such as the peak values of data point 12, 18, and 42. We can formalize the amount of variance in the data that we were able to explain by calculating $R^2$:



In [14]:

    
plt.figure(figsize=(10, 6))
plt.plot(y_test, y_pred, 'o')
plt.plot([-10, 60], [-10, 60], 'k--')
plt.axis([-10, 60, -10, 60])
plt.xlabel('ground truth')
plt.ylabel('predicted')

scorestr = r'R$^2$ = %.3f' % linreg.score(X_test, y_test)
errstr = 'MSE = %.3f' % metrics.mean_squared_error(y_test, y_pred)
plt.text(-5, 50, scorestr, fontsize=12)
plt.text(-5, 45, errstr, fontsize=12);

If our model was perfect, then all data points would lie on the dashed diagonal, since y_pred would always be equal to y_true. Deviations from the diagonal indicate that the model made some errors, or that there is some variance in the data that the model was not able to explain. Indeed, $R^2$ indicates that we were able to explain 76 percent of the scatter in the data, with a mean squared error of 15.011. These are some hard numbers we can use to compare the linear regression model to some more complicated ones.

< Understanding the k-NN Classifier | Contents | Applying Lasso and Ridge Regression >