Linear Regression Tutorial

Some problems don't have discrete (categorical) labels (e.g. color, plant species), but rather a continuous range of numbers (e.g. length, price). For these types of problems, regression is usually a good choice. Rather than predicting a categorical label for each example, it fits a continuous line (or plane, or curve) to the data in order to give a predicition as a number.

If you've ever found a "line of best fit" using Excel, you've already used regression!

Setup

Tell matplotlib to print figures in the notebook. Then import numpy (for numerical data), matplotlib.pyplot (for plotting figures), linear_model (for the scikit-learn linear regression algorithm), datasets (to download the Boston housing prices dataset from scikit-learn), and cross_validation (to create training and testing sets).


In [ ]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets # Import the linear regression function and dataset from scikit-learn
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import mean_squared_error, r2_score

# Print figures in the notebook
%matplotlib inline

Import the dataset

Import the dataset and store it to a variable called diabetes. This dataset is similar to a python dictionary, with the keys: ['DESCR', 'target', 'data', 'feature_names']

The data features are stored in diabetes.data, where each row is an example from a single patient, and each column is a single feature. The feature names are stored in diabetes.feature_names. Target values are stored in diabetes.target.

Below, we load the labels into y, the data into X, and the names of the features into featureNames. We also print the description of the dataset.


In [ ]:
# Import some data to play with
diabetes = datasets.load_diabetes()

# Store the labels (y), features (X), and feature names
y = diabetes.target       # Labels are stored in y as numbers
X = diabetes.data
featureNames = diabetes.feature_names

print(diabetes.DESCR)

Create Training and Testing Sets

In order to see how well our classifier works, we need to divide our data into training and testing sets


In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Visualize The Data

There are too many features to visualize the whole training set, but we can plot a single feature (e.g. body mass index) against the quantified disease progression.

Remember that the features have been normalized to center it around 0 and scaled by the standard deviation. So the values shown aren't the actual BMI levels, but a standardized version of them.


In [ ]:
bmi_ind = 2

plt.scatter(X_train[:,bmi_ind], y_train)
plt.ylabel('Disease Progression')
plt.xlabel('Body Mass Index')
plt.show()

Train A Toy Model

Next, we train a linear regression algorithm on our data.

In scikit learn, we use linear_model.LinearRegression() to create our model. Models are trained using a fit() method. For linear regression, fit() expects two arguments: the training examples X_train and the corresponding labels y_train.

To start, we will train a toy model using only the body mass index feature.


In [ ]:
regr = linear_model.LinearRegression()
x_train = X_train[:,bmi_ind].reshape(-1, 1) # regression expects a (#examples,#features) array shape
regr.fit(x_train, y_train)

Making Predictions

To make predictions on new data we use the predict() method. It expects a single input: an array-like object containing the features for the examples we want to predict.

Here, we get the predictions for two body mass index values: -.05 and 0.1.


In [ ]:
bmi = [[-0.05],[0.1]]
predictions = regr.predict(bmi)
print(predictions)

Visualize the Toy Model

Here we plot the linear regression line of our trained model on top of the data. We do this by predicting the output of the model on a range of values from -0.1 to 0.15. These predictions are plotted as a line on top of the training data.

This can't tell us how well it will perform on new, unseen, data, but it can show us how well the line fits the training data.


In [ ]:
values = np.arange(-0.1, 0.15, 0.01).reshape(-1, 1) # Reshape so each feature is a separate row

plt.scatter(x_train, y_train)
plt.plot(values, regr.predict(values), c='r')
plt.ylabel('Disease Progression')
plt.xlabel('Body Mass Index')
plt.title('Regression Line on Training Data')
plt.show()

Test the Toy Model

Next, we will test the ability of our model to predict the disease progression in our test set, using only the body mass index.

First, we get our predictions for the training data, and plot the predicted model on top of the test data. Since we trained our model using only one feature, we need to get only that feature from the test set.


In [ ]:
x_test = X_test[:,bmi_ind][np.newaxis].T # regression expects a (#examples,#features) array shape
predictions = regr.predict(x_test)

plt.scatter(x_test, y_test)
plt.plot(x_test, predictions, c='r')
plt.ylabel('Disease Progression')
plt.xlabel('Body Mass Index')
plt.title('Regression Line on Test Data')

Evaluate the Toy Model

Next, we evaluate how well our model worked on the training dataset. Unlike with discrete classifiers (e.g. KNN, SVM), the number of examples it got "correct" isn't meaningful here. We certainly care if the predicted value is off by 100, but do we care as much if it is off by 1? What about 0.01?

There are many ways to evaluate a linear classifier, but one popular way is the mean-squared error, or MSE. As the name implies, you find the error for each example (the distance between the point and the predicted line), square each of them, and then add them all together.

Scikit-learn has a function that does this for you easily.


In [ ]:
mse = mean_squared_error(y_test, predictions)

print('The MSE is ' + '%.2f' % mse)

The MSE isn't as intuitive as the accuracy of a discrete classifier, but it is highly useful for comparing the effectiveness of different models. Another option is to look at the $R^2$ score, which you may already be familiar with if you've ever fit a line to data in Excel. A value of 1.0 is a perfect predictor, while 0.0 means there is no correlation between the input and output of the regression model.


In [ ]:
r2score = r2_score(y_test, predictions)

print('The R^2 score is ' + '%.2f' % r2score)

Train A Model on All Features

Next we will train a model on all of the available features and use it to predict the progression of diabetes after a year. We can then see how this compares to using only a single feature.


In [ ]:
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)

predictions = regr.predict(X_test)

mse = mean_squared_error(y_test, predictions)
print('The MSE is ' + '%.2f' % mse)

r2score = r2_score(y_test, predictions)
print('The R^2 score is ' + '%.2f' % r2score)

Using Crossvalidation

To properly compare these two models, we need to use crossvalidation to select the best model. We can then get our final result using our test data.

First we create our two linear regression models, then we divide our training data into folds. We loop through the sets of training and validation folds. Each time, we train each model on the training data and evaluate on the validation data. We store the MSE and $R^2$ score of each classifier on each fold so we can look at them later.


In [ ]:
# Create our regression models
regr_1 = linear_model.LinearRegression()
regr_all = linear_model.LinearRegression()

# Create arrays to store the MSE and R^2 score
mse_1 = []
r2score_1 = []
mse_all = []
r2score_all = []

# Loop through 5 folds
kf = KFold(n_splits=5)
for trainInd, valInd in kf.split(X_train):
    X_tr = X_train[trainInd,:]
    y_tr = y_train[trainInd]
    X_val = X_train[valInd,:]
    y_val = y_train[valInd]
    
    # Train our models
    regr_1.fit(X_tr[:,bmi_ind].reshape(-1, 1), y_tr) # Train on only one feature
    regr_all.fit(X_tr, y_tr) # Train on all features
    
    # Make our predictions
    pred_1 = regr_1.predict(X_val[:,bmi_ind].reshape(-1, 1))
    pred_all = regr_all.predict(X_val)
    
    # Calculate the MSE
    mse_1.append(mean_squared_error(y_val, pred_1))
    mse_all.append(mean_squared_error(y_val, pred_all))
    
    # Calculate the R^2 score
    r2score_1.append(r2_score(y_val, pred_1))
    r2score_all.append(r2_score(y_val, pred_all))

Select a Model

To select a model, we look at the average $R^2$ score and MSE across all folds.


In [ ]:
print('One Feature:\nMSE: ' + '%.2f' % np.mean(mse_1))
print('R^2: ' + '%.2f' % np.mean(r2score_1))

print('\nAll Features:\nMSE: ' + '%.2f' % np.mean(mse_all))
print('R^2: ' + '%.2f' % np.mean(r2score_all))

Final Evaluation

The model using all features has a higher $R^2$ score and lower MSE, so we select it as our best model. Now we can evaluate it on our training set and get our final MSE an $R^2$ score values.


In [ ]:
regr_all.fit(X_train, y_train)
predictions = regr_all.predict(X_test)

mse = mean_squared_error(y_test, predictions)
r2score = r2_score(y_test, predictions)

print('The final MSE is ' + '%.2f' % mse)
print('The final R^2 score is ' + '%.2f' % r2score)

Sidenote: Randomness and Results

Every time you run this notebook, you will get slightly different results. Why? Because data is randomly divided among the training/testing/validation data sets. Running the code again will create a different division of the data, and will make the results slightly different. However, the overall outcome should remain consistent and have approximately the same values. If you have drastically different results when running an analysis multiple times, it suggests a problem with your model or that you need more data.

If it's important that you get the exact same results every time you run the code, you can specify a random state in the random_state argument of train_test_split() and KFold.


In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [ ]: