Straight line fitting is a linear regression problem - and an example of predictive learning.
A predictive model will have been "fitted" to the data when an assumed loss function has been minimized. A popular choice of minimized loss function is the following, corresponding to the method of "ordinary least squares":
$$ \text{min}_{w, b} \sum_i || w^\mathsf{T}x_i + b - y_i||^2 $$Let's fit some test data with a straight line using the SciKit-Learn
library, and see how accurately we can make our predictions.
In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from sklearn import datasets, linear_model
In [2]:
# Code source: Jaques Grobler
# License: BSD 3 clause
# Load the boston dataset, and focus on just one attribute: LSTAT (attribute 13)
boston = datasets.load_boston()
X = np.atleast_2d(boston.data[:,12]).T
y = np.atleast_2d(boston.target).T
# Make a training/test split:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
labels = ['LSTAT','MEDV']
print X_train.shape, X_test.shape
print y_train.shape, y_test.shape
In [3]:
plt.scatter(X_test, y_test, color='black')
plt.xlabel('LSTAT')
plt.ylabel('MEDV')
Out[3]:
In [4]:
# Create linear regression object:
model = linear_model.LinearRegression()
# Train the model using the training sets:
model.fit(X_train, y_train)
# The coefficients:
print("Coefficients:", model.coef_)
In [5]:
# The mean square prediction error:
print("Training data: MSE = %.2f"
% np.mean((model.predict(X_train) - y_train) ** 2))
print("Test data: MSE = %.2f"
% np.mean((model.predict(X_test) - y_test) ** 2))
print ""
# Explained variance score: 1 is perfect prediction:
print('Training data: Variance score = %.2f' % model.score(X_train, y_train))
print('Test data: Variance score = %.2f' % model.score(X_test, y_test))
In [6]:
# Plot outputs:
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, model.predict(X_test), color='blue', linewidth=3)
plt.xlabel('LSTAT')
plt.ylabel('MEDV')
Out[6]:
In [7]:
from sklearn.cross_validation import cross_val_score
model = linear_model.LinearRegression()
cross_val_score(model, X, y, cv=5, scoring='r2')
Out[7]:
How we design the folds matters: we want each subset of the data to be a fair sample of the whole.
In this problem, we want to select the LSTAT values randomly (rather than sequentially).
In [8]:
from sklearn.cross_validation import ShuffleSplit
shuffle_split = ShuffleSplit(len(X), 20, test_size=0.4)
cross_val_score(model, X, y, cv=shuffle_split)
Out[8]:
In [9]:
MSE = cross_val_score(model, X, y, cv=shuffle_split, scoring='mean_squared_error')
GE,errGE = np.mean(MSE),np.std(MSE)
print "Generalization error:",GE,"+/-",errGE
In [10]:
R2 = cross_val_score(model, X, y, cv=shuffle_split, scoring='r2')
meanR2,errR2 = np.mean(R2),np.std(R2)
print "Mean score:",meanR2,"+/-",errR2
In [11]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=4)
XX = poly.fit_transform(X)
XX
Out[11]:
In [12]:
polymodel = linear_model.LinearRegression()
poly_split = ShuffleSplit(len(XX), 20, test_size=0.4)
R2 = cross_val_score(polymodel, XX, y, cv=poly_split, scoring='r2')
meanR2,errR2 = np.mean(R2),np.std(R2)
print "Mean score:",meanR2,"+/-",errR2
In [13]:
# Make predictions with models:
from sklearn.cross_validation import cross_val_predict
y_straightline = cross_val_predict(model, X, y, cv=10)
y_polynomial = cross_val_predict(polymodel, XX, y, cv=10)
In [14]:
# Plot outputs:
plt.scatter(X, y, color='black', alpha=0.1)
plt.plot(X, y_straightline, color='blue', linewidth=2, alpha=0.4)
plt.plot(X, y_polynomial, color='green', linewidth=2, alpha=0.4)
plt.xlabel('LSTAT')
plt.ylabel('MEDV')
Out[14]:
In [15]:
# Define a linear model:
supermodel = linear_model.LinearRegression()
# Use all the data, and set up a 20-fold cross validation run:
super_split = ShuffleSplit(len(boston.data), 20, test_size=0.4)
# Carry out the cross-validation of the model, training, testing and reporting:
R2 = cross_val_score(supermodel, boston.data, boston.target, cv=super_split, scoring='r2')
# Compute our model prediction accuracy score, for comparison with other models:
meanR2,errR2 = np.mean(R2),np.std(R2)
print "Mean score:",meanR2,"+/-",errR2