In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

Regression

In regression we try to predict a continuous output variable. This can be most easily visualized in one dimension. We will start with a very simple toy example. We will create a dataset out of a sinus curve with some noise:


In [3]:
x = np.linspace(-3, 3, 100)

In [4]:
y = np.sin(4 * x) + x + np.random.uniform(size=len(x))

In [5]:
plt.plot(x, y, 'o')


Out[5]:
[<matplotlib.lines.Line2D at 0x7fac00b61438>]

Linear Regression

One of the simplest models again is a linear one, that simply tries to predict the data as lying on a line. One way to find such a line is LinearRegression (also known as ordinary least squares). The interface for LinearRegression is exactly the same as for the classifiers before, only that y now contains float values, instead of classes.

To apply a scikit-learn model, we need to make X be a 2d-array:


In [6]:
print(x.shape)
X = x[:, np.newaxis]
print(X.shape)


(100,)
(100, 1)

We split our data in a training and a test set again:


In [7]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, X_test.shape)


(75, 1) (25, 1)

Then we can built our regression model:


In [14]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)


Out[14]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

And predict. First let us try the training set:


In [15]:
y_pred_train = regressor.predict(X_train)

In [16]:
plt.plot(X_train, y_train, 'o', label="data")
plt.plot(X_train, y_pred_train, 'o', label="prediction")
x = np.linspace(-3, 3)
plt.plot(x, regressor.coef_*x + regressor.intercept_, '-r')
plt.legend(loc='best')


Out[16]:
<matplotlib.legend.Legend at 0x7fabf772ae80>

The line is able to capture the general slope of the data, but not many details.

Let's try the test set:


In [17]:
y_pred_test = regressor.predict(X_test)

In [18]:
plt.plot(X_test, y_test, 'o', label="data")
plt.plot(X_test, y_pred_test, 'o', label="prediction")
plt.legend(loc='best')


Out[18]:
<matplotlib.legend.Legend at 0x7fabf9a836d8>

Again, scikit-learn provides an easy way to evaluate the prediction quantitatively using the score method. For regression tasks, this is the R2 score. Another popular way would be the mean squared error.


In [19]:
print(regressor.score(X_test, y_test))
print(regressor.score(X_train, y_train))


0.746479700274
0.828404975002

KNeighborsRegression

As for classification, we can also use a neighbor based method for regression. We can simply take the output of the nearest point, or we could average several nearest points. This method is less popular for regression than for classification, but still a good baseline.


In [11]:
from sklearn.neighbors import KNeighborsRegressor
kneighbor_regression = KNeighborsRegressor(n_neighbors=3)
kneighbor_regression.fit(X_train, y_train)


Out[11]:
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=3, p=2,
          weights='uniform')

Again, let us look at the behavior on training and test set:


In [12]:
y_pred_train = kneighbor_regression.predict(X_train)

plt.plot(X_train, y_train, 'o', label="data")
plt.plot(X_train, y_pred_train, 'o', label="prediction")
X_ = np.linspace(-3, 3, 1000)[:, np.newaxis]
y_ = kneighbor_regression.predict(X_)
plt.plot(X_, y_, 'r-')
plt.legend(loc='best')


Out[12]:
<matplotlib.legend.Legend at 0x7fabf7a31e80>

On the training set, we do a perfect job: each point is its own nearest neighbor!


In [13]:
y_pred_test = kneighbor_regression.predict(X_test)

plt.plot(X_test, y_test, 'o', label="data")
plt.plot(X_test, y_pred_test, 'o', label="prediction")
plt.legend(loc='best')


Out[13]:
<matplotlib.legend.Legend at 0x7fabf77a9518>

On the test set, we also do a better job of capturing the variation, but our estimates look much more messy then before. Let us look at the R2 score:


In [16]:
print(kneighbor_regression.score(X_test, y_test))
print(kneighbor_regression.score(X_train, y_train))


0.960199988598
0.977174634189

Much better then before! Here, the linear model was not a good fit for our problem.

Exercise

Compare the KNeighborsRegressor and LinearRegression on the boston housing dataset. You can load the dataset using sklearn.datasets.load_boston.


In [17]:
from sklearn.datasets import load_boston
boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=0)

In [18]:
# Linear regression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
print('train score: ', regressor.score(X_train, y_train))
print('test score:  ', regressor.score(X_test, y_test))


train score:  0.769744837056
test score:   0.635362078667

In [75]:
# KNN
kn_regressor = KNeighborsRegressor(n_neighbors=5)
kn_regressor.fit(X_train, y_train)
print('train score: ', kn_regressor.score(X_train, y_train))
print('test score:  ', kn_regressor.score(X_test, y_test))


('train score: ', 0.7061990439169894)
('test score:  ', 0.46163809246101117)

In [ ]:


In [ ]: