In [ ]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

Regression

In regression we try to predict a continuous output variable. This can be most easily visualized in one dimension. We will start with a very simple toy example. We will create a dataset out of a sinus curve with some noise:



In [ ]:

    
x = np.linspace(-3, 3, 100)
print(x)



In [ ]:

    
y = np.sin(4 * x) + x + np.random.uniform(size=len(x))



In [ ]:

    
plt.plot(x, y, 'o')

Linear Regression

One of the simplest models again is a linear one, that simply tries to predict the data as lying on a line. One way to find such a line is LinearRegression (also known as ordinary least squares). The interface for LinearRegression is exactly the same as for the classifiers before, only that y now contains float values, instead of classes.

To apply a scikit-learn model, we need to make X be a 2d-array:



In [ ]:

    
print(x.shape)
X = x[:, np.newaxis]
print(X.shape)

We split our data in a training and a test set again:



In [ ]:

    
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

Then we can built our regression model:



In [ ]:

    
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

And predict. First let us try the training set:



In [ ]:

    
y_pred_train = regressor.predict(X_train)



In [ ]:

    
plt.plot(X_train, y_train, 'o', label="data")
plt.plot(X_train, y_pred_train, 'o', label="prediction")
plt.legend(loc='best')

The line is able to capture the general slope of the data, but not many details.

Let's try the test set:



In [ ]:

    
y_pred_test = regressor.predict(X_test)



In [ ]:

    
plt.plot(X_test, y_test, 'o', label="data")
plt.plot(X_test, y_pred_test, 'o', label="prediction")
plt.legend(loc='best')

Again, scikit-learn provides an easy way to evaluate the prediction quantitatively using the score method. For regression tasks, this is the R2 score. Another popular way would be the mean squared error.



In [ ]:

    
regressor.score(X_test, y_test)

KNeighborsRegression

As for classification, we can also use a neighbor based method for regression. We can simply take the output of the nearest point, or we could average several nearest points. This method is less popular for regression than for classification, but still a good baseline.



In [ ]:

    
from sklearn.neighbors import KNeighborsRegressor
kneighbor_regression = KNeighborsRegressor(n_neighbors=1)
kneighbor_regression.fit(X_train, y_train)

Again, let us look at the behavior on training and test set:



In [ ]:

    
y_pred_train = kneighbor_regression.predict(X_train)

plt.plot(X_train, y_train, 'o', label="data")
plt.plot(X_train, y_pred_train, 'o', label="prediction")
plt.legend(loc='best')

On the training set, we do a perfect job: each point is its own nearest neighbor!



In [ ]:

    
y_pred_test = kneighbor_regression.predict(X_test)

plt.plot(X_test, y_test, 'o', label="data")
plt.plot(X_test, y_pred_test, 'o', label="prediction")
plt.legend(loc='best')

On the test set, we also do a better job of capturing the variation, but our estimates look much more messy then before. Let us look at the R2 score:



In [ ]:

    
kneighbor_regression.score(X_test, y_test)

Much better then before! Here, the linear model was not a good fit for our problem.

Exercise

Compare the KNeighborsRegressor and LinearRegression on the boston housing dataset. You can load the dataset using sklearn.datasets.load_boston.



In [ ]: