In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
In [3]:
x = np.linspace(-3, 3, 100)
In [4]:
y = np.sin(4 * x) + x + np.random.uniform(size=len(x))
In [5]:
plt.plot(x, y, 'o')
Out[5]:
One of the simplest models again is a linear one, that simply tries to predict the data as lying on a line. One way to find such a line is LinearRegression (also known as ordinary least squares).
The interface for LinearRegression is exactly the same as for the classifiers before, only that y
now contains float values, instead of classes.
To apply a scikit-learn model, we need to make X be a 2d-array:
In [6]:
print(x.shape)
X = x[:, np.newaxis]
print(X.shape)
We split our data in a training and a test set again:
In [7]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, X_test.shape)
Then we can built our regression model:
In [14]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
Out[14]:
And predict. First let us try the training set:
In [15]:
y_pred_train = regressor.predict(X_train)
In [16]:
plt.plot(X_train, y_train, 'o', label="data")
plt.plot(X_train, y_pred_train, 'o', label="prediction")
x = np.linspace(-3, 3)
plt.plot(x, regressor.coef_*x + regressor.intercept_, '-r')
plt.legend(loc='best')
Out[16]:
The line is able to capture the general slope of the data, but not many details.
Let's try the test set:
In [17]:
y_pred_test = regressor.predict(X_test)
In [18]:
plt.plot(X_test, y_test, 'o', label="data")
plt.plot(X_test, y_pred_test, 'o', label="prediction")
plt.legend(loc='best')
Out[18]:
Again, scikit-learn provides an easy way to evaluate the prediction quantitatively using the score
method. For regression tasks, this is the R2 score. Another popular way would be the mean squared error.
In [19]:
print(regressor.score(X_test, y_test))
print(regressor.score(X_train, y_train))
In [11]:
from sklearn.neighbors import KNeighborsRegressor
kneighbor_regression = KNeighborsRegressor(n_neighbors=3)
kneighbor_regression.fit(X_train, y_train)
Out[11]:
Again, let us look at the behavior on training and test set:
In [12]:
y_pred_train = kneighbor_regression.predict(X_train)
plt.plot(X_train, y_train, 'o', label="data")
plt.plot(X_train, y_pred_train, 'o', label="prediction")
X_ = np.linspace(-3, 3, 1000)[:, np.newaxis]
y_ = kneighbor_regression.predict(X_)
plt.plot(X_, y_, 'r-')
plt.legend(loc='best')
Out[12]:
On the training set, we do a perfect job: each point is its own nearest neighbor!
In [13]:
y_pred_test = kneighbor_regression.predict(X_test)
plt.plot(X_test, y_test, 'o', label="data")
plt.plot(X_test, y_pred_test, 'o', label="prediction")
plt.legend(loc='best')
Out[13]:
On the test set, we also do a better job of capturing the variation, but our estimates look much more messy then before. Let us look at the R2 score:
In [16]:
print(kneighbor_regression.score(X_test, y_test))
print(kneighbor_regression.score(X_train, y_train))
Much better then before! Here, the linear model was not a good fit for our problem.
In [17]:
from sklearn.datasets import load_boston
boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=0)
In [18]:
# Linear regression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
print('train score: ', regressor.score(X_train, y_train))
print('test score: ', regressor.score(X_test, y_test))
In [75]:
# KNN
kn_regressor = KNeighborsRegressor(n_neighbors=5)
kn_regressor.fit(X_train, y_train)
print('train score: ', kn_regressor.score(X_train, y_train))
print('test score: ', kn_regressor.score(X_test, y_test))
In [ ]:
In [ ]: