In [2]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
As we did during the classification lab, we generate some data points uniformly using np.random.rand
. In this example, we will generate a set of data points $x \in [0, 1]$ and their images $y$ under the function $f$ defined by
$$ y = f(x) = \cos \left( \dfrac{3\pi}{2} x \right)$$
which will be modeled by my_function
.
We also add some noise following a Gaussian distribution (np.random.randn
) centered in 0.
In [3]:
import numpy as np
np.random.seed(0)
n_samples = 30
def my_function(X):
return np.cos(1.5 * np.pi * X)
X = np.random.rand(n_samples)
y = my_function(X) + np.random.randn(n_samples) * 0.1
In [4]:
X.shape
Out[4]:
Let us also generate the testing points needed to plot the result of our prediction. In this case, we generate 100 points uniformly spread between 0 and 1 using np.linspace
.
In [5]:
X_test = np.linspace(0, 1, 100)
In [6]:
X_test
Out[6]:
With this, we can plot $f$ and the generated points.
In [7]:
import matplotlib.pyplot as plt
plt.plot(X_test, my_function(X_test), label="True function")
plt.scatter(X, y, label="Samples")
plt.legend(loc="best")
Out[7]:
Note that the samples are not on the function curve because of the additive noise.
In [8]:
from sklearn.linear_model import LinearRegression
linear_regression = LinearRegression()
In [9]:
my_regression = linear_regression.fit(X[:, np.newaxis], y)
In [10]:
X_test = np.linspace(0, 1, 100)
plt.plot(X_test, my_regression.predict(X_test[:, np.newaxis]), label="Model")
# my_regression.predict(X_test[:, np.newaxis] = y_test (the prediction of X_test)
plt.plot(X_test, my_function(X_test), label="True function")
plt.scatter(X, y, label="Samples")
plt.xlabel("x")
plt.ylabel("y")
plt.xlim((0, 1))
plt.ylim((-2, 2))
plt.legend(loc="best")
Out[10]:
The previous example is a simple linear regression and as we can see, the trained model fits the samples (and $f$) quite poorly. As we have seen during the lectures, we can use polynomial features to obtain a more accurate model.
Formally, it consists in defining a polynomial kernel $\phi$ such that of degree $d$ such that, for each $x \in [0, 1]$, we have
$$\phi(x) = \left[1, x, x^2, \dots, x^d\right]$$
In sklearn, this can be achieved with the sklearn.preprocessing.PolynomialFeatures
function:
In [11]:
from sklearn.preprocessing import PolynomialFeatures
polynomial_features = PolynomialFeatures(degree=1)
polynomial_X = polynomial_features.fit_transform(X[:, np.newaxis])
Note that here we use np.newaxis
to create an array of single elements. Applying polynomial_features.fit_transform
to X
will not lead to the desired result (try it by yourself).
In [12]:
X[:, np.newaxis]
Out[12]:
So far, we have just created a feature of degree 1, which won't change much the data we have. However, the shape of the polynomial features has changed:
In [13]:
polynomial_X.shape
Out[13]:
In [14]:
X[:, np.newaxis].shape
Out[14]:
Let's take a closer look at what is inside polynomial_X
.
In [15]:
polynomial_X
Out[15]:
By default, PolynomialFeatures
adds a constant term equal to 1. It corresponds to the intercept (or degree 0) of the model we are training.
Let's now define another linear regression and train it on the augmented features $\phi(x) = \left[1, x, x^2\right]$.
In [16]:
linear_regression = LinearRegression()
polynomial_features = PolynomialFeatures(degree=2)
polynomial_X = polynomial_features.fit_transform(X[:, np.newaxis])
my_regression = linear_regression.fit(polynomial_X, y)
And display the result:
In [17]:
X_test = np.linspace(0, 1, 100)
polynomial_X_test = polynomial_features.fit_transform(X_test[:, np.newaxis])
plt.plot(X_test, my_regression.predict(polynomial_X_test), label="Model")
plt.plot(X_test, my_function(X_test), label="True function")
plt.scatter(X, y, label="Samples")
plt.xlabel("x")
plt.ylabel("y")
plt.xlim((0, 1))
plt.ylim((-2, 2))
plt.legend(loc="best")
Out[17]:
(Note that we need to apply $\phi$ to X_test
because my_regression
has been trained on polynomial features!).
As we can see, a second order kernel gives a much better fit than a simple linear regression.
In [18]:
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import cross_val_score
degrees = [1, 4, 15]
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)
plt.setp(ax, xticks=(), yticks=())
polynomial_features = PolynomialFeatures(degree=degrees[i])
linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features),
("linear_regression", linear_regression)])
pipeline.fit(X[:, np.newaxis], y)
# Evaluate the models using crossvalidation
scores = cross_val_score(pipeline, X[:, np.newaxis], y,
scoring="mean_squared_error", cv=10)
X_test = np.linspace(0, 1, 100)
plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
plt.plot(X_test, my_function(X_test), label="True function")
plt.scatter(X, y, label="Samples")
plt.xlabel("x")
plt.ylabel("y")
plt.xlim((0, 1))
plt.ylim((-2, 2))
plt.legend(loc="best")
plt.title("Degree {}\nMSE = {:.2e}(+/- {:.2e})".format(
degrees[i], -scores.mean(), scores.std()))
plt.show()
The Linear regression we have been working with solves the following problem: $$\min_w \sum_i \left(w^T \phi\left(x^{(i)}\right) - y^{(i)}\right)^2$$
We can add a regularization term to the optimization problem in order to penalise big coefficients in $w$ and avoid overfitting as a consequence: $$\min_w \sum_i \left(w^T \phi\left(x^{(i)}\right) - y^{(i)}\right)^2 + \alpha ||w||_2^2$$
Note that this leads to adding an extra parameter, $\alpha$, to the model. As we will see in the following, this parameter has a big impact on how much we overfit or underfit the training data.
In [19]:
from sklearn.linear_model import Ridge
In [20]:
polynomial_features = PolynomialFeatures(degree=15)
polynomial_X = polynomial_features.fit_transform(X[:, np.newaxis])
ridge_regression = Ridge(alpha = .1)
my_regression = ridge_regression.fit(polynomial_X, y)
In [21]:
X_test = np.linspace(0, 1, 100)
X_test_polynomial = polynomial_features.fit_transform(X_test[:, np.newaxis])
plt.plot(X_test, my_regression.predict(X_test_polynomial), label="Model")
plt.plot(X_test, my_function(X_test), label="True function")
plt.scatter(X, y, label="Samples")
plt.xlabel("x")
plt.ylabel("y")
plt.xlim((0, 1))
plt.ylim((-2, 2))
plt.legend(loc="best")
plt.title("Ridge regression")
Out[21]:
Support vector regression (sklearn.svm.SVR
in sklearn) is another way to train a regression model. The difference with the linear regression with least squares we have used previously is in the loss function:
$$\min_w \sum_i \ell \left(w^T \phi\left(x^{(i)}\right), y^{(i)}\right)$$
For ordinary least squares, we had
$$\ell \left(w^T \phi\left(x^{(i)}\right), y^{(i)}\right) = \left(w^T \phi\left(x^{(i)}\right) - y^{(i)}\right)^2.$$ For SVR, we have
$$\ell \left(w^T \phi\left(x^{(i)}\right), y^{(i)}\right) =
\begin{cases}
0,& \text{if } -\epsilon \leq w^T \phi\left(x\right) - y \leq +\epsilon\\
\left| w^T \phi\left(x\right) - y \right| - \epsilon, & \text{otherwise}
\end{cases}.
$$
Having this in mind, we can apply SVR by chosing an apprioriate value for $\epsilon$.
In [22]:
from sklearn.svm import SVR
In [23]:
epsilon = .1
sv_regression = SVR(epsilon = epsilon)
my_svr = sv_regression.fit(polynomial_X, y)
Note that chosing $\epsilon$ with an order of magnitude close to the estimated noise makes sense because we would not penalize errors smaller than the noise.
In [24]:
plt.plot(X_test, my_regression.predict(X_test_polynomial), label="Model")
plt.plot(X_test, my_function(X_test), label="True function")
plt.scatter(X, y, label="Samples")
plt.xlabel("x")
plt.ylabel("y")
plt.xlim((0, 1))
plt.ylim((-2, 2))
plt.legend(loc="best")
plt.title("Support vector regression")
Out[24]:
As for classification, sklearn proposes standard datasets to evaluate models. These datasets can be found here: http://scikit-learn.org/stable/datasets/. Two dataset are suited for regression problem:
In [25]:
from sklearn import datasets
boston = datasets.load_boston()
In [26]:
print boston.DESCR
In [27]:
boston.feature_names
Out[27]:
In [28]:
print boston.data.shape, boston.target.shape
In [29]:
polynomial_features = PolynomialFeatures(degree=3)
polynomial_X = polynomial_features.fit_transform(boston.data)
alpha = .1
ridge_regression = Ridge(alpha = alpha)
my_regression = ridge_regression.fit(polynomial_X, boston.target)
In [30]:
predicted_prices = my_regression.predict(polynomial_X)
In [31]:
predicted_prices
Out[31]:
In [32]:
from sklearn.cross_validation import train_test_split
features_train, features_test, labels_train, labels_test = train_test_split(polynomial_X, boston.target, test_size = 0.5)
In [33]:
print features_train.shape, features_test.shape, labels_train.shape, labels_test.shape
In [34]:
from sklearn import metrics
In [35]:
predicted_labels_train = my_regression.predict(features_train)
print "MSE on the train set: ", metrics.mean_squared_error(labels_train, predicted_labels_train)
predicted_labels_test = my_regression.predict(features_test)
print "MSE on the test set: ", metrics.mean_squared_error(labels_test, predicted_labels_test)
Feel free to tune the parameters manually and see which one gives the best results.