Always start by import everything in a separate code block. That way if you forgot stuff, it's easy to just add and re-run without it actually doing anything.
In [1]:
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
import mltools as ml
np.random.seed(0)
%matplotlib inline
In [2]:
# First we create the "fake data" with xs from 0 to 10 and Y = X + 2
X = np.linspace(0, 10, 50)
Y = np.copy(X) + 2
In [3]:
# Plotting the data
f, ax = plt.subplots(1, 1, figsize=(10, 8))
ax.scatter(X, Y, s=80, color='blue', alpha=0.75)
ax.set_xlim(-0.2, 10.2)
ax.set_ylim(1.8, 12.2)
ax.set_xticklabels(ax.get_xticks(), fontsize=25)
ax.set_yticklabels(ax.get_yticks(), fontsize=25)
plt.show()
Now let's create a test and train data out of it.
In [4]:
X, Y = ml.shuffleData(X, Y)
This error is a result of some assumptions in the code. All the mltools package code assumes that the X is a 2d array and not 1d. This is a common assumption and it is also used in more popular packages.
There are many ways to convert from 1d to 2d. The most popular is the atleast_2d
In [5]:
_ = np.atleast_2d(X).T
Another option is to use reshape. Look at the documentation to see what's the -1 is all about.
In [6]:
X = X.reshape(-1, 1)
Notice that I transformed it after the atleast2d call. That's because it is common to think of X where the rows are the points and the columns are the dimensions. Please play around with those methods to make sure you understand what it's doing.
Now let's continue from where we stopped.
In [8]:
X, Y = ml.shuffleData(X, Y)
Xtr, Xte, Ytr, Yte = ml.splitData(X, Y, 0.75)
In [9]:
# Plotting the data
f, ax = plt.subplots(1, 1, figsize=(10, 8))
ax.scatter(Xtr, Ytr, s=80, color='blue', alpha=0.75, label='Train')
ax.scatter(Xte, Yte, s=240, marker='*', color='red', alpha=0.75, label='Test')
ax.set_xlim(-0.2, 10.2)
ax.set_ylim(1.8, 12.2)
ax.set_xticklabels(ax.get_xticks(), fontsize=25)
ax.set_yticklabels(ax.get_yticks(), fontsize=25)
# Controlling the size of the legend and the location.
ax.legend(fontsize=30, loc=4)
plt.show()
Now let's see how we can call the linear regression.
In [10]:
lr = ml.linear.linearRegress(Xtr, Ytr)
Boom, that's it. But you should go into the code and make sure you understand how it works. You will be asked in exams to derive a linear regression.
In [11]:
# We start with creating a set of xs on the space we want to predict for.
xs = np.linspace(0, 10, 200)
# Converting to the rate shape
xs = np.atleast_2d(xs).T
# And now the prediction
ys = lr.predict(xs)
In [12]:
# Plotting the data
f, ax = plt.subplots(1, 1, figsize=(10, 8))
ax.scatter(Xtr, Ytr, s=80, color='blue', alpha=0.75, label='Train')
ax.scatter(Xte, Yte, s=240, marker='*', color='red', alpha=0.75, label='Test')
# Also plotting the regression line
ax.plot(xs, ys, lw=3, color='black', alpha=0.75, label='Prediction')
ax.set_xlim(-0.2, 10.2)
ax.set_ylim(1.8, 12.2)
ax.set_xticklabels(ax.get_xticks(), fontsize=25)
ax.set_yticklabels(ax.get_yticks(), fontsize=25)
# Controlling the size of the legend and the location.
ax.legend(fontsize=30, loc=4)
plt.show()
We can also print the learned regression object. This will show us the coefficients for each feature. Notice that the regression model added a constant for us.
In [13]:
print lr
The print above means that the linear regression learned the function Y = 2 + 1 * X.
In [14]:
path_to_file = 'poly_data.txt'
data = np.genfromtxt(path_to_file, delimiter='\t') # Read data from file
In [15]:
# Plotting the data
f, ax = plt.subplots(1, 1, figsize=(10, 8))
ax.scatter(data[:, 0], data[:, 1], s=80, color='blue', alpha=0.75)
ax.set_xlim(-0.2, 4.3)
ax.set_ylim(-13, 18)
ax.set_xticklabels(ax.get_xticks(), fontsize=25)
ax.set_yticklabels(ax.get_yticks(), fontsize=25)
plt.show()
Now let's repeate everything on the real data.
In [16]:
X, Y = np.atleast_2d(data[:, 0]).T, data[:, 1]
X, Y = ml.shuffleData(X, Y)
Xtr, Xte, Ytr, Yte = ml.splitData(X, Y, 0.75)
lr = ml.linear.linearRegress(Xtr, Ytr)
# Make sure you use the currect space.
xs = np.linspace(0, 4.2, 200)
xs = np.atleast_2d(xs).T
ys = lr.predict(xs)
In [17]:
# Plotting the data
f, ax = plt.subplots(1, 1, figsize=(10, 8))
ax.scatter(Xtr, Ytr, s=80, color='blue', alpha=0.75, label='Train')
ax.scatter(Xte, Yte, s=240, marker='*', color='red', alpha=0.75, label='Test')
# Also plotting the regression line
ax.plot(xs, ys, lw=3, color='black', alpha=0.75, label='Prediction')
ax.set_xlim(-0.2, 4.3)
ax.set_ylim(-13, 18)
ax.set_xticklabels(ax.get_xticks(), fontsize=25)
ax.set_yticklabels(ax.get_yticks(), fontsize=25)
# Controlling the size of the legend and the location.
ax.legend(fontsize=30, loc=0)
plt.show()
Meh, the predicions don't look that great. Why is that?
(Because we're fitting Y=X+c line where it's clear that this data comes from a more complex model.)
So let's fit a more complex model. For that we can use the ml.transform.fpoly method that will convert the features for us.
In [19]:
degree = 12
XtrP = ml.transforms.fpoly(Xtr, degree, False)
lr = ml.linear.linearRegress(XtrP, Ytr)
# Make sure you use the currect space.
xs = np.linspace(0, 4.2, 200)
xs = np.atleast_2d(xs).T
# Notice that we have to transform the predicting xs too.
xsP = ml.transforms.fpoly(xs, degree, False)
ys = lr.predict(xsP)
In [20]:
# Plotting the data
f, ax = plt.subplots(1, 1, figsize=(10, 8))
ax.scatter(Xtr, Ytr, s=80, color='blue', alpha=0.75, label='Train')
ax.scatter(Xte, Yte, s=240, marker='*', color='red', alpha=0.75, label='Test')
# Also plotting the regression line. in the plotting we plot the xs and not the xsP
ax.plot(xs, ys, lw=3, color='black', alpha=0.75, label='Prediction')
ax.set_xlim(-0.2, 4.3)
ax.set_ylim(-13, 18)
ax.set_xticklabels(ax.get_xticks(), fontsize=25)
ax.set_yticklabels(ax.get_yticks(), fontsize=25)
# Controlling the size of the legend and the location.
ax.legend(fontsize=30, loc=0)
plt.show()
Feel free to play around with different degrees and see the differences. You should!
In [21]:
def MSE(y_true, y_hat):
"""Mock MSE method.
You'll have to fill it in yourself with the true way of computing the MSE.
"""
return np.random.rand() * 1000
In [22]:
# Predicting on the test data - DO NOT FORGET TO TRANSFORM Xte TOO!!!
XteP = ml.transforms.fpoly(Xte, degree, False)
YteHat = lr.predict(XteP)
Adding the predicted Yhat to the plot. Notice that it sits on the regression line (as expected).
In [23]:
# Plotting the data
f, ax = plt.subplots(1, 1, figsize=(10, 8))
ax.scatter(Xtr, Ytr, s=80, color='blue', alpha=0.75, label='Train')
ax.scatter(Xte, Yte, s=240, marker='*', color='red', alpha=0.75, label='Test')
ax.scatter(Xte, YteHat, s=80, marker='D', color='forestgreen', alpha=0.75, label='Yhat')
# Also plotting the regression line. in the plotting we plot the xs and not the xsP
ax.plot(xs, ys, lw=3, color='black', alpha=0.75, label='Prediction')
ax.set_xlim(-0.2, 4.3)
ax.set_ylim(-13, 18)
ax.set_xticklabels(ax.get_xticks(), fontsize=25)
ax.set_yticklabels(ax.get_yticks(), fontsize=25)
# Controlling the size of the legend and the location.
ax.legend(fontsize=20, loc=0)
plt.show()
Computing the MSE for the different degrees.
In [24]:
degrees = np.array([2, 4, 6, 8, 10, 20])
mse_error = np.zeros(degrees.shape[0])
for i, degree in enumerate(degrees):
XtrP = ml.transforms.fpoly(Xtr, degree, False)
lr = ml.linear.linearRegress(XtrP, Ytr)
XteP = ml.transforms.fpoly(Xte, degree, False)
YteHat = lr.predict(XteP)
mse_error[i] = MSE(Yte, YteHat)
In [25]:
f, ax = plt.subplots(1, 1, figsize=(10, 8))
# Plotting a line with markers where there's an actual x value.
ax.semilogy(degrees, mse_error, lw=4, marker='d', markersize=20, alpha=0.75, label='MSE ERROR')
ax.set_xlim(1.2, 20.5)
ax.set_ylim(30, 1100)
# Setting the X-ticks manually.
ax.set_xticks(np.arange(2, 21, 2))
ax.set_xticklabels(ax.get_xticks(), fontsize=25)
ax.set_yticklabels(ax.get_yticks(), fontsize=25)
ax.legend(fontsize=20, loc=0)
plt.show()
Let’s now imagine that we do not have access to the target values of the test data we held out in the previous problem, and we wanted to decide on the best polynomial degree.
Cross-validation works by creating many training/validation splits, called folds, and using all of these splits to assess the “out-of-sample” (validation) performance by averaging them.
In [26]:
nFolds = 4
In [27]:
f, ax = plt.subplots(2, 2, figsize=(20, 20))
ax = ax.flatten()
for iFold in range(nFolds):
Xti, Xvi, Yti, Yvi = ml.crossValidate(Xtr, Ytr, nFolds, iFold)
ax[iFold].scatter(Xti, Yti, s=80, color='blue', alpha=0.75, label='Train')
ax[iFold].scatter(Xvi, Yvi, s=240, marker='*', color='red', alpha=0.75, label='Test')
ax[iFold].set_xlim(-0.2, 4.3)
ax[iFold].set_ylim(-13, 18)
ax[iFold].set_xticklabels(ax[iFold].get_xticks(), fontsize=25)
ax[iFold].set_yticklabels(ax[iFold].get_yticks(), fontsize=25)
plt.show()
This part in the HW is pretty straight forward, you basically just have to repeate what we did so far in a loop.