Work on this before the next lecture. We will talk about questions, comments, and solutions during the exercise after the second lecture.
Please do form study groups! When you do, make sure you can explain everything in your own words, do not simply copy&paste from others.
The solutions to a lot of these problems can probably be found with Google. Please don't. You will not learn a lot by copy&pasting from the internet.
If you want to get credit/examination on this course please upload your work to your GitHub repository for this course before the next lecture starts. If you worked on things together with others please add their names to the notebook so we can see who formed groups.
There are two objectives for this set of exercises:
get you started using python, scikit-learn, matplotlib, and GitHub. You will be using them a lot during the course, so make sure you get a good foundation to build on.
working through the steps of opening a new dataset, plotting the data, fitting a model to it, evaluating your model, and deciding on model complexity.
Install python, scikit-learn (v0.18), matplotlib, jupyter and git.
Instructions for doing so: https://github.com/wildtreetech/advanced-comp-2017/blob/master/install.md
Documentation and guides for the various tools:
Read up on git clone
, git pull
, git push
, git add
and git commit
. Once you master these five commands you should be good for this course. There is a whole universe of complex things that git
can do for you, don't worry about them for now. Once you feel comfortable with the basics you can always step it up later.
These are some useful default imports for plotting and numpy
In [1]:
%config InlineBackend.figure_format='retina'
%matplotlib inline
import numpy as np
np.random.seed(123)
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (8, 8)
plt.rcParams["font.size"] = 14
from sklearn.utils import check_random_state
In the lecture we used the nearest neighbour classifier to classify points from a toy dataset into either "red" or "blue" classes. We investigated how the performance changes as a function of model complexity and what this means for the performance of our classifier on unseen data. Instead of using a linear model as in the lecture, use a k-nearest neighbour model.
n_neighbors
.n_neighbors
for this dataset.Use make_blobs(n_samples=400, centers=23, random_state=42)
to create a simple dataset and use the KNeighborsClassifier
classifier to answer the above questions.
In [2]:
from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier
labels = ["b", "r"]
X, y = make_blobs(n_samples=400, centers=23, random_state=42)
y = np.take(labels, (y < 10))
In [3]:
# Your solution
In [4]:
# Tim's solution
plt.scatter(X[:, 0], X[:, 1], c=y, lw=0)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
from sklearn.model_selection import train_test_split, validation_curve
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=243)
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
print("Training score:", clf.score(X_train, y_train))
print("Testing score:", clf.score(X_test, y_test))
# test score is generally lower (worse) than the training score.
# this makes sense as the classifier tries to minimize the loss when fitting to the
# training data. This means the score evaluated on the training set is an optimistic
# estimate. The test data set was not used and so can be treated as "unseen" data
# using it we obtain a fair estimate of the generalisation error of the classifier.
# As long as we do not start using the score from the testing data set to inform
# decisions about the hyper-parameters of the model.
In [5]:
from sklearn.model_selection import learning_curve
# As this kind of question is very common scikit-learn
# provides a helper function to perform (nearly) this task
# It is a nice short cut compared to having to write
# the for-loop yourself.
# However, this is a slight cheat from Tim's side as it uses
# cross-validation to estimate the scores which we
# will only meet in the third lecture.
sizes, train_scores, test_scores = learning_curve(KNeighborsClassifier(), X, y)
train_scores = np.mean(train_scores, axis=1)
test_scores = np.mean(test_scores, axis=1)
plt.plot(sizes, train_scores, label="train")
plt.plot(sizes, test_scores, label="test")
plt.xlabel("training set size")
plt.ylabel("score")
plt.legend(loc='best');
# because learnign_curve uses three fold cross validation we do not
# arrive at the full training set size of 400 samples.
print(2*X.shape[0]/3)
In [6]:
# validation_curve is another helper from scikit-learn
# to evaluate a model at several values of a single parameter
#
param_range = np.arange(1, 50, 1)
train_scores, test_scores = validation_curve(
KNeighborsClassifier(),
X, y, param_name="n_neighbors", param_range=param_range, cv=2
)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
fig, ax = plt.subplots(1,1)
plt.title("Validation Curve for kNN's n_neighbors")
plt.xlabel("n_neighbors")
plt.ylabel("Score")
plt.ylim(0.0, 1.1)
plt.grid()
lw = 2
plt.plot(param_range, train_scores_mean, label="Training score",
color="darkorange", lw=lw)
plt.plot(param_range, test_scores_mean, label="Test score",
color="navy", lw=lw)
plt.legend(loc="best");
# The best setting of n_neighbors is the one that maximises
# the score on the test data set. Note that because we used
# the testing data set to pick n_neighbors the score is not
# an unbiased estimate of the generalisation error anymore.
print("best training score at n_neighbors=", param_range[np.argmax(test_scores_mean)])
This is a regression problem. It mostly follows the setup of the classification problem so you should be able to reuse some of your work.
n_neighbors
and compare each regressors predictions to the location of the training and testing points. n_neighbors
for both training and testing datasets.n_neighbors
for this dataset.n_neihgors
=5 to 15 at the value that it does?Use make_regression()
to create the dataset and use KNeighborsRegressor
to answer the above questions. Take a look at scikit-learn's metrics
module to compute the mean squared error.
In [7]:
def make_regression(n_samples=100, noise_level=0.8, random_state=2):
rng = check_random_state(random_state)
X = np.linspace(-2, 2, n_samples)
y = 2 * X + np.sin(5 * X) + rng.randn(n_samples) * noise_level
return X.reshape(-1, 1), y
In [8]:
# Your solution
In [9]:
# Tim's solution
from sklearn.neighbors import KNeighborsRegressor
X, y = make_regression()
plt.plot(X, y, 'xk')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=2)
fig, ax = plt.subplots(1,1)
line = np.linspace(-2, 2, 100).reshape(-1, 1)
ax.plot(X_test, y_test, 'xk')
for n in range(1, 20, 4):
rgr = KNeighborsRegressor(n_neighbors=n)
rgr.fit(X_train, y_train)
ax.plot(line, rgr.predict(line), label='n_neighbors=%i' % n);
plt.legend(loc='best');
In [10]:
# what is the best value, why is there a plateau
# The parameter that maximises the test score/lowest
# test error is the best parameter to use. It appears as
# if there is a range of values that should work, we just
# pick the one at the maximum.
# The plateau is not nearly as visible as when Tim first solved
# the problem. He thinks it shows the intrinsic noise added
# to the observations. This is more a curiosity than a
# general thing.
param_range = np.arange(1, 25, 1)
train_scores, test_scores = validation_curve(
KNeighborsRegressor(),
X, y, param_name="n_neighbors", param_range=param_range, cv=5,
scoring='neg_mean_squared_error'
)
train_scores_mean = -np.mean(train_scores, axis=1)
test_scores_mean = -np.mean(test_scores, axis=1)
fig, ax = plt.subplots(1,1)
plt.title("Validation Curve for kNN's n_neighbors")
plt.xlabel("n_neighbors")
plt.ylabel("Mean squared error")
plt.hlines(0.8**2, xmin=0, xmax=25, label='noise**2')
plt.grid()
lw = 2
plt.plot(param_range, train_scores_mean, label="Training MSE",
color="darkorange", lw=lw)
plt.plot(param_range, test_scores_mean, label="Test MSE",
color="navy", lw=lw)
plt.legend(loc="best");
Logistic regression. Use a linear model to solve a two class classification problem.
LogisticRegression
classifier to fit a model to your training datacluster_std
argument) and plot the decision boundary for each case. What happens and why?random_state
parameter and fit a model to each. Visualise the variation in the fitted parameters and the decision boundaries you obtain. Is this a high or low variance model?Use make_two_blobs()
to create a simple dataset and use the LogisticRegression
classifier to answer the above questions.
In [11]:
from sklearn.linear_model import LogisticRegression
def make_two_blobs(n_samples=400, cluster_std=2., random_state=42):
rng = check_random_state(random_state)
X = rng.multivariate_normal([5,0], [[cluster_std**2, 0], [0., cluster_std**2]],
size=n_samples//2)
X2 = rng.multivariate_normal([0, 5.], [[cluster_std**2, 0], [0., cluster_std**2]],
size=n_samples//2)
X = np.vstack((X, X2))
return X, np.hstack((np.ones(n_samples//2), np.zeros(n_samples//2)))
X, y = make_two_blobs()
labels = ['b', 'r']
y = np.take(labels, (y < 0.5))
In [12]:
# Your answer
In [13]:
# Tim's answer
plt.scatter(X[:,0], X[:,1], c=y, lw=0);
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)
clf = LogisticRegression()
clf.fit(X_train, y_train)
print(clf.coef_, clf.intercept_)
from utils import plot_surface
plot_surface(clf, X, y, xlim=(-6, 10), ylim=(-6, 10))
In [14]:
# vary the cluster_std parameter from 1 to 5 (distinct clusters to very mixed clusters)
X, y = make_two_blobs(cluster_std=1.5)
labels = ['b', 'r']
y = np.take(labels, (y < 0.5))
clf = LogisticRegression()
clf.fit(X, y)
plot_surface(clf, X, y, xlim=(-6, 10), ylim=(-6, 10))
# the decision boundary should not move much as the center of the
# clusters does not move. Despite the larger noise the best
# place to split the data remains half way between the cluster centers.
In [15]:
# Variation in the parameters of the fitted model
fig, axs = plt.subplots(2,2)
intercepts = []
coeff1 = []
coeff2 = []
for n in range(50):
X, y = make_two_blobs(random_state=n)
labels = ['b', 'r']
y = np.take(labels, (y < 0.5))
clf = LogisticRegression()
clf.fit(X, y)
axs[1,1].plot()
intercepts.append(clf.intercept_[0])
coeff1.append(clf.coef_[0][0])
coeff2.append(clf.coef_[0][1])
axs[0,0].hist(intercepts, bins=10);
axs[0,0].set_xlabel("intercept")
axs[0,1].hist(coeff1, bins=10);
axs[0,1].set_xlabel("coefficient 1")
axs[1,0].hist(coeff2, bins=10);
axs[1,0].set_xlabel("coefficient 2");
plt.tight_layout();
Logistic regression. Use a more complex linear model to create a two class classifier for the "circle inside a circle" problem. Think about how you can increase the complexity of a logistic regression model. Visualise the classificatio naccuracy as a function of the model complexity.
Use make_circles(n_samples=400, factor=.3, noise=.1)
to create a simple dataset and use the LogisticRegression
classifier to answer the above question.
In [16]:
from sklearn.datasets import make_circles
X, y = make_circles(n_samples=400, factor=.3, noise=.1)
labels = ['b', 'r']
y = np.take(labels, (y < 0.5))
plt.scatter(X[:,0], X[:,1], c=y)
Out[16]:
In [17]:
# Your answer
In [18]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.6)
for p in range(1, 5):
clf = make_pipeline(PolynomialFeatures(p), LogisticRegression())
clf.fit(X_test, y_test)
print(p, clf.score(X_test, y_test))
clf = make_pipeline(PolynomialFeatures(2), LogisticRegression())
clf.fit(X_test, y_test)
plot_surface(clf, X, y)