Work on this before the next lecture on 10 April. We will talk about questions, comments, and solutions during the exercise after the second lecture.
Please do form study groups! When you do, make sure you can explain everything in your own words, do not simply copy&paste from others.
The solutions to a lot of these problems can probably be found with Google. Please don't. You will not learn a lot by copy&pasting from the internet.
If you want to get credit/examination on this course please upload your work to your GitHub repository for this course before the next lecture starts and post a link to your repository in this thread. If you worked on things together with others please add their names to the notebook so we can see who formed groups.
There are two objectives for this set of exercises:
get you started using python, scikit-learn, matplotlib, and GitHub. You will be using them a lot during the course, so make sure you get a good foundation to build on.
working through the steps of opening a new dataset, plotting the data, fitting a model to it, evaluating your model, and deciding on model complexity.
Install python, scikit-learn (v0.18), matplotlib, jupyter and git.
Instructions for doing so: https://github.com/wildtreetech/advanced-comp-2017/blob/master/install.md
Documentation and guides for the various tools:
Read up on git clone
, git pull
, git push
, git add
and git commit
. Once you master these five commands you should be good for this course. There is a whole universe of complex things that git
can do for you, don't worry about them for now. Once you feel comfortable with the basics you can always step it up later.
These are some useful default imports for plotting and numpy
In [3]:
%config InlineBackend.figure_format='retina'
%matplotlib inline
import numpy as np
np.random.seed(123)
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (8, 8)
plt.rcParams["font.size"] = 14
from sklearn.utils import check_random_state
In the lecture we used the nearest neighbour classifier to classify points from a toy dataset into either "red" or "blue" classes. We investigated how the performance changes as a function of model complexity and what this means for the performance of our classifier on unseen data. Instead of using a linear model as in the lecture, use a k-nearest neighbour model.
n_neighbors
.n_neighbors
for this dataset.Use make_blobs(n_samples=400, centers=23, random_state=42)
to create a simple dataset and use the KNeighborsClassifier
classifier to answer the above questions.
In [13]:
from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier
labels = ["b", "darkorange"]
X, y = make_blobs(n_samples=400, centers=23, random_state=42)
y = np.take(labels, (y < 10))
In [14]:
plt.scatter(X[:, 0], X[:, 1], facecolor=y, edgecolor="white", s=40 )
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
Out[14]:
The split operation can be done using different ratios of the train set and test set. I will comment on it later. By now, I decided to take half of the data as training test and the other half as test test. Intuitively, we can already argue that this choice may be the best tradeoff between having a bigger portion of the data as training set (thus facing the risk of overfitting) or the way around, i.e. having a training set too small in order to accurately estimate good parameters that will be able to reproduce the trend observed for unseen data, i.e. for the test set.
By now, I fixed the number of neighbors ("n_neighbors") of the model to 5, thus being able to evaluate the performance (as asked in the next steps) for the fixed hyperparameter "n_neighbors"
In [31]:
from sklearn.model_selection import train_test_split
train_performance = []
test_performance = []
for i in range(500):
X_train, X_test, y_train,y_test = train_test_split(X, y, train_size=0.5)
kN_classifier = KNeighborsClassifier(n_neighbors=5)
kN_classifier.fit(X_train, y_train)
# store the performance of the kN_classifier on the train set
train_performance.append(kN_classifier.score(X_train, y_train))
# store the performance of the kN_classifier on the test set
test_performance.append(kN_classifier.score(X_test, y_test))
train_test_performances = [train_performance, test_performance]
fig = plt.figure()
ax = fig.add_subplot(111)
medianprops = dict(linestyle=':', linewidth=2.5, color='grey')
meanlineprops = dict(linestyle='-', linewidth=2.5, color='black')
bp = ax.boxplot(train_test_performances, medianprops=medianprops, whis=[5,95], meanprops=meanlineprops, meanline=True, showmeans=True)
bp['boxes'][0].set(color = 'g', lw = 2)
bp['boxes'][1].set(color = 'r', lw = 2)
ax.set_xticklabels(['Train', 'Test'])
ax.set_ylabel("Accuracy")
Out[31]:
In the boxplots, we summarize the distribuzion of the scores for the training and test set. The upper and lower bound of the boxes mark the first and third quartile, while the vertical lines outside the boxes represent the 5th and 95 percentiles. Data represented as dots are named outliers that fall outside the range between 5th and 95th percentiles. Black horizontal line represent the means, while dotted grey lines the medians.
As we can see, the values of the means (and medians) are pretty far away the each others between the training set and test set. Moreover, the 95th percentile of the test set is just sligthly above the 5th percentile of the traning set. Overall, these mean that the scores on the training set are sistematically higher then the ones in the test set, given their distribution. Of course, some values of the scores on the test set are higher than the lowest ones on the training set, due to fluctuations in the score values.
In [38]:
train_performance_splits = []
test_performance_splits = []
range_ft = np.linspace(0.1,0.9,9)
for i in range(500):
train_performance = []
test_performance = []
for ft in range_ft:
# choose each time a different random number to randomly split the dataset into train and test set
X_train, X_test, y_train,y_test = train_test_split(X, y, train_size=ft)
kN_classifier = KNeighborsClassifier(n_neighbors=5)
kN_classifier.fit(X_train, y_train)
# store the performance of the kN_classifier on the train set
train_performance.append(kN_classifier.score(X_train, y_train))
# store the performance of the kN_classifier on the test set
test_performance.append(kN_classifier.score(X_test, y_test))
#
train_performance_splits.append(train_performance)
test_performance_splits.append(test_performance)
train_performance_splits = np.array(train_performance_splits)
test_performance_splits = np.array(test_performance_splits)
fig = plt.figure()
ax = fig.add_subplot(111)
mean_train_performance_splits = np.mean(train_performance_splits, axis=0)
std_train_performance_splits = np.std(train_performance_splits, axis=0)
mean_test_performance_splits = np.mean(test_performance_splits, axis=0)
std_test_performance_splits = np.std(test_performance_splits, axis=0)
ax.plot(range_ft, mean_train_performance_splits, color='g', label='Train', lw=4)
ax.plot(range_ft, mean_train_performance_splits + std_train_performance_splits, range_ft, mean_train_performance_splits - std_train_performance_splits, color='g', alpha=0.5, ls="--")
ax.plot(range_ft, mean_test_performance_splits, color='r', label='Test', lw=4)
ax.plot(range_ft, mean_test_performance_splits + std_test_performance_splits, range_ft, mean_test_performance_splits - std_test_performance_splits, color='r', alpha=0.5, ls="--")
ax.set_xlabel("Train set fraction")
ax.set_ylabel("Accuracy")
plt.legend(loc='best')
# plt.plot(ks, np.array(accuracies_test).mean(axis=0), label='Test', c='r', lw=4)
# plt.plot(ks, np.array(accuracies_train).mean(axis=0), label='Train', c='b', lw=4)
# plt.xlabel('k or inverse model complexity')
# plt.ylabel('accuracy')
# plt.legend(loc='best')
# plt.xlim((0, max(ks)))
# plt.ylim((0.4, 1.));
Out[38]:
Here, the two scoes start having a big difference between their average values (tick lines). The train set is small so we are not using too many data to train the classifier. Increasing the fraction of the train size provide a larger number of points that improve the performance of the classifier, both on the training and test sets, since we gave it more statistics.
In particular, we see that the mean score (for both sets) as a function of the training set undergoes a vivid rise at the beginning, up to train size fraction of 0.4, when after stays almost constant. This means that increasing the train set size does not always improve significantly the accuracy of the results (on average) after a given point. Note that, however, that the standard deviation of the test set score increases, especially when we reach high values of the train set size. In this case, we are thus trying to have many data to fit the model but few to predict the accuracy in the test set. As a consequence, we are more proned to overfitting the data in the train set. The test set curve seems to show a maximum on 0.5, which means that a somehow optimal split is attained dividing in half the dataset. This is reasonable, as already commented in the part "Splitting the data between a training set and test."
In [39]:
train_performance_nn = []
test_performance_nn = []
# We fix the ratio between train and test size at 0.5
range_nn = range(1,26,1)
for i in range(500):
train_performance = []
test_performance = []
for nn in range_nn:
X_train, X_test, y_train,y_test = train_test_split(X, y, train_size=0.5)
kN_classifier = KNeighborsClassifier(n_neighbors=nn)
kN_classifier.fit(X_train, y_train)
# store the performance of the kN_classifier on the train set
train_performance.append(kN_classifier.score(X_train, y_train))
# store the performance of the kN_classifier on the test set
test_performance.append(kN_classifier.score(X_test, y_test))
#
train_performance_nn.append(train_performance)
test_performance_nn.append(test_performance)
train_performance_nn = np.array(train_performance_nn)
test_performance_nn = np.array(test_performance_nn)
fig = plt.figure()
ax = fig.add_subplot(111)
mean_train_performance_nn = np.mean(train_performance_nn, axis=0)
std_train_performance_nn = np.std(train_performance_nn, axis=0)
mean_test_performance_nn = np.mean(test_performance_nn, axis=0)
std_test_performance_nn = np.std(test_performance_nn, axis=0)
ax.plot(range_nn, mean_train_performance_nn, color='g', label='Train', lw=4)
ax.plot(range_nn, mean_train_performance_nn + std_train_performance_nn, range_nn, mean_train_performance_nn - std_train_performance_nn, color='g', alpha=0.5, ls="--")
ax.plot(range_nn, mean_test_performance_nn, color='r', label='Test', lw=4)
ax.plot(range_nn, mean_test_performance_nn + std_test_performance_nn, range_nn, mean_test_performance_nn - std_test_performance_nn, color='r', alpha=0.5, ls="--")
ax.set_xlabel("n_neighbors")
ax.set_ylabel("Accuracy")
plt.legend(loc='best')
# plt.plot(ks, np.array(accuracies_test).mean(axis=0), label='Test', c='r', lw=4)
# plt.plot(ks, np.array(accuracies_train).mean(axis=0), label='Train', c='b', lw=4)
# plt.xlabel('k or inverse model complexity')
# plt.ylabel('accuracy')
# plt.legend(loc='best')
# plt.xlim((0, max(ks)))
# plt.ylim((0.4, 1.));
Out[39]:
This is a regression problem. It mostly follows the setup of the classification problem so you should be able to reuse some of your work.
n_neighbors
and compare each regressors predictions to the location of the training and testing points. n_neighbors
for both training and testing datasets.n_neighbors
for this dataset.n_neihgors
=5 to 15 at the value that it does?Use make_regression()
to create the dataset and use KNeighborsRegressor
to answer the above questions. Take a look at scikit-learn's metrics
module to compute the mean squared error.
In [4]:
def make_regression(n_samples=100, noise_level=0.8, random_state=2):
rng = check_random_state(random_state)
X = np.linspace(-2, 2, n_samples)
y = 2 * X + np.sin(5 * X) + rng.randn(n_samples) * noise_level
return X.reshape(-1, 1), y
In [5]:
# Your solution
Logistic regression. Use a linear model to solve a two class classification problem.
LogisticRegression
classifier to fit a model to your training datacluster_std
argument) and plot the decision boundary for each case. What happens and why?random_state
parameter and fit a model to each. Visualise the variation in the fitted parameters and the decision boundaries you obtain. Is this a high or low variance model?Use make_two_blobs()
to create a simple dataset and use the LogisticRegression
classifier to answer the above questions.
In [6]:
from sklearn.linear_model import LogisticRegression
def make_two_blobs(n_samples=400, cluster_std=2., random_state=42):
rng = check_random_state(random_state)
X = rng.multivariate_normal([5,0], [[cluster_std**2, 0], [0., cluster_std**2]],
size=n_samples//2)
X2 = rng.multivariate_normal([0, 5.], [[cluster_std**2, 0], [0., cluster_std**2]],
size=n_samples//2)
X = np.vstack((X, X2))
return X, np.hstack((np.ones(n_samples//2), np.zeros(n_samples//2)))
X, y = make_two_blobs()
labels = ['b', 'r']
y = np.take(labels, (y < 0.5))
In [7]:
# Your answer
Logistic regression. Use a more complex linear model to create a two class classifier for the "circle inside a circle" problem. Think about how you can increase the complexity of a logistic regression model. Visualise the classificatio naccuracy as a function of the model complexity.
Use make_circles(n_samples=400, factor=.3, noise=.1)
to create a simple dataset and use the LogisticRegression
classifier to answer the above question.
In [8]:
from sklearn.datasets import make_circles
X, y = make_circles(n_samples=400, factor=.3, noise=.1)
labels = ['b', 'r']
y = np.take(labels, (y < 0.5))
plt.scatter(X[:,0], X[:,1], c=y)
Out[8]:
In [9]:
# Your answer