Exercise 1

Work on this before the next lecture on 10 April. We will talk about questions, comments, and solutions during the exercise after the second lecture.

Please do form study groups! When you do, make sure you can explain everything in your own words, do not simply copy&paste from others.

The solutions to a lot of these problems can probably be found with Google. Please don't. You will not learn a lot by copy&pasting from the internet.

If you want to get credit/examination on this course please upload your work to your GitHub repository for this course before the next lecture starts and post a link to your repository in this thread. If you worked on things together with others please add their names to the notebook so we can see who formed groups.

Objective

There are two objectives for this set of exercises:

  • get you started using python, scikit-learn, matplotlib, and GitHub. You will be using them a lot during the course, so make sure you get a good foundation to build on.

  • working through the steps of opening a new dataset, plotting the data, fitting a model to it, evaluating your model, and deciding on model complexity.

Question 0

Install python, scikit-learn (v0.18), matplotlib, jupyter and git.

Instructions for doing so: https://github.com/wildtreetech/advanced-comp-2017/blob/master/install.md

Documentation and guides for the various tools:

GitHub and git

Read up on git clone, git pull, git push, git add and git commit. Once you master these five commands you should be good for this course. There is a whole universe of complex things that git can do for you, don't worry about them for now. Once you feel comfortable with the basics you can always step it up later.


These are some useful default imports for plotting and numpy


In [3]:
%config InlineBackend.figure_format='retina'
%matplotlib inline

import numpy as np
np.random.seed(123)
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (8, 8)
plt.rcParams["font.size"] = 14
from sklearn.utils import check_random_state

Question 1

In the lecture we used the nearest neighbour classifier to classify points from a toy dataset into either "red" or "blue" classes. We investigated how the performance changes as a function of model complexity and what this means for the performance of our classifier on unseen data. Instead of using a linear model as in the lecture, use a k-nearest neighbour model.

  • plot your dataset
  • split your dataset into a training and testing set. Comment on how you decided to split your data.
  • evaluate the performance of the classifier on your training dataset.
  • evaluate the performance of the classifier on your testing dataset.
  • repeat the above two steps for varying splits (10-90, 20-80, 30-70, ...) and comment on what you see. Is there a "best" way to split your data?
  • comment on why the two performance estimates agree or disagree.
  • plot the accuracy of the classifier as a function of n_neighbors.
  • comment on the similarities and differences between the performance on the testing and training dataset.
  • is a KNeighbor Classifier with 4 or 10 neighbors more complicated?
  • find the best setting of n_neighbors for this dataset.
  • why is this the best setting?

Use make_blobs(n_samples=400, centers=23, random_state=42) to create a simple dataset and use the KNeighborsClassifier classifier to answer the above questions.


In [13]:
from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier

labels = ["b", "darkorange"]
X, y = make_blobs(n_samples=400, centers=23, random_state=42)
y = np.take(labels, (y < 10))

Simply plot the dataset


In [14]:
plt.scatter(X[:, 0], X[:, 1], facecolor=y, edgecolor="white", s=40 )
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")


Out[14]:
<matplotlib.text.Text at 0x7f718de6fe80>

Splitting the data between a training set and test.

The split operation can be done using different ratios of the train set and test set. I will comment on it later. By now, I decided to take half of the data as training test and the other half as test test. Intuitively, we can already argue that this choice may be the best tradeoff between having a bigger portion of the data as training set (thus facing the risk of overfitting) or the way around, i.e. having a training set too small in order to accurately estimate good parameters that will be able to reproduce the trend observed for unseen data, i.e. for the test set.

By now, I fixed the number of neighbors ("n_neighbors") of the model to 5, thus being able to evaluate the performance (as asked in the next steps) for the fixed hyperparameter "n_neighbors"


In [31]:
from sklearn.model_selection import train_test_split

train_performance = []
test_performance = []


for i in range(500):
    X_train, X_test, y_train,y_test = train_test_split(X, y, train_size=0.5)
    kN_classifier = KNeighborsClassifier(n_neighbors=5)
    kN_classifier.fit(X_train, y_train)
    # store the performance of the kN_classifier on the train set
    train_performance.append(kN_classifier.score(X_train, y_train))
    # store the performance of the kN_classifier on the test set
    test_performance.append(kN_classifier.score(X_test, y_test))
    
                            
                            
train_test_performances = [train_performance, test_performance]

fig = plt.figure()
ax = fig.add_subplot(111)

medianprops = dict(linestyle=':', linewidth=2.5, color='grey')
meanlineprops = dict(linestyle='-', linewidth=2.5, color='black')


bp = ax.boxplot(train_test_performances, medianprops=medianprops, whis=[5,95], meanprops=meanlineprops, meanline=True, showmeans=True)

bp['boxes'][0].set(color = 'g', lw = 2)
bp['boxes'][1].set(color = 'r', lw = 2)

ax.set_xticklabels(['Train', 'Test'])
ax.set_ylabel("Accuracy")


Out[31]:
<matplotlib.text.Text at 0x7f7187d471d0>

In the boxplots, we summarize the distribuzion of the scores for the training and test set. The upper and lower bound of the boxes mark the first and third quartile, while the vertical lines outside the boxes represent the 5th and 95 percentiles. Data represented as dots are named outliers that fall outside the range between 5th and 95th percentiles. Black horizontal line represent the means, while dotted grey lines the medians.

As we can see, the values of the means (and medians) are pretty far away the each others between the training set and test set. Moreover, the 95th percentile of the test set is just sligthly above the 5th percentile of the traning set. Overall, these mean that the scores on the training set are sistematically higher then the ones in the test set, given their distribution. Of course, some values of the scores on the test set are higher than the lowest ones on the training set, due to fluctuations in the score values.

Varying the splits between datasets (10-90, 20-80, 30-70, ...)


In [38]:
train_performance_splits = []
test_performance_splits = []

range_ft = np.linspace(0.1,0.9,9)

for i in range(500):
    train_performance = []
    test_performance = []
    for ft in range_ft:
        # choose each time a different random number to randomly split the dataset into train and test set
        X_train, X_test, y_train,y_test = train_test_split(X, y, train_size=ft)
        kN_classifier = KNeighborsClassifier(n_neighbors=5)
        kN_classifier.fit(X_train, y_train)
        # store the performance of the kN_classifier on the train set
        train_performance.append(kN_classifier.score(X_train, y_train))
        # store the performance of the kN_classifier on the test set
        test_performance.append(kN_classifier.score(X_test, y_test))
        #
    train_performance_splits.append(train_performance)
    test_performance_splits.append(test_performance)

train_performance_splits = np.array(train_performance_splits)
test_performance_splits = np.array(test_performance_splits)    

fig = plt.figure()
ax = fig.add_subplot(111)

mean_train_performance_splits = np.mean(train_performance_splits, axis=0)
std_train_performance_splits = np.std(train_performance_splits, axis=0)
mean_test_performance_splits = np.mean(test_performance_splits, axis=0)
std_test_performance_splits = np.std(test_performance_splits, axis=0)

ax.plot(range_ft, mean_train_performance_splits, color='g', label='Train', lw=4)
ax.plot(range_ft, mean_train_performance_splits + std_train_performance_splits, range_ft, mean_train_performance_splits - std_train_performance_splits, color='g', alpha=0.5, ls="--")
ax.plot(range_ft, mean_test_performance_splits, color='r', label='Test', lw=4)
ax.plot(range_ft, mean_test_performance_splits + std_test_performance_splits, range_ft, mean_test_performance_splits - std_test_performance_splits, color='r', alpha=0.5, ls="--")

ax.set_xlabel("Train set fraction")
ax.set_ylabel("Accuracy")
plt.legend(loc='best')
# plt.plot(ks, np.array(accuracies_test).mean(axis=0), label='Test', c='r', lw=4)
# plt.plot(ks, np.array(accuracies_train).mean(axis=0), label='Train', c='b', lw=4)
# plt.xlabel('k or inverse model complexity')
# plt.ylabel('accuracy')
# plt.legend(loc='best')
# plt.xlim((0, max(ks)))
# plt.ylim((0.4, 1.));


Out[38]:
<matplotlib.legend.Legend at 0x7f7187991eb8>

Here, the two scoes start having a big difference between their average values (tick lines). The train set is small so we are not using too many data to train the classifier. Increasing the fraction of the train size provide a larger number of points that improve the performance of the classifier, both on the training and test sets, since we gave it more statistics.

In particular, we see that the mean score (for both sets) as a function of the training set undergoes a vivid rise at the beginning, up to train size fraction of 0.4, when after stays almost constant. This means that increasing the train set size does not always improve significantly the accuracy of the results (on average) after a given point. Note that, however, that the standard deviation of the test set score increases, especially when we reach high values of the train set size. In this case, we are thus trying to have many data to fit the model but few to predict the accuracy in the test set. As a consequence, we are more proned to overfitting the data in the train set. The test set curve seems to show a maximum on 0.5, which means that a somehow optimal split is attained dividing in half the dataset. This is reasonable, as already commented in the part "Splitting the data between a training set and test."

Varying n_neighbors


In [39]:
train_performance_nn = []
test_performance_nn = []

# We fix the ratio between train and test size at 0.5

range_nn = range(1,26,1)

for i in range(500):
    train_performance = []
    test_performance = []
    for nn in range_nn:
        X_train, X_test, y_train,y_test = train_test_split(X, y, train_size=0.5)
        kN_classifier = KNeighborsClassifier(n_neighbors=nn)
        kN_classifier.fit(X_train, y_train)
        # store the performance of the kN_classifier on the train set
        train_performance.append(kN_classifier.score(X_train, y_train))
        # store the performance of the kN_classifier on the test set
        test_performance.append(kN_classifier.score(X_test, y_test))
        #
    train_performance_nn.append(train_performance)
    test_performance_nn.append(test_performance)

train_performance_nn = np.array(train_performance_nn)
test_performance_nn = np.array(test_performance_nn)    

fig = plt.figure()
ax = fig.add_subplot(111)

mean_train_performance_nn = np.mean(train_performance_nn, axis=0)
std_train_performance_nn = np.std(train_performance_nn, axis=0)
mean_test_performance_nn = np.mean(test_performance_nn, axis=0)
std_test_performance_nn = np.std(test_performance_nn, axis=0)

ax.plot(range_nn, mean_train_performance_nn, color='g', label='Train', lw=4)
ax.plot(range_nn, mean_train_performance_nn + std_train_performance_nn, range_nn, mean_train_performance_nn - std_train_performance_nn, color='g', alpha=0.5, ls="--")
ax.plot(range_nn, mean_test_performance_nn, color='r', label='Test', lw=4)
ax.plot(range_nn, mean_test_performance_nn + std_test_performance_nn, range_nn, mean_test_performance_nn - std_test_performance_nn, color='r', alpha=0.5, ls="--")

ax.set_xlabel("n_neighbors")
ax.set_ylabel("Accuracy")
plt.legend(loc='best')
# plt.plot(ks, np.array(accuracies_test).mean(axis=0), label='Test', c='r', lw=4)
# plt.plot(ks, np.array(accuracies_train).mean(axis=0), label='Train', c='b', lw=4)
# plt.xlabel('k or inverse model complexity')
# plt.ylabel('accuracy')
# plt.legend(loc='best')
# plt.xlim((0, max(ks)))
# plt.ylim((0.4, 1.));


Out[39]:
<matplotlib.legend.Legend at 0x7f7187896160>

Question 2

This is a regression problem. It mostly follows the setup of the classification problem so you should be able to reuse some of your work.

  • plot your dataset
  • fit a kNN regressor with varying number of n_neighbors and compare each regressors predictions to the location of the training and testing points.
  • plot the mean squared error of the classifier as a function of n_neighbors for both training and testing datasets.
  • comment on the similarities and differences between the performance on the testing and training dataset.
  • find the best setting of n_neighbors for this dataset.
  • why is this the best setting?
  • can you explain why the mean square error on the training dataset plateaus between ~n_neihgors=5 to 15 at the value that it does?

Use make_regression() to create the dataset and use KNeighborsRegressor to answer the above questions. Take a look at scikit-learn's metrics module to compute the mean squared error.


In [4]:
def make_regression(n_samples=100, noise_level=0.8, random_state=2):
    rng = check_random_state(random_state)
    X = np.linspace(-2, 2, n_samples)
    y = 2 * X + np.sin(5 * X) + rng.randn(n_samples) * noise_level
    
    return X.reshape(-1, 1), y

In [5]:
# Your solution

Question 3

Logistic regression. Use a linear model to solve a two class classification problem.

  • What is the difference between a linear regression model and a logistic regression model?
  • plot your data and split it into a training and test set
  • draw your guess for where the decision boundary will be on the plot. Why did you pick this one?
  • use the LogisticRegression classifier to fit a model to your training data
  • extract the fitted coefficients from the model and draw the fitted decision boundary
  • create a function to draw the decision surface (the classifier's prediction for every point in space)
  • why is the boundary where it is?
  • (bonus) create new datasets with increasingly larger amounts of noise (increase the cluster_std argument) and plot the decision boundary for each case. What happens and why?
  • create 20 new datasets by changing the random_state parameter and fit a model to each. Visualise the variation in the fitted parameters and the decision boundaries you obtain. Is this a high or low variance model?

Use make_two_blobs() to create a simple dataset and use the LogisticRegression classifier to answer the above questions.


In [6]:
from sklearn.linear_model import LogisticRegression

def make_two_blobs(n_samples=400, cluster_std=2., random_state=42):
    rng = check_random_state(random_state)
    X = rng.multivariate_normal([5,0], [[cluster_std**2, 0], [0., cluster_std**2]],
                                size=n_samples//2)
    
    X2 = rng.multivariate_normal([0, 5.], [[cluster_std**2, 0], [0., cluster_std**2]],
                                 size=n_samples//2)
    X = np.vstack((X, X2))
    return X, np.hstack((np.ones(n_samples//2), np.zeros(n_samples//2)))

X, y = make_two_blobs()
labels = ['b', 'r']
y = np.take(labels, (y < 0.5))

In [7]:
# Your answer

Question 4

Logistic regression. Use a more complex linear model to create a two class classifier for the "circle inside a circle" problem. Think about how you can increase the complexity of a logistic regression model. Visualise the classificatio naccuracy as a function of the model complexity.

Use make_circles(n_samples=400, factor=.3, noise=.1) to create a simple dataset and use the LogisticRegression classifier to answer the above question.


In [8]:
from sklearn.datasets import make_circles

X, y = make_circles(n_samples=400, factor=.3, noise=.1)
labels = ['b', 'r']
y = np.take(labels, (y < 0.5))

plt.scatter(X[:,0], X[:,1], c=y)


Out[8]:
<matplotlib.collections.PathCollection at 0x11226ee48>

In [9]:
# Your answer