Q1

In this question, you'll design a very simple classifier to work on a real-world dataset.

The iris dataset is a general-purpose dataset for classification. It consists of 150 total data points, with 3 different classes (50 data points per class). Each data point is 4-dimensional.

The data are flower observations. The four dimensions, or features, are respectively:

  • sepal length (cm)
  • sepal width (cm)
  • petal length (cm)
  • petal width (cm)

The class is one of the following three: setosa, versicolor, or virginica. These classes are represented by 0, 1, and 2, respectively.

A

In this part, you'll plot the data that you're going to use in the classification task. This is purely to familiarize you with the data that you'll be classifying in the next step.

In the block of code below, all the imports you'll need are given. Furthermore, the raw dataset is extracted using datasets.load_iris(). However, the resulting object iris is a dictionary with lots of extraneous information: the keys are "target_names", "target", "DESCR", "feature_names", and "data". Furthermore, for the classification task, we won't be using the full Iris dataset, but rather a 2D subset of it.

In your code:

  1. Define two variables: X (the data) and y (the target variables, or classes).
    • X should have shape (150, 2) and y should have shape (150,).
    • X will be all the rows and the first two columns of the full Iris dataset.
    • y should consist only of 150 integer numbers that are all either 0, 1, or 2.
  2. Create a 2D scatter plot of X.
  3. Color each dot in X of the scatter plot by the appropriate class, as specified in y.

You should already have all the import statements you need, but feel free to import anything else you may need.


In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import sklearn.datasets as datasets
import seaborn as sns

iris = datasets.load_iris()

### BEGIN SOLUTION

### END SOLUTION

B

In this part, you'll write a classifier to predict the flower type (setosa, versicolor, or virginica) given the four features. It's the simplest possible classifier: a k-nearest neighbor classifier, or "KNN".

KNN classifiers work by classifying a new data point according to what other data it's closest to. If the majority of the nearby points belong to a particular class, the new data point takes that class.

How many neighbors do we consider? That's the "K" in "KNN". If we choose k = 5, then we look at the 5 nearest data points to the new data point, take a majority vote of those 5 data points' classes, and assign that majority class to the new data point.

Using the Iris dataset: imagine a new data point comes into the 2D plot we had in the last problem, and k = 3. If 2 of the 3 nearest neighbors belong to class 0 (setosa) and the other one belongs to class 1 (versicolor), then we assign the new data point to be class 0.

scikit-learn has already done the heavy lifting of writing all the code to implement nearest neighbors in Python. You just have to train and test the KNN classifier.

In your code:

  • Define a function train_knn().
  • train_knn should take 3 arguments: a data matrix X, a target array y, and a number of nearest neighbors k.
  • It should return a trained KNN model.

Your function should 1) create a KNN model, initialized with the correct number of neighbors, and 2) train (or fit) the model with a dataset and its labels. Look at the scikit-learn documentation for K-Nearest Neighbors.


In [ ]:
import sklearn.neighbors as neighbors

### BEGIN SOLUTION

### END SOLUTION

In [ ]:
try:
    train_knn
except:
    assert False
else:
    assert True

In [ ]:
np.random.seed(23843)
X = np.random.random((10, 2))
y = np.random.randint(2, size = 10)

m1 = train_knn(X, y, 3)
assert m1.n_neighbors == 3

x = m1.kneighbors()[1]
assert set(x[0].tolist()) == set([3, 6, 8])

m2 = train_knn(X, y, 11)
assert m2.n_neighbors == 11
try:
    m2.kneighbors()
except ValueError:
    assert True
else:
    assert False

C

In this part, you'll write an accompanying function to test the classification accuracy of your trained model.

In your code:

  • Define a function test_knn().
  • test_knn should take 3 arguments: a data matrix X, a target array y, and a trained KNN model. It should return a prediction accuracy between 0 (completely incorrect) and 1 (completely correct).

Your function can use the score() method available on the KNN model. Look at the scikit-learn documentation for K-Nearest Neighbors.


In [ ]:


In [ ]:
try:
    test_knn
except:
    assert False
else:
    assert True

In [ ]:
np.random.seed(65456)
X = np.random.random((10, 2))
y = np.random.randint(2, size = 10)

m1 = train_knn(X[:8], y[:8], 3)
acc1 = test_knn(X[8:], y[8:], m1)
np.testing.assert_allclose(acc1, 0.0)

m2 = train_knn(X[2:], y[2:], 1)
acc2 = test_knn(X[:2], y[:2], m2)
np.testing.assert_allclose(acc2, 0.5)

D

In this part, you'll test the functions you just wrote and see what value of k gives you the best classifier!

Using the Iris data from Part A, slice out corresponding subsets from X and y for training and for testing. Change the relative sizes of the training and testing sets. Vary the value of k. Show your work!

Remember:

  • Any data you don't use in your training set, you'll use in your testing set. In this way, you use the whole dataset for training + testing.
  • You may want to focus on using odd values of k, otherwise you'll potentially have tie votes that are broken arbitrarily.
  • Feel free to import any packages you may want. No other packages are required, and no plots are needed, though you're welcome to make any you'd like.

In [ ]:

E

Describe your findings from Part D. What effect did the size of the training set (and, therefore, the testing set) have on the classification accuracy of your KNN model? What effect did the value of k have on your classification accuracy? Explain these behaviors, along with any other thoughts you may have on the process.