K-nearest neighbors and scikit-learn

Review of the iris dataset



In [1]:

    
%matplotlib inline
import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
iris = pd.read_csv(url, header=None, names=col_names)



In [2]:

    
iris.head()









    Out[2]:






  
    
      
      sepal_length
      sepal_width
      petal_length
      petal_width
      species
    
  
  
    
      0
      5.1
      3.5
      1.4
      0.2
      Iris-setosa
    
    
      1
      4.9
      3.0
      1.4
      0.2
      Iris-setosa
    
    
      2
      4.7
      3.2
      1.3
      0.2
      Iris-setosa
    
    
      3
      4.6
      3.1
      1.5
      0.2
      Iris-setosa
    
    
      4
      5.0
      3.6
      1.4
      0.2
      Iris-setosa

Terminology

150 observations (n=150): each observation is one iris flower
4 features (p=4): sepal length, sepal width, petal length, and petal width
Response: iris species
Classification problem since response is categorical

Human learning on the iris dataset

How did we (as humans) predict the species of an iris?

We observed that the different species had (somewhat) dissimilar measurements.
We focused on features that seemed to correlate with the response.
We created a set of rules (using those features) to predict the species of an unknown iris.

We assumed that if an unknown iris has measurements similar to previous irises, then its species is most likely the same as those previous irises.



In [3]:

    
import matplotlib.pyplot as plt

# increase default figure and font sizes for easier viewing
plt.rcParams['figure.figsize'] = (10, 8)
plt.rcParams['font.size'] = 14

# create a custom colormap
from matplotlib.colors import ListedColormap
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])



In [4]:

    
# map each iris species to a number
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})



In [5]:

    
# box plot of all numeric columns grouped by species
iris.drop('species_num', axis=1).boxplot(by='species', rot=45)









    Out[5]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1126a1ba8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x112788198>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1127d7048>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x116c30550>]], dtype=object)



In [6]:

    
# create a scatter plot of PETAL LENGTH versus PETAL WIDTH and color by SPECIES
iris.plot(kind='scatter', x='petal_length', y='petal_width', c='species_num', colormap=cmap_bold)









    Out[6]:





<matplotlib.axes._subplots.AxesSubplot at 0x1127e5860>



In [7]:

    
# create a scatter plot of SEPAL LENGTH versus SEPAL WIDTH and color by SPECIES
iris.plot(kind='scatter', x='sepal_length', y='sepal_width', c='species_num', colormap=cmap_bold)









    Out[7]:





<matplotlib.axes._subplots.AxesSubplot at 0x117a97198>

K-nearest neighbors (KNN) classification

Pick a value for K.
Search for the K observations in the data that are "nearest" to the measurements of the unknown iris.
- Euclidian distance is often used as the distance metric, but other metrics are allowed.
Use the most popular response value from the K "nearest neighbors" as the predicted response value for the unknown iris.

KNN classification map for iris (K=1)

KNN classification map for iris (K=5)

KNN classification map for iris (K=15)

KNN classification map for iris (K=50)

Question: What's the "best" value for K in this case?

Answer: The value which produces the most accurate predictions on unseen data. We want to create a model that generalizes!



In [8]:

    
iris.head()









    Out[8]:






  
    
      
      sepal_length
      sepal_width
      petal_length
      petal_width
      species
      species_num
    
  
  
    
      0
      5.1
      3.5
      1.4
      0.2
      Iris-setosa
      0
    
    
      1
      4.9
      3.0
      1.4
      0.2
      Iris-setosa
      0
    
    
      2
      4.7
      3.2
      1.3
      0.2
      Iris-setosa
      0
    
    
      3
      4.6
      3.1
      1.5
      0.2
      Iris-setosa
      0
    
    
      4
      5.0
      3.6
      1.4
      0.2
      Iris-setosa
      0



In [9]:

    
# store feature matrix in "X"
feature_cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
X = iris[feature_cols]



In [10]:

    
# alternative ways to create "X"
X = iris.drop(['species', 'species_num'], axis=1)
X = iris.loc[:, 'sepal_length':'petal_width']
X = iris.iloc[:, 0:4]



In [11]:

    
# store response vector in "y"
y = iris.species_num



In [12]:

    
# check X's type
print(type(X))
print(type(X.values))









    



<class 'pandas.core.frame.DataFrame'>
<class 'numpy.ndarray'>



In [13]:

    
# check y's type
print(type(y))
print(type(y.values))









    



<class 'pandas.core.series.Series'>
<class 'numpy.ndarray'>



In [14]:

    
# check X's shape (n = number of observations, p = number of features)
print(X.shape)



In [15]:

    
# check y's shape (single dimension with length n)
print(y.shape)









    



(150,)

scikit-learn's 4-step modeling pattern

Step 1: Import the class you plan to use



In [16]:

    
from sklearn.neighbors import KNeighborsClassifier

Step 2: "Instantiate" the "estimator"

"Estimator" is scikit-learn's term for "model"
"Instantiate" means "make an instance of"



In [17]:

    
# make an instance of a KNeighborsClassifier object
knn = KNeighborsClassifier(n_neighbors=1)
type(knn)









    Out[17]:





sklearn.neighbors.classification.KNeighborsClassifier

Created an object that "knows" how to do K-nearest neighbors classification, and is just waiting for data
Name of the object does not matter
Can specify tuning parameters (aka "hyperparameters") during this step
All parameters not specified are set to their defaults



In [18]:

    
print(knn)









    



KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

Step 3: Fit the model with data (aka "model training")

Model is "learning" the relationship between X and y in our "training data"
Process through which learning occurs varies by model
Occurs in-place



In [19]:

    
knn.fit(X, y)









    Out[19]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

Once a model has been fit with data, it's called a "fitted model"

Step 4: Predict the response for a new observation

New observations are called "out-of-sample" data
Uses the information it learned during the model training process



In [20]:

    
knn.predict([[3, 5, 4, 2]])









    Out[20]:





array([2])

Returns a NumPy array, and we keep track of what the numbers "mean"
Can predict for multiple observations at once



In [21]:

    
X_new = [[3, 5, 4, 2], [5, 4, 3, 2]]
knn.predict(X_new)









    Out[21]:





array([2, 1])

Tuning a KNN model



In [22]:

    
# instantiate the model (using the value K=5)
knn = KNeighborsClassifier(n_neighbors=5)

# fit the model with data
knn.fit(X, y)

# predict the response for new observations
knn.predict(X_new)









    Out[22]:





array([1, 1])

Question: Which model produced the correct predictions for the two unknown irises?

Answer: We don't know, because these are out-of-sample observations, meaning that we don't know the true response values. Our goal with supervised learning is to build models that generalize to out-of-sample data. However, we can't truly measure how well our models will perform on out-of-sample data.

Question: Does that mean that we have to guess how well our models are likely to do?

Answer: Thankfully, no. In the next class, we'll discuss model evaluation procedures, which allow us to use our existing labeled data to estimate how well our models are likely to perform on out-of-sample data. These procedures will help us to tune our models and choose between different types of models.



In [23]:

    
# calculate predicted probabilities of class membership
knn.predict_proba(X_new)









    Out[23]:





array([[ 0. ,  0.8,  0.2],
       [ 0. ,  1. ,  0. ]])

Comparing KNN with other models

Advantages of KNN:

Simple to understand and explain
Model training is fast
Can be used for classification and regression
Being a non-parametric method, it is often successful in classification situations where the decision boundary is very irregular

Disadvantages of KNN:

Must store all of the training data
Prediction phase can be slow when n is large
Sensitive to irrelevant features
Sensitive to the scale of the data
Accuracy is (generally) not competitive with the best supervised learning methods

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa