Introduction to Machine Learning

Andreas C. Muller and Sara Guido (2017 O'Reilly

SciKit-Learn: KNNeighbors Classification for Iris Data Set

1. Load applications and Data

  • Load sklearn, import iris data from datasets
  • Look at data keys, target names, and feature names

In [2]:
import sklearn
from sklearn.datasets import load_iris

iris_data = load_iris()

In [3]:
print(iris_data.keys())


dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])

In [5]:
print(iris_data['target_names'])


['setosa' 'versicolor' 'virginica']

In [7]:
print(iris_data['feature_names'])


['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

1.1 Data is contained in target and data fields

Data contains numeric measurements in numPy array

  • sepal_length, sepal_width, petal_length, petal_width
  • Rows represent flowers (n=150) are called Samples
  • Columns represent measurements (p=4) are called Features

In [8]:
print(type(iris_data['data']))


<class 'numpy.ndarray'>

In [11]:
print(iris_data['data'].shape)


(150, 4)

In [13]:
print(iris_data['data'][:5])


[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]]

1.2 Target array contains Species of flowers

  • One-dimensional array, one entry per flower (150 samples)
  • Species are encoded as integers from 0 to 2 (50 per class)

In [14]:
print(iris_data['target'])


[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

2. Create Training and Test sets from dataset

  • In sklearn, data is denoted by capital X, labels denoted by lowercase y
  • From f(x)=y in mathemtatics; X is input (matrix), y is target array (vector)

train_test_split() function

  • Shuffles data set using pseudorandom number generator
  • X_train contains 75% of dataset specified as train set to build ML model
  • X_test contains 25% of dataset designated as test set to evaluate model accuracy
  • random_state parameter sets fixed seed for same selection

In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    iris_data['data'], iris_data['target'], random_state=0)

In [22]:
print(X_train.shape)
print(y_train.shape)


(112, 4)
(112,)

In [23]:
print(X_test.shape)
print(y_test.shape)


(38, 4)
(38,)

2.1 Inspect the Data

  • Visualizing data can identify abnormalities and outliers
  • Scatterplot pair plot looks at all possible pairs of features in single plot

Create plot using Pandas scatter_matrix function

  • Convert NumPy array into pandas DataFrame
  • Label columns using strings in iris_data.feature_names
  • Create scatter_matrix from dataFrame, color by y_train

In [33]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import mglearn
from IPython.display import display

In [35]:
iris_df = pd.DataFrame(X_train, columns=iris_data.feature_names)

grr = pd.plotting.scatter_matrix(iris_df, c=y_train, figsize=(15,15), marker='o', 
                       hist_kwds={'bins':20}, s=60, alpha=0.8, cmap=mglearn.cm3)


3. Build K-Nearest Neighbors (KNN) Classifier Model

  • Building KNN model consists of storing the training set
  • Considers any fixed number k of neighbors in the training (e.g., 3 or 5)

To make a prediction for a new data point:

  • KNN algorithm finds the point in the training set that is closest to the new point
  • Then it assigns the lable of this training point to the new data point
  • Makes prediction using the majority class among the k neighbors

KNeighborsClassifier in scikit-learn

  • Before using the model, we need to instantiate the class into an object
  • Most important parameter for KNeighborClassifier, is the number of neighbors
  • for this example, set k=1

In [36]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)

In [38]:
knn.fit(X_train, y_train)


Out[38]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

4. Make Prediction

  • Make predictions using knn model on new test data about what species it would be
  • Put prediction into NumPy array, by calculating the number of samples by number of features
  • To make a prediction, call the predict method of the knn object

In [40]:
import numpy as np

X_new = np.array([[5, 2.9, 1, 0.2]])
print(X_new.shape)


(1, 4)

In [42]:
prediction = knn.predict(X_new)
print("Prediction: {}".format(prediction))
print("PRedicted target name: {}".format(
    iris_data['target_names'][prediction]))


Prediction: [0]
PRedicted target name: ['setosa']
KNN model predicts that the new iris belongs to class 0, which is species 'Setosa'.

</br>

5. Evaluate the Model

  • Make a prediction for each iris in the test set data and compare it to its own label
  • Model accuracy measured as fraction of flowers for which the correct species is predicted
  • Use score() method in scikit-learn to obtain measure of model accuracy

In [44]:
y_pred = knn.predict(X_test)
print("Test set predictions:\n {}".format(y_pred))


Test set predictions:
 [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]

In [45]:
print("Test set score: {:2f}".format(np.mean(y_pred == y_test)))


Test set score: 0.973684

In [47]:
print("Test set score: {:2f}".format(knn.score(X_test, y_test)))


Test set score: 0.973684

Model accuracy is 97%

  • Means that the model predicted the labels of 97S% of irises in test set correctly

ML Summary in scikit-learn: fit, predict, and score

  • fit, predict, score methods are the common interface to supervised models in scikit-learn

In [48]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X_train, X_test, y_train, y_test = train_test_split(
    iris_data['data'], iris_data['target'], random_state=0)

knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)

print("Test set score: {:2f}".format(knn.score(X_test, y_test)))


Test set score: 0.973684

In [ ]: