K Nearest Neighbors method used on Iris dataset



In [1]:

    
# importing all required modules
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import numpy as np

Scikit learn contains a database of pre-loaded datasets that can be accessed in the following way:



In [2]:

    
# importing datasets
from sklearn import datasets
iris = datasets.load_iris()

However the type is not a typical Pandas dataframe or Numpy array



In [3]:

    
type(iris)









    Out[3]:





sklearn.datasets.base.Bunch

The content of the data set can be accessed in the following way:



In [4]:

    
iris.keys()









    Out[4]:





dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])



In [5]:

    
# displaying the set first ten rows
iris.data[:10]









    Out[5]:





array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2],
       [ 5.4,  3.9,  1.7,  0.4],
       [ 4.6,  3.4,  1.4,  0.3],
       [ 5. ,  3.4,  1.5,  0.2],
       [ 4.4,  2.9,  1.4,  0.2],
       [ 4.9,  3.1,  1.5,  0.1]])



In [6]:

    
# assigning data and target to X and y variables that will be used in machine learning
X = iris.data
y = iris.target
y









    Out[6]:





array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

0 = iris-setosa
1 = iris-versicolor
2 = iris-virginica



In [7]:

    
iris.target_names









    Out[7]:





array(['setosa', 'versicolor', 'virginica'], 
      dtype='<U10')

In order to faciliate the display of the data, a data frame can be created



In [8]:

    
df = pd.DataFrame(X, columns=iris.feature_names)
df.head()









    Out[8]:






  
    
      
      sepal length (cm)
      sepal width (cm)
      petal length (cm)
      petal width (cm)
    
  
  
    
      0
      5.1
      3.5
      1.4
      0.2
    
    
      1
      4.9
      3.0
      1.4
      0.2
    
    
      2
      4.7
      3.2
      1.3
      0.2
    
    
      3
      4.6
      3.1
      1.5
      0.2
    
    
      4
      5.0
      3.6
      1.4
      0.2



In [9]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
sepal length (cm)    150 non-null float64
sepal width (cm)     150 non-null float64
petal length (cm)    150 non-null float64
petal width (cm)     150 non-null float64
dtypes: float64(4)
memory usage: 4.8 KB

KNN Method applied to Iris dataset



In [10]:

    
# Import necessary modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split



In [11]:

    
# spliting the dataset between test and training data (using 40% for test data because of small size of dataset)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=42, stratify=y)



In [12]:

    
# Creating the knn classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors=6)



In [13]:

    
# fitting the data
knn.fit(X_train, y_train)









    Out[13]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=6, p=2,
           weights='uniform')



In [14]:

    
# predicting the outcomes
y_pred = knn.predict(X_test)



In [15]:

    
y_pred









    Out[15]:





array([0, 1, 0, 0, 1, 0, 1, 1, 2, 2, 1, 1, 1, 2, 0, 0, 0, 0, 2, 2, 1, 1, 2,
       2, 1, 2, 2, 0, 1, 0, 2, 1, 2, 0, 0, 2, 0, 1, 0, 2, 1, 2, 1, 1, 1, 1,
       1, 0, 0, 1, 0, 1, 1, 1, 2, 2, 0, 1, 0, 0])



In [16]:

    
# model accuracy
knn.score(X_test, y_test)









    Out[16]:





0.93333333333333335

Looking for the best number of neighbors for the model



In [17]:

    
# Setup arrays to store train and test accuracies
neighbors = np.arange(1, 15)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))

# Loop over different values of k
for i, k in enumerate(neighbors):
    # Setup a k-NN Classifier with k neighbors: knn
    knn = KNeighborsClassifier(n_neighbors=k)

    # Fit the classifier to the training data
    knn.fit(X_train, y_train)

    #Compute accuracy on the training set
    train_accuracy[i] = knn.score(X_train, y_train)

    #Compute accuracy on the testing set
    test_accuracy[i] = knn.score(X_test, y_test)

# Generate plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy');

It can be concluded the the best accuracies are obtained with 3 or from 7 to 10 neighbors



In [19]:

    
# using sklearn to obtain other validation of the model
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred)









    Out[19]:





array([[20,  0,  0],
       [ 0, 20,  0],
       [ 0,  4, 16]])

The array above shows that iris-setosa (0) is classified well, the same for iris-versicolor (1) but in case of iris-virginica (2) only 16 are properly classified and 4 are misclassified as versicolor.



In [21]:

    
print(classification_report(y_test, y_pred))









    



             precision    recall  f1-score   support

          0       1.00      1.00      1.00        20
          1       0.83      1.00      0.91        20
          2       1.00      0.80      0.89        20

avg / total       0.94      0.93      0.93        60

pecision = TP/(TP+FP), recall = TP/(TP + FN), f1-score = 2 precision recall/(precision + recall)

Parameters hypertuning with sklearn



In [24]:

    
from sklearn.model_selection import GridSearchCV

# parameters grid (in our case just one parameter = n_neighbours)
param_grid = {'n_neighbors': np.arange(1,50)}
knn2 = KNeighborsClassifier()
knn_cv = GridSearchCV(knn2, param_grid, cv=5)
knn_cv.fit(X, y)









    Out[24]:





GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)



In [25]:

    
knn_cv.best_params_









    Out[25]:





{'n_neighbors': 6}



In [26]:

    
knn_cv.best_score_









    Out[26]:





0.97999999999999998

Using pipeline for classification



In [29]:

    
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# setting up pipeline steps
steps =[('scaler', StandardScaler()), ('knn3', KNeighborsClassifier())]

pipeline = Pipeline(steps)

# parameters grid is set up in the cell above but it must be redef
parameters = {'knn3__n_neighbors': np.arange(1,50)}

# using Grid search to build the model
cv = GridSearchCV(pipeline, param_grid=parameters, cv=5)

# Fit to the training set
cv.fit(X_train, y_train)

# Predict the labels of the test set: y_pred_cv
y_pred_cv = cv.predict(X_test)

# Compute and print metrics
print("Accuracy: {}".format(cv.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print("Tuned Model Parameters: {}".format(cv.best_params_))









    



Accuracy: 0.9166666666666666
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        20
          1       0.83      1.00      0.91        20
          2       1.00      0.80      0.89        20

avg / total       0.94      0.93      0.93        60

Tuned Model Parameters: {'knn3__n_neighbors': 6}

Use of seaborn heat map to show correlation in the dataset



In [18]:

    
sns.heatmap(df.corr(), square=True, cmap='RdYlGn');



In [28]:

    
np.arange(1,50)









    Out[28]:





array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])



In [ ]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2