Reviewing the iris dataset

K-nearest neighbors (KNN) classification

  • Pick a value for K
  • Search for K observations in the training data that are "nearest" to the measurements of the unknown Iris.
  • use the most popular response value from the K nearest neighbors as the predicted response value for the unknown Iris.

In [6]:
# import load_iris function from datasets module 
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target
print X.shape
print y.shape


(150L, 4L)
(150L,)

skikit-learn 4-step modeling pattern

  • Step1: import the class you plan to use

In [7]:
from sklearn.neighbors import KNeighborsClassifier
  • Step 2: "Instantiate" the "estimator"
    • "Estimator" is scikit-learn's term for model
    • "Instantiate" means "makes an instance of"

In [8]:
knn = KNeighborsClassifier(n_neighbors=1)
 + Name of the object does not matter
+ Can specify tuning parameters during this step
+ All parameters not specified are set to defaults. 

In [9]:
print knn


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_neighbors=1, p=2, weights='uniform')
  • Step 3: Fit the model with data ("model training")
    • Model is learning the relationship between X and y
    • Occurs in-place

In [10]:
knn.fit(X,y)


Out[10]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_neighbors=1, p=2, weights='uniform')
  • Step 4: predict the response for a new observation
    • New observations are called "out-of-sample" data
    • Uses the information it learned during the model training shape

In [11]:
knn.predict([3,5,4,2])


Out[11]:
array([2])

In [12]:
X_new = [[3,5,4,2], [5,4,3,2]]
knn.predict(X_new)


Out[12]:
array([2, 1])

Using a different value for K


In [13]:
knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X,y)
knn.predict(X_new)


Out[13]:
array([1, 1])

Using a different classification model


In [14]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()

logreg.fit(X,y)

logreg.predict(X_new)


Out[14]:
array([2, 0])

In [ ]: