Training a machine learning model with scikit-learn

From the video series: Introduction to machine learning with scikit-learn

jupyter notebook 04_model_training.ipynb

Agenda

  • What is the K-nearest neighbors classification model?
  • What are the four steps for model training and prediction in scikit-learn?
  • How can I apply this pattern to other machine learning models?

K-nearest neighbors (KNN) classification

  1. Pick a value for K.
  2. Search for the K observations in the training data that are "nearest" to the measurements of the unknown iris.
  3. Use the most popular response value from the K nearest neighbors as the predicted response value for the unknown iris.

Example training data

KNN classification map (K=1)

KNN classification map (K=5)

Image Credits: Data3classes, Map1NN, Map5NN by Agor153. Licensed under CC BY-SA 3.0

Loading the data


In [1]:
# import load_iris function from datasets module
from sklearn.datasets import load_iris

# save "bunch" object containing iris dataset and its attributes
iris = load_iris()

# store feature matrix in "X"
X = iris.data

# store response vector in "y"
y = iris.target

In [2]:
# print the shapes of X and y
print X.shape
print y.shape


(150, 4)
(150,)

scikit-learn 4-step modeling pattern

Step 1: Import the class you plan to use


In [3]:
from sklearn.neighbors import KNeighborsClassifier

Step 2: "Instantiate" the "estimator"

  • "Estimator" is scikit-learn's term for model
  • "Instantiate" means "make an instance of"

In [4]:
knn = KNeighborsClassifier(n_neighbors=1)
  • Name of the object does not matter
  • Can specify tuning parameters (aka "hyperparameters") during this step
  • All parameters not specified are set to their defaults

In [5]:
print knn


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

Step 3: Fit the model with data (aka "model training")

  • Model is learning the relationship between X and y
  • Occurs in-place

In [6]:
knn.fit(X, y)


Out[6]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

Step 4: Predict the response for a new observation

  • New observations are called "out-of-sample" data
  • Uses the information it learned during the model training process

In [7]:
print(knn.predict([[3, 5, 4, 2]]))


[2]
  • Returns a NumPy array
  • Can predict for multiple observations at once

In [8]:
X_new = [[3, 5, 4, 2], [5, 4, 3, 2]]
knn.predict(X_new)


Out[8]:
array([2, 1])

Using a different value for K


In [9]:
# instantiate the model (using the value K=5)
knn = KNeighborsClassifier(n_neighbors=5)

# fit the model with data
knn.fit(X, y)

# predict the response for new observations
knn.predict(X_new)


Out[9]:
array([1, 1])

Using a different classification model


In [10]:
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg = LogisticRegression()

# fit the model with data
logreg.fit(X, y)

# predict the response for new observations
logreg.predict(X_new)


Out[10]:
array([2, 0])

Resources

Credit

  • Kevin Markham