Training a machine learning model with scikit-learn

Agenda

What is the K-nearest neighbors classification model?
What is the four steps for model training and prediction in scikit learn?
How can I apply this pattern to other machine learning models?

Reviewing the iris dataset



In [1]:

    
from IPython.display import HTML
HTML('<iframe src=http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data width=300 height=200></iframe>')









    Out[1]:

150 observations
4 features (sepal length, sepal width, petal length, petal width)
Response variable is the iris species
Classification problem since response is categorical
More information in the UCI Machine Learning Repository

K-nearest neighbors (KNN) classification

Pick a value for K.
Search for the K observations in the training data that are "nearest" to the measurements of the unknown iris.
Use the most popular response value from the K nearest neighbors as the predicted response value for the unknown iris.

Loading the data



In [3]:

    
# import load_iris function from datasets module
from sklearn.datasets import load_iris

# save "bunch" object containing iris dataset and its attributes
iris = load_iris()

# store feature matrix in "X"
X = iris.data

# store response vector in "y"
y = iris.target



In [4]:

    
# print the shapes of X and y
print X.shape
print y.shape









    



(150, 4)
(150,)

scikit-learn 4-step modeling pattern

Step 1:Import the class you plan to use



In [5]:

    
from sklearn.neighbors import KNeighborsClassifier

Step 2: "Instantiate" the "estimator"

"Estimator" is scikit-learn's term for model
"Instantiate" means "make an instance of"



In [6]:

    
knn = KNeighborsClassifier(n_neighbors=1)

Name of the object does not matter
Can specify tuning parameters (aka "hyperparameters") during this step
All parameters not specified are set to their defaults



In [7]:

    
print knn









    



KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

Step 3: Fit the model with data (aka "model training")

Model is learning the relationship between X and y
Occurs in-place



In [8]:

    
knn.fit(X,y)









    Out[8]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

Step 4: Predict the response for a new observation

New observations are called "out-of-sample" data
Uses the information it learned during the model training process



In [9]:

    
knn.predict([3,5,4,2])









    



//anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)






    Out[9]:





array([2])

Returns a NumPy array
Can predict for multiple observations at once



In [10]:

    
X_new = [[3, 5, 4, 2], [5, 4, 3, 2]]
knn.predict(X_new)









    Out[10]:





array([2, 1])

Using a different value for K



In [11]:

    
# instantiate the model (using the value K=5)
knn = KNeighborsClassifier(n_neighbors=5)

# fit the model with data
knn.fit(X, y)

# predict the response for new observations
knn.predict(X_new)









    Out[11]:





array([1, 1])

Using a different classification model



In [12]:

    
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg = LogisticRegression()

# fit the model with data
logreg.fit(X, y)

# predict the response for new observations
logreg.predict(X_new)









    Out[12]:





array([2, 0])

Resources

Nearest Neighbors (user guide), KNeighborsClassifier (class documentation)
Logistic Regression (user guide LogisticRegression
Videos from An Introduction to Statistical Learning
- Classification Problems and K-Nearest Neighbors (Chapter 2)
- Introduction to Classification (Chapter 4)
- Logistic Regression and Maximum Likelihood (Chapter 4)



In [13]:

    
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()









    Out[13]: