Getting started with model selection

Those who have used Scikit-Learn before will no doubt already be familiar with the Choosing the Right Estimator flow chart. This diagram is handy for those who are just getting started, as it models a simplified decision-making process for selecting the machine learning algorithm that is best suited to one's dataset.

Imports



In [ ]:

    
from __future__ import print_function

import os
import numpy as np
import pandas as pd

from sklearn.preprocessing import scale
from sklearn.preprocessing import normalize

from sklearn import cross_validation as cv
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import r2_score, mean_squared_error as mse

from sklearn.svm import SVR
from sklearn.svm import SVC
from sklearn.svm import LinearSVC

from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import RANSACRegressor
from sklearn.linear_model import LinearRegression

from sklearn.neighbors import KNeighborsClassifier

Let's try it together.

Load the datasets



In [ ]:

    
# Load the room occupancy dataset
occupancy = os.path.join('data','occupancy_data','datatraining.txt')
occupancy = pd.read_csv(occupancy, sep=',')
occupancy.columns = [
    'date', 'temp', 'humid', 'light', 'co2', 'hratio', 'occupied'
]

More than 50 samples?

First we are asked whether we have more than 50 samples for our dataset.



In [ ]:

    
print(len(occupancy))

Predicting a quantity or a category?

Next we're asked if we're predicting a category. For the occupancy dataset, the answer is yes. For occupancy, we are predicting whether a room is occupied (0 for no, 1 for yes). Therefore, we will be looking for a classifier for our occupancy dataset.



In [ ]:

    
def classify(attributes, targets, model):
    """
    Executes classification using the specified model and returns
    a classification report.
    """
    # Split data into 'test' and 'train' for cross validation
    splits = cv.train_test_split(attributes, targets, test_size=0.2)
    X_train, X_test, y_train, y_test = splits

    model.fit(X_train, y_train)
    y_true = y_test
    y_pred = model.predict(X_test)
    print(classification_report(y_true, y_pred, target_names=list(occupancy)))

Since our categorical dataset has fewer than 100,000 instances, we are prompted to start with sklearn.svm.LinearSVC (which will map the data to a higher dimensional feature space), or failing that, sklearn.neighbors.KNeighborsClassifier (which will assign instances to the class most common among its k nearest neighbors).

In our feature exploration of the occupancy dataset, you'll remember that the different attributes were not all on the same scale, so in addition to the other steps, we import scale so that we can standardize all the features before we run fit-predict:



In [ ]:

    
features = occupancy[['temp', 'humid', 'light', 'co2', 'hratio']]
labels   = occupancy['occupied']

# Scale the features
stdfeatures = scale(features)

classify(stdfeatures, labels, LinearSVC())
classify(stdfeatures, labels, KNeighborsClassifier())



In [ ]:

    
# Load the concrete compression data set
concrete   = pd.read_excel(os.path.join('data','Concrete_Data.xls'))
concrete.columns = [
    'cement', 'slag', 'ash', 'water', 'splast',
    'coarse', 'fine', 'age', 'strength'
]

More than 50 samples?



In [ ]:

    
print(len(concrete))

Predicting a quantity or a category?

For the concrete dataset, the labels for the strength of the concrete are continuous, so we are predicting a quantity, not a category. Therefore, we will be looking for a regressor for our concrete dataset.



In [ ]:

    
def regress(attributes, targets, model):
    # Split data into 'test' and 'train' for cross validation
    splits = cv.train_test_split(attributes, targets, test_size=0.2)
    X_train, X_test, y_train, y_test = splits

    model.fit(X_train, y_train)
    y_true = y_test
    y_pred = model.predict(X_test)
    print("Mean squared error = {:0.3f}".format(mse(y_true, y_pred)))
    print("R2 score = {:0.3f}".format(r2_score(y_true, y_pred)))

Meanwhile for our concrete dataset, we must determine whether we think all of the features are important, or only a few of them. If we decide to keep all the features as is, the chart suggests using sklearn.linear_model.RidgeRegression (which will identify features that are less predictive and ensure they have less influence in the model) or possibly sklearn.svm.SVR with a linear kernel (which is similar to the LinearSVC classifier). If we guess that some of the features are not important, we might decide instead to choose sklearn.linear_model.Lasso (which will drop out any features that aren't predictive) or sklearn.linear_model.ElasticNet (which will try to find a happy medium between the Lasso and Ridge methods, taking the linear combination of their L1 and L2 penalties).

Let's try a few because, why not?



In [ ]:

    
features = concrete[[
    'cement', 'slag', 'ash', 'water', 'splast', 'coarse', 'fine', 'age'
]]
labels   = concrete['strength']

regress(features, labels, Ridge())
regress(features, labels, Lasso())
regress(features, labels, ElasticNet())

As illustrated in the code above, the Scikit-Learn API allows us to rapidly deploy as many models as we want. This is an incredibly powerful feature of the Scikit-Learn library that cannot be understated.