Messing Around with Scikit-Learn


In [1]:
import sklearn
import numpy as np
%matplotlib
import matplotlib.pyplot as plt


Using matplotlib backend: Qt4Agg

The plan for today:

  • Part 1: I will regurgitate some of the scikit-learn documentation at you, in ipython notebook form
  • Part 2: Shannon will give an example of scikit-learn in the wild

The basics: What is scikit-learn?

Scikit-learn, or sklearn, is a machine learning package for python. It is built on the familiar scientific python tools (numpy, scipy, matplotlib, etc.) and thus plays very nicely with data in the form of numpy arrays (a cursory googling shows people are also working on better integrating sklearn with other popular pythonic data structures like pandas DataFrames).

This notebook is really just an aggregation of some of the things from the sklearn documentation that I thought were interesting and/or useful. The documentation is really quite well organized in my opinion, so if you are curious about the layout of scikit learn or whether it has specific capabilities that you're interested in, go check it out!

When should you use scikit-learn?

When you have data with many samples that you can formulate into a learning problem. In other words, you'd like to find patterns within that data and/or predict properties of unknown data. The documentation breaks down learning problems into the following categories:

  • Supervised learning

    You have input data and corresponding properties that you'd like to learn from your data to predict the properties of unknown data. This can fall into a couple different sub-categories:

    • Classification: When the data fall into two or more discrete categories; classifying hand-written digits for example.

    • Regression: similar to classification, but the output is a continuous variable rather than group membership. An example of regression problem might be predicting height given age, weight, and gender.

  • Unsupervised learning

    You have input data, but no corresponding labels or dependent variable. Instead of trying to use the data to predict some output for unknown data, you're more interested in finding structure in the data that is not immediately apparent. For example, you might be interested in detecting clusters in the input data, or dimensionality reduction to identify components that introduce the most variability in your data, or to aid in visualization.

Other considerations

sklearn may be a good fit if:

  • You have moderately sized data (>50 samples, but not "big data")
    • For instance, the svm.SVC classifier fit method is worse than quadratic in the number of samples, so the documentation recommends you limit use of this classifier to datasets with less than a few 10,000's samples.

  • You are interested in trying off-the-shelf machine learning algorithms to explore data with good performance and low learning curve
    • sklearn implements quite a few regression, classification, and unsupervised learning methods, but is not all-inclusive. The goal of sklearn to make popular machine learning approaches accessible rather than implementing every algorithm out there.

Drawbacks and Caveats

sklearn is not a silver bullet for all types of data and every application. The documentation readily states that the goal of sklearn is to do a few things well rather than everything. A couple specific thoughts:

  • Scalability: see above. Make sure to check the documentation for the specific algorithms you want to use. There are usually notes about how the algorithm will perform for data with different numbers of samples and features.

  • Hardware acceleration --- While sklearn wraps many C-libraries such as libsvm and Liblinear for fast performance, it doesn't support gpus. The sklearn developers prioritize portability over the complexity that supporting extra hardware introduces.

  • Neural networks --- If you're interested in another extremely popular buzzword, deep learning, you might be better served checking out projects that focus specifically on neural networks, like Pylearn2, PyBrain, or caffe (developed at UC Berkeley). There is also scikit-neuralnetwork that aims to expand neural network capabilities while staying true to the sklearn API.

  • Black box --- As a non-statistician, this is a big caveat. sklearn is almost criminally easy to use. The difficult part is interpreting your results. Fortunately, the documentation for the algorithms has external links to relevant literature on which the implementations are based.

A Quick Look at some Classifiers

Shannon will focus on regression in part 2, so for now let's take a look at some of the classifiers available in sklearn. Fortunately, there is a great comparison between classifiers in the documentation.

The code from the above link is copied below so that you can run it in this notebook. Credit to Gael Varoquaux, Andreas Muller, and Jacques Grobler.


In [2]:
# Code source: Gaël Varoquaux
#              Andreas Müller
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause

from matplotlib.colors import ListedColormap
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

h = .02  # step size in the mesh

names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Decision Tree",
         "Random Forest", "AdaBoost", "Naive Bayes", "Linear Discriminant Analysis",
         "Quadratic Discriminant Analysis"]
classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    AdaBoostClassifier(),
    GaussianNB(),
    LinearDiscriminantAnalysis(),
    QuadraticDiscriminantAnalysis()]

X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,
                           random_state=1, n_clusters_per_class=1)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)

datasets = [make_moons(noise=0.3, random_state=0),
            make_circles(noise=0.2, factor=0.5, random_state=1),
            linearly_separable
            ]

figure = plt.figure(figsize=(27, 9))
i = 1
# iterate over datasets
for ds in datasets:
    # preprocess dataset, split into training and test part
    X, y = ds
    X = StandardScaler().fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)

    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    # just plot the dataset first
    cm = plt.cm.RdBu
    cm_bright = ListedColormap(['#FF0000', '#0000FF'])
    ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
    # Plot the training points
    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)
    # and testing points
    ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6)
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.set_xticks(())
    ax.set_yticks(())
    i += 1

    # iterate over classifiers
    for name, clf in zip(names, classifiers):
        ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
        clf.fit(X_train, y_train)
        score = clf.score(X_test, y_test)

        # Plot the decision boundary. For that, we will assign a color to each
        # point in the mesh [x_min, m_max]x[y_min, y_max].
        if hasattr(clf, "decision_function"):
            Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
        else:
            Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]

        # Put the result into a color plot
        Z = Z.reshape(xx.shape)
        ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)

        # Plot also the training points
        ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)
        # and testing points
        ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,
                   alpha=0.6)

        ax.set_xlim(xx.min(), xx.max())
        ax.set_ylim(yy.min(), yy.max())
        ax.set_xticks(())
        ax.set_yticks(())
        ax.set_title(name)
        ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),
                size=15, horizontalalignment='right')
        i += 1

figure.subplots_adjust(left=.02, right=.98)
plt.show()

sklearn Roadmap

How do you decide which tool to use for your specific problem? Start with this chart. The image is included below, but at the previous link, clicking on specific bubbles in the flow chart will take you to the corresponding documentation.

Breakout - A Classification Example with the included "digits" dataset

sklearn indludes some data in the datasets package to try some basic examples. Let's tackle an (apparently) classic problem in machine learning: hand-written digit classification.

This example is adapted from the Quick Start page in the sklearn documentation.

Load the data


In [3]:
import sklearn.datasets
digits = sklearn.datasets.load_digits()

Inspect the data


In [4]:
print type(digits)


<class 'sklearn.datasets.base.Bunch'>

In [5]:
sklearn.datasets.base.Bunch?

In [6]:
print digits.keys(), "\n"
# What is the types of the data stored in the bunch?
for k, v in digits.iteritems(): print "%s: %s" %(k, type(v))


['images', 'data', 'target_names', 'DESCR', 'target'] 

images: <type 'numpy.ndarray'>
data: <type 'numpy.ndarray'>
target_names: <type 'numpy.ndarray'>
DESCR: <type 'str'>
target: <type 'numpy.ndarray'>

In [7]:
print digits.DESCR


Optical Recognition of Handwritten Digits Data Set
===================================================

Notes
-----
Data Set Characteristics:
    :Number of Instances: 5620
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small
distortions.

For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
1994.

References
----------
  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
    Graduate Studies in Science and Engineering, Bogazici University.
  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
    Linear dimensionalityreduction using relevance weighted LDA. School of
    Electrical and Electronic Engineering Nanyang Technological University.
    2005.
  - Claudio Gentile. A New Approximate Maximal Margin Classification
    Algorithm. NIPS. 2000.


In [8]:
print "images: ", digits.images.shape
print "data (n_samples, n_features): ", digits["data"].shape


images:  (1797, 8, 8)
data (n_samples, n_features):  (1797, 64)

In [9]:
# Get a feel for the data
fig, ax = plt.subplots(2, 4, figsize=(13,9))
for a in ax.ravel():
    data_ind = int(digits["images"].shape[0] * np.random.rand())
    img = a.imshow(digits["images"][data_ind,...], interpolation="nearest", cmap="gray")
    a.set_title("Ground Truth: %s" %(digits["target"][data_ind]));
plt.show()

Refer to the Roadmap to pick our classifier: SVC


In [10]:
from sklearn.svm import SVC
SVC?

In [11]:
# Let's try a support-vector classifier with the default parameter values
clf = SVC()

Now, split up our dataset into train and test sets

For an explanation of why you should have separate train and test sets (to avoid overfitting), see this useful article from the sklearn docs.


In [12]:
from sklearn.cross_validation import train_test_split
test_size = 0.3      # Train on 70% of the data, test on the remaining 30%
X_train, X_test, y_train, y_test = train_test_split(digits["data"], digits["target"], test_size=test_size)
print X_train.shape


(1257, 64)

Train our classifier


In [13]:
clf.fit(X_train, y_train)


Out[13]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

The moment of truth: how well did our classifier do?


In [14]:
# Look at some specific cases
fig, ax = plt.subplots(2, 4, figsize=(13,9))
for a in ax.ravel():
    data_ind = int(X_test.shape[0] * np.random.rand())
    img = X_test[data_ind,...].reshape((8, 8))
    prediction = clf.predict(np.array(X_test[data_ind], ndmin=2))
    actual = y_test[data_ind]
    a.imshow(img, interpolation="nearest", cmap="gray");
    a.set_title("Predicted: %s\nGround Truth: %s" %(prediction, actual));
plt.show()

In [15]:
# Evaluate the classifier on the entire test set
clf.score(X_test, y_test)


Out[15]:
0.49074074074074076

In [16]:
# Let's get a little more detail
from sklearn.metrics import classification_report, confusion_matrix
predicted = clf.predict(X_test)
print classification_report(y_test, predicted)
print
print confusion_matrix(y_test, predicted)


             precision    recall  f1-score   support

          0       1.00      0.57      0.72        44
          1       1.00      0.60      0.75        50
          2       1.00      0.76      0.86        50
          3       1.00      0.12      0.21        67
          4       1.00      0.98      0.99        46
          5       1.00      0.23      0.38        60
          6       1.00      0.61      0.76        59
          7       1.00      0.24      0.39        58
          8       1.00      0.07      0.14        55
          9       0.16      1.00      0.27        51

avg / total       0.92      0.49      0.53       540


[[25  0  0  0  0  0  0  0  0 19]
 [ 0 30  0  0  0  0  0  0  0 20]
 [ 0  0 38  0  0  0  0  0  0 12]
 [ 0  0  0  8  0  0  0  0  0 59]
 [ 0  0  0  0 45  0  0  0  0  1]
 [ 0  0  0  0  0 14  0  0  0 46]
 [ 0  0  0  0  0  0 36  0  0 23]
 [ 0  0  0  0  0  0  0 14  0 44]
 [ 0  0  0  0  0  0  0  0  4 51]
 [ 0  0  0  0  0  0  0  0  0 51]]

Well, that was disappointing... what now?

The classifier we originally set up used the default values... we didn't even attempt to add any input of our own. Perhaps we should try adjusting the penalty term (C) and the kernel coefficient (gamma) to see if we can get better results. We can use the cross_validation module to help find better hyperparameters for our classifier by evaluating our data. We still don't even really need to know what we're doing!


In [ ]:
from sklearn import grid_search

# Construct a dictionary of parameters you'd like to search over
penalties = [1., 10., 100., 1000.]
gammas = [0.1, 0.01, 0.001, 0.0001]
parameters_to_search = {"C":penalties, "gamma":gammas}

# Search the parameter space
svm = SVC()
clf = grid_search.GridSearchCV(svm, parameters_to_search, n_jobs=1)
clf.fit(X_train, y_train)

NOTE: If this runs too slowly (it took ~5.75 sec on my computer) and you have multiple cores, you can try running GridSearchCV with the n_jobs keyword > 1. In this case, sklearn will use the multiprocessing module to parallelize the search


In [ ]:
# Results
for score in clf.grid_scores_: print score
print "\nBest parameters: ", clf.best_params_

In [ ]:
y_true, y_pred = y_test, clf.predict(X_test)
print classification_report(y_true, y_pred)
print confusion_matrix(y_true, y_pred)

Part 1 Conclusion

sklearn is a super easy-to-use machine learning library for python. If you'd like to quickly apply off-the-shelf machine learning algorithms to datasets without a big learning curve, it is a great tool. The above presentation only scratches the surface of what sklearn can do; if you're interested in learning more, the documentation is quite thorough and well organized. Links to several examples in the documentation on which this presentation are based are enumerated below: