In [1]:
import sklearn
import numpy as np
%matplotlib
import matplotlib.pyplot as plt
The plan for today:
Scikit-learn, or sklearn
, is a machine learning package for python. It is built on the familiar scientific python tools (numpy, scipy, matplotlib, etc.) and thus plays very nicely with data in the form of numpy arrays (a cursory googling shows people are also working on better integrating sklearn with other popular pythonic data structures like pandas DataFrames).
This notebook is really just an aggregation of some of the things from the sklearn documentation that I thought were interesting and/or useful. The documentation is really quite well organized in my opinion, so if you are curious about the layout of scikit learn or whether it has specific capabilities that you're interested in, go check it out!
When you have data with many samples that you can formulate into a learning problem. In other words, you'd like to find patterns within that data and/or predict properties of unknown data. The documentation breaks down learning problems into the following categories:
Supervised learning
You have input data and corresponding properties that you'd like to learn from your data to predict the properties of unknown data. This can fall into a couple different sub-categories:
Classification: When the data fall into two or more discrete categories; classifying hand-written digits for example.
Regression: similar to classification, but the output is a continuous variable rather than group membership. An example of regression problem might be predicting height given age, weight, and gender.
Unsupervised learning
You have input data, but no corresponding labels or dependent variable. Instead of trying to use the data to predict some output for unknown data, you're more interested in finding structure in the data that is not immediately apparent. For example, you might be interested in detecting clusters in the input data, or dimensionality reduction to identify components that introduce the most variability in your data, or to aid in visualization.
sklearn
may be a good fit if:
sklearn
implements quite a few regression, classification, and unsupervised learning methods, but is not
all-inclusive. The goal of sklearn
to make popular machine learning approaches accessible rather than
implementing every algorithm out there.sklearn
is not a silver bullet for all types of data and every application. The documentation readily states that the goal of sklearn
is to do a few things well rather than everything. A couple specific thoughts:
Scalability: see above. Make sure to check the documentation for the specific algorithms you want to use. There are usually notes about how the algorithm will perform for data with different numbers of samples and features.
Hardware acceleration --- While sklearn
wraps many C-libraries such as libsvm and Liblinear for fast performance, it doesn't support gpus. The sklearn
developers prioritize portability over the complexity that supporting extra hardware introduces.
Neural networks --- If you're interested in another extremely popular buzzword, deep learning, you might be better served checking out projects that focus specifically on neural networks, like Pylearn2, PyBrain, or caffe (developed at UC Berkeley). There is also scikit-neuralnetwork that aims to expand neural network capabilities while staying true to the sklearn
API.
Black box --- As a non-statistician, this is a big caveat. sklearn
is almost criminally easy to use. The difficult part is interpreting your results. Fortunately, the documentation for the algorithms has external links to relevant literature on which the implementations are based.
Shannon will focus on regression in part 2, so for now let's take a look at some of the classifiers available in
sklearn
. Fortunately, there is a great comparison between classifiers in the documentation.
The code from the above link is copied below so that you can run it in this notebook. Credit to Gael Varoquaux, Andreas Muller, and Jacques Grobler.
In [2]:
# Code source: Gaël Varoquaux
# Andreas Müller
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause
from matplotlib.colors import ListedColormap
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
h = .02 # step size in the mesh
names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Decision Tree",
"Random Forest", "AdaBoost", "Naive Bayes", "Linear Discriminant Analysis",
"Quadratic Discriminant Analysis"]
classifiers = [
KNeighborsClassifier(3),
SVC(kernel="linear", C=0.025),
SVC(gamma=2, C=1),
DecisionTreeClassifier(max_depth=5),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
AdaBoostClassifier(),
GaussianNB(),
LinearDiscriminantAnalysis(),
QuadraticDiscriminantAnalysis()]
X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,
random_state=1, n_clusters_per_class=1)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)
datasets = [make_moons(noise=0.3, random_state=0),
make_circles(noise=0.2, factor=0.5, random_state=1),
linearly_separable
]
figure = plt.figure(figsize=(27, 9))
i = 1
# iterate over datasets
for ds in datasets:
# preprocess dataset, split into training and test part
X, y = ds
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# just plot the dataset first
cm = plt.cm.RdBu
cm_bright = ListedColormap(['#FF0000', '#0000FF'])
ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
# Plot the training points
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)
# and testing points
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6)
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())
i += 1
# iterate over classifiers
for name, clf in zip(names, classifiers):
ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
if hasattr(clf, "decision_function"):
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
else:
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
# Put the result into a color plot
Z = Z.reshape(xx.shape)
ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)
# Plot also the training points
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)
# and testing points
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,
alpha=0.6)
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(name)
ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),
size=15, horizontalalignment='right')
i += 1
figure.subplots_adjust(left=.02, right=.98)
plt.show()
sklearn
RoadmapHow do you decide which tool to use for your specific problem? Start with this chart. The image is included below, but at the previous link, clicking on specific bubbles in the flow chart will take you to the corresponding documentation.
sklearn
indludes some data in the datasets
package to try some basic examples. Let's tackle an (apparently) classic problem in machine learning: hand-written digit classification.
This example is adapted from the Quick Start page in the sklearn documentation.
In [3]:
import sklearn.datasets
digits = sklearn.datasets.load_digits()
In [4]:
print type(digits)
In [5]:
sklearn.datasets.base.Bunch?
In [6]:
print digits.keys(), "\n"
# What is the types of the data stored in the bunch?
for k, v in digits.iteritems(): print "%s: %s" %(k, type(v))
In [7]:
print digits.DESCR
In [8]:
print "images: ", digits.images.shape
print "data (n_samples, n_features): ", digits["data"].shape
In [9]:
# Get a feel for the data
fig, ax = plt.subplots(2, 4, figsize=(13,9))
for a in ax.ravel():
data_ind = int(digits["images"].shape[0] * np.random.rand())
img = a.imshow(digits["images"][data_ind,...], interpolation="nearest", cmap="gray")
a.set_title("Ground Truth: %s" %(digits["target"][data_ind]));
plt.show()
In [10]:
from sklearn.svm import SVC
SVC?
In [11]:
# Let's try a support-vector classifier with the default parameter values
clf = SVC()
For an explanation of why you should have separate train and test sets (to avoid overfitting), see this useful article from the sklearn
docs.
In [12]:
from sklearn.cross_validation import train_test_split
test_size = 0.3 # Train on 70% of the data, test on the remaining 30%
X_train, X_test, y_train, y_test = train_test_split(digits["data"], digits["target"], test_size=test_size)
print X_train.shape
In [13]:
clf.fit(X_train, y_train)
Out[13]:
In [14]:
# Look at some specific cases
fig, ax = plt.subplots(2, 4, figsize=(13,9))
for a in ax.ravel():
data_ind = int(X_test.shape[0] * np.random.rand())
img = X_test[data_ind,...].reshape((8, 8))
prediction = clf.predict(np.array(X_test[data_ind], ndmin=2))
actual = y_test[data_ind]
a.imshow(img, interpolation="nearest", cmap="gray");
a.set_title("Predicted: %s\nGround Truth: %s" %(prediction, actual));
plt.show()
In [15]:
# Evaluate the classifier on the entire test set
clf.score(X_test, y_test)
Out[15]:
In [16]:
# Let's get a little more detail
from sklearn.metrics import classification_report, confusion_matrix
predicted = clf.predict(X_test)
print classification_report(y_test, predicted)
print
print confusion_matrix(y_test, predicted)
The classifier we originally set up used the default values... we didn't even attempt to add any input of our own. Perhaps we should try adjusting the penalty term (C) and the kernel coefficient (gamma) to see if we can get better results. We can use the cross_validation
module to help find better hyperparameters for our classifier by evaluating our data. We still don't even really need to know what we're doing!
In [ ]:
from sklearn import grid_search
# Construct a dictionary of parameters you'd like to search over
penalties = [1., 10., 100., 1000.]
gammas = [0.1, 0.01, 0.001, 0.0001]
parameters_to_search = {"C":penalties, "gamma":gammas}
# Search the parameter space
svm = SVC()
clf = grid_search.GridSearchCV(svm, parameters_to_search, n_jobs=1)
clf.fit(X_train, y_train)
NOTE: If this runs too slowly (it took ~5.75 sec on my computer) and you have multiple cores, you can try running GridSearchCV with the n_jobs
keyword > 1. In this case, sklearn
will use the multiprocessing module to parallelize the search
In [ ]:
# Results
for score in clf.grid_scores_: print score
print "\nBest parameters: ", clf.best_params_
In [ ]:
y_true, y_pred = y_test, clf.predict(X_test)
print classification_report(y_true, y_pred)
print confusion_matrix(y_true, y_pred)
sklearn
is a super easy-to-use machine learning library for python. If you'd like to quickly apply off-the-shelf machine learning algorithms to datasets without a big learning curve, it is a great tool. The above presentation only scratches the surface of what sklearn
can do; if you're interested in learning more, the documentation is quite thorough and well organized. Links to several examples in the documentation on which this presentation are based are enumerated below: