This cookbook contains recipes for some common applications of machine learning. You'll need a working knowledge of pandas, matplotlib, numpy, and, of course, scikit-learn to benefit from it.
In [1]:
# <help:cookbook_setup>
%matplotlib inline
This recipe repeatedly trains a logistic regression classifier over different subsets (folds) of sample data. It attempts to match the percentage of each class in every fold to its percentage in the overall dataset (stratification). It evaluates each model against a test set and collects the confusion matrices for each test fold into a pandas.Panel
.
This recipe defaults to using the Iris data set. To use your own data, set X
to your instance feature vectors, y
to the instance classes as a factor, and labels
to the instance classes as human readable names.
In [2]:
# <help:scikit_cross_validation>
import warnings
warnings.filterwarnings('ignore') #notebook outputs warnings, let's ignore them
import pandas
import sklearn
import sklearn.datasets
import sklearn.metrics as metrics
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import StratifiedKFold
# load the iris dataset
dataset = sklearn.datasets.load_iris()
# define feature vectors (X) and target (y)
X = dataset.data
y = dataset.target
labels = dataset.target_names
labels
Out[2]:
In [3]:
# <help:scikit_cross_validation>
# use log reg classifier
clf = LogisticRegression()
cms = {}
scores = []
cv = StratifiedKFold(y, n_folds=10)
for i, (train, test) in enumerate(cv):
# train then immediately predict the test set
y_pred = clf.fit(X[train], y[train]).predict(X[test])
# compute the confusion matrix on each fold, convert it to a DataFrame and stash it for later compute
cms[i] = pandas.DataFrame(metrics.confusion_matrix(y[test], y_pred), columns=labels, index=labels)
# stash the overall accuracy on the test set for the fold too
scores.append(metrics.accuracy_score(y[test], y_pred))
# Panel of all test set confusion matrices
pl = pandas.Panel(cms)
cm = pl.sum(axis=0) #Sum the confusion matrices to get one view of how well the classifiers perform
cm
Out[3]:
In [4]:
# <help:scikit_cross_validation>
# accuracy predicting the test set for each fold
scores
Out[4]:
This recipe performs a PCA and plots the data against the first two principal components in a scatter plot. It then prints the eigenvalues and eigenvectors of the covariance matrix and finally prints the precentage of total variance explained by each component.
This recipe defaults to using the Iris data set. To use your own data, set X
to your instance feature vectors, y
to the instance classes as a factor, and labels
to human-readable names of the classes.
In [5]:
# <help:scikit_pca>
import warnings
warnings.filterwarnings('ignore') #notebook outputs warnings, let's ignore them
from __future__ import division
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets
import sklearn.metrics as metrics
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# load the iris dataset
dataset = sklearn.datasets.load_iris()
# define feature vectors (X) and target (y)
X = dataset.data
y = dataset.target
labels = dataset.target_names
In [6]:
# <help:scikit_pca>
# define the number of components to compute, recommend n_components < y_features
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# plot the first two principal components
fig, ax = plt.subplots()
plt.scatter(X_pca[:,0], X_pca[:,1])
plt.grid()
plt.title('PCA of the dataset')
ax.set_xlabel('Component #1')
ax.set_ylabel('Component #2')
plt.show()
In [7]:
# <help:scikit_pca>
# eigendecomposition on the covariance matrix
cov_mat = np.cov(X_pca.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)
In [8]:
# <help:scikit_pca>
# prints the percentage of overall variance explained by each component
print(pca.explained_variance_ratio_)
This recipe performs a K-means clustering k=1..n
times. It prints and plots the the within-clusters sum of squares error for each k
(i.e., inertia) as an indicator of what value of k
might be appropriate for the given dataset.
This recipe defaults to using the Iris data set. To use your own data, set X
to your instance feature vectors, y
to the instance classes as a factor, and labels
to human-readable names of the classes. To change the number of clusters, modify k
.
In [9]:
# <help:scikit_k_means_cluster>
import warnings
warnings.filterwarnings('ignore') #notebook outputs warnings, let's ignore them
from time import time
import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets
from sklearn.cluster import KMeans
# load datasets and assign data and features
dataset = sklearn.datasets.load_iris()
# define feature vectors (X) and target (y)
X = dataset.data
y = dataset.target
# set the number of clusters, must be >=1
n = 6
inertia = [np.NaN]
# perform k-means clustering over i=0...k
for k in range(1,n):
k_means_ = KMeans(n_clusters=k)
k_means_.fit(X)
print('k = %d, inertia= %f' % (k, k_means_.inertia_ ))
inertia.append(k_means_.inertia_)
# plot the SSE of the clusters for each value of i
ax = plt.subplot(111)
ax.plot(inertia, '-o')
plt.xticks(range(n))
plt.title("Inertia")
ax.set_ylabel('Inertia')
ax.set_xlabel('# Clusters')
plt.show()
This recipe performs a grid search for the best settings for a support vector machine, predicting the class of each flower in the dataset. It splits the dataset into training and test instances once.
This recipe defaults to using the Iris data set. To use your own data, set X
to your instance feature vectors, y
to the instance classes as a factor, and labels
to human-readable names of the classes. Modify parameters
to change the grid search space or the scoring='accuracy'
value to optimize a different metric for the classifier (e.g., precision, recall).
In [10]:
#<help_scikit_grid_search>
import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets
import sklearn.metrics as metrics
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import label_binarize
# load datasets and features
dataset = sklearn.datasets.load_iris()
# define feature vectors (X) and target (y)
X = dataset.data
y = dataset.target
labels = dataset.target_names
# separate datasets into training and test datasets once, no folding
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
In [11]:
#<help_scikit_grid_search>
#define the parameter dictionary with the kernels of SVCs
parameters = [
{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4, 1e-2], 'C': [1, 10, 100, 1000]},
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]},
{'kernel': ['poly'], 'degree': [1, 3, 5], 'C': [1, 10, 100, 1000]}
]
# find the best parameters to optimize accuracy
svc_clf = SVC(C=1, probability= True)
clf = GridSearchCV(svc_clf, parameters, cv=5, scoring='accuracy') #5 folds
clf.fit(X_train, y_train) #train the model
print("Best parameters found from SVM's:")
print clf.best_params_
print("Best score found from SVM's:")
print clf.best_score_
This recipe plots the reciever operating characteristic (ROC) curve for a SVM classifier trained over the given dataset.
This recipe defaults to using the Iris data set which has three classes. The recipe uses a one-vs-the-rest strategy to create the binary classifications appropriate for ROC plotting. To use your own data, set X
to your instance feature vectors, y
to the instance classes as a factor, and labels
to human-readable names of the classes.
Note that the recipe adds noise to the iris features to make the ROC plots more realistic. Otherwise, the classification is nearly perfect and the plot hard to study. Remove the noise generator if you use your own data!
In [12]:
# <help:scikit_roc>
import warnings
warnings.filterwarnings('ignore') #notebook outputs warnings, let's ignore them
import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets
import sklearn.metrics as metrics
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import label_binarize
# load iris, set and data
dataset = sklearn.datasets.load_iris()
X = dataset.data
# binarize the output for binary classification
y = label_binarize(dataset.target, classes=[0, 1, 2])
labels = dataset.target_names
In [13]:
# <help:scikit_roc>
# add noise to the features so the plot is less ideal
# REMOVE ME if you use your own dataset!
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]
In [14]:
# <help:scikit_roc>
# split data for cross-validation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# classify instances into more than two classes, one vs rest
# add param to create probabilities to determine Y or N as the classification
clf = OneVsRestClassifier(SVC(kernel='linear', probability=True))
# fit estiamators and return the distance of each sample from the decision boundary
y_score = clf.fit(X_train, y_train).decision_function(X_test)
In [15]:
# <help:scikit_roc>
# plot the ROC curve, best for it to be in top left corner
plt.figure(figsize=(10,5))
plt.plot([0, 1], [0, 1], 'k--') # add a straight line representing a random model
for i, label in enumerate(labels):
# false positive and true positive rate for each class
fpr, tpr, _ = metrics.roc_curve(y_test[:, i], y_score[:, i])
# area under the curve (auc) for each class
roc_auc = metrics.auc(fpr, tpr)
plt.plot(fpr, tpr, label='ROC curve of {0} (area = {1:0.2f})'.format(label, roc_auc))
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.title('Receiver Operating Characteristic for Iris data set')
plt.xlabel('False Positive Rate') # 1- specificity
plt.ylabel('True Positive Rate') # sensitivity
plt.legend(loc="lower right")
plt.show()
This recipe builds a transformation and training pipeline for a model that can classify a snippet of text as belonging to one of 20 USENET newgroups. It then prints the precision, recall, and F1-score for predictions over a held-out test set as well as the confusion matrix.
This recipe defaults to using the 20 USENET newsgroup dataset. To use your own data, set X
to your instance feature vectors, y
to the instance classes as a factor, and labels
to human-readable names of the classes. Then modify the pipeline components to perform appropriate transformations for your data.
In [16]:
# <help:scikit_pipeline>
import pandas
import sklearn.metrics as metrics
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import Perceptron
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import train_test_split
from sklearn.datasets import fetch_20newsgroups
# download the newsgroup dataset
dataset = fetch_20newsgroups('all')
# define feature vectors (X) and target (y)
X = dataset.data
y = dataset.target
labels = dataset.target_names
labels
Out[16]:
In [17]:
# <help:scikit_pipeline>
# split data holding out 30% for testing the classifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# pipelines concatenate functions serially, output of 1 becomes input of 2
clf = Pipeline([
('vect', HashingVectorizer(analyzer='word', ngram_range=(1,3))), # count frequency of words, using hashing trick
('tfidf', TfidfTransformer()), # transform counts to tf-idf values,
('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5))
])
In [18]:
# <help:scikit_pipeline>
# train the model and predict the test set
y_pred = clf.fit(X_train, y_train).predict(X_test)
# standard information retrieval metrics
print metrics.classification_report(y_test, y_pred, target_names=labels)
In [19]:
# <help:scikit_pipeline>
# show the confusion matrix in a labeled dataframe for ease of viewing
index_labels = ['{} {}'.format(i, l) for i, l in enumerate(labels)]
pandas.DataFrame(metrics.confusion_matrix(y_test,y_pred), index=index_labels)
Out[19]: