Here, we will process some data, classify it with SVM (see [here](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) for more info), and view the quality of the classification with a confusion matrix.
In [ ]:
import pandas as pd
# import model algorithm and data
from sklearn import svm, datasets
# import splitter
from sklearn.cross_validation import train_test_split
# import metrics
from sklearn.metrics import confusion_matrix
# feature data (X) and labels (y)
iris = datasets.load_iris()
X, y = iris.data, iris.target
# split data into training and test sets
X_train, X_test, y_train, y_test = \
train_test_split(X, y, train_size = 0.70, random_state = 42)
In [ ]:
# perform the classification step and run a prediction on test set from above
clf = svm.SVC(kernel = 'linear', C = 0.01)
y_pred = clf.fit(X_train, y_train).predict(X_test)
pd.DataFrame({'Prediction': iris.target_names[y_pred],
'Actual': iris.target_names[y_test]})
In [ ]:
# accuracy score
clf.score(X_test, y_test)
In [ ]:
# Define a plotting function confusion matrices
# (from http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html)
import matplotlib.pyplot as plt
def plot_confusion_matrix(cm, target_names, title = 'The Confusion Matrix', cmap = plt.cm.YlOrRd):
plt.imshow(cm, interpolation = 'nearest', cmap = cmap)
plt.tight_layout()
# Add feature labels to x and y axes
tick_marks = np.arange(len(target_names))
plt.xticks(tick_marks, target_names, rotation=45)
plt.yticks(tick_marks, target_names)
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.colorbar()
Numbers in confusion matrix:
In [ ]:
%matplotlib inline
cm = confusion_matrix(y_test, y_pred)
# see the actual counts
print(cm)
# visually inpsect how the classifier did of matching predictions to true labels
plot_confusion_matrix(cm, iris.target_names)
In [ ]:
from sklearn.metrics import classification_report
# Using the test and prediction sets from above
print(classification_report(y_test, y_pred, target_names = iris.target_names))
In [ ]:
# Another example with some toy data
y_test = ['cat', 'dog', 'mouse', 'mouse', 'cat', 'cat']
y_pred = ['mouse', 'dog', 'cat', 'mouse', 'cat', 'mouse']
# How did our predictor do?
print(classification_report(y_test, ___, target_names = ___)) # <-- fill in the blanks
QUICK QUESTION: Is it better to have too many false positives or too many false negatives?
GridSearchCV paramter tuningPARTING THOUGHT: Does a parameter when increased/decreased cause overfitting or underfitting? What are the implications of those cases?
Created by a Microsoft Employee.
The MIT License (MIT)
Copyright (c) 2016 Micheleen Harris