Machine Learning with scikit-learn

Learning Objectives


  • Gain some high level knowledge around machine learning with a gentle/brief introduction
  • Learn the importance of pre-processing data and how scikit-learn expects data
  • See data transformations for machine learning in action
  • Get an idea of your options for learning on training sets and applying model for prediction
  • See what sort of metrics are commonly used in scikit-learn
  • Learn options for model evaluation
  • Become familiar with ways to make this process robust and simplified (pipelining and tuning parameters)

For workshop (reorg)

Sections:

  1. ML 101 w/ code examples (incl. taste of sklearn w/ logistic regression + accuracy scores and user entered sepal and petal measurements); intro to sklearn's approach
  2. Our data: iris
  3. Supervised
  4. Unsupervised
  • Evaluating a model
  • What next (GridSearch & Pipeline)

Flow:

  • a learner and visual (logistic regression, pairplot, accuracy, give some sepal and petal measurements)
    • what category of ML is logistic regression? (regression or classification?)
    • QUESTION: what should we have done first? Any ideas? (EDA, visualize, pre-process if need be)
  • flower pics
  • peek at data
  • preprocessing from sklearn
  • Supervised - decision tree in detail, random forest
  • Unsupervised - novelty detection aka anomoly detection (note that PCA & dimensionality reduction is unsupervised)
  • Evaluate - metrics
  • what do you do with this model? what next?
  • note: parameter tuning can be automated with GridSearch
  • note: can test many algorithms at once with Pipeline

some questions inline

Does this parameter when increased cause overfitting or underfitting? what are the implications of those cases?

Is it better to have too many false positives or too many false negatives?

What is the diffrence between outlier detection and anomaly detection?

Machine Learning 101

It's said in different ways, but I like the way Jake VanderPlas defines ML:

Machine Learning is about building programs with tunable parameters (typically an array of floating point values) that are adjusted automatically so as to improve their behavior by adapting to previously seen data.

He goes on to say:

Machine Learning can be considered a subfield of Artificial Intelligence since those algorithms can be seen as building blocks to make computers learn to behave more intelligently by somehow generalizing rather that just storing and retrieving data items like a database system would do.

(more here)

ML is much more than writing a program. ML experts write clever and robust algorithms which can generalize to answer different, but specific questions. There are still types of questions that a certain algorithm can not or should not be used to answer. I say answer instead of solve, because even with an answer one should evaluate whether it is a good answer or bad answer. Also, just an in statistics, one needs to be careful about assumptions and limitations of an algorithm and the subsequent model that is built from it. Here's my hand-drawn diagram of the machine learning process.



Examples

Below, we are going to show a simple case of classification. In the figure we show a collection of 2D data, colored by their class labels (imagine one class is labeled "red" and the other "blue").

The fig_code module is credited to Jake VanderPlas and was cloned from his github repo here - also on our repo is his license file since he asked us to include that if we use his source code. :)


In [ ]:
# Plot settings for notebook

# so that plots show up in notebook
%matplotlib inline

# seaborn here is used for aesthetics.
# here, setting seaborn plot defaults (this can be safely commented out)
import seaborn; seaborn.set()

In [ ]:
# Import an example plot from the figures directory
from fig_code import plot_sgd_separator
plot_sgd_separator()

Above is the vector which best separates the two classes, "red" and "blue" using a classification algorithm called Stochastic Gradient Decent (don't worry about the detail yet). The confidence intervals are shown as dashed lines. - FACT CHECK CI LINE COMMENT PLEASE

This demonstrates a very important aspect of ML and that is the algorithm is generalizable, i.e., if we add some new data, a new point, the algorithm can predict whether is should be in the "red" or "blue" category.

ML TIP: ML can only answer 5 questions:

  • How much/how many?
  • Which category?
  • Which group?
  • Is it weird?
  • Which action?

As far as algorithms for learning a model (i.e. running some training data through an algorithm), it's nice to think of them in two different ways (with the help of the machine learning wikipedia article). The first way of thinking about ML, is by the type of information or input given to a system. So, given that criteria there are three classical categories:

  1. Supervised learning - we get the data and the labels
  2. Unsupervised learning - only get the data (no labels)
  3. Reinforcement learning - reward/penalty based information (feedback)

Another way of categorizing ML approaches, is to think of the desired output:

  1. Classification
  2. Regression
  3. Clustering
  4. Density estimation
  5. Dimensionality reduction

--> This second approach (by desired output) is how sklearn categorizes it's ML algorithms.

The problem solved in supervised learning (e.g. classification, regression)

Supervised learning consists in learning the link between two datasets: the observed data X and an external variable y that we are trying to predict, usually called “target” or “labels”. Most often, y is a 1D array of length n_samples.

All supervised estimators in sklearn implement a fit(X, y) method to fit the model and a predict(X) method that, given unlabeled observations X, returns the predicted labels y.

Common algorithms you will use to train a model and then use trying to predict the labels of unknown observations are: classification and regression. There are many types of classification and regression (for examples check out the sklearn algorithm cheatsheet below).

The problem solved in unsupervised learning

In machine learning, the problem of unsupervised learning is that of trying to find hidden structure in unlabeled data.

Unsupervised models have a fit(), transform() and/or fit_transform() in sklearn.

There are some instances where ML is just not needed or appropriate for solving a problem.

Some examples are pattern matching (e.g. regex), group-by and data mining in general (discovery vs. prediction).

EXERCISE: Should I use ML or can I get away with something else?

Looking back at previous years, by what percent did housing prices increase over each decade?
Looking back at previous years, and given the relationship between housing prices and mean income in my area, given my income how much will a house be in two years in my area?
A vacuum like roomba has to make a decision to vacuum the living room again or return to its base.
Is this image a cat or dog?
Are orange tabby cats more common than other breeds in Austin, Texas?
Using my SQL database on housing prices, group my housing prices by whether or not the house is under 10 miles from a school.
What is the weather going to be like tomorrow?
What is the purpose of life?

A very brief introduction to scikit-learn (aka sklearn)

This module is not meant to be a comprehensive introduction to ML, but rather an introduction to the current de facto tool for ML in python. As a gentle intro, it is helpful to think of the sklearn approach having layers of abstraction. This famous quote certainly applies:

Easy reading is damn hard writing, and vice versa.
--Nathaniel Hawthorne

In sklearn, you'll find you have a common programming choice: to do things very explicitly, e.g. pre-process data one step at a time, perhaps do a transformation like PCA, split data into traning and test sets, define a classifier or learner with desired parameterss, train the classifier, use the classifier to predict on a test set and then analyze how good it did.

A different approach and something sklearn offers is to combine some or all of the steps above into a pipeline so to speak. For instance, one could define a pipeline which does all of these steps at one time and perhaps even pits mutlple learners against one another or does some parameter tuning with a grid search (examples will be shown towards the end). This is what is meant here by layers of abstraction.

So, in this particular module, for the most part, we will try to be explicit regarding our process and give some useful tips on options for a more automated or pipelined approach. Just note, once you've mastered the explicit approaches you might want to explore sklearn's GridSearchCV and Pipeline classes.

Here is sklearn's algorithm diagram - (note, this is not an exhaustive list of model options offered in sklearn, but serves as a good algorithm guide).

Your first model - a multiclass logistic regression on the iris dataset

  • sklearn comes with this dataset ready-to-go for sklearn's algorithms

In [ ]:
from sklearn.datasets import load_iris
iris = load_iris()

# Leave one value out from training set - that will be test later on
X_train, y_train = iris.data[:-1,:], iris.target[:-1]

In [ ]:
from sklearn.linear_model import LogisticRegression

# our model - a multiclass regression
logistic = LogisticRegression()

# train on iris training set
logistic.fit(X_train, y_train)

X_test = iris.data[-1,:].reshape(1, -1)

y_predict = logistic.predict(X_test)

print('Predicted class %s, real class %s' % (
 y_predict, iris.target[-1]))

print('Probabilities of membership in each class: %s' % 
      logistic.predict_proba(X_test))

QUESTION:

  • What would have been good to do before plunging right in to a logistic regression model?

Some terms you will encouter as a Machine Learnest

Term Definition
Training set set of data used to learn a model
Test set set of data used to test a model
Feature a variable (continuous, discrete, categorical, etc.) aka column
Target Label (associated with dependent variable, what we predict)
Learner Model or algorithm
Fit, Train learn a model with an ML algorithm using a training set
Predict w/ supervised learning, give a label to an unknown datum(data), w/ unsupervised decide if new data is weird, in which group, or what to do next with the new data
Accuracy percentage of correct predictions ((TP + TN) / total)
Precision percentage of correct positive predictions (TP / (FP + TP))
Recall percentage of positive cases caught (TP / (FN + TP))

PRO TIP: Are you a statitician? Want to talk like a machine learning expert? Here you go (from the friendly people at SAS (here)):

A Statistician Would Say A Machine Learnest Would Say
dependent variable target
variable feature
transformation feature creation

BREAK

ML TIP: Ask sharp questions.
e.g. What type of flower is this (pictured below) closest to of the three given classes?

(This links out to source)

Labels (species names/classes):

(This links out to source)

NOTE: sklearn needs data/features (aka columns) in numpy ndarrays and the optional labels also as numpy ndarrays.

TIP: Commonly, machine learning algorithms will require your data to be standardized and preprocessed. In sklearn the data must also take on a certain structure as well.</b>


In [ ]:
print(type(iris.data))
print(type(iris.target))

Let's Dive In!


In [ ]:
import seaborn as sb
import pandas as pd
import numpy as np

#sb.set_context("notebook", font_scale=2.5)
%matplotlib inline

Features in the Iris dataset:

0 sepal length in cm
1 sepal width in cm
2 petal length in cm
3 petal width in cm

Target classes to predict:

0 Iris Setosa
1 Iris Versicolour
2 Iris Virginica

Get to know the data - visualize and explore

  • Features (columns/measurements) come from this diagram (links out to source on kaggle):
  • Shape
  • Peek at data
  • Summaries

Shape and representation


In [ ]:
import pandas as pd
from sklearn import datasets

iris = datasets.load_iris()

# How many data points (rows) x how many features (columns)
print(iris.data.shape)
print(iris.target.shape)

# What python object represents
print(type(iris.data))
print(type(iris.target))

Sneak a peek at data (a reminder of your pandas dataframe methods)


In [ ]:
# convert to pandas df (adding real column names)
iris.df = pd.DataFrame(iris.data, 
                       columns = ['Sepal length', 'Sepal width', 'Petal length', 'Petal width'])


# first few rows
iris.df.head()

Describe the dataset with some summary statitsics


In [ ]:
# summary stats
iris.df.describe()
  • We don't have to do much with the iris dataset. It has no missing values. It's already in numpy arrays and has the correct shape for sklearn. However we could try standardization and/or normalization. (later, in the transforms section, we will show one hot encoding, a preprocessing step)

Preprocessing (Bonus Material)

What you might have to do before using a learner in `sklearn`:

  1. Non-numerics transformed to numeric (tip: use applymap() method from pandas)
  • Fill in missing values
  • Standardization
  • Normalization
  • Encoding categorical features (e.g. one-hot encoding or dummy variables)

Features should end up in a numpy.ndarray (hence numeric) and labels in a list.

Data options:

If you use your own data or "real-world" data you will likely have to do some data wrangling and need to leverage pandas for some data manipulation.

Standardization - make our data look like a standard Gaussian distribution (commonly needed for sklearn learners)

FYI: you'll commonly see the data or feature set (ML word for data without it's labels) represented as a capital X and the targets or labels (if we have them) represented as a lowercase y. This is because the data is a 2D array or list of lists and the targets are a 1D array or simple list.


In [ ]:
# Standardization aka scaling
from sklearn import preprocessing, datasets

# make sure we have iris loaded
iris = datasets.load_iris()

X, y = iris.data, iris.target

# scale it to a gaussian distribution
X_scaled = preprocessing.scale(X)

# how does it look now
pd.DataFrame(X_scaled).head()

In [ ]:
# let's just confirm our standardization worked (mean is 0 w/ unit variance)
pd.DataFrame(X_scaled).describe()

# also could:
#print(X_scaled.mean(axis = 0))
#print(X_scaled.std(axis = 0))

PRO TIP: To save our standardization and reapply later (say to the test set or some new data), create a transformer object like so:

scaler = preprocessing.StandardScaler().fit(X_train)
# apply to a new dataset (e.g. test set):
scaler.transform(X_test)

Normalization - scaling samples individually to have unit norm

  • This type of scaling is really important if doing some downstream transformations and learning (see sklearn docs here for more) where similarity of pairs of samples is examined
  • A basic intro to normalization and the unit vector can be found here

In [ ]:
# Standardization aka scaling
from sklearn import preprocessing, datasets

# make sure we have iris loaded
iris = datasets.load_iris()

X, y = iris.data, iris.target

# scale it to a gaussian distribution
X_norm = preprocessing.normalize(X, norm='l1')

# how does it look now
pd.DataFrame(X_norm).tail()

In [ ]:
# let's just confirm our standardization worked (mean is 0 w/ unit variance)
pd.DataFrame(X_norm).describe()

# cumulative sum of normalized and original data:
#print(pd.DataFrame(X_norm.cumsum().reshape(X.shape)).tail())
#print(pd.DataFrame(X).cumsum().tail())

# unit norm (convert to unit vectors) - all row sums should be 1 now
X_norm.sum(axis = 1)

PRO TIP: To save our normalization (like standardization above) and reapply later (say to the test set or some new data), create a transformer object like so:

normalizer = preprocessing.Normalizer().fit(X_train)
# apply to a new dataset (e.g. test set):
normalizer.transform(X_test)

BREAK

Make the learning easier or better beforehand - feature creation/selection

  • PCA
  • SelectKBest
  • One-Hot Encoder

Principal component analysis (aka PCA) reduces the dimensions of a dataset down to get the most out of the information without a really big feature space

  • Useful for very large feature space (e.g. say the botanist in charge of the iris dataset measured 100 more parts of the flower and thus there were 104 columns instead of 4)
  • More about PCA on wikipedia here

In [ ]:
# PCA for dimensionality reduction

from sklearn import decomposition
from sklearn import datasets

iris = datasets.load_iris()

X, y = iris.data, iris.target

# perform principal component analysis
pca = decomposition.PCA(n_components = 3)
pca.fit(X)
X_t = pca.transform(X)
(X_t[:, 0])

# import numpy and matplotlib for plotting (and set some stuff)
import numpy as np
np.set_printoptions(suppress=True)
import matplotlib.pyplot as plt
%matplotlib inline

# let's separate out data based on first two principle components
x1, x2 = X_t[:, 0], X_t[:, 1]


# please don't worry about details of the plotting below 
#  (will introduce in different module)
#  (note: you can get the iris names below from iris.target_names, also in docs)

s1 = ['r' if v == 0 else 'b' if v == 1 else 'g' for v in y]
s2 = ['Setosa' if v == 0 else 'Versicolor' if v == 1 else 'Virginica' for v in y]
classes = s2
colors = s1
for (i, cla) in enumerate(set(classes)):
    xc = [p for (j, p) in enumerate(x1) if classes[j] == cla]
    yc = [p for (j, p) in enumerate(x2) if classes[j] == cla]
    cols = [c for (j, c) in enumerate(colors) if classes[j] == cla]
    plt.scatter(xc, yc, c = cols, label = cla)
plt.legend(loc = 4)

Selecting k top scoring features (also dimensionality reduction)


In [ ]:
# SelectKBest for selecting top-scoring features

from sklearn import datasets
from sklearn.feature_selection import SelectKBest, chi2

iris = datasets.load_iris()
X, y = iris.data, iris.target

print(X.shape)

# Do feature selection
#  input is scoring function (here chi2) to get univariate p-values
#  and number of top-scoring features (k) - here we get the top 2
X_t = SelectKBest(chi2, k = 2).fit_transform(X, y)

print(X_t.shape)

Note on scoring function selection in `SelectKBest` tranformations:

  • For regression - f_regression
  • For classification - chi2, f_classif

One Hot Encoding

  • It's an operation on feature labels - a method of dummying variable
  • Expands the feature space by nature of transform - later this can be processed further with a dimensionality reduction (the dummied variables are now their own features)
  • FYI: One hot encoding variables is needed for python ML module tenorflow
  • The code cell below should help make this clear

In [ ]:
# OneHotEncoder for dummying variables

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
import pandas as pd

data = pd.DataFrame({'index': range(1, 7),
                    'state': ['WA', 'NY', 'CO', 'NY', 'CA', 'WA']})
print(data)

# We encode both our categorical variable and it's labels
enc = OneHotEncoder()
label_enc = LabelEncoder() # remember the labels here

# Encode labels (can use for discrete numerical values as well)
data_label_encoded = label_enc.fit_transform(data['state'])
data['state'] = data_label_encoded

# Encode and "dummy" variables
data_feature_one_hot_encoded = enc.fit_transform(data[['state']])

# Put into dataframe to look nicer and decode state dummy variables to original state values
# TRY:  compare the original input data (look at row numbers) to one hot encoding results
#   --> do they match??
pd.DataFrame(data_feature_one_hot_encoded.toarray(), columns = label_enc.inverse_transform(range(4)))

In [ ]:
# Encoded labels as dummy variables
print(data_label_encoded)

# Decoded
print(label_enc.inverse_transform(data_label_encoded))

EXERCISE: Use one hot encoding to "recode" the iris data's extra suprise column (we are going to add a categorical variable here to play with...)


In [ ]:
from sklearn import datasets

iris = datasets.load_iris()
X, y = iris.data, iris.target

a = pd.DataFrame(X, 
                columns = ['Sepal length', 'Sepal width', 'Petal length', 'Petal width'])

col5 = pd.DataFrame(np.random.randint(1, 4, size = len(y)))

X_plus = pd.concat([a, col5], axis = 1)
X_plus.head(20)

# ...now one-hot-encode...

BREAK

Learning Algorithms - Supervised Learning

Reminder: All supervised estimators in scikit-learn implement a fit(X, y) method to fit the model and a predict(X) method that, given unlabeled observations X, returns the predicted labels y. (direct quote from sklearn docs)

  • Given that Iris is a fairly small, labeled dataset with relatively few features...what algorithm would you start with and why?

"Often the hardest part of solving a machine learning problem can be finding the right estimator for the job."

"Different estimators are better suited for different types of data and different problems."

-Choosing the Right Estimator from sklearn docs

An estimator for recognizing a new iris from its measurements

Or, in machine learning parlance, we fit an estimator on known samples of the iris measurements to predict the class to which an unseen iris belongs.

Let's give it a try! (We are actually going to hold out a small percentage of the iris dataset and check our predictions against the labels)


In [ ]:
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn import tree

# Let's load the iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# split data into training and test sets using the handy train_test_split func
# in this split, we are "holding out" only one value and label (placed into X_test and y_test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1)

In [ ]:
# Let's try a decision tree classification method
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)

In [ ]:
# Let's predict on our "held out" sample
y_pred = clf.predict(X_test)

In [ ]:
# What was the label associated with this test sample? ("held out" sample's original label)
#  fill in the blank below

# how did our prediction do?
print("Prediction: %d, Original label: %d" % (y_pred[0], ___)) # <-- fill in blank

EXERCISE: enter in your own iris data point and see what the prediction is (what limitation do you think you might encounter here?) - if out of range


In [ ]:

What does the graph look like for this decision tree?


In [ ]:
from IPython.display import Image

from sklearn.externals.six import StringIO  
import pydot 
dot_data = StringIO() 
tree.export_graphviz(clf, out_file=dot_data) 
graph = pydot.graph_from_dot_data(dot_data.getvalue()) 

dot_data = StringIO()  
tree.export_graphviz(clf, out_file=dot_data,  
                         feature_names=iris.feature_names,  
                         class_names=iris.target_names,  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = pydot.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

In [ ]:
from sklearn.tree import export_graphviz
import graphviz

export_graphviz(clf, out_file="mytree.dot",  
                         feature_names=iris.feature_names,  
                         class_names=iris.target_names,  
                         filled=True, rounded=True,  
                         special_characters=True)
with open("mytree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)

From Decision Tree to Random Forest


In [45]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
clf = RandomForestClassifier(n_jobs=2)



df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
df['species'] = pd.Factor(iris.target, iris.target_names)

train, test = df[df['is_train']==True], df[df['is_train']==False]

clf = RandomForestClassifier(n_jobs=2)

clf.fit(X_train, y_train)

preds = iris.target_names[clf.predict(X_test, y_test)]


df.head()
#pd.crosstab(test['species'], preds, rownames=['actual'], colnames=['preds'])


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-45-8a7f835029c3> in <module>()
     13 df = pd.DataFrame(iris.data, columns=iris.feature_names)
     14 df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
---> 15 df['species'] = pd.Factor(iris.target, iris.target_names)
     16 
     17 train, test = df[df['is_train']==True], df[df['is_train']==False]

AttributeError: 'module' object has no attribute 'Factor'

We can be explicit and use the train_test_split method in scikit-learn ( train_test_split ) as in (and as shown above for iris data):

# Create some data by hand and place 70% into a training set and the rest into a test set
# Here we are using labeled features (X - feature data, y - labels) in our made-up data
import numpy as np
from sklearn import linear_model
from sklearn.cross_validation import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.70)
clf = linear_model.LinearRegression()
clf.fit(X_train, y_train)

OR

Be more concise and

import numpy as np
from sklearn import cross_validation, linear_model
X, y = np.arange(10).reshape((5, 2)), range(5)
clf = linear_model.LinearRegression()
score = cross_validation.cross_val_score(clf, X, y)

There is also a cross_val_predict method to create estimates rather than scores ( cross_val_predict )

BREAK

Learning Algorithms - Unsupervised Learning

Reminder: In machine learning, the problem of unsupervised learning is that of trying to find hidden structure in unlabeled data. Since the training set given to the learner is unlabeled, there is no error or reward signal to evaluate a potential solution. Basically, we are just finding a way to represent the data and get as much information from it that we can.

HEY! Remember PCA from above? PCA is actually considered unsupervised learning. We just put it up there because it's a good way to visualize data at the beginning of the ML process.

We are going to continue to use the iris dataset (however we won't be needed the targets or labels)


In [ ]:
from sklearn import cluster, datasets

# data
iris = datasets.load_iris()
X, y = iris.data, iris.target

k_means = cluster.KMeans(n_clusters=3)
k_means.fit(X)

# how do our original labels fit into the clusters we found?
print(k_means.labels_[::10])
print(y[::10])

EXERCISE IDEA: Iterate over different number of clusters, n_clusters param, in Kmeans

BREAK

Evaluating - using metrics

  • Confusion matrix - visually inspect quality of a classifier's predictions (more here) - very useful to see if a particular class is problematic

Here, we will process some data, classify it with SVM (see [here](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) for more info), and view the quality of the classification with a confusion matrix.


In [38]:
import numpy as np

# import model algorithm and data
from sklearn import svm, datasets

# import splitter
from sklearn.cross_validation import train_test_split

# import metrics
from sklearn.metrics import confusion_matrix

# feature data (X) and labels (y)
iris = datasets.load_iris()
X, y = iris.data, iris.target

# split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.70, random_state = 42)

In [39]:
# perform the classification step and run a prediction on test set from above
clf = svm.SVC(kernel = 'linear', C = 0.01)
y_pred = clf.fit(X_train, y_train).predict(X_test)

In [40]:
# Define a plotting function confusion matrices 
#  (from http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html)

import matplotlib.pyplot as plt

def plot_confusion_matrix(cm, target_names, title = 'The Confusion Matrix', cmap = plt.cm.YlOrRd):
    plt.imshow(cm, interpolation = 'nearest', cmap = cmap)
    plt.tight_layout()
    
    # Add feature labels to x and y axes
    tick_marks = np.arange(len(target_names))
    plt.xticks(tick_marks, target_names, rotation=45)
    plt.yticks(tick_marks, target_names)
    
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    
    plt.colorbar()

Numbers in confusion matrix:

  • on-diagonal - counts of points for which the predicted label is equal to the true label
  • off-diagonal - counts of mislabeled points

In [41]:
%matplotlib inline

cm = confusion_matrix(y_test, y_pred)

# see the actual counts
print(cm)

# visually inpsect how the classifier did matching predictions to true labels
plot_confusion_matrix(cm, iris.target_names)


[[19  0  0]
 [ 0 12  1]
 [ 0  0 13]]
/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):
  • Classification reports - a text report with important classification metrics (e.g. precision, recall)

In [42]:
from sklearn.metrics import classification_report

# Using the test and prediction sets from above
print(classification_report(y_test, y_pred, target_names = iris.target_names))


             precision    recall  f1-score   support

     setosa       1.00      1.00      1.00        19
 versicolor       1.00      0.92      0.96        13
  virginica       0.93      1.00      0.96        13

avg / total       0.98      0.98      0.98        45


In [43]:
# Another example with some toy data

y_test = ['cat', 'dog', 'mouse', 'mouse', 'cat', 'cat']
y_pred = ['mouse', 'dog', 'cat', 'mouse', 'cat', 'mouse']

# How did our predictor do?
print(classification_report(y_test, y_pred, target_names = y_test))


             precision    recall  f1-score   support

        cat       0.50      0.33      0.40         3
        dog       1.00      1.00      1.00         1
      mouse       0.33      0.50      0.40         2

avg / total       0.53      0.50      0.50         6

Evaluating Models and Under/Over-Fitting

  • Over-fitting or under-fitting can be visualized as below and tuned as we will see later with GridSearchCV paramter tuning
  • A validation curve gives one an idea of the relationship of model complexity to model performance.
  • For this examination it would help to understand the idea of the bias-variance tradeoff.
  • A learning curve helps answer the question of if there is an added benefit to adding more training data to a model. It is also a tool for investigating whether an estimator is more affected by variance error or bias error.

In [ ]:
import numpy as np
from sklearn import cross_validation

# Let's run a prediction on some test data given a trained model

# First, create some data
X = np.sort(np.random.rand(20))
func = lambda x: np.cos(1.5 * np.pi * x)
y = np.array([func(x) for x in X])

In [ ]:
# A plotting function

import matplotlib.pyplot as plt
%matplotlib inline

def plot_fit(X_train, y_train, X_test, y_pred):
    plt.plot(X_test, y_pred, label = "Model")
    plt.plot(X_test, func(X_test), label = "Function")
    plt.scatter(X_train, y_train,  label = "Samples")
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((0, 1))
    plt.ylim((-2, 2))

BREAK

Easy reading...create and use a pipeline

Pipelining (as an aside to this section)

  • Pipeline(steps=[...]) - where steps can be a list of processes through which to put data or a dictionary which includes the parameters for each step as values
  • For example, here we do a transformation (SelectKBest) and a classification (SVC) all at once in a pipeline we set up
# a feature selection instance
selection = SelectKBest(chi2, k = 2)

# classification instance
clf = svm.SVC(kernel = 'linear')

# make a pipeline
pipeline = Pipeline([("feature selection", selection), ("classification", clf)])

# train the model
pipeline.fit(X, y)

See a full example here

Note: If you wish to perform multiple transformations in your pipeline try FeatureUnion


In [ ]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

poly = PolynomialFeatures(degree = 1, include_bias = False)
lm = LinearRegression()

In [ ]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([("polynomial_features", poly),
                         ("linear_regression", lm)])
pipeline.fit(X[:, np.newaxis], y)


X_test = np.linspace(0, 1, 100)

y_pred = pipeline.predict(X_test[:, np.newaxis])

plot_fit(X, y, X_test, y_pred)

Last, but not least, Searching Parameter Space with GridSearchCV


In [ ]:
from sklearn.grid_search import GridSearchCV

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

poly = PolynomialFeatures(include_bias = False)
lm = LinearRegression()

pipeline = Pipeline([("polynomial_features", poly),
                         ("linear_regression", lm)])

param_grid = dict(polynomial_features__degree = list(range(1, 30, 2)),
                  linear_regression__normalize = [False, True])

grid_search = GridSearchCV(pipeline, param_grid=param_grid)
grid_search.fit(X[:, np.newaxis], y)
print(grid_search.best_params_)

BREAK


In [ ]: