Learning Algorithms - Supervised Learning

Reminder: All supervised estimators in scikit-learn implement a fit(X, y) method to fit the model and a predict(X) method that, given unlabeled observations X, returns the predicted labels y. (direct quote from sklearn docs)

Given that Iris is a fairly small, labeled dataset with relatively few features...what algorithm would you start with and why?

"Often the hardest part of solving a machine learning problem can be finding the right estimator for the job."

"Different estimators are better suited for different types of data and different problems."

-Choosing the Right Estimator from sklearn docs

An estimator for recognizing a new iris from its measurements

Or, in machine learning parlance, we fit an estimator on known samples of the iris measurements to predict the class to which an unseen iris belongs.

Let's give it a try! (We are actually going to hold out a small percentage of the iris dataset and check our predictions against the labels)



In [1]:

    
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split

# Let's load the iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# split data into training and test sets using the handy train_test_split func
# in this split, we are "holding out" only one value and label (placed into X_test and y_test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)



In [ ]:

    
# Let's try a decision tree classification method
from sklearn import tree

t = tree.DecisionTreeClassifier(max_depth = 4,
                                    criterion = 'entropy', 
                                    class_weight = 'balanced',
                                    random_state = 2)
t.fit(X_train, y_train)

t.score(X_test, y_test) # what performance metric is this?



In [ ]:

    
# What was the label associated with this test sample? ("held out" sample's original label)
# Let's predict on our "held out" sample
y_pred = t.predict(X_test)
print(y_pred)

#  fill in the blank below

# how did our prediction do for first sample in test dataset?
print("Prediction: %d, Original label: %d" % (y_pred[0], y_test[0])) # <-- fill in blank



In [ ]:

    
# Here's a nifty way to cross-validate (useful for quick model evaluation!)
from sklearn import cross_validation

t = tree.DecisionTreeClassifier(max_depth = 4,
                                    criterion = 'entropy', 
                                    class_weight = 'balanced',
                                    random_state = 2)

# splits, fits and predicts all in one with a score (does this multiple times)
score = cross_validation.cross_val_score(t, X, y)
score

QUESTIONS: What do these scores tell you? Are they too high or too low you think? If it's 1.0, what does that mean?

What does the graph look like for this decision tree? i.e. what are the "questions" and "decisions" for this tree...

Note: You need both Graphviz app and the python package graphviz (It's worth it for this cool decision tree graph, I promise!)

To install both on OS X:

sudo port install graphviz
sudo pip install graphviz

For general Installation see this guide



In [2]:

    
from sklearn.tree import export_graphviz
import graphviz

# Let's rerun the decision tree classifier
from sklearn import tree

t = tree.DecisionTreeClassifier(max_depth = 4,
                                    criterion = 'entropy', 
                                    class_weight = 'balanced',
                                    random_state = 2)
t.fit(X_train, y_train)

t.score(X_test, y_test) # what performance metric is this?

export_graphviz(t, out_file="mytree.dot",  
                         feature_names=iris.feature_names,  
                         class_names=iris.target_names,  
                         filled=True, rounded=True,  
                         special_characters=True)

with open("mytree.dot") as f:
    dot_graph = f.read()

graphviz.Source(dot_graph, format = 'png')









    Out[2]:

From Decision Tree to Random Forest



In [ ]:

    
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)



In [ ]:

    
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(max_depth=4,
                                criterion = 'entropy', 
                                n_estimators = 100, 
                                class_weight = 'balanced',
                                n_jobs = -1,
                               random_state = 2)

#forest = RandomForestClassifier()
forest.fit(X_train, y_train)

y_preds = iris.target_names[forest.predict(X_test)]

forest.score(X_test, y_test)



In [ ]:

    
# Here's a nifty way to cross-validate (useful for model evaluation!)
from sklearn import cross_validation

# reinitialize classifier
forest = RandomForestClassifier(max_depth=4,
                                criterion = 'entropy', 
                                n_estimators = 100, 
                                class_weight = 'balanced',
                                n_jobs = -1,
                               random_state = 2)

score = cross_validation.cross_val_score(forest, X, y)
score

QUESTION: Comparing to the decision tree method, what do these accuracy scores tell you? Do they seem more reasonable?

Splitting into train and test set vs. cross-validation

We can be explicit and use the train_test_split method in scikit-learn ( train_test_split ) as in (and as shown above for iris data):

# Create some data by hand and place 70% into a training set and the rest into a test set
# Here we are using labeled features (X - feature data, y - labels) in our made-up data
import numpy as np
from sklearn import linear_model
from sklearn.cross_validation import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.70)
clf = linear_model.LinearRegression()
clf.fit(X_train, y_train)

Be more concise and

import numpy as np
from sklearn import cross_validation, linear_model
X, y = np.arange(10).reshape((5, 2)), range(5)
clf = linear_model.LinearRegression()
score = cross_validation.cross_val_score(clf, X, y)

There is also a cross_val_predict method to create estimates rather than scores and is very useful for cross-validation to evaluate models ( cross_val_predict )

Created by a Microsoft Employee.