Reminder: All supervised estimators in scikit-learn implement a
fit(X, y)method to fit the model and apredict(X)method that, given unlabeled observations X, returns the predicted labels y. (direct quote fromsklearndocs)
"Often the hardest part of solving a machine learning problem can be finding the right estimator for the job."
"Different estimators are better suited for different types of data and different problems."
An estimator for recognizing a new iris from its measurements
Or, in machine learning parlance, we fit an estimator on known samples of the iris measurements to predict the class to which an unseen iris belongs.
Let's give it a try! (We are actually going to hold out a small percentage of the iris dataset and check our predictions against the labels)
In [1]:
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
# Let's load the iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# split data into training and test sets using the handy train_test_split func
# in this split, we are "holding out" only one value and label (placed into X_test and y_test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
In [ ]:
# Let's try a decision tree classification method
from sklearn import tree
t = tree.DecisionTreeClassifier(max_depth = 4,
criterion = 'entropy',
class_weight = 'balanced',
random_state = 2)
t.fit(X_train, y_train)
t.score(X_test, y_test) # what performance metric is this?
In [ ]:
# What was the label associated with this test sample? ("held out" sample's original label)
# Let's predict on our "held out" sample
y_pred = t.predict(X_test)
print(y_pred)
# fill in the blank below
# how did our prediction do for first sample in test dataset?
print("Prediction: %d, Original label: %d" % (y_pred[0], y_test[0])) # <-- fill in blank
In [ ]:
# Here's a nifty way to cross-validate (useful for quick model evaluation!)
from sklearn import cross_validation
t = tree.DecisionTreeClassifier(max_depth = 4,
criterion = 'entropy',
class_weight = 'balanced',
random_state = 2)
# splits, fits and predicts all in one with a score (does this multiple times)
score = cross_validation.cross_val_score(t, X, y)
score
QUESTIONS: What do these scores tell you? Are they too high or too low you think? If it's 1.0, what does that mean?
graphviz (It's worth it for this cool decision tree graph, I promise!)sudo port install graphviz
sudo pip install graphviz
In [2]:
from sklearn.tree import export_graphviz
import graphviz
# Let's rerun the decision tree classifier
from sklearn import tree
t = tree.DecisionTreeClassifier(max_depth = 4,
criterion = 'entropy',
class_weight = 'balanced',
random_state = 2)
t.fit(X_train, y_train)
t.score(X_test, y_test) # what performance metric is this?
export_graphviz(t, out_file="mytree.dot",
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, rounded=True,
special_characters=True)
with open("mytree.dot") as f:
dot_graph = f.read()
graphviz.Source(dot_graph, format = 'png')
Out[2]:
In [ ]:
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
In [ ]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(max_depth=4,
criterion = 'entropy',
n_estimators = 100,
class_weight = 'balanced',
n_jobs = -1,
random_state = 2)
#forest = RandomForestClassifier()
forest.fit(X_train, y_train)
y_preds = iris.target_names[forest.predict(X_test)]
forest.score(X_test, y_test)
In [ ]:
# Here's a nifty way to cross-validate (useful for model evaluation!)
from sklearn import cross_validation
# reinitialize classifier
forest = RandomForestClassifier(max_depth=4,
criterion = 'entropy',
n_estimators = 100,
class_weight = 'balanced',
n_jobs = -1,
random_state = 2)
score = cross_validation.cross_val_score(forest, X, y)
score
QUESTION: Comparing to the decision tree method, what do these accuracy scores tell you? Do they seem more reasonable?
We can be explicit and use the train_test_split method in scikit-learn ( train_test_split ) as in (and as shown above for iris data):
# Create some data by hand and place 70% into a training set and the rest into a test set
# Here we are using labeled features (X - feature data, y - labels) in our made-up data
import numpy as np
from sklearn import linear_model
from sklearn.cross_validation import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.70)
clf = linear_model.LinearRegression()
clf.fit(X_train, y_train)
OR
Be more concise and
import numpy as np
from sklearn import cross_validation, linear_model
X, y = np.arange(10).reshape((5, 2)), range(5)
clf = linear_model.LinearRegression()
score = cross_validation.cross_val_score(clf, X, y)
There is also a cross_val_predict method to create estimates rather than scores and is very useful for cross-validation to evaluate models ( cross_val_predict )
Created by a Microsoft Employee.
The MIT License (MIT)
Copyright (c) 2016 Micheleen Harris