An introductory example of decision trees using data from this interactive visualization. This is an over-simplified example that doesn't use normalization as a pre-processing step, or cross validation as a mechanism for tuning the model.
In [1]:
# Load packages
import pandas as pd
from sklearn import tree
from __future__ import division
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
In [2]:
# Read data
df = pd.read_csv('./data/housing-data.csv')
In [3]:
# What is the shape of our data?
df.shape
Out[3]:
In [4]:
# What variables are present in the dataset?
df.columns
Out[4]:
In [5]:
# What is the distribution of our outcome variable `in_sf`?
df.hist('in_sf')
Out[5]:
In [6]:
# How does elevation vary for houses in/not-in sf (I suggest an overlapping histogram)
plt.hist(df.elevation[df.in_sf == 1], alpha=0.5, label='San Francisco')
plt.hist(df.elevation[df.in_sf != 1], alpha=0.5, label='New York')
plt.legend(loc='upper right')
plt.show()
In [7]:
# Create variables to hold features and outcomes separately
features = df.drop('in_sf', axis=1)
outcome = df.in_sf
In [8]:
# Split data into testing and training sets
train_features, test_features, train_outcome, test_outcome = train_test_split(features, outcome, test_size=0.30)
In [10]:
# Create a classifier and fit your features to your outcome
clf = tree.DecisionTreeClassifier()
clf = clf.fit(train_features, train_outcome)
clf
Out[10]:
In [11]:
# Generate a set of predictions for your test data
test_preds = clf.predict(test_features)
test_preds
Out[11]:
In [12]:
# Calculate accuracy for our test set (percentage of the time that prediction == truth)
test_acc = (test_preds == test_outcome).sum()/len(test_outcome)
test_acc
Out[12]:
In [13]:
# By comparison, how well do we predict in our training data?
training_preds = clf.predict(train_features)
train_acc = (training_preds == train_outcome).sum()/len(train_outcome)
train_acc # Pefectly
Out[13]:
A little bit of a pain, though there are some alternatives to the documentation presented here. You may have to do the following:
# Install graphviz in your terminal
conda install graphviz
I then suggest the following solution:
tree.export_graphviz(clf, out_file="mytree.dot")
with open("mytree.dot") as f:
dot_graph = f.read()
graphviz.Source(dot_graph)
In [15]:
import graphviz
tree.export_graphviz(clf, feature_names=features.columns, class_names=['San Fran', 'NYC'], out_file="mytree.dot")
with open("mytree.dot") as f:
dot_graph = f.read()
graphviz.Source(dot_graph)
Out[15]:
In [16]:
# Create a knn classifier
knn_clf = KNeighborsClassifier(n_neighbors = 3)
In [17]:
# Fit our classifier to our training data
knn_fit = knn_clf.fit(train_features, train_outcome)
In [18]:
# Predict on our test data and assess accuracy
knn_test_preds = knn_fit.predict(test_features)
test_acc = (knn_test_preds == test_outcome).sum()/len(test_outcome)
test_acc
Out[18]:
In [ ]: