Decision Trees

An introductory example of decision trees using data from this interactive visualization. This is an over-simplified example that doesn't use normalization as a pre-processing step, or cross validation as a mechanism for tuning the model.

Set up


In [1]:
# Load packages
import pandas as pd
from sklearn import tree
from __future__ import division
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

In [2]:
# Read data
df = pd.read_csv('./data/housing-data.csv')

Data Exploration

Some basic exploratory analysis before creating a decision tree


In [3]:
# What is the shape of our data?
df.shape


Out[3]:
(492, 8)

In [4]:
# What variables are present in the dataset?
df.columns


Out[4]:
Index([u'in_sf', u'beds', u'bath', u'price', u'year_built', u'sqft',
       u'price_per_sqft', u'elevation'],
      dtype='object')

In [5]:
# What is the distribution of our outcome variable `in_sf`?
df.hist('in_sf')


Out[5]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x115bb79d0>]], dtype=object)

In [6]:
# How does elevation vary for houses in/not-in sf (I suggest an overlapping histogram)
plt.hist(df.elevation[df.in_sf == 1], alpha=0.5, label='San Francisco')
plt.hist(df.elevation[df.in_sf != 1], alpha=0.5, label='New York')
plt.legend(loc='upper right')
plt.show()


Build a decision tree using all variables


In [7]:
# Create variables to hold features and outcomes separately
features = df.drop('in_sf', axis=1)
outcome = df.in_sf

In [8]:
# Split data into testing and training sets
train_features, test_features, train_outcome, test_outcome = train_test_split(features, outcome, test_size=0.30)

In [10]:
# Create a classifier and fit your features to your outcome
clf = tree.DecisionTreeClassifier()
clf = clf.fit(train_features, train_outcome)
clf


Out[10]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

Assess Model Fit


In [11]:
# Generate a set of predictions for your test data
test_preds = clf.predict(test_features)
test_preds


Out[11]:
array([1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1,
       0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0,
       1, 1, 0, 0, 0, 1, 1, 0, 1, 1])

In [12]:
# Calculate accuracy for our test set (percentage of the time that prediction == truth)
test_acc = (test_preds == test_outcome).sum()/len(test_outcome)
test_acc


Out[12]:
0.90540540540540537

In [13]:
# By comparison, how well do we predict in our training data?
training_preds = clf.predict(train_features)
train_acc = (training_preds == train_outcome).sum()/len(train_outcome)
train_acc # Pefectly


Out[13]:
1.0

Show the tree

A little bit of a pain, though there are some alternatives to the documentation presented here. You may have to do the following:

# Install graphviz in your terminal
conda install graphviz

I then suggest the following solution:

tree.export_graphviz(clf, out_file="mytree.dot")
with open("mytree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)

In [15]:
import graphviz
tree.export_graphviz(clf, feature_names=features.columns, class_names=['San Fran', 'NYC'], out_file="mytree.dot")
with open("mytree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)


Out[15]:
Tree 0 elevation <= 31.5 gini = 0.4957 samples = 344 value = [156, 188] class = NYC 1 price_per_sqft <= 1072.5 gini = 0.4138 samples = 212 value = [150, 62] class = San Fran 0->1 True 68 elevation <= 38.5 gini = 0.0868 samples = 132 value = [6, 126] class = NYC 0->68 False 2 year_built <= 1972.5 gini = 0.4999 samples = 87 value = [43, 44] class = NYC 1->2 29 elevation <= 4.5 gini = 0.2465 samples = 125 value = [107, 18] class = San Fran 1->29 3 price_per_sqft <= 597.0 gini = 0.3656 samples = 54 value = [41, 13] class = San Fran 2->3 24 sqft <= 833.5 gini = 0.1139 samples = 33 value = [2, 31] class = NYC 2->24 4 elevation <= 10.0 gini = 0.32 samples = 10 value = [2, 8] class = NYC 3->4 9 elevation <= 22.0 gini = 0.2014 samples = 44 value = [39, 5] class = San Fran 3->9 5 price_per_sqft <= 447.0 gini = 0.4444 samples = 3 value = [2, 1] class = San Fran 4->5 8 gini = 0.0 samples = 7 value = [0, 7] class = NYC 4->8 6 gini = 0.0 samples = 1 value = [0, 1] class = NYC 5->6 7 gini = 0.0 samples = 2 value = [2, 0] class = San Fran 5->7 10 beds <= 3.5 gini = 0.1356 samples = 41 value = [38, 3] class = San Fran 9->10 21 beds <= 1.5 gini = 0.4444 samples = 3 value = [1, 2] class = NYC 9->21 11 price_per_sqft <= 1019.0 gini = 0.095 samples = 40 value = [38, 2] class = San Fran 10->11 20 gini = 0.0 samples = 1 value = [0, 1] class = NYC 10->20 12 price_per_sqft <= 708.0 gini = 0.0526 samples = 37 value = [36, 1] class = San Fran 11->12 17 price_per_sqft <= 1043.0 gini = 0.4444 samples = 3 value = [2, 1] class = San Fran 11->17 13 elevation <= 11.5 gini = 0.32 samples = 5 value = [4, 1] class = San Fran 12->13 16 gini = 0.0 samples = 32 value = [32, 0] class = San Fran 12->16 14 gini = 0.0 samples = 4 value = [4, 0] class = San Fran 13->14 15 gini = 0.0 samples = 1 value = [0, 1] class = NYC 13->15 18 gini = 0.0 samples = 1 value = [0, 1] class = NYC 17->18 19 gini = 0.0 samples = 2 value = [2, 0] class = San Fran 17->19 22 gini = 0.0 samples = 1 value = [1, 0] class = San Fran 21->22 23 gini = 0.0 samples = 2 value = [0, 2] class = NYC 21->23 25 sqft <= 759.0 gini = 0.4444 samples = 6 value = [2, 4] class = NYC 24->25 28 gini = 0.0 samples = 27 value = [0, 27] class = NYC 24->28 26 gini = 0.0 samples = 4 value = [0, 4] class = NYC 25->26 27 gini = 0.0 samples = 2 value = [2, 0] class = San Fran 25->27 30 price_per_sqft <= 2430.0 gini = 0.5 samples = 14 value = [7, 7] class = San Fran 29->30 39 price_per_sqft <= 1386.5 gini = 0.1786 samples = 111 value = [100, 11] class = San Fran 29->39 31 year_built <= 1920.0 gini = 0.4861 samples = 12 value = [5, 7] class = NYC 30->31 38 gini = 0.0 samples = 2 value = [2, 0] class = San Fran 30->38 32 gini = 0.0 samples = 2 value = [2, 0] class = San Fran 31->32 33 elevation <= 2.5 gini = 0.42 samples = 10 value = [3, 7] class = NYC 31->33 34 year_built <= 2007.0 gini = 0.5 samples = 6 value = [3, 3] class = San Fran 33->34 37 gini = 0.0 samples = 4 value = [0, 4] class = NYC 33->37 35 gini = 0.0 samples = 3 value = [3, 0] class = San Fran 34->35 36 gini = 0.0 samples = 3 value = [0, 3] class = NYC 34->36 40 year_built <= 1986.0 gini = 0.375 samples = 36 value = [27, 9] class = San Fran 39->40 57 elevation <= 25.5 gini = 0.0519 samples = 75 value = [73, 2] class = San Fran 39->57 41 sqft <= 2533.5 gini = 0.142 samples = 26 value = [24, 2] class = San Fran 40->41 50 price_per_sqft <= 1365.5 gini = 0.42 samples = 10 value = [3, 7] class = NYC 40->50 42 year_built <= 1975.5 gini = 0.0799 samples = 24 value = [23, 1] class = San Fran 41->42 47 year_built <= 1920.5 gini = 0.5 samples = 2 value = [1, 1] class = San Fran 41->47 43 gini = 0.0 samples = 20 value = [20, 0] class = San Fran 42->43 44 year_built <= 1979.0 gini = 0.375 samples = 4 value = [3, 1] class = San Fran 42->44 45 gini = 0.0 samples = 1 value = [0, 1] class = NYC 44->45 46 gini = 0.0 samples = 3 value = [3, 0] class = San Fran 44->46 48 gini = 0.0 samples = 1 value = [1, 0] class = San Fran 47->48 49 gini = 0.0 samples = 1 value = [0, 1] class = NYC 47->49 51 price_per_sqft <= 1108.0 gini = 0.2449 samples = 7 value = [1, 6] class = NYC 50->51 54 price <= 2350000.0 gini = 0.4444 samples = 3 value = [2, 1] class = San Fran 50->54 52 gini = 0.0 samples = 1 value = [1, 0] class = San Fran 51->52 53 gini = 0.0 samples = 6 value = [0, 6] class = NYC 51->53 55 gini = 0.0 samples = 2 value = [2, 0] class = San Fran 54->55 56 gini = 0.0 samples = 1 value = [0, 1] class = NYC 54->56 58 elevation <= 18.5 gini = 0.027 samples = 73 value = [72, 1] class = San Fran 57->58 65 price_per_sqft <= 2243.0 gini = 0.5 samples = 2 value = [1, 1] class = San Fran 57->65 59 gini = 0.0 samples = 62 value = [62, 0] class = San Fran 58->59 60 year_built <= 2007.5 gini = 0.1653 samples = 11 value = [10, 1] class = San Fran 58->60 61 gini = 0.0 samples = 9 value = [9, 0] class = San Fran 60->61 62 price <= 10050000.0 gini = 0.5 samples = 2 value = [1, 1] class = San Fran 60->62 63 gini = 0.0 samples = 1 value = [0, 1] class = NYC 62->63 64 gini = 0.0 samples = 1 value = [1, 0] class = San Fran 62->64 66 gini = 0.0 samples = 1 value = [0, 1] class = NYC 65->66 67 gini = 0.0 samples = 1 value = [1, 0] class = San Fran 65->67 69 beds <= 1.5 gini = 0.426 samples = 13 value = [4, 9] class = NYC 68->69 76 price <= 402000.0 gini = 0.033 samples = 119 value = [2, 117] class = NYC 68->76 70 gini = 0.0 samples = 2 value = [2, 0] class = San Fran 69->70 71 price <= 1272500.0 gini = 0.2975 samples = 11 value = [2, 9] class = NYC 69->71 72 gini = 0.0 samples = 8 value = [0, 8] class = NYC 71->72 73 price_per_sqft <= 1231.5 gini = 0.4444 samples = 3 value = [2, 1] class = San Fran 71->73 74 gini = 0.0 samples = 2 value = [2, 0] class = San Fran 73->74 75 gini = 0.0 samples = 1 value = [0, 1] class = NYC 73->75 77 gini = 0.0 samples = 1 value = [1, 0] class = San Fran 76->77 78 price <= 579450.0 gini = 0.0168 samples = 118 value = [1, 117] class = NYC 76->78 79 price <= 554950.0 gini = 0.2188 samples = 8 value = [1, 7] class = NYC 78->79 82 gini = 0.0 samples = 110 value = [0, 110] class = NYC 78->82 80 gini = 0.0 samples = 7 value = [0, 7] class = NYC 79->80 81 gini = 0.0 samples = 1 value = [1, 0] class = San Fran 79->81

Comparion to KNN

Purely out of curiosity, how well does this model fit with KNN (for K=3)


In [16]:
# Create a knn classifier
knn_clf = KNeighborsClassifier(n_neighbors = 3)

In [17]:
# Fit our classifier to our training data
knn_fit = knn_clf.fit(train_features, train_outcome)

In [18]:
# Predict on our test data and assess accuracy
knn_test_preds = knn_fit.predict(test_features)
test_acc = (knn_test_preds == test_outcome).sum()/len(test_outcome)
test_acc


Out[18]:
0.6216216216216216

In [ ]: