Decision Trees

An introductory example of decision trees using data from this interactive visualization. This is an over-simplified example that doesn't use normalization as a pre-processing step, or cross validation as a mechanism for tuning the model.

Set up



In [99]:

    
# Load packages
import pandas as pd
from sklearn import tree
from __future__ import division
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np



In [88]:

    
# Read data
df = pd.read_csv('./data/housing-data.csv')

Data Exploration

Some basic exploratory analysis before creating a decision tree



In [143]:

    
# What is the shape of our data?



In [144]:

    
# What variables are present in the dataset?



In [145]:

    
# What is the distribution of our outcome variable `in_sf`?



In [146]:

    
# How does elevation vary for houses in/not-in sf (I suggest an overlapping histogram)

Build a decision tree using all variables



In [147]:

    
# Create variables to hold features and outcomes separately



In [148]:

    
# Split data into testing and training sets



In [149]:

    
# Create a classifier and fit your features to your outcome

Assess Model Fit



In [150]:

    
# Generate a set of predictions for your test data



In [151]:

    
# Calculate accuracy for our test set (percentage of the time that prediction == truth)



In [152]:

    
# By comparison, how well do we predict in our training data?

Show the tree

A little bit of a pain, though there are some alternatives to the documentation presented here. You may have to do the following:

# Install graphviz in your terminal
conda install graphviz

I then suggest the following solution:

tree.export_graphviz(clf, out_file="mytree.dot")
with open("mytree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)



In [153]:

    
# Create tree diagram

Comparion to KNN

Purely out of curiosity, how well does this model fit with KNN (for K=3)



In [140]:

    
# Create a knn classifier



In [141]:

    
# Fit our classifier to our training data



In [154]:

    
# Predict on our test data and assess accuracy