Decision Trees

An introductory example of decision trees using data from this interactive visualization. This is an over-simplified example that doesn't use normalization as a pre-processing step, or cross validation as a mechanism for tuning the model.

Set up


In [99]:
# Load packages
import pandas as pd
from sklearn import tree
from __future__ import division
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

In [88]:
# Read data
df = pd.read_csv('./data/housing-data.csv')

Data Exploration

Some basic exploratory analysis before creating a decision tree


In [143]:
# What is the shape of our data?

In [144]:
# What variables are present in the dataset?

In [145]:
# What is the distribution of our outcome variable `in_sf`?

In [146]:
# How does elevation vary for houses in/not-in sf (I suggest an overlapping histogram)

Build a decision tree using all variables


In [147]:
# Create variables to hold features and outcomes separately

In [148]:
# Split data into testing and training sets

In [149]:
# Create a classifier and fit your features to your outcome

Assess Model Fit


In [150]:
# Generate a set of predictions for your test data

In [151]:
# Calculate accuracy for our test set (percentage of the time that prediction == truth)

In [152]:
# By comparison, how well do we predict in our training data?

Show the tree

A little bit of a pain, though there are some alternatives to the documentation presented here. You may have to do the following:

# Install graphviz in your terminal
conda install graphviz

I then suggest the following solution:

tree.export_graphviz(clf, out_file="mytree.dot")
with open("mytree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)

In [153]:
# Create tree diagram

Comparion to KNN

Purely out of curiosity, how well does this model fit with KNN (for K=3)


In [140]:
# Create a knn classifier

In [141]:
# Fit our classifier to our training data

In [154]:
# Predict on our test data and assess accuracy