An introductory example of decision trees using data from this interactive visualization. This is an over-simplified example that doesn't use normalization as a pre-processing step, or cross validation as a mechanism for tuning the model.
In [99]:
# Load packages
import pandas as pd
from sklearn import tree
from __future__ import division
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
In [88]:
# Read data
df = pd.read_csv('./data/housing-data.csv')
In [143]:
# What is the shape of our data?
In [144]:
# What variables are present in the dataset?
In [145]:
# What is the distribution of our outcome variable `in_sf`?
In [146]:
# How does elevation vary for houses in/not-in sf (I suggest an overlapping histogram)
In [147]:
# Create variables to hold features and outcomes separately
In [148]:
# Split data into testing and training sets
In [149]:
# Create a classifier and fit your features to your outcome
In [150]:
# Generate a set of predictions for your test data
In [151]:
# Calculate accuracy for our test set (percentage of the time that prediction == truth)
In [152]:
# By comparison, how well do we predict in our training data?
A little bit of a pain, though there are some alternatives to the documentation presented here. You may have to do the following:
# Install graphviz in your terminal
conda install graphviz
I then suggest the following solution:
tree.export_graphviz(clf, out_file="mytree.dot")
with open("mytree.dot") as f:
dot_graph = f.read()
graphviz.Source(dot_graph)
In [153]:
# Create tree diagram
In [140]:
# Create a knn classifier
In [141]:
# Fit our classifier to our training data
In [154]:
# Predict on our test data and assess accuracy