Introduction to data analysis using machine learning

06. Classification with Decision Trees

by David Taylor, (blog) (hire me!)

For links to more material including a slideshow explaining all this stuff in further detail, please see the front page of this GitHub repo.

This is notebook 6 of 8. The next notebook is: [07. Classification with Random Forest]

[01] [02] [03] [04] [05] [06] [07] [08]

We look further at one of the classification algorithms we saw in the previous notebook. Again, this is an algorithm that is not used a lot in practice, but is very intuitively useful for beginners. Don't worry, the next algorithm is one that's used a lot!

For the first time, we encounter an algorithm that is convenient to visualize with all five of our features, not just two that we can see in a scatter plot.

1. Import libraries and load data

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import tree
from sklearn.externals.six import StringIO
import re

df = pd.read_csv('fruit.csv')
fruitnames = {1: 'Orange', 2: 'Pear', 3: 'Apple'}
colors = {1: '#e09028', 2: '#55aa33', 3: '#cc3333'}
fruitlist = ['Orange', 'Pear', 'Apple']

df.sort('fruit_id', inplace=True) # This is important because the factorizer assigns numbers
    # based on the order the first label is encountered, e.g. if the first instance had
    # fruit = 3, the y value would be 0.

2. Classify with a Decision Tree and view Confusion Matrix

With all five features used, the confusion matrix should be a perfect or near-perfect classifier on the testing set.

In [2]:
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 # randomly assign training and testing set
train, test = df[df['is_train']==True], df[df['is_train']==False]
features = ['color_id', 'elongatedness', 'weight', 'sweetness', 'acidity']
y, _ = pd.factorize(train['fruit_id'])
clf = tree.DecisionTreeClassifier()
clf =[features], y)
preds = clf.predict(test[features])
test_result = pd.crosstab(np.array([fruitnames[x] for x in test['fruit_id']]), 
                      np.array([fruitnames[x+1] for x in preds]), rownames=['actual'], colnames=['predicted'])

predicted Apple Orange Pear
Apple 11 0 0
Orange 1 13 0
Pear 0 0 15

3. View several different trees, each produced on a different randomly selected 70% training set.

Observe the differences between each.


  1. You will need Graphviz installed and in the PATH environmental variable to visualize the graphs.

  2. I did not make this code block a function because of the IPython magic shell call and call to Image.

  3. I had some issues getting Pydot to work in Python 3, which is why I ran Graphviz in the shell and did a somewhat clunky regex sub to adjust the trees.

In [3]:
# Repetition 1

df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 # randomly assign training and testing set
train, test = df[df['is_train']==True], df[df['is_train']==False]
features = ['color_id', 'elongatedness', 'weight', 'sweetness', 'acidity']
y, _ = pd.factorize(train['fruit_id'])
clf = tree.DecisionTreeClassifier()
clf =[features], y)
dot_data = StringIO() 
tree.export_graphviz(clf, out_file=dot_data) 
tree_string = dot_data.getvalue()
# replace feature numbers with feature names
tree_string = re.sub('gini = 0\.[0-9]+\\\\n', '', tree_string)
for i, feature in enumerate(features):
    tree_string = re.sub('X\[{}\]'.format(i), feature, tree_string)
# repace lists of numeric label assignments with label name
for result in re.finditer('\[[ ]+([\d]+)\.[ ]+([\d]+)\.[ ]+([\d]+)\.\]', tree_string):
    nums = []
    for i in range(0,3):
    if nums[0] > nums[1]:
        if nums[0] > nums[2]:
            tree_string = re.sub('\[[ ]+{}\.[ ]+{}\.[ ]+{}\.\]'.format(nums[0], nums[1], nums[2]), fruitlist[0], tree_string)
            tree_string = re.sub('\[[ ]+{}\.[ ]+{}\.[ ]+{}\.\]'.format(nums[0], nums[1], nums[2]), fruitlist[2], tree_string)
    elif nums[1] > nums[2]:
        tree_string = re.sub('\[[ ]+{}\.[ ]+{}\.[ ]+{}\.\]'.format(nums[0], nums[1], nums[2]), fruitlist[1], tree_string)
        tree_string = re.sub('\[[ ]+{}\.[ ]+{}\.[ ]+{}\.\]'.format(nums[0], nums[1], nums[2]), fruitlist[2], tree_string)
with open('simple.dotfile', 'w+') as f:
# normally this would be done with libraries like pydot or networkx, but
# I'm having trouble getting them to work in Python 3.4.2 under Windows,
# so I'll just call the shell executable directly
!dot.exe -Tpng simple.dotfile > simpletree.png
from IPython.core.display import Image
Image( filename ='simpletree.png')


In [4]:
# Repetition 2

df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 # randomly assign training and testing set
train, test = df[df['is_train']==True], df[df['is_train']==False]
features = ['color_id', 'elongatedness', 'weight', 'sweetness', 'acidity']
y, _ = pd.factorize(train['fruit_id'])
clf = tree.DecisionTreeClassifier()
clf =[features], y)
dot_data = StringIO() 
tree.export_graphviz(clf, out_file=dot_data) 
tree_string = dot_data.getvalue()
# replace feature numbers with feature names
tree_string = re.sub('gini = 0\.[0-9]+\\\\n', '', tree_string)
for i, feature in enumerate(features):
    tree_string = re.sub('X\[{}\]'.format(i), feature, tree_string)
# repace lists of numeric label assignments with label name
for result in re.finditer('\[[ ]+([\d]+)\.[ ]+([\d]+)\.[ ]+([\d]+)\.\]', tree_string):
    nums = []
    for i in range(0,3):
    if nums[0] > nums[1]:
        if nums[0] > nums[2]:
            tree_string = re.sub('\[[ ]+{}\.[ ]+{}\.[ ]+{}\.\]'.format(nums[0], nums[1], nums[2]), fruitlist[0], tree_string)
            tree_string = re.sub('\[[ ]+{}\.[ ]+{}\.[ ]+{}\.\]'.format(nums[0], nums[1], nums[2]), fruitlist[2], tree_string)
    elif nums[1] > nums[2]:
        tree_string = re.sub('\[[ ]+{}\.[ ]+{}\.[ ]+{}\.\]'.format(nums[0], nums[1], nums[2]), fruitlist[1], tree_string)
        tree_string = re.sub('\[[ ]+{}\.[ ]+{}\.[ ]+{}\.\]'.format(nums[0], nums[1], nums[2]), fruitlist[2], tree_string)
with open('simple.dotfile', 'w+') as f:
# normally this would be done with libraries like pydot or networkx, but
# I'm having trouble getting them to work in Python 3.4.2 under Windows,
# so I'll just call the shell executable directly
!dot.exe -Tpng simple.dotfile > simpletree.png
from IPython.core.display import Image
Image( filename ='simpletree.png')


In [5]:
# Repetition 3

df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 # randomly assign training and testing set
train, test = df[df['is_train']==True], df[df['is_train']==False]
features = ['color_id', 'elongatedness', 'weight', 'sweetness', 'acidity']
y, _ = pd.factorize(train['fruit_id'])
clf = tree.DecisionTreeClassifier()
clf =[features], y)
dot_data = StringIO() 
tree.export_graphviz(clf, out_file=dot_data) 
tree_string = dot_data.getvalue()
# replace feature numbers with feature names
tree_string = re.sub('gini = 0\.[0-9]+\\\\n', '', tree_string)
for i, feature in enumerate(features):
    tree_string = re.sub('X\[{}\]'.format(i), feature, tree_string)
# repace lists of numeric label assignments with label name
for result in re.finditer('\[[ ]+([\d]+)\.[ ]+([\d]+)\.[ ]+([\d]+)\.\]', tree_string):
    nums = []
    for i in range(0,3):
    if nums[0] > nums[1]:
        if nums[0] > nums[2]:
            tree_string = re.sub('\[[ ]+{}\.[ ]+{}\.[ ]+{}\.\]'.format(nums[0], nums[1], nums[2]), fruitlist[0], tree_string)
            tree_string = re.sub('\[[ ]+{}\.[ ]+{}\.[ ]+{}\.\]'.format(nums[0], nums[1], nums[2]), fruitlist[2], tree_string)
    elif nums[1] > nums[2]:
        tree_string = re.sub('\[[ ]+{}\.[ ]+{}\.[ ]+{}\.\]'.format(nums[0], nums[1], nums[2]), fruitlist[1], tree_string)
        tree_string = re.sub('\[[ ]+{}\.[ ]+{}\.[ ]+{}\.\]'.format(nums[0], nums[1], nums[2]), fruitlist[2], tree_string)
with open('simple.dotfile', 'w+') as f:
# normally this would be done with libraries like pydot or networkx, but
# I'm having trouble getting them to work in Python 3.4.2 under Windows,
# so I'll just call the shell executable directly
!dot.exe -Tpng simple.dotfile > simpletree.png
from IPython.core.display import Image
Image( filename ='simpletree.png')


In [ ]: