Decision Trees

By Parijat Mazumdar (GitHub ID: mazumdarparijat)

This notebook illustrates the use of decision trees in Shogun for classification and regression. Various decision tree learning algorithms like ID3, C4.5, CART, CHAID have been discussed in detail using both intuitive toy datasets as well as real-world datasets.

Decision Tree Basics

Decision Trees are a non-parametric supervised learning method that can be used for both classification and regression. Decision trees essentially encode a set of if-then-else rules which can be used to predict target variable given data features. These if-then-else rules are formed using the training dataset with the aim to satisfy as many training data instances as possible. The formation of these rules (aka. decision tree) from training data is called decision tree learning. Various decision tree learning algorithms have been developed and they work best in different situations. An advantage of decision trees is that they can model any type of function for classification or regression which other techniques cannot. But a decision tree is highly prone to overfitting and bias towards training data. So, decision trees are used for very large datasets which are assumed to represent the ground truth well. Additionally, certain tree pruning algorithms are also used to tackle overfitting.

ID3 (Iterative Dichotomiser 3)

ID3 is a straightforward decision tree learning algorithm developed by Ross Quinlan. ID3 is applicable only in cases where the attributes (or features) defining data examples are categorical in nature and the data examples belong to pre-defined, clearly distinguishable (ie. well defined) classes. ID3 is an iterative greedy algorithm which starts with the root node and eventually builds the entire tree. At each node, the "best" attribute to classify data is chosen. The "best" attribute is chosen using the information gain metric. Once an attribute is chosen in a node, the data examples in the node are categorized into sub-groups based on the attribute values that they have. Basically, all data examples having the same attribute value are put together in the same sub-group. These sub-groups form the children of the present node and the algorithm is repeated for each of the newly formed children nodes. This goes on until all the data members of a node belong to the same class or all the attributes are exhausted. In the latter case, the class predicted may be erroneous and generally the mode of the classes appearing in the node is chosen as the predictive class.

Pseudocode for ID3 Algorithm

Inputs: (D, A, C), where: D is a dataset with only nominal instance attributes A C is the class attribute Output: a decision tree T representing a sequential decision process for classifying instances (predicting the values of the class attribute C); each node of T is labeled with a non-class attribute of A Informal Inductive Bias: minimize the average height of the tree Procedure: if the set of remaining non-class attributes is empty or if all of the instances in D are in the same class { return an empty tree } else { compute the Information Gain for each attribute over the dataset D let a* be the attribute with maximum Information Gain create a root node for a tree T; label it with a* for each value b of attribute a* { let T(a*=b) be the tree computed recursively by ID3 on input (D|a*=b, A-a*, C), where: D|a*=b contains all instances of D for which a* has the value b A-a* consists of all attributes of A except a* attach T(a*=b) to the root of T as a subtree } return the resulting decision tree T }

Example using a Simple dataset

In this section, we create a simple example where we try to predict the usage of mobile phones by individuals based on their income, age, education and marital status. Each of the attributes have been categorized into 2 or 3 types. Let us create the training dataset and tabulate it first.



In [ ]:

    
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../../data')
# training data
train_income=['Low','Medium','Low','High','Low','High','Medium','Medium','High','Low','Medium',
'Medium','High','Low','Medium']

train_age = ['Old','Young','Old','Young','Old','Young','Young','Old','Old','Old','Young','Old',
'Old','Old','Young']

train_education = ['University','College','University','University','University','College','College',
'High School','University','High School','College','High School','University','High School','College']

train_marital = ['Married','Single','Married','Single','Married','Single','Married','Single','Single',
'Married','Married','Single','Single','Married','Married']

train_usage = ['Low','Medium','Low','High','Low','Medium','Medium','Low','High','Low','Medium','Low',
'High','Low','Medium']

# print data
print 'Training Data Table : \n'
print 'Income \t\t Age \t\t Education \t\t Marital Status \t Usage'
for i in xrange(len(train_income)):
	print train_income[i]+' \t\t '+train_age[i]+' \t\t '+train_education[i]+' \t\t '+train_marital[i]+' \t\t '+train_usage[i]

We want to create a decision tree from the above training dataset. The first step for that is to encode the data into numeric values and bind them to Shogun's features and multiclass labels.



In [ ]:

    
from shogun import ID3ClassifierTree, RealFeatures, MulticlassLabels
from numpy import array, concatenate

# encoding dictionary
income = {'Low' : 1.0, 'Medium' : 2.0, 'High' : 3.0}
age = {'Young' : 1.0, 'Old' : 2.0}
education = {'High School' : 1.0, 'College' : 2.0, 'University' : 3.0}
marital_status = {'Married' : 1.0, 'Single' : 2.0}
usage = {'Low' : 1.0, 'Medium' : 2.0, 'High' : 3.0}


# encode training data
for i in xrange(len(train_income)):
	train_income[i] = income[train_income[i]]
	train_age[i] = age[train_age[i]]
	train_education[i] = education[train_education[i]]
	train_marital[i] = marital_status[train_marital[i]]
	train_usage[i] = usage[train_usage[i]]
    
# form Shogun feature matrix
train_data = array([train_income, train_age, train_education, train_marital])
train_feats = RealFeatures(train_data);

# form Shogun multiclass labels
labels = MulticlassLabels(array(train_usage));

Next, we learn our decision tree using the features and labels created.



In [ ]:

    
# create ID3ClassifierTree object
id3 = ID3ClassifierTree()

# set labels
id3.set_labels(labels)

# learn the tree from training features
is_successful = id3.train(train_feats)

Our decision tree is ready now and we want to use it to make some predictions over test data. So, let us create some test data examples first.



In [ ]:

    
# test data
test_income = ['Medium','Medium','Low','High','High']
test_age = ['Old','Young','Old','Young','Old']
test_education = ['University','College','High School','University','College']
test_marital = ['Married','Single','Married','Single','Married']
test_usage = ['Low','Medium','Low','High','High']

# tabulate test data
print 'Test Data Table : \n'
print 'Income \t\t Age \t\t Education \t\t Marital Status \t Usage'
for i in xrange(len(test_income)):
	print test_income[i]+' \t\t '+test_age[i]+' \t\t '+test_education[i]+' \t\t '+test_marital[i]+' \t\t ?'

Next, as with training data, we encode our test dataset and bind it to Shogun features. Then, we apply our decision tree to the test examples to obtain the predicted labels.



In [ ]:

    
# encode test data
for i in xrange(len(test_income)):
	test_income[i] = income[test_income[i]]
	test_age[i] = age[test_age[i]]
	test_education[i] = education[test_education[i]]
	test_marital[i] = marital_status[test_marital[i]]

# bind to shogun features    
test_data = array([test_income, test_age, test_education, test_marital])
test_feats = RealFeatures(test_data)

# apply decision tree classification
test_labels = id3.apply_multiclass(test_feats)

Finally let us tabulate the results obtained and compare them with our intuitive predictions.



In [ ]:

    
output = test_labels.get_labels();
output_labels=[0]*len(output)

# decode back test data for printing
for i in xrange(len(test_income)):
	test_income[i]=income.keys()[income.values().index(test_income[i])]
	test_age[i]=age.keys()[age.values().index(test_age[i])]
	test_education[i]=education.keys()[education.values().index(test_education[i])]
	test_marital[i]=marital_status.keys()[marital_status.values().index(test_marital[i])]
	output_labels[i]=usage.keys()[usage.values().index(output[i])]

# print output data
print 'Final Test Data Table : \n'
print 'Income \t Age \t Education \t Marital Status \t Usage(predicted)'
for i in xrange(len(test_income)):
	print test_income[i]+' \t '+test_age[i]+' \t '+test_education[i]+' \t '+test_marital[i]+' \t\t '+output_labels[i]

So, do the predictions made by our decision tree match our inferences from training set? Yes! For example, from the training set we infer that the individual having low income has low usage and also all individuals going to college have medium usage. The decision tree predicts the same for both cases.

Example using a real dataset

We choose the car evaluation dataset from the UCI Machine Learning Repository as our real-world dataset. The car.names file of the dataset enumerates the class categories as well as the non-class attributes. Each car categorized into one of 4 classes : unacc, acc, good, vgood. Each car is judged using 6 attributes : buying, maint, doors, persons, lug_boot, safety. Each of these attributes can take 3-4 values. Let us first make a dictionary to encode strings to numeric values using information from cars.names file.



In [ ]:

    
# class attribute
evaluation = {'unacc' : 1.0, 'acc' : 2.0, 'good' : 3.0, 'vgood' : 4.0}

# non-class attributes
buying = {'vhigh' : 1.0, 'high' : 2.0, 'med' : 3.0, 'low' : 4.0}
maint = {'vhigh' : 1.0, 'high' : 2.0, 'med' : 3.0, 'low' : 4.0}
doors = {'2' : 1.0, '3' : 2.0, '4' : 3.0, '5more' : 4.0}
persons = {'2' : 1.0, '4' : 2.0, 'more' : 3.0}
lug_boot = {'small' : 1.0, 'med' : 2.0, 'big' : 3.0}
safety = {'low' : 1.0, 'med' : 2.0, 'high' : 3.0}

Next, let us read the file and form Shogun features and labels.



In [ ]:

    
f = open( os.path.join(SHOGUN_DATA_DIR, 'uci/car/car.data'), 'r')

features = []
labels = []

# read data from file and encode
for line in f:
    words = line.rstrip().split(',')
    words[0] = buying[words[0]]
    words[1] = maint[words[1]]
    words[2] = doors[words[2]]
    words[3] = persons[words[3]]
    words[4] = lug_boot[words[4]]
    words[5] = safety[words[5]]
    words[6] = evaluation[words[6]]
    features.append(words[0:6])
    labels.append(words[6])

f.close()

From the entire dataset, let us choose some test vectors to form our test dataset.



In [ ]:

    
from numpy import random, delete

features = array(features)
labels = array(labels)

# number of test vectors
num_test_vectors = 200;

test_indices = random.randint(features.shape[0], size = num_test_vectors)
test_features = features[test_indices]
test_labels = labels[test_indices]

# remove test vectors from training set
features = delete(features,test_indices,0)
labels = delete(labels,test_indices,0)

Next step is to train our decision tree using the training features and applying it to our test dataset to get predicted output classes.



In [ ]:

    
# shogun test features and labels
test_feats = RealFeatures(test_features.T)
test_labels = MulticlassLabels(test_labels)

# method for id3 training and
def ID3_routine(features, labels):

    # Shogun train features and labels
    train_feats = RealFeatures(features.T)
    train_lab = MulticlassLabels(labels)

    # create ID3ClassifierTree object
    id3 = ID3ClassifierTree()

    # set labels
    id3.set_labels(train_lab)

    # learn the tree from training features
    id3.train(train_feats)

    # apply to test dataset
    output = id3.apply_multiclass(test_feats)
    
    return output

output = ID3_routine(features, labels)

Finally, let us compare our predicted labels with test labels to find out the percentage error of our classification model.



In [ ]:

    
from shogun import MulticlassAccuracy

# Shogun object for calculating multiclass accuracy
accuracy = MulticlassAccuracy()
print 'Accuracy : ' + str(accuracy.evaluate(output, test_labels))

We see that the accuracy is moderately high. Thus our decision tree can evaluate any car given its features with a high success rate. As a final exercise, let us examine the effect of training dataset size on the accuracy of decision tree.



In [ ]:

    
# list of error rates for all training dataset sizes
error_rate = []

# number of error rate readings taken for each value of dataset size
num_repetitions = 3

# loop over training dataset size
for i in range(500,1600,200):
    indices = random.randint(features.shape[0], size = i)
    train_features = features[indices]
    train_labels = labels[indices]
    
    average_error = 0
    for i in xrange(num_repetitions):
        output = ID3_routine(train_features, train_labels)
        average_error = average_error + (1-accuracy.evaluate(output, test_labels))
        
    error_rate.append(average_error/num_repetitions)

# plot the error rates    
import matplotlib.pyplot as pyplot
% matplotlib inline
from scipy.interpolate import interp1d
from numpy import linspace, arange

fig,axis = pyplot.subplots(1,1)
x = arange(500,1600,200)
f = interp1d(x, error_rate)

xnew = linspace(500,1500,100)
pyplot.plot(x,error_rate,'o',xnew,f(xnew),'-')
pyplot.xlim([400,1600])
pyplot.xlabel('training dataset size')
pyplot.ylabel('Classification Error')
pyplot.title('Decision Tree Performance')
pyplot.show()

NOTE : The above code snippet takes about half a minute to execute. Please wait patiently.

From the above plot, we see that error rate decreases steadily as we increase the training dataset size. Although in this case, the decrease in error rate is not very significant, in many datasets this decrease in error rate can be substantial.

C4.5

The C4.5 algorithm is essentially an extension of the ID3 algorithm for decision tree learning. It has the additional capability of handling continuous attributes and attributes with missing values. The tree growing process in case of C4.5 is same as that of ID3 i.e. finding the best split at each node using the information gain metric. But in case of continuous attribute, the C4.5 algorithm has to perform the additional step of converting it to a two-value categorical attribute by splitting about a suitable threshold. This threshold is chosen in a way such that the resultant split produces maximum information gain. Let us start exploring Shogun's C4.5 algorithm API with a toy example.

Example using toy dataset

Let us consider a 3-class classification using 2 attributes. One of the attributes (say attribute X) is a 2-class categorical attribute depicted by values 1 and 2. The other attribute (say attribute Y) is a continuous attribute having values between 1 and 9. The simple rules of classification are as follows : if X=1 and Y $\epsilon$ [1,5), data point belongs to class 1, if X=1 and Y $\epsilon$ [5,9), data point belongs to class 2 and if X=2, data point belongs to class 3. Let us realize the toy dataset by plotting it.



In [ ]:

    
import matplotlib.pyplot as plt
from numpy import ones, zeros, random, concatenate
from shogun import RealFeatures, MulticlassLabels
% matplotlib inline

def create_toy_classification_dataset(ncat,do_plot):
    # create attribute values and labels for class 1
    x = ones((1,ncat))
    y = 1+random.rand(1,ncat)*4
    lab = zeros(ncat)

    # add attribute values and labels for class 2
    x = concatenate((x,ones((1,ncat))),1)
    y = concatenate((y,5+random.rand(1,ncat)*4),1)
    lab = concatenate((lab,ones(ncat)))

    # add attribute values and labels for class 3
    x = concatenate((x,2*ones((1,ncat))),1)
    y = concatenate((y,1+random.rand(1,ncat)*8),1)
    lab = concatenate((lab,2*ones(ncat)))

    # create test data
    ntest = 20
    x_t = concatenate((ones((1,3*ntest/4)),2*ones((1,ntest/4))),1)
    y_t = 1+random.rand(1,ntest)*8

    if do_plot:
        # plot training data
        c = ['r','g','b']
        for i in range(3):
            plt.scatter(x[0,lab==i],y[0,lab==i],color=c[i],marker='x',s=50)

        # plot test data
        plt.scatter(x_t[0,:],y_t[0,:],color='k',s=10,alpha=0.8)

        plt.xlabel('attribute X')
        plt.ylabel('attribute Y')
        plt.show()

    # form training feature matrix
    train_feats = RealFeatures(concatenate((x,y),0))

    # from training labels
    train_labels = MulticlassLabels(lab)

    # from test feature matrix
    test_feats = RealFeatures(concatenate((x_t,y_t),0))
    
    return (train_feats,train_labels,test_feats);

train_feats,train_labels,test_feats = create_toy_classification_dataset(20,True)

In the above plot the training data points are are marked with different colours of crosses where each colour corresponds to a particular label. The test data points are marked by black circles. For us it is a trivial task to assign correct colours (i.e. labels) to the black points. Let us see how accurately C4.5 assigns colours to these test points.

Now let us train a decision tree using the C4.5 algorithm. We need to create a Shogun C4.5 tree object and supply training features and training labels to it. We also need to specify which attribute is categorical and which is continuous. The attribute types can be specified using set_feature_types method through which all categorical attributes are set as True and continuous attributes as False.



In [ ]:

    
from numpy import array
from shogun import C45ClassifierTree

# steps in C4.5 Tree training bundled together in a python method
def train_tree(feats,types,labels):
    # C4.5 Tree object
    tree = C45ClassifierTree()
    # set labels
    tree.set_labels(labels)
    # supply attribute types
    tree.set_feature_types(types)
    # supply training matrix and train
    tree.train(feats)
    
    return tree

# specify attribute types X is categorical hence True, Y is continuous hence False
feat_types = array([True,False])

# get back trained tree
C45Tree = train_tree(train_feats,feat_types,train_labels)

Now that we have trained the decision tree, we can use it to classify our test vectors.



In [ ]:

    
def classify_data(tree,data):
    # get classification labels
    output = tree.apply_multiclass(data)   
    # get classification certainty
    output_certainty=tree.get_certainty_vector()
    
    return output,output_certainty

out_labels,out_certainty = classify_data(C45Tree,test_feats)

Let us use the output labels to colour our test data points to qualitatively judge the performance of the decision tree.



In [ ]:

    
from numpy import int32

# plot results
def plot_toy_classification_results(train_feats,train_labels,test_feats,test_labels):
    train = train_feats.get_feature_matrix()
    lab = train_labels.get_labels()
    test = test_feats.get_feature_matrix()
    out_labels = test_labels.get_labels()
    
    c = ['r','g','b']
    for i in range(out_labels.size):
        plt.scatter(test[0,i],test[1,i],color=c[int32(out_labels[i])],s=50)

    # plot training dataset for visual comparison
    for i in range(3):
        plt.scatter(train[0,lab==i],train[1,lab==i],color=c[i],marker='x',s=30,alpha=0.7)

    plt.show()
    
plot_toy_classification_results(train_feats,train_labels,test_feats,out_labels)

We see that the decision tree trained using the C4.5 algorithm works almost perfectly in this toy dataset. Now let us try this algorithm on a real world dataset.

Example using a real dataset

In this section we will investigate that how accurately we can predict the species of an Iris flower using a C4.5 trained decision tree. In this example we will use petal length, petal width, sepal length and sepal width as our attributes to decide among 3 classes of Iris : Iris Setosa, Iris Versicolor and Iris Verginica. Let us start by suitably reading the dataset.



In [ ]:

    
import csv
from numpy import array

# dictionary to encode class names to class labels
to_label = {'Iris-setosa' : 0.0, 'Iris-versicolor' : 1.0, 'Iris-virginica' : 2.0}

# read csv file and separate out labels and features
lab = []
feat = []
with open( os.path.join(SHOGUN_DATA_DIR, 'uci/iris/iris.data')) as csvfile:
    csvread = csv.reader(csvfile,delimiter=',')
    for row in csvread:
        feat.append([float(i) for i in row[0:4]])
        lab.append(to_label[row[4]])

lab = array(lab)
feat = array(feat).T

Because there is no separate test dataset, we first divide the given dataset into training and testing subsets.



In [ ]:

    
from numpy import int32, random

# no.of vectors in test dataset
ntest = 25
# no. of vectors in train dataset
ntrain = 150-ntest

# randomize the order of vectors
subset = int32(random.permutation(150))

# choose 1st ntrain from randomized set as training vectors
feats_train = feat[:,subset[0:ntrain]]
# form training labels correspondingly
train_labels = lab[subset[0:ntrain]]

# form test features and labels (for accuracy evaluations)
feats_test = feat[:,subset[ntrain:ntrain+ntest]]
test_labels = lab[subset[ntrain:ntrain+ntest]]

Before marching forward with applying C4.5, let us plot the data to get a better understanding. The given data points are 4-D and hence cannot be conveniently plotted. We need to reduce the number of dimensions to 2. This reduction can be achieved using any dimension reduction algorithm like PCA. However for the sake of brevity, let us just choose two highly correlated dimensions, petal width and petal length (see summary statistics), right away for plotting.



In [ ]:

    
import matplotlib.pyplot as plt
% matplotlib inline

# plot training features
c = ['r', 'g', 'b']
for i in range(3):
    plt.scatter(feats_train[2,train_labels==i],feats_train[3,train_labels==i],color=c[i],marker='x')

# plot test data points in black
plt.scatter(feats_test[2,:],feats_test[3,:],color='k',marker='o')

plt.show()

First, let us create Shogun features and labels from the given data.



In [ ]:

    
from shogun import RealFeatures, MulticlassLabels

# training data
feats_train = RealFeatures(feats_train)
train_labels = MulticlassLabels(train_labels)

# test data
feats_test = RealFeatures(feats_test)
test_labels = MulticlassLabels(test_labels)

We know for fact that decision trees tend to overfit. Hence pruning becomes a necessary step. In case of toy dataset, we skipped the pruning step because the dataset was simple and noise-free. But in case of a real dataset like the Iris dataset pruning cannot be skipped. So we have to partition the training dataset into the training subset and the validation subset.



In [ ]:

    
# randomize the order of vectors
subset = int32(random.permutation(ntrain))

nvalidation = 45

# form training subset and validation subset
train_subset = subset[0:ntrain-nvalidation]
validation_subset = subset[ntrain-nvalidation:ntrain]

Now we train the decision tree first, then prune it and finally use it to get output labels for test vectors.



In [ ]:

    
# set attribute types - all continuous
feature_types = array([False, False, False, False])

# remove validation subset before training the tree
feats_train.add_subset(train_subset)
train_labels.add_subset(train_subset)

# train tree
C45Tree = train_tree(feats_train,feature_types,train_labels)

# bring back validation subset
feats_train.remove_subset()
train_labels.remove_subset()

# remove data belonging to training subset
feats_train.add_subset(validation_subset)
train_labels.add_subset(validation_subset)

# prune the tree
C45Tree.prune_tree(feats_train,train_labels)

# bring back training subset
feats_train.remove_subset()
train_labels.remove_subset()

# get results
output, output_certainty = classify_data(C45Tree,feats_test)

Let us calculate the accuracy of the classification made by our tree as well as plot the results for qualitative evaluation.



In [ ]:

    
from shogun import MulticlassAccuracy

# Shogun object for calculating multiclass accuracy
accuracy = MulticlassAccuracy()
print 'Accuracy : ' + str(accuracy.evaluate(output, test_labels))



In [ ]:

    
# convert MulticlassLabels object to labels vector
output = output.get_labels()
test_labels = test_labels.get_labels()
train_labels = train_labels.get_labels()

# convert RealFeatures object to matrix
feats_test = feats_test.get_feature_matrix()
feats_train = feats_train.get_feature_matrix()

# plot ground truth
for i in range(3):
    plt.scatter(feats_test[2,test_labels==i],feats_test[3,test_labels==i],color=c[i],marker='x',s=100)

# plot predicted labels
for i in range(output.size):
    plt.scatter(feats_test[2,i],feats_test[3,i],color=c[int32(output[i])],marker='o',s=30*output_certainty[i])
        
plt.show()

From the evaluation of results, we infer that, with the help of a C4.5 trained decision tree, we can predict (with high accuracy) the type of Iris plant given its petal and sepal widths and lengths.

Classification and Regression Trees (CART)

The CART algorithm is a popular decision tree learning algorithm introduced by Breiman et al. Unlike ID3 and C4.5, the learnt decision tree in this case can be used for both multiclass classification and regression depending on the type of dependent variable. The tree growing process comprises of recursive binary splitting of nodes. To find the best split at each node, all possible splits of all available predictive attributes are considered. The best split is the one that maximises some splitting criterion. For classification tasks, ie. when the dependent attribute is categorical, the Gini index is used as the splitting criterion. For regression tasks, ie. when the dependent variable is continuous, the least squares deviation is used. Let us learn about Shogun's CART implementation by working on two toy problems, one on classification and the other on regression.

Classification example using toy data

Let us consider the same dataset as that in the C4.5 toy example. We re-create the dataset and plot it first.



In [ ]:

    
train_feats,train_labels,test_feats=create_toy_classification_dataset(20,True)

Next, we supply necessary parameters to the CART algorithm and use it train our decision tree.



In [ ]:

    
from shogun import PT_MULTICLASS, CARTree
from numpy import array

def train_carttree(feat_types,problem_type,num_folds,use_cv_pruning,labels,features):
    # create CART tree object
    c = CARTree(feat_types,problem_type,num_folds,use_cv_pruning)
    # set training labels
    c.set_labels(labels)
    # train using training features
    c.train(features)
    
    return c

# form feature types True for nominal (attribute X), False for ordinal/continuous (attribute Y)
ft = array([True, False])

# get back trained tree
cart = train_carttree(ft, PT_MULTICLASS, 5, True, train_labels, train_feats)

In the above code snippet, we see four parameters being supplied to the CART tree object. feat_types supplies knowledge of attribute types of training data to the CART algorithm and problem_type specifies whether it is a multiclass classification problem (PT_MULTICLASS) or a regression problem (PT_REGRESSION). The boolean parameter use_cv_pruning switches on cross-validation pruning of the trained tree and num_folds specifies the number of folds of cross-validation to be applied while pruning. At this point, let us divert ourselves briefly towards undertanding what kind of pruning strategy is employed by Shogun's CART implementation. The CART algorithm uses the cost-complexity pruning strategy. Cost-Complexity pruning yields a list of subtrees of varying depths using complexity normalized resubstitution error, $R_\alpha(T)$. Resubstitution error, R(T), measures how well a decision tree fits the training data. But, this measure favours larger trees over smaller ones. Hence the complexity normalized resubstitution error metric is used which adds penalty for increased complexity and in-turn counters overfitting.

$R_\alpha(T)=R(T)+\alpha \times (numleaves)$

The best subtree among the list of subtrees can be chosen using cross validation or using the best-fit metric in the validation dataset. Setting use_cv_pruning in the above code snippet basically tells the CART object to use cross-validation to choose the best among the subtrees generated by cost-complexity pruning.

Let us now get back on track and use the trained tree to classify our test data.



In [ ]:

    
from numpy import int32

# get output labels
output_labels = cart.apply_multiclass(test_feats)

plot_toy_classification_results(train_feats,train_labels,test_feats,output_labels)

Regression example using toy data

In this example, we form the training dataset by sampling points from a sinusoidal curve and see how well a decision tree, trained using these samples, re-creates the actual sinusoid.



In [ ]:

    
from shogun import RegressionLabels, RealFeatures
from numpy import random, sin, linspace
import matplotlib.pyplot as plt
% matplotlib inline

def create_toy_regression_dataset(nsamples,noise_var):
    # randomly choose positions in X axis between 0 to 16
    samples_x = random.rand(1,nsamples)*16

    # find out y (=sin(x)) values for the sampled x positions and add noise to it
    samples_y = sin(samples_x)+(random.rand(1,nsamples)-0.5)*noise_var

    # plot the samples
    plt.scatter(samples_x,samples_y,color='b',marker='x')
    
    # create training features
    train_feats = RealFeatures(samples_x)
    # training labels
    train_labels = RegressionLabels(samples_y[0,:])

    return (train_feats,train_labels)

# plot the reference sinusoid
def plot_ref_sinusoid():
    plot_x = linspace(-2,18,100)
    plt.plot(plot_x,sin(plot_x),color='y',linewidth=1.5)
    plt.xlabel('Feature values')
    plt.ylabel('Labels')
    plt.xlim([-3,19])
    plt.ylim([-1.5,1.5])

# number of samples is 300, noise variance is 0.5
train_feats,train_labels = create_toy_regression_dataset(300,0.5)

plot_ref_sinusoid()
plt.show()

Next, we train our CART-tree.



In [ ]:

    
from shogun import PT_REGRESSION
from numpy import array

# feature type - continuous
feat_type = array([False])

# get back trained tree
cart = train_carttree(feat_type, PT_REGRESSION, 5, True, train_labels, train_feats)

Now let us use the trained decision tree to regress over the entire range of the previously depicted sinusoid.



In [ ]:

    
def plot_predicted_sinusoid(cart):
    # regression range - 0 to 16
    x_test = array([linspace(0,16,100)])

    # form Shogun features
    test_feats = RealFeatures(x_test)

    # apply regression using our previously trained CART-tree
    regression_output = cart.apply_regression(test_feats).get_labels()

    # plot the result
    plt.plot(x_test[0,:],regression_output,linewidth=2.0)

    # plot reference sinusoid
    plot_ref_sinusoid()

    plt.show()
    
plot_predicted_sinusoid(cart)

As we can see from the above plot, CART-induced decision tree follows the reference sinusoid quite beautifully!

Classification example using real dataset

In this section, we will apply the CART algorithm on the Iris dataset. Remember that the Iris dataset provides us with just a training dataset and no separate test dataset. In case of the C4.5 example discussed earlier, we ourselves divided the entire training dataset into training subset and test subset. In this section, we will employ a different strategy i.e. cross validation. In cross-validation, we divide the training dataset into n subsets where n is a user controlled parameter. We perform n iterations of training and testing in which, at each iteration, we choose one of the n subsets as our test dataset and the remaining n-1 subsets as our training dataset. The performance of the model is usually taken as the average of the performances in various iterations. Shogun's cross validation class makes it really easy to apply cross-validation to any model of our choice. Let us realize this by applying cross-validation to CART-tree trained over Iris dataset. We start by reading the data.



In [ ]:

    
import csv
from numpy import array
import matplotlib.pylab as plt 
% matplotlib inline

# dictionary to encode class names to class labels
to_label = {'Iris-setosa' : 0.0, 'Iris-versicolor' : 1.0, 'Iris-virginica' : 2.0}

# read csv file and separate out labels and features
lab = []
feat = []
with open( os.path.join(SHOGUN_DATA_DIR, 'uci/iris/iris.data')) as csvfile:
    csvread = csv.reader(csvfile,delimiter=',')
    for row in csvread:
        feat.append([float(i) for i in row[0:4]])
        lab.append(to_label[row[4]])

lab = array(lab)
feat = array(feat).T

# plot the dataset using two highly correlated attributes
c = ['r', 'g', 'b']
for i in range(3):
    plt.scatter(feat[2,lab==i],feat[3,lab==i],color=c[i],marker='x')
    
plt.show()

Next, we setup the model which is CART-tree in this case.



In [ ]:

    
from shogun import CARTree, PT_MULTICLASS

# set attribute types - all continuous
feature_types = array([False, False, False, False])

# setup CART-tree with cross validation pruning switched off
cart = CARTree(feature_types,PT_MULTICLASS,5,False)

Finally we can use Shogun's cross-validation class to get performance.



In [ ]:

    
from shogun import RealFeatures, MulticlassLabels
from shogun import CrossValidation, MulticlassAccuracy, CrossValidationSplitting, CrossValidationResult

# training features
feats_train = RealFeatures(feat)
# training labels
labels_train = MulticlassLabels(lab)

# set evaluation criteria - multiclass accuracy
accuracy = MulticlassAccuracy()

# set splitting criteria - 10 fold cross-validation
split = CrossValidationSplitting(labels_train,10)

# set cross-validation parameters 
cross_val = CrossValidation(cart,feats_train,labels_train,split,accuracy,False)

# run cross-validation multiple times - to get better estimate of accuracy
cross_val.set_num_runs(10)

# get cross validation result
result = cross_val.evaluate()

# print result
print('Mean Accuracy : ' + str(CrossValidationResult.obtain_from_generic(result).mean))

We get a mean accuracy of about 0.93-0.94. This number essentially means that a CART-tree trained using this dataset is expected to classify Iris flowers, given their required attributes, with an accuracy of 93-94% in a real world scenario. The parameters required by Shogun's cross-validation class should be noted in the above code snippet. The class requires the model, training features, training labels, splitting strategy and evaluation method to be specified.

Regression using real dataset

In this section, we evaluate CART-induced decision tree over the Servo dataset. Using this dataset, we essentially want to train a model which can predict the rise time of a servomechanism given the required parameters which are the two (integer) gain settings and two (nominal) choices of mechanical linkages. Let us read the dataset first.



In [ ]:

    
from numpy import array

# dictionary to convert string features to integer values
to_int = {'A' : 1, 'B' : 2, 'C' : 3, 'D' : 4, 'E' : 5}

# read csv file and separate out labels and features
lab = []
feat = []
with open( os.path.join(SHOGUN_DATA_DIR, 'uci/servo/servo.data')) as csvfile:
    csvread = csv.reader(csvfile,delimiter=',')
    for row in csvread:
        feat.append([to_int[row[0]], to_int[row[1]], float(row[2]), float(row[3])])
        lab.append(float(row[4]))

lab = array(lab)
feat = array(feat).T

The servo dataset is a small training dataset (contains just 167 training vectors) with no separate test dataset, like the Iris dataset. Hence we will apply the same cross-validation strategy we applied in case of the Iris dataset. However, to make things interesting let us play around with a yet-untouched parameter of CART-induced tree i.e. the maximum allowed tree depth. As the tree depth increases, the tree becomes more complex and hence fits the training data more closely. By setting a maximum allowed tree depth, we restrict the complexity of trained tree and hence avoid over-fitting. But choosing a low value of the maximum allowed tree depth may lead to early stopping i.e. under-fitting. Let us explore how we can decide the appropriate value of the max-allowed-tree-depth parameter. Let us create a method, which takes max-allowed-tree-depth parameter as input and returns the corresponding cross-validated error as output.



In [ ]:

    
from shogun import CARTree, RegressionLabels, PT_REGRESSION, MeanSquaredError
from shogun import CrossValidation, CrossValidationSplitting, CrossValidationResult

# form training features
feats_train = RealFeatures(feat)
# form training labels
labels_train = RegressionLabels(lab)

def get_cv_error(max_depth):
    # set attribute types - 2 nominal and 2 ordinal
    feature_types = array([True, True, False, False])
    # setup CART-tree with cross validation pruning switched off
    cart = CARTree(feature_types,PT_REGRESSION,5,False)
    # set max allowed depth
    cart.set_max_depth(max_depth)

    # set evaluation criteria - mean squared error
    accuracy = MeanSquaredError()
    # set splitting criteria - 10 fold cross-validation
    split = CrossValidationSplitting(labels_train,10)
    # set cross-validation parameters 
    cross_val = CrossValidation(cart,feats_train,labels_train,split,accuracy,False)

    # run cross-validation multiple times
    cross_val.set_num_runs(10)

    # return cross validation result
    return CrossValidationResult.obtain_from_generic(cross_val.evaluate()).mean

Next, let us supply a range of max_depth values to the above method and plot the returned cross-validated errors.



In [ ]:

    
import matplotlib.pyplot as plt

cv_errors = [get_cv_error(i) for i in range(1,15)]
plt.plot(range(1,15),cv_errors,'bo',range(1,15),cv_errors,'k')
plt.xlabel('max_allowed_depth')
plt.ylabel('cross-validated error')
plt.ylim(0,1.2)
plt.show()

From the above plot quite clearly gives us the most appropriate value of maximum allowed depth. We see that the first minima occurs at a maximum allowed depth of 6-8. Hence, one of these should be the desired value. It is to be noted that the error metric that we are discussing here is the mean squared error. Thus, from the above plot, we can also claim that, given required parameters, our CART-flavoured decision tree can predict the rise time within an average error range of $\pm0.5$ (i.e. square root of 0.25 which is the approximate minimum cross-validated error). The relative error i.e average_error/range_of_labels comes out to be ~30%.

CHi-squared Automatic Interaction Detection (CHAID)

The CHAID is an algorithm for decision tree learning proposed by Kass (1980). It is similar in functionality to CART in the sense that both can be used for classification as well as regression. But unlike CART, CHAID internally handles only categorical features. The continuous features are first converted into ordinal categorical features for the CHAID algorithm to be able to use them. This conversion is done by binning of feature values.The number of bins (K) has to be supplied by the user. Given K, a predictor is split in such a way that all the bins get the same number (more or less) of distinct predictor values. The maximum feature value in each bin is used as a breakpoint.

An important parameter in the CHAID tree growing process is the p-value. The p-value is the metric that is used for deciding which categories of predictor values to merge during merging as well as for deciding the best attribute during splitting. The p-value is calculated using different hypothesis testing methods depending on the type of dependent variable (nominal, ordinal or continuous). A more detailed discussion on the CHAID algorithm can be found in the documentation of the CCHAIDTree class in Shogun. Let us move on to a more interesting topic which is learning to use CHAID using Shogun's python API.

Classification example using toy dataset

Let us re-use the toy classification dataset used in C4.5 and CART to see the API usage of CHAID as well as to qualitatively compare the results of the CHAID algorithm with the other two.



In [ ]:

    
train_feats,train_labels,test_feats = create_toy_classification_dataset(20,True)

Now, we set up our CHAID-tree with appropriate parameters and train over given data.



In [ ]:

    
from shogun import PT_MULTICLASS, CHAIDTree
from numpy import array, dtype, int32

def train_chaidtree(dependent_var_type,feature_types,num_bins,features,labels):
    # create CHAID tree object
    c = CHAIDTree(dependent_var_type,feature_types,num_bins)
    # set training labels
    c.set_labels(labels)
    # train using training features
    c.train(features)
    
    return c

# form feature types 0 for nominal (attribute X), 2 for continuous (attribute Y)
ft = array([0, 2],dtype=int32)

# cache training matrix
train_feats_cache=RealFeatures(train_feats.get_feature_matrix())

# get back trained tree - dependent variable type is nominal (hence 0), number of bins for binning is 10  
chaid = train_chaidtree(0,ft,10,train_feats,train_labels)

print('updated_matrix')
print(train_feats.get_feature_matrix())
print('')
print('original_matrix')
print(train_feats_cache.get_feature_matrix())

An important point to be noted in the above code snippet is that CHAID training modifies the training data. The actual continuous feature values are replaced by the discrete ordinal values obtained during continuous to ordinal conversion. Notice the difference between the original feature matrix and the updated matrix. The updated matrix contains only 10 distinct values denoting all values of the original matrix for feature dimension at row index 1.

With a CHAID-trained decision tree at our disposal, it's time to apply it to colour our test points.



In [ ]:

    
# get output labels
output_labels = chaid.apply_multiclass(test_feats)

plot_toy_classification_results(train_feats_cache,train_labels,test_feats,output_labels)

Regression example with toy dataset

In this section, we re-work the sinusoid curve fitting example (earlier used in CART toy regression).



In [ ]:

    
train_feats,train_labels = create_toy_regression_dataset(300,0.5)
plot_ref_sinusoid()
plt.show()

As usual, we start by setting up our decision tree and training it.



In [ ]:

    
from numpy import dtype, int32, array

# feature type - continuous
feat_type = array([2],dtype=int32)

# get back trained tree
chaid = train_chaidtree(2,feat_type, 50, train_feats, train_labels)

Next, we use the trained decision tree to follow the reference sinusoid.



In [ ]:

    
plot_predicted_sinusoid(chaid)

A distinguishing feature about the predicted curve is the presence of steps. These steps are essentially an artifact of continuous to ordinal conversion. If we decrease the number of bins for the conversion the step widths will increase.

Classification example over real dataset

In this section, we will try to estimate the quality of wine based on 13 attributes like alcohol content, malic acid, magnesium content, etc. using the wine dataset. Let us first read the dataset using Shogun's CSV file reader.



In [ ]:

    
from shogun import CSVFile, RealFeatures, MulticlassLabels

train_feats=RealFeatures(CSVFile( os.path.join(SHOGUN_DATA_DIR, 'uci/wine/fm_wine.dat')))
train_labels=MulticlassLabels(CSVFile( os.path.join(SHOGUN_DATA_DIR, 'uci/wine/label_wine.dat')))

Like the case of CART, here we are also interested in finding out the approximate accuracy with which our CHAID tree trained on this dataset will perform in real world. Hence, we will apply the cross validation strategy. But first we specify the parameters of the CHAID tree.



In [ ]:

    
from shogun import CHAIDTree, MulticlassLabels

# set attribute types - all attributes are continuous(2)
feature_types = array([2 for i in range(13)],dtype=int32)    

# setup CHAID tree - dependent variable is nominal(0), feature types set, number of bins(20)
chaid = CHAIDTree(0,feature_types,20)

Next we set up the cross-validation class and get back the error estimate we want i.e mean classification error.



In [ ]:

    
# set up cross validation class

from shogun import CrossValidation, CrossValidationSplitting, CrossValidationResult, MulticlassAccuracy

# set evaluation criteria - multiclass accuracy
accuracy = MulticlassAccuracy()
    
# set splitting criteria - 10 fold cross-validation
split = CrossValidationSplitting(train_labels,10)

# set cross-validation parameters 
cross_val = CrossValidation(chaid,train_feats,train_labels,split,accuracy,False)

# run cross-validation multiple times
cross_val.set_num_runs(10)

print('Mean classification accuracy : '+str(CrossValidationResult.obtain_from_generic(cross_val.evaluate()).mean*100)+' %')

Regression example using real dataset

In this section, we try to predict the value of houses in Boston using 13 attributes, like per capita crime rate in neighborhood, number of rooms, nitrous oxide concentration in air, proportion of non-retail business in the area etc. Out of the 13 attributes 12 are continuous and 1 (the Charles river dummy variable) is binary nominal. Let us load the dataset as our first step. For this, we can directly use Shogun's CSV file reader class.



In [ ]:

    
from shogun import CSVFile, RealFeatures, RegressionLabels
from numpy import ptp

train_feats=RealFeatures(CSVFile( os.path.join(SHOGUN_DATA_DIR, 'uci/housing/fm_housing.dat')))
train_labels=RegressionLabels(CSVFile( os.path.join(SHOGUN_DATA_DIR, 'uci/housing/housing_label.dat')))

# print range of regression labels - this is useful for calculating relative deviation later 
print('labels range : '+str(ptp(train_labels.get_labels())))

Next, we set up the parameters for the CHAID tree as well as the cross-validation class.



In [ ]:

    
from shogun import CHAIDTree, MeanSquaredError
from shogun import CrossValidation, CrossValidationSplitting, CrossValidationResult
from numpy import array, dtype, int32

def get_cv_error(max_depth):
    # set feature types - all continuous(2) except 4th column which is nominal(0) 
    feature_types = array([2]*13,dtype=int32)
    feature_types[3]=0
    feature_types[8]=1
    feature_types[9]=1    

    # setup CHAID-tree
    chaid = CHAIDTree(2,feature_types,10)
    # set max allowed depth
    chaid.set_max_tree_depth(max_depth)

    # set evaluation criteria - mean squared error
    accuracy = MeanSquaredError()
    # set splitting criteria - 5 fold cross-validation
    split = CrossValidationSplitting(train_labels,5)
    # set cross-validation parameters 
    cross_val = CrossValidation(chaid,train_feats,train_labels,split,accuracy,False)

    # run cross-validation multiple times
    cross_val.set_num_runs(3)

    # return cross validation result
    return CrossValidationResult.obtain_from_generic(cross_val.evaluate()).mean



In [ ]:

    
import matplotlib.pyplot as plt
% matplotlib inline
cv_errors = [get_cv_error(i) for i in range(1,10)]
plt.plot(range(1,10),cv_errors,'bo',range(1,10),cv_errors,'k')
plt.xlabel('max_allowed_depth')
plt.ylabel('cross-validated error')
plt.show()

From the above figure, we see that tree depth of 2-4 is most optimal and gives a mean squared error of ~25 which is a deviation of ~$\pm5$. We already calculated the range of labels to be 45.0, hence the relative deviation comes out to be 11.11%

References

[1] Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science

[2] Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1: 1 (Mar. 1986), 81-106