Decision Trees are classifiers. They classify a the data item into a particular class based on the class column. For example, Considering the outlook, wind speed, temperature and humidity, the decision tree will help the user decide whether to play golf or no.
As the name suggests, the decision trees take a bunch of decisions to reach the final conclusion. Every current decision is based on the outcome of the pervious decisions. Considering each decision a node, All the decisions together will form a decision tree. Each node represents a question and number of branches from that node represent the number of answers for that question.
A decision tree is a flow chart like tree structure, where
Each internal node denotes a test on an attribute
Each branch denotes an outcome of the test
Each leaf node represent a class
In order to classify an unknown sample, the attribute values of the sample are tested against the decision tree.
The nodes (Questions) are selected via the greedy approach. That is the root node (First Question) will be selected in such a way that it tries to split the maximum number of records into distintive classes based on the "Gain" of the question which we will discuss below. After the root node is selected, the next node is selected in the similar manner in a iterative process.
So the idea behind decision tree is to classify the record into a particular class based on the various features and conditions.
In a decision tree, the order in which the attributes are chosen determines how complicated the tree is. There are many algorithms to compute a decision tree, to name a few,
ID3 uses Information Theory to determine most informative attribute. A measure of the information content of a message is the inverse of the probability of receiving the message:
$$Information ( Message ) = \frac {1}{Probability( Message)} $$Node purity & impurity:
Node purity represents how well we split the data. A pure node is one which classifies the data completely i.e that node has all the data belonging to one class. Whereas a node which has data belonging more than class is an impure node. There are various measures to calculate the purity of the node. Some of the features are as follows:
A visual example
Entropy
In lay man terms Entropy is uncertainity or impurity. Uncertainity of a node to classify objects. The entropy of a node is based on the probability of the objects belonging to a particular class.
$$ Entropy (S) = - \sum_{i} P_i * Log_2(P_i) $$Based on Entropy(Uncertainity), We calculate the gain of a attribute. Information Gain also known as Gain of an attribute is used to select the attribute that will best separate the samples into individual classes. The attribute with the maximum gai,n will give us the maximum information and will split maximum number of records ie it will give us the most pure node.
Gain of example set S for a attribute A is given by:
$$ Gain( S, A) = Entropy (S)- S * \sum_{v} \frac{| S_v |}{ |S| }*Entropy(S_v)$$So as you can see the entropy of the attribute (A) is getting subtracted from the total entropy. Therefore we can say that, the attribute with the least entropy will have the maximum gain (Since it is getting subtracted from the overall entropy.). Therefore we can conclude that, the attribute with the maximum gain will be selected for splitting.
In this experiment, we will implement decision tree using Sci-kit learn's module : DecisionTreeClassifier So let's go ahead and import the packages required.
In [25]:
import numpy as np # for numeric operations
import pandas as pd # so store and use dataFrames
#ignoring warnings.
import warnings
warnings.simplefilter('ignore')
%matplotlib inline
Secondly let's import and analyse the dataset.
The Car Dataset
Car Evaluation Database was derived from a simple hierarchical decision model originally developed for the demonstration of DEX (M. Bohanec, V. Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates cars according to the following concept structure:
Input attributes are printed in lowercase. Besides the target concept (CAR), the model includes three intermediate concepts: PRICE, TECH, COMFORT. Every concept is in the original model related to its lower level descendants by a set of examples (for these examples sets see http://www-ai.ijs.si/BlazZupan/car.html).
The Car Evaluation Database contains examples with the structural information removed, i.e., directly relates CAR to the six input attributes: buying, maint, doors, persons, lug_boot, safety.
Because of known underlying concept structure, this database may be particularly useful for testing constructive induction andstructure discovery methods.
In [26]:
df=pd.read_csv('data/car.csv')
df.sample(5)
Out[26]:
So here you can see that based on the features, the model has to classify whether the car is unacceptable, acceptable, good and very good
In [27]:
# Let`s analyse the data and clean it
set(df['doors'])
Out[27]:
In [28]:
#finding the number of null values in the database
df.isnull().sum().sum()
Out[28]:
Since the model's cannot work with strings, we need to convert then into numerical values, So basically we need to encode them
In [29]:
from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()
#Encoding the data
columns=df.columns.tolist()
for c in columns:
df[c]=encoder.fit_transform(df[c])
# encoding the target data
#let`s view our encoding
print(df.sample(5))
In [30]:
target=df['class']
# droping the class column from the dataset.
df=df.drop('class',axis=1)
columns.remove('class')
In [31]:
columns
Out[31]:
So avoid overfitting of data, we use various methods of validations: In this experiment we will use 2:
In [32]:
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test=train_test_split(df,target,train_size=.60)
Let's implement our decision tree using entropy.
In [33]:
from sklearn.tree import DecisionTreeClassifier
dc=DecisionTreeClassifier(criterion='entropy')
dc.fit(X_train,y_train)
y_pred=dc.predict(X_test)
Now our decision tree is ready using Entropy and train_test_split, Let's have a look at the accuracy of the decision tree. To check the accuracy , we will be using
Accuray Score is the mean of the correct classifications. But is accuracy score the correct metric?
Before we compute the confusion matrix, Let us understand what it is.
A classifier is not 100% accurate, it should never be, cause that's just overfitting the data. So a classifier will get some samples right and some wrong. Generally, we check on the samples belonging to the test data. So, considering binary classes, there are 4 possible classifications.
Let's us see a visual representation of a confusion matrix.
Given these definitions, we typically calculate a few metrics for our classifier. First, the True Positive Rate: $$TPR = Recall = \frac{TP}{OP} = \frac{TP}{TP+FN},$$ also called the Hit Rate: the fraction of observed positives (1s) the classifier gets right, or how many true positives were recalled. Maximizing the recall towards 1 means keeping down the false negative rate. In a classifier try to find cancer patients, this is the number we want to maximize. The False Positive Rate is defined as $$FPR = \frac{FP}{ON} = \frac{FP}{FP+TN},$$ also called the False Alarm Rate, the fraction of observed negatives (0s) the classifier gets wrong. In general, you want this number to be low. Instead, you might want to maximize the Precision,which tells you how many of the predicted positive(1) hits were truly positive $$Precision = \frac{TP}{PP} = \frac{TP}{TP+FP}.$$ Finally the F1 score gives us the Harmonic Score of Precision and Recall. Many analysts will try and find a classifier that maximizes this score, since it tries to minimize both false positives and false negatives simultaneously, and is thus a bit more precise in what it is trying to do than the accuracy. $$F1 = \frac{2*Recall*Precision}{Recall + Precision}$$
https://github.com/cs109/2015lab6/blob/master/lab6-classification-redux.ipynb
In [34]:
from sklearn.metrics import confusion_matrix,accuracy_score, roc_curve,classification_report
import matplotlib.pyplot as plt
import itertools
#creating a function to plot confusion matrix.
def plot_confusion_matrix(cm,classes,title="Confusion Matrix",cmap=plt.cm.Blues):
plt.imshow(cm,interpolation='nearest',cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks=np.arange(len(classes))
plt.xticks(tick_marks,classes,rotation=45)
plt.yticks(tick_marks,classes)
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True Class')
plt.xlabel('Predicted Class')
In [35]:
print("Accuracy Score of Decision tree with entropy is {0}".format(accuracy_score(y_test,y_pred)));
Accuracy_score_train_test_entropy=accuracy_score(y_test,y_pred)
# print("Confusion matrix of Decision tree with entropy is\n {0}".format(confusion_matrix(y_test,y_pred)))
print("\nClassification report of Decision tree with entropy is\n {0}".format(classification_report(y_test,y_pred)))
In [36]:
#ploting confusion matrix
plt.figure()
plot_confusion_matrix(confusion_matrix(y_test,y_pred), classes=set(encoder.inverse_transform(target)),
title='Confusion matrix of Decision tree with entropy')
plt.show()
So, we understood what's a confusion matrix in terms of a binary data, but what about multi-class data ?
As you can see in the above computed confusion matrix, we have a multiple classes. So on the X axis , we have predicted Class and on the Y axis, we have the True class. So we can see the number of records belonging that class and classified as that class by the classifier.
In this case, block (2,1) will represent the number of records belonging to the "good" class classified as good and the block (2,3) will represent the number of records belonging to the unacc class but classified as good
**NOTE** : The X and Y scale on the graph starts from 0, and the origin is the left bottom corner
Gini Impurity
Used by the CART (classification and regression tree) algorithm, Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. Gini impurity can be computed by summing the probability $f_i$ an item with label $i$ being chosen times the probability $ 1- f_i$ of a mistake in categorizing that item. It reaches its minimum (zero) when all cases in the node fall into a single target category.
Gini Index
The Gini coefficient (or Gini ratio) G is a summary statistic of the Lorenz curve and a measure of inequality in a population
CART : Classification and Regression Tree uses Gini Index. It is algorithm which draws a decision tree which can be used to classify data into classes or predict a value.
Classification Tree:
When decision or target variable is categorical, the decision tree is classification decision tree. e.g. predicting whether customer will default or not (Binary Target variable). Or predicting food choices of the customers (nominal variable) using set of independent variable is an example of Classification Decision Tree.
Regression Tree:
When the decision or target variable is continuous variable, the decision tree is called regression decision tree. e.g. predicting house prices using attributes of houses such as size of a house, type of house (independent, apartment etc) and others. The independent variables can continuous and categorical variables.
The impurity (or purity) measure used in building decision tree in CART is Gini Index.
Gini Index
$$Gini =\sum_{i \ne j} p(i)p(j) $$
Consider an example, target variable is Binary variable which means it take two values (Yes and No). There can be 4 combinations.
|1|1|
|0|0|
$ P(Target=1).P(Target=1) + P(Target=1).P(Target=0) + P(Target=0).P(Target=1) + P(Target=0).P(Target=0) = 1$
$ P(Target=1).P(Target=0) + P(Target=0).P(Target=1) = 1 – P^2(Target=0) – P^2(Target=1)$
Gini Index for Binary Target variable is
$ = 1 – P^2(Target=0) – P^2(Target=1) $
$ = 1 - \sum_{t=0}^{t=1} P_t^2 $
pt : Proportion of observations with target variable value t. In Binary t takes value 0 and 1.
http://www.ke.tu-darmstadt.de/lehre/archiv/ws0809/mldm/dt.pdf
Now, let's create the decision tree using Gini Index.
In [37]:
dc=DecisionTreeClassifier(criterion='gini')
dc.fit(X_train,y_train)
y_pred=dc.predict(X_test)
Let' check the accuracy of the decision tree
In [38]:
print("Accuracy Score of Decision tree with Gini is {0}".format(accuracy_score(y_test,y_pred)));
Accuracy_score_train_test_Gini=accuracy_score(y_test,y_pred)
# print("Confusion matrix of Decision tree with Gini is\n {0}".format(confusion_matrix(y_test,y_pred)))
print("\nClassification report of Decision tree with Gini is\n {0}".format(classification_report(y_test,y_pred)))
In [39]:
#confusion matrix
plt.figure()
plot_confusion_matrix(confusion_matrix(y_test,y_pred), classes=set(encoder.inverse_transform(target)),
title='Confusion matrix of Decision tree with Gini')
plt.show()
In the above two cases, we used train_test_split, to overcome overfitting, now we will use cross validation
So let's understand why we need cross validation and what it is?
We basically have competing demands:
So inorder to balance these desires, we use cross-validation.
Thus for a given hypothesis set $\cal{H}_a$ with complexity parameter $d=a$ (the polynomial degree). We do the train/validate split, not once but multiple times.
In the figure below we create 4-folds from the training set part of our data set $\cal{D}$. By this we mean that we divide our set roughly into 4 equal parts. As illustrated below, this can be done in 4 different ways, or folds. In each fold we train a model on 3 of the parts. The model so trained is denotet as $g^-_{Fi}$, for example $g^-_{F3}$ . The minus sign in the superscript once again indicates that we are training on a reduced set. The $F3$ indicates that this model was trained on the third fold. Note that the model trained on each fold will be different! For each fold, after training the model, we calculate the risk or error on the remaining one validation part. We then add the validation errors together from the different folds, and divide by the number of folds to calculate an average error. Note again that this average error is an average over different models $g^-_{Fi}$. We use this error as the validation error for $d=a$ in the validation process described earlier.
https://github.com/cs109/2015lab5/blob/master/LearningModels.ipynb
Let's implement the decision trees using KFold cross validation
In [40]:
from sklearn.cross_validation import KFold
def run_cv(X,y,criterion):
# Construct a kfolds object
kf = KFold(len(y),n_folds=5,shuffle=True)
y_pred = y.copy()
dc=DecisionTreeClassifier(criterion=criterion)
for train_index, test_index in kf:
X_train,X_test=X[train_index],X[test_index]
y_train=y[train_index]
# print(X_train.shape,y_train.shape)
dc.fit(X_train.T,y_train)
y_pred[test_index]=dc.predict(X_test.T)
# print(X_train.shape,X_test.shape)
return y_pred
y_pred_entropy=run_cv(df.T,target,'entropy')
y_pred_gini=run_cv(df.T,target,'gini')
In [41]:
print("Accuracy Score of Decision tree with Entropy is {0}".format(accuracy_score(target,y_pred_entropy)));
Accuracy_score_Cross_validation_Entropy=accuracy_score(target,y_pred_entropy)
# print("Confusion matrix of Decision tree with Gini is\n {0}".format(confusion_matrix(target,y_pred_gini)))
print("\nClassification report of Decision tree with Entropy is\n {0}".format(classification_report(target,y_pred_entropy)))
In [42]:
plt.figure()
plot_confusion_matrix(confusion_matrix(target,y_pred_gini), classes=set(encoder.inverse_transform(target)),
title='Confusion matrix of Decision tree with Gini')
plt.show()
In [43]:
print("Accuracy Score of Decision tree with Gini is {0}".format(accuracy_score(target,y_pred_gini)));
Accuracy_score_Cross_validation_Gini=accuracy_score(target,y_pred_gini)
# print("Confusion matrix of Decision tree with Entropy is\n {0}".format(confusion_matrix(target,y_pred_entropy)))
print("\nClassification report of Decision tree with Gini is\n {0}".format(classification_report(target,y_pred_gini)))
In [44]:
plt.figure()
plot_confusion_matrix(confusion_matrix(target,y_pred_entropy), classes=set(encoder.inverse_transform(target)),
title='Confusion matrix of Decision tree with entropy')
plt.show()
Solution to the problem of overfitting
The problem of overfitting can be solved by using the following concepts:
So let's go ahead and tabulate our results
In [45]:
import prettytable as pt
from IPython.core.display import display, HTML
print("Accuracy Score Table")
table=pt.PrettyTable(['Model Selection','Gini Index','Entropy'])
table.add_row(["Train_test_Split",Accuracy_score_train_test_Gini,Accuracy_score_train_test_entropy])
table.add_row(["Cross Validation",Accuracy_score_Cross_validation_Gini,Accuracy_score_Cross_validation_Entropy])
display(HTML(table.get_html_string()))
Drawing our most accurate decision tree
In [46]:
from sklearn.tree import export_graphviz
In [47]:
class_names=list(set(encoder.inverse_transform(target)))
ddata=export_graphviz(dc,
out_file = None,
feature_names =columns,
class_names =class_names,
filled =True,
rounded =True,
special_characters = True
)
In [48]:
from IPython.display import Image
import pydotplus
graph = pydotplus.graph_from_dot_data(ddata)
Image(graph.create_png())
Out[48]: