Decision Tree Classification

Splitting

Splittin Criteria

Termination (When to stop splitting?)

  • Pure nodes ideally, stop when all the members belong to the same class, means that a node is pure.
  • Early termination to avoid overfitting

How to measure the purity/impurity of a node?

Entropy is a measure of impurity of a node. Higher entropy means the node has higher degree of impurity.

$$Entropy = -\displaystyle \sum_j p(j|t) \log p(j|t)$$

Find the best splitting attribute

  • For each candidate attribute
    • compute the entropy of each child node
    • compute the weighted entropy if the children $$E_{total} = \displaystyle \sum_{i=1}^{p} \frac{n_i}{n}E_i$$
    • choose the attribute that gives the smallest total entropy

How about splitting based on numeric attributes?

Assume we have two numeric attrbites $(x_1, x_2)$. We need to check all possible splitting points

Summary

  • Resulted boundaries are always parallel to attribute axes

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas

%matplotlib inline

In [32]:
sigma = np.array([[0.14,-0.1],[-0.1,0.14]])
mu1 = np.array([-0.4,0.2])
mu2 = np.array([0.4,0.5])

d1 = np.random.multivariate_normal(mean=mu1, cov=sigma, size=200)
d2 = np.random.multivariate_normal(mean=mu2, cov=sigma, size=200)

x1 = np.concatenate((d1[:,0], d2[:,0]))
x2 = np.concatenate((d1[:,1], d2[:,1]))
labels = np.concatenate((['Positive']*d1.shape[0], 
                         ['Negative']*d2.shape[0]))

df = pandas.DataFrame(dict(x1=x1, x2=x2, y=labels))

df = df.reindex(np.random.permutation(df.index))

sns.lmplot("x1", "x2", hue="y", data=df, fit_reg=False)


Out[32]:
<seaborn.axisgrid.FacetGrid at 0x7ff4074a40f0>

In [28]:



Out[28]:
array(['Positive', 'Positive', 'Positive', 'Positive', 'Positive',
       'Positive', 'Positive', 'Positive', 'Positive', 'Positive',
       'Positive', 'Positive', 'Positive', 'Positive', 'Positive',
       'Positive', 'Positive', 'Positive', 'Positive', 'Positive',
       'Negative', 'Negative', 'Negative', 'Negative', 'Negative',
       'Negative', 'Negative', 'Negative', 'Negative', 'Negative',
       'Negative', 'Negative', 'Negative', 'Negative', 'Negative',
       'Negative', 'Negative', 'Negative', 'Negative', 'Negative'], 
      dtype='<U8')

In [ ]: