In [1]:
import pandas as pd
from math import log2
%matplotlib inline
Decision tree learning uses a decision tree as a predictive model observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It is one of the predictive modelling approaches used in statistics, data mining and machine learning. Tree models where the target variable can take a finite set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees.
Source
In [2]:
def entropy(S):
outcomes = pd.unique(S[final])
ent = lambda p: 0 if p == 0 else p * log2(p)
return -sum([ ent( S[S[final] == o].size / S.size ) for o in outcomes ])
In [3]:
def information(S, A):
partitions = pd.unique(S[A])
return sum([( S[A][S[A] == p].size / S[A].size ) *
entropy( S[S[A] == p]) for p in partitions])
In [4]:
def gain(S, A):
return entropy(S) - information(S, A)
In [5]:
def intrinsic_information(S, A):
partitions = pd.unique(S[A])
return -sum([( S[A][S[A] == p].size / S[A].size ) *
log2( S[A][S[A] == p].size / S[A].size ) for p in partitions])
In [6]:
def gain_ratio(S, A):
return gain(S,A) / intrinsic_information(S,A)
In [7]:
def gini(S, A=None):
if A == None:
return 1-sum(
[(S[S[final] == o].size / S.size)**2 for o in pd.unique(S[final])]
)
return sum( [ ( S[A][S[A] == p].size / S[A].size ) * gini(S[S[A] == p]) for p in pd.unique(S[A])] )
In [8]:
_exec = lambda f : {col: f(data, col) for col in data.columns if not col == final}
In [9]:
data = pd.read_csv('playgolf.txt')
final='Play Golf?'
out = pd.DataFrame( dict(
gain = _exec(gain),
information = _exec(information),
gain_ratio = _exec(gain_ratio),
gini = _exec(gini)
) )
out
Out[9]:
In [10]:
_ = out.plot(kind='bar')
In [11]:
data = pd.read_csv('lens24.dat')
final = 'class'
out = pd.DataFrame( dict(
gain = _exec(gain),
information = _exec(information),
gain_ratio = _exec(gain_ratio),
gini = _exec(gini)
) )
out
Out[11]:
In [12]:
_ = out.plot(kind='bar')