Decision Tree visualizaion by Tony Chu and Stephanie Yee.
Classification and regression trees (CART) are a non-parametric decision tree learning technique that produces either classification or regression trees, depending on whether the dependent variable is categorical or numeric, respectively. CART is both a generic term to describe tree algorithms and also a specific name for Breiman's original algorithm for constructing classification and regression trees.
There are a handful of different tree algorithms in addition to Breiman's original CART algorithm. Namely, ID3, C4.5 and C5.0, all created by Ross Quinlan. C5.0 is an improvement over C4.5, however, the C4.5 algorithm is still quite popular since the multi-threaded version of C5.0 is proprietary (although the single threaded is released as GPL).
Here are some of the differences between CART and C4.5:
Decision trees are formed by a collection of rules based on variables in the modeling data set:
Each branch of the tree ends in a terminal node. Each observation falls into one and exactly one terminal node, and each terminal node is uniquely defined by a set of rules.
The original CART algorithm uses the Gini Impurity, where as ID3, C4.5 and C5.0 use Entropy or Information Gain (related to Entropy).
Used by the CART algorithm, Gini Impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. Gini impurity can be computed by summing the probability $f_i$ of each item being chosen times the probability $1 − f_i$ of a mistake in categorizing that item. It reaches its minimum (zero) when all cases in the node fall into a single target category.
To compute Gini impurity for a set of m items, suppose $i ∈ {1, 2, ..., m}$, and let $f_i$ be the fraction of items labeled with value $i$ in the set.
$$ I_{G}(f)=\sum _{i=1}^{m}f_{i}(1-f_{i})=\sum _{i=1}^{m}(f_{i}-{f_{i}}^{2})=\sum _{i=1}^{m}f_{i}-\sum _{i=1}^{m}{f_{i}}^{2}=1-\sum _{i=1}^{m}{f_{i}}^{2}=\sum _{i\neq k}f_{i}f_{k}$$Entropy, $H(S)$, is a measure of the amount of uncertainty in the (data) set $S$ (i.e. entropy characterizes the (data) set $S$).
$$ H(S)=-\sum _{{x\in X}}p(x)\log _{{2}}p(x) $$Where,
When $H(S)=0$, the set $S$ is perfectly classified (i.e. all elements in $S$ are of the same class).
In ID3, entropy is calculated for each remaining attribute. The attribute with the smallest entropy is used to split the set $S$ on this iteration. The higher the entropy, the higher the potential to improve the classification here.
Information gain $IG(A)$ is the measure of the difference in entropy from before to after the set $S$ is split on an attribute $A$. In other words, how much uncertainty in $S$ was reduced after splitting set $S$ on attribute $A$.
$$ IG(A,S)=H(S)-\sum _{{t\in T}}p(t)H(t)$$Where,
In ID3, information gain can be calculated (instead of entropy) for each remaining attribute. The attribute with the largest information gain is used to split the set $S$ on this iteration.
Tony Chu and Stephanie Yee designed an award-winning visualization of how decision trees work called "A Visual Introduction to Machine Learning." Their interactive D3 visualization is available here.
In [ ]:
***
Since its more common in machine learning to use trees in an ensemble, we'll skip the code tutorial for CART in R. For reference, trees can be grown using the rpart package, among others.