Tree-based Methods

Tree-based methods can be used to solve regression and classification problems.

Decision Trees

A decision tree is a tree structure that partition data points into regions or categories. Each vertex represents a decision to be made. The outgoing edges from a vertex represent possible choices for that decision.

Image from #

For instance, the figure above illustrates a decision tree model for the Titanic data set from Kaggle. Let us see the number of men vs. women in the data.



In [1]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline



In [2]:

    
df = pd.read_csv('data/titanic-train.csv')



In [3]:

    
df.head()









    Out[3]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S



In [4]:

    
df['Survival State'] = df['Survived'].apply(lambda x: 'Survived' if x == 1 else 'Died')
df['Survival State'].value_counts()









    Out[4]:





Died        549
Survived    342
Name: Survival State, dtype: int64

We can see that 549 people died while the remaining 342 survived. This means that following histogram is associated with the root node:



In [5]:

    
sns.countplot(x='Survival State', order=['Died', 'Survived'], data=df)









    Out[5]:





<matplotlib.axes._subplots.AxesSubplot at 0x1fb7aa584e0>

The histogram of the left child of the root vertex (female passengers) is:



In [6]:

    
sns.countplot(x='Survival State', order=['Died', 'Survived'], data=df[df['Sex'] == 'female'])









    Out[6]:





<matplotlib.axes._subplots.AxesSubplot at 0x1fb7ab57fd0>

The right child (male passengers) is associated with following histogram:



In [7]:

    
sns.countplot(x='Survival State', order=['Died', 'Survived'], data=df[df['Sex'] == 'male'])









    Out[7]:





<matplotlib.axes._subplots.AxesSubplot at 0x1fb7abc20f0>

Given a new observation $x'$, if we have reached to the right child of the root vertex, then we know that the probability of $x'$ belonging to Died category is 81 percent:



In [8]:

    
total_count =  df[df['Sex'] == 'male'].shape[0]
died_count = df[(df['Sex'] == 'male') & (df['Survival State'] == 'Died')].shape[0]
probab_pct = round(died_count / total_count * 100, 2)
print('{0} percent'.format(probab_pct))









    



81.11 percent

Random Forests

Simple decision trees are simple to understand, visualise and interpret but they don't generalise very well because tree have very high variance in their predictions. High variance means that the probability of predicting the wrong class label is very high. For instance, assume that we randomly split our training set into two halves $X_1$ and $X_2$. A decision tree based on $X_1$ can predict a completely different class label than another decision tree learned using $X_2$ for an unseen observation $x'$.

Prediction accuracy can be improved by training multiple decision trees and averaging their results to get a single prediction. Averaging the noise in the different models yields a model with low variance. A general technique to reduce the variance by combining learning models is called Bootstrap aggregating commonly known as bagging. This technique creates $m$ randomly sampled subsets (bags) of the training data with replacement. The term with replacement means that any bag can contain multiple copies of the same item in the training data.

Image Source [#]

Random forest builds upon the bagging technique. It create an ensemble (collection) of decision trees called forest using bootstrap samples of the training data. The algorithm works as follows:

Draw $m$ bootstrap samples from the training data
For each bootstrap sample, grow a decision tree. When choosing the best splitting vertex for a decision tree, only consider $k$ features randomly selected among all $l$ features. The reason we randomly choose $k$ features is to avoid creating decision trees that are highly correlated. Considering all features in a data set that contains a single strong predictor $A_{strong}$, results in similar trees that use $A_{strong}$ as the root vertex.
Perform predictions by combining the predictions of the $m$ decision trees. For classification problems use majority votes and use average for regression problems.

For classification problems, $k$ is typically set to $\sqrt{l}$.



In [ ]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S