Tree-based methods can be used to solve regression and classification problems.
A decision tree is a tree structure that partition data points into regions or categories. Each vertex represents a decision to be made. The outgoing edges from a vertex represent possible choices for that decision.
For instance, the figure above illustrates a decision tree model for the Titanic data set from Kaggle. Let us see the number of men vs. women in the data.
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [2]:
df = pd.read_csv('data/titanic-train.csv')
In [3]:
df.head()
Out[3]:
In [4]:
df['Survival State'] = df['Survived'].apply(lambda x: 'Survived' if x == 1 else 'Died')
df['Survival State'].value_counts()
Out[4]:
We can see that 549 people died while the remaining 342 survived. This means that following histogram is associated with the root node:
In [5]:
sns.countplot(x='Survival State', order=['Died', 'Survived'], data=df)
Out[5]:
The histogram of the left child of the root vertex (female passengers) is:
In [6]:
sns.countplot(x='Survival State', order=['Died', 'Survived'], data=df[df['Sex'] == 'female'])
Out[6]:
The right child (male passengers) is associated with following histogram:
In [7]:
sns.countplot(x='Survival State', order=['Died', 'Survived'], data=df[df['Sex'] == 'male'])
Out[7]:
Given a new observation $x'$, if we have reached to the right child of the root vertex, then we know that the probability of $x'$ belonging to Died category is 81 percent:
In [8]:
total_count = df[df['Sex'] == 'male'].shape[0]
died_count = df[(df['Sex'] == 'male') & (df['Survival State'] == 'Died')].shape[0]
probab_pct = round(died_count / total_count * 100, 2)
print('{0} percent'.format(probab_pct))
Simple decision trees are simple to understand, visualise and interpret but they don't generalise very well because tree have very high variance in their predictions. High variance means that the probability of predicting the wrong class label is very high. For instance, assume that we randomly split our training set into two halves $X_1$ and $X_2$. A decision tree based on $X_1$ can predict a completely different class label than another decision tree learned using $X_2$ for an unseen observation $x'$.
Prediction accuracy can be improved by training multiple decision trees and averaging their results to get a single prediction. Averaging the noise in the different models yields a model with low variance. A general technique to reduce the variance by combining learning models is called Bootstrap aggregating commonly known as bagging. This technique creates $m$ randomly sampled subsets (bags) of the training data with replacement. The term with replacement means that any bag can contain multiple copies of the same item in the training data.
Image Source [#]
Random forest builds upon the bagging technique. It create an ensemble (collection) of decision trees called forest using bootstrap samples of the training data. The algorithm works as follows:
For classification problems, $k$ is typically set to $\sqrt{l}$.
In [ ]: