by Alejandro Correa Bahnsen & Iván Torroledo
version 1.2, Feb 2018
This notebook is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Special thanks goes to Kevin Markham
Adapted from Chapter 8 of An Introduction to Statistical Learning
Why are we learning about decision trees?
Students will be able to:
Group exercise:
Rules for segmenting:
Above are the regions created by a computer:
Note: Years and Hits are both integers, but the convention is to use the midpoint between adjacent values to label a split.
These regions are used to make predictions on out-of-sample data. Thus, there are only three possible predictions! (Is this different from how linear regression makes predictions?)
Below is the equivalent regression tree:
The first split is Years < 4.5, thus that split goes at the top of the tree. When a splitting rule is True, you follow the left branch. When a splitting rule is False, you follow the right branch.
For players in the left branch, the mean Salary is \$166,000, thus you label it with that value. (Salary has been divided by 1000 and log-transformed to 5.11.)
For players in the right branch, there is a further split on Hits < 117.5, dividing players into two more Salary regions: \$403,000 (transformed to 6.00), and \$846,000 (transformed to 6.74).
What does this tree tell you about your data?
Question: What do you like and dislike about decision trees so far?
Your training data is a tiny dataset of used vehicle sale prices. Your goal is to predict price for testing data.
group_by
).Ideal approach: Consider every possible partition of the feature space (computationally infeasible)
"Good enough" approach: recursive binary splitting
In [1]:
# vehicle data
import pandas as pd
import zipfile
with zipfile.ZipFile('../datasets/vehicles_train.csv.zip', 'r') as z:
f = z.open('vehicles_train.csv')
train = pd.io.parsers.read_table(f, index_col=False, sep=',')
In [2]:
# before splitting anything, just predict the mean of the entire dataset
train['prediction'] = train.price.mean()
train
Out[2]:
In [3]:
year = 0
train['pred'] = train.loc[train.year<year, 'price'].mean()
train.loc[train.year>=year, 'pred'] = train.loc[train.year>=year, 'price'].mean()
(((train['price'] - train['pred'])**2).mean()) ** 0.5
Out[3]:
In [4]:
train_izq = train.loc[train.year<0].copy()
In [5]:
train_izq.year.unique()
Out[5]:
In [6]:
def error_año(train, year):
train['pred'] = train.loc[train.year<year, 'price'].mean()
train.loc[train.year>=year, 'pred'] = train.loc[train.year>=year, 'price'].mean()
return round(((((train['price'] - train['pred'])**2).mean()) ** 0.5), 2)
In [7]:
def error_miles(train, miles):
train['pred'] = train.loc[train.miles<miles, 'price'].mean()
train.loc[train.miles>=miles, 'pred'] = train.loc[train.miles>=miles, 'price'].mean()
return round(((((train['price'] - train['pred'])**2).mean()) ** 0.5), 2)
Recap: Before every split, this process is repeated for every feature, and the feature and cutpoint that produces the lowest MSE is chosen.
In [8]:
# encode car as 0 and truck as 1
train['vtype'] = train.vtype.map({'car':0, 'truck':1})
In [9]:
# define X and y
feature_cols = ['year', 'miles', 'doors', 'vtype']
X = train[feature_cols]
y = train.price
In [10]:
# instantiate a DecisionTreeRegressor (with random_state=1)
from sklearn.tree import DecisionTreeRegressor
treereg = DecisionTreeRegressor(random_state=1)
treereg
Out[10]:
In [11]:
# use leave-one-out cross-validation (LOOCV) to estimate the RMSE for this model
import numpy as np
from sklearn.model_selection import cross_val_score
scores = cross_val_score(treereg, X, y, cv=14, scoring='neg_mean_squared_error')
np.mean(np.sqrt(-scores))
Out[11]:
The training error continues to go down as the tree size increases (due to overfitting), but the lowest cross-validation error occurs for a tree with 3 leaves.
In [12]:
# try different values one-by-one
treereg = DecisionTreeRegressor(max_depth=1, random_state=1)
scores = cross_val_score(treereg, X, y, cv=14, scoring='neg_mean_squared_error')
np.mean(np.sqrt(-scores))
Out[12]:
Or, we could write a loop to try a range of values:
In [13]:
# list of values to try
max_depth_range = range(1, 8)
# list to store the average RMSE for each value of max_depth
RMSE_scores = []
# use LOOCV with each value of max_depth
for depth in max_depth_range:
treereg = DecisionTreeRegressor(max_depth=depth, random_state=1)
MSE_scores = cross_val_score(treereg, X, y, cv=14, scoring='neg_mean_squared_error')
RMSE_scores.append(np.mean(np.sqrt(-MSE_scores)))
In [14]:
%matplotlib inline
import matplotlib.pyplot as plt
# plot max_depth (x-axis) versus RMSE (y-axis)
plt.plot(max_depth_range, RMSE_scores)
plt.xlabel('max_depth')
plt.ylabel('RMSE (lower is better)')
Out[14]:
In [15]:
# max_depth=3 was best, so fit a tree using that parameter
treereg = DecisionTreeRegressor(max_depth=3, random_state=1)
treereg.fit(X, y)
Out[15]:
In [16]:
# "Gini importance" of each feature: the (normalized) total reduction of error brought by that feature
pd.DataFrame({'feature':feature_cols, 'importance':treereg.feature_importances_})
Out[16]:
In [17]:
# create a Graphviz file
from sklearn.tree import export_graphviz
export_graphviz(treereg, out_file='tree_vehicles.dot', feature_names=feature_cols)
# At the command line, run this to convert to PNG:
# dot -Tpng tree_vehicles.dot -o tree_vehicles.png
Reading the internal nodes:
Reading the leaves:
In [18]:
# read the testing data
with zipfile.ZipFile('../datasets/vehicles_test.csv.zip', 'r') as z:
f = z.open('vehicles_test.csv')
test = pd.io.parsers.read_table(f, index_col=False, sep=',')
test['vtype'] = test.vtype.map({'car':0, 'truck':1})
test
Out[18]:
Question: Using the tree diagram above, what predictions will the model make for each observation?
In [19]:
# use fitted model to make predictions on testing data
X_test = test[feature_cols]
y_test = test.price
y_pred = treereg.predict(X_test)
y_pred
Out[19]:
In [20]:
# calculate RMSE
from sklearn.metrics import mean_squared_error
np.sqrt(mean_squared_error(y_test, y_pred))
Out[20]:
Questions:
regression trees | classification trees |
---|---|
predict a continuous response | predict a categorical response |
predict using mean response of each leaf | predict using most commonly occuring class of each leaf |
splits are chosen to minimize MSE | splits are chosen to minimize Gini index (discussed below) |
Pretend we are predicting whether someone buys an iPhone or an Android:
Our goal in making splits is to reduce the classification error rate. Let's try splitting on gender:
Compare that with a split on age:
The decision tree algorithm will try every possible split across all features, and choose the split that reduces the error rate the most.
Calculate the Gini index before making a split:
$$1 - \left(\frac {iPhone} {Total}\right)^2 - \left(\frac {Android} {Total}\right)^2 = 1 - \left(\frac {10} {25}\right)^2 - \left(\frac {15} {25}\right)^2 = 0.48$$Evaluating the split on gender using Gini index:
$$\text{Males: } 1 - \left(\frac {2} {14}\right)^2 - \left(\frac {12} {14}\right)^2 = 0.24$$$$\text{Females: } 1 - \left(\frac {8} {11}\right)^2 - \left(\frac {3} {11}\right)^2 = 0.40$$$$\text{Weighted Average: } 0.24 \left(\frac {14} {25}\right) + 0.40 \left(\frac {11} {25}\right) = 0.31$$Evaluating the split on age using Gini index:
$$\text{30 or younger: } 1 - \left(\frac {4} {12}\right)^2 - \left(\frac {8} {12}\right)^2 = 0.44$$$$\text{31 or older: } 1 - \left(\frac {6} {13}\right)^2 - \left(\frac {7} {13}\right)^2 = 0.50$$$$\text{Weighted Average: } 0.44 \left(\frac {12} {25}\right) + 0.50 \left(\frac {13} {25}\right) = 0.47$$Again, the decision tree algorithm will try every possible split, and will choose the split that reduces the Gini index (and thus increases the "node purity") the most.
Note: There is another common splitting criteria called cross-entropy. It's numerically similar to Gini index, but slower to compute, thus it's not as popular as Gini index.
We'll build a classification tree using the Titanic data:
In [21]:
# read in the data
with zipfile.ZipFile('../datasets/titanic.csv.zip', 'r') as z:
f = z.open('titanic.csv')
titanic = pd.read_csv(f, sep=',', index_col=0)
# encode female as 0 and male as 1
titanic['Sex'] = titanic.Sex.map({'female':0, 'male':1})
# fill in the missing values for age with the median age
titanic.Age.fillna(titanic.Age.median(), inplace=True)
# create a DataFrame of dummy variables for Embarked
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True)
# concatenate the original DataFrame and the dummy DataFrame
titanic = pd.concat([titanic, embarked_dummies], axis=1)
# print the updated DataFrame
titanic.head()
Out[21]:
In [22]:
# define X and y
feature_cols = ['Pclass', 'Sex', 'Age', 'Embarked_Q', 'Embarked_S']
X = titanic[feature_cols]
y = titanic.Survived
In [23]:
# fit a classification tree with max_depth=3 on all data
from sklearn.tree import DecisionTreeClassifier
treeclf = DecisionTreeClassifier(max_depth=3, random_state=1)
treeclf.fit(X, y)
Out[23]:
In [24]:
# create a Graphviz file
export_graphviz(treeclf, out_file='tree_titanic.dot', feature_names=feature_cols)
# At the command line, run this to convert to PNG:
# dot -Tpng tree_titanic.dot -o tree_titanic.png
Notice the split in the bottom right: the same class is predicted in both of its leaves. That split didn't affect the classification error rate, though it did increase the node purity, which is important because it increases the accuracy of our predicted probabilities.
In [25]:
# compute the feature importances
pd.DataFrame({'feature':feature_cols, 'importance':treeclf.feature_importances_})
Out[25]:
Advantages of decision trees:
Disadvantages of decision trees: