version 0.2, May 2016
This notebook is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Special thanks goes to Kevin Markham
Adapted from Chapter 8 of An Introduction to Statistical Learning
Why are we learning about decision trees?
Students will be able to:
Group exercise:
Rules for segmenting:
Above are the regions created by a computer:
Note: Years and Hits are both integers, but the convention is to use the midpoint between adjacent values to label a split.
These regions are used to make predictions on out-of-sample data. Thus, there are only three possible predictions! (Is this different from how linear regression makes predictions?)
Below is the equivalent regression tree:
The first split is Years < 4.5, thus that split goes at the top of the tree. When a splitting rule is True, you follow the left branch. When a splitting rule is False, you follow the right branch.
For players in the left branch, the mean Salary is \$166,000, thus you label it with that value. (Salary has been divided by 1000 and log-transformed to 5.11.)
For players in the right branch, there is a further split on Hits < 117.5, dividing players into two more Salary regions: \$403,000 (transformed to 6.00), and \$846,000 (transformed to 6.74).
What does this tree tell you about your data?
Question: What do you like and dislike about decision trees so far?
Your training data is a tiny dataset of used vehicle sale prices. Your goal is to predict price for testing data.
group_by
).Ideal approach: Consider every possible partition of the feature space (computationally infeasible)
"Good enough" approach: recursive binary splitting
In [5]:
# vehicle data
import pandas as pd
import zipfile
with zipfile.ZipFile('../datasets/vehicles_train.csv.zip', 'r') as z:
f = z.open('vehicles_train.csv')
train = pd.io.parsers.read_table(f, index_col=False, sep=',')
In [7]:
# before splitting anything, just predict the mean of the entire dataset
train['prediction'] = train.price.mean()
train
Out[7]:
In [8]:
year = 2010
train['pred'] = train.loc[train.year<year, 'price'].mean()
train.loc[train.year>=year, 'pred'] = train.loc[train.year>=year, 'price'].mean()
(((train['price'] - train['pred'])**2).mean()) ** 0.5
Out[8]:
In [9]:
train_izq = train.loc[train.year<2010].copy()
In [10]:
train_izq.year.unique()
Out[10]:
In [22]:
def error_año(train, year):
train['pred'] = train.loc[train.year<year, 'price'].mean()
train.loc[train.year>=year, 'pred'] = train.loc[train.year>=year, 'price'].mean()
return round(((((train['price'] - train['pred'])**2).mean()) ** 0.5), 2)
In [23]:
def error_miles(train, miles):
train['pred'] = train.loc[train.miles<miles, 'price'].mean()
train.loc[train.miles>=miles, 'pred'] = train.loc[train.miles>=miles, 'price'].mean()
return round(((((train['price'] - train['pred'])**2).mean()) ** 0.5), 2)
In [24]:
for year in train_izq.year.unique():
print(year, error_año(train_izq, year))
In [25]:
train_izq.miles.describe()
Out[25]:
In [26]:
for miles in [50000, 90000, 95000, 100000, 105000, 110000, 125000, 140000, 160000, 180000]:
print(miles, error_miles(train_izq, miles))
In [30]:
train_der = train.loc[train.year>=2010].copy()
for year in train_der.year.unique():
print(year, error_año(train_der, year))
print()
for miles in [50000, 90000, 95000, 100000, 105000, 110000, 125000, 140000, 160000, 180000]:
print(miles, error_miles(train_der, miles))
In [29]:
train_der_izq = train_der.loc[train_der.year<2012].copy()
for year in train_der_izq.year.unique():
print(year, error_año(train_der_izq, year))
print()
for miles in [25000, 50000, 90000, 95000, 100000, 105000, 110000, 125000, 140000, 160000, 180000]:
print(miles, error_miles(train_der_izq, miles))
In [31]:
train_der_izq
Out[31]:
In [32]:
train_izq_izq = train_izq.loc[train_izq.year<2007].copy()
train_izq_izq
Out[32]:
In [33]:
for year in train_izq_izq.year.unique():
print(year, error_año(train_izq_izq, year))
for miles in [25000, 50000, 90000, 95000, 100000, 105000, 110000, 125000, 140000, 160000, 180000]:
print(miles, error_miles(train_izq_izq, miles))
In [34]:
train_izq_der = train_izq.loc[train_izq.year>=2007].copy()
train_izq_der
Out[34]:
In [35]:
train_der_der = train_der.loc[train_der.year>=2012].copy()
train_der_der
Out[35]:
In [36]:
train_izq_izq_izq = train_izq_izq.loc[train_izq_izq.miles<125000].copy()
train_izq_izq_izq
Out[36]:
In [37]:
train_izq_izq_der = train_izq_izq.loc[train_izq_izq.miles>=125000].copy()
train_izq_izq_der
Out[37]:
In [38]:
for year in train_izq_izq_der.year.unique():
print(year, error_año(train_izq_izq_der, year))
print()
for miles in [140000, 160000, 180000]:
print(miles, error_miles(train_izq_izq_der, miles))
In [39]:
train_izq_izq_der_izq = train_izq_izq_der.loc[train_izq_izq_der.year<2003].copy()
train_izq_izq_der_izq
Out[39]:
In [40]:
train_izq_izq_der_der = train_izq_izq_der.loc[train_izq_izq_der.year>=2003].copy()
train_izq_izq_der_der
Out[40]:
In [41]:
for year in train_izq_izq_der_der.year.unique():
print(year, error_año(train_izq_izq_der_der, year))
print()
for miles in [140000, 160000, 180000]:
print(miles, error_miles(train_izq_izq_der_der, miles))
In [42]:
# calculate RMSE for those predictions
from sklearn import metrics
import numpy as np
np.sqrt(metrics.mean_squared_error(train.price, train.prediction))
Out[42]:
In [43]:
# define a function that calculates the RMSE for a given split of miles
def mileage_split(miles):
lower_mileage_price = train[train.miles < miles].price.mean()
higher_mileage_price = train[train.miles >= miles].price.mean()
train['prediction'] = np.where(train.miles < miles, lower_mileage_price, higher_mileage_price)
return np.sqrt(metrics.mean_squared_error(train.price, train.prediction))
In [44]:
# calculate RMSE for tree which splits on miles < 50000
print('RMSE:', mileage_split(50000))
train
Out[44]:
In [45]:
# calculate RMSE for tree which splits on miles < 100000
print('RMSE:', mileage_split(100000))
train
Out[45]:
In [46]:
# check all possible mileage splits
mileage_range = range(train.miles.min(), train.miles.max(), 1000)
RMSE = [mileage_split(miles) for miles in mileage_range]
In [47]:
# allow plots to appear in the notebook
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (6, 4)
plt.rcParams['font.size'] = 14
In [48]:
# plot mileage cutpoint (x-axis) versus RMSE (y-axis)
plt.plot(mileage_range, RMSE)
plt.xlabel('Mileage cutpoint')
plt.ylabel('RMSE (lower is better)')
Out[48]:
Recap: Before every split, this process is repeated for every feature, and the feature and cutpoint that produces the lowest MSE is chosen.
In [49]:
# encode car as 0 and truck as 1
train['vtype'] = train.vtype.map({'car':0, 'truck':1})
In [50]:
# define X and y
feature_cols = ['year', 'miles', 'doors', 'vtype']
X = train[feature_cols]
y = train.price
In [51]:
# instantiate a DecisionTreeRegressor (with random_state=1)
from sklearn.tree import DecisionTreeRegressor
treereg = DecisionTreeRegressor(random_state=1)
treereg
Out[51]:
In [52]:
# use leave-one-out cross-validation (LOOCV) to estimate the RMSE for this model
import numpy as np
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(treereg, X, y, cv=14, scoring='mean_squared_error')
np.mean(np.sqrt(-scores))
Out[52]:
The training error continues to go down as the tree size increases (due to overfitting), but the lowest cross-validation error occurs for a tree with 3 leaves.
In [53]:
# try different values one-by-one
treereg = DecisionTreeRegressor(max_depth=1, random_state=1)
scores = cross_val_score(treereg, X, y, cv=14, scoring='mean_squared_error')
np.mean(np.sqrt(-scores))
Out[53]:
Or, we could write a loop to try a range of values:
In [54]:
# list of values to try
max_depth_range = range(1, 8)
# list to store the average RMSE for each value of max_depth
RMSE_scores = []
# use LOOCV with each value of max_depth
for depth in max_depth_range:
treereg = DecisionTreeRegressor(max_depth=depth, random_state=1)
MSE_scores = cross_val_score(treereg, X, y, cv=14, scoring='mean_squared_error')
RMSE_scores.append(np.mean(np.sqrt(-MSE_scores)))
In [55]:
%matplotlib inline
import matplotlib.pyplot as plt
# plot max_depth (x-axis) versus RMSE (y-axis)
plt.plot(max_depth_range, RMSE_scores)
plt.xlabel('max_depth')
plt.ylabel('RMSE (lower is better)')
Out[55]:
In [56]:
# max_depth=3 was best, so fit a tree using that parameter
treereg = DecisionTreeRegressor(max_depth=3, random_state=1)
treereg.fit(X, y)
Out[56]:
In [57]:
# "Gini importance" of each feature: the (normalized) total reduction of error brought by that feature
pd.DataFrame({'feature':feature_cols, 'importance':treereg.feature_importances_})
Out[57]:
In [21]:
# create a Graphviz file
from sklearn.tree import export_graphviz
export_graphviz(treereg, out_file='tree_vehicles.dot', feature_names=feature_cols)
# At the command line, run this to convert to PNG:
# dot -Tpng tree_vehicles.dot -o tree_vehicles.png
Reading the internal nodes:
Reading the leaves:
In [58]:
# read the testing data
with zipfile.ZipFile('../datasets/vehicles_test.csv.zip', 'r') as z:
f = z.open('vehicles_test.csv')
test = pd.io.parsers.read_table(f, index_col=False, sep=',')
test['vtype'] = test.vtype.map({'car':0, 'truck':1})
test
Out[58]:
Question: Using the tree diagram above, what predictions will the model make for each observation?
In [59]:
# use fitted model to make predictions on testing data
X_test = test[feature_cols]
y_test = test.price
y_pred = treereg.predict(X_test)
y_pred
Out[59]:
In [61]:
# calculate RMSE
from sklearn.metrics import mean_squared_error
np.sqrt(mean_squared_error(y_test, y_pred))
Out[61]:
Questions:
regression trees | classification trees |
---|---|
predict a continuous response | predict a categorical response |
predict using mean response of each leaf | predict using most commonly occuring class of each leaf |
splits are chosen to minimize MSE | splits are chosen to minimize Gini index (discussed below) |
Pretend we are predicting whether someone buys an iPhone or an Android:
Our goal in making splits is to reduce the classification error rate. Let's try splitting on gender:
Compare that with a split on age:
The decision tree algorithm will try every possible split across all features, and choose the split that reduces the error rate the most.
Calculate the Gini index before making a split:
$$1 - \left(\frac {iPhone} {Total}\right)^2 - \left(\frac {Android} {Total}\right)^2 = 1 - \left(\frac {10} {25}\right)^2 - \left(\frac {15} {25}\right)^2 = 0.48$$Evaluating the split on gender using Gini index:
$$\text{Males: } 1 - \left(\frac {2} {14}\right)^2 - \left(\frac {12} {14}\right)^2 = 0.24$$$$\text{Females: } 1 - \left(\frac {8} {11}\right)^2 - \left(\frac {3} {11}\right)^2 = 0.40$$$$\text{Weighted Average: } 0.24 \left(\frac {14} {25}\right) + 0.40 \left(\frac {11} {25}\right) = 0.31$$Evaluating the split on age using Gini index:
$$\text{30 or younger: } 1 - \left(\frac {4} {12}\right)^2 - \left(\frac {8} {12}\right)^2 = 0.44$$$$\text{31 or older: } 1 - \left(\frac {6} {13}\right)^2 - \left(\frac {7} {13}\right)^2 = 0.50$$$$\text{Weighted Average: } 0.44 \left(\frac {12} {25}\right) + 0.50 \left(\frac {13} {25}\right) = 0.47$$Again, the decision tree algorithm will try every possible split, and will choose the split that reduces the Gini index (and thus increases the "node purity") the most.
Note: There is another common splitting criteria called cross-entropy. It's numerically similar to Gini index, but slower to compute, thus it's not as popular as Gini index.
We'll build a classification tree using the Titanic data:
In [62]:
# read in the data
with zipfile.ZipFile('../datasets/titanic.csv.zip', 'r') as z:
f = z.open('titanic.csv')
titanic = pd.read_csv(f, sep=',', index_col=0)
# encode female as 0 and male as 1
titanic['Sex'] = titanic.Sex.map({'female':0, 'male':1})
# fill in the missing values for age with the median age
titanic.Age.fillna(titanic.Age.median(), inplace=True)
# create a DataFrame of dummy variables for Embarked
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True)
# concatenate the original DataFrame and the dummy DataFrame
titanic = pd.concat([titanic, embarked_dummies], axis=1)
# print the updated DataFrame
titanic.head()
Out[62]:
In [63]:
# define X and y
feature_cols = ['Pclass', 'Sex', 'Age', 'Embarked_Q', 'Embarked_S']
X = titanic[feature_cols]
y = titanic.Survived
In [64]:
# fit a classification tree with max_depth=3 on all data
from sklearn.tree import DecisionTreeClassifier
treeclf = DecisionTreeClassifier(max_depth=3, random_state=1)
treeclf.fit(X, y)
Out[64]:
In [29]:
# create a Graphviz file
export_graphviz(treeclf, out_file='tree_titanic.dot', feature_names=feature_cols)
# At the command line, run this to convert to PNG:
# dot -Tpng tree_titanic.dot -o tree_titanic.png
Notice the split in the bottom right: the same class is predicted in both of its leaves. That split didn't affect the classification error rate, though it did increase the node purity, which is important because it increases the accuracy of our predicted probabilities.
In [65]:
# compute the feature importances
pd.DataFrame({'feature':feature_cols, 'importance':treeclf.feature_importances_})
Out[65]:
Advantages of decision trees:
Disadvantages of decision trees: