Exercise 2

Work on this before the next lecture on 24 April. We will talk about questions, comments, and solutions during the exercise after the third lecture.

Please do form study groups! When you do, make sure you can explain everything in your own words, do not simply copy&paste from others.

The solutions to a lot of these problems can probably be found with Google. Please don't. You will not learn a lot by copy&pasting from the internet.

If you want to get credit/examination on this course please upload your work to your GitHub repository for this course before the next lecture starts and post a link to your repository in this thread. If you worked on things together with others please add their names to the notebook so we can see who formed groups.

These are some useful default imports for plotting and numpy



In [16]:

    
%config InlineBackend.figure_format='retina'
%matplotlib inline

import numpy as np
np.random.seed(123)
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (8, 8)
plt.rcParams["font.size"] = 14
from sklearn.utils import check_random_state

Question 1

Correlation between trees. This question is about investigating the correlation between decision trees and how this effects an ensemble constructed from them. There are three methods for adding randomisation to the tree growing process:

grow each tree on a bootstrap sample
for each tree select a subset of features at random
pick the best random split point

You can use RandomForestClassifier, BaggingClassifier, and ExtraTreesClassifier to achieve various different sets of the above three strategies.

Show how the average amount of correlation between the trees in the ensemble varies as a function of bootstrap yes/no, number of max_features, and picking the best split point at random or not.

Pick one of the classification datasets from http://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets.



In [29]:

    
from scipy.stats import pearsonr

def corr_from_trees(classifier, X_train, y_train, X_test):
    clf_fit = classifier.fit(X_train, y_train)
    num_est = len(clf_fit.estimators_)
    siz = int(num_est*(num_est-1)/2)
    corr_array = np.zeros(siz)
    for i in range(num_est):
        # Calculate the prediction from the i-th tree of the forest
        pr_i = clf_fit.estimators_[i].predict(X_test)
        for j in range(i+1,num_est):
            pr_j = clf_fit.estimators_[j].predict(X_test)
            # Calculate the Pearson correlation coefficient between the predictions of trees i and j
            indx_corr_array_i_j = k = int((num_est*(num_est-1)/2) - (num_est-i)*((num_est-i)-1)/2 + j - i - 1)
            corr_array[indx_corr_array_i_j] = pearsonr(pr_i, pr_j)[0]
            
    return corr_array



In [30]:

    
# Load the dataset and split it in training and test set

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.model_selection import validation_curve
from sklearn.model_selection import StratifiedKFold

diabetes = load_diabetes()

X = diabetes.data
y = diabetes.target

# unused
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=66)

RandomForestClassifier

In RandomForestClassifier each of the "n_estimators" decision trees is constructed using a subsample of the training set taken with replacement (1.). Moreover, only a subset of features (selected at random) is used to construct each tree (2.).

RandomForestClassifier then provide the predicted class based on the averaging of the prediction of the trees.

For such classifier, we initialize two classifier instances, on with bootstrap = True and the other with False. Then, for each of the two classifiers, we vary the "max_features" attribute.



In [41]:

    
# RandomForestClassifier 

from sklearn.ensemble import RandomForestClassifier

param_range = range(1,X.shape[1])

list_mean_corr_rfc = []
list_std_corr_rfc = []
list_mean_corr_rfc_no_bootstrap = []
list_std_corr_rfc_no_bootstrap = []

for mf in param_range:

    rfc = RandomForestClassifier(n_estimators = 100, n_jobs=-1, max_features=mf)
    rfc_no_bootstrap = RandomForestClassifier(n_estimators = 100, n_jobs=-1, bootstrap=False, max_features=mf)

    corr_rfc = corr_from_trees(rfc, X_train, y_train, X_test)
    corr_rfc_no_bootstrap = corr_from_trees(rfc_no_bootstrap, X_train, y_train, X_test)

    mean_corr_rfc = np.mean(corr_rfc, axis=0)
    std_corr_rfc = np.std(corr_rfc, axis=0)
    mean_corr_rfc_no_bootstrap = np.mean(corr_rfc_no_bootstrap, axis=0)
    std_corr_rfc_no_bootstrap = np.std(corr_rfc_no_bootstrap, axis=0)
    
    list_mean_corr_rfc.append(mean_corr_rfc)
    list_std_corr_rfc.append(std_corr_rfc)
    list_mean_corr_rfc_no_bootstrap.append(mean_corr_rfc_no_bootstrap)
    list_std_corr_rfc_no_bootstrap.append(std_corr_rfc_no_bootstrap)

    
list_mean_corr_rfc = np.array(list_mean_corr_rfc)
list_std_corr_rfc = np.array(list_std_corr_rfc)
list_mean_corr_rfc_no_bootstrap = np.array(list_mean_corr_rfc_no_bootstrap)
list_std_corr_rfc_no_bootstrap = np.array(list_std_corr_rfc_no_bootstrap)

plt.plot(param_range, list_mean_corr_rfc, label='Bootstrap True', lw=4, color='forestgreen')
plt.plot(param_range, list_mean_corr_rfc + list_std_corr_rfc, param_range, list_mean_corr_rfc - list_std_corr_rfc, alpha=0.5, ls="--", color='forestgreen')
plt.plot(param_range, list_mean_corr_rfc_no_bootstrap, label='Bootstrap False', lw=4, color='darkviolet')
plt.plot(param_range, list_mean_corr_rfc_no_bootstrap + list_std_corr_rfc_no_bootstrap, param_range, list_mean_corr_rfc_no_bootstrap - list_std_corr_rfc_no_bootstrap, alpha=0.5, ls="--", color='darkviolet')

plt.xlabel("Max features")
plt.ylabel("Correlation between trees")

plt.legend(loc='best')









    Out[41]:





<matplotlib.legend.Legend at 0x7ff756f52668>

As we see, bootstrapping samples tends to keep almost constant (on average) the amount of correlation of the predictions from different trees within the randon forest. On the other hand, not bootstrapping tends to increase the correlation as "max_features" increase.

BaggingClassifier



In [ ]:

    
# # BaggingClassifier 

# from sklearn.ensemble import RandomForestClassifier

# param_range = range(1,X.shape[1])

# list_mean_corr_rfc = []
# list_std_corr_rfc = []
# list_mean_corr_rfc_no_bootstrap = []
# list_std_corr_rfc_no_bootstrap = []

# for mf in param_range:

#     rfc = RandomForestClassifier(n_estimators = 100, n_jobs=-1, max_features=mf)
#     rfc_no_bootstrap = RandomForestClassifier(n_estimators = 100, n_jobs=-1, bootstrap=False, max_features=mf)

#     corr_rfc = corr_from_trees(rfc, X_train, y_train, X_test)
#     corr_rfc_no_bootstrap = corr_from_trees(rfc_no_bootstrap, X_train, y_train, X_test)

#     mean_corr_rfc = np.mean(corr_rfc, axis=0)
#     std_corr_rfc = np.std(corr_rfc, axis=0)
#     mean_corr_rfc_no_bootstrap = np.mean(corr_rfc_no_bootstrap, axis=0)
#     std_corr_rfc_no_bootstrap = np.std(corr_rfc_no_bootstrap, axis=0)
    
#     list_mean_corr_rfc.append(mean_corr_rfc)
#     list_std_corr_rfc.append(std_corr_rfc)
#     list_mean_corr_rfc_no_bootstrap.append(mean_corr_rfc_no_bootstrap)
#     list_std_corr_rfc_no_bootstrap.append(std_corr_rfc_no_bootstrap)

    
# list_mean_corr_rfc = np.array(list_mean_corr_rfc)
# list_std_corr_rfc = np.array(list_std_corr_rfc)
# list_mean_corr_rfc_no_bootstrap = np.array(list_mean_corr_rfc_no_bootstrap)
# list_std_corr_rfc_no_bootstrap = np.array(list_std_corr_rfc_no_bootstrap)

# plt.plot(param_range, list_mean_corr_rfc, label='Bootstrap True', lw=4, color='forestgreen')
# plt.plot(param_range, list_mean_corr_rfc + list_std_corr_rfc, param_range, list_mean_corr_rfc - list_std_corr_rfc, alpha=0.5, ls="--", color='forestgreen')
# plt.plot(param_range, list_mean_corr_rfc_no_bootstrap, label='Bootstrap False', lw=4, color='darkviolet')
# plt.plot(param_range, list_mean_corr_rfc_no_bootstrap + list_std_corr_rfc_no_bootstrap, param_range, list_mean_corr_rfc_no_bootstrap - list_std_corr_rfc_no_bootstrap, alpha=0.5, ls="--", color='darkviolet')

# plt.xlabel("Max features")
# plt.ylabel("Correlation between trees")

# plt.legend(loc='best')

Question 2

Compare the feature importances calculated by a RandomForestClassifier, ExtraTreesClassifier and GradientBoostedTreesClassifier on the digits dataset. You might have to tune n_estimators to get good performance. Which parts of the images is the most important and do you agree with the interpretation of the classifiers? (Bonus) Do the importances change if you change to problem to be a classification problem of odd vs even digit?

You can load the data set with: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits



In [ ]:

    
# your answer

Question 3

This is a regression problem. Use a gradient boosted tree regressor (tune the max_depth, learning_rate and n_estimators parameters) to study the importance of the different features as well as the partial dependence of the output on individual features as well as pairs of features.

can you identify uninformative features?
how do the interactions between the features show up in the partial dependence plots?

(Help: rgr = GradientBoostingRegressor(n_estimators=200, max_depth=2, learning_rate=0.1) seems to work quite well) (Help: to produce 1D and 2D partial dependence plots pass [0,1, (0,1)] as the features argument of plot_partial_dependence. More details in the function's documentation.)



In [ ]:

    
from sklearn.ensemble import GradientBoostingRegressor


def make_data(n_samples=800, n_features=8, noise=0.2, random_state=2):
    generator = check_random_state(random_state)

    X = generator.rand(n_samples, n_features)
    y = 10 * (X[:, 0] * X[:, 1]) + 20 * (X[:, 2] - 0.5) ** 2 \
        + 10 * X[:, 3] + 10 * X[:, 4] + noise * generator.randn(n_samples)

    return X, y

X,y = make_data()



In [ ]:

    
# your solution

(Bonus) Question 4

House prices in California. Use a gradient boosted regression tree model to build a model that can predict house prices in California (GradientBoostingRegressor is your friend).

Plot each of the features as a scatter plot with the target to learn about each variable. You can also make a plot of two features and use the target as colour.

Fit a model and tune the model complexity using a training and test data set.

Explore the feature importances and partial dependences that are important to the house price.



In [ ]:

    
from sklearn.datasets.california_housing import fetch_california_housing

cal_housing = fetch_california_housing()

# if the above doesn't work, download `cal_housing_py3.pkl` from the GitHub repository
# and adjust the path to the downloaded file which is passed to `load()`
# uncomment the following lines
#from sklearn.externals.joblib import load
#d = load('/home/username/Downloads/cal_housing_py3.pkz')
#X, y = d[:,1:], d[:,0]/100000
#X[:, 2] /= X[:, 5]
#X[:, 3] /= X[:, 5]
#X[:, 5] = X[:, 4] / X[:, 5]



In [ ]:

    
# your solution