Key Requirements for the iRF scikit-learn implementation

  • The following is a documentation of the main requirements for the iRF implementation

Typical Setup

Import the required dependencies

  • In particular irf_utils and irf_jupyter_utils

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
import numpy as np
from functools import reduce

# Needed for the scikit-learn wrapper function
from sklearn.utils import resample
from sklearn.ensemble import RandomForestClassifier
from math import ceil

# Import our custom utilities
from imp import reload
from utils import irf_jupyter_utils
from utils import irf_utils
reload(irf_jupyter_utils)
reload(irf_utils)


Out[1]:
<module 'utils.irf_utils' from '/home/runjing_liu/Documents/iRF/scikit-learn-sandbox/jupyter/utils/irf_utils.py'>

Step 1: Fit the Initial Random Forest

  • Just fit every feature with equal weights per the usual random forest code e.g. DecisionForestClassifier in scikit-learn

In [2]:
load_breast_cancer = load_breast_cancer()

In [3]:
X_train, X_test, y_train, y_test, rf = irf_jupyter_utils.generate_rf_example(n_estimators=20, 
                                                                             feature_weight=None)

Check out the data


In [4]:
print("Training feature dimensions", X_train.shape, sep = ":\n")
print("\n")
print("Training outcome dimensions", y_train.shape, sep = ":\n")
print("\n")
print("Test feature dimensions", X_test.shape, sep = ":\n")
print("\n")
print("Test outcome dimensions", y_test.shape, sep = ":\n")
print("\n")
print("first 2 rows of the training set features", X_train[:2], sep = ":\n")
print("\n")
print("first 2 rows of the training set outcomes", y_train[:2], sep = ":\n")


Training feature dimensions:
(512, 30)


Training outcome dimensions:
(512,)


Test feature dimensions:
(57, 30)


Test outcome dimensions:
(57,)


first 2 rows of the training set features:
[[  1.98900000e+01   2.02600000e+01   1.30500000e+02   1.21400000e+03
    1.03700000e-01   1.31000000e-01   1.41100000e-01   9.43100000e-02
    1.80200000e-01   6.18800000e-02   5.07900000e-01   8.73700000e-01
    3.65400000e+00   5.97000000e+01   5.08900000e-03   2.30300000e-02
    3.05200000e-02   1.17800000e-02   1.05700000e-02   3.39100000e-03
    2.37300000e+01   2.52300000e+01   1.60500000e+02   1.64600000e+03
    1.41700000e-01   3.30900000e-01   4.18500000e-01   1.61300000e-01
    2.54900000e-01   9.13600000e-02]
 [  2.01800000e+01   1.95400000e+01   1.33800000e+02   1.25000000e+03
    1.13300000e-01   1.48900000e-01   2.13300000e-01   1.25900000e-01
    1.72400000e-01   6.05300000e-02   4.33100000e-01   1.00100000e+00
    3.00800000e+00   5.24900000e+01   9.08700000e-03   2.71500000e-02
    5.54600000e-02   1.91000000e-02   2.45100000e-02   4.00500000e-03
    2.20300000e+01   2.50700000e+01   1.46000000e+02   1.47900000e+03
    1.66500000e-01   2.94200000e-01   5.30800000e-01   2.17300000e-01
    3.03200000e-01   8.07500000e-02]]


first 2 rows of the training set outcomes:
[0 0]

Step 2: Get all Random Forest and Decision Tree Data

  • Extract in a single dictionary the random forest data and for all of it's decision trees
  • This is as required for RIT purposes

In [5]:
all_rf_tree_data = irf_utils.get_rf_tree_data(
    rf=rf, X_train=X_train, X_test=X_test, y_test=y_test)

STEP 3: Get the RIT data and produce RITs


In [6]:
np.random.seed(12)
all_rit_tree_data = irf_utils.get_rit_tree_data(
    all_rf_tree_data=all_rf_tree_data,
    bin_class_type=1,
    M=100,
    max_depth=2,
    noisy_split=False,
    num_splits=2)

Perform Manual CHECKS on the irf_utils

  • These should be converted to unit tests and checked with nosetests -v test_irf_utils.py

Step 4: Plot some Data

List Ranked Feature Importances


In [7]:
# Print the feature ranking
print("Feature ranking:")

feature_importances_rank_idx = all_rf_tree_data['feature_importances_rank_idx']
feature_importances = all_rf_tree_data['feature_importances']

for f in range(X_train.shape[1]):
    print("%d. feature %d (%f)" % (f + 1
                                   , feature_importances_rank_idx[f]
                                   , feature_importances[feature_importances_rank_idx[f]]))


Feature ranking:
1. feature 20 (0.269505)
2. feature 23 (0.166292)
3. feature 22 (0.125083)
4. feature 27 (0.095908)
5. feature 7 (0.047362)
6. feature 6 (0.041892)
7. feature 3 (0.038964)
8. feature 26 (0.038608)
9. feature 0 (0.037853)
10. feature 24 (0.015405)
11. feature 28 (0.013661)
12. feature 21 (0.012163)
13. feature 1 (0.011433)
14. feature 25 (0.009299)
15. feature 2 (0.009179)
16. feature 10 (0.007583)
17. feature 13 (0.007535)
18. feature 29 (0.006910)
19. feature 12 (0.006683)
20. feature 4 (0.005703)
21. feature 17 (0.005578)
22. feature 15 (0.004319)
23. feature 16 (0.003881)
24. feature 9 (0.003759)
25. feature 5 (0.003555)
26. feature 19 (0.003329)
27. feature 14 (0.002782)
28. feature 8 (0.002763)
29. feature 18 (0.001688)
30. feature 11 (0.001323)

Plot Ranked Feature Importances


In [8]:
# Plot the feature importances of the forest
feature_importances_std = all_rf_tree_data['feature_importances_std']

plt.figure()
plt.title("Feature importances")
plt.bar(range(X_train.shape[1])
        , feature_importances[feature_importances_rank_idx]
        , color="r"
        , yerr = feature_importances_std[feature_importances_rank_idx], align="center")
plt.xticks(range(X_train.shape[1]), feature_importances_rank_idx)
plt.xlim([-1, X_train.shape[1]])
plt.show()


Decision Tree 0 (First) - Get output

Check the output against the decision tree graph


In [9]:
# Now plot the trees individually
#irf_jupyter_utils.draw_tree(decision_tree = all_rf_tree_data['rf_obj'].estimators_[0])

Compare to our dict of extracted data from the tree


In [10]:
#irf_jupyter_utils.pretty_print_dict(inp_dict = all_rf_tree_data['dtree0'])

In [11]:
# Count the number of samples passing through the leaf nodes
sum(all_rf_tree_data['dtree0']['tot_leaf_node_values'])


Out[11]:
512

Check output against the diagram


In [12]:
#irf_jupyter_utils.pretty_print_dict(inp_dict = all_rf_tree_data['dtree0']['all_leaf_paths_features'])

Run the iRF function

We will run the iRF with the following parameters

Data:

  • breast cancer binary classification data
  • random state (for reproducibility): 2018

Weighted RFs

  • K: 5 iterations
  • number of trees: 20

Bootstrap RFs

  • proportion of bootstrap samples: 20%
  • B: 30 bootstrap samples
  • number of trees (bootstrap RFs): 5 iterations

RITs (on the bootstrap RFs)

  • M: 20 RITs per forest
  • filter label type: 1-class only
  • Max Depth: 5
  • Noisy Split: False
  • Number of splits at Node: 2 splits

Running the iRF is easy - single function call

  • All of the bootstrap, RIT complexity is covered through the key parameters passed through in the main algorithm (as listed above)
  • This function call returns the following data:
    1. all RF weights
    2. all the K RFs that are iterated over
    3. all of the B bootstrap RFs that are run
    4. all the B*M RITs that are run on the bootstrap RFs
    5. the stability score

This is a lot of data returned!

Will be useful when we build the interface later

Let's run it!


In [13]:
all_rf_weights, all_K_iter_rf_data, \
all_rf_bootstrap_output, all_rit_bootstrap_output, \
stability_score = irf_utils.run_iRF(X_train=X_train,
                                    X_test=X_test,
                                    y_train=y_train,
                                    y_test=y_test,
                                    K=5,
                                    n_estimators=20,
                                    B=30,
                                    random_state_classifier=2018,
                                    propn_n_samples=.2,
                                    bin_class_type=1,
                                    M=20,
                                    max_depth=5,
                                    noisy_split=False,
                                    num_splits=2,
                                    n_estimators_bootstrap=5)

Examine the stability scores


In [14]:
irf_utils._get_histogram(stability_score, sort = True)


That's interesting - feature 22, 27, 20, 23 keep popping up!

We should probably look at the feature importances to understand if there is a useful correlation

Examine feature importances

In particular, let us see how they change over the K iterations of random forest


In [15]:
for k in range(5): 
    
    iteration = "rf_iter{}".format(k)
    
    feature_importances_std = all_K_iter_rf_data[iteration]['feature_importances_std']
    feature_importances_rank_idx = all_K_iter_rf_data[iteration]['feature_importances_rank_idx']
    feature_importances = all_K_iter_rf_data[iteration]['feature_importances']

    plt.figure(figsize=(8, 6))
    title = "Feature importances; iteration = {}".format(k)
    plt.title(title)
    plt.bar(range(X_train.shape[1])
            , feature_importances[feature_importances_rank_idx]
            , color="r"
            , yerr = feature_importances_std[feature_importances_rank_idx], align="center")
    plt.xticks(range(X_train.shape[1]), feature_importances_rank_idx, rotation='vertical')
    plt.xlim([-1, X_train.shape[1]])
    plt.show()


Some Observations

  • Note that after 5 iterations, the most important features were found to be 22, 27, 7, and 23
  • Now also recall that the most stable interactions were found to be '22_27', '7_22', '7_22_27', '23_27', '7_27', '22_23_27'
  • Given the overlap between these two plots, the results are not unreasonable here.

Explore iRF Data Further

We can look at the decision paths of the Kth RF

Let's look at the final iteration RF - the key validation metrics


In [16]:
irf_jupyter_utils.pretty_print_dict(all_K_iter_rf_data['rf_iter4']['rf_validation_metrics'])


{   'accuracy_score': 0.96491228070175439,
    'confusion_matrix': array([[12,  2],
       [ 0, 43]]),
    'f1_score': 0.97727272727272729,
    'hamming_loss': 0.035087719298245612,
    'log_loss': 1.2119149470996806,
    'precision_score': 0.9555555555555556,
    'recall_score': 1.0,
    'zero_one_loss': 0.035087719298245612}

In [17]:
# Now plot the trees individually
irf_jupyter_utils.draw_tree(decision_tree = all_K_iter_rf_data['rf_iter4']['rf_obj'].estimators_[0])


We can get this data quite easily in a convenient format


In [18]:
irf_jupyter_utils.pretty_print_dict(
    all_K_iter_rf_data['rf_iter4']['dtree0']['all_leaf_paths_features'])


[   array([ 7, 23,  7, 21, 23]),
    array([ 7, 23,  7, 21, 23, 22]),
    array([ 7, 23,  7, 21, 23, 22]),
    array([ 7, 23,  7, 21, 27]),
    array([ 7, 23,  7, 21, 27]),
    array([ 7, 23,  7, 22]),
    array([ 7, 23,  7, 22,  7]),
    array([ 7, 23,  7, 22,  7]),
    array([ 7, 23, 27]),
    array([ 7, 23, 27, 21]),
    array([ 7, 23, 27, 21]),
    array([ 7, 27, 27]),
    array([ 7, 27, 27]),
    array([ 7, 27, 22]),
    array([ 7, 27, 22, 21]),
    array([ 7, 27, 22, 21, 26]),
    array([ 7, 27, 22, 21, 26, 23, 21]),
    array([ 7, 27, 22, 21, 26, 23, 21]),
    array([ 7, 27, 22, 21, 26, 23])]

This checks nicely against the plotted diagram above.

In fact - we can go further and plot some interesting data from the Decision Trees

  • This can help us understand variable interactions better

In [19]:
irf_jupyter_utils.pretty_print_dict(
    all_K_iter_rf_data['rf_iter4']['dtree0']['all_leaf_node_values'])


[   array([[  0, 252]]),
    array([[2, 0]]),
    array([[ 0, 32]]),
    array([[ 0, 15]]),
    array([[7, 0]]),
    array([[0, 8]]),
    array([[7, 0]]),
    array([[0, 2]]),
    array([[0, 1]]),
    array([[0, 1]]),
    array([[9, 0]]),
    array([[1, 0]]),
    array([[0, 6]]),
    array([[0, 1]]),
    array([[0, 1]]),
    array([[0, 1]]),
    array([[0, 1]]),
    array([[6, 0]]),
    array([[159,   0]])]

We can also look at the frequency that a feature appears along a decision path


In [21]:
irf_utils._hist_features(all_K_iter_rf_data['rf_iter4'], n_estimators = 20, \
                         title = 'Frequency of features along decision paths : iteration = 4')


The most common features that appeared were 27,22,23, and 7. This matches well with the feature importance plot above.

Run some Sanity Checks

Run iRF for just 1 iteration - should be the uniform sampling version

This is just a sanity check: the feature importances from iRF after 1 iteration should match the feature importance from running a standard RF


In [17]:
all_K_iter_rf_data.keys()
print(all_K_iter_rf_data['rf_iter0']['feature_importances'])


[ 0.0378531   0.01143301  0.00917885  0.0389645   0.00570286  0.00355542
  0.04189169  0.04736217  0.00276282  0.00375928  0.00758323  0.00132309
  0.00668335  0.00753545  0.0027825   0.0043194   0.00388132  0.00557761
  0.00168787  0.00332904  0.26950483  0.01216281  0.12508337  0.16629184
  0.01540501  0.00929946  0.03860752  0.0959075   0.01366094  0.00691016]

Compare to the original single fitted random forest


In [18]:
rf = RandomForestClassifier(n_estimators=20, random_state=2018)
rf.fit(X=X_train, y=y_train)
print(rf.feature_importances_)


[ 0.0378531   0.01143301  0.00917885  0.0389645   0.00570286  0.00355542
  0.04189169  0.04736217  0.00276282  0.00375928  0.00758323  0.00132309
  0.00668335  0.00753545  0.0027825   0.0043194   0.00388132  0.00557761
  0.00168787  0.00332904  0.26950483  0.01216281  0.12508337  0.16629184
  0.01540501  0.00929946  0.03860752  0.0959075   0.01366094  0.00691016]

And they match perfectly as expected.

End Wrapper test