Key Requirements for the iRF scikit-learn implementation

  • The following is a documentation of the main requirements for the iRF implementation

Typical Setup

Import the required dependencies

  • In particular irf_utils and irf_jupyter_utils

In [15]:
%matplotlib inline
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
import numpy as np
from functools import reduce

# Needed for the scikit-learn wrapper function
from sklearn.utils import resample
from sklearn.ensemble import RandomForestClassifier
from math import ceil

# Import our custom utilities
from imp import reload
from utils import irf_jupyter_utils
from utils import irf_utils
reload(irf_jupyter_utils)
reload(irf_utils)


Out[15]:
<module 'utils.irf_utils' from '/Users/shamindras/PERSONAL/LEARNING/REPOS/scikit-learn-sandbox/jupyter/utils/irf_utils.py'>

Step 1: Fit the Initial Random Forest

  • Just fit every feature with equal weights per the usual random forest code e.g. DecisionForestClassifier in scikit-learn

In [16]:
load_breast_cancer = load_breast_cancer()

In [56]:
X_train, X_test, y_train, y_test, rf = irf_jupyter_utils.generate_rf_example(n_estimators=1000, 
                                                                             feature_weight=None)

Check out the data


In [57]:
print("Training feature dimensions", X_train.shape, sep = ":\n")
print("\n")
print("Training outcome dimensions", y_train.shape, sep = ":\n")
print("\n")
print("Test feature dimensions", X_test.shape, sep = ":\n")
print("\n")
print("Test outcome dimensions", y_test.shape, sep = ":\n")
print("\n")
print("first 5 rows of the training set features", X_train[:2], sep = ":\n")
print("\n")
print("first 5 rows of the training set outcomes", y_train[:2], sep = ":\n")


Training feature dimensions:
(512, 30)


Training outcome dimensions:
(512,)


Test feature dimensions:
(57, 30)


Test outcome dimensions:
(57,)


first 5 rows of the training set features:
[[  1.98900000e+01   2.02600000e+01   1.30500000e+02   1.21400000e+03
    1.03700000e-01   1.31000000e-01   1.41100000e-01   9.43100000e-02
    1.80200000e-01   6.18800000e-02   5.07900000e-01   8.73700000e-01
    3.65400000e+00   5.97000000e+01   5.08900000e-03   2.30300000e-02
    3.05200000e-02   1.17800000e-02   1.05700000e-02   3.39100000e-03
    2.37300000e+01   2.52300000e+01   1.60500000e+02   1.64600000e+03
    1.41700000e-01   3.30900000e-01   4.18500000e-01   1.61300000e-01
    2.54900000e-01   9.13600000e-02]
 [  2.01800000e+01   1.95400000e+01   1.33800000e+02   1.25000000e+03
    1.13300000e-01   1.48900000e-01   2.13300000e-01   1.25900000e-01
    1.72400000e-01   6.05300000e-02   4.33100000e-01   1.00100000e+00
    3.00800000e+00   5.24900000e+01   9.08700000e-03   2.71500000e-02
    5.54600000e-02   1.91000000e-02   2.45100000e-02   4.00500000e-03
    2.20300000e+01   2.50700000e+01   1.46000000e+02   1.47900000e+03
    1.66500000e-01   2.94200000e-01   5.30800000e-01   2.17300000e-01
    3.03200000e-01   8.07500000e-02]]


first 5 rows of the training set outcomes:
[0 0]

Step 2: Get all Random Forest and Decision Tree Data

  • Extract in a single dictionary the random forest data and for all of it's decision trees
  • This is as required for RIT purposes

In [58]:
all_rf_tree_data = irf_utils.get_rf_tree_data(rf=rf,
                                              X_train=X_train, y_train=y_train, 
                                              X_test=X_test, y_test=y_test)

STEP 3: Get the RIT data and produce RITs


In [20]:
all_rit_tree_data = irf_utils.get_rit_tree_data(
    all_rf_tree_data=all_rf_tree_data,
    bin_class_type=1,
    random_state=12,
    M=100,
    max_depth=2,
    noisy_split=False,
    num_splits=2)

In [21]:
#for i in range(100):
#    print(all_rit_tree_data['rit{}'.format(i)]['rit_leaf_node_union_value'])

Perform Manual CHECKS on the irf_utils

  • These should be converted to unit tests and checked with nosetests -v test_irf_utils.py

Step 4: Plot some Data

List Ranked Feature Importances


In [59]:
# Print the feature ranking
print("Feature ranking:")

feature_importances_rank_idx = all_rf_tree_data['feature_importances_rank_idx']
feature_importances = all_rf_tree_data['feature_importances']

for f in range(X_train.shape[1]):
    print("%d. feature %d (%f)" % (f + 1
                                   , feature_importances_rank_idx[f]
                                   , feature_importances[feature_importances_rank_idx[f]]))


Feature ranking:
1. feature 22 (0.141571)
2. feature 23 (0.118066)
3. feature 27 (0.118023)
4. feature 20 (0.116496)
5. feature 7 (0.083462)
6. feature 3 (0.053727)
7. feature 2 (0.046495)
8. feature 0 (0.046333)
9. feature 6 (0.046172)
10. feature 13 (0.037305)
11. feature 26 (0.033145)
12. feature 21 (0.018278)
13. feature 25 (0.015116)
14. feature 5 (0.014100)
15. feature 10 (0.013705)
16. feature 1 (0.012798)
17. feature 24 (0.011960)
18. feature 12 (0.011029)
19. feature 28 (0.009534)
20. feature 29 (0.006791)
21. feature 4 (0.006139)
22. feature 16 (0.005647)
23. feature 17 (0.005003)
24. feature 14 (0.004568)
25. feature 11 (0.004414)
26. feature 18 (0.004412)
27. feature 19 (0.004295)
28. feature 15 (0.003972)
29. feature 9 (0.003792)
30. feature 8 (0.003654)

Plot Ranked Feature Importances


In [23]:
# Plot the feature importances of the forest
feature_importances_std = all_rf_tree_data['feature_importances_std']

plt.figure()
plt.title("Feature importances")
plt.bar(range(X_train.shape[1])
        , feature_importances[feature_importances_rank_idx]
        , color="r"
        , yerr = feature_importances_std[feature_importances_rank_idx], align="center")
plt.xticks(range(X_train.shape[1]), feature_importances_rank_idx)
plt.xlim([-1, X_train.shape[1]])
plt.show()


Decision Tree 0 (First) - Get output

Check the output against the decision tree graph


In [24]:
# Now plot the trees individually
#irf_jupyter_utils.draw_tree(decision_tree = all_rf_tree_data['rf_obj'].estimators_[0])

Compare to our dict of extracted data from the tree


In [25]:
#irf_jupyter_utils.pretty_print_dict(inp_dict = all_rf_tree_data['dtree0'])

In [26]:
# Count the number of samples passing through the leaf nodes
sum(all_rf_tree_data['dtree0']['tot_leaf_node_values'])


Out[26]:
512

Check output against the diagram


In [27]:
#irf_jupyter_utils.pretty_print_dict(inp_dict = all_rf_tree_data['dtree0']['all_leaf_paths_features'])

Wrapper function for iRF


In [45]:
def run_rit(X_train,
            X_test,
            y_train,
            y_test,
            K=7,
            n_estimators=20,
            B=10,
            random_state_classifier=2018,
            propn_n_samples=0.2,
            bin_class_type=1,
            random_state=12,
            M=4,
            max_depth=2,
            noisy_split=False,
            num_splits=2):
    """ This function will allow us to run the RIT
        for the given parameters
    """

    # Set the random state for reproducibility
    np.random.seed(random_state_classifier)

    # Convert the bootstrap resampling proportion to the number
    # of rows to resample from the training data
    n_samples = ceil(propn_n_samples * X_train.shape[0])

    # Initialize dictionary of rf weights
    # CHECK: change this name to be `all_rf_weights_output`
    all_rf_weights = {}

    # Initialize dictionary of bootstrap rf output
    all_rf_bootstrap_output = {}

    # Initialize dictionary of bootstrap RIT output
    all_rit_bootstrap_output = {}

    for k in range(K):
        if k == 0:

            # Initially feature weights are None
            feature_importances = None

            # Update the dictionary of all our RF weights
            all_rf_weights["rf_weight{}".format(k)] = feature_importances

            # fit RF feature weights i.e. initially None
            rf = RandomForestClassifier(n_estimators=n_estimators)

            # fit the classifier
            rf.fit(
                X=X_train,
                y=y_train,
                feature_weight=all_rf_weights["rf_weight{}".format(k)])

            # Update feature weights using the
            # new feature importance score
            feature_importances = rf.feature_importances_

            # Load the weights for the next iteration
            all_rf_weights["rf_weight{}".format(k + 1)] = feature_importances

        else:
            # fit weighted RF
            # Use the weights from the previous iteration
            rf = RandomForestClassifier(n_estimators=n_estimators)

            # fit the classifier
            rf.fit(
                X=X_train,
                y=y_train,
                feature_weight=all_rf_weights["rf_weight{}".format(k)])

            # Update feature weights using the
            # new feature importance score
            feature_importances = rf.feature_importances_

            # Load the weights for the next iteration
            all_rf_weights["rf_weight{}".format(k + 1)] = feature_importances

    # Run the RITs
    for b in range(B):

        # Take a bootstrap sample from the training data
        # based on the specified user proportion
        X_train_rsmpl, y_rsmpl = resample(
            X_train, y_train, n_samples=n_samples)

        # Set up the weighted random forest
        # Using the weight from the (K-1)th iteration i.e. RF(w(K))
        rf_bootstrap = RandomForestClassifier(
            #CHECK: different number of trees to fit for bootstrap samples
            n_estimators=n_estimators)

        # Fit RF(w(K)) on the bootstrapped dataset
        rf_bootstrap.fit(
            X=X_train_rsmpl,
            y=y_rsmpl,
            feature_weight=all_rf_weights["rf_weight{}".format(K)])

        # All RF tree data
        # CHECK: why do we need y_train here?
        all_rf_tree_data = irf_utils.get_rf_tree_data(
            rf=rf_bootstrap,
            X_train=X_train_rsmpl,
            y_train=y_rsmpl,
            X_test=X_test,
            y_test=y_test)

        # Update the rf bootstrap output dictionary
        all_rf_bootstrap_output['rf_bootstrap{}'.format(b)] = all_rf_tree_data

        # Run RIT on the interaction rule set
        # CHECK - each of these variables needs to be passed into
        # the main run_rit function
        all_rit_tree_data = irf_utils.get_rit_tree_data(
            all_rf_tree_data=all_rf_tree_data,
            bin_class_type=1,
            random_state=12,
            M=4,
            max_depth=2,
            noisy_split=False,
            num_splits=2)

        # Update the rf bootstrap output dictionary
        # We will reference the RIT for a particular rf bootstrap
        # using the specific bootstrap id - consistent with the
        # rf bootstrap output data
        all_rit_bootstrap_output['rf_bootstrap{}'.format(
            b)] = all_rit_tree_data

    return all_rf_weights, all_rf_bootstrap_output, all_rit_bootstrap_output

Run the iRF function


In [62]:
all_rf_weights, all_rf_bootstrap_output, all_rit_bootstrap_output =\
run_rit(X_train=X_train,
        X_test=X_test,
        y_train=y_train,
        y_test=y_test,
        K=6,
        n_estimators=20,
        B=10,
        random_state_classifier=2018,
        propn_n_samples=0.2,
        bin_class_type=1,
        random_state=12,
        M=4,
        max_depth=2,
        noisy_split=False,
        num_splits=2)

In [70]:
all_rit_bootstrap_output['rf_bootstrap1']


Out[70]:
{'rit0': {'rit': <utils.irf_utils.RITTree at 0x11e2c52e8>,
  'rit_intersected_values': [array([27]), array([], dtype=int64), array([27])],
  'rit_leaf_node_union_value': array([27]),
  'rit_leaf_node_values': [array([], dtype=int64), array([27])]},
 'rit1': {'rit': <utils.irf_utils.RITTree at 0x111ddafd0>,
  'rit_intersected_values': [array([23, 27]), array([27]), array([23, 27])],
  'rit_leaf_node_union_value': array([23, 27]),
  'rit_leaf_node_values': [array([27]), array([23, 27])]},
 'rit2': {'rit': <utils.irf_utils.RITTree at 0x11d7e5ac8>,
  'rit_intersected_values': [array([22, 23, 27]), array([27]), array([23])],
  'rit_leaf_node_union_value': array([23, 27]),
  'rit_leaf_node_values': [array([27]), array([23])]},
 'rit3': {'rit': <utils.irf_utils.RITTree at 0x11e29d390>,
  'rit_intersected_values': [array([22, 27]),
   array([22, 27]),
   array([22, 27])],
  'rit_leaf_node_union_value': array([22, 27]),
  'rit_leaf_node_values': [array([22, 27]), array([22, 27])]}}
all_rit_bootstrap_output

Run iRF for just 1 iteration - should be the uniform sampling version


In [49]:
all_rf_weights_1iter, all_rf_bootstrap_output_1iter, all_rit_bootstrap_output_1iter =\
run_rit(X_train=X_train,
        X_test=X_test,
        y_train=y_train,
        y_test=y_test,
        K=1,
        n_estimators=1000,
        B=10,
        random_state_classifier=2018,
        propn_n_samples=0.2,
        bin_class_type=1,
        random_state=12,
        M=4,
        max_depth=2,
        noisy_split=False,
        num_splits=2)

In [55]:
print(all_rf_weights_1iter['rf_weight1'])


[ 0.04633261  0.01279795  0.04649524  0.05372707  0.00613899  0.01410021
  0.04617194  0.08346155  0.00365378  0.00379191  0.01370514  0.00441408
  0.01102853  0.03730455  0.00456819  0.00397247  0.00564707  0.00500305
  0.00441179  0.00429459  0.11649582  0.01827778  0.14157085  0.11806595
  0.01195991  0.01511598  0.03314478  0.11802327  0.00953361  0.00679134]

Compare to the original single fitted random forest (top of the notebook)!


In [60]:
rf.feature_importances_


Out[60]:
array([ 0.04633261,  0.01279795,  0.04649524,  0.05372707,  0.00613899,
        0.01410021,  0.04617194,  0.08346155,  0.00365378,  0.00379191,
        0.01370514,  0.00441408,  0.01102853,  0.03730455,  0.00456819,
        0.00397247,  0.00564707,  0.00500305,  0.00441179,  0.00429459,
        0.11649582,  0.01827778,  0.14157085,  0.11806595,  0.01195991,
        0.01511598,  0.03314478,  0.11802327,  0.00953361,  0.00679134])

These look like they match as required!


In [61]:
rf_weight5 = np.ndarray.tolist(all_rf_weights['rf_weight1'])
rf_weight5


Out[61]:
[0.03785310470431957,
 0.011433014482771538,
 0.009178847665561797,
 0.038964497097170335,
 0.005702864300092174,
 0.0035554200305757616,
 0.041891686998428004,
 0.04736216645722789,
 0.0027628245777236277,
 0.003759277392257268,
 0.0075832267680531916,
 0.0013230930451214183,
 0.006683350681223972,
 0.007535448498366222,
 0.0027824980670987315,
 0.0043193967779257,
 0.0038813156721859963,
 0.005577613495435727,
 0.0016878723423762096,
 0.003329040950739148,
 0.2695048270955845,
 0.012162811436513673,
 0.1250833695561254,
 0.16629184413751788,
 0.015405008680302696,
 0.00929945634990437,
 0.03860751552331201,
 0.09590750167873455,
 0.013660944440826783,
 0.006910161096523803]

#


In [48]:
sorted([i for i, e in enumerate(rf_weight10) if e != 0])


Out[48]:
[1, 7, 12, 13, 20, 21, 22, 23, 26, 27]

In [83]:



Out[83]:
{'rf_weight0': None,
 'rf_weight1': array([ 0.10767154,  0.00899294,  0.04092015,  0.02001948,  0.00387857,
         0.00461486,  0.01078431,  0.04536844,  0.00452756,  0.00619205,
         0.00808735,  0.00440133,  0.0287609 ,  0.01211674,  0.00581382,
         0.00285006,  0.0041309 ,  0.00331072,  0.00168265,  0.00553938,
         0.14002331,  0.01362173,  0.17859298,  0.05971482,  0.00724069,
         0.01703418,  0.09629764,  0.1408401 ,  0.01194766,  0.00502317]),
 'rf_weight2': array([  4.44435012e-02,   3.40469419e-03,   4.96289277e-03,
          3.15592324e-03,   0.00000000e+00,   0.00000000e+00,
          2.34638107e-03,   8.55088015e-02,   0.00000000e+00,
          2.00859319e-03,   1.59775615e-03,   4.19349026e-04,
          7.67580520e-03,   2.78187346e-03,   1.09017353e-04,
          2.72543383e-04,   3.79613998e-04,   0.00000000e+00,
          2.23560029e-03,   1.36390833e-03,   1.44290207e-01,
          6.75257965e-03,   3.71650639e-01,   1.57358155e-01,
          1.61530818e-03,   1.85282571e-03,   2.52684472e-02,
          1.25694971e-01,   2.50338925e-03,   3.47222222e-04]),
 'rf_weight3': array([  1.17762395e-02,   6.48715393e-03,   2.72487516e-03,
          8.57126726e-04,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   1.28470191e-01,   0.00000000e+00,
          0.00000000e+00,   4.24679020e-04,   0.00000000e+00,
          3.41816566e-03,   2.75400676e-03,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          1.02065118e-03,   1.16568428e-04,   1.80568022e-02,
          5.96746789e-03,   4.55458453e-01,   1.22361870e-01,
          0.00000000e+00,   0.00000000e+00,   1.59286079e-02,
          2.22943752e-01,   8.22389199e-04,   4.11000690e-04]),
 'rf_weight4': array([  8.17518952e-03,   8.65331085e-03,   8.18913380e-04,
          2.86319839e-04,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   1.80483283e-01,   0.00000000e+00,
          0.00000000e+00,   6.10276905e-05,   0.00000000e+00,
          3.33487481e-03,   5.10494922e-03,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          1.59934073e-03,   0.00000000e+00,   4.76677734e-02,
          2.12361339e-03,   3.95643473e-01,   1.14104906e-01,
          0.00000000e+00,   0.00000000e+00,   7.87261664e-03,
          2.21135044e-01,   2.30587320e-03,   6.29491349e-04]),
 'rf_weight5': array([ 0.00096902,  0.01495549,  0.00103937,  0.        ,  0.        ,
         0.        ,  0.        ,  0.08071276,  0.        ,  0.        ,
         0.        ,  0.        ,  0.00323576,  0.00223953,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.01689564,  0.0032026 ,  0.47753928,  0.1439653 ,  0.        ,
         0.        ,  0.01072254,  0.24360832,  0.00091439,  0.        ])}