Key Requirements for the iRF scikit-learn implementation

The following is a documentation of the main requirements for the iRF implementation

Typical Setup

Import the required dependencies

In particular irf_utils and irf_jupyter_utils



In [15]:

    
%matplotlib inline
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
import numpy as np
from functools import reduce

# Needed for the scikit-learn wrapper function
from sklearn.utils import resample
from sklearn.ensemble import RandomForestClassifier
from math import ceil

# Import our custom utilities
from imp import reload
from utils import irf_jupyter_utils
from utils import irf_utils
reload(irf_jupyter_utils)
reload(irf_utils)









    Out[15]:





<module 'utils.irf_utils' from '/Users/shamindras/PERSONAL/LEARNING/REPOS/scikit-learn-sandbox/jupyter/utils/irf_utils.py'>

Step 1: Fit the Initial Random Forest

Just fit every feature with equal weights per the usual random forest code e.g. DecisionForestClassifier in scikit-learn



In [16]:

    
load_breast_cancer = load_breast_cancer()



In [56]:

    
X_train, X_test, y_train, y_test, rf = irf_jupyter_utils.generate_rf_example(n_estimators=1000, 
                                                                             feature_weight=None)

Check out the data



In [57]:

    
print("Training feature dimensions", X_train.shape, sep = ":\n")
print("\n")
print("Training outcome dimensions", y_train.shape, sep = ":\n")
print("\n")
print("Test feature dimensions", X_test.shape, sep = ":\n")
print("\n")
print("Test outcome dimensions", y_test.shape, sep = ":\n")
print("\n")
print("first 5 rows of the training set features", X_train[:2], sep = ":\n")
print("\n")
print("first 5 rows of the training set outcomes", y_train[:2], sep = ":\n")









    



Training feature dimensions:
(512, 30)


Training outcome dimensions:
(512,)


Test feature dimensions:
(57, 30)


Test outcome dimensions:
(57,)


first 5 rows of the training set features:
[[  1.98900000e+01   2.02600000e+01   1.30500000e+02   1.21400000e+03
    1.03700000e-01   1.31000000e-01   1.41100000e-01   9.43100000e-02
    1.80200000e-01   6.18800000e-02   5.07900000e-01   8.73700000e-01
    3.65400000e+00   5.97000000e+01   5.08900000e-03   2.30300000e-02
    3.05200000e-02   1.17800000e-02   1.05700000e-02   3.39100000e-03
    2.37300000e+01   2.52300000e+01   1.60500000e+02   1.64600000e+03
    1.41700000e-01   3.30900000e-01   4.18500000e-01   1.61300000e-01
    2.54900000e-01   9.13600000e-02]
 [  2.01800000e+01   1.95400000e+01   1.33800000e+02   1.25000000e+03
    1.13300000e-01   1.48900000e-01   2.13300000e-01   1.25900000e-01
    1.72400000e-01   6.05300000e-02   4.33100000e-01   1.00100000e+00
    3.00800000e+00   5.24900000e+01   9.08700000e-03   2.71500000e-02
    5.54600000e-02   1.91000000e-02   2.45100000e-02   4.00500000e-03
    2.20300000e+01   2.50700000e+01   1.46000000e+02   1.47900000e+03
    1.66500000e-01   2.94200000e-01   5.30800000e-01   2.17300000e-01
    3.03200000e-01   8.07500000e-02]]


first 5 rows of the training set outcomes:
[0 0]

Step 2: Get all Random Forest and Decision Tree Data

Extract in a single dictionary the random forest data and for all of it's decision trees
This is as required for RIT purposes



In [58]:

    
all_rf_tree_data = irf_utils.get_rf_tree_data(rf=rf,
                                              X_train=X_train, y_train=y_train, 
                                              X_test=X_test, y_test=y_test)

STEP 3: Get the RIT data and produce RITs



In [20]:

    
all_rit_tree_data = irf_utils.get_rit_tree_data(
    all_rf_tree_data=all_rf_tree_data,
    bin_class_type=1,
    random_state=12,
    M=100,
    max_depth=2,
    noisy_split=False,
    num_splits=2)



In [21]:

    
#for i in range(100):
#    print(all_rit_tree_data['rit{}'.format(i)]['rit_leaf_node_union_value'])

Perform Manual CHECKS on the `irf_utils`

These should be converted to unit tests and checked with nosetests -v test_irf_utils.py

Step 4: Plot some Data

List Ranked Feature Importances



In [59]:

    
# Print the feature ranking
print("Feature ranking:")

feature_importances_rank_idx = all_rf_tree_data['feature_importances_rank_idx']
feature_importances = all_rf_tree_data['feature_importances']

for f in range(X_train.shape[1]):
    print("%d. feature %d (%f)" % (f + 1
                                   , feature_importances_rank_idx[f]
                                   , feature_importances[feature_importances_rank_idx[f]]))









    



Feature ranking:
1. feature 22 (0.141571)
2. feature 23 (0.118066)
3. feature 27 (0.118023)
4. feature 20 (0.116496)
5. feature 7 (0.083462)
6. feature 3 (0.053727)
7. feature 2 (0.046495)
8. feature 0 (0.046333)
9. feature 6 (0.046172)
10. feature 13 (0.037305)
11. feature 26 (0.033145)
12. feature 21 (0.018278)
13. feature 25 (0.015116)
14. feature 5 (0.014100)
15. feature 10 (0.013705)
16. feature 1 (0.012798)
17. feature 24 (0.011960)
18. feature 12 (0.011029)
19. feature 28 (0.009534)
20. feature 29 (0.006791)
21. feature 4 (0.006139)
22. feature 16 (0.005647)
23. feature 17 (0.005003)
24. feature 14 (0.004568)
25. feature 11 (0.004414)
26. feature 18 (0.004412)
27. feature 19 (0.004295)
28. feature 15 (0.003972)
29. feature 9 (0.003792)
30. feature 8 (0.003654)

Plot Ranked Feature Importances



In [23]:

    
# Plot the feature importances of the forest
feature_importances_std = all_rf_tree_data['feature_importances_std']

plt.figure()
plt.title("Feature importances")
plt.bar(range(X_train.shape[1])
        , feature_importances[feature_importances_rank_idx]
        , color="r"
        , yerr = feature_importances_std[feature_importances_rank_idx], align="center")
plt.xticks(range(X_train.shape[1]), feature_importances_rank_idx)
plt.xlim([-1, X_train.shape[1]])
plt.show()

Decision Tree 0 (First) - Get output

Check the output against the decision tree graph



In [24]:

    
# Now plot the trees individually
#irf_jupyter_utils.draw_tree(decision_tree = all_rf_tree_data['rf_obj'].estimators_[0])

Compare to our dict of extracted data from the tree



In [25]:

    
#irf_jupyter_utils.pretty_print_dict(inp_dict = all_rf_tree_data['dtree0'])



In [26]:

    
# Count the number of samples passing through the leaf nodes
sum(all_rf_tree_data['dtree0']['tot_leaf_node_values'])









    Out[26]:





512

Check output against the diagram



In [27]:

    
#irf_jupyter_utils.pretty_print_dict(inp_dict = all_rf_tree_data['dtree0']['all_leaf_paths_features'])

Wrapper function for iRF



In [45]:

    
def run_rit(X_train,
            X_test,
            y_train,
            y_test,
            K=7,
            n_estimators=20,
            B=10,
            random_state_classifier=2018,
            propn_n_samples=0.2,
            bin_class_type=1,
            random_state=12,
            M=4,
            max_depth=2,
            noisy_split=False,
            num_splits=2):
    """ This function will allow us to run the RIT
        for the given parameters
    """

    # Set the random state for reproducibility
    np.random.seed(random_state_classifier)

    # Convert the bootstrap resampling proportion to the number
    # of rows to resample from the training data
    n_samples = ceil(propn_n_samples * X_train.shape[0])

    # Initialize dictionary of rf weights
    # CHECK: change this name to be `all_rf_weights_output`
    all_rf_weights = {}

    # Initialize dictionary of bootstrap rf output
    all_rf_bootstrap_output = {}

    # Initialize dictionary of bootstrap RIT output
    all_rit_bootstrap_output = {}

    for k in range(K):
        if k == 0:

            # Initially feature weights are None
            feature_importances = None

            # Update the dictionary of all our RF weights
            all_rf_weights["rf_weight{}".format(k)] = feature_importances

            # fit RF feature weights i.e. initially None
            rf = RandomForestClassifier(n_estimators=n_estimators)

            # fit the classifier
            rf.fit(
                X=X_train,
                y=y_train,
                feature_weight=all_rf_weights["rf_weight{}".format(k)])

            # Update feature weights using the
            # new feature importance score
            feature_importances = rf.feature_importances_

            # Load the weights for the next iteration
            all_rf_weights["rf_weight{}".format(k + 1)] = feature_importances

        else:
            # fit weighted RF
            # Use the weights from the previous iteration
            rf = RandomForestClassifier(n_estimators=n_estimators)

            # fit the classifier
            rf.fit(
                X=X_train,
                y=y_train,
                feature_weight=all_rf_weights["rf_weight{}".format(k)])

            # Update feature weights using the
            # new feature importance score
            feature_importances = rf.feature_importances_

            # Load the weights for the next iteration
            all_rf_weights["rf_weight{}".format(k + 1)] = feature_importances

    # Run the RITs
    for b in range(B):

        # Take a bootstrap sample from the training data
        # based on the specified user proportion
        X_train_rsmpl, y_rsmpl = resample(
            X_train, y_train, n_samples=n_samples)

        # Set up the weighted random forest
        # Using the weight from the (K-1)th iteration i.e. RF(w(K))
        rf_bootstrap = RandomForestClassifier(
            #CHECK: different number of trees to fit for bootstrap samples
            n_estimators=n_estimators)

        # Fit RF(w(K)) on the bootstrapped dataset
        rf_bootstrap.fit(
            X=X_train_rsmpl,
            y=y_rsmpl,
            feature_weight=all_rf_weights["rf_weight{}".format(K)])

        # All RF tree data
        # CHECK: why do we need y_train here?
        all_rf_tree_data = irf_utils.get_rf_tree_data(
            rf=rf_bootstrap,
            X_train=X_train_rsmpl,
            y_train=y_rsmpl,
            X_test=X_test,
            y_test=y_test)

        # Update the rf bootstrap output dictionary
        all_rf_bootstrap_output['rf_bootstrap{}'.format(b)] = all_rf_tree_data

        # Run RIT on the interaction rule set
        # CHECK - each of these variables needs to be passed into
        # the main run_rit function
        all_rit_tree_data = irf_utils.get_rit_tree_data(
            all_rf_tree_data=all_rf_tree_data,
            bin_class_type=1,
            random_state=12,
            M=4,
            max_depth=2,
            noisy_split=False,
            num_splits=2)

        # Update the rf bootstrap output dictionary
        # We will reference the RIT for a particular rf bootstrap
        # using the specific bootstrap id - consistent with the
        # rf bootstrap output data
        all_rit_bootstrap_output['rf_bootstrap{}'.format(
            b)] = all_rit_tree_data

    return all_rf_weights, all_rf_bootstrap_output, all_rit_bootstrap_output

Run the iRF function



In [62]:

    
all_rf_weights, all_rf_bootstrap_output, all_rit_bootstrap_output =\
run_rit(X_train=X_train,
        X_test=X_test,
        y_train=y_train,
        y_test=y_test,
        K=6,
        n_estimators=20,
        B=10,
        random_state_classifier=2018,
        propn_n_samples=0.2,
        bin_class_type=1,
        random_state=12,
        M=4,
        max_depth=2,
        noisy_split=False,
        num_splits=2)



In [70]:

    
all_rit_bootstrap_output['rf_bootstrap1']









    Out[70]:





{'rit0': {'rit': <utils.irf_utils.RITTree at 0x11e2c52e8>,
  'rit_intersected_values': [array([27]), array([], dtype=int64), array([27])],
  'rit_leaf_node_union_value': array([27]),
  'rit_leaf_node_values': [array([], dtype=int64), array([27])]},
 'rit1': {'rit': <utils.irf_utils.RITTree at 0x111ddafd0>,
  'rit_intersected_values': [array([23, 27]), array([27]), array([23, 27])],
  'rit_leaf_node_union_value': array([23, 27]),
  'rit_leaf_node_values': [array([27]), array([23, 27])]},
 'rit2': {'rit': <utils.irf_utils.RITTree at 0x11d7e5ac8>,
  'rit_intersected_values': [array([22, 23, 27]), array([27]), array([23])],
  'rit_leaf_node_union_value': array([23, 27]),
  'rit_leaf_node_values': [array([27]), array([23])]},
 'rit3': {'rit': <utils.irf_utils.RITTree at 0x11e29d390>,
  'rit_intersected_values': [array([22, 27]),
   array([22, 27]),
   array([22, 27])],
  'rit_leaf_node_union_value': array([22, 27]),
  'rit_leaf_node_values': [array([22, 27]), array([22, 27])]}}

all_rit_bootstrap_output

Run iRF for just 1 iteration - should be the uniform sampling version



In [49]:

    
all_rf_weights_1iter, all_rf_bootstrap_output_1iter, all_rit_bootstrap_output_1iter =\
run_rit(X_train=X_train,
        X_test=X_test,
        y_train=y_train,
        y_test=y_test,
        K=1,
        n_estimators=1000,
        B=10,
        random_state_classifier=2018,
        propn_n_samples=0.2,
        bin_class_type=1,
        random_state=12,
        M=4,
        max_depth=2,
        noisy_split=False,
        num_splits=2)



In [55]:

    
print(all_rf_weights_1iter['rf_weight1'])









    



[ 0.04633261  0.01279795  0.04649524  0.05372707  0.00613899  0.01410021
  0.04617194  0.08346155  0.00365378  0.00379191  0.01370514  0.00441408
  0.01102853  0.03730455  0.00456819  0.00397247  0.00564707  0.00500305
  0.00441179  0.00429459  0.11649582  0.01827778  0.14157085  0.11806595
  0.01195991  0.01511598  0.03314478  0.11802327  0.00953361  0.00679134]

Compare to the original single fitted random forest (top of the notebook)!



In [60]:

    
rf.feature_importances_









    Out[60]:





array([ 0.04633261,  0.01279795,  0.04649524,  0.05372707,  0.00613899,
        0.01410021,  0.04617194,  0.08346155,  0.00365378,  0.00379191,
        0.01370514,  0.00441408,  0.01102853,  0.03730455,  0.00456819,
        0.00397247,  0.00564707,  0.00500305,  0.00441179,  0.00429459,
        0.11649582,  0.01827778,  0.14157085,  0.11806595,  0.01195991,
        0.01511598,  0.03314478,  0.11802327,  0.00953361,  0.00679134])

These look like they match as required!



In [61]:

    
rf_weight5 = np.ndarray.tolist(all_rf_weights['rf_weight1'])
rf_weight5









    Out[61]:





[0.03785310470431957,
 0.011433014482771538,
 0.009178847665561797,
 0.038964497097170335,
 0.005702864300092174,
 0.0035554200305757616,
 0.041891686998428004,
 0.04736216645722789,
 0.0027628245777236277,
 0.003759277392257268,
 0.0075832267680531916,
 0.0013230930451214183,
 0.006683350681223972,
 0.007535448498366222,
 0.0027824980670987315,
 0.0043193967779257,
 0.0038813156721859963,
 0.005577613495435727,
 0.0016878723423762096,
 0.003329040950739148,
 0.2695048270955845,
 0.012162811436513673,
 0.1250833695561254,
 0.16629184413751788,
 0.015405008680302696,
 0.00929945634990437,
 0.03860751552331201,
 0.09590750167873455,
 0.013660944440826783,
 0.006910161096523803]



In [48]:

    
sorted([i for i, e in enumerate(rf_weight10) if e != 0])









    Out[48]:





[1, 7, 12, 13, 20, 21, 22, 23, 26, 27]



In [83]:









    Out[83]:





{'rf_weight0': None,
 'rf_weight1': array([ 0.10767154,  0.00899294,  0.04092015,  0.02001948,  0.00387857,
         0.00461486,  0.01078431,  0.04536844,  0.00452756,  0.00619205,
         0.00808735,  0.00440133,  0.0287609 ,  0.01211674,  0.00581382,
         0.00285006,  0.0041309 ,  0.00331072,  0.00168265,  0.00553938,
         0.14002331,  0.01362173,  0.17859298,  0.05971482,  0.00724069,
         0.01703418,  0.09629764,  0.1408401 ,  0.01194766,  0.00502317]),
 'rf_weight2': array([  4.44435012e-02,   3.40469419e-03,   4.96289277e-03,
          3.15592324e-03,   0.00000000e+00,   0.00000000e+00,
          2.34638107e-03,   8.55088015e-02,   0.00000000e+00,
          2.00859319e-03,   1.59775615e-03,   4.19349026e-04,
          7.67580520e-03,   2.78187346e-03,   1.09017353e-04,
          2.72543383e-04,   3.79613998e-04,   0.00000000e+00,
          2.23560029e-03,   1.36390833e-03,   1.44290207e-01,
          6.75257965e-03,   3.71650639e-01,   1.57358155e-01,
          1.61530818e-03,   1.85282571e-03,   2.52684472e-02,
          1.25694971e-01,   2.50338925e-03,   3.47222222e-04]),
 'rf_weight3': array([  1.17762395e-02,   6.48715393e-03,   2.72487516e-03,
          8.57126726e-04,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   1.28470191e-01,   0.00000000e+00,
          0.00000000e+00,   4.24679020e-04,   0.00000000e+00,
          3.41816566e-03,   2.75400676e-03,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          1.02065118e-03,   1.16568428e-04,   1.80568022e-02,
          5.96746789e-03,   4.55458453e-01,   1.22361870e-01,
          0.00000000e+00,   0.00000000e+00,   1.59286079e-02,
          2.22943752e-01,   8.22389199e-04,   4.11000690e-04]),
 'rf_weight4': array([  8.17518952e-03,   8.65331085e-03,   8.18913380e-04,
          2.86319839e-04,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   1.80483283e-01,   0.00000000e+00,
          0.00000000e+00,   6.10276905e-05,   0.00000000e+00,
          3.33487481e-03,   5.10494922e-03,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          1.59934073e-03,   0.00000000e+00,   4.76677734e-02,
          2.12361339e-03,   3.95643473e-01,   1.14104906e-01,
          0.00000000e+00,   0.00000000e+00,   7.87261664e-03,
          2.21135044e-01,   2.30587320e-03,   6.29491349e-04]),
 'rf_weight5': array([ 0.00096902,  0.01495549,  0.00103937,  0.        ,  0.        ,
         0.        ,  0.        ,  0.08071276,  0.        ,  0.        ,
         0.        ,  0.        ,  0.00323576,  0.00223953,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.01689564,  0.0032026 ,  0.47753928,  0.1439653 ,  0.        ,
         0.        ,  0.01072254,  0.24360832,  0.00091439,  0.        ])}

Key Requirements for the iRF scikit-learn implementation

Typical Setup

Import the required dependencies

Step 1: Fit the Initial Random Forest

Check out the data

Step 2: Get all Random Forest and Decision Tree Data

STEP 3: Get the RIT data and produce RITs

Perform Manual CHECKS on the irf_utils

Step 4: Plot some Data

List Ranked Feature Importances

Plot Ranked Feature Importances

Decision Tree 0 (First) - Get output

Check the output against the decision tree graph

Compare to our dict of extracted data from the tree

Check output against the diagram

Wrapper function for iRF

Run the iRF function

all_rit_bootstrap_output

Run iRF for just 1 iteration - should be the uniform sampling version

Compare to the original single fitted random forest (top of the notebook)!

These look like they match as required!

Perform Manual CHECKS on the `irf_utils`