RITs Pseudocode

RITs inputs

  • M Number of trees to build
  • D Max Tree Depth
  • p Children sample node probability threshold (= 0 for no split, i.e. based on uniform (0, 1) RNG with respect the the threshold)
  • n Min number of children to sample at each node (if p != 0 then at each node if the split node prob <= p, then sample n children at that node, else sample n + 1 children at that node each node)

i.e. if we want just a binary RIT i.e. always 2 children sampled at each node then set p = 0 and n = 2.

RITs outputs

Our version of the RITs should output the following:

  • Node class and The RIT class
  • The random number list of nodes that we generated i.e. as a generator function (for reproducibility and testing)
  • The entire RITs (for all M trees)

RIT Node class

  • We need to return the rich RIT object
    • The authors mention calculating prevalence and sparsity, how should we best calculate these metrics?
    • Needs to return clean attributes:
      • IsNode
      • HasChildren
      • NumChildren
      • Is leaf node
      • getIntersectedPath

Summary

  • At it's core, the RIT is comprised of 3 main modules
  • FILTERING: Subsetting to either the 1's or the 0's
  • RANDOM SAMPLING: The path-nodes in a weighted manner, with/ without replacement, within tree/ outside tree
  • INTERSECTION: Intersecting the selected node paths in a systematic manner

Pseudocode for iRFs and RITs

  • Question for SVW: How to specify random seeds for all K iterations?

def iterative_random_forest(#RF params **rf_params, rf_B, #number of decision trees to fit for each random forest K=4,

                         #RIT params
                         M_trees=20, 
                         max_depth=5, 
                         n_splits=2,
                         noisy_splits=False):

every_irf_output = {}

for k in range(K):
    if k == 0:
        #set weights uniformly here for the first iteration
        #get the number of features to set this uniform parameter
         rf = RandomForestClassifier(**rf_params, 
                            n_estimators=B, 
                            rf_weights=None)
    else:
        rf = RandomForestClassifier(**rf_params, 
                            n_estimators=B, 
                            rf_weights=rf_weights)

    all_rf_tree_data = irf_utils.get_rf_tree_data(rf=rf,
                                                  X_train=X_train, y_train=y_train, 
                                                  X_test=X_test, y_test=y_test)

    #Run the RIT using the decision tree outputs
    #should be a dictionary structure similar to 
    all_rit_tree_data = irf_utils.get_rit_tree_data(
        all_rf_tree_data=all_rf_tree_data,
        bin_class_type=1,
        random_state=12,
        M=10,
        max_depth=3,
        noisy_split=False,
        num_splits=2)

    #should be able to access the rit_output
    stability_score = ...

    #Append the stability score to the RIT
    all_rit_tree_data['stability_score'] = stability_score

    every_irf_output["irf{}".format(k)] = (all_rf_tree_data, all_rf_tree_data)

#return the dictionar
return every_irf_output

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
import numpy as np
from functools import reduce

# Import our custom utilities
from imp import reload
from utils import irf_jupyter_utils
from utils import irf_utils
reload(irf_jupyter_utils)
reload(irf_utils)


Out[1]:
<module 'utils.irf_utils' from '/Users/shamindras/PERSONAL/LEARNING/REPOS/scikit-learn-sandbox/jupyter/utils/irf_utils.py'>

In [4]:
X_train, X_test, y_train, y_test, rf = irf_jupyter_utils.generate_rf_example(
    sklearn_ds=load_breast_cancer(), n_estimators=10)

In [5]:
all_rf_tree_data = irf_utils.get_rf_tree_data(rf=rf,
                                              X_train=X_train, y_train=y_train, 
                                              X_test=X_test, y_test=y_test)

In [6]:
rf_weights = all_rf_tree_data['feature_importances']

In [8]:
gen_random_leaf_paths = irf_utils.generate_rit_samples(all_rf_tree_data=all_rf_tree_data, 
                                                       bin_class_type=1)

rit0 = irf_utils.build_tree(feature_paths=gen_random_leaf_paths, 
                            max_depth=3, 
                            noisy_split=False, 
                            num_splits=5)

In [ ]: