Key Requirements for the iRF scikit-learn implementation

The following is a documentation of the main requirements for the iRF implementation

Pseudocode iRF implementation

Step 0: Setup

Import required libraries and set up the seed value for reproducibility
Keep all custom functions in utils/utils.py

Inputs:

D = {($X_{i}$, $Y_{i}$), $X_{i} \in \mathbb{R}$, $Y_{i} \in \left \{0, 1 \right \}$ p , Y i ∈ {0, 1}},C ∈ {0, 1}, B, K



In [1]:

    
# Setup
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.datasets import load_iris
from sklearn import tree
import numpy as np

# Define a function to draw the decision trees in IPython
# Adapted from: http://scikit-learn.org/stable/modules/tree.html
from IPython.display import display, Image
import pydotplus

# Custom util functions
from utils import utils

# Set seed for reproducibility
np.random.seed(1015)









    



/Users/shamindras/anaconda/envs/sklearnprod0/lib/python3.6/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Step 1: Fit the Initial Random Forest

Just fit every feature with equal weights per the usual random forest code e.g. DecisionForestClassifier in scikit-learn



In [2]:

    
# Load the iris data
iris = load_iris()

# Create the train-test datasets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target)

np.random.seed(1039)

# Just fit a simple random forest classifier with 2 decision trees
rf = RandomForestClassifier(n_estimators = 2)
rf.fit(X = X_train, y = y_train)

# Now plot the trees individually
for idx, dtree in enumerate(rf.estimators_):
    print(idx)
    utils.draw_tree(inp_tree = dtree)
    #utils.draw_tree(inp_tree = rf.estimators_[1])

Step 2: Get the Gini Importance of Weights

For the first random forest we just need to get the Gini Importance of Weights

Step 2.1 Get them numerically - most important



In [3]:

    
importances = rf.feature_importances_
std = np.std([dtree.feature_importances_ for dtree in rf.estimators_]
             , axis=0)
indices = np.argsort(importances)[::-1]

# Check that the feature importances are standardized to 1
print(sum(importances))

1.0

Step 2.2 Display Feature Importances Graphically (just for interest)



In [4]:

    
# Print the feature ranking
print("Feature ranking:")

for f in range(X_train.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X_train.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X_train.shape[1]), indices)
plt.xlim([-1, X_train.shape[1]])
plt.show()









    



Feature ranking:
1. feature 3 (0.627351)
2. feature 2 (0.319113)
3. feature 1 (0.041706)
4. feature 0 (0.011829)

Step 3: For each Tree get core leaf node features

For each decision tree in the classifier, get:
- The list of leaf nodes
- Depth of the leaf node
- Leaf node predicted class i.e. {0, 1}
- Probability of predicting class in leaf node
- Number of observations in the leaf node i.e. weight of node

Name the Features



In [5]:

    
feature_names = ["X" + str(i) for i in range(X_train.shape[1])]
target_vals = list(np.sort(np.unique(y_train)))
target_names = ["y" + str(i) for i in target_vals]
print(feature_names)
print(target_names)









    



['X0', 'X1', 'X2', 'X3']
['y0', 'y1', 'y2']

Get the second Decision tree to use for testing



In [51]:

    
estimator = rf.estimators_[1]



In [7]:

    
from sklearn.tree import _tree
estimator.tree_.node_count
estimator.tree_.children_left[0]
estimator.tree_.children_right[0]
_tree.TREE_LEAF









    Out[7]:





-1

Write down an efficient Binary Tree Traversal Function



In [8]:

    
# Now plot the trees individually
utils.draw_tree(inp_tree = estimator)



In [9]:

    
def binaryTreePaths(dtree, root_node_id = 0):

    # Use these lists to parse the tree structure
    children_left  = dtree.tree_.children_left
    children_right = dtree.tree_.children_right
    
    if root_node_id is None: 
        paths      = []
    
    if root_node_id == _tree.TREE_LEAF:
        raise ValueError("Invalid node_id %s" % _tree.TREE_LEAF)
        
    # if left/right is None we'll get empty list anyway
    if children_left[root_node_id] != _tree.TREE_LEAF:
        paths = [str(root_node_id) + '->' + str(l)
                 for l in binaryTreePaths(dtree, children_left[root_node_id]) + 
                          binaryTreePaths(dtree, children_right[root_node_id])]
    else:
        paths = [root_node_id]
    
    return paths



In [10]:

    
x1 = binaryTreePaths(rf.estimators_[1], root_node_id = 0)
x1









    Out[10]:





['0->1',
 '0->2->3->4->5',
 '0->2->3->4->6',
 '0->2->3->7',
 '0->2->8->9->10',
 '0->2->8->9->11',
 '0->2->8->12']



In [27]:

    
def binaryTreePaths2(dtree, root_node_id = 0):

    # Use these lists to parse the tree structure
    children_left  = dtree.tree_.children_left
    children_right = dtree.tree_.children_right

    if root_node_id is None: 
        paths      = []
    
    if root_node_id == _tree.TREE_LEAF:
        raise ValueError("Invalid node_id %s" % _tree.TREE_LEAF)
        
    # if left/right is None we'll get empty list anyway
    if children_left[root_node_id] != _tree.TREE_LEAF:
        paths = [np.append(root_node_id, l)
                 for l in binaryTreePaths2(dtree, children_left[root_node_id]) + 
                          binaryTreePaths2(dtree, children_right[root_node_id])]

    else:
        paths = [root_node_id]
    
    return paths



In [28]:

    
x = binaryTreePaths2(rf.estimators_[1], root_node_id = 0)
x









    Out[28]:





[array([0, 1]),
 array([0, 2, 3, 4, 5]),
 array([0, 2, 3, 4, 6]),
 array([0, 2, 3, 7]),
 array([ 0,  2,  8,  9, 10]),
 array([ 0,  2,  8,  9, 11]),
 array([ 0,  2,  8, 12])]



In [30]:

    
leaf_nodes = [y[-1] for y in x]
leaf_nodes









    Out[30]:





[1, 5, 6, 7, 10, 11, 12]



In [60]:

    
n_node_samples = estimator.tree_.n_node_samples
num_samples = [n_node_samples[y].astype(int) for y in leaf_nodes]
print(n_node_samples)
print(len(n_node_samples))
num_samples
print(num_samples)
print(sum(num_samples))
print(sum(n_node_samples))









    



[70 21 49 25 24 21  3  1 24  5  4  1 19]
13
[21, 21, 3, 1, 4, 1, 19]
70
267



In [46]:

    
X_train.shape









    Out[46]:





(112, 4)



In [59]:

    
value  = estimator.tree_.value
values = [value[y].astype(int) for y in leaf_nodes]
print(values)
# This should match the number of rows in the training feature set
print(sum(values).sum())
values









    



[array([[39,  0,  0]]), array([[ 0, 31,  0]]), array([[0, 0, 4]]), array([[0, 0, 1]]), array([[0, 0, 6]]), array([[0, 1, 0]]), array([[ 0,  0, 30]])]
112






    Out[59]:





[array([[39,  0,  0]]),
 array([[ 0, 31,  0]]),
 array([[0, 0, 4]]),
 array([[0, 0, 1]]),
 array([[0, 0, 6]]),
 array([[0, 1, 0]]),
 array([[ 0,  0, 30]])]



In [ ]:

    
feature_names = ["X" + str(i) for i in range(X_train.shape[1])]
np.asarray(feature_names)
print(type(feature_names))

print(feature_names[0])
print(feature_names[-2])


feature = estimator.tree_.feature
z = [feature[y].astype(int) for y in x]
z
#[feature_names[i] for i in z]



In [ ]:

    
max_dpth = estimator.tree_.max_depth
max_dpth



In [ ]:

    
max_n_class = estimator.tree_.max_n_classes
max_n_class



In [ ]:

    
print("nodes", np.asarray(a = nodes, dtype = "int64"), sep = ":\n")
print("node_depth", node_depth, sep = ":\n")
print("leaf_node", is_leaves, sep = ":\n")
print("feature_names", used_feature_names, sep = ":\n")
print("feature", feature, sep = ":\n")

Step 4: For each tree get the paths to the leaf node from root node

For each decision tree in the classifier, get:
- Full path sequence to all leaf nodes i.e. SEQUENCE of all features that led to a leaf node
- Path to all leaf nodes i.e. SET of all features that led to a leaf node i.e. remove duplicate features
- Get the node_ids and the feature_ids at each node_id
- Get the feature SET associated with each node along a path

Get the tree paths

The following code is taken from: https://github.com/andosa/treeinterpreter/blob/master/treeinterpreter/treeinterpreter.py#L12-L33

Step 5: Refit the Random Forest but change the feature weighting procedure

We need to be able to refit the Random Forests (after the first one by re-weighting the weights)
The key parts of the code that need to be modified are:
- https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_splitter.pyx#L401-L402
- https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_utils.pyx#L72-L75

Step 6: Use Export Graphviz code to obtain the node paths

https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/tree/export.py#L165-L167

Discussion with Karl

Check the RIT procedure
Check any additional data values required
OOB is tricky - the warm start aspect means that we can't easily paralellize this. Also tracking is difficult
Go through the general structure of the code