Key Requirements for the iRF scikit-learn implementation

  • The following is a documentation of the main requirements for the iRF implementation

Pseudocode iRF implementation

Step 0: Setup

  • Import required libraries and set up the seed value for reproducibility
  • Keep all custom functions in utils/utils.py

Inputs:

  • D = {($X_{i}$, $Y_{i}$), $X_{i} \in \mathbb{R}$, $Y_{i} \in \left \{0, 1 \right \}$ p , Y i ∈ {0, 1}},C ∈ {0, 1}, B, K

In [1]:
# Setup
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.datasets import load_iris
from sklearn import tree
import numpy as np

# Define a function to draw the decision trees in IPython
# Adapted from: http://scikit-learn.org/stable/modules/tree.html
from IPython.display import display, Image
import pydotplus

# Custom util functions
from utils import utils

# Set seed for reproducibility
np.random.seed(1015)


/Users/shamindras/anaconda/envs/sklearnprod0/lib/python3.6/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Step 1: Fit the Initial Random Forest

  • Just fit every feature with equal weights per the usual random forest code e.g. DecisionForestClassifier in scikit-learn

In [2]:
# Load the iris data
iris = load_iris()

# Create the train-test datasets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target)

np.random.seed(1039)

# Just fit a simple random forest classifier with 2 decision trees
rf = RandomForestClassifier(n_estimators = 2)
rf.fit(X = X_train, y = y_train)

# Now plot the trees individually
for idx, dtree in enumerate(rf.estimators_):
    print(idx)
    utils.draw_tree(inp_tree = dtree)
    #utils.draw_tree(inp_tree = rf.estimators_[1])


0
1

Step 2: Get the Gini Importance of Weights

  • For the first random forest we just need to get the Gini Importance of Weights

Step 2.1 Get them numerically - most important


In [3]:
importances = rf.feature_importances_
std = np.std([dtree.feature_importances_ for dtree in rf.estimators_]
             , axis=0)
indices = np.argsort(importances)[::-1]

# Check that the feature importances are standardized to 1
print(sum(importances))


1.0

Step 2.2 Display Feature Importances Graphically (just for interest)


In [4]:
# Print the feature ranking
print("Feature ranking:")

for f in range(X_train.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X_train.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X_train.shape[1]), indices)
plt.xlim([-1, X_train.shape[1]])
plt.show()


Feature ranking:
1. feature 3 (0.627351)
2. feature 2 (0.319113)
3. feature 1 (0.041706)
4. feature 0 (0.011829)

Step 3: For each Tree get core leaf node features

  • For each decision tree in the classifier, get:
    • The list of leaf nodes
    • Depth of the leaf node
    • Leaf node predicted class i.e. {0, 1}
    • Probability of predicting class in leaf node
    • Number of observations in the leaf node i.e. weight of node

Name the Features


In [5]:
feature_names = ["X" + str(i) for i in range(X_train.shape[1])]
target_vals = list(np.sort(np.unique(y_train)))
target_names = ["y" + str(i) for i in target_vals]
print(feature_names)
print(target_names)


['X0', 'X1', 'X2', 'X3']
['y0', 'y1', 'y2']

Get the second Decision tree to use for testing


In [51]:
estimator = rf.estimators_[1]

In [7]:
from sklearn.tree import _tree
estimator.tree_.node_count
estimator.tree_.children_left[0]
estimator.tree_.children_right[0]
_tree.TREE_LEAF


Out[7]:
-1

Write down an efficient Binary Tree Traversal Function


In [8]:
# Now plot the trees individually
utils.draw_tree(inp_tree = estimator)



In [9]:
def binaryTreePaths(dtree, root_node_id = 0):

    # Use these lists to parse the tree structure
    children_left  = dtree.tree_.children_left
    children_right = dtree.tree_.children_right
    
    if root_node_id is None: 
        paths      = []
    
    if root_node_id == _tree.TREE_LEAF:
        raise ValueError("Invalid node_id %s" % _tree.TREE_LEAF)
        
    # if left/right is None we'll get empty list anyway
    if children_left[root_node_id] != _tree.TREE_LEAF:
        paths = [str(root_node_id) + '->' + str(l)
                 for l in binaryTreePaths(dtree, children_left[root_node_id]) + 
                          binaryTreePaths(dtree, children_right[root_node_id])]
    else:
        paths = [root_node_id]
    
    return paths

In [10]:
x1 = binaryTreePaths(rf.estimators_[1], root_node_id = 0)
x1


Out[10]:
['0->1',
 '0->2->3->4->5',
 '0->2->3->4->6',
 '0->2->3->7',
 '0->2->8->9->10',
 '0->2->8->9->11',
 '0->2->8->12']

In [27]:
def binaryTreePaths2(dtree, root_node_id = 0):

    # Use these lists to parse the tree structure
    children_left  = dtree.tree_.children_left
    children_right = dtree.tree_.children_right

    if root_node_id is None: 
        paths      = []
    
    if root_node_id == _tree.TREE_LEAF:
        raise ValueError("Invalid node_id %s" % _tree.TREE_LEAF)
        
    # if left/right is None we'll get empty list anyway
    if children_left[root_node_id] != _tree.TREE_LEAF:
        paths = [np.append(root_node_id, l)
                 for l in binaryTreePaths2(dtree, children_left[root_node_id]) + 
                          binaryTreePaths2(dtree, children_right[root_node_id])]

    else:
        paths = [root_node_id]
    
    return paths

In [28]:
x = binaryTreePaths2(rf.estimators_[1], root_node_id = 0)
x


Out[28]:
[array([0, 1]),
 array([0, 2, 3, 4, 5]),
 array([0, 2, 3, 4, 6]),
 array([0, 2, 3, 7]),
 array([ 0,  2,  8,  9, 10]),
 array([ 0,  2,  8,  9, 11]),
 array([ 0,  2,  8, 12])]

In [30]:
leaf_nodes = [y[-1] for y in x]
leaf_nodes


Out[30]:
[1, 5, 6, 7, 10, 11, 12]

In [60]:
n_node_samples = estimator.tree_.n_node_samples
num_samples = [n_node_samples[y].astype(int) for y in leaf_nodes]
print(n_node_samples)
print(len(n_node_samples))
num_samples
print(num_samples)
print(sum(num_samples))
print(sum(n_node_samples))


[70 21 49 25 24 21  3  1 24  5  4  1 19]
13
[21, 21, 3, 1, 4, 1, 19]
70
267

In [46]:
X_train.shape


Out[46]:
(112, 4)

In [59]:
value  = estimator.tree_.value
values = [value[y].astype(int) for y in leaf_nodes]
print(values)
# This should match the number of rows in the training feature set
print(sum(values).sum())
values


[array([[39,  0,  0]]), array([[ 0, 31,  0]]), array([[0, 0, 4]]), array([[0, 0, 1]]), array([[0, 0, 6]]), array([[0, 1, 0]]), array([[ 0,  0, 30]])]
112
Out[59]:
[array([[39,  0,  0]]),
 array([[ 0, 31,  0]]),
 array([[0, 0, 4]]),
 array([[0, 0, 1]]),
 array([[0, 0, 6]]),
 array([[0, 1, 0]]),
 array([[ 0,  0, 30]])]

In [ ]:
feature_names = ["X" + str(i) for i in range(X_train.shape[1])]
np.asarray(feature_names)
print(type(feature_names))

print(feature_names[0])
print(feature_names[-2])


feature = estimator.tree_.feature
z = [feature[y].astype(int) for y in x]
z
#[feature_names[i] for i in z]

In [ ]:
max_dpth = estimator.tree_.max_depth
max_dpth

In [ ]:
max_n_class = estimator.tree_.max_n_classes
max_n_class

In [ ]:
print("nodes", np.asarray(a = nodes, dtype = "int64"), sep = ":\n")
print("node_depth", node_depth, sep = ":\n")
print("leaf_node", is_leaves, sep = ":\n")
print("feature_names", used_feature_names, sep = ":\n")
print("feature", feature, sep = ":\n")

Step 4: For each tree get the paths to the leaf node from root node

  • For each decision tree in the classifier, get:
    • Full path sequence to all leaf nodes i.e. SEQUENCE of all features that led to a leaf node
    • Path to all leaf nodes i.e. SET of all features that led to a leaf node i.e. remove duplicate features
    • Get the node_ids and the feature_ids at each node_id
    • Get the feature SET associated with each node along a path

Step 5: Refit the Random Forest but change the feature weighting procedure

Step 6: Use Export Graphviz code to obtain the node paths

Discussion with Karl

  • Check the RIT procedure
  • Check any additional data values required
  • OOB is tricky - the warm start aspect means that we can't easily paralellize this. Also tracking is difficult
  • Go through the general structure of the code