Title: Handle Imbalanced Classes In Random Forest
Slug: handle_imbalanced_classes_in_random_forests
Summary: Handle imbalanced classes in random forests in scikit-learn.
Date: 2017-09-21 12:00
Category: Machine Learning
Tags: Trees And Forests
Authors: Chris Albon
In [5]:
# Load libraries
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn import datasets
In [6]:
# Load data
iris = datasets.load_iris()
X = iris.data
y = iris.target
In [7]:
# Make class highly imbalanced by removing first 40 observations
X = X[40:,:]
y = y[40:]
# Create target vector indicating if class 0, otherwise 1
y = np.where((y == 0), 0, 1)
When using RandomForestClassifier
a useful setting is class_weight=balanced
wherein classes are automatically weighted inversely proportional to how frequently they appear in the data. Specifically:
where $w_j$ is the weight to class $j$, $n$ is the number of observations, $n_j$ is the number of observations in class $j$, and $k$ is the total number of classes.
In [8]:
# Create decision tree classifer object
clf = RandomForestClassifier(random_state=0, n_jobs=-1, class_weight="balanced")
# Train model
model = clf.fit(X, y)