Title: Handling Imbalanced Classes In Logistic Regression
Slug: handling_imbalanced_classes_in_logistic_regression
Summary: How to handle imbalanced classes in logistic regression in scikit-learn.
Date: 2017-09-21 12:00
Category: Machine Learning
Tags: Logistic Regression
Authors: Chris Albon

Like many other learning algorithms in scikit-learn, LogisticRegression comes with a built-in method of handling imbalanced classes. If we have highly imbalanced classes and have no addressed it during preprocessing, we have the option of using the class_weight parameter to weight the classes to make certain we have a balanced mix of each class. Specifically, the balanced argument will automatically weigh classes inversely proportional to their frequency:

$$w_j = \frac{n}{kn_{j}}$$

where $w_j$ is the weight to class $j$, $n$ is the number of observations, $n_j$ is the number of observations in class $j$, and $k$ is the total number of classes.

Preliminaries


In [1]:
# Load libraries
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
import numpy as np

Load Iris Flower Dataset


In [2]:
# Load data
iris = datasets.load_iris()
X = iris.data
y = iris.target

Make Classes Imbalanced


In [3]:
# Make class highly imbalanced by removing first 40 observations
X = X[40:,:]
y = y[40:]

# Create target vector indicating if class 0, otherwise 1
y = np.where((y == 0), 0, 1)

Standardize Features


In [4]:
# Standarize features
scaler = StandardScaler()
X_std = scaler.fit_transform(X)

Train A Logistic Regression With Weighted Classes


In [5]:
# Create decision tree classifer object
clf = LogisticRegression(random_state=0, class_weight='balanced')

# Train model
model = clf.fit(X_std, y)