Title: Logistic Regression On Very Large Data
Slug: logistic_regression_on_very_large_data
Summary: How to train a logistic regression on very large data in scikit-learn.
Date: 2017-09-21 12:00
Category: Machine Learning
Tags: Logistic Regression
Authors: Chris Albon

scikit-learn's LogisticRegression offers a number of techniques for training a logistic regression, called solvers. Most of the time scikit-learn will select the best solver automatically for us or warn us that you cannot do some thing with that solver. However, there is one particular case we should be aware of.

While an exact explanation is beyond the bounds of this book, stochastic average gradient descent allows us to train a model much faster than other solvers when our data is very large. However, it is also very sensitive to feature scaling to standardizing our features is particularly important. We can set our learning algorithm to use this solver by setting solver='sag'.

Preliminaries


In [1]:
# Load libraries
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler

Load Iris Flower Data


In [2]:
# Load data
iris = datasets.load_iris()
X = iris.data
y = iris.target

Standardize Features


In [3]:
# Standarize features
scaler = StandardScaler()
X_std = scaler.fit_transform(X)

Train Logistic Regression Using SAG solver


In [4]:
# Create logistic regression object using sag solver
clf = LogisticRegression(random_state=0, solver='sag')

# Train model
model = clf.fit(X_std, y)