Title: Logistic Regression On Very Large Data
Slug: logistic_regression_on_very_large_data
Summary: How to train a logistic regression on very large data in scikit-learn.
Date: 2017-09-21 12:00
Category: Machine Learning
Tags: Logistic Regression
Authors: Chris Albon
scikit-learn's LogisticRegression
offers a number of techniques for training a logistic regression, called solvers. Most of the time scikit-learn will select the best solver automatically for us or warn us that you cannot do some thing with that solver. However, there is one particular case we should be aware of.
While an exact explanation is beyond the bounds of this book, stochastic average gradient descent allows us to train a model much faster than other solvers when our data is very large. However, it is also very sensitive to feature scaling to standardizing our features is particularly important. We can set our learning algorithm to use this solver by setting solver='sag'
.
In [1]:
# Load libraries
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
In [2]:
# Load data
iris = datasets.load_iris()
X = iris.data
y = iris.target
In [3]:
# Standarize features
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
In [4]:
# Create logistic regression object using sag solver
clf = LogisticRegression(random_state=0, solver='sag')
# Train model
model = clf.fit(X_std, y)