Logistic Regression Based on Extracted Features

Author(s): bfoo@google.com, kozyr@google.com

In this notebook, we will perform training over the features collected from step 4's image and feature analysis step. Two tools will be used in this demo:

  • Scikit learn: the widely used, single machine Python machine learning library
  • TensorFlow: Google's home-grown machine learning library that allows distributed machine learning

Setup

You need to have worked through the feature engineering notebook in order for this to work since we'll be loading the pickled datasets we saved in Step 4. You might have to adjust the directories below if you made changes to save directory in that notebook.


In [0]:
# Enter your username:
YOUR_GMAIL_ACCOUNT = '******' # Whatever is before @gmail.com in your email address

In [0]:
import cv2
import numpy as np
import os
import pickle
import shutil
import sys
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from random import random
from scipy import stats
from sklearn import preprocessing
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve

import tensorflow as tf
from tensorflow.contrib.learn import LinearClassifier
from tensorflow.contrib.learn import Experiment
from tensorflow.contrib.learn.python.learn import learn_runner
from tensorflow.contrib.layers import real_valued_column
from tensorflow.contrib.learn import RunConfig

In [0]:
# Directories:
PREPROC_DIR = os.path.join('/home', YOUR_GMAIL_ACCOUNT, 'data/')
OUTPUT_DIR = os.path.join('/home', YOUR_GMAIL_ACCOUNT, 'data/logreg/')  # Does not need to exist yet.

Load stored features and labels

Load from the pkl files saved in step 4 and confirm that the feature length is correct.


In [0]:
training_std = pickle.load(open(PREPROC_DIR + 'training_std.pkl', 'r'))
debugging_std = pickle.load(open(PREPROC_DIR + 'debugging_std.pkl', 'r'))
training_labels = pickle.load(open(PREPROC_DIR + 'training_labels.pkl', 'r'))
debugging_labels = pickle.load(open(PREPROC_DIR + 'debugging_labels.pkl', 'r'))

FEATURE_LENGTH = training_std.shape[1]
print FEATURE_LENGTH

In [0]:
# Examine the shape of the feature data we loaded:
print(type(training_std))  # Type will be numpy array.
print(np.shape(training_std))  # Rows, columns.

In [0]:
# Examine the label data we loaded:
print(type(training_labels))  # Type will be numpy array.
print(np.shape(training_labels)) # How many datapoints?
training_labels[:3]  # First 3 training labels.

Step 5: Enabling Logistic Regression to Run

Logistic regression is a generalized linear model that predicts a probability value of whether each picture is a cat. Scikit-learn has a very easy interface for training a logistic regression model.

Logistic Regression in scikit-learn

In logistic regression, one of the hyperparameters is known as the regularization term C. Regularization is a penalty associated with the complexity of the model itself, such as the value of its weights. The example below uses "L1" regularization, which has the following behavior: as C decreases, the number of non-zero weights also decreases (complexity decreases).

A high complexity model (high C) will fit very well to the training data, but will also capture the noise inherent in the training set. This could lead to poor performance when predicting labels on the debugging set.

A low complexity model (low C) does not fit as well with training data, but will generalize better over unseen data. There is a delicate balance in this process, as oversimplifying the model also hurts its performance.


In [0]:
# Plug into scikit-learn for logistic regression training
model = LogisticRegression(penalty='l1', C=0.2)  # C is inverse of the regularization strength
model.fit(training_std, training_labels)

# Print zero coefficients to check regularization strength
print 'Non-zero weights', sum(model.coef_[0] > 0)

Step 6: Train Logistic Regression with scikit-learn

Let's train!


In [0]:
# Get the output predictions of the training and debugging inputs
training_predictions = model.predict_proba(training_std)[:, 1]
debugging_predictions = model.predict_proba(debugging_std)[:, 1]

That was easy! But how well did it do? Let's check the accuracy of the model we just trained.


In [0]:
# Accuracy metric:
def get_accuracy(truth, predictions, threshold=0.5, roundoff=2):
  """    
  Args:
    truth: can be Boolean (False, True), int (0, 1), or float (0, 1)
    predictions: number between 0 and 1, inclusive
    threshold: we convert predictions to 1s if they're above this value
    roundoff: report accuracy to how many decimal places?

  Returns:  
    accuracy: number correct divided by total predictions
  """

  truth = np.array(truth) == (1|True)
  predicted = np.array(predictions) >= threshold
  matches = sum(predicted == truth)
  accuracy = float(matches) / len(truth)
  return round(accuracy, roundoff)

# Compute our accuracy metric for training and debugging
print 'Training accuracy is ' + str(get_accuracy(training_labels, training_predictions))
print 'Debugging accuracy is ' + str(get_accuracy(debugging_labels, debugging_predictions))

Step 5: Enabling Logistic Regression to Run v2.0

Tensorflow Model

Tensorflow is a Google home-grown tool that allows one to define a model and run distributed training on it. In this notebook, we focus on the atomic pieces for building a tensorflow model. However, this will all be trained locally.

Input functions

Tensorflow requires the user to define input functions, which are functions that return rows of feature vectors, and their corresponding labels. Tensorflow will periodically call these functions to obtain data as model training progresses.

Why not just provide the feature vectors and labels upfront? Again, this comes down to the distributed aspect of Tensorflow, where data can be received from various sources, and not all data can fit on a single machine. For instance, you may have several million rows distributed across a cluster, but any one machine can only provide a few thousand rows. Tensorflow allows you to define the input function to pull data in from a queue rather than a numpy array, and that queue can contain training data that is available at that time.

Another practical reason for supplying limited training data is that sometimes the feature vectors are very long, and only a few rows can fit within memory at a time. Finally, complex ML models (such as deep neural networks) take a long time to train and use up a lot of resources, and so limiting the training samples at each machine allows us to train faster and without memory issues.

The input function's returned features is defined as a dictionary of scalar, categorical, or tensor-valued features. The returned labels from an input function is defined as a single tensor storing the labels. In this notebook, we will simply return the entire set of features and labels with every function call.


In [0]:
def train_input_fn():
  training_X_tf = tf.convert_to_tensor(training_std, dtype=tf.float32)
  training_y_tf = tf.convert_to_tensor(training_labels, dtype=tf.float32)
  return {'features': training_X_tf}, training_y_tf

def eval_input_fn():
  debugging_X_tf = tf.convert_to_tensor(debugging_std, dtype=tf.float32)
  debugging_y_tf = tf.convert_to_tensor(debugging_labels, dtype=tf.float32)
  return {'features': debugging_X_tf}, debugging_y_tf

Logistic Regression with TensorFlow

Tensorflow's linear classifiers, such as logistic regression, are structured as estimators. An estimator has the ability to compute the objective function of the ML model, and take a step towards reducing it. Tensorflow has built-in estimators such as "LinearClassifier", which is just a logistic regression trainer. These estimators have additional metrics that are calculated, such as the average accuracy at threshold = 0.5.


In [0]:
# Tweak this hyperparameter to improve debugging precision-recall AUC. 
REG_L1 = 5.0  # Use the inverse of C in sklearn, i.e 1/C.
LEARNING_RATE = 2.0  # How aggressively to adjust coefficients during optimization?
TRAINING_STEPS = 20000

# The estimator requires an array of features from the dictionary of feature columns to use in the model
feature_columns = [real_valued_column('features', dimension=FEATURE_LENGTH)]

# We use Tensorflow's built-in LinearClassifier estimator, which implements a logistic regression.
# You can go to the model_dir below to see what Tensorflow leaves behind during training.
# Delete the directory if you wish to retrain.
estimator = LinearClassifier(feature_columns=feature_columns,
                             optimizer=tf.train.FtrlOptimizer(
                               learning_rate=LEARNING_RATE,
                               l1_regularization_strength=REG_L1),
                             model_dir=OUTPUT_DIR + '-model-reg-' + str(REG_L1)
                            )

Experiments and Runners

An experiment is a TensorFlow object that stores the estimator, as well as several other parameters. It can also periodically write the model progress into checkpoints which can be loaded later if you would like to continue the model where the training last left off.

Some of the parameters are:

  • train_steps: how many times to adjust model weights before stopping
  • eval_steps: when a summary is written, the model, in its current state of progress, will try to predict the debugging data and calculate its accuracy. Eval_steps is set to 1 because we only need to call the input function once (already returns the entire evaluation dataset).
  • The rest of the parameters just boils down to "do evaluation once".

(If you run the below script multiple times without changing REG_L1 or train_steps, you will notice that the model does not train, as you've already trained the model that many steps for the given configuration).


In [0]:
def generate_experiment_fn():
  def _experiment_fn(output_dir):
    return Experiment(estimator=estimator,
                      train_input_fn=train_input_fn,
                      eval_input_fn=eval_input_fn,
                      train_steps=TRAINING_STEPS,
                      eval_steps=1,
                      min_eval_frequency=1)
  return _experiment_fn

Step 6: Train Logistic Regression with TensorFlow

Unless you change TensorFlow's verbosity, there is a lot of text that is outputted. Such text can be useful when debugging a distributed training pipeline, but is pretty noisy when running from a notebook locally. The line to look for is the chunk at the end where "accuracy" is reported. This is the final result of the model.


In [0]:
learn_runner.run(generate_experiment_fn(), OUTPUT_DIR + '-model-reg-' + str(REG_L1))