Author(s): bfoo@google.com, kozyr@google.com
In this notebook, we will perform training over the features collected from step 4's image and feature analysis step. Two tools will be used in this demo:
In [0]:
# Enter your username:
YOUR_GMAIL_ACCOUNT = '******' # Whatever is before @gmail.com in your email address
In [0]:
import cv2
import numpy as np
import os
import pickle
import shutil
import sys
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from random import random
from scipy import stats
from sklearn import preprocessing
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
import tensorflow as tf
from tensorflow.contrib.learn import LinearClassifier
from tensorflow.contrib.learn import Experiment
from tensorflow.contrib.learn.python.learn import learn_runner
from tensorflow.contrib.layers import real_valued_column
from tensorflow.contrib.learn import RunConfig
In [0]:
# Directories:
PREPROC_DIR = os.path.join('/home', YOUR_GMAIL_ACCOUNT, 'data/')
OUTPUT_DIR = os.path.join('/home', YOUR_GMAIL_ACCOUNT, 'data/logreg/') # Does not need to exist yet.
In [0]:
training_std = pickle.load(open(PREPROC_DIR + 'training_std.pkl', 'r'))
debugging_std = pickle.load(open(PREPROC_DIR + 'debugging_std.pkl', 'r'))
training_labels = pickle.load(open(PREPROC_DIR + 'training_labels.pkl', 'r'))
debugging_labels = pickle.load(open(PREPROC_DIR + 'debugging_labels.pkl', 'r'))
FEATURE_LENGTH = training_std.shape[1]
print FEATURE_LENGTH
In [0]:
# Examine the shape of the feature data we loaded:
print(type(training_std)) # Type will be numpy array.
print(np.shape(training_std)) # Rows, columns.
In [0]:
# Examine the label data we loaded:
print(type(training_labels)) # Type will be numpy array.
print(np.shape(training_labels)) # How many datapoints?
training_labels[:3] # First 3 training labels.
In logistic regression, one of the hyperparameters is known as the regularization term C. Regularization is a penalty associated with the complexity of the model itself, such as the value of its weights. The example below uses "L1" regularization, which has the following behavior: as C decreases, the number of non-zero weights also decreases (complexity decreases).
A high complexity model (high C) will fit very well to the training data, but will also capture the noise inherent in the training set. This could lead to poor performance when predicting labels on the debugging set.
A low complexity model (low C) does not fit as well with training data, but will generalize better over unseen data. There is a delicate balance in this process, as oversimplifying the model also hurts its performance.
In [0]:
# Plug into scikit-learn for logistic regression training
model = LogisticRegression(penalty='l1', C=0.2) # C is inverse of the regularization strength
model.fit(training_std, training_labels)
# Print zero coefficients to check regularization strength
print 'Non-zero weights', sum(model.coef_[0] > 0)
In [0]:
# Get the output predictions of the training and debugging inputs
training_predictions = model.predict_proba(training_std)[:, 1]
debugging_predictions = model.predict_proba(debugging_std)[:, 1]
That was easy! But how well did it do? Let's check the accuracy of the model we just trained.
In [0]:
# Accuracy metric:
def get_accuracy(truth, predictions, threshold=0.5, roundoff=2):
"""
Args:
truth: can be Boolean (False, True), int (0, 1), or float (0, 1)
predictions: number between 0 and 1, inclusive
threshold: we convert predictions to 1s if they're above this value
roundoff: report accuracy to how many decimal places?
Returns:
accuracy: number correct divided by total predictions
"""
truth = np.array(truth) == (1|True)
predicted = np.array(predictions) >= threshold
matches = sum(predicted == truth)
accuracy = float(matches) / len(truth)
return round(accuracy, roundoff)
# Compute our accuracy metric for training and debugging
print 'Training accuracy is ' + str(get_accuracy(training_labels, training_predictions))
print 'Debugging accuracy is ' + str(get_accuracy(debugging_labels, debugging_predictions))
Tensorflow requires the user to define input functions, which are functions that return rows of feature vectors, and their corresponding labels. Tensorflow will periodically call these functions to obtain data as model training progresses.
Why not just provide the feature vectors and labels upfront? Again, this comes down to the distributed aspect of Tensorflow, where data can be received from various sources, and not all data can fit on a single machine. For instance, you may have several million rows distributed across a cluster, but any one machine can only provide a few thousand rows. Tensorflow allows you to define the input function to pull data in from a queue rather than a numpy array, and that queue can contain training data that is available at that time.
Another practical reason for supplying limited training data is that sometimes the feature vectors are very long, and only a few rows can fit within memory at a time. Finally, complex ML models (such as deep neural networks) take a long time to train and use up a lot of resources, and so limiting the training samples at each machine allows us to train faster and without memory issues.
The input function's returned features is defined as a dictionary of scalar, categorical, or tensor-valued features. The returned labels from an input function is defined as a single tensor storing the labels. In this notebook, we will simply return the entire set of features and labels with every function call.
In [0]:
def train_input_fn():
training_X_tf = tf.convert_to_tensor(training_std, dtype=tf.float32)
training_y_tf = tf.convert_to_tensor(training_labels, dtype=tf.float32)
return {'features': training_X_tf}, training_y_tf
def eval_input_fn():
debugging_X_tf = tf.convert_to_tensor(debugging_std, dtype=tf.float32)
debugging_y_tf = tf.convert_to_tensor(debugging_labels, dtype=tf.float32)
return {'features': debugging_X_tf}, debugging_y_tf
Tensorflow's linear classifiers, such as logistic regression, are structured as estimators. An estimator has the ability to compute the objective function of the ML model, and take a step towards reducing it. Tensorflow has built-in estimators such as "LinearClassifier", which is just a logistic regression trainer. These estimators have additional metrics that are calculated, such as the average accuracy at threshold = 0.5.
In [0]:
# Tweak this hyperparameter to improve debugging precision-recall AUC.
REG_L1 = 5.0 # Use the inverse of C in sklearn, i.e 1/C.
LEARNING_RATE = 2.0 # How aggressively to adjust coefficients during optimization?
TRAINING_STEPS = 20000
# The estimator requires an array of features from the dictionary of feature columns to use in the model
feature_columns = [real_valued_column('features', dimension=FEATURE_LENGTH)]
# We use Tensorflow's built-in LinearClassifier estimator, which implements a logistic regression.
# You can go to the model_dir below to see what Tensorflow leaves behind during training.
# Delete the directory if you wish to retrain.
estimator = LinearClassifier(feature_columns=feature_columns,
optimizer=tf.train.FtrlOptimizer(
learning_rate=LEARNING_RATE,
l1_regularization_strength=REG_L1),
model_dir=OUTPUT_DIR + '-model-reg-' + str(REG_L1)
)
An experiment is a TensorFlow object that stores the estimator, as well as several other parameters. It can also periodically write the model progress into checkpoints which can be loaded later if you would like to continue the model where the training last left off.
Some of the parameters are:
(If you run the below script multiple times without changing REG_L1 or train_steps, you will notice that the model does not train, as you've already trained the model that many steps for the given configuration).
In [0]:
def generate_experiment_fn():
def _experiment_fn(output_dir):
return Experiment(estimator=estimator,
train_input_fn=train_input_fn,
eval_input_fn=eval_input_fn,
train_steps=TRAINING_STEPS,
eval_steps=1,
min_eval_frequency=1)
return _experiment_fn
Unless you change TensorFlow's verbosity, there is a lot of text that is outputted. Such text can be useful when debugging a distributed training pipeline, but is pretty noisy when running from a notebook locally. The line to look for is the chunk at the end where "accuracy" is reported. This is the final result of the model.
In [0]:
learn_runner.run(generate_experiment_fn(), OUTPUT_DIR + '-model-reg-' + str(REG_L1))