Introducing tf.estimator

Learning Objectives

  • Understand where the tf.estimator module sits in the hierarchy of Tensorflow APIs
  • Understand the workflow of creating a tf.estimator model
    1. Create Feature Columns
    2. Create Input Functions
    3. Create Estimator
    4. Train/Evaluate/Predict
  • Understand how to swap in/out different types of Estimators

Introduction

Tensorflow is a hierarchical framework. The further down the hierarchy you go, the more flexibility you have, but that more code you have to write. Generally one starts at the highest level of abstraction. Then if you need additional flexibility drop down one layer.

(image: https://www.tensorflow.org/guide/premade_estimators)

In this notebook we will be operating at the highest level of Tensorflow abstraction, using the Estimator API to predict taxifare prices on the sampled dataset we created previously.


In [ ]:
import tensorflow as tf
import pandas as pd
import shutil

print(tf.__version__)

Load raw data

First let's download the raw .csv data. These are the same files created in the create_datasets.ipynb notebook


In [ ]:
!gsutil cp gs://cloud-training-demos/taxifare/small/*.csv .
!ls -l *.csv

Because the files are small we can load them into in-memory Pandas dataframes.


In [ ]:
df_train = pd.read_csv(filepath_or_buffer = "./taxi-train.csv")
df_valid = pd.read_csv(filepath_or_buffer = "./taxi-valid.csv")
df_test = pd.read_csv(filepath_or_buffer = "./taxi-test.csv")

CSV_COLUMN_NAMES = list(df_train)
print(CSV_COLUMN_NAMES)

FEATURE_NAMES = CSV_COLUMN_NAMES[1:] # all but first column
LABEL_NAME = CSV_COLUMN_NAMES[0] # first column

Create feature columns

Feature columns make it easy to perform common type of feature engineering on your raw data. For example you can one-hot encode categorical data, create feature crosses, embeddings and more. We'll cover these later in the course, but if you want to a sneak peak browse the official TensorFlow feature columns guide.

In our case we won't do any feature engineering. However we still need to create a list of feature columns because the Estimator we will use requires one. To specify the numeric values should be passed on without modification we use tf.feature_column.numeric_column()

Exercise 1

Use a python list comprehension or a for loop to create the feature columns for all features in FEATURE_NAMES.


In [ ]:
feature_columns = # TODO: Your code goes here

Define input function

Now that your estimator knows what type of data to expect and how to intepret it, you need to actually pass the data to it! This is the job of the input function.

The input function returns a new batch of (features, label) tuples each time it is called by the Estimator.

  • features: A python dictionary. Each key is a feature column name and its value is the tensor containing the data for that feature
  • label: A Tensor containing the labels

So how do we get from our current Pandas dataframes to (features, label) tuples that return one batch at a time?

The tf.data module contains a collection of classes that allows you to easily load data, manipulate it, and pipe it into your model. https://www.tensorflow.org/guide/datasets_for_estimators

Exercise 2

The code cell below has a few TODOs for you to complete.

The first TODO in the train_input_fn asks you to create a tf.dataset using the tf.data.Dataset API for input pipelines. Complete the code so that the variable dataset creates a tf.data.Dataset element using the tf.from_tensor_slices method. The argument tensors should be a tuple of a dict of the features and the label taken from the Pandas dataframe.

The second TODO in the train_input_fn asks you to add a shuffle, repeat and batch operation to the dataset object you created above. Have a look at the usage of these methods in the tf.data.Datasets API

The next TODO is in the eval_input_fn. Here you are asked to create a dataset object for the validation data. It should look similar to the pipeline you created for the train_input_fn. Note that for the eval_input_fn we don't add a shuffle or repeat step as we'll just evaluation a given batch during each validation step.

The last TODO is in the predict_input_fn where you are asked to once again use the Tensorflow Dataset API to set up a dataset for the prediction stage using the same from_tensor_slices as before. Note, during PREDICT we don't have the label, only features.


In [ ]:
def train_input_fn(df, batch_size = 128):
    #1. Convert dataframe into correct (features, label) format for Estimator API
    dataset = # TODO: Your code goes here
    
    # Note:
    # If we returned now, the Dataset would iterate over the data once  
    # in a fixed order, and only produce a single element at a time.
    
    #2. Shuffle, repeat, and batch the examples.
    dataset = # TODO: Your code goes here
   
    return dataset

def eval_input_fn(df, batch_size = 128):
    #1. Convert dataframe into correct (features, label) format for Estimator API
    dataset = # TODO: Your code goes here
    
    #2.Batch the examples.
    dataset = dataset.batch(batch_size = batch_size)
   
    return dataset

def predict_input_fn(df, batch_size = 128):
    #1. Convert dataframe into correct (features) format for Estimator API
    dataset = # TODO: Your code goes here

    #2.Batch the examples.
    dataset = dataset.batch(batch_size = batch_size)
   
    return dataset

Choose Estimator

Tensorflow has several premade estimators for you to choose from:

  • LinearClassifier/Regressor
  • BoostedTreesClassifier/Regressor
  • DNNClassifier/Regressor
  • DNNLinearCombinedClassifier/Regressor

If none of these meet your needs you can implement a custom estimator using tf.Keras. We'll cover that later in the course.

For now we will use the premade LinearRegressor. To instantiate an estimator simply pass it what feature columns to expect and specify an directory for it to output checkpoint files to.

Exercise 3

Comlete the code in the cell below to define a Linear Regression model using the TF Estimator API. Have a look at the documentation to see what variables you must pass to initialize a LinearRegressor instance. You'll want to add values for feature_columns, model_dir and config. When setting up config, have a look at the documentation for tf.estimator.RunConfig and be sure to set tf.random_seed to ensure reproducibility.


In [ ]:
OUTDIR = "taxi_trained"

model = tf.estimator.LinearRegressor(
# TODO: Your code goes here
)

Train

Simply invoke the estimator's train() function. Specify the input_fn which tells it how to load in data, and specify the number of steps to train for.

By default estimators check the output directory for checkpoint files before beginning training, so it can pickup where it last left off. To prevent this we'll delete the output directory before starting training each time.


In [ ]:
%%time
tf.logging.set_verbosity(tf.logging.INFO) # so loss is printed during training
shutil.rmtree(path = OUTDIR, ignore_errors = True) # start fresh each time

model.train(
    input_fn = lambda: train_input_fn(df = df_train), 
    steps = 500)

Evaluate

Estimators similarly have an evaluate() function. In this case we don't need to specify the number of steps to train because we didn't tell our input function to repeat the data. Once the input function reaches the end of the data evaluation will end.

Loss is reported as MSE by default so we take the square root before printing.

Exercise 4

Complete the code in the cell below to run evaluation on the model you just trained. You'll use the evaluate method of the LinearRegressor model you created and trained above. Have a look at the documentation of the evaluate method here to see what it expects. Note you'll need to pass the evaluation input function as a lambda function processing the Pandas dataframe df_valid.


In [ ]:
def print_rmse(model, df):
    metrics = model.evaluate(
        # TODO: Your code goes here
    )
    print("RMSE on dataset = {}".format(metrics["average_loss"]**.5))
print_rmse(model = model, df = df_valid)

RMSE of 9.43 is worse than our rules based benchmark (RMSE of $7.70). However given that we haven't done any feature engineering or hyperparameter tuning, and we're training on a small dataset using a simple linear model, we shouldn't yet expect good performance.

The goal at this point is to demonstrate the mechanics of the Estimator API. In subsequent notebooks we'll improve on the model.

Predict

To run prediction on the test set df_test we use the predict_input_fn you created above, passsing the df_test dataframe for prediction. We'll use our model to make predicitons on the first 10 elements of the df_test dataframe.


In [ ]:
predictions = model.predict(input_fn = lambda: predict_input_fn(df = df_test[:10]))
for items in predictions:
    print(items)

Further evidence of the primitiveness of our model, it predicts almost the same amount for every trip!

Change Estimator type

One of the payoffs for using the Estimator API is we can swap in a different model type with just a few lines of code. Let's try a DNN. Note how now we need to specify the number of neurons in each hidden layer. Have a look at the documentation for the DNN Regressor to see what other variables you can set.


In [ ]:
%%time
tf.logging.set_verbosity(tf.logging.INFO)
shutil.rmtree(path = OUTDIR, ignore_errors = True)

model = tf.estimator.DNNRegressor(
    hidden_units = [10,10], # specify neural architecture
    feature_columns = feature_columns, 
    model_dir = OUTDIR,
    config = tf.estimator.RunConfig(tf_random_seed = 1)
)
model.train(
    input_fn = lambda: train_input_fn(df = df_train), 
    steps = 500)
print_rmse(model = model, df = df_valid)

Our performance is only slightly better at 9.26, and still far worse than our rules based model. This illustrates an important tenant of machine learning: A more complex model can't outrun bad data.

Currently since we're not doing any feature engineering our input data has very little signal to learn from, so using a DNN doesn't help much.

Results summary

We can summarize our results in a table here.

Exercise 5

Insert the results you found for the LinearRegressor and DNNRegressor model performance here.

Model RMSE on validation set
Rules Based Benchmark 7.76
Linear Model TODO: Your results go here
DNN Model TODO: Your results go here

Challenge exercise

Create a neural network that is capable of finding the volume of a cylinder given the radius of its base (r) and its height (h). Assume that the radius and height of the cylinder are both in the range 0.5 to 2.0. Simulate the necessary training dataset.

Hint (highlight to see):

The input features will be r and h and the label will be $\pi r^2 h$ Create random values for r and h and compute V. Your dataset will consist of r, h and V. Then, use a DNN regressor. Make sure to generate enough data.

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License