TF-TRT Inference From Saved Model with Tensorflow <= 1.13

This notebook is based on https://github.com/tensorflow/tensorrt/blob/master/tftrt/examples/image-classification/TF-TRT-inference-from-saved-model.ipynb commited on Jun 19, 2019.

Notebook Content

  1. Pre-requisite: data and model
  2. Verifying the orignal FP32 model
  3. Creating TF-TRT FP32 model
  4. Creating TF-TRT FP16 model
  5. Creating TF-TRT INT8 model
  6. Calibrating TF-TRT INT8 model with raw JPEG images

Quick start

We will run this demonstration with a Resnet-50 model. The INT8 calibration process requires access to a small but representative sample of real training or valiation data.

We will use the ImageNet dataset that is stored in TFrecords format. Google provides an excellent all-in-one script for downloading and preparing the ImageNet dataset at

https://github.com/tensorflow/models/blob/master/research/inception/inception/data/download_and_preprocess_imagenet.sh.

1. Pre-requisite: data and model


In [1]:
import tensorflow as tf
import tensorflow.contrib.tensorrt as trt
import numpy as np
import matplotlib.pyplot as plt
import os
import sys
import time
import logging
logging.getLogger("tensorflow").setLevel(logging.ERROR)

config = tf.ConfigProto()
config.gpu_options.allow_growth=True

Data

We verify that the correct Imagenet data folder has been mounted and validation data files of the form validation-00xxx-of-00128 are available.


In [2]:
def get_files(data_dir, filename_pattern):
    if data_dir == None:
        return []
    files = tf.gfile.Glob(os.path.join(data_dir, filename_pattern))
    if files == []:
        raise ValueError('Can not find any files in {} with '
                         'pattern "{}"'.format(data_dir, filename_pattern))
    return files

In [3]:
VALIDATION_DATA_DIR = "gs://sandbox-kathryn-data/imagenet"

calibration_files = get_files(VALIDATION_DATA_DIR, 'validation*')
print('There are %d calibration files. \n%s\n%s\n...'%(len(calibration_files), calibration_files[0], calibration_files[-1]))


There are 128 calibration files. 
gs://sandbox-kathryn-data/imagenet/validation-00000-of-00128
gs://sandbox-kathryn-data/imagenet/validation-00127-of-00128
...

TF saved model

We will be downloading ResNet-50 trained based on Cloud TPU tutorials.


In [4]:
!gsutil cp -R gs://sandbox-kathryn-resnet/export/1564508810/* models/resnet/original/00001


Copying gs://sandbox-kathryn-resnet/export/1564508810/saved_model.pb...
Copying gs://sandbox-kathryn-resnet/export/1564508810/variables/variables.data-00000-of-00001...
/ [2 files][ 98.2 MiB/ 98.2 MiB]                                                
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying gs://sandbox-kathryn-resnet/export/1564508810/variables/variables.index...
- [3 files][ 98.2 MiB/ 98.2 MiB]                                                
Operation completed over 3 objects/98.2 MiB.                                     

In [5]:
SAVED_MODEL_DIR='{}/models/'.format(os.getcwd())
RESNET_MODEL_DIR=os.path.join(SAVED_MODEL_DIR, 'resnet')
ORIGINAL_MODEL_DIR=os.path.join(RESNET_MODEL_DIR, 'original', '00001')

Helper functions

We define a few helper functions to read and preprocess Imagenet data from TFRecord files.


In [6]:
def deserialize_image_record(record):
    keys_to_features = {
        'image/encoded': tf.FixedLenFeature((), tf.string, ''),
        'image/class/label': tf.FixedLenFeature([], tf.int64, -1),
    }

    with tf.name_scope('deserialize_image_record'):
        parsed = tf.parse_single_example(record, keys_to_features)
        image_bytes = tf.reshape(parsed['image/encoded'], shape=[])
        label = tf.cast(tf.reshape(parsed['image/class/label'], shape=[]), dtype=tf.int32)
        
        # 'image/class/label' is encoded as an integer from 1 to num_label_classes
        # In order to generate the correct one-hot label vector from this number,
        # we subtract the number by 1 to make it in [0, num_label_classes).
        label -= 1
        return image_bytes, label

In [7]:
image_size=224
CROP_PADDING=32

def preprocess(record):
    image_bytes, label = deserialize_image_record(record)
    
    shape = tf.image.extract_jpeg_shape(image_bytes)
    image_height = shape[0]
    image_width = shape[1]
    
    padded_center_crop_size = tf.cast(
        ((image_size / (image_size + CROP_PADDING)) *
         tf.cast(tf.minimum(image_height, image_width), tf.float32)),
        tf.int32)

    offset_height = ((image_height - padded_center_crop_size) + 1) // 2
    offset_width = ((image_width - padded_center_crop_size) + 1) // 2
    crop_window = tf.stack([offset_height, offset_width,
                            padded_center_crop_size, padded_center_crop_size])
    image = tf.image.decode_and_crop_jpeg(image_bytes, crop_window, channels=3)
    image = tf.image.resize_bicubic([image], [image_size, image_size])[0]
    image = tf.reshape(image, [image_size, image_size, 3])
    image = tf.image.convert_image_dtype(image, tf.float32)
    
    return image, label

In [8]:
#Define some global variables
BATCH_SIZE = 64

dataset = tf.data.TFRecordDataset(calibration_files)    
dataset = dataset.apply(tf.contrib.data.map_and_batch(map_func=preprocess, batch_size=BATCH_SIZE, num_parallel_calls=20))

2. Verifying the orignal FP32 model

We demonstrate the conversion process with a Resnet-50 v1 model. First, we inspect the original Tensorflow model.

We employ saved_model_cli to inspect the inputs and outputs of the model.


In [24]:
!saved_model_cli show --all --dir $ORIGINAL_MODEL_DIR


MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['classify']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['input'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 224, 224, 3)
        name: input_tensor:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['classes'] tensor_info:
        dtype: DT_INT64
        shape: (-1)
        name: ArgMax:0
    outputs['probabilities'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 1000)
        name: softmax_tensor:0
  Method name is: tensorflow/serving/predict

signature_def['serving_default']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['input'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 224, 224, 3)
        name: input_tensor:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['classes'] tensor_info:
        dtype: DT_INT64
        shape: (-1)
        name: ArgMax:0
    outputs['probabilities'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 1000)
        name: softmax_tensor:0
  Method name is: tensorflow/serving/predict

This give us information on the input and output tensors as input_tensor:0 and softmax_tensor:0 respectively. Also note that the number of output classes here is 1001 instead of 1000 Imagenet classes. This is because the network was trained with an extra background class.


In [10]:
INPUT_TENSOR = 'input_tensor:0'
OUTPUT_TENSOR = 'softmax_tensor:0'

Next, we define a function to read in a saved mode, measuring its speed and accuracy on the validation data.


In [11]:
def benchmark_saved_model(SAVED_MODEL_DIR, dataset=dataset, BATCH_SIZE=64):
    with tf.Session(graph=tf.Graph(), config=config) as sess:
        # prepare dataset iterator
        iterator = dataset.make_one_shot_iterator()
        next_element = iterator.get_next()

        tf.saved_model.loader.load(
            sess, [tf.saved_model.tag_constants.SERVING], SAVED_MODEL_DIR)

        print('Warming up for 50 batches...')
        for _ in range (50):
            sess.run(OUTPUT_TENSOR, feed_dict={INPUT_TENSOR: sess.run(next_element)[0]})

        print('Benchmarking inference engine...')
        num_hits = 0
        num_predict = 0
        start_time = time.time()
        try:
            while True:        
                image_data = sess.run(next_element)    
                img = image_data[0]
                label = image_data[1].squeeze()
                output = sess.run([OUTPUT_TENSOR], feed_dict={INPUT_TENSOR: img})            
                prediction = np.argmax(output[0], axis=1)
                num_hits += np.sum(prediction == label)
                num_predict += len(prediction)
        except tf.errors.OutOfRangeError as e:
            pass

        print('Accuracy: %.2f%%'%(100*num_hits/num_predict))
        print('Inference speed: %.2f samples/s'%(num_predict/(time.time()-start_time)))

In [12]:
benchmark_saved_model(ORIGINAL_MODEL_DIR, dataset=dataset, BATCH_SIZE=BATCH_SIZE)


Warming up for 50 batches...
Benchmarking inference engine...
Accuracy: 61.38%
Inference speed: 151.48 samples/s

3. Creating TF-TRT FP32 model

Next, we convert the naitive TF FP32 model to TF-TRT FP32, then verify model accuracy and inference speed.


In [13]:
FP32_SAVED_MODEL_DIR = os.path.join(RESNET_MODEL_DIR, "FP32", "00001")
!rm -rf $FP32_SAVED_MODEL_DIR

#Now we create the TFTRT FP32 engine
_ = trt.create_inference_graph(
    input_graph_def=None,
    outputs=None,
    max_batch_size=BATCH_SIZE,
    input_saved_model_dir=ORIGINAL_MODEL_DIR,
    output_saved_model_dir=FP32_SAVED_MODEL_DIR,
    precision_mode="FP32")

In [14]:
benchmark_saved_model(FP32_SAVED_MODEL_DIR, dataset=dataset, BATCH_SIZE=BATCH_SIZE)


Warming up for 50 batches...
Benchmarking inference engine...
Accuracy: 61.38%
Inference speed: 308.16 samples/s

4. Creating TF-TRT FP16 model

Next, we convert the naitive TF FP32 model to TF-TRT FP16, then verify model accuracy and inference speed.


In [15]:
FP16_SAVED_MODEL_DIR = os.path.join(RESNET_MODEL_DIR, "FP16", "00001")
!rm -rf $FP16_SAVED_MODEL_DIR

#Now we create the TFTRT FP16 engine
_ = trt.create_inference_graph(
    input_graph_def=None,
    outputs=None,
    max_batch_size=BATCH_SIZE,
    input_saved_model_dir=ORIGINAL_MODEL_DIR,
    output_saved_model_dir=FP16_SAVED_MODEL_DIR,
    precision_mode="FP16")

In [16]:
benchmark_saved_model(FP16_SAVED_MODEL_DIR, dataset=dataset, BATCH_SIZE=BATCH_SIZE)


Warming up for 50 batches...
Benchmarking inference engine...
Accuracy: 61.38%
Inference speed: 357.50 samples/s

5. Creating TF-TRT INT8 model

Creating TF-TRT INT8 inference model requires two steps:

  • Step 1: creating the calibration graph, and run some training data through that graph for INT-8 calibration.

  • Step 2: converting the calibration graph to the TF-TRT INT8 inference engine

Step 1: Creating the calibration graph


In [17]:
#Now we create the TFTRT INT8 calibration graph
trt_int8_calib_graph = trt.create_inference_graph(
    input_graph_def=None,
    outputs=[OUTPUT_TENSOR],
    max_batch_size=BATCH_SIZE,
    input_saved_model_dir=ORIGINAL_MODEL_DIR,    
    precision_mode="INT8")

In [18]:
#Then calibrate it with 10 batches of examples
N_runs=10
with tf.Session(graph=tf.Graph(), config=config) as sess:
    print('Preparing calibration data...')
    iterator = dataset.make_one_shot_iterator()
    next_element = iterator.get_next()

    print('Loading INT8 calibration graph...')
    output_node = tf.import_graph_def(
        trt_int8_calib_graph,
        return_elements=[OUTPUT_TENSOR],
        name='')

    print('Calibrate model on calibration data...')    
    for _ in range(N_runs):
        sess.run(output_node, feed_dict={INPUT_TENSOR: sess.run(next_element)[0]})


Preparing calibration data...
Loading INT8 calibration graph...
Calibrate model on calibration data...

Step 2: Converting the calibration graph to inference graph

Now we convert the INT8 calibration graph to the final TF-TRT INT8 inference engine, then save this engine to a saved model, ready to be served elsewhere.


In [19]:
#Create Int8 inference model from the calibration graph and write to a saved session
print('Creating TF-TRT INT8 inference engine...')
trt_int8_calibrated_graph=trt.calib_graph_to_infer_graph(trt_int8_calib_graph)


Creating TF-TRT INT8 inference engine...

In [20]:
# Copy MetaGraph from base model.
with tf.Graph().as_default():
    with tf.Session(config=config) as sess:
        base_model = tf.saved_model.loader.load(
            sess,
            [tf.saved_model.tag_constants.SERVING],
            ORIGINAL_MODEL_DIR)

        # Copy information.
        metagraph = tf.MetaGraphDef()
        metagraph.graph_def.CopyFrom(trt_int8_calibrated_graph)
        for key in base_model.collection_def:
            if key not in [
                'variables', 'local_variables', 'model_variables',
                'trainable_variables', 'train_op', 'table_initializer'
            ]:
                metagraph.collection_def[key].CopyFrom(
                    base_model.collection_def[key])

        metagraph.meta_info_def.CopyFrom(base_model.meta_info_def)
        for key in base_model.signature_def:
            metagraph.signature_def[key].CopyFrom(
                base_model.signature_def[key])

In [21]:
INT8_SAVED_MODEL_DIR =  os.path.join(RESNET_MODEL_DIR, "INT8/00001")
!rm -rf $INT8_SAVED_MODEL_DIR

saved_model_builder = tf.saved_model.builder.SavedModelBuilder(INT8_SAVED_MODEL_DIR)
with tf.Graph().as_default():
    tf.graph_util.import_graph_def(
        trt_int8_calibrated_graph,
        return_elements=[OUTPUT_TENSOR],
        name='')
    # We don't use TRT here.
    with tf.Session(config=config) as sess:
        saved_model_builder.add_meta_graph_and_variables(
            sess,
            ('serve',),
            signature_def_map=metagraph.signature_def)
# Ignore other meta graphs from the input SavedModel.
saved_model_builder.save()


Out[21]:
b'/home/jupyter/gcp-getting-started-lab-jp/machine_learning/ml_infrastructure/inference-server-performance/server/models/resnet/INT8/00001/saved_model.pb'

Benchmarking INT8 saved model

Finally we reload and verify the accuracy and performance of the INT8 saved model from disk.


In [22]:
benchmark_saved_model(INT8_SAVED_MODEL_DIR)


Warming up for 50 batches...
Benchmarking inference engine...
Accuracy: 61.25%
Inference speed: 373.62 samples/s

In [23]:
!saved_model_cli show --all --dir $INT8_SAVED_MODEL_DIR


MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['classify']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['input'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 224, 224, 3)
        name: input_tensor:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['classes'] tensor_info:
        dtype: DT_INT64
        shape: (-1)
        name: ArgMax:0
    outputs['probabilities'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 1000)
        name: softmax_tensor:0
  Method name is: tensorflow/serving/predict

signature_def['serving_default']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['input'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 224, 224, 3)
        name: input_tensor:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['classes'] tensor_info:
        dtype: DT_INT64
        shape: (-1)
        name: ArgMax:0
    outputs['probabilities'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 1000)
        name: softmax_tensor:0
  Method name is: tensorflow/serving/predict

6. Calibrating TF-TRT INT8 model with raw JPEG images

As an alternative to taking data in TFRecords format, in this section, we demonstrate the process of calibrating TFTRT INT-8 model from a directory of raw JPEG images. We asume that raw images have been mounted to the directory /data/Calibration_data.

As a rule of thumb, calibration data should be a small but representative set of images that is similar to what is expected in deployment. Empirically, for common network architectures trained on imagenet data, calibration data of size 500-1000 provide good accuracy. As such, a good strategy for a dataset such as imagenet is to choose one sample from each class.


In [ ]:
data_directory = "/data/Calibration_data"
calibration_files = [os.path.join(path, name) for path, _, files in os.walk(data_directory) for name in files]
print('There are %d calibration files. \n%s\n%s\n...'%(len(calibration_files), calibration_files[0], calibration_files[-1]))

We define a helper function to read and preprocess image from JPEG file.


In [ ]:
def parse_file(filepath):
    image = tf.read_file(filepath)
    image = tf.image.decode_jpeg(image, channels=3)
    image = tf.reshape(image, [224, 224, 3] is_training=False)
    return image

In [ ]:
dataset = tf.data.Dataset.from_tensor_slices(data_files)
dataset = dataset.apply(tf.contrib.data.map_and_batch(map_func=parse_file, batch_size=BATCH_SIZE, num_parallel_calls=20))
dataset = dataset.repeat(count=1)

Next, we proceed with the two-stage process of creating and calibrating TFTRT INT8 model.

Step 1: Creating the calibration graph


In [ ]:
#Now we create the TFTRT INT8 calibration graph
trt_int8_calib_graph = trt.create_inference_graph(
    input_graph_def=None,
    outputs=[OUTPUT_TENSOR],
    max_batch_size=BATCH_SIZE,
    input_saved_model_dir=SAVED_MODEL_DIR,    
    precision_mode="INT8")

#Then calibrate it with 10 batches of examples
N_runs=10
with tf.Session(graph=tf.Graph(), config=config) as sess:
    print('Preparing calibration data...')
    iterator = dataset.make_one_shot_iterator()
    next_element = iterator.get_next()

    print('Loading INT8 calibration graph...')
    output_node = tf.import_graph_def(
        trt_int8_calib_graph,
        return_elements=[OUTPUT_TENSOR],
        name='')

    print('Calibrate model on calibration data...')    
    for _ in range(N_runs):
        sess.run(output_node, feed_dict={INPUT_TENSOR: sess.run(next_element)})

Step 2: Converting the calibration graph to inference graph


In [ ]:
#Create Int8 inference model from the calibration graph and write to a saved session
print('Creating TF-TRT INT8 inference engine...')
trt_int8_calibrated_graph=trt.calib_graph_to_infer_graph(trt_int8_calib_graph)

#set a directory to write the saved model
INT8_SAVED_MODEL_DIR =  SAVED_MODEL_DIR + "_TFTRT_INT8_JPEG/1"
!rm -rf $INT8_SAVED_MODEL_DIR

with tf.Session(graph=tf.Graph()) as sess:
    print('Loading TF-TRT INT8 inference engine...')
    output_node = tf.import_graph_def(
        trt_int8_calibrated_graph,
        return_elements=[OUTPUT_TENSOR],
        name='')

    #Save model for serving
    print('Saving INT8 model to %s'%INT8_SAVED_MODEL_DIR)
    tf.saved_model.simple_save(
        session=sess,
        export_dir=INT8_SAVED_MODEL_DIR,
        inputs={"input":tf.get_default_graph().get_tensor_by_name(INPUT_TENSOR)},
        outputs={"softmax":tf.get_default_graph().get_tensor_by_name(OUTPUT_TENSOR),
                 "classes":tf.get_default_graph().get_tensor_by_name("ArgMax:0")},
        legacy_init_op=None
     )

As before, we can benchmark the speed and accuracy of the resulting model.


In [ ]:
benchmark_saved_model(INT8_SAVED_MODEL_DIR)

Conclusion

In this notebook, we have demonstrated the process of creating TF-TRT inference model from an original TF FP32 saved model. In every case, we have also verified the accuracy and speed to the resulting model.