Post Training Integer Quantization

Overview

TensorFlow Lite now supports converting an entire model (weights and activations) to 8-bit during model conversion from TensorFlow to TensorFlow Lite's flat buffer format. This results in a 4x reduction in model size and a 3 to 4x performance improvement on CPU performance. In addition, this fully quantized model can be consumed by integer-only hardware accelerators.

In contrast to post-training "on-the-fly" quantization , which only stores weights as 8-bit ints, in this technique all weights and activations are quantized statically during model conversion.

In this tutorial, we train an MNIST model from scratch, check its accuracy in TensorFlow, and then convert the saved model into a Tensorflow Lite flatbuffer with full quantization. We finally check the accuracy of the converted model and compare it to the original saved model. We run the training script mnist.py from Tensorflow official MNIST tutorial.

Building an MNIST model

Setup


In [0]:
! pip uninstall -y tensorflow
! pip install -U tf-nightly

In [0]:
import tensorflow as tf
tf.enable_eager_execution()

In [0]:
! git clone --depth 1 https://github.com/tensorflow/models

In [0]:
import sys
import os

if sys.version_info.major >= 3:
    import pathlib
else:
    import pathlib2 as pathlib

# Add `models` to the python path.
models_path = os.path.join(os.getcwd(), "models")
sys.path.append(models_path)

Train and export the model


In [0]:
saved_models_root = "/tmp/mnist_saved_model"

In [0]:
# The above path addition is not visible to subprocesses, add the path for the subprocess as well.
# Note: channels_last is required here or the conversion may fail. 
!PYTHONPATH={models_path} python models/official/mnist/mnist.py --train_epochs=1 --export_dir {saved_models_root} --data_format=channels_last

For the example, we only trained the model for a single epoch, so it only trains to ~96% accuracy.

Convert to a TensorFlow Lite model

The savedmodel directory is named with a timestamp. Select the most recent one:


In [0]:
saved_model_dir = str(sorted(pathlib.Path(saved_models_root).glob("*"))[-1])
saved_model_dir

Using the Python TFLiteConverter, the saved model can be converted into a TensorFlow Lite model.

First load the model using the TFLiteConverter:


In [0]:
import tensorflow as tf
tf.enable_eager_execution()
tf.logging.set_verbosity(tf.logging.DEBUG)

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
tflite_model = converter.convert()

Write it out to a .tflite file:


In [0]:
tflite_models_dir = pathlib.Path("/tmp/mnist_tflite_models/")
tflite_models_dir.mkdir(exist_ok=True, parents=True)

In [0]:
tflite_model_file = tflite_models_dir/"mnist_model.tflite"
tflite_model_file.write_bytes(tflite_model)

To instead quantize the model on export, first set the optimizations flag to optimize for size:


In [0]:
tf.logging.set_verbosity(tf.logging.INFO)
converter.optimizations = [tf.lite.Optimize.DEFAULT]

Now, construct and provide a representative dataset, this is used to get the dynamic range of activations.


In [0]:
mnist_train, _ = tf.keras.datasets.mnist.load_data()
images = tf.cast(mnist_train[0], tf.float32)/255.0
mnist_ds = tf.data.Dataset.from_tensor_slices((images)).batch(1)
def representative_data_gen():
  for input_value in mnist_ds.take(100):
    yield [input_value]

converter.representative_dataset = representative_data_gen

Finally, convert the model like usual. Note, by default the converted model will still use float input and outputs for invocation convenience.


In [0]:
tflite_quant_model = converter.convert()
tflite_model_quant_file = tflite_models_dir/"mnist_model_quant.tflite"
tflite_model_quant_file.write_bytes(tflite_quant_model)

Note how the resulting file is approximately 1/4 the size.


In [0]:
!ls -lh {tflite_models_dir}

Run the TensorFlow Lite models

We can run the TensorFlow Lite model using the Python TensorFlow Lite Interpreter.

Load the test data

First, let's load the MNIST test data to feed to the model:


In [0]:
import numpy as np
_, mnist_test = tf.keras.datasets.mnist.load_data()
images, labels = tf.cast(mnist_test[0], tf.float32)/255.0, mnist_test[1]

mnist_ds = tf.data.Dataset.from_tensor_slices((images, labels)).batch(1)

Load the model into the interpreters


In [0]:
interpreter = tf.lite.Interpreter(model_path=str(tflite_model_file))
interpreter.allocate_tensors()

In [0]:
interpreter_quant = tf.lite.Interpreter(model_path=str(tflite_model_quant_file))
interpreter_quant.allocate_tensors()

Test the models on one image


In [0]:
for img, label in mnist_ds:
  break

interpreter.set_tensor(interpreter.get_input_details()[0]["index"], img)
interpreter.invoke()
predictions = interpreter.get_tensor(
    interpreter.get_output_details()[0]["index"])

In [0]:
import matplotlib.pylab as plt

plt.imshow(img[0])
template = "True:{true}, predicted:{predict}"
_ = plt.title(template.format(true= str(label[0].numpy()),
                              predict=str(predictions[0])))
plt.grid(False)

In [0]:
interpreter_quant.set_tensor(
    interpreter_quant.get_input_details()[0]["index"], img)
interpreter_quant.invoke()
predictions = interpreter_quant.get_tensor(
    interpreter_quant.get_output_details()[0]["index"])

In [0]:
plt.imshow(img[0])
template = "True:{true}, predicted:{predict}"
_ = plt.title(template.format(true= str(label[0].numpy()),
                              predict=str(predictions[0])))
plt.grid(False)

Evaluate the models


In [0]:
def eval_model(interpreter, mnist_ds):
  total_seen = 0
  num_correct = 0

  input_index = interpreter.get_input_details()[0]["index"]
  output_index = interpreter.get_output_details()[0]["index"]
  for img, label in mnist_ds:
    total_seen += 1
    interpreter.set_tensor(input_index, img)
    interpreter.invoke()
    predictions = interpreter.get_tensor(output_index)
    if predictions == label.numpy():
      num_correct += 1

    if total_seen % 500 == 0:
      print("Accuracy after %i images: %f" %
            (total_seen, float(num_correct) / float(total_seen)))

  return float(num_correct) / float(total_seen)

In [0]:
print(eval_model(interpreter, mnist_ds))

We can repeat the evaluation on the fully quantized model to obtain:


In [0]:
# NOTE: Colab runs on server CPUs. At the time of writing this, TensorFlow Lite
# doesn't have super optimized server CPU kernels. For this reason this may be
# slower than the above float interpreter. But for mobile CPUs, considerable
# speedup can be observed.
print(eval_model(interpreter_quant, mnist_ds))

In this example, we have fully quantized a model with no difference in the accuracy.