In [0]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Post-training integer quantization

Overview

TensorFlow Lite now supports converting all model values (weights and activations) to 8-bit integers when converting from TensorFlow to TensorFlow Lite's flat buffer format. This results in a 4x reduction in model size and a 3 to 4x performance improvement on CPU performance. In addition, this fully quantized model can be consumed by integer-only hardware accelerators.

In contrast to post-training "on-the-fly" quantization—which stores only the weights as 8-bit integers—this technique statically quantizes all weights and activations during model conversion.

In this tutorial, you'll train an MNIST model from scratch, check its accuracy in TensorFlow, and then convert the saved model into a Tensorflow Lite flatbuffer with full quantization. Finally, you'll check the accuracy of the converted model and compare it to the original float model.

The training script, mnist.py, is available from the TensorFlow official MNIST tutorial.

Build an MNIST model

Setup


In [0]:
! pip uninstall -y tensorflow
! pip install -U tf-nightly

In [0]:
import tensorflow as tf
tf.enable_eager_execution()

In [0]:
! git clone --depth 1 https://github.com/tensorflow/models

In [0]:
import sys
import os

if sys.version_info.major >= 3:
    import pathlib
else:
    import pathlib2 as pathlib

# Add `models` to the python path.
models_path = os.path.join(os.getcwd(), "models")
sys.path.append(models_path)

Train and export the model


In [0]:
saved_models_root = "/tmp/mnist_saved_model"

In [0]:
# The above path addition is not visible to subprocesses, add the path for the subprocess as well.
# Note: channels_last is required here or the conversion may fail. 
!PYTHONPATH={models_path} python models/official/mnist/mnist.py --train_epochs=1 --export_dir {saved_models_root} --data_format=channels_last

This training won't take long because you're training the model for just a single epoch, which trains to about 96% accuracy.

Convert to a TensorFlow Lite model

Using the Python TFLiteConverter, you can now convert the trained model into a TensorFlow Lite model.

The trained model is saved in the saved_models_root directory, which is named with a timestamp. So select the most recent directory:


In [0]:
saved_model_dir = str(sorted(pathlib.Path(saved_models_root).glob("*"))[-1])
saved_model_dir

Now load the model using the TFLiteConverter:


In [0]:
import tensorflow as tf
tf.enable_eager_execution()
tf.logging.set_verbosity(tf.logging.DEBUG)

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
tflite_model = converter.convert()

Write it out to a .tflite file:


In [0]:
tflite_models_dir = pathlib.Path("/tmp/mnist_tflite_models/")
tflite_models_dir.mkdir(exist_ok=True, parents=True)

In [0]:
tflite_model_file = tflite_models_dir/"mnist_model.tflite"
tflite_model_file.write_bytes(tflite_model)

Now you have a trained MNIST model that's converted to a .tflite file, but it's still using 32-bit float values for all parameter data.

So let's convert the model again, this time using quantization...

Convert using quantization

First, first set the optimizations flag to optimize for size:


In [0]:
tf.logging.set_verbosity(tf.logging.INFO)
converter.optimizations = [tf.lite.Optimize.DEFAULT]

Now, in order to create quantized values with an accurate dynamic range of activations, you need to provide a representative dataset:


In [0]:
mnist_train, _ = tf.keras.datasets.mnist.load_data()
images = tf.cast(mnist_train[0], tf.float32)/255.0
mnist_ds = tf.data.Dataset.from_tensor_slices((images)).batch(1)
def representative_data_gen():
  for input_value in mnist_ds.take(100):
    yield [input_value]

converter.representative_dataset = representative_data_gen

Finally, convert the model to TensorFlow Lite format:


In [0]:
tflite_model_quant = converter.convert()
tflite_model_quant_file = tflite_models_dir/"mnist_model_quant.tflite"
tflite_model_quant_file.write_bytes(tflite_model_quant)

Note how the resulting file is approximately 1/4 the size:


In [0]:
!ls -lh {tflite_models_dir}

Your model should now be fully quantized. However, if you convert a model that includes any operations that TensorFlow Lite cannot quantize, those ops are left in floating point. This allows for conversion to complete so you have a smaller and more efficient model, but the model won't be compatible with some ML accelerators that require full integer quantization. Also, by default, the converted model still use float input and outputs, which also is not compatible with some accelerators.

So to ensure that the converted model is fully quantized (make the converter throw an error if it encounters an operation it cannot quantize), and to use integers for the model's input and output, you need to convert the model again using these additional configurations:


In [0]:
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8

tflite_model_quant = converter.convert()
tflite_model_quant_file = tflite_models_dir/"mnist_model_quant_io.tflite"
tflite_model_quant_file.write_bytes(tflite_model_quant)

In this example, the resulting model size remains the same because all operations successfully quantized to begin with. However, this new model now uses quantized input and output, making it compatible with more accelerators, such as the Coral Edge TPU.

In the following sections, notice that we are now handling two TensorFlow Lite models: tflite_model_file is the converted model that still uses floating-point parameters, and tflite_model_quant_file is the same model converted with full integer quantization, including uint8 input and output.

Run the TensorFlow Lite models

Run the TensorFlow Lite model using the Python TensorFlow Lite Interpreter.

Load the test data

First, let's load the MNIST test data to feed to the model. Because the quantized model expects uint8 input data, we need to create a separate dataset for that model:


In [0]:
import numpy as np
_, mnist_test = tf.keras.datasets.mnist.load_data()
labels = mnist_test[1]

# Load data for float model
images = tf.cast(mnist_test[0], tf.float32)/255.0
mnist_ds = tf.data.Dataset.from_tensor_slices((images, labels)).batch(1)

# Load data for quantized model
images_uint8 = tf.cast(mnist_test[0], tf.uint8)
mnist_ds_uint8 = tf.data.Dataset.from_tensor_slices((images_uint8, labels)).batch(1)

Load the model into the interpreters


In [0]:
interpreter = tf.lite.Interpreter(model_path=str(tflite_model_file))
interpreter.allocate_tensors()

In [0]:
interpreter_quant = tf.lite.Interpreter(model_path=str(tflite_model_quant_file))
interpreter_quant.allocate_tensors()

Test the models on one image

First test it on the float model:


In [0]:
for img, label in mnist_ds:
  break

interpreter.set_tensor(interpreter.get_input_details()[0]["index"], img)
interpreter.invoke()
predictions = interpreter.get_tensor(
    interpreter.get_output_details()[0]["index"])

In [0]:
import matplotlib.pylab as plt

plt.imshow(img[0])
template = "True:{true}, predicted:{predict}"
_ = plt.title(template.format(true= str(label[0].numpy()),
                              predict=str(predictions[0])))
plt.grid(False)

Now test the quantized model (using the uint8 data):


In [0]:
for img, label in mnist_ds_uint8:
  break

interpreter_quant.set_tensor(
    interpreter_quant.get_input_details()[0]["index"], img)
interpreter_quant.invoke()
predictions = interpreter_quant.get_tensor(
    interpreter_quant.get_output_details()[0]["index"])

In [0]:
plt.imshow(img[0])
template = "True:{true}, predicted:{predict}"
_ = plt.title(template.format(true= str(label[0].numpy()),
                              predict=str(predictions[0])))
plt.grid(False)

Evaluate the models


In [0]:
def eval_model(interpreter, mnist_ds):
  total_seen = 0
  num_correct = 0

  input_index = interpreter.get_input_details()[0]["index"]
  output_index = interpreter.get_output_details()[0]["index"]

  for img, label in mnist_ds:
    total_seen += 1
    interpreter.set_tensor(input_index, img)
    interpreter.invoke()
    predictions = interpreter.get_tensor(output_index)
    if predictions == label.numpy():
      num_correct += 1

    if total_seen % 500 == 0:
      print("Accuracy after %i images: %f" %
            (total_seen, float(num_correct) / float(total_seen)))

  return float(num_correct) / float(total_seen)

In [0]:
# Create smaller dataset for demonstration purposes
mnist_ds_demo = mnist_ds.take(2000)

print(eval_model(interpreter, mnist_ds_demo))

Repeat the evaluation on the fully quantized model using the uint8 data:


In [0]:
# NOTE: Colab runs on server CPUs, and TensorFlow Lite currently
# doesn't have super optimized server CPU kernels. So this part may be
# slower than the above float interpreter. But for mobile CPUs, considerable
# speedup can be observed.
mnist_ds_demo_uint8 = mnist_ds_uint8.take(2000)

print(eval_model(interpreter_quant, mnist_ds_demo_uint8))

In this example, you have fully quantized a model with almost no difference in the accuracy, compared to the above float model.