TensorFlow Lite now supports converting an entire model (weights and activations) to 8-bit during model conversion from TensorFlow to TensorFlow Lite's flat buffer format. This results in a 4x reduction in model size and a 3 to 4x performance improvement on CPU performance. In addition, this fully quantized model can be consumed by integer-only hardware accelerators.
In contrast to post-training "on-the-fly" quantization , which only stores weights as 8-bit ints, in this technique all weights and activations are quantized statically during model conversion.
In this tutorial, we train an MNIST model from scratch, check its accuracy in TensorFlow, and then convert the saved model into a Tensorflow Lite flatbuffer with full quantization. We finally check the accuracy of the converted model and compare it to the original saved model. We run the training script mnist.py from Tensorflow official MNIST tutorial.
In [0]:
! pip uninstall -y tensorflow
! pip install -U tf-nightly
In [0]:
import tensorflow as tf
tf.enable_eager_execution()
In [0]:
! git clone --depth 1 https://github.com/tensorflow/models
In [0]:
import sys
import os
if sys.version_info.major >= 3:
import pathlib
else:
import pathlib2 as pathlib
# Add `models` to the python path.
models_path = os.path.join(os.getcwd(), "models")
sys.path.append(models_path)
In [0]:
saved_models_root = "/tmp/mnist_saved_model"
In [0]:
# The above path addition is not visible to subprocesses, add the path for the subprocess as well.
# Note: channels_last is required here or the conversion may fail.
!PYTHONPATH={models_path} python models/official/mnist/mnist.py --train_epochs=1 --export_dir {saved_models_root} --data_format=channels_last
For the example, we only trained the model for a single epoch, so it only trains to ~96% accuracy.
In [0]:
saved_model_dir = str(sorted(pathlib.Path(saved_models_root).glob("*"))[-1])
saved_model_dir
Using the Python TFLiteConverter
, the saved model can be converted into a TensorFlow Lite model.
First load the model using the TFLiteConverter
:
In [0]:
import tensorflow as tf
tf.enable_eager_execution()
tf.logging.set_verbosity(tf.logging.DEBUG)
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
tflite_model = converter.convert()
Write it out to a .tflite
file:
In [0]:
tflite_models_dir = pathlib.Path("/tmp/mnist_tflite_models/")
tflite_models_dir.mkdir(exist_ok=True, parents=True)
In [0]:
tflite_model_file = tflite_models_dir/"mnist_model.tflite"
tflite_model_file.write_bytes(tflite_model)
To instead quantize the model on export, first set the optimizations
flag to optimize for size:
In [0]:
tf.logging.set_verbosity(tf.logging.INFO)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
Now, construct and provide a representative dataset, this is used to get the dynamic range of activations.
In [0]:
mnist_train, _ = tf.keras.datasets.mnist.load_data()
images = tf.cast(mnist_train[0], tf.float32)/255.0
mnist_ds = tf.data.Dataset.from_tensor_slices((images)).batch(1)
def representative_data_gen():
for input_value in mnist_ds.take(100):
yield [input_value]
converter.representative_dataset = representative_data_gen
Finally, convert the model like usual. Note, by default the converted model will still use float input and outputs for invocation convenience.
In [0]:
tflite_quant_model = converter.convert()
tflite_model_quant_file = tflite_models_dir/"mnist_model_quant.tflite"
tflite_model_quant_file.write_bytes(tflite_quant_model)
Note how the resulting file is approximately 1/4
the size.
In [0]:
!ls -lh {tflite_models_dir}
In [0]:
import numpy as np
_, mnist_test = tf.keras.datasets.mnist.load_data()
images, labels = tf.cast(mnist_test[0], tf.float32)/255.0, mnist_test[1]
mnist_ds = tf.data.Dataset.from_tensor_slices((images, labels)).batch(1)
In [0]:
interpreter = tf.lite.Interpreter(model_path=str(tflite_model_file))
interpreter.allocate_tensors()
In [0]:
interpreter_quant = tf.lite.Interpreter(model_path=str(tflite_model_quant_file))
interpreter_quant.allocate_tensors()
In [0]:
for img, label in mnist_ds:
break
interpreter.set_tensor(interpreter.get_input_details()[0]["index"], img)
interpreter.invoke()
predictions = interpreter.get_tensor(
interpreter.get_output_details()[0]["index"])
In [0]:
import matplotlib.pylab as plt
plt.imshow(img[0])
template = "True:{true}, predicted:{predict}"
_ = plt.title(template.format(true= str(label[0].numpy()),
predict=str(predictions[0])))
plt.grid(False)
In [0]:
interpreter_quant.set_tensor(
interpreter_quant.get_input_details()[0]["index"], img)
interpreter_quant.invoke()
predictions = interpreter_quant.get_tensor(
interpreter_quant.get_output_details()[0]["index"])
In [0]:
plt.imshow(img[0])
template = "True:{true}, predicted:{predict}"
_ = plt.title(template.format(true= str(label[0].numpy()),
predict=str(predictions[0])))
plt.grid(False)
In [0]:
def eval_model(interpreter, mnist_ds):
total_seen = 0
num_correct = 0
input_index = interpreter.get_input_details()[0]["index"]
output_index = interpreter.get_output_details()[0]["index"]
for img, label in mnist_ds:
total_seen += 1
interpreter.set_tensor(input_index, img)
interpreter.invoke()
predictions = interpreter.get_tensor(output_index)
if predictions == label.numpy():
num_correct += 1
if total_seen % 500 == 0:
print("Accuracy after %i images: %f" %
(total_seen, float(num_correct) / float(total_seen)))
return float(num_correct) / float(total_seen)
In [0]:
print(eval_model(interpreter, mnist_ds))
We can repeat the evaluation on the fully quantized model to obtain:
In [0]:
# NOTE: Colab runs on server CPUs. At the time of writing this, TensorFlow Lite
# doesn't have super optimized server CPU kernels. For this reason this may be
# slower than the above float interpreter. But for mobile CPUs, considerable
# speedup can be observed.
print(eval_model(interpreter_quant, mnist_ds))
In this example, we have fully quantized a model with no difference in the accuracy.