In [0]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
TensorFlow Lite now supports converting all model values (weights and activations) to 8-bit integers when converting from TensorFlow to TensorFlow Lite's flat buffer format. This results in a 4x reduction in model size and a 3 to 4x performance improvement on CPU performance. In addition, this fully quantized model can be consumed by integer-only hardware accelerators.
In contrast to post-training "on-the-fly" quantization—which stores only the weights as 8-bit integers—this technique statically quantizes all weights and activations during model conversion.
In this tutorial, you'll train an MNIST model from scratch, check its accuracy in TensorFlow, and then convert the saved model into a Tensorflow Lite flatbuffer with full quantization. Finally, you'll check the accuracy of the converted model and compare it to the original float model.
The training script, mnist.py
, is available from the
TensorFlow official MNIST tutorial.
In [0]:
! pip uninstall -y tensorflow
! pip install -U tf-nightly
In [0]:
import tensorflow as tf
tf.enable_eager_execution()
In [0]:
! git clone --depth 1 https://github.com/tensorflow/models
In [0]:
import sys
import os
if sys.version_info.major >= 3:
import pathlib
else:
import pathlib2 as pathlib
# Add `models` to the python path.
models_path = os.path.join(os.getcwd(), "models")
sys.path.append(models_path)
In [0]:
saved_models_root = "/tmp/mnist_saved_model"
In [0]:
# The above path addition is not visible to subprocesses, add the path for the subprocess as well.
# Note: channels_last is required here or the conversion may fail.
!PYTHONPATH={models_path} python models/official/mnist/mnist.py --train_epochs=1 --export_dir {saved_models_root} --data_format=channels_last
This training won't take long because you're training the model for just a single epoch, which trains to about 96% accuracy.
Using the Python TFLiteConverter
, you can now convert the trained model into a TensorFlow Lite model.
The trained model is saved in the saved_models_root
directory, which is named with a timestamp. So select the most recent directory:
In [0]:
saved_model_dir = str(sorted(pathlib.Path(saved_models_root).glob("*"))[-1])
saved_model_dir
Now load the model using the TFLiteConverter
:
In [0]:
import tensorflow as tf
tf.enable_eager_execution()
tf.logging.set_verbosity(tf.logging.DEBUG)
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
tflite_model = converter.convert()
Write it out to a .tflite
file:
In [0]:
tflite_models_dir = pathlib.Path("/tmp/mnist_tflite_models/")
tflite_models_dir.mkdir(exist_ok=True, parents=True)
In [0]:
tflite_model_file = tflite_models_dir/"mnist_model.tflite"
tflite_model_file.write_bytes(tflite_model)
In [0]:
tf.logging.set_verbosity(tf.logging.INFO)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
Now, in order to create quantized values with an accurate dynamic range of activations, you need to provide a representative dataset:
In [0]:
mnist_train, _ = tf.keras.datasets.mnist.load_data()
images = tf.cast(mnist_train[0], tf.float32)/255.0
mnist_ds = tf.data.Dataset.from_tensor_slices((images)).batch(1)
def representative_data_gen():
for input_value in mnist_ds.take(100):
yield [input_value]
converter.representative_dataset = representative_data_gen
Finally, convert the model to TensorFlow Lite format:
In [0]:
tflite_model_quant = converter.convert()
tflite_model_quant_file = tflite_models_dir/"mnist_model_quant.tflite"
tflite_model_quant_file.write_bytes(tflite_model_quant)
Note how the resulting file is approximately 1/4
the size:
In [0]:
!ls -lh {tflite_models_dir}
Your model should now be fully quantized. However, if you convert a model that includes any operations that TensorFlow Lite cannot quantize, those ops are left in floating point. This allows for conversion to complete so you have a smaller and more efficient model, but the model won't be compatible with some ML accelerators that require full integer quantization. Also, by default, the converted model still use float input and outputs, which also is not compatible with some accelerators.
So to ensure that the converted model is fully quantized (make the converter throw an error if it encounters an operation it cannot quantize), and to use integers for the model's input and output, you need to convert the model again using these additional configurations:
In [0]:
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tflite_model_quant = converter.convert()
tflite_model_quant_file = tflite_models_dir/"mnist_model_quant_io.tflite"
tflite_model_quant_file.write_bytes(tflite_model_quant)
In this example, the resulting model size remains the same because all operations successfully quantized to begin with. However, this new model now uses quantized input and output, making it compatible with more accelerators, such as the Coral Edge TPU.
In the following sections, notice that we are now handling two TensorFlow Lite models: tflite_model_file
is the converted model that still uses floating-point parameters, and tflite_model_quant_file
is the same model converted with full integer quantization, including uint8 input and output.
In [0]:
import numpy as np
_, mnist_test = tf.keras.datasets.mnist.load_data()
labels = mnist_test[1]
# Load data for float model
images = tf.cast(mnist_test[0], tf.float32)/255.0
mnist_ds = tf.data.Dataset.from_tensor_slices((images, labels)).batch(1)
# Load data for quantized model
images_uint8 = tf.cast(mnist_test[0], tf.uint8)
mnist_ds_uint8 = tf.data.Dataset.from_tensor_slices((images_uint8, labels)).batch(1)
In [0]:
interpreter = tf.lite.Interpreter(model_path=str(tflite_model_file))
interpreter.allocate_tensors()
In [0]:
interpreter_quant = tf.lite.Interpreter(model_path=str(tflite_model_quant_file))
interpreter_quant.allocate_tensors()
In [0]:
for img, label in mnist_ds:
break
interpreter.set_tensor(interpreter.get_input_details()[0]["index"], img)
interpreter.invoke()
predictions = interpreter.get_tensor(
interpreter.get_output_details()[0]["index"])
In [0]:
import matplotlib.pylab as plt
plt.imshow(img[0])
template = "True:{true}, predicted:{predict}"
_ = plt.title(template.format(true= str(label[0].numpy()),
predict=str(predictions[0])))
plt.grid(False)
Now test the quantized model (using the uint8 data):
In [0]:
for img, label in mnist_ds_uint8:
break
interpreter_quant.set_tensor(
interpreter_quant.get_input_details()[0]["index"], img)
interpreter_quant.invoke()
predictions = interpreter_quant.get_tensor(
interpreter_quant.get_output_details()[0]["index"])
In [0]:
plt.imshow(img[0])
template = "True:{true}, predicted:{predict}"
_ = plt.title(template.format(true= str(label[0].numpy()),
predict=str(predictions[0])))
plt.grid(False)
In [0]:
def eval_model(interpreter, mnist_ds):
total_seen = 0
num_correct = 0
input_index = interpreter.get_input_details()[0]["index"]
output_index = interpreter.get_output_details()[0]["index"]
for img, label in mnist_ds:
total_seen += 1
interpreter.set_tensor(input_index, img)
interpreter.invoke()
predictions = interpreter.get_tensor(output_index)
if predictions == label.numpy():
num_correct += 1
if total_seen % 500 == 0:
print("Accuracy after %i images: %f" %
(total_seen, float(num_correct) / float(total_seen)))
return float(num_correct) / float(total_seen)
In [0]:
# Create smaller dataset for demonstration purposes
mnist_ds_demo = mnist_ds.take(2000)
print(eval_model(interpreter, mnist_ds_demo))
Repeat the evaluation on the fully quantized model using the uint8 data:
In [0]:
# NOTE: Colab runs on server CPUs, and TensorFlow Lite currently
# doesn't have super optimized server CPU kernels. So this part may be
# slower than the above float interpreter. But for mobile CPUs, considerable
# speedup can be observed.
mnist_ds_demo_uint8 = mnist_ds_uint8.take(2000)
print(eval_model(interpreter_quant, mnist_ds_demo_uint8))
In this example, you have fully quantized a model with almost no difference in the accuracy, compared to the above float model.