In [ ]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Text classification with TensorFlow Lite Model Maker

The TensorFlow Lite Model Maker library simplifies the process of adapting and converting a TensorFlow neural-network model to particular input data when deploying this model for on-device ML applications.

This notebook shows an end-to-end example that utilizes this Model Maker library to illustrate the adaption and conversion of a commonly-used text classification model to classify movie reviews on a mobile device.

Prerequisites

To run this example, we first need to install several required packages, including Model Maker package that in github repo.


In [ ]:
!pip install git+https://github.com/tensorflow/examples.git#egg=tensorflow-examples[model_maker]

Import the required packages.


In [ ]:
import numpy as np
import os

import tensorflow as tf
assert tf.__version__.startswith('2')

from tensorflow_examples.lite.model_maker.core.data_util.text_dataloader import TextClassifierDataLoader
from tensorflow_examples.lite.model_maker.core.task.model_spec import AverageWordVecModelSpec
from tensorflow_examples.lite.model_maker.core.task.model_spec import BertClassifierModelSpec
from tensorflow_examples.lite.model_maker.core.task import text_classifier

Simple End-to-End Example

Get the data path

Let's get some texts to play with this simple end-to-end example.


In [ ]:
data_path = tf.keras.utils.get_file(
      fname='aclImdb',
      origin='http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz',
      untar=True)

You could replace it with your own text folders. As for uploading data to colab, you could find the upload button in the left sidebar shown in the image below with the red rectangle. Just have a try to upload a zip file and unzip it. The root file path is the current path.

If you prefer not to upload your images to the cloud, you could try to run the library locally following the guide in github.

Run the example

The example just consists of 6 lines of code as shown below, representing 5 steps of the overall process.

Step 0. Choose a model_spec that represents a model for text classifier.


In [ ]:
model_spec = AverageWordVecModelSpec()

Step 1. Load train and test data specific to an on-device ML app and preprocess the data according to specific model_spec.


In [ ]:
train_data = TextClassifierDataLoader.from_folder(os.path.join(data_path, 'train'), model_spec=model_spec, class_labels=['pos', 'neg'])
test_data = TextClassifierDataLoader.from_folder(os.path.join(data_path, 'test'), model_spec=model_spec, is_training=False, shuffle=False)

Step 2. Customize the TensorFlow model.


In [ ]:
model = text_classifier.create(train_data, model_spec=model_spec)

Step 3. Evaluate the model.


In [ ]:
loss, acc = model.evaluate(test_data)

Step 4. Export to TensorFlow Lite model. You could download it in the left sidebar same as the uploading part for your own use.


In [ ]:
model.export(export_dir='.')

After this simple 5 steps, we could further use TensorFlow Lite model file and label file in on-device applications like in text classification reference app.

Detailed Process

In the above, we tried the simple end-to-end example. The following walks through the example step by step to show more detail.

Step 0: Choose a model_spec that represents a model for text classifier.

each model_spec object represents a specific model for the text classifier. Currently, we support averging word embedding model and BERT-base model.


In [ ]:
model_spec = AverageWordVecModelSpec()

Step 1: Load Input Data Specific to an On-device ML App

The IMDB dataset contains 25000 movie reviews for training and 25000 movie reviews for testing from the Internet Movie Database. The dataset has two classes: positive and negative movie reviews.

Download the archive version of the dataset and untar it.

The IMDB dataset has the following directory structure:

aclImdb
|__ train
    |______ pos: [1962_10.txt, 2499_10.txt, ...]
    |______ neg: [104_3.txt, 109_2.txt, ...]
    |______ unsup: [12099_0.txt, 1424_0.txt, ...]
|__ test
    |______ pos: [1384_9.txt, 191_9.txt, ...]
    |______ neg: [1629_1.txt, 21_1.txt]

Note that the text data under train/unsup folder are unlabeled documents for unsupervised learning and such data should be ignored in this tutorial.


In [ ]:
data_path = tf.keras.utils.get_file(
      fname='aclImdb',
      origin='http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz',
      untar=True)

Use TextClassifierDataLoader to load data.

As for from_folder() method, it could load data from the folder. It assumes that the text data of the same class are in the same subdirectory and the subfolder name is the class name. Each text file contains one movie review sample.

Parameter class_labels is used to specify which subfolder should be considered. As for train folder, this parameter is used to skip unsup subfolder.


In [ ]:
train_data = TextClassifierDataLoader.from_folder(os.path.join(data_path, 'train'), model_spec=model_spec, class_labels=['pos', 'neg'])
test_data = TextClassifierDataLoader.from_folder(os.path.join(data_path, 'test'), model_spec=model_spec, is_training=False, shuffle=False)
train_data, validation_data = train_data.split(0.9)

Step 2: Customize the TensorFlow Model

Create a custom text classifier model based on the loaded data. Currently, we support averaging word embedding and BERT-base model.


In [ ]:
model = text_classifier.create(train_data, model_spec=model_spec, validation_data=validation_data)

Have a look at the detailed model structure.


In [ ]:
model.summary()

Step 3: Evaluate the Customized Model

Evaluate the result of the model, get the loss and accuracy of the model.

Evaluate the loss and accuracy in test_data. If no data is given the results are evaluated on the data that's splitted in the create method.


In [ ]:
loss, acc = model.evaluate(test_data)

Step 4: Export to TensorFlow Lite Model

Convert the existing model to TensorFlow Lite model format that could be later used in on-device ML application. Meanwhile, save the text labels in label file and vocabulary in vocab file. The default TFLite filename is model.tflite, the default label filename is label.txt, the default vocab filename is vocab.


In [ ]:
model.export(export_dir='.')

The TensorFlow Lite model file and label file could be used in the text classification reference app.

In detail, we could add movie_review_classifier.tflite, text_label.txt and vocab.txt to the assets directory folder. Meanwhile, change the filenames in code.

Here, we also demonstrate how to use the above files to run and evaluate the TensorFlow Lite model.


In [ ]:
# Read TensorFlow Lite model from TensorFlow Lite file.
with tf.io.gfile.GFile('model.tflite', 'rb') as f:
  model_content = f.read()

# Read label names from label file.
with tf.io.gfile.GFile('labels.txt', 'r') as f:
  label_names = f.read().split('\n')

# Initialze TensorFlow Lite inpterpreter.
interpreter = tf.lite.Interpreter(model_content=model_content)
interpreter.allocate_tensors()
input_index = interpreter.get_input_details()[0]['index']
output = interpreter.tensor(interpreter.get_output_details()[0]["index"])

# Run predictions on each test data and calculate accuracy.
accurate_count = 0
for text, label in test_data.dataset:
    # Add batch dimension and convert to float32 to match with the model's input
    # data format.
    text = tf.expand_dims(text, 0)

    # Run inference.
    interpreter.set_tensor(input_index, text)
    interpreter.invoke()

    # Post-processing: remove batch dimension and find the label with highest
    # probability.
    predict_label = np.argmax(output()[0])
    # Get label name with label index.
    predict_label_name = label_names[predict_label]
    accurate_count += (predict_label == label.numpy())

accuracy = accurate_count * 1.0 / test_data.size
print('TensorFlow Lite model accuracy = %.4f' % accuracy)

Note that preprocessing for inference should be the same as training. Currently, preprocessing contains split the text to tokens by '\W', encode the tokens to ids, the pad the text with pad_id to have the length of seq_length.

Advanced Usage

The create function is the critical part of this library in which parameter model_spec defines the specification of the model, currently AverageWordVecModelSpec and BertModelSpec is supported. The create function contains the following steps for AverageWordVecModelSpec:

  1. Tokenize the text and select the top num_words most frequent words to generate the vocubulary. The default value of num_words in AverageWordVecModelSpec object is 10000.
  2. Encode the text string tokens to int ids.
  3. Create the text classifier model. Currently, this library supports one model: average the word embedding of the text with RELU activation, then leverage softmax dense layer for classification. As for Embedding layer, the input dimension is the size of the vocabulary, the output dimension is AverageWordVecModelSpec object's variable wordvec_dim which default value is 16, the input length is AverageWordVecModelSpec object's variable seq_len which default value is 256.
  4. Train the classifier model. The default epoch is 2 and the default batch size is 32.

In this section, we describe several advanced topics, including adjusting the model, changing the training hyperparameters etc.

Adjust the model

We could adjust the model infrastructure like variables wordvec_dim, seq_len in AverageWordVecModelSpec class.

  • wordvec_dim: Dimension of word embedding.
  • seq_len: length of sequence.

For example, we could train with larger wordvec_dim. If we change the model, we need to construct the new model_spec firstly.


In [ ]:
new_model_spec = AverageWordVecModelSpec(wordvec_dim=32)

Secondly, we should get the preprocessed data accordingly.


In [ ]:
new_train_data = TextClassifierDataLoader.from_folder(os.path.join(data_path, 'train'), model_spec=new_model_spec, class_labels=['pos', 'neg'])
new_train_data, new_validation_data = new_train_data.split(0.9)

Finally, we could train the new model.


In [ ]:
model = text_classifier.create(new_train_data, model_spec=new_model_spec, validation_data=new_validation_data)

Change the training hyperparameters

We could also change the training hyperparameters like epochs and batch_size that could affect the model accuracy. For instance,

  • epochs: more epochs could achieve better accuracy, but may lead to overfitting.
  • batch_size: number of samples to use in one training step.

For example, we could train with more epochs.


In [ ]:
model = text_classifier.create(train_data, model_spec=model_spec, validation_data=validation_data, epochs=5)

Evaluate the newly retrained model with 5 training epochs.


In [ ]:
loss, accuracy = model.evaluate(test_data)

Change the Model

We could change the model by changing the model_spec. The following shows how we change to BERT-base model.

First, we could change model_spec to BertModelSpec.


In [ ]:
model_spec = BertClassifierModelSpec()

The remaining steps remains the same.

Load data and preprocess the data according to model_spec.


In [ ]:
train_data = TextClassifierDataLoader.from_folder(os.path.join(data_path, 'train'), model_spec=model_spec, class_labels=['pos', 'neg'])
test_data = TextClassifierDataLoader.from_folder(os.path.join(data_path, 'test'), model_spec=model_spec, is_training=False, shuffle=False)

Then retrain the model. Note that it could take a long time to retrain the BERT model. we just set epochs equals 1 to demonstrate it.


In [ ]:
model = text_classifier.create(train_data, model_spec=model_spec, epochs=1)