Joint Intent Classification and Slot Filling with Transformers

The goal of this notebook is to fine-tune a pretrained transformer-based neural network model to convert a user query expressed in English into a representation that is structured enough to be processed by an automated service.

Here is an example of interpretation computed by such a Natural Language Understanding system:

>>> nlu("Book a table for two at Le Ritz for Friday night",
        tokenizer, joint_model, intent_names, slot_names)
{
    'intent': 'BookRestaurant',
    'slots': {
        'party_size_number': 'two',
        'restaurant_name': 'Le Ritz',
        'timeRange': 'Friday night'
    }
}

Intent classification is a simple sequence classification problem. The trick is to treat the structured knowledge extraction part ("Slot Filling") as token-level classification problem using BIO-annotations:

>>> show_predictions("Book a table for two at Le Ritz for Friday night!",
...                  tokenizer, joint_model, intent_names, slot_names)
## Intent: BookRestaurant
## Slots:
      Book : O
         a : O
     table : O
       for : O
       two : B-party_size_number
        at : O
        Le : B-restaurant_name
         R : I-restaurant_name
     ##itz : I-restaurant_name
       for : O
    Friday : B-timeRange
     night : I-timeRange
         ! : O

We will show how to train a such as join "sequence classification" and "token classification" joint model on a voice command dataset published by snips.ai.

This notebook is a partial reproduction of some of the results presented in this paper:

BERT for Joint Intent Classification and Slot Filling Qian Chen, Zhu Zhuo, Wen Wang

https://arxiv.org/abs/1902.10909


In [ ]:
%tensorflow_version 2.x

In [ ]:
!nvidia-smi

In [ ]:
%pip install -q transformers

The Data

We will use a speech command dataset collected, annotated and published by French startup SNIPS.ai (bought in 2019 by Audio device manufacturer Sonos).

The original dataset comes in YAML format with inline markdown annotations.

Instead we will use a preprocessed variant with token level B-I-O annotations closer the representation our model will predict. This variant of the SNIPS dataset was prepared by Su Zhu.


In [ ]:
from urllib.request import urlretrieve
from pathlib import Path


SNIPS_DATA_BASE_URL = (
    "https://github.com/ogrisel/slot_filling_and_intent_detection_of_SLU/blob/"
    "master/data/snips/"
)
for filename in ["train", "valid", "test", "vocab.intent", "vocab.slot"]:
    path = Path(filename)
    if not path.exists():
        print(f"Downloading {filename}...")
        urlretrieve(SNIPS_DATA_BASE_URL + filename + "?raw=true", path)

Let's have a look at the first lines from the training set:


In [ ]:
lines_train = Path("train").read_text("utf-8").strip().splitlines()
lines_train[:5]

Some remarks:

  • The class label for the voice command appears at the end of each line (after the "<=>" marker).
  • Each word-level token is annotated with B-I-O labels using the ":" separator.
  • B/I/O stand for "Beginning" / "Inside" / "Outside"
  • "Add:O" means that the token "Add" is "Outside" of any annotation span
  • "Don:B-entity_name" means that "Don" is the "Beginning" of an annotation of type "entity-name".
  • "and:I-entity_name" means that "and" is "Inside" the previously started annotation of type "entity-name".

Let's write a parsing function and test it on the first line:


In [ ]:
def parse_line(line):
    utterance_data, intent_label = line.split(" <=> ")
    items = utterance_data.split()
    words = [item.rsplit(":", 1)[0]for item in items]
    word_labels = [item.rsplit(":", 1)[1]for item in items]
    return {
        "intent_label": intent_label,
        "words": " ".join(words),
        "word_labels": " ".join(word_labels),
        "length": len(words),
    }

In [ ]:
parse_line(lines_train[0])

This utterance is a voice command of type "AddToPlaylist" with to annotations:

  • an entity-name: "Don and Sherri",
  • a playlist: "Medidate to Sounds of Nature".

The goal of this project is to build a baseline Natural Understanding model to analyse such voice commands and predict:

  • the intent of the speaker: the sentence level class label ("AddToPlaylist");
  • extract the interesting "slots" (typed named entities) from the sentence by performing word level classification using the B-I-O tags as target classes. This second task is often referred to as "NER" (Named Entity Recognition) in the Natural Language Processing literature. Alternatively this is also known as "slot filling" when we expect a fixed set of named entity per sentence of a given class.

The list of possible classes for the sentence level and the word level classification problems are given as:


In [ ]:
print(Path("vocab.intent").read_text("utf-8"))

In [ ]:
print(Path("vocab.slot").read_text("utf-8"))

"POI" stands for "Point of Interest".

Let's parse all the lines and store the results in pandas DataFrames:


In [ ]:
import pandas as pd

parsed = [parse_line(line) for line in lines_train]

df_train = pd.DataFrame([p for p in parsed if p is not None])
df_train

In [ ]:
df_train

In [ ]:
df_train.groupby("intent_label").count()

In [ ]:
df_train.hist("length", bins=30);

In [ ]:
lines_valid = Path("valid").read_text("utf-8").strip().splitlines()
lines_test = Path("test").read_text("utf-8").strip().splitlines()

df_valid = pd.DataFrame([parse_line(line) for line in lines_valid])
df_test = pd.DataFrame([parse_line(line) for line in lines_test])

A First Model: Intent Classification (Sentence Level)

Let's ignore the slot filling task for now and let's try to build a sentence level classifier by fine-tuning a pre-trained Transformer-based model using the huggingface/transformers package that provides both TF2/Keras and Pytorch APIs.

The BERT Tokenizer

First let's load a pre-trained tokenizer and test it on a test sentence from the training set:


In [ ]:
from transformers import BertTokenizer

model_name = "bert-base-cased"
tokenizer = BertTokenizer.from_pretrained(model_name)

In [ ]:
first_sentence = df_train.iloc[0]["words"]
first_sentence

In [ ]:
tokenizer.tokenize(first_sentence)

Notice that BERT uses subword tokens so the length of the tokenized sentence is likely to be larger than the number of words in the sentence.

Question:

  • why is it particulary interesting to use subword tokenization for general purpose language models such as BERT?

Each token string is mapped to a unique integer id that makes it fast to lookup the right column in the input layer token embedding:


In [ ]:
tokenizer.encode(first_sentence)

In [ ]:
tokenizer.decode(tokenizer.encode(first_sentence))

Remarks:

  • The first token [CLS] is used by the pre-training task for sequence classification.
  • The last token [SEP] is a separator for the pre-training task that classifiies if a pair of sentences are consecutive in a corpus or not (next sentence prediction).
  • Here we want to use BERT to compute a representation of a single voice command at a time
  • We could reuse the representation of the [CLS] token for sequence classification.
  • Alternatively we can pool the representations of all the tokens of the voice command (e.g. global average) and use that as the input of the final sequence classification layer.

In [ ]:
import matplotlib.pyplot as plt

train_sequence_lengths = [len(tokenizer.encode(text))
                          for text in df_train["words"]]
plt.hist(train_sequence_lengths, bins=30)
plt.title(f"max sequence length: {max(train_sequence_lengths)}");

To perform transfer learning, we will need to work with padded sequences so they all have the same sizes. The above histograms, shows that after tokenization, 43 tokens are enough to represent all the voice commands in the training set.

The mapping can be introspected in the tokenizer.vocab attribute:


In [ ]:
tokenizer.vocab_size

In [ ]:
bert_vocab_items = list(tokenizer.vocab.items())
bert_vocab_items[:10]

In [ ]:
bert_vocab_items[100:110]

In [ ]:
bert_vocab_items[900:910]

In [ ]:
bert_vocab_items[1100:1110]

In [ ]:
bert_vocab_items[20000:20010]

In [ ]:
bert_vocab_items[-10:]

Couple of remarks:

  • 30K is a reasonable vocabulary size and is small enough to be used in a softmax output layer;
  • it can represent multi-lingual sentences, including non-Western alphabets;
  • subword tokenization makes it possible to deal with typos and morphological variations with a small vocabulary side and without any language-specific preprocessing;
  • subword tokenization makes it unlikely to use the [UNK] special token as rare words can often be represented as a sequence of frequent enough short subwords in a meaningful way.

Encoding the Dataset with the Tokenizer

Let's now encode the full train / valid and test sets with our tokenizer to get a padded integer numpy arrays:


In [ ]:
import numpy as np


def encode_dataset(tokenizer, text_sequences, max_length):
    token_ids = np.zeros(shape=(len(text_sequences), max_length),
                         dtype=np.int32)
    for i, text_sequence in enumerate(text_sequences):
        encoded = tokenizer.encode(text_sequence)
        token_ids[i, 0:len(encoded)] = encoded
    attention_masks = (token_ids != 0).astype(np.int32)
    return {"input_ids": token_ids, "attention_masks": attention_masks}


encoded_train = encode_dataset(tokenizer, df_train["words"], 45)
encoded_train["input_ids"]

In [ ]:
encoded_train["attention_masks"]

In [ ]:
encoded_valid = encode_dataset(tokenizer, df_valid["words"], 45)
encoded_test = encode_dataset(tokenizer, df_test["words"], 45)

Encoding the Sequence Classification Targets

To do so we build a simple mapping from the auxiliary files:


In [ ]:
intent_names = Path("vocab.intent").read_text("utf-8").split()
intent_map = dict((label, idx) for idx, label in enumerate(intent_names))
intent_map

In [ ]:
intent_train = df_train["intent_label"].map(intent_map).values
intent_train

In [ ]:
intent_valid = df_valid["intent_label"].map(intent_map).values
intent_test = df_test["intent_label"].map(intent_map).values

Loading and Feeding a Pretrained BERT model

Let's load a pretrained BERT model using the huggingface transformers package:


In [ ]:
from transformers import TFBertModel

base_bert_model = TFBertModel.from_pretrained("bert-base-cased")
base_bert_model.summary()

In [ ]:
encoded_valid

In [ ]:
outputs = base_bert_model(encoded_valid)
len(outputs)

The first ouput of the BERT model is a tensor with shape: (batch_size, seq_len, output_dim) which computes features for each token in the input sequence:


In [ ]:
outputs[0].shape

The second output of the BERT model is a tensor with shape (batch_size, output_dim) which is the vector representation of the special token [CLS]. This vector is typically used as a pooled representation for the sequence as a whole. This is will be used as the features of our Intent classifier:


In [ ]:
outputs[1].shape

Exercise

Use the following code template to build and train a sequence classification model using to predict the intent class.

Use the self.bert pre-trained model in the call method and only consider the pooled features (ignore the token-wise features for now).


In [ ]:
import tensorflow as tf
from transformers import TFBertModel
from tensorflow.keras.layers import Dropout, Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.metrics import SparseCategoricalAccuracy


class IntentClassificationModel(tf.keras.Model):

    def __init__(self, intent_num_labels=None, model_name="bert-base-cased",
                 dropout_prob=0.1):
        super().__init__(name="joint_intent_slot")
        # Let's preload the pretrained model BERT in the constructor of our
        # classifier model
        self.bert = TFBertModel.from_pretrained(model_name)

        # TODO: define a (Dense) classification layer to compute the 
        # for each sequence in a batch the batch of samples. The number of
        # output classes is given by the intent_num_labels parameter.

        # Use the default linear activation (no softmax) to compute logits.
        # The softmax normalization will be computed in the loss function
        # instead of the model itself.

    def call(self, inputs, **kwargs):
        # Use the pretrained model to extract features from our encoded inputs:
        sequence_output, pooled_output = self.bert(inputs, **kwargs)

        # The second output of the main BERT layer has shape:
        # (batch_size, output_dim)
        # and gives a "pooled" representation for the full sequence from the
        # hidden state that corresponds to the "[CLS]" token.
        
        # TODO: use the classifier layer to compute the logits from the pooled
        # features.
        intent_logits = None
        return intent_logits


intent_model = IntentClassificationModel(intent_num_labels=len(intent_map))

intent_model.compile(optimizer=Adam(learning_rate=3e-5, epsilon=1e-08),
                     loss=SparseCategoricalCrossentropy(from_logits=True),
                     metrics=[SparseCategoricalAccuracy('accuracy')])

# TODO: uncomment to train the model:

# history = intent_model.fit(encoded_train, intent_train, epochs=2, batch_size=32,
#                            validation_data=(encoded_valid, intent_valid))

In [ ]:

Solution


In [ ]:
import tensorflow as tf
from transformers import TFBertModel
from tensorflow.keras.layers import Dropout, Dense


class IntentClassificationModel(tf.keras.Model):

    def __init__(self, intent_num_labels=None, model_name="bert-base-cased",
                 dropout_prob=0.1):
        super().__init__(name="joint_intent_slot")
        self.bert = TFBertModel.from_pretrained(model_name)
        self.dropout = Dropout(dropout_prob)

        # Use the default linear activation (no softmax) to compute logits.
        # The softmax normalization will be computed in the loss function
        # instead of the model itself.
        self.intent_classifier = Dense(intent_num_labels)

    def call(self, inputs, **kwargs):
        sequence_output, pooled_output = self.bert(inputs, **kwargs)

        pooled_output = self.dropout(pooled_output,
                                     training=kwargs.get("training", False))
        
        intent_logits = self.intent_classifier(pooled_output)
        return intent_logits


intent_model = IntentClassificationModel(intent_num_labels=len(intent_map))

Our classification model outputs logits instead of probabilities. The final softmax normalization layer is implicit, that is included in the loss function instead of the model directly.

We need to configure the loss function SparseCategoricalCrossentropy(from_logits=True) accordingly:


In [ ]:
intent_model.compile(optimizer=Adam(learning_rate=3e-5, epsilon=1e-08),
                     loss=SparseCategoricalCrossentropy(from_logits=True),
                     metrics=[SparseCategoricalAccuracy('accuracy')])

In [ ]:
history = intent_model.fit(encoded_train, intent_train, epochs=2, batch_size=32,
                           validation_data=(encoded_valid, intent_valid))

In [ ]:
def classify(text, tokenizer, model, intent_names):
    inputs = tf.constant(tokenizer.encode(text))[None, :]  # batch_size = 1
    class_id = model(inputs).numpy().argmax(axis=1)[0]
    return intent_names[class_id]


classify("Book a table for two at La Tour d'Argent for Friday night.",
         tokenizer, intent_model, intent_names)

In [ ]:
classify("I would like to listen to Anima by Thom Yorke.",
         tokenizer, intent_model, intent_names)

In [ ]:
classify("Will it snow tomorrow in Saclay?",
         tokenizer, intent_model, intent_names)

In [ ]:
classify("Where can I see to the last Star Wars near Odéon tonight?",
         tokenizer, intent_model, intent_names)

Join Intent Classification and Slot Filling

Let's now refine our Natural Language Understanding system by trying the retrieve the important structured elements of each voici command.

To do so we will perform word level (or token level) classification of the BIO labels.

Since we have word level tags but BERT uses a wordpiece tokenizer, we need to align the BIO labels with the BERT tokens.

Let's load the list of possible word token labels and augment it with an additional padding label to be able to ignore special tokens:


In [ ]:
slot_names = ["[PAD]"]
slot_names += Path("vocab.slot").read_text("utf-8").strip().splitlines()
slot_map = {}
for label in slot_names:
    slot_map[label] = len(slot_map)
slot_map

The following function generates token-aligned integer labels from the BIO word-level annotations. In particular, if a specific word is too long to be represented as a single token, we expand its label for all the tokens of that word while taking care of using "B-" labels only for the first token and then use "I-" for the matching slot type for subsequent tokens of the same word:


In [ ]:
def encode_token_labels(text_sequences, slot_names, tokenizer, slot_map,
                        max_length):
    encoded = np.zeros(shape=(len(text_sequences), max_length), dtype=np.int32)
    for i, (text_sequence, word_labels) in enumerate(
            zip(text_sequences, slot_names)):
        encoded_labels = []
        for word, word_label in zip(text_sequence.split(), word_labels.split()):
            tokens = tokenizer.tokenize(word)
            encoded_labels.append(slot_map[word_label])
            expand_label = word_label.replace("B-", "I-")
            if not expand_label in slot_map:
                expand_label = word_label
            encoded_labels.extend([slot_map[expand_label]] * (len(tokens) - 1))
        encoded[i, 1:len(encoded_labels) + 1] = encoded_labels
    return encoded


slot_train = encode_token_labels(
    df_train["words"], df_train["word_labels"], tokenizer, slot_map, 45)
slot_valid = encode_token_labels(
    df_valid["words"], df_valid["word_labels"], tokenizer, slot_map, 45)
slot_test = encode_token_labels(
    df_test["words"], df_test["word_labels"], tokenizer, slot_map, 45)

In [ ]:
slot_train[0]

In [ ]:
slot_valid[0]

Note that the special tokens such as "[PAD]" and "[SEP]" and all padded positions recieve a 0 label.

Exercise

Use the following code template to build a joint sequence and token classification model suitable for training on our encoded dataset with slot labels:


In [ ]:
from transformers import TFBertModel
from tensorflow.keras.layers import Dropout, Dense


class JointIntentAndSlotFillingModel(tf.keras.Model):

    def __init__(self, intent_num_labels=None, slot_num_labels=None,
                 model_name="bert-base-cased", dropout_prob=0.1):
        super().__init__(name="joint_intent_slot")
        self.bert = TFBertModel.from_pretrained(model_name)
        # TODO: define all the needed layers here.

    def call(self, inputs, **kwargs):
        # TODO: extrac the features from the inputs using the pre-trained
        # BERT model here.

        # TODO: use the new layers to predict slot class (logits) for each
        # token position in the input sequence:
        slot_logits = None  # (batch_size, seq_len, slot_num_labels)

        # TODO: define a second classification head for the sequence-wise
        # predictions:
        intent_logits = None  # (batch_size, intent_num_labels)

        return slot_logits, intent_logits


joint_model = JointIntentAndSlotFillingModel(
    intent_num_labels=len(intent_map), slot_num_labels=len(slot_map))

# Define one classification loss for each output:
losses = [SparseCategoricalCrossentropy(from_logits=True),
          SparseCategoricalCrossentropy(from_logits=True)]
joint_model.compile(optimizer=Adam(learning_rate=3e-5, epsilon=1e-08),
                    loss=losses)

# TODO: uncomment to train the model:
# history = joint_model.fit(
#     encoded_train, (slot_train, intent_train),
#     validation_data=(encoded_valid, (slot_valid, intent_valid)),
#     epochs=2, batch_size=32)

In [ ]:

Solution:


In [ ]:
from transformers import TFBertModel
from tensorflow.keras.layers import Dropout, Dense


class JointIntentAndSlotFillingModel(tf.keras.Model):

    def __init__(self, intent_num_labels=None, slot_num_labels=None,
                 model_name="bert-base-cased", dropout_prob=0.1):
        super().__init__(name="joint_intent_slot")
        self.bert = TFBertModel.from_pretrained(model_name)
        self.dropout = Dropout(dropout_prob)
        self.intent_classifier = Dense(intent_num_labels,
                                       name="intent_classifier")
        self.slot_classifier = Dense(slot_num_labels,
                                     name="slot_classifier")

    def call(self, inputs, **kwargs):
        sequence_output, pooled_output = self.bert(inputs, **kwargs)

        # The first output of the main BERT layer has shape:
        # (batch_size, max_length, output_dim)
        sequence_output = self.dropout(sequence_output,
                                       training=kwargs.get("training", False))
        slot_logits = self.slot_classifier(sequence_output)

        # The second output of the main BERT layer has shape:
        # (batch_size, output_dim)
        # and gives a "pooled" representation for the full sequence from the
        # hidden state that corresponds to the "[CLS]" token.
        pooled_output = self.dropout(pooled_output,
                                     training=kwargs.get("training", False))
        intent_logits = self.intent_classifier(pooled_output)

        return slot_logits, intent_logits


joint_model = JointIntentAndSlotFillingModel(
    intent_num_labels=len(intent_map), slot_num_labels=len(slot_map))

In [ ]:
opt = Adam(learning_rate=3e-5, epsilon=1e-08)
losses = [SparseCategoricalCrossentropy(from_logits=True),
          SparseCategoricalCrossentropy(from_logits=True)]
metrics = [SparseCategoricalAccuracy('accuracy')]
joint_model.compile(optimizer=opt, loss=losses, metrics=metrics)

In [ ]:
history = joint_model.fit(
    encoded_train, (slot_train, intent_train),
    validation_data=(encoded_valid, (slot_valid, intent_valid)),
    epochs=2, batch_size=32)

The following function uses our trained model to make a prediction on a single text sequence and display both the sequence-wise and the token-wise class labels:


In [ ]:
def show_predictions(text, tokenizer, model, intent_names, slot_names):
    inputs = tf.constant(tokenizer.encode(text))[None, :]  # batch_size = 1
    outputs = model(inputs)
    slot_logits, intent_logits = outputs
    slot_ids = slot_logits.numpy().argmax(axis=-1)[0, 1:-1]
    intent_id = intent_logits.numpy().argmax(axis=-1)[0]
    print("## Intent:", intent_names[intent_id])
    print("## Slots:")
    for token, slot_id in zip(tokenizer.tokenize(text), slot_ids):
        print(f"{token:>10} : {slot_names[slot_id]}")

In [ ]:
show_predictions("Book a table for two at Le Ritz for Friday night!",
                 tokenizer, joint_model, intent_names, slot_names)

In [ ]:
show_predictions("Will it snow tomorrow in Saclay?",
                 tokenizer, joint_model, intent_names, slot_names)

In [ ]:
show_predictions("I would like to listen to Anima by Thom Yorke.",
                 tokenizer, joint_model, intent_names, slot_names)

Decoding Predictions into Structured Knowledge

For completeness, here a minimal function to naively decode the predicted BIO slot ids and convert it into a structured representation for the detected slots as a Python dictionaries:


In [ ]:
def decode_predictions(text, tokenizer, intent_names, slot_names,
                       intent_id, slot_ids):
    info = {"intent": intent_names[intent_id]}
    collected_slots = {}
    active_slot_words = []
    active_slot_name = None
    for word in text.split():
        tokens = tokenizer.tokenize(word)
        current_word_slot_ids = slot_ids[:len(tokens)]
        slot_ids = slot_ids[len(tokens):]
        current_word_slot_name = slot_names[current_word_slot_ids[0]]
        if current_word_slot_name == "O":
            if active_slot_name:
                collected_slots[active_slot_name] = " ".join(active_slot_words)
                active_slot_words = []
                active_slot_name = None
        else:
            # Naive BIO: handling: treat B- and I- the same...
            new_slot_name = current_word_slot_name[2:]
            if active_slot_name is None:
                active_slot_words.append(word)
                active_slot_name = new_slot_name
            elif new_slot_name == active_slot_name:
                active_slot_words.append(word)
            else:
                collected_slots[active_slot_name] = " ".join(active_slot_words)
                active_slot_words = [word]
                active_slot_name = new_slot_name
    if active_slot_name:
        collected_slots[active_slot_name] = " ".join(active_slot_words)
    info["slots"] = collected_slots
    return info

In [ ]:
def nlu(text, tokenizer, model, intent_names, slot_names):
    inputs = tf.constant(tokenizer.encode(text))[None, :]  # batch_size = 1
    outputs = model(inputs)
    slot_logits, intent_logits = outputs
    slot_ids = slot_logits.numpy().argmax(axis=-1)[0, 1:-1]
    intent_id = intent_logits.numpy().argmax(axis=-1)[0]

    return decode_predictions(text, tokenizer, intent_names, slot_names,
                              intent_id, slot_ids)

nlu("Book a table for two at Le Ritz for Friday night",
    tokenizer, joint_model, intent_names, slot_names)

In [ ]:
nlu("Will it snow tomorrow in Saclay",
    tokenizer, joint_model, intent_names, slot_names)

In [ ]:
nlu("I would like to listen to Anima by Thom Yorke",
    tokenizer, joint_model, intent_names, slot_names)

Limitations

Language

BERT is pretrained primarily on English content. It can therefore only extract meaningful features on text written in English.

Note that there exists alternative pretrained model that use a mix of different languages (e.g. XLM) and others that have been trained on other languages. For instance CamemBERT is pretrained on French text. Both kinds of models are available in the transformers package:

https://github.com/huggingface/transformers#model-architectures

The public snips.ai dataset used for fine-tuning is English only. To build a model for another language we would need to collect and annotate a similar corpus with tens of thousands of diverse, representative samples.

Biases Embedded in the Pre-Trained Model

The original data used to pre-trained BERT was collected from the Internet and contains all kinds of data, including offensive and hateful speech.

While using BERT for or voice command understanding system is quite unlikely to be significantly impacted by those biases, it could be a serious problem for other kinds of applications such as Machine Translation for instance.

It is therefore strongly recommended to spend time auditing the biases that are embedded in such pre-trained models before deciding to deploy system that derive from them.

Computational Resources

The original BERT model has many parameters which uses a lot of memory and can be prohibitive to deploy on small devices such as mobile phones. It is also very computationally intensive and typically requires powerful GPUs or TPUs to process text data at a reasonable speed (both for training and at inference time).

Designing alternative architectures with fewer parameters or more efficient training and inference procedures is still a very active area of research.

Depending of on the problems, it might be the case that simpler architectures based on convolutional neural networks and LSTMs might offer a better speed / accuracy trade-off.


In [ ]: