The goal of this notebook is to fine-tune a pretrained transformer-based neural network model to convert a user query expressed in English into a representation that is structured enough to be processed by an automated service.
Here is an example of interpretation computed by such a Natural Language Understanding system:
>>> nlu("Book a table for two at Le Ritz for Friday night",
tokenizer, joint_model, intent_names, slot_names)
{
'intent': 'BookRestaurant',
'slots': {
'party_size_number': 'two',
'restaurant_name': 'Le Ritz',
'timeRange': 'Friday night'
}
}
Intent classification is a simple sequence classification problem. The trick is to treat the structured knowledge extraction part ("Slot Filling") as token-level classification problem using BIO-annotations:
>>> show_predictions("Book a table for two at Le Ritz for Friday night!",
... tokenizer, joint_model, intent_names, slot_names)
## Intent: BookRestaurant
## Slots:
Book : O
a : O
table : O
for : O
two : B-party_size_number
at : O
Le : B-restaurant_name
R : I-restaurant_name
##itz : I-restaurant_name
for : O
Friday : B-timeRange
night : I-timeRange
! : O
We will show how to train a such as join "sequence classification" and "token classification" joint model on a voice command dataset published by snips.ai.
This notebook is a partial reproduction of some of the results presented in this paper:
BERT for Joint Intent Classification and Slot Filling Qian Chen, Zhu Zhuo, Wen Wang
In [4]:
%tensorflow_version 2.x
In [5]:
!nvidia-smi
In [6]:
%pip install -q transformers
We will use a speech command dataset collected, annotated and published by French startup SNIPS.ai (bought in 2019 by Audio device manufacturer Sonos).
The original dataset comes in YAML format with inline markdown annotations.
Instead we will use a preprocessed variant with token level B-I-O annotations closer the representation our model will predict. This variant of the SNIPS dataset was prepared by Su Zhu.
In [7]:
from urllib.request import urlretrieve
from pathlib import Path
SNIPS_DATA_BASE_URL = (
"https://github.com/ogrisel/slot_filling_and_intent_detection_of_SLU/blob/"
"master/data/snips/"
)
for filename in ["train", "valid", "test", "vocab.intent", "vocab.slot"]:
path = Path(filename)
if not path.exists():
print(f"Downloading {filename}...")
urlretrieve(SNIPS_DATA_BASE_URL + filename + "?raw=true", path)
Let's have a look at the first lines from the training set:
In [8]:
lines_train = Path("train").read_text("utf-8").strip().splitlines()
lines_train[:5]
Out[8]:
Some remarks:
Let's write a parsing function and test it on the first line:
In [ ]:
def parse_line(line):
utterance_data, intent_label = line.split(" <=> ")
items = utterance_data.split()
words = [item.rsplit(":", 1)[0]for item in items]
word_labels = [item.rsplit(":", 1)[1]for item in items]
return {
"intent_label": intent_label,
"words": " ".join(words),
"word_labels": " ".join(word_labels),
"length": len(words),
}
In [10]:
parse_line(lines_train[0])
Out[10]:
This utterance is a voice command of type "AddToPlaylist" with to annotations:
The goal of this project is to build a baseline Natural Understanding model to analyse such voice commands and predict:
The list of possible classes for the sentence level and the word level classification problems are given as:
In [11]:
print(Path("vocab.intent").read_text("utf-8"))
In [12]:
print(Path("vocab.slot").read_text("utf-8"))
"POI" stands for "Point of Interest".
Let's parse all the lines and store the results in pandas DataFrames:
In [13]:
import pandas as pd
parsed = [parse_line(line) for line in lines_train]
df_train = pd.DataFrame([p for p in parsed if p is not None])
df_train
Out[13]:
In [14]:
df_train
Out[14]:
In [15]:
df_train.groupby("intent_label").count()
Out[15]:
In [16]:
df_train.hist("length", bins=30);
In [ ]:
lines_valid = Path("valid").read_text("utf-8").strip().splitlines()
lines_test = Path("test").read_text("utf-8").strip().splitlines()
df_valid = pd.DataFrame([parse_line(line) for line in lines_valid])
df_test = pd.DataFrame([parse_line(line) for line in lines_test])
Let's ignore the slot filling task for now and let's try to build a sentence level classifier by fine-tuning a pre-trained Transformer-based model using the huggingface/transformers
package that provides both TF2/Keras and Pytorch APIs.
First let's load a pre-trained tokenizer and test it on a test sentence from the training set:
In [ ]:
from transformers import BertTokenizer
model_name = "bert-base-cased"
tokenizer = BertTokenizer.from_pretrained(model_name)
In [19]:
first_sentence = df_train.iloc[0]["words"]
first_sentence
Out[19]:
In [20]:
tokenizer.tokenize(first_sentence)
Out[20]:
Notice that BERT uses subword tokens so the length of the tokenized sentence is likely to be larger than the number of words in the sentence.
Question:
Each token string is mapped to a unique integer id that makes it fast to lookup the right column in the input layer token embedding:
In [21]:
tokenizer.encode(first_sentence)
Out[21]:
In [22]:
tokenizer.decode(tokenizer.encode(first_sentence))
Out[22]:
Remarks:
[CLS]
is used by the pre-training task for sequence classification.[SEP]
is a separator for the pre-training task that classifiies if a pair of sentences are consecutive in a corpus or not (next sentence prediction).[CLS]
token for sequence classification.
In [23]:
import matplotlib.pyplot as plt
train_sequence_lengths = [len(tokenizer.encode(text))
for text in df_train["words"]]
plt.hist(train_sequence_lengths, bins=30)
plt.title(f"max sequence length: {max(train_sequence_lengths)}");
To perform transfer learning, we will need to work with padded sequences so they all have the same sizes. The above histograms, shows that after tokenization, 43 tokens are enough to represent all the voice commands in the training set.
The mapping can be introspected in the tokenizer.vocab
attribute:
In [24]:
tokenizer.vocab_size
Out[24]:
In [25]:
bert_vocab_items = list(tokenizer.vocab.items())
bert_vocab_items[:10]
Out[25]:
In [26]:
bert_vocab_items[100:110]
Out[26]:
In [27]:
bert_vocab_items[900:910]
Out[27]:
In [28]:
bert_vocab_items[1100:1110]
Out[28]:
In [29]:
bert_vocab_items[20000:20010]
Out[29]:
In [30]:
bert_vocab_items[-10:]
Out[30]:
Couple of remarks:
[UNK]
special token as rare words can often be represented as a sequence of frequent enough short subwords in a meaningful way.
In [31]:
import numpy as np
def encode_dataset(tokenizer, text_sequences, max_length):
token_ids = np.zeros(shape=(len(text_sequences), max_length),
dtype=np.int32)
for i, text_sequence in enumerate(text_sequences):
encoded = tokenizer.encode(text_sequence)
token_ids[i, 0:len(encoded)] = encoded
attention_masks = (token_ids != 0).astype(np.int32)
return {"input_ids": token_ids, "attention_masks": attention_masks}
encoded_train = encode_dataset(tokenizer, df_train["words"], 45)
encoded_train["input_ids"]
Out[31]:
In [32]:
encoded_train["attention_masks"]
Out[32]:
In [ ]:
encoded_valid = encode_dataset(tokenizer, df_valid["words"], 45)
encoded_test = encode_dataset(tokenizer, df_test["words"], 45)
In [34]:
intent_names = Path("vocab.intent").read_text("utf-8").split()
intent_map = dict((label, idx) for idx, label in enumerate(intent_names))
intent_map
Out[34]:
In [35]:
intent_train = df_train["intent_label"].map(intent_map).values
intent_train
Out[35]:
In [ ]:
intent_valid = df_valid["intent_label"].map(intent_map).values
intent_test = df_test["intent_label"].map(intent_map).values
Let's load a pretrained BERT model using the huggingface transformers package:
In [37]:
from transformers import TFBertModel
base_bert_model = TFBertModel.from_pretrained("bert-base-cased")
base_bert_model.summary()
In [38]:
encoded_valid
Out[38]:
In [39]:
outputs = base_bert_model(encoded_valid)
len(outputs)
Out[39]:
The first ouput of the BERT model is a tensor with shape: (batch_size, seq_len, output_dim)
which computes features for each token in the input sequence:
In [40]:
outputs[0].shape
Out[40]:
The second output of the BERT model is a tensor with shape (batch_size, output_dim)
which is the vector representation of the special token [CLS]
. This vector is typically used as a pooled representation for the sequence as a whole. This is will be used as the features of our Intent classifier:
In [41]:
outputs[1].shape
Out[41]:
In [ ]:
import tensorflow as tf
from transformers import TFBertModel
from tensorflow.keras.layers import Dropout, Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.metrics import SparseCategoricalAccuracy
class IntentClassificationModel(tf.keras.Model):
def __init__(self, intent_num_labels=None, model_name="bert-base-cased",
dropout_prob=0.1):
super().__init__(name="joint_intent_slot")
# Let's preload the pretrained model BERT in the constructor of our
# classifier model
self.bert = TFBertModel.from_pretrained(model_name)
# TODO: define a (Dense) classification layer to compute the
# for each sequence in a batch the batch of samples. The number of
# output classes is given by the intent_num_labels parameter.
# Use the default linear activation (no softmax) to compute logits.
# The softmax normalization will be computed in the loss function
# instead of the model itself.
def call(self, inputs, **kwargs):
# Use the pretrained model to extract features from our encoded inputs:
sequence_output, pooled_output = self.bert(inputs, **kwargs)
# The second output of the main BERT layer has shape:
# (batch_size, output_dim)
# and gives a "pooled" representation for the full sequence from the
# hidden state that corresponds to the "[CLS]" token.
# TODO: use the classifier layer to compute the logits from the pooled
# features.
intent_logits = None
return intent_logits
intent_model = IntentClassificationModel(intent_num_labels=len(intent_map))
intent_model.compile(optimizer=Adam(learning_rate=3e-5, epsilon=1e-08),
loss=SparseCategoricalCrossentropy(from_logits=True),
metrics=[SparseCategoricalAccuracy('accuracy')])
# TODO: uncomment to train the model:
# history = intent_model.fit(encoded_train, intent_train, epochs=2, batch_size=32,
# validation_data=(encoded_valid, intent_valid))
In [ ]:
In [ ]:
import tensorflow as tf
from transformers import TFBertModel
from tensorflow.keras.layers import Dropout, Dense
class IntentClassificationModel(tf.keras.Model):
def __init__(self, intent_num_labels=None, model_name="bert-base-cased",
dropout_prob=0.1):
super().__init__(name="joint_intent_slot")
self.bert = TFBertModel.from_pretrained(model_name)
self.dropout = Dropout(dropout_prob)
# Use the default linear activation (no softmax) to compute logits.
# The softmax normalization will be computed in the loss function
# instead of the model itself.
self.intent_classifier = Dense(intent_num_labels)
def call(self, inputs, **kwargs):
sequence_output, pooled_output = self.bert(inputs, **kwargs)
pooled_output = self.dropout(pooled_output,
training=kwargs.get("training", False))
intent_logits = self.intent_classifier(pooled_output)
return intent_logits
intent_model = IntentClassificationModel(intent_num_labels=len(intent_map))
Our classification model outputs logits instead of probabilities. The final softmax normalization layer is implicit, that is, included in the loss function instead of the model directly.
We need to configure the loss function SparseCategoricalCrossentropy(from_logits=True)
accordingly:
In [ ]:
intent_model.compile(optimizer=Adam(learning_rate=3e-5, epsilon=1e-08),
loss=SparseCategoricalCrossentropy(from_logits=True),
metrics=[SparseCategoricalAccuracy('accuracy')])
In [46]:
history = intent_model.fit(encoded_train, intent_train, epochs=2, batch_size=32,
validation_data=(encoded_valid, intent_valid))
In [47]:
def classify(text, tokenizer, model, intent_names):
inputs = tf.constant(tokenizer.encode(text))[None, :] # batch_size = 1
class_id = model(inputs).numpy().argmax(axis=1)[0]
return intent_names[class_id]
classify("Book a table for two at La Tour d'Argent for Friday night.",
tokenizer, intent_model, intent_names)
Out[47]:
In [48]:
classify("I would like to listen to Anima by Thom Yorke.",
tokenizer, intent_model, intent_names)
Out[48]:
In [49]:
classify("Will it snow tomorrow in Saclay?",
tokenizer, intent_model, intent_names)
Out[49]:
In [50]:
classify("Where can I see to the last Star Wars near Odéon tonight?",
tokenizer, intent_model, intent_names)
Out[50]:
Let's now refine our Natural Language Understanding system by trying the retrieve the important structured elements of each voice command.
To do so we will perform word level (or token level) classification of the BIO labels.
Since we have word level tags but BERT uses a wordpiece tokenizer, we need to align the BIO labels with the BERT tokens.
Let's load the list of possible word token labels and augment it with an additional padding label to be able to ignore special tokens:
In [51]:
slot_names = ["[PAD]"]
slot_names += Path("vocab.slot").read_text("utf-8").strip().splitlines()
slot_map = {}
for label in slot_names:
slot_map[label] = len(slot_map)
slot_map
Out[51]:
The following function generates token-aligned integer labels from the BIO word-level annotations. In particular, if a specific word is too long to be represented as a single token, we expand its label for all the tokens of that word while taking care of using "B-" labels only for the first token and then use "I-" for the matching slot type for subsequent tokens of the same word:
In [ ]:
def encode_token_labels(text_sequences, slot_names, tokenizer, slot_map,
max_length):
encoded = np.zeros(shape=(len(text_sequences), max_length), dtype=np.int32)
for i, (text_sequence, word_labels) in enumerate(
zip(text_sequences, slot_names)):
encoded_labels = []
for word, word_label in zip(text_sequence.split(), word_labels.split()):
tokens = tokenizer.tokenize(word)
encoded_labels.append(slot_map[word_label])
expand_label = word_label.replace("B-", "I-")
if not expand_label in slot_map:
expand_label = word_label
encoded_labels.extend([slot_map[expand_label]] * (len(tokens) - 1))
encoded[i, 1:len(encoded_labels) + 1] = encoded_labels
return encoded
slot_train = encode_token_labels(
df_train["words"], df_train["word_labels"], tokenizer, slot_map, 45)
slot_valid = encode_token_labels(
df_valid["words"], df_valid["word_labels"], tokenizer, slot_map, 45)
slot_test = encode_token_labels(
df_test["words"], df_test["word_labels"], tokenizer, slot_map, 45)
In [53]:
slot_train[0]
Out[53]:
In [54]:
slot_valid[0]
Out[54]:
Note that the special tokens such as "[PAD]" and "[SEP]" and all padded positions recieve a 0 label.
In [ ]:
from transformers import TFBertModel
from tensorflow.keras.layers import Dropout, Dense
class JointIntentAndSlotFillingModel(tf.keras.Model):
def __init__(self, intent_num_labels=None, slot_num_labels=None,
model_name="bert-base-cased", dropout_prob=0.1):
super().__init__(name="joint_intent_slot")
self.bert = TFBertModel.from_pretrained(model_name)
# TODO: define all the needed layers here.
def call(self, inputs, **kwargs):
# TODO: extrac the features from the inputs using the pre-trained
# BERT model here.
# TODO: use the new layers to predict slot class (logits) for each
# token position in the input sequence:
slot_logits = None # (batch_size, seq_len, slot_num_labels)
# TODO: define a second classification head for the sequence-wise
# predictions:
intent_logits = None # (batch_size, intent_num_labels)
return slot_logits, intent_logits
joint_model = JointIntentAndSlotFillingModel(
intent_num_labels=len(intent_map), slot_num_labels=len(slot_map))
# Define one classification loss for each output:
losses = [SparseCategoricalCrossentropy(from_logits=True),
SparseCategoricalCrossentropy(from_logits=True)]
joint_model.compile(optimizer=Adam(learning_rate=3e-5, epsilon=1e-08),
loss=losses)
# TODO: uncomment to train the model:
# history = joint_model.fit(
# encoded_train, (slot_train, intent_train),
# validation_data=(encoded_valid, (slot_valid, intent_valid)),
# epochs=2, batch_size=32)
In [ ]:
In [ ]:
from transformers import TFBertModel
from tensorflow.keras.layers import Dropout, Dense
class JointIntentAndSlotFillingModel(tf.keras.Model):
def __init__(self, intent_num_labels=None, slot_num_labels=None,
model_name="bert-base-cased", dropout_prob=0.1):
super().__init__(name="joint_intent_slot")
self.bert = TFBertModel.from_pretrained(model_name)
self.dropout = Dropout(dropout_prob)
self.intent_classifier = Dense(intent_num_labels,
name="intent_classifier")
self.slot_classifier = Dense(slot_num_labels,
name="slot_classifier")
def call(self, inputs, **kwargs):
sequence_output, pooled_output = self.bert(inputs, **kwargs)
# The first output of the main BERT layer has shape:
# (batch_size, max_length, output_dim)
sequence_output = self.dropout(sequence_output,
training=kwargs.get("training", False))
slot_logits = self.slot_classifier(sequence_output)
# The second output of the main BERT layer has shape:
# (batch_size, output_dim)
# and gives a "pooled" representation for the full sequence from the
# hidden state that corresponds to the "[CLS]" token.
pooled_output = self.dropout(pooled_output,
training=kwargs.get("training", False))
intent_logits = self.intent_classifier(pooled_output)
return slot_logits, intent_logits
joint_model = JointIntentAndSlotFillingModel(
intent_num_labels=len(intent_map), slot_num_labels=len(slot_map))
In [ ]:
opt = Adam(learning_rate=3e-5, epsilon=1e-08)
losses = [SparseCategoricalCrossentropy(from_logits=True),
SparseCategoricalCrossentropy(from_logits=True)]
metrics = [SparseCategoricalAccuracy('accuracy')]
joint_model.compile(optimizer=opt, loss=losses, metrics=metrics)
In [58]:
history = joint_model.fit(
encoded_train, (slot_train, intent_train),
validation_data=(encoded_valid, (slot_valid, intent_valid)),
epochs=2, batch_size=32)
The following function uses our trained model to make a prediction on a single text sequence and display both the sequence-wise and the token-wise class labels:
In [ ]:
def show_predictions(text, tokenizer, model, intent_names, slot_names):
inputs = tf.constant(tokenizer.encode(text))[None, :] # batch_size = 1
outputs = model(inputs)
slot_logits, intent_logits = outputs
slot_ids = slot_logits.numpy().argmax(axis=-1)[0, 1:-1]
intent_id = intent_logits.numpy().argmax(axis=-1)[0]
print("## Intent:", intent_names[intent_id])
print("## Slots:")
for token, slot_id in zip(tokenizer.tokenize(text), slot_ids):
print(f"{token:>10} : {slot_names[slot_id]}")
In [60]:
show_predictions("Book a table for two at Le Ritz for Friday night!",
tokenizer, joint_model, intent_names, slot_names)
In [61]:
show_predictions("Will it snow tomorrow in Saclay?",
tokenizer, joint_model, intent_names, slot_names)
In [62]:
show_predictions("I would like to listen to Anima by Thom Yorke.",
tokenizer, joint_model, intent_names, slot_names)
In [ ]:
def decode_predictions(text, tokenizer, intent_names, slot_names,
intent_id, slot_ids):
info = {"intent": intent_names[intent_id]}
collected_slots = {}
active_slot_words = []
active_slot_name = None
for word in text.split():
tokens = tokenizer.tokenize(word)
current_word_slot_ids = slot_ids[:len(tokens)]
slot_ids = slot_ids[len(tokens):]
current_word_slot_name = slot_names[current_word_slot_ids[0]]
if current_word_slot_name == "O":
if active_slot_name:
collected_slots[active_slot_name] = " ".join(active_slot_words)
active_slot_words = []
active_slot_name = None
else:
# Naive BIO: handling: treat B- and I- the same...
new_slot_name = current_word_slot_name[2:]
if active_slot_name is None:
active_slot_words.append(word)
active_slot_name = new_slot_name
elif new_slot_name == active_slot_name:
active_slot_words.append(word)
else:
collected_slots[active_slot_name] = " ".join(active_slot_words)
active_slot_words = [word]
active_slot_name = new_slot_name
if active_slot_name:
collected_slots[active_slot_name] = " ".join(active_slot_words)
info["slots"] = collected_slots
return info
In [64]:
def nlu(text, tokenizer, model, intent_names, slot_names):
inputs = tf.constant(tokenizer.encode(text))[None, :] # batch_size = 1
outputs = model(inputs)
slot_logits, intent_logits = outputs
slot_ids = slot_logits.numpy().argmax(axis=-1)[0, 1:-1]
intent_id = intent_logits.numpy().argmax(axis=-1)[0]
return decode_predictions(text, tokenizer, intent_names, slot_names,
intent_id, slot_ids)
nlu("Book a table for two at Le Ritz for Friday night",
tokenizer, joint_model, intent_names, slot_names)
Out[64]:
In [65]:
nlu("Will it snow tomorrow in Saclay",
tokenizer, joint_model, intent_names, slot_names)
Out[65]:
In [66]:
nlu("I would like to listen to Anima by Thom Yorke",
tokenizer, joint_model, intent_names, slot_names)
Out[66]:
BERT is pretrained primarily on English content. It can therefore only extract meaningful features on text written in English.
Note that there exists alternative pretrained model that use a mix of different languages (e.g. XLM) and others that have been trained on other languages. For instance CamemBERT is pretrained on French text. Both kinds of models are available in the transformers package:
https://github.com/huggingface/transformers#model-architectures
The public snips.ai dataset used for fine-tuning is English only. To build a model for another language we would need to collect and annotate a similar corpus with tens of thousands of diverse, representative samples.
The original data used to pre-trained BERT was collected from the Internet and contains all kinds of data, including offensive and hateful speech.
While using BERT for or voice command understanding system is quite unlikely to be significantly impacted by those biases, it could be a serious problem for other kinds of applications such as Machine Translation for instance.
It is therefore strongly recommended to spend time auditing the biases that are embedded in such pre-trained models before deciding to deploy system that derive from them.
The original BERT model has many parameters which uses a lot of memory and can be prohibitive to deploy on small devices such as mobile phones. It is also very computationally intensive and typically requires powerful GPUs or TPUs to process text data at a reasonable speed (both for training and at inference time).
Designing alternative architectures with fewer parameters or more efficient training and inference procedures is still a very active area of research.
Depending of on the problems, it might be the case that simpler architectures based on convolutional neural networks and LSTMs might offer a better speed / accuracy trade-off.
In [ ]: