We use the small training data (botchan.txt) in this example. (Botchan is a novel written by Natsume Sōseki in 1906. The sample is English-translated one.)
In [0]:
!pip install sentencepiece
!wget https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt
In [0]:
import sentencepiece as spm
# train sentencepiece model from `botchan.txt` and makes `m.model` and `m.vocab`
# `m.vocab` is just a reference. not used in the segmentation.
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')
# makes segmenter instance and loads the model file (m.model)
sp = spm.SentencePieceProcessor()
sp.load('m.model')
# encode: text => id
print(sp.encode_as_pieces('This is a test'))
print(sp.encode_as_ids('This is a test'))
# decode: id => text
print(sp.decode_pieces(['▁This', '▁is', '▁a', '▁t', 'est']))
print(sp.decode_ids([209, 31, 9, 375, 586]))
In [0]:
# returns vocab size
print(sp.get_piece_size())
# id <=> piece conversion
print(sp.id_to_piece(209))
print(sp.piece_to_id('▁This'))
# returns 0 for unknown tokens (we can change the id for UNK)
print(sp.piece_to_id('__MUST_BE_UNKNOWN__'))
# <unk>, <s>, </s> are defined by default. Their ids are (0, 1, 2)
# <s> and </s> are defined as 'control' symbol.
for id in range(3):
print(sp.id_to_piece(id), sp.is_control(id))
Sentencepiece's model file is just a serialized protocol buffer. We can instantiate sentencepiece processor from byte object with load_from_serialized_proto method.
In [0]:
import tensorflow as tf
# Assumes that m.model is stored in non-Posix file system.
serialized_model_proto = tf.gfile.GFile('m.model', 'rb').read()
sp = spm.SentencePieceProcessor()
sp.load_from_serialized_proto(serialized_model_proto)
print(sp.encode_as_pieces('this is a test'))
We can define special tokens (symbols) to tweak the DNN behavior through the tokens. Typical examples are BERT's special symbols., e.g., [SEP] and [CLS].
There are two types of special tokens:
For experimental purpose, user defined symbols are easier to use since user can change the behavior just by modifying the input text. However, we want to use control symbols in the production setting in order to avoid users from tweaking the behavior by feeding these special symbols in their input text.
In [0]:
## Example of user defined symbols
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_user --user_defined_symbols=<sep>,<cls> --vocab_size=2000')
sp_user = spm.SentencePieceProcessor()
sp_user.load('m_user.model')
# ids are reserved in both mode.
# <unk>=0, <s>=1, </s>=2, <sep>=3, <cls>=4
# user defined symbols allow these symbol to apper in the text.
print(sp_user.encode_as_pieces('this is a test<sep> hello world<cls>'))
print(sp_user.piece_to_id('<sep>')) # 3
print(sp_user.piece_to_id('<cls>')) # 4
print('3=', sp_user.decode_ids([3])) # decoded to <sep>
print('4=', sp_user.decode_ids([4])) # decoded to <cls>
In [0]:
## Example of control symbols
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_ctrl --control_symbols=<sep>,<cls> --vocab_size=2000')
sp_ctrl = spm.SentencePieceProcessor()
sp_ctrl.load('m_ctrl.model')
# control symbols just reserve ids.
print(sp_ctrl.encode_as_pieces('this is a test<sep> hello world<cls>'))
print(sp_ctrl.piece_to_id('<sep>')) # 3
print(sp_ctrl.piece_to_id('<cls>')) # 4
print('3=', sp_ctrl.decode_ids([3])) # decoded to empty
print('4=', sp_ctrl.decode_ids([4])) # decoded to empty
BOS/EOS (<s>, </s>) are defined as control symbols, but we can define them as user defined symbols.
In [0]:
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_bos_as_user --user_defined_symbols=<s>,</s> --vocab_size=2000')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.encode_as_pieces('<s> hello</s>')) # <s>,</s> are segmented. (default behavior)
sp = spm.SentencePieceProcessor()
sp.load('m_bos_as_user.model')
print(sp.encode_as_pieces('<s> hello</s>')) # <s>,</s> are handled as one token.
In [0]:
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
print('bos=', sp.bos_id())
print('eos=', sp.eos_id())
print('unk=', sp.unk_id())
print('pad=', sp.pad_id()) # disabled by default
print(sp.encode_as_ids('Hello world'))
# Prepend or append bos/eos ids.
print([sp.bos_id()] + sp.encode_as_ids('Hello world') + [sp.eos_id()])
By default, UNK/BOS/EOS/PAD tokens and their ids are defined as follows:
token | UNK | BOS | EOS | PAD | ||
---|---|---|---|---|---|---|
surface | <unk> | <s> | </s> | <pad> | ||
id | 0 | 1 | 2 | undefined (-1) |
We can change these mappings with --{unk|bos|eos|pad}_id and --{unk|bos|eos|pad}_piece flags.
In [0]:
spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3 --pad_piece=[PAD] --unk_piece=[UNK] --bos_piece=[BOS] --eos_piece=[EOS]')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
for id in range(4):
print(sp.id_to_piece(id), sp.is_control(id))
When -1 is set, this special symbol is disabled. UNK must not be undefined.
In [0]:
# Disable BOS/EOS
spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m --bos_id=-1 --eos_id=-1')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
# <s>, </s> are UNK.
print(sp.unk_id())
print(sp.piece_to_id('<s>'))
print(sp.piece_to_id('</s>'))
UNK id is decoded into U+2047 (⁇) by default. We can change UNK surface with --unk_surface=<STR> flag.
In [0]:
spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.decode_ids([sp.unk_id()])) # default is U+2047
spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m --unk_surface=__UNKNOWN__')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.decode_ids([sp.unk_id()]))
When --model_type=unigram (default) is used, we can perform sampling and n-best segmentation for data augmentation. See subword regularization paper [kudo18] for more detail.
In [0]:
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')
# Can obtain different segmentations per request.
# There are two hyperparamenters for sampling (nbest_size and inverse temperature). see the paper [kudo18] for detail.
for n in range(10):
print(sp.sample_encode_as_pieces('hello world', -1, 0.1))
for n in range(10):
print(sp.sample_encode_as_ids('hello world', -1, 0.1))
In [0]:
# get 10 best
print(sp.nbest_encode_as_pieces('hello world', 10))
print(sp.nbest_encode_as_ids('hello world', 10))
Sentencepiece supports BPE (byte-pair-encoding) for subword segmentation with --model_type=bpe flag. We do not find empirical differences in translation quality between BPE and unigram model, but unigram model can perform sampling and n-best segmentation. See subword regularization paper [kudo18] for more detail.
In [0]:
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_bpe --vocab_size=2000 --model_type=bpe')
sp_bpe = spm.SentencePieceProcessor()
sp_bpe.load('m_bpe.model')
print('*** BPE ***')
print(sp_bpe.encode_as_pieces('thisisatesthelloworld'))
print(sp_bpe.nbest_encode_as_pieces('hello world', 5)) # returns an empty list.
In [0]:
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_unigram --vocab_size=2000 --model_type=unigram')
sp_unigram = spm.SentencePieceProcessor()
sp_unigram.load('m_unigram.model')
print('*** Unigram ***')
print(sp_unigram.encode_as_pieces('thisisatesthelloworld'))
print(sp_unigram.nbest_encode_as_pieces('thisisatesthelloworld', 5))
Sentencepiece supports character and word segmentation with --model_type=char and --model_type=character flags.
In word
segmentation, sentencepiece just segments tokens with whitespaces, so the input text must be pre-tokenized.
We can apply different segmentation algorithm transparently without changing pre/post processors.
In [0]:
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_char --model_type=char --vocab_size=400')
sp_char = spm.SentencePieceProcessor()
sp_char.load('m_char.model')
print(sp_char.encode_as_pieces('this is a test.'))
print(sp_char.encode_as_ids('this is a test.'))
In [0]:
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_word --model_type=word --vocab_size=2000')
sp_word = spm.SentencePieceProcessor()
sp_word.load('m_word.model')
print(sp_word.encode_as_pieces('this is a test.')) # '.' will not be one token.
print(sp_word.encode_as_ids('this is a test.'))
Sentencepiece provides the following general pre-defined normalization rules. We can change the normalizer with --normaliation_rule_name=<NAME> flag.
In [0]:
import sentencepiece as spm
# NFKC normalization and lower casing.
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --normalization_rule_name=nfkc_cf')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.encode_as_pieces('HELLO WORLD.')) # lower casing and normalization
The normalization is performed with user-defined string-to-string mappings and leftmost longest matching. We can also define the custom normalization rules as TSV file. The TSV files for pre-defined normalziation rules can be found in the data directory (sample). The normalization rule is compiled into FST and embedded in the model file. We don't need to specify the normalization configuration in the segmentation phase.
Here's the example of custom normalization. The TSV file is fed with --normalization_rule_tsv=<FILE> flag.
In [0]:
def tocode(s):
out = []
for c in s:
out.append(str(hex(ord(c))).replace('0x', 'U+'))
return ' '.join(out)
# TSV format: source Unicode code points <tab> target code points
# normalize "don't => do not, I'm => I am"
with open('normalization_rule.tsv', 'w') as f:
f.write(tocode("I'm") + '\t' + tocode("I am") + '\n')
f.write(tocode("don't") + '\t' + tocode("do not") + '\n')
print(open('normalization_rule.tsv', 'r').read())
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --normalization_rule_tsv=normalization_rule.tsv')
sp = spm.SentencePieceProcessor()
# m.model embeds the normalization rule compiled into an FST.
sp.load('m.model')
print(sp.encode_as_pieces("I'm busy")) # normalzied to `I am busy'
print(sp.encode_as_pieces("I don't know it.")) # normalized to 'I do not know it.'
Sentencepiece loads all the lines of training data into memory to train the model. However, larger training data increases the training time and memory usage, though they are liner to the training data. When --input_sentence_size=<SIZE> is specified, Sentencepiece randomly samples <SIZE> lines from the whole training data. --shuffle_input_sentence=false disables the random shuffle and takes the first <SIZE> lines.
In [0]:
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --input_sentence_size=1000')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
sp.encode_as_pieces('this is a test.')
Out[0]:
We can encode the text only using the tokens spececified with set_vocabulary method. The background of this feature is described in subword-nmt page.
In [0]:
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.encode_as_pieces('this is a test.'))
# Gets all tokens as Python list.
vocabs = [sp.id_to_piece(id) for id in range(sp.get_piece_size())]
# Aggregates the frequency of each token in the training data.
freq = {}
with open('botchan.txt', 'r') as f:
for line in f:
line = line.rstrip()
for piece in sp.encode_as_pieces(line):
freq.setdefault(piece, 0)
freq[piece] += 1
# only uses the token appearing more than 1000 times in the training data.
vocabs = list(filter(lambda x : x in freq and freq[x] > 1000, vocabs))
sp.set_vocabulary(vocabs)
print(sp.encode_as_pieces('this is a test.'))
# reset the restriction
sp.reset_vocabulary()
print(sp.encode_as_pieces('this is a test.'))
Sentencepieces does not extract pieces crossing multiple words (here the word
means the space delimited tokens). The piece will never contain the whitespace marker (_) in the middle.
--split_by_whtespace=false disables this restriction and allows to extract pieces crossing multiple words. In CJK (Chinese/Japanese/Korean), this flag will not affect the final segmentation results so much as words are not tokenized with whitespaces in CJK.
In [0]:
import re
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --split_by_whitespace=false')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
# Gets all tokens as Python list.
vocabs = [sp.id_to_piece(id) for id in range(sp.get_piece_size())]
for piece in vocabs[0:500]:
if re.match('\w+▁\w+', piece):
print(piece)
We can train the sentencepiece model from the pair of <word, frequency>. First, you make a TSV file where the first column is the word and the second column is the frequency. Then, feed this TSV file with --input_format=tsv flag. Note that when feeding TSV as training data, we implicitly assume that --split_by_whtespace=true.
In [0]:
freq={}
with open('botchan.txt', 'r') as f:
for line in f:
line = line.rstrip()
for piece in line.split():
freq.setdefault(piece, 0)
freq[piece] += 1
with open('word_freq_list.tsv', 'w') as f:
for k, v in freq.items():
f.write('%s\t%d\n' % (k, v))
import sentencepiece as spm
spm.SentencePieceTrainer.train('--input=word_freq_list.tsv --input_format=tsv --model_prefix=m --vocab_size=2000')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.encode_as_pieces('this is a test.'))
Sentencepiece keeps track of byte offset (span) of each token, which is useful for highlighting the token on top of unnormalized text.
We first need to install protobuf module and sentencepiece_pb2.py as the byte offsets and all other meta data for segementation are encoded in protocol buffer. encode_as_serialized_proto method resturns serialized SentencePieceText proto. You can get the deserialized object by calling ParseFromString method.
The definition of SentencePieceText proto is found here.
In [0]:
!pip install protobuf
!wget https://raw.githubusercontent.com/google/sentencepiece/master/python/sentencepiece_pb2.py
In [0]:
import sentencepiece_pb2
import sentencepiece as spm
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
# One best result
spt = sentencepiece_pb2.SentencePieceText()
spt.ParseFromString(sp.encode_as_serialized_proto('hello')) # Full width hello
# begin/end (offsets) are pointing to the original input.
print(spt)
# Nbest results
nspt = sentencepiece_pb2.NBestSentencePieceText()
nspt.ParseFromString(sp.nbest_encode_as_serialized_proto('hello', 5))
# print(nspt)
Out[0]: