Sentencepiece python module

This notebook describes comprehensive examples of sentencepiece Python module. Since Python module calls C++ API through SWIG, this document is also useful for developing C++ client.

Install and data preparation

We use the small training data (botchan.txt) in this example. (Botchan is a novel written by Natsume Sōseki in 1906. The sample is English-translated one.)



In [0]:

    
!pip install sentencepiece
!wget https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt









    



Requirement already satisfied: sentencepiece in /usr/local/lib/python3.6/dist-packages (0.1.81)
--2019-03-27 21:17:13--  https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 278779 (272K) [text/plain]
Saving to: ‘botchan.txt.1’

botchan.txt.1       100%[===================>] 272.25K  --.-KB/s    in 0.05s   

2019-03-27 21:17:13 (5.50 MB/s) - ‘botchan.txt.1’ saved [278779/278779]

Basic end-to-end example



In [0]:

    
import sentencepiece as spm

# train sentencepiece model from `botchan.txt` and makes `m.model` and `m.vocab`
# `m.vocab` is just a reference. not used in the segmentation.
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')

# makes segmenter instance and loads the model file (m.model)
sp = spm.SentencePieceProcessor()
sp.load('m.model')

# encode: text => id
print(sp.encode_as_pieces('This is a test'))
print(sp.encode_as_ids('This is a test'))

# decode: id => text
print(sp.decode_pieces(['▁This', '▁is', '▁a', '▁t', 'est']))
print(sp.decode_ids([209, 31, 9, 375, 586]))









    



['▁This', '▁is', '▁a', '▁t', 'est']
[209, 31, 9, 375, 586]
This is a test
This is a test



In [0]:

    
# returns vocab size
print(sp.get_piece_size())

# id <=> piece conversion
print(sp.id_to_piece(209))
print(sp.piece_to_id('▁This'))

# returns 0 for unknown tokens (we can change the id for UNK)
print(sp.piece_to_id('__MUST_BE_UNKNOWN__'))

# <unk>, <s>, </s> are defined by default. Their ids are (0, 1, 2)
# <s> and </s> are defined as 'control' symbol.
for id in range(3):
  print(sp.id_to_piece(id), sp.is_control(id))









    



2000
▁This
209
0
<unk> False
<s> True
</s> True

Loads model from byte stream

Sentencepiece's model file is just a serialized protocol buffer. We can instantiate sentencepiece processor from byte object with load_from_serialized_proto method.



In [0]:

    
import tensorflow as tf

# Assumes that m.model is stored in non-Posix file system.
serialized_model_proto = tf.gfile.GFile('m.model', 'rb').read()

sp = spm.SentencePieceProcessor()
sp.load_from_serialized_proto(serialized_model_proto)

print(sp.encode_as_pieces('this is a test'))









    



['▁this', '▁is', '▁a', '▁t', 'est']

User defined and control symbols

We can define special tokens (symbols) to tweak the DNN behavior through the tokens. Typical examples are BERT's special symbols., e.g., [SEP] and [CLS].

There are two types of special tokens:

user defined symbols: Always treated as one token in any context. These symbols can appear in the input sentence.
control symbol: We only reserve ids for these tokens. Even if these tokens appear in the input text, they are not handled as one token. User needs to insert ids explicitly after encoding.

For experimental purpose, user defined symbols are easier to use since user can change the behavior just by modifying the input text. However, we want to use control symbols in the production setting in order to avoid users from tweaking the behavior by feeding these special symbols in their input text.



In [0]:

    
## Example of user defined symbols
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_user --user_defined_symbols=<sep>,<cls> --vocab_size=2000')

sp_user = spm.SentencePieceProcessor()
sp_user.load('m_user.model')

# ids are reserved in both mode.
# <unk>=0, <s>=1, </s>=2, <sep>=3, <cls>=4
# user defined symbols allow these symbol to apper in the text.
print(sp_user.encode_as_pieces('this is a test<sep> hello world<cls>'))
print(sp_user.piece_to_id('<sep>'))  # 3
print(sp_user.piece_to_id('<cls>'))  # 4
print('3=', sp_user.decode_ids([3]))  # decoded to <sep>
print('4=', sp_user.decode_ids([4]))  # decoded to <cls>









    



['▁this', '▁is', '▁a', '▁t', 'est', '<sep>', '▁he', 'll', 'o', '▁world', '<cls>']
3
4
3= <sep>
4= <cls>



In [0]:

    
## Example of control symbols
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_ctrl --control_symbols=<sep>,<cls> --vocab_size=2000')

sp_ctrl = spm.SentencePieceProcessor()
sp_ctrl.load('m_ctrl.model')

# control symbols just reserve ids.
print(sp_ctrl.encode_as_pieces('this is a test<sep> hello world<cls>'))
print(sp_ctrl.piece_to_id('<sep>'))  # 3
print(sp_ctrl.piece_to_id('<cls>'))  # 4
print('3=', sp_ctrl.decode_ids([3]))  # decoded to empty
print('4=', sp_ctrl.decode_ids([4]))  # decoded to empty









    



['▁this', '▁is', '▁a', '▁t', 'est', '<', 'se', 'p', '>', '▁he', 'll', 'o', '▁world', '<', 'c', 'l', 's', '>']
3
4
3= 
4=

BOS/EOS (<s>, </s>) are defined as control symbols, but we can define them as user defined symbols.



In [0]:

    
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_bos_as_user --user_defined_symbols=<s>,</s> --vocab_size=2000')

sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.encode_as_pieces('<s> hello</s>'))   # <s>,</s> are segmented. (default behavior)

sp = spm.SentencePieceProcessor()
sp.load('m_bos_as_user.model')
print(sp.encode_as_pieces('<s> hello</s>'))   # <s>,</s> are handled as one token.









    



['▁', '<', 's', '>', '▁he', 'll', 'o', '</', 's', '>']
['▁', '<s>', '▁he', 'll', 'o', '</s>']

Manipulating BOS/EOS/EOS/PAD symbols

BOS, EOS, UNK, and PAD ids can be obtained with bos_id(), eos_id(), unk_id(), and pad_id() methods. We can explicitly insert these ids as follows.



In [0]:

    
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')

sp = spm.SentencePieceProcessor()
sp.load('m.model')

print('bos=', sp.bos_id())
print('eos=', sp.eos_id())
print('unk=', sp.unk_id())
print('pad=', sp.pad_id())  # disabled by default


print(sp.encode_as_ids('Hello world'))

# Prepend or append bos/eos ids.
print([sp.bos_id()] + sp.encode_as_ids('Hello world') + [sp.eos_id()])









    



bos= 1
eos= 2
unk= 0
pad= -1
[12, 1828, 1038]
[1, 12, 1828, 1038, 2]

Changing the vocab id and surface representation of UNK/BOS/EOS/PAD symbols

By default, UNK/BOS/EOS/PAD tokens and their ids are defined as follows:

	token	UNK	BOS	EOS	PAD
	surface	<unk>	<s>	</s>	<pad>
	id	0	1	2	undefined (-1)

We can change these mappings with --{unk|bos|eos|pad}_id and --{unk|bos|eos|pad}_piece flags.



In [0]:

    
spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3 --pad_piece=[PAD] --unk_piece=[UNK] --bos_piece=[BOS] --eos_piece=[EOS]')
sp = spm.SentencePieceProcessor()
sp.load('m.model')


for id in range(4):
    print(sp.id_to_piece(id), sp.is_control(id))









    



[PAD] True
[UNK] False
[BOS] True
[EOS] True

When -1 is set, this special symbol is disabled. UNK must not be undefined.



In [0]:

    
# Disable BOS/EOS
spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m --bos_id=-1 --eos_id=-1')
sp = spm.SentencePieceProcessor()
sp.load('m.model')

# <s>, </s> are UNK.
print(sp.unk_id())
print(sp.piece_to_id('<s>'))
print(sp.piece_to_id('</s>'))

UNK id is decoded into U+2047 (⁇) by default. We can change UNK surface with --unk_surface=<STR> flag.



In [0]:

    
spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.decode_ids([sp.unk_id()]))   # default is U+2047

spm.SentencePieceTrainer.train('--input=botchan.txt --vocab_size=2000 --model_prefix=m --unk_surface=__UNKNOWN__')
sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.decode_ids([sp.unk_id()]))









    



 ⁇ 
__UNKNOWN__

Sampling and nbest segmentation for subword regularization

When --model_type=unigram (default) is used, we can perform sampling and n-best segmentation for data augmentation. See subword regularization paper [kudo18] for more detail.



In [0]:

    
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')

# Can obtain different segmentations per request.
# There are two hyperparamenters for sampling (nbest_size and inverse temperature). see the paper [kudo18] for detail.
for n in range(10):
  print(sp.sample_encode_as_pieces('hello world', -1, 0.1))
  
for n in range(10):
  print(sp.sample_encode_as_ids('hello world', -1, 0.1))









    



['▁', 'h', 'e', 'll', 'o', '▁w', 'o', 'r', 'l', 'd']
['▁he', 'l', 'l', 'o', '▁world']
['▁he', 'l', 'l', 'o', '▁w', 'or', 'l', 'd']
['▁', 'he', 'l', 'l', 'o', '▁world']
['▁', 'he', 'll', 'o', '▁w', 'o', 'r', 'l', 'd']
['▁', 'he', 'll', 'o', '▁world']
['▁he', 'll', 'o', '▁world']
['▁', 'he', 'll', 'o', '▁world']
['▁he', 'll', 'o', '▁w', 'o', 'r', 'l', 'd']
['▁', 'h', 'e', 'l', 'l', 'o', '▁w', 'o', 'r', 'l', 'd']
[12, 489, 57, 57, 38, 1246, 57, 20]
[28, 98, 38, 1038]
[12, 489, 98, 38, 12, 151, 105, 57, 20]
[12, 489, 98, 38, 1038]
[28, 98, 38, 254, 105, 57, 20]
[12, 489, 98, 38, 12, 151, 38, 46, 57, 20]
[28, 57, 57, 38, 1038]
[28, 98, 38, 1038]
[12, 96, 351, 57, 38, 1038]
[28, 98, 38, 1038]



In [0]:

    
# get 10 best
print(sp.nbest_encode_as_pieces('hello world', 10))
print(sp.nbest_encode_as_ids('hello world', 10))









    



[['▁he', 'll', 'o', '▁world'], ['▁he', 'l', 'l', 'o', '▁world'], ['▁', 'he', 'll', 'o', '▁world'], ['▁', 'h', 'e', 'll', 'o', '▁world'], ['▁he', 'll', 'o', '▁wor', 'l', 'd'], ['▁', 'he', 'l', 'l', 'o', '▁world'], ['▁', 'h', 'el', 'l', 'o', '▁world'], ['▁he', 'll', 'o', '▁w', 'or', 'l', 'd'], ['▁', 'h', 'e', 'l', 'l', 'o', '▁world'], ['▁he', 'l', 'l', 'o', '▁wor', 'l', 'd']]
[[28, 98, 38, 1038], [28, 57, 57, 38, 1038], [12, 489, 98, 38, 1038], [12, 96, 25, 98, 38, 1038], [28, 98, 38, 1246, 57, 20], [12, 489, 57, 57, 38, 1038], [12, 96, 351, 57, 38, 1038], [28, 98, 38, 254, 105, 57, 20], [12, 96, 25, 57, 57, 38, 1038], [28, 57, 57, 38, 1246, 57, 20]]

BPE (Byte pair encoding) model

Sentencepiece supports BPE (byte-pair-encoding) for subword segmentation with --model_type=bpe flag. We do not find empirical differences in translation quality between BPE and unigram model, but unigram model can perform sampling and n-best segmentation. See subword regularization paper [kudo18] for more detail.



In [0]:

    
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_bpe --vocab_size=2000 --model_type=bpe')
sp_bpe = spm.SentencePieceProcessor()
sp_bpe.load('m_bpe.model')

print('*** BPE ***')
print(sp_bpe.encode_as_pieces('thisisatesthelloworld'))
print(sp_bpe.nbest_encode_as_pieces('hello world', 5))  # returns an empty list.









    



*** BPE ***
['▁this', 'is', 'at', 'est', 'he', 'llow', 'or', 'ld']
[]



In [0]:

    
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_unigram --vocab_size=2000 --model_type=unigram')
sp_unigram = spm.SentencePieceProcessor()
sp_unigram.load('m_unigram.model')

print('*** Unigram ***')
print(sp_unigram.encode_as_pieces('thisisatesthelloworld'))
print(sp_unigram.nbest_encode_as_pieces('thisisatesthelloworld', 5))









    



*** Unigram ***
['▁this', 'is', 'ate', 's', 'the', 'llow', 'or', 'l', 'd']
[['▁this', 'is', 'ate', 's', 'the', 'llow', 'or', 'l', 'd'], ['▁this', 'i', 's', 'ate', 's', 'the', 'llow', 'or', 'l', 'd'], ['▁this', 'is', 'ate', 'st', 'he', 'llow', 'or', 'l', 'd'], ['▁this', 'is', 'at', 'es', 'the', 'llow', 'or', 'l', 'd'], ['▁this', 'is', 'at', 'est', 'he', 'llow', 'or', 'l', 'd']]

Character and word model

Sentencepiece supports character and word segmentation with --model_type=char and --model_type=character flags.

In word segmentation, sentencepiece just segments tokens with whitespaces, so the input text must be pre-tokenized. We can apply different segmentation algorithm transparently without changing pre/post processors.



In [0]:

    
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_char --model_type=char --vocab_size=400')

sp_char = spm.SentencePieceProcessor()
sp_char.load('m_char.model')

print(sp_char.encode_as_pieces('this is a test.'))
print(sp_char.encode_as_ids('this is a test.'))









    



['▁', 't', 'h', 'i', 's', '▁', 'i', 's', '▁', 'a', '▁', 't', 'e', 's', 't', '.']
[3, 5, 10, 9, 11, 3, 9, 11, 3, 7, 3, 5, 4, 11, 5, 23]



In [0]:

    
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_word --model_type=word --vocab_size=2000')

sp_word = spm.SentencePieceProcessor()
sp_word.load('m_word.model')

print(sp_word.encode_as_pieces('this is a test.'))  # '.' will not be one token.
print(sp_word.encode_as_ids('this is a test.'))









    



['▁this', '▁is', '▁a', '▁test.']
[31, 17, 8, 0]

Text normalization

Sentencepiece provides the following general pre-defined normalization rules. We can change the normalizer with --normaliation_rule_name=<NAME> flag.

nmt_nfkc: NFKC normalization with some additional normalization around spaces. (default)
nfkc: original: NFKC normalization.
nmt_nfkc_cf: nmt_nfkc + Unicode case folding (mostly lower casing)
nfkc_cf: nfkc + Unicode case folding.
identity: no normalization



In [0]:

    
import sentencepiece as spm

# NFKC normalization and lower casing.
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --normalization_rule_name=nfkc_cf')

sp = spm.SentencePieceProcessor()
sp.load('m.model')
print(sp.encode_as_pieces('ＨＥＬＬＯ　ＷＯＲＬＤ.'))  # lower casing and normalization









    



['▁', 'hello', '▁world', '.']

The normalization is performed with user-defined string-to-string mappings and leftmost longest matching. We can also define the custom normalization rules as TSV file. The TSV files for pre-defined normalziation rules can be found in the data directory (sample). The normalization rule is compiled into FST and embedded in the model file. We don't need to specify the normalization configuration in the segmentation phase.

Here's the example of custom normalization. The TSV file is fed with --normalization_rule_tsv=<FILE> flag.



In [0]:

    
def tocode(s):                                                                               
    out = []                                                                                 
    for c in s:                                                                              
        out.append(str(hex(ord(c))).replace('0x', 'U+'))                                     
    return ' '.join(out)          

# TSV format:  source Unicode code points <tab> target code points
# normalize "don't => do not,  I'm => I am"
with open('normalization_rule.tsv', 'w') as f:
  f.write(tocode("I'm") + '\t' + tocode("I am") + '\n')
  f.write(tocode("don't") + '\t' + tocode("do not") + '\n')

print(open('normalization_rule.tsv', 'r').read())

spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --normalization_rule_tsv=normalization_rule.tsv')

sp = spm.SentencePieceProcessor()
# m.model embeds the normalization rule compiled into an FST.
sp.load('m.model')
print(sp.encode_as_pieces("I'm busy"))  # normalzied to `I am busy'
print(sp.encode_as_pieces("I don't know it."))  # normalized to 'I do not know it.'









    



U+49 U+27 U+6d	U+49 U+20 U+61 U+6d
U+64 U+6f U+6e U+27 U+74	U+64 U+6f U+20 U+6e U+6f U+74

['▁I', '▁am', '▁bu', 's', 'y']
['▁I', '▁do', '▁not', '▁know', '▁it', '.']

Randomizing training data

Sentencepiece loads all the lines of training data into memory to train the model. However, larger training data increases the training time and memory usage, though they are liner to the training data. When --input_sentence_size=<SIZE> is specified, Sentencepiece randomly samples <SIZE> lines from the whole training data. --shuffle_input_sentence=false disables the random shuffle and takes the first <SIZE> lines.



In [0]:

    
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --input_sentence_size=1000')

sp = spm.SentencePieceProcessor()
sp.load('m.model')

sp.encode_as_pieces('this is a test.')









    Out[0]:





['▁this', '▁is', '▁a', '▁t', 'est', '.']

Vocabulary restriction

We can encode the text only using the tokens spececified with set_vocabulary method. The background of this feature is described in subword-nmt page.



In [0]:

    
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')

sp = spm.SentencePieceProcessor()
sp.load('m.model')

print(sp.encode_as_pieces('this is a test.'))

# Gets all tokens as Python list.
vocabs = [sp.id_to_piece(id) for id in range(sp.get_piece_size())]

# Aggregates the frequency of each token in the training data.
freq = {}
with open('botchan.txt', 'r') as f:
    for line in f:
        line = line.rstrip()
        for piece in sp.encode_as_pieces(line):
            freq.setdefault(piece, 0)
            freq[piece] += 1
            
# only uses the token appearing more than 1000 times in the training data.
vocabs = list(filter(lambda x : x in freq and freq[x] > 1000, vocabs))
sp.set_vocabulary(vocabs)
print(sp.encode_as_pieces('this is a test.'))

# reset the restriction
sp.reset_vocabulary()
print(sp.encode_as_pieces('this is a test.'))









    



['▁this', '▁is', '▁a', '▁t', 'est', '.']
['▁', 't', 'h', 'i', 's', '▁', 'i', 's', '▁a', '▁', 't', 'e', 's', 't', '.']
['▁this', '▁is', '▁a', '▁t', 'est', '.']

Extracting crossing-words pieces

Sentencepieces does not extract pieces crossing multiple words (here the word means the space delimited tokens). The piece will never contain the whitespace marker (_) in the middle.

--split_by_whtespace=false disables this restriction and allows to extract pieces crossing multiple words. In CJK (Chinese/Japanese/Korean), this flag will not affect the final segmentation results so much as words are not tokenized with whitespaces in CJK.



In [0]:

    
import re

spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000 --split_by_whitespace=false')

sp = spm.SentencePieceProcessor()
sp.load('m.model')

# Gets all tokens as Python list.
vocabs = [sp.id_to_piece(id) for id in range(sp.get_piece_size())]

for piece in vocabs[0:500]:
    if re.match('\w+▁\w+', piece):
        print(piece)









    



ed▁to
s▁of
ing▁the
s▁and
ed▁by
ed▁the
ed▁me

Training sentencepiece model from the word list with frequency

We can train the sentencepiece model from the pair of <word, frequency>. First, you make a TSV file where the first column is the word and the second column is the frequency. Then, feed this TSV file with --input_format=tsv flag. Note that when feeding TSV as training data, we implicitly assume that --split_by_whtespace=true.



In [0]:

    
freq={}
with open('botchan.txt', 'r') as f:
  for line in f:
    line = line.rstrip()
    for piece in line.split():
      freq.setdefault(piece, 0)
      freq[piece] += 1
            
with open('word_freq_list.tsv', 'w') as f:
  for k, v in freq.items():
    f.write('%s\t%d\n' % (k, v))
  

import sentencepiece as spm

spm.SentencePieceTrainer.train('--input=word_freq_list.tsv --input_format=tsv --model_prefix=m --vocab_size=2000')
sp = spm.SentencePieceProcessor()
sp.load('m.model')

print(sp.encode_as_pieces('this is a test.'))









    



['▁this', '▁is', '▁a', '▁t', 'est', '.']

Getting byte offsets of tokens

Sentencepiece keeps track of byte offset (span) of each token, which is useful for highlighting the token on top of unnormalized text.

We first need to install protobuf module and sentencepiece_pb2.py as the byte offsets and all other meta data for segementation are encoded in protocol buffer. encode_as_serialized_proto method resturns serialized SentencePieceText proto. You can get the deserialized object by calling ParseFromString method.

The definition of SentencePieceText proto is found here.



In [0]:

    
!pip install protobuf
!wget https://raw.githubusercontent.com/google/sentencepiece/master/python/sentencepiece_pb2.py









    



Requirement already satisfied: protobuf in /usr/local/lib/python3.6/dist-packages (3.7.0)
Requirement already satisfied: six>=1.9 in /usr/local/lib/python3.6/dist-packages (from protobuf) (1.11.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from protobuf) (40.8.0)
--2019-03-27 21:42:35--  https://raw.githubusercontent.com/google/sentencepiece/master/python/sentencepiece_pb2.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7382 (7.2K) [text/plain]
Saving to: ‘sentencepiece_pb2.py.1’

sentencepiece_pb2.p 100%[===================>]   7.21K  --.-KB/s    in 0s      

2019-03-27 21:42:35 (52.3 MB/s) - ‘sentencepiece_pb2.py.1’ saved [7382/7382]



In [0]:

    
import sentencepiece_pb2
import sentencepiece as spm

spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')

sp = spm.SentencePieceProcessor()
sp.load('m.model')

# One best result
spt = sentencepiece_pb2.SentencePieceText()
spt.ParseFromString(sp.encode_as_serialized_proto('ｈｅｌｌｏ')) # Full width hello

# begin/end (offsets) are pointing to the original input.
print(spt)

# Nbest results
nspt = sentencepiece_pb2.NBestSentencePieceText()
nspt.ParseFromString(sp.nbest_encode_as_serialized_proto('ｈｅｌｌｏ', 5))
# print(nspt)









    



text: "\357\275\210\357\275\205\357\275\214\357\275\214\357\275\217"
pieces {
  piece: "\342\226\201he"
  id: 28
  surface: "\357\275\210\357\275\205"
  begin: 0
  end: 6
}
pieces {
  piece: "ll"
  id: 98
  surface: "\357\275\214\357\275\214"
  begin: 6
  end: 12
}
pieces {
  piece: "o"
  id: 38
  surface: "\357\275\217"
  begin: 12
  end: 15
}







    Out[0]:





489