Copyright 2020 The TensorFlow Authors.



In [0]:

    
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Tokenizing text and creating sequences for sentences

View source on GitHub

This colab shows you how to tokenize text and create sequences for sentences as the first stage of preparing text for use with TensorFlow models.

Import the Tokenizer



In [0]:

    
# Import the Tokenizer
from tensorflow.keras.preprocessing.text import Tokenizer

Write some sentences

Feel free to change and add sentences as you like



In [0]:

    
sentences = [
    'My favorite food is ice cream',
    'do you like ice cream too?',
    'My dog likes ice cream!',
    "your favorite flavor of icecream is chocolate",
    "chocolate isn't good for dogs",
    "your dog, your cat, and your parrot prefer broccoli"
]

Tokenize the words

The first step to preparing text to be used in a machine learning model is to tokenize the text, in other words, to generate numbers for the words.



In [0]:

    
# Optionally set the max number of words to tokenize.
# The out of vocabulary (OOV) token represents words that are not in the index.
# Call fit_on_text() on the tokenizer to generate unique numbers for each word
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)

View the word index

After you tokenize the text, the tokenizer has a word index that contains key-value pairs for all the words and their numbers.

The word is the key, and the number is the value.

Notice that the OOV token is the first entry.



In [0]:

    
# Examine the word index
word_index = tokenizer.word_index
print(word_index)



In [0]:

    
# Get the number for a given word
print(word_index['favorite'])

Create sequences for the sentences

After you tokenize the words, the word index contains a unique number for each word. However, the numbers in the word index are not ordered. Words in a sentence have an order. So after tokenizing the words, the next step is to generate sequences for the sentences.



In [0]:

    
sequences = tokenizer.texts_to_sequences(sentences)
print (sequences)

Sequence sentences that contain words that are not in the word index

Let's take a look at what happens if the sentence being sequenced contains words that are not in the word index.

The Out of Vocabluary (OOV) token is the first entry in the word index. You will see it shows up in the sequences in place of any word that is not in the word index.



In [0]:

    
sentences2 = ["I like hot chocolate", "My dogs and my hedgehog like kibble but my squirrel prefers grapes and my chickens like ice cream, preferably vanilla"]

sequences2 = tokenizer.texts_to_sequences(sentences2)
print(sequences2)