In [32]:
import shutil
import os
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import callbacks, layers, models, utils
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow_hub import KerasLayer
In [2]:
!head ./data/babyweight_sample.csv
In [3]:
df = pd.read_csv("./data/babyweight_sample.csv")
df.plurality.head(5)
Out[3]:
In [4]:
df.plurality.unique()
Out[4]:
In [5]:
CLASSES = {
'Single(1)': 0,
'Multiple(2+)': 1,
'Twins(2)': 2,
'Triplets(3)': 3,
'Quadruplets(4)': 4,
'Quintuplets(5)': 5
}
N_CLASSES = len(CLASSES)
Convert the plurality to a numeric index.
In [6]:
plurality_class = [CLASSES[plurality] for plurality in df.plurality]
In [7]:
print(df.plurality[:5])
print(plurality_class[:5])
Create an embedding layer. Supply arguments input_dim
and output_dim
input_dim
indicates the size of the vocabulary. For plurality
this is 6.ouptut_dim
indicates the dimension of the dense embedding.
In [8]:
EMBED_DIM = 2
embedding_layer = layers.Embedding(input_dim=N_CLASSES,
output_dim=EMBED_DIM)
embeds = embedding_layer(tf.constant(plurality_class))
The variable embeds
contains the two-dimensional for each plurality class.
In [9]:
embeds.shape
Out[9]:
In [10]:
embeds[:5]
Out[10]:
In this section, we will implement text models to recognize the probable source (Github, Tech-Crunch, or The New-York Times) of the titles we have in the title dataset we constructed in the previous lab.
In a first step, we will load and pre-process the texts and labels so that they are suitable to be fed to a Keras model. For the texts of the titles we will learn how to split them into a list of tokens, and then how to map each token to an integer using the Keras Tokenizer class. What will be fed to our Keras models will be batches of padded list of integers representing the text. For the labels, we will learn how to one-hot-encode each of the 3 classes into a 3 dimensional basis vector.
Then we will explore a few possible models to do the title classification. All models will be fed padded list of integers, and all models will start with a Keras Embedding layer that transforms the integer representing the words into dense vectors.
Our model will be a simple bag-of-words DNN model that averages up the word vectors and feeds the tensor that results to further dense layers. Doing so means that we forget the word order (and hence that we consider sentences as a “bag-of-words”). Using an RNN or a 1-dimensional CNN would allow us to maintain the order of word embeddings in our model.
Let's start by specifying where the information about the trained models will be saved as well as where our dataset is located:
In [11]:
LOGDIR = "./text_models"
DATA_DIR = "./data"
Our dataset consists of titles of articles along with the label indicating from which source these articles have been taken from (GitHub, Tech-Crunch, or the New-York Times).
In [12]:
DATASET_NAME = "titles_full.csv"
TITLE_SAMPLE_PATH = os.path.join(DATA_DIR, DATASET_NAME)
COLUMNS = ['title', 'source']
titles_df = pd.read_csv(TITLE_SAMPLE_PATH, header=None, names=COLUMNS)
titles_df.head()
Out[12]:
First, we'll find how many words we have in our dataset (VOCAB_SIZE
), how many titles we have (DATASET_SIZE
), and what the maximum length of the titles we have (MAX_LEN
) is. Keras offers the Tokenizer
class in its keras.preprocessing.text
module to help us with this.
In [13]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(titles_df.title)
integerized_titles = tokenizer.texts_to_sequences(titles_df.title)
The variable 'integerized_titles' contains the integer representation of each article title in out dataset.
In [14]:
integerized_titles[:3]
Out[14]:
From this and the tokenizer
we can extract the VOCAB_SIZE
, DATASET_SIZE
and MAX_LEN
.
In [15]:
VOCAB_SIZE = len(tokenizer.index_word)
VOCAB_SIZE
Out[15]:
In [16]:
DATASET_SIZE = tokenizer.document_count
DATASET_SIZE
Out[16]:
In [17]:
MAX_LEN = max(len(sequence) for sequence in integerized_titles)
MAX_LEN
Out[17]:
We'll need to pad the elements of our title to feed into the model. Keras has the helper functions pad_sequence
for that on the top of the tokenizer methods.
The function create_sequences
will
In [18]:
def create_sequences(texts, max_len=MAX_LEN):
sequences = tokenizer.texts_to_sequences(texts)
padded_sequences = pad_sequences(sequences,
max_len,
padding='post')
return padded_sequences
In [19]:
sample_titles = create_sequences(["holy cash cow batman - content is back",
"close look at a flu outbreak upends some common wisdom"])
sample_titles
Out[19]:
Next, we'll convert our label to numeric, categorical variable.
In [20]:
CLASSES = {
'github': 0,
'nytimes': 1,
'techcrunch': 2
}
N_CLASSES = len(CLASSES)
In [21]:
def encode_labels(sources):
classes = [CLASSES[source] for source in sources]
one_hots = utils.to_categorical(classes)
return one_hots
Create train/validation split
In [22]:
N_TRAIN = int(DATASET_SIZE * 0.80)
titles_df = pd.read_csv(TITLE_SAMPLE_PATH, header=None, names=COLUMNS)
titles_train, sources_train = (
titles_df.title[:N_TRAIN], titles_df.source[:N_TRAIN])
titles_valid, sources_valid = (
titles_df.title[N_TRAIN:], titles_df.source[N_TRAIN:])
In [23]:
sources_train.value_counts()
Out[23]:
Then, prepare the data for the model.
In [24]:
X_train, Y_train = create_sequences(titles_train), encode_labels(sources_train)
X_valid, Y_valid = create_sequences(titles_valid), encode_labels(sources_valid)
In [25]:
X_train[:3], Y_train[:3]
Out[25]:
The build_dnn_model
function below returns a compiled Keras model that implements a simple embedding layer transforming the word integers into dense vectors, followed by a Dense softmax layer that returns the probabilities for each class.
Note that we need to put a custom Keras Lambda layer in between the Embedding layer and the Dense softmax layer to do an average of the word vectors returned by the embedding layer. This is the average that's fed to the dense softmax layer. By doing so, we create a model that is simple but that loses information about the word order, creating a model that sees sentences as "bag-of-words".
In [29]:
def build_dnn_model(embed_dim):
model = models.Sequential([
layers.Embedding(VOCAB_SIZE + 1,
embed_dim,
input_shape=[MAX_LEN]),
layers.Lambda(lambda x: tf.reduce_mean(x, axis=1)),
layers.Dense(N_CLASSES, activation='softmax')
])
model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
return model
In [30]:
Y_train.shape
Out[30]:
In [31]:
%%time
tf.random.set_seed(33)
MODEL_DIR = os.path.join(LOGDIR, 'dnn')
shutil.rmtree(MODEL_DIR, ignore_errors=True)
BATCH_SIZE = 300
EPOCHS = 100
EMBED_DIM = 10
PATIENCE = 0
dnn_model = build_dnn_model(embed_dim=EMBED_DIM)
dnn_history = dnn_model.fit(
X_train, Y_train,
epochs=EPOCHS,
batch_size=BATCH_SIZE,
validation_data=(X_valid, Y_valid),
callbacks=[callbacks.EarlyStopping(patience=PATIENCE),
callbacks.TensorBoard(MODEL_DIR)],
)
pd.DataFrame(dnn_history.history)[['loss', 'val_loss']].plot()
pd.DataFrame(dnn_history.history)[['accuracy', 'val_accuracy']].plot()
dnn_model.summary()
We can also use a word embedding from a pre-trained modle using a Neural Probabilistic Language Model. TF-Hub has a 50-dimensional one called nnlm-en-dim50-with-normalization, which also normalizes the vectors produced.
Once loaded from its url, the TF-hub module can be used as a normal Keras layer in a sequential or functional model. Since we have enough data to fine-tune the parameters of the pre-trained embedding itself, we will set trainable=True
in the KerasLayer
that loads the pre-trained embedding:
In [30]:
NNLM = "https://tfhub.dev/google/nnlm-en-dim50/2"
nnlm_module = KerasLayer(
handle=NNLM,
output_shape=[50],
input_shape=[],
dtype=tf.string,
trainable=True)
With this module, we do not need to pad our inputs. The NNLM module returns a 50-dimensional vector given a word or sentence.
In [31]:
nnlm_module(tf.constant(["holy cash cow batman - content is back",
"close look at a flu outbreak upends some common wisdom"]))
Out[31]:
With this in mind, we can simplify our data inputs since do not need to integerize or pad.
In [32]:
X_train, Y_train = titles_train.values, encode_labels(sources_train)
X_valid, Y_valid = titles_valid.values, encode_labels(sources_valid)
In [33]:
X_train[:3]
Out[33]:
In [34]:
def build_hub_model():
model = models.Sequential([
KerasLayer(handle=NNLM,
output_shape=[50],
input_shape=[],
dtype=tf.string,
trainable=True),
layers.Dense(N_CLASSES, activation='softmax')
])
model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
return model
In [38]:
%%time
tf.random.set_seed(33)
MODEL_DIR = os.path.join(LOGDIR, 'hub')
shutil.rmtree(MODEL_DIR, ignore_errors=True)
BATCH_SIZE = 300
EPOCHS = 100
EMBED_DIM = 10
PATIENCE = 3
hub_model = build_hub_model()
hub_history = hub_model.fit(
X_train, Y_train,
epochs=EPOCHS,
batch_size=BATCH_SIZE,
validation_data=(X_valid, Y_valid),
callbacks=[callbacks.EarlyStopping(patience=PATIENCE),
callbacks.TensorBoard(MODEL_DIR)],
)
pd.DataFrame(hub_history.history)[['loss', 'val_loss']].plot()
pd.DataFrame(hub_history.history)[['accuracy', 'val_accuracy']].plot()
hub_model.summary()
Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License