Learning Objectives
In this notebook, we will implement text models to recognize the probable source (Github, Tech-Crunch, or The New-York Times) of the titles we have in the title dataset we constructed in the first task of the lab.
In the next step, we will load and pre-process the texts and labels so that they are suitable to be fed to a Keras model. For the texts of the titles we will learn how to split them into a list of tokens, and then how to map each token to an integer using the Keras Tokenizer class. What will be fed to our Keras models will be batches of padded list of integers representing the text. For the labels, we will learn how to one-hot-encode each of the 3 classes into a 3 dimensional basis vector.
Then we will explore a few possible models to do the title classification. All models will be fed padded list of integers, and all models will start with a Keras Embedding layer that transforms the integer representing the words into dense vectors.
The first model will be a simple bag-of-word DNN model that averages up the word vectors and feeds the tensor that results to further dense layers. Doing so means that we forget the word order (and hence that we consider sentences as a “bag-of-words”). In the second and in the third model we will keep the information about the word order using a simple RNN and a simple CNN allowing us to achieve the same performance as with the DNN model but in much fewer epochs.
In [ ]:
import os
from google.cloud import bigquery
import pandas as pd
In [ ]:
%load_ext google.cloud.bigquery
Replace the variable values in the cell below:
In [ ]:
PROJECT = "cloud-training-demos" # Replace with your PROJECT
BUCKET = PROJECT # defaults to PROJECT
REGION = "us-central1" # Replace with your REGION
SEED = 0
Hacker news headlines are available as a BigQuery public dataset. The dataset contains all headlines from the sites inception in October 2006 until October 2015.
Here is a sample of the dataset:
In [ ]:
%%bigquery --project $PROJECT
SELECT
url, title, score
FROM
`bigquery-public-data.hacker_news.stories`
WHERE
LENGTH(title) > 10
AND score > 10
AND LENGTH(url) > 0
LIMIT 10
Let's do some regular expression parsing in BigQuery to get the source of the newspaper article from the URL. For example, if the url is http://mobile.nytimes.com/...., I want to be left with nytimes
In [ ]:
%%bigquery --project $PROJECT
SELECT
ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
COUNT(title) AS num_articles
FROM
`bigquery-public-data.hacker_news.stories`
WHERE
REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
AND LENGTH(title) > 10
GROUP BY
source
ORDER BY num_articles DESC
LIMIT 100
Now that we have good parsing of the URL to get the source, let's put together a dataset of source and titles. This will be our labeled dataset for machine learning.
In [ ]:
regex = '.*://(.[^/]+)/'
sub_query = """
SELECT
title,
ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '{0}'), '.'))[OFFSET(1)] AS source
FROM
`bigquery-public-data.hacker_news.stories`
WHERE
REGEXP_CONTAINS(REGEXP_EXTRACT(url, '{0}'), '.com$')
AND LENGTH(title) > 10
""".format(regex)
query = """
SELECT
LOWER(REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ')) AS title,
source
FROM
({sub_query})
WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')
""".format(sub_query=sub_query)
print(query)
For ML training, we usually need to split our dataset into training and evaluation datasets (and perhaps an independent test dataset if we are going to do model or feature selection based on the evaluation dataset). AutoML however figures out on its own how to create these splits, so we won't need to do that here.
In [ ]:
bq = bigquery.Client(project=PROJECT)
title_dataset = bq.query(query).to_dataframe()
title_dataset.head()
AutoML for text classification requires that
The dataset we pulled from BiqQuery satisfies these requirements.
In [ ]:
print("The full dataset contains {n} titles".format(n=len(title_dataset)))
Let's make sure we have roughly the same number of labels for each of our three labels:
In [ ]:
title_dataset.source.value_counts()
Finally we will save our data, which is currently in-memory, to disk.
We will create a csv file containing the full dataset and another containing only 1000 articles for development.
Note: It may take a long time to train AutoML on the full dataset, so we recommend to use the sample dataset for the purpose of learning the tool.
In [ ]:
DATADIR = './data/'
if not os.path.exists(DATADIR):
os.makedirs(DATADIR)
In [ ]:
FULL_DATASET_NAME = 'titles_full.csv'
FULL_DATASET_PATH = os.path.join(DATADIR, FULL_DATASET_NAME)
# Let's shuffle the data before writing it to disk.
title_dataset = title_dataset.sample(n=len(title_dataset))
title_dataset.to_csv(
FULL_DATASET_PATH, header=False, index=False, encoding='utf-8')
Now let's sample 1000 articles from the full dataset and make sure we have enough examples for each label in our sample dataset (see here for further details on how to prepare data for AutoML).
In [ ]:
sample_title_dataset = title_dataset.sample(n=1000)
sample_title_dataset.source.value_counts()
Let's write the sample datatset to disk.
In [ ]:
SAMPLE_DATASET_NAME = 'titles_sample.csv'
SAMPLE_DATASET_PATH = os.path.join(DATADIR, SAMPLE_DATASET_NAME)
sample_title_dataset.to_csv(
SAMPLE_DATASET_PATH, header=False, index=False, encoding='utf-8')
In [ ]:
sample_title_dataset.head()
In [ ]:
# Ensure the right version of Tensorflow is installed.
!pip freeze | grep tensorflow==2.1 || pip install tensorflow==2.1
Note: You can simply ignore the incompatibility error related
to tensorflow-serving-api
and tensorflow-io
.
While re-running the above cell you will see the output
tensorflow==2.1.0
that is the installed version of tensorflow.
In [ ]:
import os
import shutil
import pandas as pd
import tensorflow as tf
from tensorflow.keras.callbacks import TensorBoard, EarlyStopping
from tensorflow.keras.layers import (
Embedding,
Flatten,
GRU,
Conv1D,
Lambda,
Dense,
)
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
print(tf.__version__)
In [ ]:
%matplotlib inline
Let's start by specifying where the information about the trained models will be saved as well as where our dataset is located:
In [ ]:
LOGDIR = "./text_models"
DATA_DIR = "./data"
Our dataset consists of titles of articles along with the label indicating from which source these articles have been taken from (GitHub, Tech-Crunch, or the New-York Times).
In [ ]:
DATASET_NAME = "titles_full.csv"
TITLE_SAMPLE_PATH = os.path.join(DATA_DIR, DATASET_NAME)
COLUMNS = ['title', 'source']
titles_df = pd.read_csv(TITLE_SAMPLE_PATH, header=None, names=COLUMNS)
titles_df.head()
The first thing we need to do is to find how many words we have in our dataset (VOCAB_SIZE
), how many titles we have (DATASET_SIZE
), and what the maximum length of the titles we have (MAX_LEN
) is. Keras offers the Tokenizer
class in its keras.preprocessing.text
module to help us with that:
In [ ]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(titles_df.title)
In [ ]:
integerized_titles = tokenizer.texts_to_sequences(titles_df.title)
integerized_titles[:3]
In [ ]:
VOCAB_SIZE = len(tokenizer.index_word)
VOCAB_SIZE
In [ ]:
DATASET_SIZE = tokenizer.document_count
DATASET_SIZE
In [ ]:
MAX_LEN = max(len(sequence) for sequence in integerized_titles)
MAX_LEN
Let's now implement a function create_sequence
that will
Keras has the helper functions pad_sequence
for that on the top of the tokenizer methods.
In [ ]:
# TODO 1
def create_sequences(texts, max_len=MAX_LEN):
sequences = tokenizer.texts_to_sequences(texts)
padded_sequences = pad_sequences(sequences, max_len, padding='post')
return padded_sequences
In [ ]:
sequences = create_sequences(titles_df.title[:3])
sequences
In [ ]:
titles_df.source[:4]
We now need to write a function that
Keras to_categorical
is handy for that.
In [ ]:
CLASSES = {
'github': 0,
'nytimes': 1,
'techcrunch': 2
}
N_CLASSES = len(CLASSES)
In [ ]:
# TODO 2
def encode_labels(sources):
classes = [CLASSES[source] for source in sources]
one_hots = to_categorical(classes)
return one_hots
In [ ]:
encode_labels(titles_df.source[:4])
Let's split our data into train and test splits:
In [ ]:
N_TRAIN = int(DATASET_SIZE * 0.80)
titles_train, sources_train = (
titles_df.title[:N_TRAIN], titles_df.source[:N_TRAIN])
titles_valid, sources_valid = (
titles_df.title[N_TRAIN:], titles_df.source[N_TRAIN:])
To be on the safe side, we verify that the train and test splits have roughly the same number of examples per classes.
Since it is the case, accuracy will be a good metric to use to measure the performance of our models.
In [ ]:
sources_train.value_counts()
In [ ]:
sources_valid.value_counts()
Using create_sequence
and encode_labels
, we can now prepare the
training and validation data to feed our models.
The features will be padded list of integers and the labels will be one-hot-encoded 3D vectors.
In [ ]:
X_train, Y_train = create_sequences(titles_train), encode_labels(sources_train)
X_valid, Y_valid = create_sequences(titles_valid), encode_labels(sources_valid)
In [ ]:
X_train[:3]
In [ ]:
Y_train[:3]
The build_dnn_model function below returns a compiled Keras model that implements a simple embedding layer transforming the word integers into dense vectors, followed by a Dense softmax layer that returns the probabilities for each class.
Note that we need to put a custom Keras Lambda layer in between the Embedding layer and the Dense softmax layer to do an average of the word vectors returned by the embedding layer. This is the average that's fed to the dense softmax layer. By doing so, we create a model that is simple but that loses information about the word order, creating a model that sees sentences as "bag-of-words".
In [ ]:
def build_dnn_model(embed_dim):
model = Sequential([
Embedding(VOCAB_SIZE + 1, embed_dim, input_shape=[MAX_LEN]), # TODO 3
Lambda(lambda x: tf.reduce_mean(x, axis=1)), # TODO 4
Dense(N_CLASSES, activation='softmax') # TODO 5
])
model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
return model
Below we train the model on 100 epochs but adding an EarlyStopping
callback that will stop the training as soon as the validation loss has not improved after a number of steps specified by PATIENCE
. Note that we also give the model.fit
method a Tensorboard callback so that we can later compare all the models using TensorBoard.
In [ ]:
%%time
tf.random.set_seed(33)
MODEL_DIR = os.path.join(LOGDIR, 'dnn')
shutil.rmtree(MODEL_DIR, ignore_errors=True)
BATCH_SIZE = 300
EPOCHS = 100
EMBED_DIM = 10
PATIENCE = 0
dnn_model = build_dnn_model(embed_dim=EMBED_DIM)
dnn_history = dnn_model.fit(
X_train, Y_train,
epochs=EPOCHS,
batch_size=BATCH_SIZE,
validation_data=(X_valid, Y_valid),
callbacks=[EarlyStopping(patience=PATIENCE), TensorBoard(MODEL_DIR)],
)
pd.DataFrame(dnn_history.history)[['loss', 'val_loss']].plot()
pd.DataFrame(dnn_history.history)[['accuracy', 'val_accuracy']].plot()
dnn_model.summary()
The build_dnn_model
function below returns a compiled Keras model that implements a simple RNN model with a single GRU
layer, which now takes into account the word order in the sentence.
The first and last layers are the same as for the simple DNN model.
Note that we set mask_zero=True
in the Embedding
layer so that the padded words (represented by a zero) are ignored by this and the subsequent layers.
In [ ]:
def build_rnn_model(embed_dim, units):
model = Sequential([
Embedding(VOCAB_SIZE + 1, embed_dim, input_shape=[MAX_LEN], mask_zero=True), # TODO 3
GRU(units), # TODO 5
Dense(N_CLASSES, activation='softmax')
])
model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
return model
Let's train the model with early stoping as above.
Observe that we obtain the same type of accuracy as with the DNN model, but in less epochs (~3 v.s. ~20 epochs):
In [ ]:
%%time
tf.random.set_seed(33)
MODEL_DIR = os.path.join(LOGDIR, 'rnn')
shutil.rmtree(MODEL_DIR, ignore_errors=True)
EPOCHS = 100
BATCH_SIZE = 300
EMBED_DIM = 10
UNITS = 16
PATIENCE = 0
rnn_model = build_rnn_model(embed_dim=EMBED_DIM, units=UNITS)
history = rnn_model.fit(
X_train, Y_train,
epochs=EPOCHS,
batch_size=BATCH_SIZE,
validation_data=(X_valid, Y_valid),
callbacks=[EarlyStopping(patience=PATIENCE), TensorBoard(MODEL_DIR)],
)
pd.DataFrame(history.history)[['loss', 'val_loss']].plot()
pd.DataFrame(history.history)[['accuracy', 'val_accuracy']].plot()
rnn_model.summary()
The build_dnn_model
function below returns a compiled Keras model that implements a simple CNN model with a single Conv1D
layer, which now takes into account the word order in the sentence.
The first and last layers are the same as for the simple DNN model, but we need to add a Flatten
layer betwen the convolution and the softmax layer.
Note that we set mask_zero=True
in the Embedding
layer so that the padded words (represented by a zero) are ignored by this and the subsequent layers.
In [ ]:
def build_cnn_model(embed_dim, filters, ksize, strides):
model = Sequential([
Embedding(
VOCAB_SIZE + 1,
embed_dim,
input_shape=[MAX_LEN],
mask_zero=True), # TODO 3
Conv1D( # TODO 5
filters=filters,
kernel_size=ksize,
strides=strides,
activation='relu',
),
Flatten(), # TODO 5
Dense(N_CLASSES, activation='softmax')
])
model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
return model
Let's train the model.
Again we observe that we get the same kind of accuracy as with the DNN model but in many fewer steps.
In [ ]:
%%time
tf.random.set_seed(33)
MODEL_DIR = os.path.join(LOGDIR, 'cnn')
shutil.rmtree(MODEL_DIR, ignore_errors=True)
EPOCHS = 100
BATCH_SIZE = 300
EMBED_DIM = 5
FILTERS = 200
STRIDES = 2
KSIZE = 3
PATIENCE = 0
cnn_model = build_cnn_model(
embed_dim=EMBED_DIM,
filters=FILTERS,
strides=STRIDES,
ksize=KSIZE,
)
cnn_history = cnn_model.fit(
X_train, Y_train,
epochs=EPOCHS,
batch_size=BATCH_SIZE,
validation_data=(X_valid, Y_valid),
callbacks=[EarlyStopping(patience=PATIENCE), TensorBoard(MODEL_DIR)],
)
pd.DataFrame(cnn_history.history)[['loss', 'val_loss']].plot()
pd.DataFrame(cnn_history.history)[['accuracy', 'val_accuracy']].plot()
cnn_model.summary()
Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License