This lab relies on files created in the content_based_preproc.ipynb notebook. Be sure to complete the TODOs in that notebook and run the code there before completing this lab.
Also, we'll be using the python3 kernel from here on out so don't forget to change the kernel if it's still python2.
This lab illustrates:
Tensorflow Hub should already be installed. You can check using pip freeze.
In [ ]:
%%bash
pip freeze | grep tensor
If 'tensorflow-hub' isn't one of the outputs above, then you'll need to install it. Uncomment the cell below and execute the commands. After doing the pip install, click "Reset Session" on the notebook so that the Python environment picks up the new packages.
In [ ]:
!pip3 install tensorflow-hub==0.4.0
!pip3 install --upgrade tensorflow==1.13.1
In [ ]:
import os
import tensorflow as tf
import numpy as np
import tensorflow_hub as hub
import shutil
PROJECT = 'cloud-training-demos' # REPLACE WITH YOUR PROJECT ID
BUCKET = 'cloud-training-demos-ml' # REPLACE WITH YOUR BUCKET NAME
REGION = 'us-central1' # REPLACE WITH YOUR BUCKET REGION e.g. us-central1
# do not change these
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '1.13.1'
In [ ]:
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION
To start, we'll load the list of categories, authors and article ids we created in the previous Create Datasets notebook.
In [ ]:
categories_list = open("categories.txt").read().splitlines()
authors_list = open("authors.txt").read().splitlines()
content_ids_list = open("content_ids.txt").read().splitlines()
mean_months_since_epoch = 523
In the cell below we'll define the feature columns to use in our model. If necessary, remind yourself the various feature columns to use.
For the embedded_title_column feature column, use a Tensorflow Hub Module to create an embedding of the article title. Since the articles and titles are in German, you'll want to use a German language embedding module.
Explore the text embedding Tensorflow Hub modules available here. Filter by setting the language to 'German'. The 50 dimensional embedding should be sufficient for our purposes.
In [ ]:
embedded_title_column = #TODO: use a Tensorflow Hub module to create a text embeddding column for the article "title".
# Use the module available at https://alpha.tfhub.dev/ filtering by German language.
embedded_content_column = #TODO: create an embedded categorical feature column for the article id; i.e. "content_id".
embedded_author_column = #TODO: create an embedded categorical feature column for the article "author"
category_column = #TODO: create a categorical feature column for the article "category"
months_since_epoch_boundaries = list(range(400,700,20))
months_since_epoch_bucketized = #TODO: create a bucketized numeric feature column of values for the "months since epoch"
crossed_months_since_category_column = #TODO: create a crossed feature column using the "category" and "months since epoch" values
feature_columns = [embedded_content_column,
embedded_author_column,
category_column,
embedded_title_column,
crossed_months_since_category_column]
In [ ]:
record_defaults = [["Unknown"], ["Unknown"],["Unknown"],["Unknown"],["Unknown"],[mean_months_since_epoch],["Unknown"]]
column_keys = ["visitor_id", "content_id", "category", "title", "author", "months_since_epoch", "next_content_id"]
label_key = "next_content_id"
def read_dataset(filename, mode, batch_size = 512):
def _input_fn():
def decode_csv(value_column):
columns = tf.decode_csv(value_column,record_defaults=record_defaults)
features = dict(zip(column_keys, columns))
label = features.pop(label_key)
return features, label
# Create list of files that match pattern
file_list = tf.gfile.Glob(filename)
# Create dataset from file list
dataset = tf.data.TextLineDataset(file_list).map(decode_csv)
if mode == tf.estimator.ModeKeys.TRAIN:
num_epochs = None # indefinitely
dataset = dataset.shuffle(buffer_size = 10 * batch_size)
else:
num_epochs = 1 # end-of-input after this
dataset = dataset.repeat(num_epochs).batch(batch_size)
return dataset.make_one_shot_iterator().get_next()
return _input_fn
Next, we'll build our model which recommends an article for a visitor to the Kurier.at website. Look through the code below. We use the input_layer feature column to create the dense input layer to our network. This is just a sigle layer network where we can adjust the number of hidden units as a parameter.
Currently, we compute the accuracy between our predicted 'next article' and the actual 'next article' read next by the visitor. Resolve the TODOs in the cell below by adding additional performance metrics to assess our model. You will need to
In [ ]:
def model_fn(features, labels, mode, params):
net = tf.feature_column.input_layer(features, params['feature_columns'])
for units in params['hidden_units']:
net = tf.layers.dense(net, units=units, activation=tf.nn.relu)
# Compute logits (1 per class).
logits = tf.layers.dense(net, params['n_classes'], activation=None)
predicted_classes = tf.argmax(logits, 1)
from tensorflow.python.lib.io import file_io
with file_io.FileIO('content_ids.txt', mode='r') as ifp:
content = tf.constant([x.rstrip() for x in ifp])
predicted_class_names = tf.gather(content, predicted_classes)
if mode == tf.estimator.ModeKeys.PREDICT:
predictions = {
'class_ids': predicted_classes[:, tf.newaxis],
'class_names' : predicted_class_names[:, tf.newaxis],
'probabilities': tf.nn.softmax(logits),
'logits': logits,
}
return tf.estimator.EstimatorSpec(mode, predictions=predictions)
table = tf.contrib.lookup.index_table_from_file(vocabulary_file="content_ids.txt")
labels = table.lookup(labels)
# Compute loss.
loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
# Compute evaluation metrics.
accuracy = tf.metrics.accuracy(labels=labels,
predictions=predicted_classes,
name='acc_op')
top_10_accuracy = #TODO: Compute the top_10 accuracy, using the tf.nn.in_top_k and tf.metrics.mean functions in Tensorflow
metrics = {
'accuracy': accuracy,
#TODO: Add top_10_accuracy to the metrics dictionary
}
tf.summary.scalar('accuracy', accuracy[1])
#TODO: Add the top_10_accuracy metric to the Tensorboard summary
if mode == tf.estimator.ModeKeys.EVAL:
return tf.estimator.EstimatorSpec(
mode, loss=loss, eval_metric_ops=metrics)
# Create training op.
assert mode == tf.estimator.ModeKeys.TRAIN
optimizer = tf.train.AdagradOptimizer(learning_rate=0.1)
train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)
In [ ]:
outdir = 'content_based_model_trained'
shutil.rmtree(outdir, ignore_errors = True) # start fresh each time
tf.summary.FileWriterCache.clear() # ensure filewriter cache is clear for TensorBoard events file
estimator = tf.estimator.Estimator(
model_fn=model_fn,
model_dir = outdir,
params={
'feature_columns': feature_columns,
'hidden_units': [200, 100, 50],
'n_classes': len(content_ids_list)
})
train_spec = tf.estimator.TrainSpec(
input_fn = read_dataset("training_set.csv", tf.estimator.ModeKeys.TRAIN),
max_steps = 200)
eval_spec = tf.estimator.EvalSpec(
input_fn = read_dataset("test_set.csv", tf.estimator.ModeKeys.EVAL),
steps = None,
start_delay_secs = 30,
throttle_secs = 60)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
With the model now trained, we can make predictions by calling the predict method on the estimator. Let's look at how our model predicts on the first five examples of the training set.
To start, we'll create a new file 'first_5.csv' which contains the first five elements of our training set. We'll also save the target values to a file 'first_5_content_ids' so we can compare our results.
In [ ]:
%%bash
head -5 training_set.csv > first_5.csv
head first_5.csv
awk -F "\"*,\"*" '{print $2}' first_5.csv > first_5_content_ids
Recall, to make predictions on the trained model we pass a list of examples through the input function. Complete the code below to make predicitons on the examples contained in the "first_5.csv" file we created above.
In [ ]:
output = #TODO: Use the predict method on our trained model to find the predictions for the examples contained in "first_5.csv".
In [ ]:
import numpy as np
recommended_content_ids = [np.asscalar(d["class_names"]).decode('UTF-8') for d in output]
content_ids = open("first_5_content_ids").read().splitlines()
Finally, we'll map the content id back to the article title. We can then compare our model's recommendation for the first of our examples. This can all be done in BigQuery. Look through the query below and make sure it is clear what is being returned.
In [ ]:
from google.cloud import bigquery
recommended_title_sql="""
#standardSQL
SELECT
(SELECT MAX(IF(index=6, value, NULL)) FROM UNNEST(hits.customDimensions)) AS title
FROM `cloud-training-demos.GA360_test.ga_sessions_sample`,
UNNEST(hits) AS hits
WHERE
# only include hits on pages
hits.type = "PAGE"
AND (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) = \"{}\"
LIMIT 1""".format(recommended_content_ids[0])
current_title_sql="""
#standardSQL
SELECT
(SELECT MAX(IF(index=6, value, NULL)) FROM UNNEST(hits.customDimensions)) AS title
FROM `cloud-training-demos.GA360_test.ga_sessions_sample`,
UNNEST(hits) AS hits
WHERE
# only include hits on pages
hits.type = "PAGE"
AND (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) = \"{}\"
LIMIT 1""".format(content_ids[0])
recommended_title = bigquery.Client().query(recommended_title_sql).to_dataframe()['title'].tolist()[0].encode('utf-8').strip()
current_title = bigquery.Client().query(current_title_sql).to_dataframe()['title'].tolist()[0].encode('utf-8').strip()
print("Current title: {} ".format(current_title))
print("Recommended title: {}".format(recommended_title))