In this notebook, we'll walk you through building a model to predict the genres of a movie given its description. The emphasis here is not on accuracy, but instead how to use TF Hub layers in a text classification model.
To start, import the necessary dependencies for this project.
In [0]:
import os
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import json
import pickle
import urllib
from sklearn.preprocessing import MultiLabelBinarizer
print(tf.__version__)
We need a lot of text inputs to train our model. For this model we'll use this awesome movies dataset from Kaggle. To simplify things I've made the movies_metadata.csv
file available in a public Cloud Storage bucket so we can download it with wget
. I've preprocessed the dataset already to limit the number of genres we'll use for our model, but first let's take a look at the original data so we can see what we're working with.
In [0]:
# Download the data from GCS
!wget 'https://storage.googleapis.com/movies_data/movies_metadata.csv'
Next we'll convert the dataset to a Pandas dataframe and print the first 5 rows. For this model we're only using 2 of these columns: genres
and overview
.
In [0]:
data = pd.read_csv('movies_metadata.csv')
data.head()
I've done some preprocessing to limit the dataset to the top 9 genres, and I've saved the Pandas dataframes as public Pickle files in GCS. Here we download those files. The resulting descriptions
and genres
variables are Pandas Series containing all descriptions and genres from our dataset respectively.
In [0]:
urllib.request.urlretrieve('https://storage.googleapis.com/bq-imports/descriptions.p', 'descriptions.p')
urllib.request.urlretrieve('https://storage.googleapis.com/bq-imports/genres.p', 'genres.p')
descriptions = pickle.load(open('descriptions.p', 'rb'))
genres = pickle.load(open('genres.p', 'rb'))
In [0]:
train_size = int(len(descriptions) * .8)
train_descriptions = descriptions[:train_size].astype('str')
train_genres = genres[:train_size]
test_descriptions = descriptions[train_size:].astype('str')
test_genres = genres[train_size:]
When we train our model we'll provide the labels (in this case genres) associated with each movie. We can't pass the genres in as strings directly, we'll transform them into multi-hot vectors. Since we have 9 genres, we'll have a 9 element vector for each movie with 0s and 1s indicating which genres are present in each description.
In [0]:
encoder = MultiLabelBinarizer()
encoder.fit_transform(train_genres)
train_encoded = encoder.transform(train_genres)
test_encoded = encoder.transform(test_genres)
num_classes = len(encoder.classes_)
# Print all possible genres and the labels for the first movie in our training dataset
print(encoder.classes_)
print(train_encoded[0])
TF Hub provides a library of existing pre-trained model checkpoints for various kinds of models (images, text, and more) In this model we'll use the TF Hub universal-sentence-encoder
module for our pre-trained word embeddings. We only need one line of code to instantiate module. When we train our model, it'll convert our array of movie description strings to embeddings. When we train our model, we'll use this as a feature column.
In [0]:
description_embeddings = hub.text_embedding_column("descriptions", module_spec="https://tfhub.dev/google/universal-sentence-encoder/2", trainable=False)
The first parameter we pass to our DNNEstimator is called a head, and defines the type of labels our model should expect. Since we want our model to output multiple labels, we’ll use multi_label_head here. Then we'll convert our features and labels to numpy arrays and instantiate our Estimator. batch_size
and num_epochs
are hyperparameters - you should experiment with different values to see what works best on your dataset.
In [0]:
multi_label_head = tf.contrib.estimator.multi_label_head(
num_classes,
loss_reduction=tf.losses.Reduction.SUM_OVER_BATCH_SIZE
)
In [0]:
features = {
"descriptions": np.array(train_descriptions).astype(np.str)
}
labels = np.array(train_encoded).astype(np.int32)
train_input_fn = tf.estimator.inputs.numpy_input_fn(features, labels, shuffle=True, batch_size=32, num_epochs=25)
estimator = tf.contrib.estimator.DNNEstimator(
head=multi_label_head,
hidden_units=[64,10],
feature_columns=[description_embeddings])
To train our model, we simply call train()
passing it the input function we defined above. Once our model is trained, we'll define an evaluation input function similar to the one above and call evaluate()
. When this completes we'll get a few metrics we can use to evaluate our model's accuracy.
In [0]:
estimator.train(input_fn=train_input_fn)
In [0]:
# Define our eval input_fn and run eval
eval_input_fn = tf.estimator.inputs.numpy_input_fn({"descriptions": np.array(test_descriptions).astype(np.str)}, test_encoded.astype(np.int32), shuffle=False)
estimator.evaluate(input_fn=eval_input_fn)
Now for the most fun part! Let's generate predictions on movie descriptions our model hasn't seen before. We'll define an array of 3 new description strings (the comments indicate the correct genres) and create a predict_input_fn
. Then we'll display the top 2 genres along with their confidence percentages for each of the 3 movies.
In [0]:
# Test our model on some raw description data
raw_test = [
"An examination of our dietary choices and the food we put in our bodies. Based on Jonathan Safran Foer's memoir.", # Documentary
"After escaping an attack by what he claims was a 70-foot shark, Jonas Taylor must confront his fears to save those trapped in a sunken submersible.", # Action, Adventure
"A teenager tries to survive the last week of her disastrous eighth-grade year before leaving to start high school.", # Comedy
]
In [0]:
# Generate predictions
predict_input_fn = tf.estimator.inputs.numpy_input_fn({"descriptions": np.array(raw_test).astype(np.str)}, shuffle=False)
results = estimator.predict(predict_input_fn)
In [0]:
# Display predictions
for movie_genres in results:
top_2 = movie_genres['probabilities'].argsort()[-2:][::-1]
for genre in top_2:
text_genre = encoder.classes_[genre]
print(text_genre + ': ' + str(round(movie_genres['probabilities'][genre] * 100, 2)) + '%')
print('')
In [0]: