In [ ]:
# Copyright 2019 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
|
|
In this notebook we will see how AutoML Tables can be used to make music recommendations to users. AutoML Tables is a supervised learning service for structured data that can vastly simplify the model building process.
AutoML Tables allows data to be imported from either GCS or BigQuery. This tutorial uses the ListenBrainz dataset from Cloud Marketplace, hosted in BigQuery.
The ListenBrainz dataset is a log of songs played by users, some notable pieces of the schema include:
Field name | Description |
---|---|
user_name | a user id |
track_name | a song id |
release_name | the album of the song |
artist_name | the artist of the song |
tags | the genres of the song |
The goal of this notebook is to demonstrate how to create a lookup table in BigQuery of songs to recommend to users using a log of user-song listens and AutoML Tables. This will be done by training a binary classification model to predict whether or not a user will like a given song. In the training data, liking a song was defined as having listened to a song more than twice. Using the predictions for every (user, song)
pair to generate a ranking of the most similar songs for each user.
As the number of (user, song)
pairs grows exponentially with the number of unique users and songs, this approach may not be optimal for extremely large datasets. One workaround would be to train a model that learns to embed users and songs in the same embedding space, and use a nearest-neighbors algorithm to get recommendations for users. Unfortunately, AutoML Tables does not expose any feature for training and using embeddings, so a custom ML model would need to be used instead.
Another recommendation approach that is worth mentioning is using extreme multiclass classification, as that also circumvents storing every possible pair of users and songs. Unfortunately, AutoML Tables does not support the multiclass classification of more than 100 classes.
This tutorial uses billable components of Google Cloud Platform (GCP):
Learn about Cloud AI Platform pricing, Bigquery pricing, AutoML Tables pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.
If you are using Colab or AI Platform Notebooks, your environment already meets all the requirements to run this notebook. If you are using AI Platform Notebook, make sure the machine configuration type is 1 vCPU, 3.75 GB RAM or above. You can skip this step.
Otherwise, make sure your environment meets this notebook's requirements. You need the following:
The Google Cloud guide to Setting up a Python development environment and the Jupyter installation guide provide detailed instructions for meeting these requirements. The following steps provide a condensed set of instructions:
Install virtualenv and create a virtual environment that uses Python 3.
Activate that environment and run pip install jupyter
in a shell to install
Jupyter.
Run jupyter notebook
in a shell to launch Jupyter.
Open this notebook in the Jupyter Notebook Dashboard.
The following steps are required, regardless of your notebook environment.
Select or create a GCP project.. When you first create an account, you get a $300 free credit towards your compute/storage costs.
In [ ]:
# Use the latest major GA version of the framework.
! pip install --upgrade --quiet --user google-cloud-automl
! pip install --upgrade --quiet --user google-cloud-bigquery
! pip install --upgrade --quiet --user seaborn
Note: Try installing using sudo
, if the above command throw any permission errors.
Restart
the kernel to allow automl_v1beta1 to be imported for Jupyter Notebooks.
In [ ]:
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")
In [ ]:
PROJECT_ID = "[your-project-id]" #@param {type:"string"}
COMPUTE_REGION = "us-central1" # Currently only supported region.
Otherwise, follow these steps:
In the GCP Console, go to the Create service account key page.
From the Service account drop-down list, select New service account.
In the Service account name field, enter a name.
From the Role drop-down list, select AutoML > AutoML Admin and BigQuery > BigQuery Admin.
Click Create. A JSON file that contains your key downloads to your local environment.
Note: Jupyter runs lines prefixed with !
as shell commands, and it interpolates Python variables prefixed with $
into these commands.
In [ ]:
# Upload the downloaded JSON file that contains your key.
import sys
if 'google.colab' in sys.modules:
from google.colab import files
keyfile_upload = files.upload()
keyfile = list(keyfile_upload.keys())[0]
%env GOOGLE_APPLICATION_CREDENTIALS $keyfile
! gcloud auth activate-service-account --key-file $keyfile
If you are running the notebook locally, enter the path to your service account key as the GOOGLE_APPLICATION_CREDENTIALS
variable in the cell below and run the cell
In [ ]:
# If you are running this notebook locally, replace the string below with the
# path to your service account key and run this cell to authenticate your GCP
# account.
%env GOOGLE_APPLICATION_CREDENTIALS /path/to/service/account
! gcloud auth activate-service-account --key-file '/path/to/service/account'
Import relevant packages.
In [ ]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
In [ ]:
from google.cloud import automl_v1beta1 as automl
from google.cloud import bigquery
from google.cloud import exceptions
import seaborn as sns
%matplotlib inline
Populate the following cell with the necessary constants and run it to initialize constants.
In [ ]:
#@title Constants { vertical-output: true }
# A name for the AutoML tables Dataset to create.
DATASET_DISPLAY_NAME = "music_rec" #@param {type: 'string'}
# The BigQuery dataset to import data from (doesn't need to exist).
BQ_DATASET_NAME = "music_rec_dataset" #@param {type: 'string'}
# The BigQuery table to import data from (doesn't need to exist).
BQ_TABLE_NAME = "music_rec_table" #@param {type: 'string'}
# A name for the AutoML tables model to create.
MODEL_DISPLAY_NAME = "music_rec_model" #@param {type: 'string'}
assert all([
PROJECT_ID,
COMPUTE_REGION,
DATASET_DISPLAY_NAME,
BQ_DATASET_NAME,
BQ_TABLE_NAME,
MODEL_DISPLAY_NAME,
])
Initialize the clients for AutoML, AutoML Tables and BigQuery.
In [ ]:
# Initialize the clients.
automl_client = automl.AutoMlClient()
tables_client = automl.TablesClient(project=PROJECT_ID, region=COMPUTE_REGION)
bq_client = bigquery.Client()
In [ ]:
# List the datasets.
list_datasets = tables_client.list_datasets()
datasets = { dataset.display_name: dataset.name for dataset in list_datasets }
datasets
You can also print the list of your models by running the following cell.
If no model has previously trained using AutoML Tables, you shall expect an empty return.
In [ ]:
# List the models.
list_models = tables_client.list_models()
models = { model.display_name: model.name for model in list_models }
models
In order to train a model, a structured dataset must be injested into AutoML tables from either BigQuery or Google Cloud Storage. Once injested, the user will be able to cherry pick columns to use as features, labels, or weights and configure the loss function.
First, do some feature engineering on the original ListenBrainz dataset to turn it into a dataset for training and export it into a seperate BigQuery table:
1. Make each sample a unique `(user, song)` pair.
2. For features, use the user's top 10 songs ever played and the song's number of albums, artist, and genres.
3. For a label, use the number of times the user has listened to the song, normalized by dividing by the maximum number of times that user has listened to any song. Normalizing the listen counts ensures active users don't have disproportionate effect on the model error.
4. Add a weight equal to the label to give songs more popular with the user higher weights. This is to help account for the skew in the label distribution.
In [ ]:
query = """
WITH
songs AS (
SELECT CONCAT(track_name, " by ", artist_name) AS song,
MAX(tags) as tags
FROM `listenbrainz.listenbrainz.listen`
GROUP BY song
HAVING tags != ""
ORDER BY COUNT(*) DESC
LIMIT 10000
),
user_songs AS (
SELECT user_name AS user, ANY_VALUE(artist_name) AS artist,
CONCAT(track_name, " by ", artist_name) AS song,
SPLIT(ANY_VALUE(songs.tags), ",") AS tags,
COUNT(*) AS user_song_listens
FROM `listenbrainz.listenbrainz.listen`
JOIN songs ON songs.song = CONCAT(track_name, " by ", artist_name)
GROUP BY user_name, song
),
user_tags AS (
SELECT user, tag, COUNT(*) AS COUNT
FROM user_songs,
UNNEST(tags) tag
WHERE tag != ""
GROUP BY user, tag
),
top_tags AS (
SELECT tag
FROM user_tags
GROUP BY tag
ORDER BY SUM(count) DESC
LIMIT 20
),
tag_table AS (
SELECT user, b.tag
FROM user_tags a, top_tags b
GROUP BY user, b.tag
),
user_tag_features AS (
SELECT user,
ARRAY_AGG(IFNULL(count, 0) ORDER BY tag) as user_tags,
SUM(count) as tag_count
FROM tag_table
LEFT JOIN user_tags USING (user, tag)
GROUP BY user
), user_features AS (
SELECT user, MAX(user_song_listens) AS user_max_listen,
ANY_VALUE(user_tags)[OFFSET(0)]/ANY_VALUE(tag_count) as user_tags0,
ANY_VALUE(user_tags)[OFFSET(1)]/ANY_VALUE(tag_count) as user_tags1,
ANY_VALUE(user_tags)[OFFSET(2)]/ANY_VALUE(tag_count) as user_tags2,
ANY_VALUE(user_tags)[OFFSET(3)]/ANY_VALUE(tag_count) as user_tags3,
ANY_VALUE(user_tags)[OFFSET(4)]/ANY_VALUE(tag_count) as user_tags4,
ANY_VALUE(user_tags)[OFFSET(5)]/ANY_VALUE(tag_count) as user_tags5,
ANY_VALUE(user_tags)[OFFSET(6)]/ANY_VALUE(tag_count) as user_tags6,
ANY_VALUE(user_tags)[OFFSET(7)]/ANY_VALUE(tag_count) as user_tags7,
ANY_VALUE(user_tags)[OFFSET(8)]/ANY_VALUE(tag_count) as user_tags8,
ANY_VALUE(user_tags)[OFFSET(9)]/ANY_VALUE(tag_count) as user_tags9,
ANY_VALUE(user_tags)[OFFSET(10)]/ANY_VALUE(tag_count) as user_tags10,
ANY_VALUE(user_tags)[OFFSET(11)]/ANY_VALUE(tag_count) as user_tags11,
ANY_VALUE(user_tags)[OFFSET(12)]/ANY_VALUE(tag_count) as user_tags12,
ANY_VALUE(user_tags)[OFFSET(13)]/ANY_VALUE(tag_count) as user_tags13,
ANY_VALUE(user_tags)[OFFSET(14)]/ANY_VALUE(tag_count) as user_tags14,
ANY_VALUE(user_tags)[OFFSET(15)]/ANY_VALUE(tag_count) as user_tags15,
ANY_VALUE(user_tags)[OFFSET(16)]/ANY_VALUE(tag_count) as user_tags16,
ANY_VALUE(user_tags)[OFFSET(17)]/ANY_VALUE(tag_count) as user_tags17,
ANY_VALUE(user_tags)[OFFSET(18)]/ANY_VALUE(tag_count) as user_tags18,
ANY_VALUE(user_tags)[OFFSET(19)]/ANY_VALUE(tag_count) as user_tags19
FROM user_songs
LEFT JOIN user_tag_features USING (user)
GROUP BY user
HAVING COUNT(*) < 5000 AND user_max_listen > 2
),
item_features AS (
SELECT CONCAT(track_name, " by ", artist_name) AS song,
COUNT(DISTINCT(release_name)) AS albums
FROM `listenbrainz.listenbrainz.listen`
WHERE track_name != ""
GROUP BY song
)
SELECT user, song, artist, tags, albums, user_tags0, user_tags1, user_tags2,
user_tags3, user_tags4, user_tags5, user_tags6, user_tags7, user_tags8,
user_tags9, user_tags10, user_tags11, user_tags12, user_tags13, user_tags14,
user_tags15, user_tags16, user_tags17, user_tags18, user_tags19,
IF(user_song_listens > 2,
SQRT(user_song_listens/user_max_listen),
.5/user_song_listens) AS weight,
IF(user_song_listens > 2, 1, 0) as label
FROM user_songs
JOIN user_features USING(user)
JOIN item_features USING(song)
"""
In [ ]:
def create_table_from_query(query, table):
"""Creates a new table using the results from the given query.
Args:
query: a query string.
table: a name to give the new table.
"""
job_config = bigquery.QueryJobConfig()
bq_dataset = bigquery.Dataset("{0}.{1}".format(
PROJECT_ID, BQ_DATASET_NAME))
bq_dataset.location = "US"
try:
bq_dataset = bq_client.create_dataset(bq_dataset)
except exceptions.Conflict:
pass
table_ref = bq_client.dataset(BQ_DATASET_NAME).table(table)
job_config.destination = table_ref
query_job = bq_client.query(query,
location=bq_dataset.location,
job_config=job_config)
query_job.result()
print('Query results loaded to table {}'.format(table_ref.path))
In [ ]:
create_table_from_query(query, BQ_TABLE_NAME)
In [ ]:
# Create dataset.
dataset = tables_client.create_dataset(
dataset_display_name=DATASET_DISPLAY_NAME)
dataset_name = dataset.name
dataset
In [ ]:
# Read the data source from BigQuery.
dataset_bq_input_uri = 'bq://{0}.{1}.{2}'.format(
PROJECT_ID, BQ_DATASET_NAME, BQ_TABLE_NAME)
import_data_response = tables_client.import_data(
dataset=dataset, bigquery_input_uri=dataset_bq_input_uri)
print('Dataset import operation: {}'.format(import_data_response.operation))
# Synchronous check of operation status. Wait until import is done.
print('Dataset import response: {}'.format(import_data_response.result()))
# Verify the status by checking the example_count field.
dataset = tables_client.get_dataset(dataset_name=dataset_name)
dataset
Inspect the datatypes assigned to each column. In this case, the song
and artist
should be categorical, not textual.
In [ ]:
list_column_specs_response = tables_client.list_column_specs(
dataset_display_name=DATASET_DISPLAY_NAME)
column_specs = {s.display_name: s for s in list_column_specs_response}
def print_column_specs(column_specs):
"""Parses the given specs and prints each column and column type."""
data_types = automl.proto.data_types_pb2
return [(x, data_types.TypeCode.Name(
column_specs[x].data_type.type_code)) for x in column_specs.keys()]
print_column_specs(column_specs)
In [ ]:
type_code='CATEGORY' #@param {type:'string'}
for col in ["song", "artist"]:
tables_client.update_column_spec(dataset_display_name=DATASET_DISPLAY_NAME,
column_spec_display_name=col,
type_code=type_code)
list_column_specs_response = tables_client.list_column_specs(
dataset_display_name=DATASET_DISPLAY_NAME)
column_specs = {s.display_name: s for s in list_column_specs_response}
print_column_specs(column_specs)
Not all columns are feature columns, in order to train a model, we need to tell Tables which column should be used as the target variable and, optionally, which column should be used as sample weights.
In [ ]:
tables_client.set_target_column(dataset_display_name=DATASET_DISPLAY_NAME,
column_spec_display_name="label")
tables_client.set_weight_column(dataset_display_name=DATASET_DISPLAY_NAME,
column_spec_display_name="weight")
Once the Dataset has been configured correctly, we can tell AutoML Tables to train a new model. The amount of resources spent to train this model can be adjusted using a parameter called 'train_budget_milli_node_hours'
. As the name implies, this puts a maximum budget on how many resources a training job can use up before exporting a servable model.
For demonstration purpose, the following command sets the budget as 1 node hour ('train_budget_milli_node_hours': 1000)
. You can increase that number up to a maximum of 72 hours ('train_budget_milli_node_hours': 72000)
for the best model performance.
Even with a budget of 1 node hour (the minimum possible budget), training a model can take more than the specified node hours.
You can also select the objective to optimize your model training by setting optimization_objective. This solution optimizes the model by using default optimization objective. Refer link for more details.
In [ ]:
# The number of hours to train the model.
model_train_hours = 1 #@param {type:'integer'}
create_model_response = tables_client.create_model(
model_display_name=MODEL_DISPLAY_NAME,
dataset_display_name=DATASET_DISPLAY_NAME,
train_budget_milli_node_hours=model_train_hours*1000)
operation_id = create_model_response.operation.name
print('Create model operation: {}'.format(create_model_response.operation))
In [ ]:
# Wait until model training is done.
model = create_model_response.result()
model_name = model.name
model
Because we are optimizing a surrogate problem (predicting the similarity between (user, song)
pairs) in order to achieve our final objective of producing a list of recommended songs for a user, it's difficult to tell how well the model performs by looking only at the final loss function. Instead, an evaluation metric we can use for our model is recall@n
for the top m
most listened to songs for each user. This metric will give us the probability that one of a user's top m
most listened to songs will appear in the top n
recommendations we make.
In order to get the top recommendations for each user, we need to create a batch job to predict similarity scores between each user and item pair. These similarity scores would then be sorted per user to produce an ordered list of recommended songs.
Instead of creating a lookup table for all users, let's just focus on the performance for a few users for this demo. We will focus especially on recommendations for the user rob
, and demonstrate how the others can be included in an overall evaluation metric for the model. We start by creatings a dataset for prediction to feed into the trained model; this is a table of every possible (user, song)
pair containing the users and corresponding features.
In [ ]:
users = ["rob", "fiveofoh", "Aerion"]
training_table = "{}.{}.{}".format(
PROJECT_ID, BQ_DATASET_NAME, BQ_TABLE_NAME)
query = """
WITH user as (
SELECT user,
user_tags0, user_tags1, user_tags2, user_tags3, user_tags4,
user_tags5, user_tags6, user_tags7, user_tags8, user_tags9,
user_tags10,user_tags11, user_tags12, user_tags13, user_tags14,
user_tags15, user_tags16, user_tags17, user_tags18, user_tags19, label
FROM `{0}`
WHERE user in ({1})
)
SELECT ANY_VALUE(a).*, song, ANY_VALUE(artist) as artist,
ANY_VALUE(tags) as tags, ANY_VALUE(albums) as albums
FROM `{0}`, user a
GROUP BY song
""".format(training_table, ",".join(["\"{}\"".format(x) for x in users]))
In [ ]:
eval_table = "{}_example".format(BQ_TABLE_NAME)
create_table_from_query(query, eval_table)
In [ ]:
preds_bq_input_uri = "bq://{}.{}.{}".format(
PROJECT_ID, BQ_DATASET_NAME, eval_table)
preds_bq_output_uri = "bq://{}".format(PROJECT_ID)
response = tables_client.batch_predict(model_display_name=MODEL_DISPLAY_NAME,
bigquery_input_uri=preds_bq_input_uri,
bigquery_output_uri=preds_bq_output_uri)
print('Prediction response: {}'.format(response.result()))
output_uri = response.metadata.batch_predict_details\
.output_info.bigquery_output_dataset
print('Output URI: {}'.format(output_uri))
With the similarity predictions for rob, we can order by the predictions to get a ranked list of songs to recommend to rob
.
In [ ]:
n = 10
query = """
SELECT user, song, tables.score as score, a.label as pred_label,
b.label as true_label
FROM `{}.predictions` a, UNNEST(predicted_label)
LEFT JOIN `{}` b USING(user, song)
WHERE user = "{}" AND CAST(tables.value AS INT64) = 1
ORDER BY score DESC
LIMIT {}
""".format(output_uri[5:].replace(":", "."), training_table, users[0], n)
query_job = bq_client.query(query)
print("Top {} song recommended for {}:".format(n, users[0]))
for idx, row in enumerate(query_job):
print("{}.".format(idx + 1), row["song"])
In [ ]:
query = """
WITH
top_k AS (
SELECT user, song, label,
ROW_NUMBER() OVER (PARTITION BY user ORDER BY label + weight DESC) as user_rank
FROM `{0}`
)
SELECT user, song, tables.score as score, b.label,
ROW_NUMBER() OVER (ORDER BY tables.score DESC) as rank, user_rank
FROM `{1}.predictions` a, UNNEST(predicted_label)
LEFT JOIN top_k b USING(user, song)
WHERE CAST(tables.value AS INT64) = 1
ORDER BY score DESC
""".format(training_table, output_uri[5:].replace(":", "."))
df = bq_client.query(query).result().to_dataframe()
df.head()
In [ ]:
precision_at_k = {}
recall_at_k = {}
for user in users:
precision_at_k[user] = []
recall_at_k[user] = []
for k in range(1, 1000):
precision = df["label"][:k].sum() / k
recall = df["label"][:k].sum() / df["label"].sum()
precision_at_k[user].append(precision)
recall_at_k[user].append(recall)
# plot the precision-recall curve.
ax = sns.lineplot(recall_at_k[users[0]], precision_at_k[users[0]])
ax.set_title("precision-recall curve for varying k")
ax.set_xlabel("recall@k")
ax.set_ylabel("precision@k")
Achieving a high precision@k means a large proportion of top-k recommended items are relevant to the user. Recall@k shows what proportion of all relevant items appeared in the top-k recommendations.
Mean Average Precision (MAP)
Precision@k is a good metric for understanding how many relevant recommendations we might make at each top-k. However, we would prefer relevant items to be recommended first when possible and should encode that into our evaluation metric. Average Precision (AP) is a running average of precision@k, rewarding recommendations where the revelant items are seen earlier rather than later. When the averaged across all users for some k, the AP metric is called MAP.
In [ ]:
def calculate_ap(precision):
ap = [precision[0]]
for p in precision[1:]:
ap.append(ap[-1] + p)
ap = [x / (n + 1) for x, n in zip(ap, range(len(ap)))]
return ap
ap_at_k = {user: calculate_ap(pk)
for user, pk in precision_at_k.items()}
num_k = 500
map_at_k = [sum([ap_at_k[user][k] for user in users]) / len(users)
for k in range(num_k)]
print("MAP@50: {}".format(map_at_k[49]))
# plot average precision.
ax = sns.lineplot(range(num_k), map_at_k)
ax.set_title("MAP@k for varying k")
ax.set_xlabel("k")
ax.set_ylabel("MAP")
To clean up all GCP resources used in this project, you can delete the GCP project you used for the tutorial.
Delete BigQuery datasets
In order to delete BigQuery tables, make sure the service account linked to this notebook has a role with the bigquery.tables.delete permission such as Big Query Data Owner. The following command displays the current service account.
IAM permissions can be adjusted here.
In [ ]:
# Delete model resource.
tables_client.delete_model(model_name=model_name)
# Delete dataset resource.
tables_client.delete_dataset(dataset_name=dataset_name)
# Delete the prediction dataset.
dataset_id = str(output_uri[5:].replace(":", "."))
bq_client.delete_dataset(dataset_id, delete_contents=True, not_found_ok=True)
# Delete the training dataset.
dataset_id = "{0}.{1}".format(PROJECT_ID, BQ_DATASET_NAME)
bq_client.delete_dataset(dataset_id, delete_contents=True, not_found_ok=True)
# If training model is still running, cancel it.
automl_client.transport._operations_client.cancel_operation(operation_id)