Hybrid Recommendations with the Movie Lense Dataset

Note: Please complete the als_bqml.ipynb notebook before continuing.

Learning Objectives

  1. Know extract user and product factors from a BigQuery Matrix Factorizarion Model
  2. Know how to format inputs for a BigQuery Hybrid Recommendation Model

Incorporating user and movie information

The matrix factorization approach does not use any information about users or movies beyond what is available from the ratings matrix. However, we will often have user information (such as the city they live, their annual income, their annual expenditure, etc.) and we will almost always have more information about the products in our catalog. How do we incorporate this information in our recommendation model?

The answer lies in recognizing that the user factors and product factors that result from the matrix factorization approach end up being a concise representation of the information about users and products available from the ratings matrix. We can concatenate this information with other information we have available and train a regression model to predict the rating.

Obtaining user and product factors

We can get the user factors or product factors from ML.WEIGHTS. For example to get the product factors for movieId=96481 and user factors for userId=54192, we would do:


In [ ]:
import os
PROJECT = "your-project-id-here" # REPLACE WITH YOUR PROJECT ID

# Do not change these
os.environ["PROJECT"] = PROJECT

In [ ]:
%%bigquery --project $PROJECT
SELECT 
    processed_input,
    feature,
    TO_JSON_STRING(factor_weights),
    intercept
FROM ML.WEIGHTS(MODEL movielens.recommender_16)
WHERE
    (processed_input = 'movieId' AND feature = '96481')
    OR (processed_input = 'userId' AND feature = '54192')

Multiplying these weights and adding the intercept is how we get the predicted rating for this combination of movieId and userId in the matrix factorization approach.

These weights also serve as a low-dimensional representation of the movie and user behavior. We can create a regression model to predict the rating given the user factors, product factors, and any other information we know about our users and products.

Creating input features

The MovieLens dataset does not have any user information, and has very little information about the movies themselves. To illustrate the concept, therefore, let’s create some synthetic information about users:


In [ ]:
%%bigquery --project $PROJECT
CREATE OR REPLACE TABLE movielens.users AS
SELECT
    userId,
    RAND() * COUNT(rating) AS loyalty,
    CONCAT(SUBSTR(CAST(userId AS STRING), 0, 2)) AS postcode
FROM
  movielens.ratings
GROUP BY userId

Input features about users can be obtained by joining the user table with the ML weights and selecting all the user information and the user factors from the weights array.


In [ ]:
%%bigquery --project $PROJECT
WITH userFeatures AS (
  SELECT 
     u.*,
     (SELECT ARRAY_AGG(weight) FROM UNNEST(factor_weights)) AS user_factors
  FROM movielens.users u
  JOIN ML.WEIGHTS(MODEL movielens.recommender_16) w
  ON processed_input = 'userId' AND feature = CAST(u.userId AS STRING)
)

SELECT * FROM userFeatures
LIMIT 5

Similarly, we can get product features for the movies data, except that we have to decide how to handle the genre since a movie could have more than one genre. If we decide to create a separate training row for each genre, then we can construct the product features using.


In [ ]:
%%bigquery --project $PROJECT
WITH productFeatures AS (
  SELECT 
      p.* EXCEPT(genres),
      g, (SELECT ARRAY_AGG(weight) FROM UNNEST(factor_weights))
            AS product_factors
  FROM movielens.movies p, UNNEST(genres) g
  JOIN ML.WEIGHTS(MODEL movielens.recommender_16) w
  ON processed_input = 'movieId' AND feature = CAST(p.movieId AS STRING)
)

SELECT * FROM productFeatures
LIMIT 5

Combining these two WITH clauses and pulling in the rating corresponding the movieId-userId combination (if it exists in the ratings table), we can create the training dataset.

TODO 1: Combine the above two queries to get the user factors and product factor for each rating.


In [ ]:
%%bigquery --project $PROJECT
CREATE OR REPLACE TABLE movielens.hybrid_dataset AS

    WITH userFeatures AS (
      # TODO: Place the user features query here
    ),

    productFeatures AS (
      # TODO: Place the product features query here
    )

    SELECT
        p.* EXCEPT(movieId),
        u.* EXCEPT(userId),
        rating 
    FROM productFeatures p, userFeatures u
    JOIN movielens.ratings r
    ON r.movieId = p.movieId AND r.userId = u.userId

One of the rows of this table looks like this:


In [ ]:
%%bigquery --project $PROJECT
SELECT *
FROM movielens.hybrid_dataset
LIMIT 1

Essentially, we have a couple of attributes about the movie, the product factors array corresponding to the movie, a couple of attributes about the user, and the user factors array corresponding to the user. These form the inputs to our “hybrid” recommendations model that builds off the matrix factorization model and adds in metadata about users and movies.

Training hybrid recommendation model

At the time of writing, BigQuery ML can not handle arrays as inputs to a regression model. Let’s, therefore, define a function to convert arrays to a struct where the array elements are its fields:


In [ ]:
%%bigquery --project $PROJECT
CREATE OR REPLACE FUNCTION movielens.arr_to_input_16_users(u ARRAY<FLOAT64>)
RETURNS 
    STRUCT<
        u1 FLOAT64,
        u2 FLOAT64,
        u3 FLOAT64,
        u4 FLOAT64,
        u5 FLOAT64,
        u6 FLOAT64,
        u7 FLOAT64,
        u8 FLOAT64,
        u9 FLOAT64,
        u10 FLOAT64,
        u11 FLOAT64,
        u12 FLOAT64,
        u13 FLOAT64,
        u14 FLOAT64,
        u15 FLOAT64,
        u16 FLOAT64
    > AS (STRUCT(
        u[OFFSET(0)],
        u[OFFSET(1)],
        u[OFFSET(2)],
        u[OFFSET(3)],
        u[OFFSET(4)],
        u[OFFSET(5)],
        u[OFFSET(6)],
        u[OFFSET(7)],
        u[OFFSET(8)],
        u[OFFSET(9)],
        u[OFFSET(10)],
        u[OFFSET(11)],
        u[OFFSET(12)],
        u[OFFSET(13)],
        u[OFFSET(14)],
        u[OFFSET(15)]
));

which gives:


In [ ]:
%%bigquery --project $PROJECT
SELECT movielens.arr_to_input_16_users(u).*
FROM (SELECT
    [0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14., 15.] AS u)

We can create a similar function named movielens.arr_to_input_16_products to convert the product factor array into named columns.

TODO 2: Create a function that returns named columns from a size 16 product factor array.


In [ ]:
%%bigquery --project $PROJECT
CREATE OR REPLACE FUNCTION movielens.arr_to_input_16_products(p ARRAY<FLOAT64>)
RETURNS 
    STRUCT<
        p1 FLOAT64,
        p2 FLOAT64,
        p3 FLOAT64,
        p4 FLOAT64,
        p5 FLOAT64,
        p6 FLOAT64,
        p7 FLOAT64,
        p8 FLOAT64,
        p9 FLOAT64,
        p10 FLOAT64,
        p11 FLOAT64,
        p12 FLOAT64,
        # TODO: Finish building this struct
    > AS (STRUCT(
        p[OFFSET(0)],
        p[OFFSET(1)],
        p[OFFSET(2)],
        p[OFFSET(3)],
        p[OFFSET(4)],
        p[OFFSET(5)],
        p[OFFSET(6)],
        p[OFFSET(7)],
        p[OFFSET(8)],
        p[OFFSET(9)],
        p[OFFSET(10)],
        p[OFFSET(11)],
        p[OFFSET(12)],
        # TODO: Finish building this struct
));

Then, we can tie together metadata about users and products with the user factors and product factors obtained from the matrix factorization approach to create a regression model to predict the rating:


In [ ]:
%%bigquery --project $PROJECT
CREATE OR REPLACE MODEL movielens.recommender_hybrid 
OPTIONS(model_type='linear_reg', input_label_cols=['rating'])
AS

SELECT
  * EXCEPT(user_factors, product_factors),
    movielens.arr_to_input_16_users(user_factors).*,
    movielens.arr_to_input_16_products(product_factors).*
FROM
  movielens.hybrid_dataset

There is no point looking at the evaluation metrics of this model because the user information we used to create the training dataset was fake (not the RAND() in the creation of the loyalty column) -- we did this exercise in order to demonstrate how it could be done. And of course, we could train a dnn_regressor model and optimize the hyperparameters if we want a more sophisticated model. But if we are going to go that far, it might be better to consider using Auto ML tables, covered in the next section.

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.