Explicit Feedback Neural Recommender Systems

Goals:

  • Understand recommender data
  • Build different models architectures using Keras
  • Retrieve Embeddings and visualize them
  • Add metadata information as input to the model

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os.path as op

from zipfile import ZipFile
try:
    from urllib.request import urlretrieve
except ImportError:  # Python 2 compat
    from urllib import urlretrieve


ML_100K_URL = "http://files.grouplens.org/datasets/movielens/ml-100k.zip"
ML_100K_FILENAME = ML_100K_URL.rsplit('/', 1)[1]
ML_100K_FOLDER = 'ml-100k'

if not op.exists(ML_100K_FILENAME):
    print('Downloading %s to %s...' % (ML_100K_URL, ML_100K_FILENAME))
    urlretrieve(ML_100K_URL, ML_100K_FILENAME)

if not op.exists(ML_100K_FOLDER):
    print('Extracting %s to %s...' % (ML_100K_FILENAME, ML_100K_FOLDER))
    ZipFile(ML_100K_FILENAME).extractall('.')

Ratings file

Each line contains a rated movie:

  • a user
  • an item
  • a rating from 1 to 5 stars

In [2]:
import pandas as pd

raw_ratings = pd.read_csv(op.join(ML_100K_FOLDER, 'u.data'), sep='\t',
                      names=["user_id", "item_id", "rating", "timestamp"])
raw_ratings.head()


Out[2]:
user_id item_id rating timestamp
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
3 244 51 2 880606923
4 166 346 1 886397596

Item metadata file

The item metadata file contains metadata like the name of the movie or the date it was released. The movies file contains columns indicating the movie's genres. Let's only load the first five columns of the file with usecols.


In [3]:
m_cols = ['item_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
items = pd.read_csv(op.join(ML_100K_FOLDER, 'u.item'), sep='|',
                    names=m_cols, usecols=range(5), encoding='latin-1')
items.head()


Out[3]:
item_id title release_date video_release_date imdb_url
0 1 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2...
1 2 GoldenEye (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?GoldenEye%20(...
2 3 Four Rooms (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Four%20Rooms%...
3 4 Get Shorty (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Get%20Shorty%...
4 5 Copycat (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Copycat%20(1995)

Let's write a bit of Python preprocessing code to extract the release year as an integer value:


In [4]:
def extract_year(release_date):
    if hasattr(release_date, 'split'):
        components = release_date.split('-')
        if len(components) == 3:
            return int(components[2])
    # Missing value marker
    return 1920


items['release_year'] = items['release_date'].map(extract_year)
items.hist('release_year', bins=50);


Enrich the raw ratings data with the collected items metadata:


In [5]:
all_ratings = pd.merge(items, raw_ratings)

In [6]:
all_ratings.head()


Out[6]:
item_id title release_date video_release_date imdb_url release_year user_id rating timestamp
0 1 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 1995 308 4 887736532
1 1 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 1995 287 5 875334088
2 1 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 1995 148 4 877019411
3 1 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 1995 280 4 891700426
4 1 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 1995 66 3 883601324

Data preprocessing

To understand well the distribution of the data, the following statistics are computed:

  • the number of users
  • the number of items
  • the rating distribution
  • the popularity of each movie

In [7]:
min_user_id = all_ratings['user_id'].min()
min_user_id


Out[7]:
1

In [8]:
max_user_id = all_ratings['user_id'].max()
max_user_id


Out[8]:
943

In [9]:
min_item_id = all_ratings['item_id'].min()
min_item_id


Out[9]:
1

In [10]:
max_item_id = all_ratings['item_id'].max()
max_item_id


Out[10]:
1682

In [11]:
all_ratings['rating'].describe()


Out[11]:
count    100000.000000
mean          3.529860
std           1.125674
min           1.000000
25%           3.000000
50%           4.000000
75%           4.000000
max           5.000000
Name: rating, dtype: float64

Let's do a bit more pandas magic compute the popularity of each movie (number of ratings):


In [12]:
popularity = all_ratings.groupby('item_id').size().reset_index(name='popularity')
items = pd.merge(popularity, items)
items.nlargest(10, 'popularity')


Out[12]:
item_id popularity title release_date video_release_date imdb_url release_year
49 50 583 Star Wars (1977) 01-Jan-1977 NaN http://us.imdb.com/M/title-exact?Star%20Wars%2... 1977
257 258 509 Contact (1997) 11-Jul-1997 NaN http://us.imdb.com/Title?Contact+(1997/I) 1997
99 100 508 Fargo (1996) 14-Feb-1997 NaN http://us.imdb.com/M/title-exact?Fargo%20(1996) 1997
180 181 507 Return of the Jedi (1983) 14-Mar-1997 NaN http://us.imdb.com/M/title-exact?Return%20of%2... 1997
293 294 485 Liar Liar (1997) 21-Mar-1997 NaN http://us.imdb.com/Title?Liar+Liar+(1997) 1997
285 286 481 English Patient, The (1996) 15-Nov-1996 NaN http://us.imdb.com/M/title-exact?English%20Pat... 1996
287 288 478 Scream (1996) 20-Dec-1996 NaN http://us.imdb.com/M/title-exact?Scream%20(1996) 1996
0 1 452 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 1995
299 300 431 Air Force One (1997) 01-Jan-1997 NaN http://us.imdb.com/M/title-exact?Air+Force+One... 1997
120 121 429 Independence Day (ID4) (1996) 03-Jul-1996 NaN http://us.imdb.com/M/title-exact?Independence%... 1996

In [13]:
items["title"][181]


Out[13]:
'GoodFellas (1990)'

In [14]:
indexed_items = items.set_index('item_id')
indexed_items["title"][181]


Out[14]:
'Return of the Jedi (1983)'

In [15]:
all_ratings = pd.merge(popularity, all_ratings)
all_ratings.describe()


Out[15]:
item_id popularity video_release_date release_year user_id rating timestamp
count 100000.000000 100000.000000 0.0 100000.000000 100000.00000 100000.000000 1.000000e+05
mean 425.530130 168.071900 NaN 1987.950100 462.48475 3.529860 8.835289e+08
std 330.798356 121.784558 NaN 14.169558 266.61442 1.125674 5.343856e+06
min 1.000000 1.000000 NaN 1920.000000 1.00000 1.000000 8.747247e+08
25% 175.000000 71.000000 NaN 1986.000000 254.00000 3.000000 8.794487e+08
50% 322.000000 145.000000 NaN 1994.000000 447.00000 4.000000 8.828269e+08
75% 631.000000 239.000000 NaN 1996.000000 682.00000 4.000000 8.882600e+08
max 1682.000000 583.000000 NaN 1998.000000 943.00000 5.000000 8.932866e+08

In [16]:
all_ratings.head()


Out[16]:
item_id popularity title release_date video_release_date imdb_url release_year user_id rating timestamp
0 1 452 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 1995 308 4 887736532
1 1 452 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 1995 287 5 875334088
2 1 452 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 1995 148 4 877019411
3 1 452 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 1995 280 4 891700426
4 1 452 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 1995 66 3 883601324

Later in the analysis we will assume that this popularity does not come from the ratings themselves but from an external metadata, e.g. box office numbers in the month after the release in movie theaters.

Let's split the enriched data in a train / test split to make it possible to do predictive modeling:


In [17]:
from sklearn.model_selection import train_test_split

ratings_train, ratings_test = train_test_split(
    all_ratings, test_size=0.2, random_state=0)

user_id_train = np.array(ratings_train['user_id'])
item_id_train = np.array(ratings_train['item_id'])
rating_train = np.array(ratings_train['rating'])

user_id_test = np.array(ratings_test['user_id'])
item_id_test = np.array(ratings_test['item_id'])
rating_test = np.array(ratings_test['rating'])

Explicit feedback: supervised ratings prediction

For each pair of (user, item) try to predict the rating the user would give to the item.

This is the classical setup for building recommender systems from offline data with explicit supervision signal.

Predictive ratings as a regression problem

The following code implements the following architecture:


In [18]:
from tensorflow.keras.layers import Embedding, Flatten, Dense, Dropout
from tensorflow.keras.layers import Dot
from tensorflow.keras.models import Model

In [19]:
# For each sample we input the integer identifiers
# of a single user and a single item
class RegressionModel(Model):
    def __init__(self, embedding_size, max_user_id, max_item_id):
        super().__init__()
        
        self.user_embedding = Embedding(output_dim=embedding_size,
                                        input_dim=max_user_id + 1,
                                        input_length=1,
                                        name='user_embedding')
        self.item_embedding = Embedding(output_dim=embedding_size,
                                        input_dim=max_item_id + 1,
                                        input_length=1,
                                        name='item_embedding')
        
        # The following two layers don't have parameters.
        self.flatten = Flatten()
        self.dot = Dot(axes=1)
        
    def call(self, inputs):
        user_inputs = inputs[0]
        item_inputs = inputs[1]
        
        user_vecs = self.flatten(self.user_embedding(user_inputs))
        item_vecs = self.flatten(self.item_embedding(item_inputs))
        
        y = self.dot([user_vecs, item_vecs])
        return y


model = RegressionModel(64, max_user_id, max_item_id)
model.compile(optimizer="adam", loss='mae')

In [20]:
# Useful for debugging the output shape of model
initial_train_preds = model.predict([user_id_train, item_id_train])
initial_train_preds.shape


Out[20]:
(80000, 1)

Model error

Using initial_train_preds, compute the model errors:

  • mean absolute error
  • mean squared error

Converting a pandas Series to numpy array is usually implicit, but you may use rating_train.values to do so explicitly. Be sure to monitor the shapes of each object you deal with by using object.shape.


In [23]:
# %load solutions/compute_errors.py
squared_differences = np.square(initial_train_preds[:,0] - rating_train)
absolute_differences = np.abs(initial_train_preds[:,0] - rating_train)

print("Random init MSE: %0.3f" % np.mean(squared_differences))
print("Random init MAE: %0.3f" % np.mean(absolute_differences))

# You may also use sklearn metrics to do so using scikit-learn:

from sklearn.metrics import mean_absolute_error, mean_squared_error

print("Random init MSE: %0.3f" % mean_squared_error(initial_train_preds, rating_train))
print("Random init MAE: %0.3f" % mean_absolute_error(initial_train_preds, rating_train))


Random init MSE: 13.720
Random init MAE: 3.529
Random init MSE: 13.720
Random init MAE: 3.529

Monitoring runs

Keras enables to monitor various variables during training.

history.history returned by the model.fit function is a dictionary containing the 'loss' and validation loss 'val_loss' after each epoch


In [24]:
%%time

# Training the model
history = model.fit([user_id_train, item_id_train], rating_train,
                    batch_size=64, epochs=10, validation_split=0.1,
                    shuffle=True)


Train on 72000 samples, validate on 8000 samples
Epoch 1/10
72000/72000 [==============================] - 3s 44us/sample - loss: 2.6014 - val_loss: 1.0323
Epoch 2/10
72000/72000 [==============================] - 2s 34us/sample - loss: 0.8467 - val_loss: 0.7943
Epoch 3/10
72000/72000 [==============================] - 2s 34us/sample - loss: 0.7600 - val_loss: 0.7668
Epoch 4/10
72000/72000 [==============================] - 2s 31us/sample - loss: 0.7403 - val_loss: 0.7594
Epoch 5/10
72000/72000 [==============================] - 2s 34us/sample - loss: 0.7263 - val_loss: 0.7569
Epoch 6/10
72000/72000 [==============================] - 2s 30us/sample - loss: 0.7092 - val_loss: 0.7503
Epoch 7/10
72000/72000 [==============================] - 2s 34us/sample - loss: 0.6888 - val_loss: 0.7447
Epoch 8/10
72000/72000 [==============================] - 2s 32us/sample - loss: 0.6669 - val_loss: 0.7383
Epoch 9/10
72000/72000 [==============================] - 2s 32us/sample - loss: 0.6421 - val_loss: 0.7389
Epoch 10/10
72000/72000 [==============================] - 2s 33us/sample - loss: 0.6163 - val_loss: 0.7362
CPU times: user 43.9 s, sys: 2.12 s, total: 46 s
Wall time: 24.4 s

In [25]:
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='validation')
plt.ylim(0, 2)
plt.legend(loc='best')
plt.title('Loss');


Questions:

  • Why is the train loss higher than the first loss in the first few epochs?
  • Why is Keras not computing the train loss on the full training set at the end of each epoch as it does on the validation set?

Now that the model is trained, the model MSE and MAE look nicer:


In [26]:
def plot_predictions(y_true, y_pred):
    plt.figure(figsize=(4, 4))
    plt.xlim(-1, 6)
    plt.xlabel("True rating")
    plt.ylim(-1, 6)
    plt.xlabel("Predicted rating")
    plt.scatter(y_true, y_pred, s=60, alpha=0.01)

In [27]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

test_preds = model.predict([user_id_test, item_id_test])
print("Final test MSE: %0.3f" % mean_squared_error(test_preds, rating_test))
print("Final test MAE: %0.3f" % mean_absolute_error(test_preds, rating_test))
plot_predictions(rating_test, test_preds)


Final test MSE: 0.902
Final test MAE: 0.733

In [28]:
train_preds = model.predict([user_id_train, item_id_train])
print("Final train MSE: %0.3f" % mean_squared_error(train_preds, rating_train))
print("Final train MAE: %0.3f" % mean_absolute_error(train_preds, rating_train))
plot_predictions(rating_train, train_preds)


Final train MSE: 0.652
Final train MAE: 0.594

Model Embeddings

  • It is possible to retrieve the embeddings by simply using the Keras function model.get_weights which returns all the model learnable parameters.
  • The weights are returned the same order as they were build in the model
  • What is the total number of parameters?

In [29]:
# weights and shape
weights = model.get_weights()
[w.shape for w in weights]


Out[29]:
[(944, 64), (1683, 64)]

In [30]:
# Solution: 
# model.summary()

In [31]:
user_embeddings = weights[0]
item_embeddings = weights[1]

In [32]:
item_id = 181
print(f"Title for item_id={item_id}: {indexed_items['title'][item_id]}")


Title for item_id=181: Return of the Jedi (1983)

In [33]:
print(f"Embedding vector for item_id={item_id}")
print(item_embeddings[item_id])
print("shape:", item_embeddings[item_id].shape)


Embedding vector for item_id=181
[ 0.48151043  0.36075953 -0.5152434  -0.23560835  0.02258038 -0.35015777
 -0.39849016  0.3608066   0.25714108 -0.16833186  0.4536687  -0.23835418
  0.22830537 -0.38697934 -0.39666864  0.08538622 -0.40700766 -0.45461214
  0.29057175  0.4556055  -0.29594952  0.0113386   0.40053546  0.20864698
 -0.28388047  0.06081908 -0.07695611  0.4026974  -0.4370351   0.10940211
  0.38689417 -0.27938834  0.21300362  0.16662993 -0.38443014 -0.4884122
  0.4386094  -0.42546353  0.1434935   0.00376552 -0.3089447  -0.09152184
  0.13884781 -0.47453746  0.35158294 -0.3211391   0.03171986 -0.41499725
 -0.15330398 -0.28007802 -0.3319683   0.54434323 -0.3486536  -0.08418526
 -0.24189633  0.08783964  0.05250944  0.31410423  0.34306523  0.18519272
 -0.55053395 -0.4381623   0.41965964 -0.09234481]
shape: (64,)

Finding most similar items

Finding k most similar items to a point in embedding space

  • Write in numpy a function to compute the cosine similarity between two points in embedding space.
  • Test it on the following cells to check the similarities between popular movies.
  • Bonus: try to generalize the function to compute the similarities between one movie and all the others and return the most related movies.

Notes:

  • you may use np.linalg.norm to compute the norm of vector, and you may specify the axis=
  • the numpy function np.argsort(...) enables to compute the sorted indices of a vector
  • items["name"][idxs] returns the names of the items indexed by array idxs

In [36]:
EPSILON = 1e-07  # to avoid division by 0.


def cosine(x, y):
    # TODO: implement me!
    return 0.

In [37]:
# %load solutions/similarity.py
EPSILON = 1e-07


def cosine(x, y):
    dot_products = np.dot(x, y.T)
    norm_products = np.linalg.norm(x) * np.linalg.norm(y)
    return dot_products / (norm_products + EPSILON)

In [38]:
def print_similarity(item_a, item_b, item_embeddings, titles):
    print(titles[item_a])
    print(titles[item_b])
    similarity = cosine(item_embeddings[item_a],
                        item_embeddings[item_b])
    print(f"Cosine similarity: {similarity:.3}")
    
print_similarity(50, 181, item_embeddings, indexed_items["title"])


Star Wars (1977)
Return of the Jedi (1983)
Cosine similarity: 0.932

In [39]:
print_similarity(181, 288, item_embeddings, indexed_items["title"])


Return of the Jedi (1983)
Scream (1996)
Cosine similarity: 0.764

In [40]:
print_similarity(181, 1, item_embeddings, indexed_items["title"])


Return of the Jedi (1983)
Toy Story (1995)
Cosine similarity: 0.833

In [41]:
print_similarity(181, 181, item_embeddings, indexed_items["title"])


Return of the Jedi (1983)
Return of the Jedi (1983)
Cosine similarity: 1.0

In [42]:
def cosine_similarities(item_id, item_embeddings):
    """Compute similarities between item_id and all items embeddings"""
    query_vector = item_embeddings[item_id]
    dot_products = item_embeddings @ query_vector

    query_vector_norm = np.linalg.norm(query_vector)
    all_item_norms = np.linalg.norm(item_embeddings, axis=1)
    norm_products = query_vector_norm * all_item_norms
    return dot_products / (norm_products + EPSILON)


similarities = cosine_similarities(181, item_embeddings)
similarities


Out[42]:
array([0.12287419, 0.8332738 , 0.71978253, ..., 0.7245761 , 0.7401352 ,
       0.72493505], dtype=float32)

In [43]:
plt.hist(similarities, bins=30);



In [44]:
def most_similar(item_id, item_embeddings, titles,
                 top_n=30):
    sims = cosine_similarities(item_id, item_embeddings)
    # [::-1] makes it possible to reverse the order of a numpy
    # array, this is required because most similar items have
    # a larger cosine similarity value
    sorted_indexes = np.argsort(sims)[::-1]
    idxs = sorted_indexes[0:top_n]
    return list(zip(idxs, titles[idxs], sims[idxs]))


most_similar(50, item_embeddings, indexed_items["title"], top_n=10)


Out[44]:
[(50, 'Star Wars (1977)', 0.99999994),
 (181, 'Return of the Jedi (1983)', 0.9318021),
 (172, 'Empire Strikes Back, The (1980)', 0.9285119),
 (1550, 'Destiny Turns on the Radio (1995)', 0.9011149),
 (1586, 'Lashou shentan (1992)', 0.8954362),
 (174, 'Raiders of the Lost Ark (1981)', 0.8927318),
 (1554, 'Safe Passage (1994)', 0.890608),
 (186, 'Blues Brothers, The (1980)', 0.88655555),
 (96, 'Terminator 2: Judgment Day (1991)', 0.87996095),
 (1582, 'T-Men (1947)', 0.8792745)]

In [45]:
# items[items['title'].str.contains("Star Trek")]

In [46]:
most_similar(227, item_embeddings, indexed_items["title"], top_n=10)


Out[46]:
[(227, 'Star Trek VI: The Undiscovered Country (1991)', 1.0000001),
 (228, 'Star Trek: The Wrath of Khan (1982)', 0.93749136),
 (1076, 'Pagemaster, The (1994)', 0.90108544),
 (230, 'Star Trek IV: The Voyage Home (1986)', 0.8826339),
 (431, 'Highlander (1986)', 0.87688845),
 (502, 'Bananas (1971)', 0.8761338),
 (79, 'Fugitive, The (1993)', 0.874859),
 (1540, 'Amazing Panda Adventure, The (1995)', 0.8742083),
 (586, 'Terminal Velocity (1994)', 0.8717097),
 (1539, 'Being Human (1993)', 0.86765033)]

The similarities do not always make sense: the number of ratings is low and the embedding does not automatically capture semantic relationships in that context. Better representations arise with higher number of ratings, and less overfitting in models or maybe better loss function, such as those based on implicit feedback.

Visualizing embeddings using TSNE

  • we use scikit learn to visualize items embeddings
  • Try different perplexities, and visualize user embeddings as well
  • What can you conclude ?

In [47]:
from sklearn.manifold import TSNE

item_tsne = TSNE(perplexity=30).fit_transform(item_embeddings)

In [48]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 10))
plt.scatter(item_tsne[:, 0], item_tsne[:, 1]);
plt.xticks(()); plt.yticks(());
plt.show()



In [49]:
%pip install -q plotly


Note: you may need to restart the kernel to use updated packages.

In [50]:
import plotly.express as px

tsne_df = pd.DataFrame(item_tsne, columns=["tsne_1", "tsne_2"])
tsne_df["item_id"] = np.arange(item_tsne.shape[0])
tsne_df = tsne_df.merge(items.reset_index())

px.scatter(tsne_df, x="tsne_1", y="tsne_2",
           color="popularity",
           hover_data=["item_id", "title",
                       "release_year", "popularity"])



In [51]:
# %pip install umap-learn

In [52]:
# import umap

# item_umap = umap.UMAP().fit_transform(item_embeddings)
# plt.figure(figsize=(10, 10))
# plt.scatter(item_umap[:, 0], item_umap[:, 1]);
# plt.xticks(()); plt.yticks(());
# plt.show()

A Deep recommender model

Using a similar framework as previously, the following deep model described in the course was built (with only two fully connected)

To build this model we will need a new kind of layer:


In [53]:
from tensorflow.keras.layers import Concatenate

Exercise

  • The following code has 4 errors that prevent it from working correctly. Correct them and explain why they are critical.

In [54]:
class DeepRegressionModel(Model):

    def __init__(self, embedding_size, max_user_id, max_item_id):
        super().__init__()
        
        self.user_embedding = Embedding(output_dim=embedding_size,
                                        input_dim=max_user_id + 1,
                                        input_length=1,
                                        name='user_embedding')
        self.item_embedding = Embedding(output_dim=embedding_size,
                                        input_dim=max_item_id + 1,
                                        input_length=1,
                                        name='item_embedding')
        
        # The following two layers don't have parameters.
        self.flatten = Flatten()
        self.concat = Concatenate()
        
        self.dropout = Dropout(0.99)
        self.dense1 = Dense(64, activation="relu")
        self.dense2 = Dense(2, activation="tanh")
        
    def call(self, inputs):
        user_inputs = inputs[0]
        item_inputs = inputs[1]
        
        user_vecs = self.flatten(self.user_embedding(user_inputs))
        item_vecs = self.flatten(self.item_embedding(item_inputs))
        
        input_vecs = self.concat([user_vecs, item_vecs])
        
        y = self.dropout(input_vecs)
        y = self.dense1(y)
        y = self.dense2(y)
        
        return y
        
model = DeepRegressionModel(64, max_user_id, max_item_id)
model.compile(optimizer='adam', loss='binary_crossentropy')

initial_train_preds = model.predict([user_id_train, item_id_train])


WARNING:tensorflow:Large dropout rate: 0.99 (>0.5). In TensorFlow 2.x, dropout() uses dropout rate instead of keep_prob. Please ensure that this is intended.

In [55]:
# %load solutions/deep_explicit_feedback_recsys.py
# For each sample we input the integer identifiers
# of a single user and a single item
class DeepRegressionModel(Model):

    def __init__(self, embedding_size, max_user_id, max_item_id):
        super().__init__()

        self.user_embedding = Embedding(
            output_dim=embedding_size,
            input_dim=max_user_id + 1,
            input_length=1,
            name='user_embedding'
        )
        self.item_embedding = Embedding(
            output_dim=embedding_size,
            input_dim=max_item_id + 1,
            input_length=1,
            name='item_embedding'
        )

        # The following two layers don't have parameters.
        self.flatten = Flatten()
        self.concat = Concatenate()

        ## Error 1: Dropout was too high, preventing any training
        self.dropout = Dropout(0.5)
        self.dense1 = Dense(64, activation="relu")
        ## Error 2: output dimension was 2 where we predict only 1-d rating
        ## Error 3: tanh activation squashes the outputs between -1 and 1
        ## when we want to predict values between 1 and 5
        self.dense2 = Dense(1)

    def call(self, inputs, training=False):
        user_inputs = inputs[0]
        item_inputs = inputs[1]

        user_vecs = self.flatten(self.user_embedding(user_inputs))
        item_vecs = self.flatten(self.item_embedding(item_inputs))

        input_vecs = self.concat([user_vecs, item_vecs])

        y = self.dropout(input_vecs, training=training)
        y = self.dense1(y)
        y = self.dropout(y, training=training)
        y = self.dense2(y)

        return y


model = DeepRegressionModel(64, max_user_id, max_item_id)
## Error 4: A binary crossentropy loss is only useful for binary
## classification, while we are in regression (use mse or mae)
model.compile(optimizer='adam', loss='mae')

initial_train_preds = model.predict([user_id_train, item_id_train])

In [56]:
%%time
history = model.fit([user_id_train, item_id_train], rating_train,
                    batch_size=64, epochs=10, validation_split=0.1,
                    shuffle=True)


Train on 72000 samples, validate on 8000 samples
Epoch 1/10
72000/72000 [==============================] - 3s 44us/sample - loss: 1.1222 - val_loss: 0.7812
Epoch 2/10
72000/72000 [==============================] - 3s 39us/sample - loss: 0.8731 - val_loss: 0.7547
Epoch 3/10
72000/72000 [==============================] - 3s 46us/sample - loss: 0.8405 - val_loss: 0.7531
Epoch 4/10
72000/72000 [==============================] - 3s 48us/sample - loss: 0.8139 - val_loss: 0.7481
Epoch 5/10
72000/72000 [==============================] - 3s 38us/sample - loss: 0.7947 - val_loss: 0.7511
Epoch 6/10
72000/72000 [==============================] - 2s 34us/sample - loss: 0.7770 - val_loss: 0.7476
Epoch 7/10
72000/72000 [==============================] - 3s 35us/sample - loss: 0.7648 - val_loss: 0.7422
Epoch 8/10
72000/72000 [==============================] - 3s 45us/sample - loss: 0.7553 - val_loss: 0.7397
Epoch 9/10
72000/72000 [==============================] - 3s 48us/sample - loss: 0.7453 - val_loss: 0.7371
Epoch 10/10
72000/72000 [==============================] - 3s 40us/sample - loss: 0.7368 - val_loss: 0.7364
CPU times: user 53.1 s, sys: 2.76 s, total: 55.9 s
Wall time: 30.2 s

In [57]:
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='validation')
plt.ylim(0, 2)
plt.legend(loc='best')
plt.title('Loss');



In [58]:
train_preds = model.predict([user_id_train, item_id_train])
print("Final train MSE: %0.3f" % mean_squared_error(train_preds, rating_train))
print("Final train MAE: %0.3f" % mean_absolute_error(train_preds, rating_train))


Final train MSE: 0.827
Final train MAE: 0.701

In [59]:
test_preds = model.predict([user_id_test, item_id_test])
print("Final test MSE: %0.3f" % mean_squared_error(test_preds, rating_test))
print("Final test MAE: %0.3f" % mean_absolute_error(test_preds, rating_test))


Final test MSE: 0.890
Final test MAE: 0.734

The performance of this model is not necessarily significantly better than the previous model but you can notice that the gap between train and test is lower, probably thanks to the use of dropout.

Furthermore this model is more flexible in the sense that we can extend it to include metadata for hybrid recsys as we will see in the following.

Home assignment:

  • Add another layer, compare train/test error.
  • Can you improve the test MAE?
  • Try adding more dropout and change layer sizes.

Manual tuning of so many hyperparameters is tedious. In practice it's better to automate the design of the model using an hyperparameter search tool such as:

Using item metadata in the model

Using a similar framework as previously, we will build another deep model that can also leverage additional metadata. The resulting system is therefore an Hybrid Recommender System that does both Collaborative Filtering and Content-based recommendations.


In [60]:
from sklearn.preprocessing import QuantileTransformer

meta_columns = ['popularity', 'release_year']

scaler = QuantileTransformer()
item_meta_train = scaler.fit_transform(ratings_train[meta_columns])
item_meta_test = scaler.transform(ratings_test[meta_columns])

In [61]:
class HybridModel(Model):

    def __init__(self, embedding_size, max_user_id, max_item_id):
        super().__init__()
        
        self.user_embedding = Embedding(output_dim=embedding_size,
                                        input_dim=max_user_id + 1,
                                        input_length=1,
                                        name='user_embedding')
        self.item_embedding = Embedding(output_dim=embedding_size,
                                        input_dim=max_item_id + 1,
                                        input_length=1,
                                        name='item_embedding')
        
        # The following two layers don't have parameters.
        self.flatten = Flatten()
        self.concat = Concatenate()
        
        self.dense1 = Dense(64, activation="relu")
        self.dropout = Dropout(0.3)
        self.dense2 = Dense(64, activation='relu')
        self.dense3 = Dense(1)
        
    def call(self, inputs, training=False):
        user_inputs = inputs[0]
        item_inputs = inputs[1]
        meta_inputs = inputs[2]

        user_vecs = self.flatten(self.user_embedding(user_inputs))
        user_vecs = self.dropout(user_vecs, training=training)

        item_vecs = self.flatten(self.item_embedding(item_inputs))
        item_vecs = self.dropout(item_vecs, training=training)

        input_vecs = self.concat([user_vecs, item_vecs, meta_inputs])

        y = self.dense1(input_vecs)
        y = self.dropout(y, training=training)
        y = self.dense2(y)
        y = self.dropout(y, training=training)
        y = self.dense3(y)
        return y
        
model = HybridModel(64, max_user_id, max_item_id)
model.compile(optimizer='adam', loss='mae')

initial_train_preds = model.predict([user_id_train,
                                     item_id_train,
                                     item_meta_train])

In [62]:
%%time
history = model.fit([user_id_train, item_id_train, item_meta_train],
                    rating_train,
                    batch_size=64, epochs=10, validation_split=0.1,
                    shuffle=True)


Train on 72000 samples, validate on 8000 samples
Epoch 1/10
72000/72000 [==============================] - 3s 48us/sample - loss: 0.9812 - val_loss: 0.7589
Epoch 2/10
72000/72000 [==============================] - 3s 43us/sample - loss: 0.8295 - val_loss: 0.7495
Epoch 3/10
72000/72000 [==============================] - 3s 48us/sample - loss: 0.8024 - val_loss: 0.7473
Epoch 4/10
72000/72000 [==============================] - 3s 43us/sample - loss: 0.7868 - val_loss: 0.7394
Epoch 5/10
72000/72000 [==============================] - 3s 45us/sample - loss: 0.7681 - val_loss: 0.7388
Epoch 6/10
72000/72000 [==============================] - 3s 46us/sample - loss: 0.7555 - val_loss: 0.7310
Epoch 7/10
72000/72000 [==============================] - 3s 47us/sample - loss: 0.7431 - val_loss: 0.7291
Epoch 8/10
72000/72000 [==============================] - 3s 49us/sample - loss: 0.7336 - val_loss: 0.7226
Epoch 9/10
72000/72000 [==============================] - 4s 59us/sample - loss: 0.7234 - val_loss: 0.7214
Epoch 10/10
72000/72000 [==============================] - 3s 48us/sample - loss: 0.7149 - val_loss: 0.7178
CPU times: user 1min 1s, sys: 3.36 s, total: 1min 4s
Wall time: 34.4 s

In [63]:
test_preds = model.predict([user_id_test, item_id_test, item_meta_test])
print("Final test MSE: %0.3f" % mean_squared_error(test_preds, rating_test))
print("Final test MAE: %0.3f" % mean_absolute_error(test_preds, rating_test))


Final test MSE: 0.863
Final test MAE: 0.718

The additional metadata seems to improve the predictive power of the model a bit but this should be re-run several times to see the impact of the random initialization of the model.

A recommendation function for a given user

Once the model is trained, the system can be used to recommend a few items for a user, that he/she hasn't already seen:

  • we use the model.predict to compute the ratings a user would have given to all items
  • we build a reco function that sorts these items and exclude those the user has already seen

In [64]:
def recommend(user_id, top_n=10):
    item_ids = range(1, max_item_id)
    seen_mask = all_ratings["user_id"] == user_id
    seen_movies = set(all_ratings[seen_mask]["item_id"])
    item_ids = list(filter(lambda x: x not in seen_movies, item_ids))

    print("User %d has seen %d movies, including:" % (user_id, len(seen_movies)))
    for title in all_ratings[seen_mask].nlargest(20, 'popularity')['title']:
        print("   ", title)
    print("Computing ratings for %d other movies:" % len(item_ids))
    
    item_ids = np.array(item_ids)
    user_ids = np.zeros_like(item_ids)
    user_ids[:] = user_id
    items_meta = scaler.transform(indexed_items[meta_columns].loc[item_ids])
    
    rating_preds = model.predict([user_ids, item_ids, items_meta])
    
    item_ids = np.argsort(rating_preds[:, 0])[::-1].tolist()
    rec_items = item_ids[:top_n]
    return [(items["title"][movie], rating_preds[movie][0])
            for movie in rec_items]

In [65]:
for title, pred_rating in recommend(5):
    print("    %0.1f: %s" % (pred_rating, title))


User 5 has seen 175 movies, including:
    Star Wars (1977)
    Fargo (1996)
    Return of the Jedi (1983)
    Toy Story (1995)
    Independence Day (ID4) (1996)
    Raiders of the Lost Ark (1981)
    Silence of the Lambs, The (1991)
    Empire Strikes Back, The (1980)
    Star Trek: First Contact (1996)
    Back to the Future (1985)
    Mission: Impossible (1996)
    Fugitive, The (1993)
    Indiana Jones and the Last Crusade (1989)
    Willy Wonka and the Chocolate Factory (1971)
    Princess Bride, The (1987)
    Forrest Gump (1994)
    Monty Python and the Holy Grail (1974)
    Men in Black (1997)
    E.T. the Extra-Terrestrial (1982)
    Birdcage, The (1996)
Computing ratings for 1506 other movies:
    4.5: Boys of St. Vincent, The (1993)
    4.5: Richard III (1995)
    4.4: Robocop 3 (1993)
    4.4: August (1996)
    4.3: Madness of King George, The (1994)
    4.3: Raising Arizona (1987)
    4.3: Romy and Michele's High School Reunion (1997)
    4.2: Boogie Nights (1997)
    4.2: Under Siege (1992)
    4.2: Raging Bull (1980)

Home assignment: Predicting ratings as a classification problem

In this dataset, the ratings all belong to a finite set of possible values:


In [66]:
import numpy as np

np.unique(rating_train)


Out[66]:
array([1, 2, 3, 4, 5])

Maybe we can help the model by forcing it to predict those values by treating the problem as a multiclassification problem. The only required changes are:

  • setting the final layer to output class membership probabities using a softmax activation with 5 outputs;
  • optimize the categorical cross-entropy classification loss instead of a regression loss such as MSE or MAE.

In [68]:
# %load solutions/classification.py
class ClassificationModel(Model):
    def __init__(self, embedding_size, max_user_id, max_item_id):
        super().__init__()

        self.user_embedding = Embedding(output_dim=embedding_size, input_dim=max_user_id + 1,
                                        input_length=1, name='user_embedding')
        self.item_embedding = Embedding(output_dim=embedding_size, input_dim=max_item_id + 1,
                                        input_length=1, name='item_embedding')

        # The following two layers don't have parameters.
        self.flatten = Flatten()
        self.concat = Concatenate()

        self.dropout1 = Dropout(0.5)
        self.dense1 = Dense(128, activation="relu")
        self.dropout2 = Dropout(0.2)
        self.dense2 = Dense(128, activation='relu')
        self.dense3 = Dense(5, activation="softmax")

    def call(self, inputs):
        user_inputs = inputs[0]
        item_inputs = inputs[1]

        user_vecs = self.flatten(self.user_embedding(user_inputs))
        item_vecs = self.flatten(self.item_embedding(item_inputs))

        input_vecs = self.concat([user_vecs, item_vecs])

        y = self.dropout1(input_vecs)
        y = self.dense1(y)
        y = self.dropout2(y)
        y = self.dense2(y)
        y = self.dense3(y)

        return y

model = ClassificationModel(16, max_user_id, max_item_id)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

initial_train_preds = model.predict([user_id_train, item_id_train]).argmax(axis=1) + 1
print("Random init MSE: %0.3f" % mean_squared_error(initial_train_preds, rating_train))
print("Random init MAE: %0.3f" % mean_absolute_error(initial_train_preds, rating_train))

history = model.fit([user_id_train, item_id_train], rating_train - 1,
                    batch_size=64, epochs=15, validation_split=0.1,
                    shuffle=True)

plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='validation')
plt.ylim(0, 2)
plt.legend(loc='best')
plt.title('loss');

test_preds = model.predict([user_id_test, item_id_test]).argmax(axis=1) + 1
print("Final test MSE: %0.3f" % mean_squared_error(test_preds, rating_test))
print("Final test MAE: %0.3f" % mean_absolute_error(test_preds, rating_test))


Random init MSE: 1.989
Random init MAE: 1.071
Train on 72000 samples, validate on 8000 samples
Epoch 1/15
72000/72000 [==============================] - 4s 52us/sample - loss: 1.3698 - val_loss: 1.2793
Epoch 2/15
72000/72000 [==============================] - 2s 34us/sample - loss: 1.2840 - val_loss: 1.2622
Epoch 3/15
72000/72000 [==============================] - 3s 35us/sample - loss: 1.2620 - val_loss: 1.2535
Epoch 4/15
72000/72000 [==============================] - 2s 33us/sample - loss: 1.2513 - val_loss: 1.2493
Epoch 5/15
72000/72000 [==============================] - 2s 31us/sample - loss: 1.2453 - val_loss: 1.2440
Epoch 6/15
72000/72000 [==============================] - 2s 31us/sample - loss: 1.2385 - val_loss: 1.2445
Epoch 7/15
72000/72000 [==============================] - 2s 30us/sample - loss: 1.2334 - val_loss: 1.2424
Epoch 8/15
72000/72000 [==============================] - 2s 30us/sample - loss: 1.2298 - val_loss: 1.2418
Epoch 9/15
72000/72000 [==============================] - 2s 31us/sample - loss: 1.2275 - val_loss: 1.2392
Epoch 10/15
72000/72000 [==============================] - 3s 40us/sample - loss: 1.2239 - val_loss: 1.2397
Epoch 11/15
72000/72000 [==============================] - 3s 43us/sample - loss: 1.2189 - val_loss: 1.2356
Epoch 12/15
72000/72000 [==============================] - 3s 41us/sample - loss: 1.2183 - val_loss: 1.2395
Epoch 13/15
72000/72000 [==============================] - 3s 48us/sample - loss: 1.2165 - val_loss: 1.2372
Epoch 14/15
72000/72000 [==============================] - 3s 40us/sample - loss: 1.2147 - val_loss: 1.2384
Epoch 15/15
72000/72000 [==============================] - 3s 42us/sample - loss: 1.2114 - val_loss: 1.2377
Final test MSE: 1.144
Final test MAE: 0.717

In [ ]: