Explicit Feedback Neural Recommender Systems

Goals:

Understand recommender data
Build different models architectures using Keras
Retrieve Embeddings and visualize them
Add metadata information as input to the model



In [1]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os.path as op

from zipfile import ZipFile
try:
    from urllib.request import urlretrieve
except ImportError:  # Python 2 compat
    from urllib import urlretrieve


ML_100K_URL = "http://files.grouplens.org/datasets/movielens/ml-100k.zip"
ML_100K_FILENAME = ML_100K_URL.rsplit('/', 1)[1]
ML_100K_FOLDER = 'ml-100k'

if not op.exists(ML_100K_FILENAME):
    print('Downloading %s to %s...' % (ML_100K_URL, ML_100K_FILENAME))
    urlretrieve(ML_100K_URL, ML_100K_FILENAME)

if not op.exists(ML_100K_FOLDER):
    print('Extracting %s to %s...' % (ML_100K_FILENAME, ML_100K_FOLDER))
    ZipFile(ML_100K_FILENAME).extractall('.')

Ratings file

Each line contains a rated movie:

a user
an item
a rating from 1 to 5 stars



In [2]:

    
import pandas as pd

raw_ratings = pd.read_csv(op.join(ML_100K_FOLDER, 'u.data'), sep='\t',
                      names=["user_id", "item_id", "rating", "timestamp"])
raw_ratings.head()

Item metadata file

The item metadata file contains metadata like the name of the movie or the date it was released. The movies file contains columns indicating the movie's genres. Let's only load the first five columns of the file with usecols.



In [3]:

    
m_cols = ['item_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
items = pd.read_csv(op.join(ML_100K_FOLDER, 'u.item'), sep='|',
                    names=m_cols, usecols=range(5), encoding='latin-1')
items.head()









    Out[3]:







  
    
      
      item_id
      title
      release_date
      video_release_date
      imdb_url
    
  
  
    
      0
      1
      Toy Story (1995)
      01-Jan-1995
      NaN
      http://us.imdb.com/M/title-exact?Toy%20Story%2...
    
    
      1
      2
      GoldenEye (1995)
      01-Jan-1995
      NaN
      http://us.imdb.com/M/title-exact?GoldenEye%20(...
    
    
      2
      3
      Four Rooms (1995)
      01-Jan-1995
      NaN
      http://us.imdb.com/M/title-exact?Four%20Rooms%...
    
    
      3
      4
      Get Shorty (1995)
      01-Jan-1995
      NaN
      http://us.imdb.com/M/title-exact?Get%20Shorty%...
    
    
      4
      5
      Copycat (1995)
      01-Jan-1995
      NaN
      http://us.imdb.com/M/title-exact?Copycat%20(1995)

Let's write a bit of Python preprocessing code to extract the release year as an integer value:



In [4]:

    
def extract_year(release_date):
    if hasattr(release_date, 'split'):
        components = release_date.split('-')
        if len(components) == 3:
            return int(components[2])
    # Missing value marker
    return 1920


items['release_year'] = items['release_date'].map(extract_year)
items.hist('release_year', bins=50);

Enrich the raw ratings data with the collected items metadata:



In [5]:

    
all_ratings = pd.merge(items, raw_ratings)



In [6]:

    
all_ratings.head()









    Out[6]:







  
    
      
      item_id
      title
      release_date
      video_release_date
      imdb_url
      release_year
      user_id
      rating
      timestamp
    
  
  
    
      0
      1
      Toy Story (1995)
      01-Jan-1995
      NaN
      http://us.imdb.com/M/title-exact?Toy%20Story%2...
      1995
      308
      4
      887736532
    
    
      1
      1
      Toy Story (1995)
      01-Jan-1995
      NaN
      http://us.imdb.com/M/title-exact?Toy%20Story%2...
      1995
      287
      5
      875334088
    
    
      2
      1
      Toy Story (1995)
      01-Jan-1995
      NaN
      http://us.imdb.com/M/title-exact?Toy%20Story%2...
      1995
      148
      4
      877019411
    
    
      3
      1
      Toy Story (1995)
      01-Jan-1995
      NaN
      http://us.imdb.com/M/title-exact?Toy%20Story%2...
      1995
      280
      4
      891700426
    
    
      4
      1
      Toy Story (1995)
      01-Jan-1995
      NaN
      http://us.imdb.com/M/title-exact?Toy%20Story%2...
      1995
      66
      3
      883601324

Data preprocessing

To understand well the distribution of the data, the following statistics are computed:

the number of users
the number of items
the rating distribution
the popularity of each movie



In [7]:

    
min_user_id = all_ratings['user_id'].min()
min_user_id









    Out[7]:





1



In [8]:

    
max_user_id = all_ratings['user_id'].max()
max_user_id









    Out[8]:





943



In [9]:

    
min_item_id = all_ratings['item_id'].min()
min_item_id









    Out[9]:





1



In [10]:

    
max_item_id = all_ratings['item_id'].max()
max_item_id









    Out[10]:





1682



In [11]:

    
all_ratings['rating'].describe()









    Out[11]:





count    100000.000000
mean          3.529860
std           1.125674
min           1.000000
25%           3.000000
50%           4.000000
75%           4.000000
max           5.000000
Name: rating, dtype: float64

Let's do a bit more pandas magic compute the popularity of each movie (number of ratings):



In [12]:

    
popularity = all_ratings.groupby('item_id').size().reset_index(name='popularity')
items = pd.merge(popularity, items)
items.nlargest(10, 'popularity')









    Out[12]:







  
    
      
      item_id
      popularity
      title
      release_date
      video_release_date
      imdb_url
      release_year
    
  
  
    
      49
      50
      583
      Star Wars (1977)
      01-Jan-1977
      NaN
      http://us.imdb.com/M/title-exact?Star%20Wars%2...
      1977
    
    
      257
      258
      509
      Contact (1997)
      11-Jul-1997
      NaN
      http://us.imdb.com/Title?Contact+(1997/I)
      1997
    
    
      99
      100
      508
      Fargo (1996)
      14-Feb-1997
      NaN
      http://us.imdb.com/M/title-exact?Fargo%20(1996)
      1997
    
    
      180
      181
      507
      Return of the Jedi (1983)
      14-Mar-1997
      NaN
      http://us.imdb.com/M/title-exact?Return%20of%2...
      1997
    
    
      293
      294
      485
      Liar Liar (1997)
      21-Mar-1997
      NaN
      http://us.imdb.com/Title?Liar+Liar+(1997)
      1997
    
    
      285
      286
      481
      English Patient, The (1996)
      15-Nov-1996
      NaN
      http://us.imdb.com/M/title-exact?English%20Pat...
      1996
    
    
      287
      288
      478
      Scream (1996)
      20-Dec-1996
      NaN
      http://us.imdb.com/M/title-exact?Scream%20(1996)
      1996
    
    
      0
      1
      452
      Toy Story (1995)
      01-Jan-1995
      NaN
      http://us.imdb.com/M/title-exact?Toy%20Story%2...
      1995
    
    
      299
      300
      431
      Air Force One (1997)
      01-Jan-1997
      NaN
      http://us.imdb.com/M/title-exact?Air+Force+One...
      1997
    
    
      120
      121
      429
      Independence Day (ID4) (1996)
      03-Jul-1996
      NaN
      http://us.imdb.com/M/title-exact?Independence%...
      1996



In [13]:

    
items["title"][181]









    Out[13]:





'GoodFellas (1990)'



In [14]:

    
indexed_items = items.set_index('item_id')
indexed_items["title"][181]









    Out[14]:





'Return of the Jedi (1983)'



In [15]:

    
all_ratings = pd.merge(popularity, all_ratings)
all_ratings.describe()









    Out[15]:







  
    
      
      item_id
      popularity
      video_release_date
      release_year
      user_id
      rating
      timestamp
    
  
  
    
      count
      100000.000000
      100000.000000
      0.0
      100000.000000
      100000.00000
      100000.000000
      1.000000e+05
    
    
      mean
      425.530130
      168.071900
      NaN
      1987.950100
      462.48475
      3.529860
      8.835289e+08
    
    
      std
      330.798356
      121.784558
      NaN
      14.169558
      266.61442
      1.125674
      5.343856e+06
    
    
      min
      1.000000
      1.000000
      NaN
      1920.000000
      1.00000
      1.000000
      8.747247e+08
    
    
      25%
      175.000000
      71.000000
      NaN
      1986.000000
      254.00000
      3.000000
      8.794487e+08
    
    
      50%
      322.000000
      145.000000
      NaN
      1994.000000
      447.00000
      4.000000
      8.828269e+08
    
    
      75%
      631.000000
      239.000000
      NaN
      1996.000000
      682.00000
      4.000000
      8.882600e+08
    
    
      max
      1682.000000
      583.000000
      NaN
      1998.000000
      943.00000
      5.000000
      8.932866e+08



In [16]:

    
all_ratings.head()









    Out[16]:







  
    
      
      item_id
      popularity
      title
      release_date
      video_release_date
      imdb_url
      release_year
      user_id
      rating
      timestamp
    
  
  
    
      0
      1
      452
      Toy Story (1995)
      01-Jan-1995
      NaN
      http://us.imdb.com/M/title-exact?Toy%20Story%2...
      1995
      308
      4
      887736532
    
    
      1
      1
      452
      Toy Story (1995)
      01-Jan-1995
      NaN
      http://us.imdb.com/M/title-exact?Toy%20Story%2...
      1995
      287
      5
      875334088
    
    
      2
      1
      452
      Toy Story (1995)
      01-Jan-1995
      NaN
      http://us.imdb.com/M/title-exact?Toy%20Story%2...
      1995
      148
      4
      877019411
    
    
      3
      1
      452
      Toy Story (1995)
      01-Jan-1995
      NaN
      http://us.imdb.com/M/title-exact?Toy%20Story%2...
      1995
      280
      4
      891700426
    
    
      4
      1
      452
      Toy Story (1995)
      01-Jan-1995
      NaN
      http://us.imdb.com/M/title-exact?Toy%20Story%2...
      1995
      66
      3
      883601324

Later in the analysis we will assume that this popularity does not come from the ratings themselves but from an external metadata, e.g. box office numbers in the month after the release in movie theaters.

Let's split the enriched data in a train / test split to make it possible to do predictive modeling:



In [17]:

    
from sklearn.model_selection import train_test_split

ratings_train, ratings_test = train_test_split(
    all_ratings, test_size=0.2, random_state=0)

user_id_train = np.array(ratings_train['user_id'])
item_id_train = np.array(ratings_train['item_id'])
rating_train = np.array(ratings_train['rating'])

user_id_test = np.array(ratings_test['user_id'])
item_id_test = np.array(ratings_test['item_id'])
rating_test = np.array(ratings_test['rating'])

Explicit feedback: supervised ratings prediction

For each pair of (user, item) try to predict the rating the user would give to the item.

This is the classical setup for building recommender systems from offline data with explicit supervision signal.

Predictive ratings as a regression problem

The following code implements the following architecture:



In [18]:

    
from tensorflow.keras.layers import Embedding, Flatten, Dense, Dropout
from tensorflow.keras.layers import Dot
from tensorflow.keras.models import Model



In [19]:

    
# For each sample we input the integer identifiers
# of a single user and a single item
class RegressionModel(Model):
    def __init__(self, embedding_size, max_user_id, max_item_id):
        super().__init__()
        
        self.user_embedding = Embedding(output_dim=embedding_size,
                                        input_dim=max_user_id + 1,
                                        input_length=1,
                                        name='user_embedding')
        self.item_embedding = Embedding(output_dim=embedding_size,
                                        input_dim=max_item_id + 1,
                                        input_length=1,
                                        name='item_embedding')
        
        # The following two layers don't have parameters.
        self.flatten = Flatten()
        self.dot = Dot(axes=1)
        
    def call(self, inputs):
        user_inputs = inputs[0]
        item_inputs = inputs[1]
        
        user_vecs = self.flatten(self.user_embedding(user_inputs))
        item_vecs = self.flatten(self.item_embedding(item_inputs))
        
        y = self.dot([user_vecs, item_vecs])
        return y


model = RegressionModel(64, max_user_id, max_item_id)
model.compile(optimizer="adam", loss='mae')



In [20]:

    
# Useful for debugging the output shape of model
initial_train_preds = model.predict([user_id_train, item_id_train])
initial_train_preds.shape









    Out[20]:





(80000, 1)

Model error

Using initial_train_preds, compute the model errors:

mean absolute error
mean squared error

Converting a pandas Series to numpy array is usually implicit, but you may use rating_train.values to do so explicitly. Be sure to monitor the shapes of each object you deal with by using object.shape.



In [23]:

    
# %load solutions/compute_errors.py
squared_differences = np.square(initial_train_preds[:,0] - rating_train)
absolute_differences = np.abs(initial_train_preds[:,0] - rating_train)

print("Random init MSE: %0.3f" % np.mean(squared_differences))
print("Random init MAE: %0.3f" % np.mean(absolute_differences))

# You may also use sklearn metrics to do so using scikit-learn:

from sklearn.metrics import mean_absolute_error, mean_squared_error

print("Random init MSE: %0.3f" % mean_squared_error(initial_train_preds, rating_train))
print("Random init MAE: %0.3f" % mean_absolute_error(initial_train_preds, rating_train))









    



Random init MSE: 13.720
Random init MAE: 3.529
Random init MSE: 13.720
Random init MAE: 3.529

Monitoring runs

Keras enables to monitor various variables during training.

history.history returned by the model.fit function is a dictionary containing the 'loss' and validation loss 'val_loss' after each epoch



In [24]:

    
%%time

# Training the model
history = model.fit([user_id_train, item_id_train], rating_train,
                    batch_size=64, epochs=10, validation_split=0.1,
                    shuffle=True)









    



Train on 72000 samples, validate on 8000 samples
Epoch 1/10
72000/72000 [==============================] - 3s 44us/sample - loss: 2.6014 - val_loss: 1.0323
Epoch 2/10
72000/72000 [==============================] - 2s 34us/sample - loss: 0.8467 - val_loss: 0.7943
Epoch 3/10
72000/72000 [==============================] - 2s 34us/sample - loss: 0.7600 - val_loss: 0.7668
Epoch 4/10
72000/72000 [==============================] - 2s 31us/sample - loss: 0.7403 - val_loss: 0.7594
Epoch 5/10
72000/72000 [==============================] - 2s 34us/sample - loss: 0.7263 - val_loss: 0.7569
Epoch 6/10
72000/72000 [==============================] - 2s 30us/sample - loss: 0.7092 - val_loss: 0.7503
Epoch 7/10
72000/72000 [==============================] - 2s 34us/sample - loss: 0.6888 - val_loss: 0.7447
Epoch 8/10
72000/72000 [==============================] - 2s 32us/sample - loss: 0.6669 - val_loss: 0.7383
Epoch 9/10
72000/72000 [==============================] - 2s 32us/sample - loss: 0.6421 - val_loss: 0.7389
Epoch 10/10
72000/72000 [==============================] - 2s 33us/sample - loss: 0.6163 - val_loss: 0.7362
CPU times: user 43.9 s, sys: 2.12 s, total: 46 s
Wall time: 24.4 s



In [25]:

    
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='validation')
plt.ylim(0, 2)
plt.legend(loc='best')
plt.title('Loss');

Questions:

Why is the train loss higher than the first loss in the first few epochs?
Why is Keras not computing the train loss on the full training set at the end of each epoch as it does on the validation set?

Now that the model is trained, the model MSE and MAE look nicer:



In [26]:

    
def plot_predictions(y_true, y_pred):
    plt.figure(figsize=(4, 4))
    plt.xlim(-1, 6)
    plt.xlabel("True rating")
    plt.ylim(-1, 6)
    plt.xlabel("Predicted rating")
    plt.scatter(y_true, y_pred, s=60, alpha=0.01)



In [27]:

    
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

test_preds = model.predict([user_id_test, item_id_test])
print("Final test MSE: %0.3f" % mean_squared_error(test_preds, rating_test))
print("Final test MAE: %0.3f" % mean_absolute_error(test_preds, rating_test))
plot_predictions(rating_test, test_preds)









    



Final test MSE: 0.902
Final test MAE: 0.733



In [28]:

    
train_preds = model.predict([user_id_train, item_id_train])
print("Final train MSE: %0.3f" % mean_squared_error(train_preds, rating_train))
print("Final train MAE: %0.3f" % mean_absolute_error(train_preds, rating_train))
plot_predictions(rating_train, train_preds)









    



Final train MSE: 0.652
Final train MAE: 0.594

Model Embeddings

It is possible to retrieve the embeddings by simply using the Keras function model.get_weights which returns all the model learnable parameters.
The weights are returned the same order as they were build in the model
What is the total number of parameters?



In [29]:

    
# weights and shape
weights = model.get_weights()
[w.shape for w in weights]









    Out[29]:





[(944, 64), (1683, 64)]



In [30]:

    
# Solution: 
# model.summary()



In [31]:

    
user_embeddings = weights[0]
item_embeddings = weights[1]



In [32]:

    
item_id = 181
print(f"Title for item_id={item_id}: {indexed_items['title'][item_id]}")









    



Title for item_id=181: Return of the Jedi (1983)



In [33]:

    
print(f"Embedding vector for item_id={item_id}")
print(item_embeddings[item_id])
print("shape:", item_embeddings[item_id].shape)









    



Embedding vector for item_id=181
[ 0.48151043  0.36075953 -0.5152434  -0.23560835  0.02258038 -0.35015777
 -0.39849016  0.3608066   0.25714108 -0.16833186  0.4536687  -0.23835418
  0.22830537 -0.38697934 -0.39666864  0.08538622 -0.40700766 -0.45461214
  0.29057175  0.4556055  -0.29594952  0.0113386   0.40053546  0.20864698
 -0.28388047  0.06081908 -0.07695611  0.4026974  -0.4370351   0.10940211
  0.38689417 -0.27938834  0.21300362  0.16662993 -0.38443014 -0.4884122
  0.4386094  -0.42546353  0.1434935   0.00376552 -0.3089447  -0.09152184
  0.13884781 -0.47453746  0.35158294 -0.3211391   0.03171986 -0.41499725
 -0.15330398 -0.28007802 -0.3319683   0.54434323 -0.3486536  -0.08418526
 -0.24189633  0.08783964  0.05250944  0.31410423  0.34306523  0.18519272
 -0.55053395 -0.4381623   0.41965964 -0.09234481]
shape: (64,)

Finding most similar items

Finding k most similar items to a point in embedding space

Write in numpy a function to compute the cosine similarity between two points in embedding space.
Test it on the following cells to check the similarities between popular movies.
Bonus: try to generalize the function to compute the similarities between one movie and all the others and return the most related movies.

Notes:

you may use np.linalg.norm to compute the norm of vector, and you may specify the axis=
the numpy function np.argsort(...) enables to compute the sorted indices of a vector
items["name"][idxs] returns the names of the items indexed by array idxs



In [36]:

    
EPSILON = 1e-07  # to avoid division by 0.


def cosine(x, y):
    # TODO: implement me!
    return 0.



In [37]:

    
# %load solutions/similarity.py
EPSILON = 1e-07


def cosine(x, y):
    dot_products = np.dot(x, y.T)
    norm_products = np.linalg.norm(x) * np.linalg.norm(y)
    return dot_products / (norm_products + EPSILON)



In [38]:

    
def print_similarity(item_a, item_b, item_embeddings, titles):
    print(titles[item_a])
    print(titles[item_b])
    similarity = cosine(item_embeddings[item_a],
                        item_embeddings[item_b])
    print(f"Cosine similarity: {similarity:.3}")
    
print_similarity(50, 181, item_embeddings, indexed_items["title"])









    



Star Wars (1977)
Return of the Jedi (1983)
Cosine similarity: 0.932



In [39]:

    
print_similarity(181, 288, item_embeddings, indexed_items["title"])









    



Return of the Jedi (1983)
Scream (1996)
Cosine similarity: 0.764



In [40]:

    
print_similarity(181, 1, item_embeddings, indexed_items["title"])









    



Return of the Jedi (1983)
Toy Story (1995)
Cosine similarity: 0.833



In [41]:

    
print_similarity(181, 181, item_embeddings, indexed_items["title"])









    



Return of the Jedi (1983)
Return of the Jedi (1983)
Cosine similarity: 1.0



In [42]:

    
def cosine_similarities(item_id, item_embeddings):
    """Compute similarities between item_id and all items embeddings"""
    query_vector = item_embeddings[item_id]
    dot_products = item_embeddings @ query_vector

    query_vector_norm = np.linalg.norm(query_vector)
    all_item_norms = np.linalg.norm(item_embeddings, axis=1)
    norm_products = query_vector_norm * all_item_norms
    return dot_products / (norm_products + EPSILON)


similarities = cosine_similarities(181, item_embeddings)
similarities









    Out[42]:





array([0.12287419, 0.8332738 , 0.71978253, ..., 0.7245761 , 0.7401352 ,
       0.72493505], dtype=float32)



In [43]:

    
plt.hist(similarities, bins=30);



In [44]:

    
def most_similar(item_id, item_embeddings, titles,
                 top_n=30):
    sims = cosine_similarities(item_id, item_embeddings)
    # [::-1] makes it possible to reverse the order of a numpy
    # array, this is required because most similar items have
    # a larger cosine similarity value
    sorted_indexes = np.argsort(sims)[::-1]
    idxs = sorted_indexes[0:top_n]
    return list(zip(idxs, titles[idxs], sims[idxs]))


most_similar(50, item_embeddings, indexed_items["title"], top_n=10)









    Out[44]:





[(50, 'Star Wars (1977)', 0.99999994),
 (181, 'Return of the Jedi (1983)', 0.9318021),
 (172, 'Empire Strikes Back, The (1980)', 0.9285119),
 (1550, 'Destiny Turns on the Radio (1995)', 0.9011149),
 (1586, 'Lashou shentan (1992)', 0.8954362),
 (174, 'Raiders of the Lost Ark (1981)', 0.8927318),
 (1554, 'Safe Passage (1994)', 0.890608),
 (186, 'Blues Brothers, The (1980)', 0.88655555),
 (96, 'Terminator 2: Judgment Day (1991)', 0.87996095),
 (1582, 'T-Men (1947)', 0.8792745)]



In [45]:

    
# items[items['title'].str.contains("Star Trek")]



In [46]:

    
most_similar(227, item_embeddings, indexed_items["title"], top_n=10)









    Out[46]:





[(227, 'Star Trek VI: The Undiscovered Country (1991)', 1.0000001),
 (228, 'Star Trek: The Wrath of Khan (1982)', 0.93749136),
 (1076, 'Pagemaster, The (1994)', 0.90108544),
 (230, 'Star Trek IV: The Voyage Home (1986)', 0.8826339),
 (431, 'Highlander (1986)', 0.87688845),
 (502, 'Bananas (1971)', 0.8761338),
 (79, 'Fugitive, The (1993)', 0.874859),
 (1540, 'Amazing Panda Adventure, The (1995)', 0.8742083),
 (586, 'Terminal Velocity (1994)', 0.8717097),
 (1539, 'Being Human (1993)', 0.86765033)]

The similarities do not always make sense: the number of ratings is low and the embedding does not automatically capture semantic relationships in that context. Better representations arise with higher number of ratings, and less overfitting in models or maybe better loss function, such as those based on implicit feedback.

Visualizing embeddings using TSNE

we use scikit learn to visualize items embeddings
Try different perplexities, and visualize user embeddings as well
What can you conclude ?



In [47]:

    
from sklearn.manifold import TSNE

item_tsne = TSNE(perplexity=30).fit_transform(item_embeddings)



In [48]:

    
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 10))
plt.scatter(item_tsne[:, 0], item_tsne[:, 1]);
plt.xticks(()); plt.yticks(());
plt.show()



In [49]:

    
%pip install -q plotly









    



Note: you may need to restart the kernel to use updated packages.



In [50]:

    
import plotly.express as px

tsne_df = pd.DataFrame(item_tsne, columns=["tsne_1", "tsne_2"])
tsne_df["item_id"] = np.arange(item_tsne.shape[0])
tsne_df = tsne_df.merge(items.reset_index())

px.scatter(tsne_df, x="tsne_1", y="tsne_2",
           color="popularity",
           hover_data=["item_id", "title",
                       "release_year", "popularity"])

Alternatively with Uniform Manifold Approximation and Projection:



In [51]:

    
# %pip install umap-learn



In [52]:

    
# import umap

# item_umap = umap.UMAP().fit_transform(item_embeddings)
# plt.figure(figsize=(10, 10))
# plt.scatter(item_umap[:, 0], item_umap[:, 1]);
# plt.xticks(()); plt.yticks(());
# plt.show()

A Deep recommender model

Using a similar framework as previously, the following deep model described in the course was built (with only two fully connected)

To build this model we will need a new kind of layer:



In [53]:

    
from tensorflow.keras.layers import Concatenate

Exercise

The following code has 4 errors that prevent it from working correctly. Correct them and explain why they are critical.



In [54]:

    
class DeepRegressionModel(Model):

    def __init__(self, embedding_size, max_user_id, max_item_id):
        super().__init__()
        
        self.user_embedding = Embedding(output_dim=embedding_size,
                                        input_dim=max_user_id + 1,
                                        input_length=1,
                                        name='user_embedding')
        self.item_embedding = Embedding(output_dim=embedding_size,
                                        input_dim=max_item_id + 1,
                                        input_length=1,
                                        name='item_embedding')
        
        # The following two layers don't have parameters.
        self.flatten = Flatten()
        self.concat = Concatenate()
        
        self.dropout = Dropout(0.99)
        self.dense1 = Dense(64, activation="relu")
        self.dense2 = Dense(2, activation="tanh")
        
    def call(self, inputs):
        user_inputs = inputs[0]
        item_inputs = inputs[1]
        
        user_vecs = self.flatten(self.user_embedding(user_inputs))
        item_vecs = self.flatten(self.item_embedding(item_inputs))
        
        input_vecs = self.concat([user_vecs, item_vecs])
        
        y = self.dropout(input_vecs)
        y = self.dense1(y)
        y = self.dense2(y)
        
        return y
        
model = DeepRegressionModel(64, max_user_id, max_item_id)
model.compile(optimizer='adam', loss='binary_crossentropy')

initial_train_preds = model.predict([user_id_train, item_id_train])









    



WARNING:tensorflow:Large dropout rate: 0.99 (>0.5). In TensorFlow 2.x, dropout() uses dropout rate instead of keep_prob. Please ensure that this is intended.



In [55]:

    
# %load solutions/deep_explicit_feedback_recsys.py
# For each sample we input the integer identifiers
# of a single user and a single item
class DeepRegressionModel(Model):

    def __init__(self, embedding_size, max_user_id, max_item_id):
        super().__init__()

        self.user_embedding = Embedding(
            output_dim=embedding_size,
            input_dim=max_user_id + 1,
            input_length=1,
            name='user_embedding'
        )
        self.item_embedding = Embedding(
            output_dim=embedding_size,
            input_dim=max_item_id + 1,
            input_length=1,
            name='item_embedding'
        )

        # The following two layers don't have parameters.
        self.flatten = Flatten()
        self.concat = Concatenate()

        ## Error 1: Dropout was too high, preventing any training
        self.dropout = Dropout(0.5)
        self.dense1 = Dense(64, activation="relu")
        ## Error 2: output dimension was 2 where we predict only 1-d rating
        ## Error 3: tanh activation squashes the outputs between -1 and 1
        ## when we want to predict values between 1 and 5
        self.dense2 = Dense(1)

    def call(self, inputs, training=False):
        user_inputs = inputs[0]
        item_inputs = inputs[1]

        user_vecs = self.flatten(self.user_embedding(user_inputs))
        item_vecs = self.flatten(self.item_embedding(item_inputs))

        input_vecs = self.concat([user_vecs, item_vecs])

        y = self.dropout(input_vecs, training=training)
        y = self.dense1(y)
        y = self.dropout(y, training=training)
        y = self.dense2(y)

        return y


model = DeepRegressionModel(64, max_user_id, max_item_id)
## Error 4: A binary crossentropy loss is only useful for binary
## classification, while we are in regression (use mse or mae)
model.compile(optimizer='adam', loss='mae')

initial_train_preds = model.predict([user_id_train, item_id_train])



In [56]:

    
%%time
history = model.fit([user_id_train, item_id_train], rating_train,
                    batch_size=64, epochs=10, validation_split=0.1,
                    shuffle=True)









    



Train on 72000 samples, validate on 8000 samples
Epoch 1/10
72000/72000 [==============================] - 3s 44us/sample - loss: 1.1222 - val_loss: 0.7812
Epoch 2/10
72000/72000 [==============================] - 3s 39us/sample - loss: 0.8731 - val_loss: 0.7547
Epoch 3/10
72000/72000 [==============================] - 3s 46us/sample - loss: 0.8405 - val_loss: 0.7531
Epoch 4/10
72000/72000 [==============================] - 3s 48us/sample - loss: 0.8139 - val_loss: 0.7481
Epoch 5/10
72000/72000 [==============================] - 3s 38us/sample - loss: 0.7947 - val_loss: 0.7511
Epoch 6/10
72000/72000 [==============================] - 2s 34us/sample - loss: 0.7770 - val_loss: 0.7476
Epoch 7/10
72000/72000 [==============================] - 3s 35us/sample - loss: 0.7648 - val_loss: 0.7422
Epoch 8/10
72000/72000 [==============================] - 3s 45us/sample - loss: 0.7553 - val_loss: 0.7397
Epoch 9/10
72000/72000 [==============================] - 3s 48us/sample - loss: 0.7453 - val_loss: 0.7371
Epoch 10/10
72000/72000 [==============================] - 3s 40us/sample - loss: 0.7368 - val_loss: 0.7364
CPU times: user 53.1 s, sys: 2.76 s, total: 55.9 s
Wall time: 30.2 s



In [57]:

    
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='validation')
plt.ylim(0, 2)
plt.legend(loc='best')
plt.title('Loss');



In [58]:

    
train_preds = model.predict([user_id_train, item_id_train])
print("Final train MSE: %0.3f" % mean_squared_error(train_preds, rating_train))
print("Final train MAE: %0.3f" % mean_absolute_error(train_preds, rating_train))









    



Final train MSE: 0.827
Final train MAE: 0.701



In [59]:

    
test_preds = model.predict([user_id_test, item_id_test])
print("Final test MSE: %0.3f" % mean_squared_error(test_preds, rating_test))
print("Final test MAE: %0.3f" % mean_absolute_error(test_preds, rating_test))









    



Final test MSE: 0.890
Final test MAE: 0.734

The performance of this model is not necessarily significantly better than the previous model but you can notice that the gap between train and test is lower, probably thanks to the use of dropout.

Furthermore this model is more flexible in the sense that we can extend it to include metadata for hybrid recsys as we will see in the following.

Home assignment:

Add another layer, compare train/test error.
Can you improve the test MAE?
Try adding more dropout and change layer sizes.

Manual tuning of so many hyperparameters is tedious. In practice it's better to automate the design of the model using an hyperparameter search tool such as:

https://keras-team.github.io/keras-tuner/ (Keras specific)
https://optuna.org/ (any machine learning framework, Keras included)

Using item metadata in the model

Using a similar framework as previously, we will build another deep model that can also leverage additional metadata. The resulting system is therefore an Hybrid Recommender System that does both Collaborative Filtering and Content-based recommendations.



In [60]:

    
from sklearn.preprocessing import QuantileTransformer

meta_columns = ['popularity', 'release_year']

scaler = QuantileTransformer()
item_meta_train = scaler.fit_transform(ratings_train[meta_columns])
item_meta_test = scaler.transform(ratings_test[meta_columns])



In [61]:

    
class HybridModel(Model):

    def __init__(self, embedding_size, max_user_id, max_item_id):
        super().__init__()
        
        self.user_embedding = Embedding(output_dim=embedding_size,
                                        input_dim=max_user_id + 1,
                                        input_length=1,
                                        name='user_embedding')
        self.item_embedding = Embedding(output_dim=embedding_size,
                                        input_dim=max_item_id + 1,
                                        input_length=1,
                                        name='item_embedding')
        
        # The following two layers don't have parameters.
        self.flatten = Flatten()
        self.concat = Concatenate()
        
        self.dense1 = Dense(64, activation="relu")
        self.dropout = Dropout(0.3)
        self.dense2 = Dense(64, activation='relu')
        self.dense3 = Dense(1)
        
    def call(self, inputs, training=False):
        user_inputs = inputs[0]
        item_inputs = inputs[1]
        meta_inputs = inputs[2]

        user_vecs = self.flatten(self.user_embedding(user_inputs))
        user_vecs = self.dropout(user_vecs, training=training)

        item_vecs = self.flatten(self.item_embedding(item_inputs))
        item_vecs = self.dropout(item_vecs, training=training)

        input_vecs = self.concat([user_vecs, item_vecs, meta_inputs])

        y = self.dense1(input_vecs)
        y = self.dropout(y, training=training)
        y = self.dense2(y)
        y = self.dropout(y, training=training)
        y = self.dense3(y)
        return y
        
model = HybridModel(64, max_user_id, max_item_id)
model.compile(optimizer='adam', loss='mae')

initial_train_preds = model.predict([user_id_train,
                                     item_id_train,
                                     item_meta_train])



In [62]:

    
%%time
history = model.fit([user_id_train, item_id_train, item_meta_train],
                    rating_train,
                    batch_size=64, epochs=10, validation_split=0.1,
                    shuffle=True)









    



Train on 72000 samples, validate on 8000 samples
Epoch 1/10
72000/72000 [==============================] - 3s 48us/sample - loss: 0.9812 - val_loss: 0.7589
Epoch 2/10
72000/72000 [==============================] - 3s 43us/sample - loss: 0.8295 - val_loss: 0.7495
Epoch 3/10
72000/72000 [==============================] - 3s 48us/sample - loss: 0.8024 - val_loss: 0.7473
Epoch 4/10
72000/72000 [==============================] - 3s 43us/sample - loss: 0.7868 - val_loss: 0.7394
Epoch 5/10
72000/72000 [==============================] - 3s 45us/sample - loss: 0.7681 - val_loss: 0.7388
Epoch 6/10
72000/72000 [==============================] - 3s 46us/sample - loss: 0.7555 - val_loss: 0.7310
Epoch 7/10
72000/72000 [==============================] - 3s 47us/sample - loss: 0.7431 - val_loss: 0.7291
Epoch 8/10
72000/72000 [==============================] - 3s 49us/sample - loss: 0.7336 - val_loss: 0.7226
Epoch 9/10
72000/72000 [==============================] - 4s 59us/sample - loss: 0.7234 - val_loss: 0.7214
Epoch 10/10
72000/72000 [==============================] - 3s 48us/sample - loss: 0.7149 - val_loss: 0.7178
CPU times: user 1min 1s, sys: 3.36 s, total: 1min 4s
Wall time: 34.4 s



In [63]:

    
test_preds = model.predict([user_id_test, item_id_test, item_meta_test])
print("Final test MSE: %0.3f" % mean_squared_error(test_preds, rating_test))
print("Final test MAE: %0.3f" % mean_absolute_error(test_preds, rating_test))









    



Final test MSE: 0.863
Final test MAE: 0.718

The additional metadata seems to improve the predictive power of the model a bit but this should be re-run several times to see the impact of the random initialization of the model.

A recommendation function for a given user

Once the model is trained, the system can be used to recommend a few items for a user, that he/she hasn't already seen:

we use the model.predict to compute the ratings a user would have given to all items
we build a reco function that sorts these items and exclude those the user has already seen



In [64]:

    
def recommend(user_id, top_n=10):
    item_ids = range(1, max_item_id)
    seen_mask = all_ratings["user_id"] == user_id
    seen_movies = set(all_ratings[seen_mask]["item_id"])
    item_ids = list(filter(lambda x: x not in seen_movies, item_ids))

    print("User %d has seen %d movies, including:" % (user_id, len(seen_movies)))
    for title in all_ratings[seen_mask].nlargest(20, 'popularity')['title']:
        print("   ", title)
    print("Computing ratings for %d other movies:" % len(item_ids))
    
    item_ids = np.array(item_ids)
    user_ids = np.zeros_like(item_ids)
    user_ids[:] = user_id
    items_meta = scaler.transform(indexed_items[meta_columns].loc[item_ids])
    
    rating_preds = model.predict([user_ids, item_ids, items_meta])
    
    item_ids = np.argsort(rating_preds[:, 0])[::-1].tolist()
    rec_items = item_ids[:top_n]
    return [(items["title"][movie], rating_preds[movie][0])
            for movie in rec_items]



In [65]:

    
for title, pred_rating in recommend(5):
    print("    %0.1f: %s" % (pred_rating, title))









    



User 5 has seen 175 movies, including:
    Star Wars (1977)
    Fargo (1996)
    Return of the Jedi (1983)
    Toy Story (1995)
    Independence Day (ID4) (1996)
    Raiders of the Lost Ark (1981)
    Silence of the Lambs, The (1991)
    Empire Strikes Back, The (1980)
    Star Trek: First Contact (1996)
    Back to the Future (1985)
    Mission: Impossible (1996)
    Fugitive, The (1993)
    Indiana Jones and the Last Crusade (1989)
    Willy Wonka and the Chocolate Factory (1971)
    Princess Bride, The (1987)
    Forrest Gump (1994)
    Monty Python and the Holy Grail (1974)
    Men in Black (1997)
    E.T. the Extra-Terrestrial (1982)
    Birdcage, The (1996)
Computing ratings for 1506 other movies:
    4.5: Boys of St. Vincent, The (1993)
    4.5: Richard III (1995)
    4.4: Robocop 3 (1993)
    4.4: August (1996)
    4.3: Madness of King George, The (1994)
    4.3: Raising Arizona (1987)
    4.3: Romy and Michele's High School Reunion (1997)
    4.2: Boogie Nights (1997)
    4.2: Under Siege (1992)
    4.2: Raging Bull (1980)

Home assignment: Predicting ratings as a classification problem

In this dataset, the ratings all belong to a finite set of possible values:



In [66]:

    
import numpy as np

np.unique(rating_train)









    Out[66]:





array([1, 2, 3, 4, 5])

Maybe we can help the model by forcing it to predict those values by treating the problem as a multiclassification problem. The only required changes are:

setting the final layer to output class membership probabities using a softmax activation with 5 outputs;
optimize the categorical cross-entropy classification loss instead of a regression loss such as MSE or MAE.



In [68]:

    
# %load solutions/classification.py
class ClassificationModel(Model):
    def __init__(self, embedding_size, max_user_id, max_item_id):
        super().__init__()

        self.user_embedding = Embedding(output_dim=embedding_size, input_dim=max_user_id + 1,
                                        input_length=1, name='user_embedding')
        self.item_embedding = Embedding(output_dim=embedding_size, input_dim=max_item_id + 1,
                                        input_length=1, name='item_embedding')

        # The following two layers don't have parameters.
        self.flatten = Flatten()
        self.concat = Concatenate()

        self.dropout1 = Dropout(0.5)
        self.dense1 = Dense(128, activation="relu")
        self.dropout2 = Dropout(0.2)
        self.dense2 = Dense(128, activation='relu')
        self.dense3 = Dense(5, activation="softmax")

    def call(self, inputs):
        user_inputs = inputs[0]
        item_inputs = inputs[1]

        user_vecs = self.flatten(self.user_embedding(user_inputs))
        item_vecs = self.flatten(self.item_embedding(item_inputs))

        input_vecs = self.concat([user_vecs, item_vecs])

        y = self.dropout1(input_vecs)
        y = self.dense1(y)
        y = self.dropout2(y)
        y = self.dense2(y)
        y = self.dense3(y)

        return y

model = ClassificationModel(16, max_user_id, max_item_id)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

initial_train_preds = model.predict([user_id_train, item_id_train]).argmax(axis=1) + 1
print("Random init MSE: %0.3f" % mean_squared_error(initial_train_preds, rating_train))
print("Random init MAE: %0.3f" % mean_absolute_error(initial_train_preds, rating_train))

history = model.fit([user_id_train, item_id_train], rating_train - 1,
                    batch_size=64, epochs=15, validation_split=0.1,
                    shuffle=True)

plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='validation')
plt.ylim(0, 2)
plt.legend(loc='best')
plt.title('loss');

test_preds = model.predict([user_id_test, item_id_test]).argmax(axis=1) + 1
print("Final test MSE: %0.3f" % mean_squared_error(test_preds, rating_test))
print("Final test MAE: %0.3f" % mean_absolute_error(test_preds, rating_test))









    



Random init MSE: 1.989
Random init MAE: 1.071
Train on 72000 samples, validate on 8000 samples
Epoch 1/15
72000/72000 [==============================] - 4s 52us/sample - loss: 1.3698 - val_loss: 1.2793
Epoch 2/15
72000/72000 [==============================] - 2s 34us/sample - loss: 1.2840 - val_loss: 1.2622
Epoch 3/15
72000/72000 [==============================] - 3s 35us/sample - loss: 1.2620 - val_loss: 1.2535
Epoch 4/15
72000/72000 [==============================] - 2s 33us/sample - loss: 1.2513 - val_loss: 1.2493
Epoch 5/15
72000/72000 [==============================] - 2s 31us/sample - loss: 1.2453 - val_loss: 1.2440
Epoch 6/15
72000/72000 [==============================] - 2s 31us/sample - loss: 1.2385 - val_loss: 1.2445
Epoch 7/15
72000/72000 [==============================] - 2s 30us/sample - loss: 1.2334 - val_loss: 1.2424
Epoch 8/15
72000/72000 [==============================] - 2s 30us/sample - loss: 1.2298 - val_loss: 1.2418
Epoch 9/15
72000/72000 [==============================] - 2s 31us/sample - loss: 1.2275 - val_loss: 1.2392
Epoch 10/15
72000/72000 [==============================] - 3s 40us/sample - loss: 1.2239 - val_loss: 1.2397
Epoch 11/15
72000/72000 [==============================] - 3s 43us/sample - loss: 1.2189 - val_loss: 1.2356
Epoch 12/15
72000/72000 [==============================] - 3s 41us/sample - loss: 1.2183 - val_loss: 1.2395
Epoch 13/15
72000/72000 [==============================] - 3s 48us/sample - loss: 1.2165 - val_loss: 1.2372
Epoch 14/15
72000/72000 [==============================] - 3s 40us/sample - loss: 1.2147 - val_loss: 1.2384
Epoch 15/15
72000/72000 [==============================] - 3s 42us/sample - loss: 1.2114 - val_loss: 1.2377
Final test MSE: 1.144
Final test MAE: 0.717



In [ ]:

	user_id	item_id	rating	timestamp
0	196	242	3	881250949
1	186	302	3	891717742
2	22	377	1	878887116
3	244	51	2	880606923
4	166	346	1	886397596

	item_id	title	release_date	video_release_date	imdb_url
0	1	Toy Story (1995)	01-Jan-1995	NaN	http://us.imdb.com/M/title-exact?Toy%20Story%2...
1	2	GoldenEye (1995)	01-Jan-1995	NaN	http://us.imdb.com/M/title-exact?GoldenEye%20(...
2	3	Four Rooms (1995)	01-Jan-1995	NaN	http://us.imdb.com/M/title-exact?Four%20Rooms%...
3	4	Get Shorty (1995)	01-Jan-1995	NaN	http://us.imdb.com/M/title-exact?Get%20Shorty%...
4	5	Copycat (1995)	01-Jan-1995	NaN	http://us.imdb.com/M/title-exact?Copycat%20(1995)

	item_id	popularity	title	release_date	video_release_date	imdb_url	release_year
49	50	583	Star Wars (1977)	01-Jan-1977	NaN	http://us.imdb.com/M/title-exact?Star%20Wars%2...	1977
257	258	509	Contact (1997)	11-Jul-1997	NaN	http://us.imdb.com/Title?Contact+(1997/I)	1997
99	100	508	Fargo (1996)	14-Feb-1997	NaN	http://us.imdb.com/M/title-exact?Fargo%20(1996)	1997
180	181	507	Return of the Jedi (1983)	14-Mar-1997	NaN	http://us.imdb.com/M/title-exact?Return%20of%2...	1997
293	294	485	Liar Liar (1997)	21-Mar-1997	NaN	http://us.imdb.com/Title?Liar+Liar+(1997)	1997
285	286	481	English Patient, The (1996)	15-Nov-1996	NaN	http://us.imdb.com/M/title-exact?English%20Pat...	1996
287	288	478	Scream (1996)	20-Dec-1996	NaN	http://us.imdb.com/M/title-exact?Scream%20(1996)	1996
0	1	452	Toy Story (1995)	01-Jan-1995	NaN	http://us.imdb.com/M/title-exact?Toy%20Story%2...	1995
299	300	431	Air Force One (1997)	01-Jan-1997	NaN	http://us.imdb.com/M/title-exact?Air+Force+One...	1997
120	121	429	Independence Day (ID4) (1996)	03-Jul-1996	NaN	http://us.imdb.com/M/title-exact?Independence%...	1996

	item_id	popularity	video_release_date	release_year	user_id	rating	timestamp
count	100000.000000	100000.000000	0.0	100000.000000	100000.00000	100000.000000	1.000000e+05
mean	425.530130	168.071900	NaN	1987.950100	462.48475	3.529860	8.835289e+08
std	330.798356	121.784558	NaN	14.169558	266.61442	1.125674	5.343856e+06
min	1.000000	1.000000	NaN	1920.000000	1.00000	1.000000	8.747247e+08
25%	175.000000	71.000000	NaN	1986.000000	254.00000	3.000000	8.794487e+08
50%	322.000000	145.000000	NaN	1994.000000	447.00000	4.000000	8.828269e+08
75%	631.000000	239.000000	NaN	1996.000000	682.00000	4.000000	8.882600e+08
max	1682.000000	583.000000	NaN	1998.000000	943.00000	5.000000	8.932866e+08