Using SimEc for recommender systems

Tasks like product recommendation or drug-target interaction prediction essentially consist of having to predict missing entries in a large matrix containing pairwise relations, e.g., the user ratings of some items or whether or not a drug interacts with a certain protein. Besides the sparse matrix containing the pairwise relations, generally one can also construct some feature vectors for the items and users (drugs / proteins), e.g., based on textual descriptions. These can come in especially handy when predictions need to be made, e.g., for new items that did not receive any user ratings so far. In the following we will only talk about items and users but of course this extends to other problem setups as well like drug-target interaction prediction. We distinguish between 3 tasks with increasing difficulty:

  • T1: Predict missing ratings for existing items and users
  • T2a and T2b: Predict ratings for new items and existing users (a) or new users and existing items (b)
  • T3: Predict ratings for new items and new users

For tasks T2a/b and T3, feature vectors describing items and/or users are required.

There are several methods that can be used to solve some or all of the above tasks. These include:

Baseline Methods
  • Predict average: This is a no-brainer: simply fill all the missing values by averages. For example, an item rating from a user can be predicted based on the average rating the user usually gives (he might in general be more or less critical than other users) and the average rating the item got from other users (it might be better or worse than the average item) or for new items and users just predict the overall average rating (solves T1, T2a/b, T3).
  • SVD of the ratings matrix: By factorizing the ratings matrix using (iterative) singular value decomposition (SVD), one can compute a low rank approximation of the ratings matrix and use these approximate values as predictions for the missing values (solves T1). This can also be combined with the average ratings from above, i.e., the low rank approximation can be used to predict the residuals.
  • SVD + Regression: Given some feature vectors for items or users and the low rank approximation of the ratings matrix computed above, using a regression model, the mapping from the items' feature vectors to their rating vectors can be learned (or respectively for users). This is an extension of the above method to additionally solve either T2a or T2b, or T3 if models are learned for both sides of the factorization.
  • Regression/Classification model: This approach is completely different from the so-called latent factor models discussed above. Here we train an ordinary regression or classification model (depending on the form of the pairwise data, e.g. continuous ratings or binary interactions) by using as input the concatenation of the feature vectors of an item and a user and as the target their rating. One possible realization of such a model could involve two neural networks to map the individual feature vectors into some lower dimensional embedding space. This approach can be used to solve all tasks T1, T2a/b, T3 provided corresponding feature vectors are available.
Similarity Encoder Models
  • Factorization of the ratings matrix using the identity matrix as input: By training a SimEc to factorize the ratings matrix using the identity matrix as input, we can recreate the solution obtained with SVD (while possibly better handling missing values when computing the decomposition). Correspondingly, this only solves T1.
  • Factorization of the ratings matrix using feature vectors as input: By using either item or user feature vectors as input when factorizing the ratings matrix, we can additionally solve T2a or T2b.
  • Train a second SimEc with feature vectors and fixed last layer weights: After training, e.g., a SimEc with item feature vectors as input to decompose the ratings matrix, we can use this SimEc to compute the item embeddings $Y$. We can then construct a second SimEc, which uses user feature vectors as input to factorize the ratings matrix. However, here we fix the weights of the last layer by setting them to the transpose of the embedding matrix computed for the items. After this SimEc is trained, we can now use both SimEcs to compute item and user embeddings respectively and then compute the scalar product of the embedding vectors to predict the ratings. This approach can then also be used to predict the ratings given the feature vectors for new items and users, i.e., it can be used to solve all tasks T1, T2a/b, T3.

If the rating matrix contains explicit ratings (i.e. likes and dislikes), all available entries can be used to train the above models. If the pairwise relations in the matrix only represent implicit feedback or binary interactions (e.g. the user listens to music by certain artists, which means he likes them, but we don't know if he doesn't listen to other artists because he doesn't know them or because he doesn't like them), then we can use the given entries in the matrix as positive examples and additionally take a random sample of the missing entries and use them as negative examples. In the latter case, it might be more useful to use classification instead of regression models and also when training the SimEc it could be helpful to apply a non-linearity on the output before computing the error of the model.

Dataset

In this notebook we work with the movielens dataset and additionally pull some information about the individual movies from the movie database using their API.

Since we only have additional information about the movies, not the users, we focus on solving tasks T1 and T2a here.


In [1]:
from __future__ import unicode_literals, division, print_function, absolute_import
from builtins import range, str
import os
import json
import requests
import numpy as np
np.random.seed(28)
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
tf.set_random_seed(28)
import keras
import keras.backend as K

from collections import defaultdict, Counter
from scipy.sparse import lil_matrix, dok_matrix, csr_matrix, coo_matrix, hstack, vstack, diags
from scipy.sparse.linalg import svds
from scipy.stats import spearmanr, pearsonr
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.linear_model import Ridge as rreg
from sklearn.model_selection import GridSearchCV
from simec import SimilarityEncoder

%matplotlib inline
%load_ext autoreload
%autoreload 2

savefigs = True


Using TensorFlow backend.

Loading Data


In [2]:
if not os.path.exists("data/recsys/tmdb_data"):
    os.mkdir("data/recsys/tmdb_data")

def parse_tmdb(tmdbid, apikey):
    movie_data = {}
    if os.path.exists("data/recsys/tmdb_data/%r.json" % tmdbid):
        with open("data/recsys/tmdb_data/%r.json" % tmdbid) as f:
            movie_data = json.load(f)
    # for a movie with tmdbid get:
    # genres:name, original language en y/n, id, title, overview, release_date-->year,
    # keywords:keywords:name, credits:cast:name[:10], credits:crew:("job": "Director"):name
    if not movie_data:
        r = requests.get("https://api.themoviedb.org/3/movie/%r?api_key=%s&language=en-US&append_to_response=keywords,credits" % (tmdbid, apikey))
        if r.status_code != 200:
            print("something went wrong when accessing tmdb with id %r!" % tmdbid)
            print(r.text)
        else:
            movie_json = r.json()
            movie_data['tmdbid'] = movie_json['id']
            movie_data['title'] = movie_json['title']
            movie_data['overview'] = movie_json['overview']
            movie_data['release_date'] = movie_json['release_date']
            movie_data['year'] = movie_json["release_date"].split("-")[0]
            movie_data['original_en'] = str(movie_json['original_language'] == "en")
            movie_data['genres'] = [g["name"] for g in movie_json['genres']]
            movie_data['keywords'] = [k["name"] for k in movie_json['keywords']['keywords']]
            movie_data['cast'] = [c["name"] for c in movie_json['credits']['cast'][:10]]
            movie_data['directors'] = [c["name"] for c in movie_json['credits']['crew'] if c["job"] == "Director"]
            print("got data for %s" % movie_json['title'])
            with open("data/recsys/tmdb_data/%r.json" % tmdbid, "w") as f:
                json.dump(movie_data, f, indent=2)
    return movie_data

In [3]:
# get movielens data from: https://grouplens.org/datasets/movielens/10m/
# load all possible movies (in the 10m dataset)
df_movies = pd.read_csv("data/recsys/ml-10M100K/movies.dat", sep="::", names=["movieId","title","genres"])
# get corresponding tmdbids (only in 20m dataset)
df_links = pd.read_csv("data/recsys/ml-20m/links.csv")
df_links = df_links.dropna()
df_links = df_links.astype(int) 
map_movieids = dict(zip(df_links.movieId, df_links.tmdbId))
# get additional details from themoviedb.org (assumes api key is stored at data/recsys/tmdb_apikey.txt)
if os.path.exists("data/recsys/tmdb_data.json"):
    with open("data/recsys/tmdb_data.json") as f:
        movies_data = json.load(f)
else:
    movies_data = {}
    with open('data/recsys/tmdb_apikey.txt') as f:
        apikey = f.read().strip()
    for movieid in df_movies.movieId:
        if movieid in map_movieids:
            m = parse_tmdb(map_movieids[movieid], apikey)
            if m:
                # careful: when loading the json later the ids will be strings as well anyways
                movies_data[str(movieid)] = m
            else:
                print("error with movie id: %i" % movieid)
    with open("data/recsys/tmdb_data.json", "w") as f:
        json.dump(movies_data, f, indent=2)
print("got data for %i movies" % len(movies_data))


/home/franzi/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  This is separate from the ipykernel package so we can avoid doing imports until
got data for 10608 movies

In [4]:
# load pairwise data and generate a dict with {(movieid, userid): rating}
movieids = set()
userids = set()
tuple_ratings = {}
rating_pairs = []
with open("data/recsys/ml-10M100K/ratings.dat") as f:
    for i, l in enumerate(f.readlines()):
        if not i % 1000000:
            print("parsed %i lines" % i)
        u, m, r, t = l.strip().split("::")
        # only consider ratings for movies where we have external data available
        if m in movies_data:
            # in addition to the ratings, also get a list of all users and movies
            if u not in userids:
                userids.add(u)
            if m not in movieids:
                movieids.add(m)
            tuple_ratings[(m,u)] = float(r)
            rating_pairs.append((m,u))
        #else:
        #    print("warning, skipping rating for movie with id %r" % m)
# shuffle all movie and user ids (important so we can split data into train and test sets)
# this list additionally functions as a mapping from a (matrix) index to the actual id
np.random.seed(13)
map_index2movieid = np.random.permutation(sorted(movieids))
map_index2userid = np.random.permutation(sorted(userids))
# also get a shuffeled list of all rating pairs
rating_pairs = np.random.permutation(rating_pairs)
print("%i movies, %i users, and %i ratings" % (len(movieids), len(userids), len(rating_pairs)))


parsed 0 lines
parsed 1000000 lines
parsed 2000000 lines
parsed 3000000 lines
parsed 4000000 lines
parsed 5000000 lines
parsed 6000000 lines
parsed 7000000 lines
parsed 8000000 lines
parsed 9000000 lines
parsed 10000000 lines
10604 movies, 69878 users, and 9989664 ratings

Ratings Prediction

Overview of prediction results (RMSE)

mean mean+SVD mean+SimEc(I) mean+SVD+regression mean+SimEc(X)
T1 0.88614 0.85891 0.87660 0.88014 0.86796
T2a 0.97610 - - 0.97332 0.96897

While the SVD of the residual ratings matrix gives the best approximation of the ratings for known movies and users (T1), learning the connection between the movies' feature vectors and the pairwise relations with a SimEc enables us to make better prediction for new movies (T2a).


In [5]:
# we have different scenarios: either we're only missing some individual ratings or entire movies/users
def split_traintest(scenario):
    print("generating train/test splits for scenario %r" % scenario)
    if scenario == "T1":
        # missing ratings
        rating_pairs_train = rating_pairs[:int(0.7*len(rating_pairs))]
        rating_pairs_test = rating_pairs[int(0.7*len(rating_pairs)):]
        map_index2movieid_train = map_index2movieid
        map_index2userid_train = map_index2userid
    else:
        rating_pairs_train = []
        rating_pairs_test = []
        if scenario == "T2a":
            # missing movies
            map_index2movieid_train = map_index2movieid[:int(0.7*len(map_index2movieid))]
            map_index2userid_train = map_index2userid
        elif scenario == "T2b":
            # missing users
            map_index2movieid_train = map_index2movieid
            map_index2userid_train = map_index2userid[:int(0.7*len(map_index2userid))]
        elif scenario == "T3":
            # missing movies and users
            map_index2movieid_train = map_index2movieid[:int(0.85*len(map_index2movieid))]
            map_index2userid_train = map_index2userid[:int(0.8*len(map_index2userid))]
        else:
            raise Exception("unknown scenario %r, use either T1, T2a, T2b, or T3!" % scenario)
        movieids_train_set = set(map_index2movieid_train)
        userids_train_set = set(map_index2userid_train)
        rating_pairs_train = []
        rating_pairs_test = []
        for (m, u) in rating_pairs:
            if u in userids_train_set and m in movieids_train_set:
                rating_pairs_train.append((m, u))
            else:
                rating_pairs_test.append((m, u))
    print("got %i training and %i test ratings" % (len(rating_pairs_train), len(rating_pairs_test)))
    # create mappings from the actual id to the index
    map_movieid2index_train = {m: i for i, m in enumerate(map_index2movieid_train)}
    map_userid2index_train = {u: i for i, u in enumerate(map_index2userid_train)}
    return rating_pairs_train, rating_pairs_test, map_index2userid_train,\
           map_index2movieid_train, map_userid2index_train, map_movieid2index_train

def make_train_matrix(tuple_ratings, rating_pairs_train, map_userid2index_train, map_movieid2index_train):
    # transform training ratings into a sparse matrix for convenience
    print("transforming dict with %i ratings into sparse matrix" % len(rating_pairs_train))
    ratings_matrix = lil_matrix((len(map_movieid2index_train),len(map_userid2index_train)))
    for (m, u) in rating_pairs_train:
        ratings_matrix[map_movieid2index_train[m],map_userid2index_train[u]] = tuple_ratings[(m, u)]
    ratings_matrix = csr_matrix(ratings_matrix)
    return ratings_matrix

Baseline model: predict mean


In [6]:
class MeansModel():
    """
    A very simple baseline model, which predicts the rating a user would give to a movie as:
        mean + user_mean + movie_mean
    """
    def __init__(self, shrinkage=1.):
        self.mean = None
        self.mean_users = {}
        self.mean_movies = {}
        # shrinkage decreases the influence of the individual user/movie means
        # --> mean + shrinkage*user_mean + shrinkage*movie_mean
        # it should always be between 0 and 1; 0 means individual means are ignored
        self.shrinkage = max(0., min(1., shrinkage))

    def fit(self, tuple_ratings, rating_pairs_train):
        # overall mean based on all training ratings
        self.mean = np.mean([tuple_ratings[(m, u)] for (m, u) in rating_pairs_train])
        # means for movies and users
        if self.shrinkage:
            mean_users = defaultdict(list)
            mean_movies = defaultdict(list)
            for (m, u) in rating_pairs_train:
                mean_users[u].append(tuple_ratings[(m, u)])
                mean_movies[m].append(tuple_ratings[(m, u)])
            self.mean_users = {u: np.mean(mean_users[u])-self.mean for u in mean_users}
            self.mean_movies = {m: np.mean(mean_movies[m])-self.mean for m in mean_movies}
    
    def predict(self, m, u, residuals=None):
        """
        generate rating prediction for a user u and movie m
        """
        rating = self.mean
        if u in self.mean_users:
            rating += self.shrinkage*self.mean_users[u]
        if m in self.mean_movies:
            rating += self.shrinkage*self.mean_movies[m]
        if residuals and (m, u) in residuals:
            rating += residuals[(m, u)]
        return rating
    
    def compute_residuals(self, tuple_ratings, rating_pairs_train):
        """
        for all ratings, subtract the respective average ratings to get residuals
        """
        return {(m, u): tuple_ratings[(m, u)] - (self.mean+self.shrinkage*(self.mean_users[u]+self.mean_movies[m]))
                                  for (m, u) in rating_pairs_train}

In [7]:
for scenario in ["T1", "T2a", "T2b", "T3"]:
    # get train/test data
    rating_pairs_train, rating_pairs_test, map_index2userid_train, map_index2movieid_train, map_userid2index_train, map_movieid2index_train = split_traintest(scenario)
    for shrinkage in [0., 0.1, 0.5, 0.9, 1.]:
        # initalize means model
        mmodel = MeansModel(shrinkage)
        print("fitting model with shrinkage=%.1f" % shrinkage)
        mmodel.fit(tuple_ratings, rating_pairs_train)
        # get a vector with target ratings for test tuples
        y_true = np.array([tuple_ratings[(m, u)] for (m, u) in rating_pairs_test])
        # get the corresponding predictions
        y_pred = np.array([mmodel.predict(m, u) for (m, u) in rating_pairs_test])
        print("Scenario %s: RMSE: %.5f; MAE: %.5f" % (scenario, np.sqrt(mean_squared_error(y_true, y_pred)), mean_absolute_error(y_true, y_pred)))


generating train/test splits for scenario 'T1'
got 6992764 training and 2996900 test ratings
fitting model with shrinkage=0.0
Scenario T1: RMSE: 1.06044; MAE: 0.85552
fitting model with shrinkage=0.1
Scenario T1: RMSE: 1.02330; MAE: 0.82574
fitting model with shrinkage=0.5
Scenario T1: RMSE: 0.91333; MAE: 0.71937
fitting model with shrinkage=0.9
Scenario T1: RMSE: 0.88074; MAE: 0.68157
fitting model with shrinkage=1.0
Scenario T1: RMSE: 0.88614; MAE: 0.68404
generating train/test splits for scenario 'T2a'
got 7064138 training and 2925526 test ratings
fitting model with shrinkage=0.0
Scenario T2a: RMSE: 1.05938; MAE: 0.85143
fitting model with shrinkage=0.1
Scenario T2a: RMSE: 1.04342; MAE: 0.83821
fitting model with shrinkage=0.5
Scenario T2a: RMSE: 0.99569; MAE: 0.79322
fitting model with shrinkage=0.9
Scenario T2a: RMSE: 0.97628; MAE: 0.76558
fitting model with shrinkage=1.0
Scenario T2a: RMSE: 0.97610; MAE: 0.76332
generating train/test splits for scenario 'T2b'
got 7015197 training and 2974467 test ratings
fitting model with shrinkage=0.0
Scenario T2b: RMSE: 1.05763; MAE: 0.85382
fitting model with shrinkage=0.1
Scenario T2b: RMSE: 1.03627; MAE: 0.83724
fitting model with shrinkage=0.5
Scenario T2b: RMSE: 0.97058; MAE: 0.77542
fitting model with shrinkage=0.9
Scenario T2b: RMSE: 0.94101; MAE: 0.73810
fitting model with shrinkage=1.0
Scenario T2b: RMSE: 0.93976; MAE: 0.73544
generating train/test splits for scenario 'T3'
got 6827065 training and 3162599 test ratings
fitting model with shrinkage=0.0
Scenario T3: RMSE: 1.05580; MAE: 0.85198
fitting model with shrinkage=0.1
Scenario T3: RMSE: 1.03854; MAE: 0.83850
fitting model with shrinkage=0.5
Scenario T3: RMSE: 0.98629; MAE: 0.78893
fitting model with shrinkage=0.9
Scenario T3: RMSE: 0.96390; MAE: 0.75888
fitting model with shrinkage=1.0
Scenario T3: RMSE: 0.96327; MAE: 0.75657

Baseline model: SVD of residuals


In [8]:
scenario = "T1"
# get train/test data
rating_pairs_train, rating_pairs_test, map_index2userid_train, map_index2movieid_train, map_userid2index_train, map_movieid2index_train = split_traintest(scenario)
# initalize and fit means model
mmodel = MeansModel()
mmodel.fit(tuple_ratings, rating_pairs_train)
# get sparse matrix with residuals
print("computing residuals")
residual_ratings = mmodel.compute_residuals(tuple_ratings, rating_pairs_train)
ratings_matrix = make_train_matrix(residual_ratings, rating_pairs_train, map_userid2index_train, map_movieid2index_train)


generating train/test splits for scenario 'T1'
got 6992764 training and 2996900 test ratings
computing residuals
transforming dict with 6992764 ratings into sparse matrix

In [9]:
# inspect eigenvalues of matrix
eigenvals = svds(ratings_matrix, k=1000, return_singular_vectors=False)
eigenvals = sorted(eigenvals, reverse=True)
plt.figure()
plt.plot(list(range(1, len(eigenvals)+1)), eigenvals)
plt.xlabel("eigenvalue")


Out[9]:
Text(0.5, 0, 'eigenvalue')

In [10]:
# get eigenvalues and -vectors for some relevant dimensions
e_dim = 100
U, s, Vh = svds(ratings_matrix, k=e_dim)
S = np.zeros((e_dim, e_dim))
S = np.diag(s)
U.shape, S.shape, Vh.shape


Out[10]:
((10604, 100), (100, 100), (100, 69878))

In [11]:
# construct approximation of residual ratings
print("get approximations")
temp = np.dot(U, np.dot(S, Vh))
# get dict with residuals for missing test ratings
print("get residual ratings")
residual_ratings_test = {(m, u): temp[map_movieid2index_train[m], map_userid2index_train[u]]
                                 for (m, u) in rating_pairs_test}
del temp
print("predict test ratings")
# get a vector with target ratings for test tuples
y_true = np.array([tuple_ratings[(m, u)] for (m, u) in rating_pairs_test])
# get the corresponding predictions
y_pred = np.array([mmodel.predict(m, u, residual_ratings_test) for (m, u) in rating_pairs_test])
print("Scenario %s: RMSE: %.5f; MAE: %.5f" % (scenario, np.sqrt(mean_squared_error(y_true, y_pred)), mean_absolute_error(y_true, y_pred)))


get approximations
get residual ratings
predict test ratings
Scenario T1: RMSE: 0.85891; MAE: 0.65994

SimEc model: identity matrix as input


In [12]:
# get a vector with target ratings for test tuples
y_true = np.array([tuple_ratings[(m, u)] for (m, u) in rating_pairs_test])
# train simec with identiy matrix as input to predict residuals
e_dim = 100
X = np.eye(ratings_matrix.shape[0], dtype=np.float16)
model = SimilarityEncoder(ratings_matrix.shape[0], e_dim, ratings_matrix.shape[1], opt=0.05)
model.fit(X, ratings_matrix, epochs=20)
print("get approximations")
temp = np.array(model.predict(X), dtype=np.float16)
# get dict with residuals for missing test ratings
print("get residual ratings")
residual_ratings_test = {(m, u): temp[map_movieid2index_train[m], map_userid2index_train[u]]
                                 for (m, u) in rating_pairs_test}
del temp
print("predict test ratings")
# get the corresponding predictions
y_pred = np.array([mmodel.predict(m, u, residual_ratings_test) for (m, u) in rating_pairs_test])
print("Scenario %s: RMSE: %.5f; MAE: %.5f" % (scenario, np.sqrt(mean_squared_error(y_true, y_pred)), mean_absolute_error(y_true, y_pred)))


WARNING: Logging before flag parsing goes to stderr.
W0128 19:23:43.217101 140069488686912 module_wrapper.py:139] From /home/franzi/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0128 19:23:43.218171 140069488686912 module_wrapper.py:139] From /home/franzi/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0128 19:23:43.223250 140069488686912 module_wrapper.py:139] From /home/franzi/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0128 19:23:43.256577 140069488686912 module_wrapper.py:139] From /home/franzi/anaconda3/lib/python3.7/site-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0128 19:23:43.398526 140069488686912 module_wrapper.py:139] From /home/franzi/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

W0128 19:23:43.420161 140069488686912 module_wrapper.py:139] From /home/franzi/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

W0128 19:23:43.434075 140069488686912 module_wrapper.py:139] From /home/franzi/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:2741: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

W0128 19:23:43.438901 140069488686912 module_wrapper.py:139] From /home/franzi/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

W0128 19:23:43.439568 140069488686912 module_wrapper.py:139] From /home/franzi/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:181: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

Warning: For best results, S (and X) should be normalized (try S /= np.max(np.abs(S))).
Epoch 1/20
W0128 19:23:43.723411 140069488686912 module_wrapper.py:139] From /home/franzi/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

W0128 19:23:43.724771 140069488686912 module_wrapper.py:139] From /home/franzi/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

W0128 19:23:43.781834 140069488686912 module_wrapper.py:139] From /home/franzi/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

10604/10604 [==============================] - 24s 2ms/step - loss: 0.0073
Epoch 2/20
10604/10604 [==============================] - 24s 2ms/step - loss: 0.0073
Epoch 3/20
10604/10604 [==============================] - 24s 2ms/step - loss: 0.0073
Epoch 4/20
10604/10604 [==============================] - 24s 2ms/step - loss: 0.0073
Epoch 5/20
10604/10604 [==============================] - 24s 2ms/step - loss: 0.0073
Epoch 6/20
10604/10604 [==============================] - 23s 2ms/step - loss: 0.0073
Epoch 7/20
10604/10604 [==============================] - 24s 2ms/step - loss: 0.0073
Epoch 8/20
10604/10604 [==============================] - 24s 2ms/step - loss: 0.0073
Epoch 9/20
10604/10604 [==============================] - 24s 2ms/step - loss: 0.0073
Epoch 10/20
10604/10604 [==============================] - 24s 2ms/step - loss: 0.0073
Epoch 11/20
10604/10604 [==============================] - 24s 2ms/step - loss: 0.0073
Epoch 12/20
10604/10604 [==============================] - 24s 2ms/step - loss: 0.0073
Epoch 13/20
10604/10604 [==============================] - 24s 2ms/step - loss: 0.0073
Epoch 14/20
10604/10604 [==============================] - 24s 2ms/step - loss: 0.0073
Epoch 15/20
10604/10604 [==============================] - 24s 2ms/step - loss: 0.0073
Epoch 16/20
10604/10604 [==============================] - 24s 2ms/step - loss: 0.0073
Epoch 17/20
10604/10604 [==============================] - 24s 2ms/step - loss: 0.0073
Epoch 18/20
10604/10604 [==============================] - 24s 2ms/step - loss: 0.0073
Epoch 19/20
10604/10604 [==============================] - 24s 2ms/step - loss: 0.0073
Epoch 20/20
10604/10604 [==============================] - 24s 2ms/step - loss: 0.0073
get approximations
get residual ratings
predict test ratings
Scenario T1: RMSE: 0.87660; MAE: 0.67578

Construct feature vectors for movies


In [13]:
def features2mat(movies_data, movieids, feature, featurenames=[]):
    if not featurenames:
        featurenames = sorted(set(word for m in movieids for word in movies_data[m][feature]))
    fnamedict = {feat: i for i, feat in enumerate(featurenames)}
    featmat = dok_matrix((len(movieids), len(featurenames)), dtype=np.float16)
    for i, m in enumerate(movieids):
        for word in movies_data[m][feature]:
            try:
                featmat[i, fnamedict[word]] = 1.
            except KeyError:
                pass
    featmat = csr_matrix(featmat)
    return featmat, featurenames

def get_features_mats(movieids_train, movieids_test):
    featurenames = []
    genres_mat_train, genres_names = features2mat(movies_data, movieids_train, "genres")
    featurenames.extend(genres_names)
    genres_mat_test, _ = features2mat(movies_data, movieids_test, "genres", genres_names)
    keywords_mat_train, keywords_names = features2mat(movies_data, movieids_train, "keywords")
    featurenames.extend(keywords_names)
    keywords_mat_test, _ = features2mat(movies_data, movieids_test, "keywords", keywords_names)
    directors_mat_train, directors_names = features2mat(movies_data, movieids_train, "directors")
    featurenames.extend(directors_names)
    directors_mat_test, _ = features2mat(movies_data, movieids_test, "directors", directors_names)
    feat_mat_train = hstack([genres_mat_train, keywords_mat_train, directors_mat_train], format="csr", dtype=np.float16)
    feat_mat_test = hstack([genres_mat_test, keywords_mat_test, directors_mat_test], format="csr", dtype=np.float16)
    return feat_mat_train, feat_mat_test, featurenames

Baseline model: SVD of residuals with regression


In [14]:
## for T1 same scenario as above
scenario = "T1"
# get train/test data
rating_pairs_train, rating_pairs_test, map_index2userid_train, map_index2movieid_train, map_userid2index_train, map_movieid2index_train = split_traintest(scenario)
# train and test movies are the same, i.e., no unknown movies
feat_mat_train, _, _ = get_features_mats(map_index2movieid_train, [])
# initalize and fit means model
mmodel = MeansModel()
mmodel.fit(tuple_ratings, rating_pairs_train)
# get sparse matrix with residuals
print("computing residuals")
residual_ratings = mmodel.compute_residuals(tuple_ratings, rating_pairs_train)
ratings_matrix = make_train_matrix(residual_ratings, rating_pairs_train, map_userid2index_train, map_movieid2index_train)
# get eigenvalues and -vectors for some relevant dimensions
e_dim = 100
U, s, Vh = svds(ratings_matrix, k=e_dim)
S = np.zeros((e_dim, e_dim))
S = np.diag(s)
# train regression model to map from feat_mat to U
print("train regression model")
alpha = 250.  # None to do grid search
if alpha is None:
    m = rreg()
    rrm = GridSearchCV(m, {'alpha': [0.01, 0.1, 0.25, 0.5, 0.75, 1., 2.5 , 5., 7.5, 10., 25., 50., 75., 100., 250., 500., 750., 1000.]})
    rrm.fit(feat_mat_train, U)
    print("best alpha: ", rrm.best_params_)
else:
    rrm = rreg(alpha=alpha)
    rrm.fit(feat_mat_train, U)
U_pred = rrm.predict(feat_mat_train)
# construct approximation of residual ratings
print("get approximations")
temp = np.dot(U_pred, np.dot(S, Vh))
# get dict with residuals for missing test ratings
print("get residual ratings")
residual_ratings_test = {(m, u): temp[map_movieid2index_train[m], map_userid2index_train[u]]
                                 for (m, u) in rating_pairs_test}
del temp
print("predict test ratings")
# get a vector with target ratings for test tuples
y_true = np.array([tuple_ratings[(m, u)] for (m, u) in rating_pairs_test])
# get the corresponding predictions
y_pred = np.array([mmodel.predict(m, u, residual_ratings_test) for (m, u) in rating_pairs_test])
print("Scenario %s: RMSE: %.5f; MAE: %.5f" % (scenario, np.sqrt(mean_squared_error(y_true, y_pred)), mean_absolute_error(y_true, y_pred)))


generating train/test splits for scenario 'T1'
got 6992764 training and 2996900 test ratings
computing residuals
transforming dict with 6992764 ratings into sparse matrix
train regression model
get approximations
get residual ratings
predict test ratings
Scenario T1: RMSE: 0.88014; MAE: 0.67895

In [15]:
scenario = "T2a"
# get train/test data
rating_pairs_train, rating_pairs_test, map_index2userid_train, map_index2movieid_train, map_userid2index_train, map_movieid2index_train = split_traintest(scenario)
movieids_test = sorted(set(m for (m, u) in rating_pairs_test if m not in map_movieid2index_train))
map_movieid2index_all = {m : i for i, m in enumerate(movieids_test, len(map_movieid2index_train))}
map_movieid2index_all.update(map_movieid2index_train)
feat_mat_train, feat_mat_test, _ = get_features_mats(map_index2movieid_train, movieids_test)
# initalize and fit means model
mmodel = MeansModel()
mmodel.fit(tuple_ratings, rating_pairs_train)
# get sparse matrix with residuals
print("computing residuals")
residual_ratings = mmodel.compute_residuals(tuple_ratings, rating_pairs_train)
ratings_matrix = make_train_matrix(residual_ratings, rating_pairs_train, map_userid2index_train, map_movieid2index_train)
# get eigenvalues and -vectors for some relevant dimensions
e_dim = 100
U, s, Vh = svds(ratings_matrix, k=e_dim)
S = np.zeros((e_dim, e_dim))
S = np.diag(s)
# train regression model to map from feat_mat to U
print("train regression model")
rrm = rreg(alpha=250.)
rrm.fit(feat_mat_train, U)
# stack train and test feature matrices to make predictions for all
feat_mat_all = vstack([feat_mat_train, feat_mat_test], format="csr", dtype=np.float16)
del feat_mat_train, feat_mat_test
U_pred = rrm.predict(feat_mat_all)
# construct approximation of residual ratings
print("get approximations")
temp = np.dot(U_pred, np.dot(S, Vh))
# get dict with residuals for missing test ratings
print("get residual ratings")
residual_ratings_test = {(m, u): temp[map_movieid2index_all[m], map_userid2index_train[u]]
                                 for (m, u) in rating_pairs_test}
del temp
print("predict test ratings")
# get a vector with target ratings for test tuples
y_true = np.array([tuple_ratings[(m, u)] for (m, u) in rating_pairs_test])
# get the corresponding predictions
y_pred = np.array([mmodel.predict(m, u, residual_ratings_test) for (m, u) in rating_pairs_test])
print("Scenario %s: RMSE: %.5f; MAE: %.5f" % (scenario, np.sqrt(mean_squared_error(y_true, y_pred)), mean_absolute_error(y_true, y_pred)))


generating train/test splits for scenario 'T2a'
got 7064138 training and 2925526 test ratings
computing residuals
transforming dict with 7064138 ratings into sparse matrix
train regression model
/home/franzi/anaconda3/lib/python3.7/site-packages/numpy/core/_methods.py:36: RuntimeWarning: overflow encountered in reduce
  return umr_sum(a, axis, dtype, out, keepdims, initial)
get approximations
get residual ratings
predict test ratings
Scenario T2a: RMSE: 0.97332; MAE: 0.76108

SimEc model: feature matrix as input


In [16]:
## for T1 same scenario as above
scenario = "T1"
# get train/test data
rating_pairs_train, rating_pairs_test, map_index2userid_train, map_index2movieid_train, map_userid2index_train, map_movieid2index_train = split_traintest(scenario)
# train and test movies are the same, i.e., no unknown movies
feat_mat_train, _, _ = get_features_mats(map_index2movieid_train, [])
# initalize and fit means model
mmodel = MeansModel()
mmodel.fit(tuple_ratings, rating_pairs_train)
# get sparse matrix with residuals
print("computing residuals")
residual_ratings = mmodel.compute_residuals(tuple_ratings, rating_pairs_train)
ratings_matrix = make_train_matrix(residual_ratings, rating_pairs_train, map_userid2index_train, map_movieid2index_train)
# get a vector with target ratings for test tuples
y_true = np.array([tuple_ratings[(m, u)] for (m, u) in rating_pairs_test])
# train simec with identiy matrix as input to predict residuals
e_dim = 100
model = SimilarityEncoder(feat_mat_train.shape[1], e_dim, ratings_matrix.shape[1], sparse_inputs=True,
                          opt=0.0075)
model.fit(feat_mat_train, ratings_matrix, epochs=20)
print("get approximations")
temp = np.array(model.predict(feat_mat_train), dtype=np.float16)
# get dict with residuals for missing test ratings
print("get residual ratings")
residual_ratings_test = {(m, u): temp[map_movieid2index_train[m], map_userid2index_train[u]]
                                 for (m, u) in rating_pairs_test}
del temp
print("predict test ratings")
# get the corresponding predictions
y_pred = np.array([mmodel.predict(m, u, residual_ratings_test) for (m, u) in rating_pairs_test])
print("Scenario %s: RMSE: %.5f; MAE: %.5f" % (scenario, np.sqrt(mean_squared_error(y_true, y_pred)), mean_absolute_error(y_true, y_pred)))


generating train/test splits for scenario 'T1'
got 6992764 training and 2996900 test ratings
computing residuals
transforming dict with 6992764 ratings into sparse matrix
W0128 19:40:34.411109 140069488686912 module_wrapper.py:139] From /home/franzi/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:515: The name tf.sparse_placeholder is deprecated. Please use tf.compat.v1.sparse_placeholder instead.

W0128 19:40:34.422111 140069488686912 module_wrapper.py:139] From /home/franzi/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:1083: The name tf.sparse_tensor_dense_matmul is deprecated. Please use tf.sparse.sparse_dense_matmul instead.

Warning: For best results, S (and X) should be normalized (try S /= np.max(np.abs(S))).
Epoch 1/20
10604/10604 [==============================] - 25s 2ms/step - loss: 0.0073
Epoch 2/20
10604/10604 [==============================] - 24s 2ms/step - loss: 0.0072
Epoch 3/20
10604/10604 [==============================] - 25s 2ms/step - loss: 0.0072
Epoch 4/20
10604/10604 [==============================] - 24s 2ms/step - loss: 0.0072
Epoch 5/20
10604/10604 [==============================] - 25s 2ms/step - loss: 0.0072
Epoch 6/20
10604/10604 [==============================] - 25s 2ms/step - loss: 0.0072
Epoch 7/20
10604/10604 [==============================] - 25s 2ms/step - loss: 0.0072
Epoch 8/20
10604/10604 [==============================] - 25s 2ms/step - loss: 0.0072
Epoch 9/20
10604/10604 [==============================] - 24s 2ms/step - loss: 0.0072
Epoch 10/20
10604/10604 [==============================] - 25s 2ms/step - loss: 0.0072
Epoch 11/20
10604/10604 [==============================] - 25s 2ms/step - loss: 0.0072
Epoch 12/20
10604/10604 [==============================] - 25s 2ms/step - loss: 0.0072
Epoch 13/20
10604/10604 [==============================] - 24s 2ms/step - loss: 0.0072
Epoch 14/20
10604/10604 [==============================] - 25s 2ms/step - loss: 0.0071
Epoch 15/20
10604/10604 [==============================] - 25s 2ms/step - loss: 0.0071
Epoch 16/20
10604/10604 [==============================] - 24s 2ms/step - loss: 0.0071
Epoch 17/20
10604/10604 [==============================] - 25s 2ms/step - loss: 0.0071
Epoch 18/20
10604/10604 [==============================] - 25s 2ms/step - loss: 0.0071
Epoch 19/20
10604/10604 [==============================] - 24s 2ms/step - loss: 0.0071
Epoch 20/20
10604/10604 [==============================] - 25s 2ms/step - loss: 0.0071
get approximations
get residual ratings
predict test ratings
Scenario T1: RMSE: 0.86796; MAE: 0.66840

In [17]:
scenario = "T2a"
# get train/test data
rating_pairs_train, rating_pairs_test, map_index2userid_train, map_index2movieid_train, map_userid2index_train, map_movieid2index_train = split_traintest(scenario)
movieids_test = sorted(set(m for (m, u) in rating_pairs_test if m not in map_movieid2index_train))
map_movieid2index_all = {m : i for i, m in enumerate(movieids_test, len(map_movieid2index_train))}
map_movieid2index_all.update(map_movieid2index_train)
feat_mat_train, feat_mat_test, _ = get_features_mats(map_index2movieid_train, movieids_test)
# stack train and test feature matrices to make predictions for all
feat_mat_all = vstack([feat_mat_train, feat_mat_test], format="csr", dtype=np.float16)
del feat_mat_test
# initalize and fit means model
mmodel = MeansModel()
mmodel.fit(tuple_ratings, rating_pairs_train)
# get sparse matrix with residuals
print("computing residuals")
residual_ratings = mmodel.compute_residuals(tuple_ratings, rating_pairs_train)
ratings_matrix = make_train_matrix(residual_ratings, rating_pairs_train, map_userid2index_train, map_movieid2index_train)
# get a vector with target ratings for test tuples
y_true = np.array([tuple_ratings[(m, u)] for (m, u) in rating_pairs_test])
# train simec with identiy matrix as input to predict residuals
e_dim = 100
model = SimilarityEncoder(feat_mat_train.shape[1], e_dim, ratings_matrix.shape[1], sparse_inputs=True,
                          opt=0.005)
model.fit(feat_mat_train, ratings_matrix, epochs=20)
print("get approximations")
temp = np.array(model.predict(feat_mat_all), dtype=np.float16)
# get dict with residuals for missing test ratings
print("get residual ratings")
residual_ratings_test = {(m, u): temp[map_movieid2index_all[m], map_userid2index_train[u]]
                                 for (m, u) in rating_pairs_test}
del temp
print("predict test ratings")
# get the corresponding predictions
y_pred = np.array([mmodel.predict(m, u, residual_ratings_test) for (m, u) in rating_pairs_test])
print("Scenario %s: RMSE: %.5f; MAE: %.5f" % (scenario, np.sqrt(mean_squared_error(y_true, y_pred)), mean_absolute_error(y_true, y_pred)))


generating train/test splits for scenario 'T2a'
got 7064138 training and 2925526 test ratings
computing residuals
transforming dict with 7064138 ratings into sparse matrix
Warning: For best results, S (and X) should be normalized (try S /= np.max(np.abs(S))).
Epoch 1/20
7422/7422 [==============================] - 17s 2ms/step - loss: 0.0104
Epoch 2/20
7422/7422 [==============================] - 17s 2ms/step - loss: 0.0104
Epoch 3/20
7422/7422 [==============================] - 17s 2ms/step - loss: 0.0104
Epoch 4/20
7422/7422 [==============================] - 17s 2ms/step - loss: 0.0103
Epoch 5/20
7422/7422 [==============================] - 17s 2ms/step - loss: 0.0103
Epoch 6/20
7422/7422 [==============================] - 17s 2ms/step - loss: 0.0102
Epoch 7/20
7422/7422 [==============================] - 17s 2ms/step - loss: 0.0102
Epoch 8/20
7422/7422 [==============================] - 17s 2ms/step - loss: 0.0102
Epoch 9/20
7422/7422 [==============================] - 17s 2ms/step - loss: 0.0101
Epoch 10/20
7422/7422 [==============================] - 17s 2ms/step - loss: 0.0101
Epoch 11/20
7422/7422 [==============================] - 17s 2ms/step - loss: 0.0101
Epoch 12/20
7422/7422 [==============================] - 17s 2ms/step - loss: 0.0100
Epoch 13/20
7422/7422 [==============================] - 17s 2ms/step - loss: 0.0100
Epoch 14/20
7422/7422 [==============================] - 17s 2ms/step - loss: 0.0099
Epoch 15/20
7422/7422 [==============================] - 17s 2ms/step - loss: 0.0099
Epoch 16/20
7422/7422 [==============================] - 17s 2ms/step - loss: 0.0099
Epoch 17/20
7422/7422 [==============================] - 17s 2ms/step - loss: 0.0098
Epoch 18/20
7422/7422 [==============================] - 17s 2ms/step - loss: 0.0098
Epoch 19/20
7422/7422 [==============================] - 17s 2ms/step - loss: 0.0097
Epoch 20/20
7422/7422 [==============================] - 17s 2ms/step - loss: 0.0097
get approximations
get residual ratings
predict test ratings
Scenario T2a: RMSE: 0.96897; MAE: 0.75744

Interpretation of ratings

In addition to accurately predicting a user's rating for a certain item and therefore generating valuable recommendations, it might also be interesting to understand why a user might like a certain item. For this, we can use layerwise relevance propagation to identify the features of an item that most contributed to a positive or negative predicted rating.


In [18]:
# get all the ratings
rating_pairs_train, rating_pairs_test, map_index2userid_train, map_index2movieid_train, map_userid2index_train, map_movieid2index_train = split_traintest("T1")
rating_pairs_train, rating_pairs_test = list(rating_pairs_train), list(rating_pairs_test)
rating_pairs_train.extend(rating_pairs_test)
feat_mat_train, _, featurenames = get_features_mats(map_index2movieid_train, [])
# initalize and fit means model
mmodel = MeansModel()
mmodel.fit(tuple_ratings, rating_pairs_train)
# get sparse matrix with residuals
print("computing residuals")
residual_ratings = mmodel.compute_residuals(tuple_ratings, rating_pairs_train)
ratings_matrix = make_train_matrix(residual_ratings, rating_pairs_train, map_userid2index_train, map_movieid2index_train)
# get dense ratings matrix with missing values = -100
# R = -100*np.ones(ratings_matrix.shape, dtype=np.float16)
# R[ratings_matrix.nonzero()] = ratings_matrix[ratings_matrix.nonzero()]
# train simec with identiy matrix as input to predict residuals


generating train/test splits for scenario 'T1'
got 6992764 training and 2996900 test ratings
computing residuals
transforming dict with 9989664 ratings into sparse matrix

In [19]:
# select an interesting movies that appeals to different kinds of audiences
# e.g. 8874 Shaun of the Dead
for mid in movies_data:
    if "Comedy" in movies_data[mid]["genres"] and "Horror" in movies_data[mid]["genres"] and movies_data[mid]["year"] > "2000":
        print(mid, movies_data[mid]["title"])


1115 The Sleepover
4996 Little Otik
5478 Eight Legged Freaks
5837 Legion of the Dead
5909 Visitor Q
6755 Bubba Ho-tep
7202 Beyond Re-Animator
7266 The Lost Skeleton of Cadavra
7319 Club Dread
7846 Tremors 3: Back to Perfection
8258 Citizen Toxie: The Toxic Avenger IV
8578 Undead
8874 Shaun of the Dead
27563 The Happiness of the Katakuris
32011 Cursed
32239 Save the Green Planet!
32314 Incident at Loch Ness
44777 Evil Aliens
44828 Slither
47937 Severance
48231 Taxidermia
48678 Feast
51498 2001 Maniacs
53468 Fido
54910 Behind the Mask: The Rise of Leslie Vernon
55553 Black Sheep
57910 Teeth
60044 Baghead
60363 Zombie Strippers!
60522 The Machine Girl
62008 Dead Fury
62788 Lost Boys: The Tribe

In [20]:
# check which users have given the movie a 5 star rating 
users = {}
for (m, u) in tuple_ratings:
    if m == "8874" and tuple_ratings[(m, u)] >= 5. and residual_ratings[(m, u)] >= 1.:
        users[u] = 0
# sort these users by the most ratings
for (m, u) in tuple_ratings:
    if u in users:
        users[u] += 1
for u in sorted(users, key=users.get, reverse=True):
    if users[u] > 1000:
        print(u, users[u])
    else:
        break


3817 4159
14134 2750
38928 2552
43992 2444
67542 2333
33399 2257
55005 2176
57481 2129
69714 1863
63905 1804
6845 1662
54922 1582
3810 1560
36827 1512
9181 1473
44041 1406
37412 1396
9139 1381
12367 1361
23988 1333
48717 1291
19111 1205
28855 1195
35015 1191
18870 1125
42242 1121
27313 1088
34324 1077
70116 1076
59934 1051
35779 1047
3171 1028
32501 1013
41130 1005

In [21]:
# get the top 10 movies (based on residual ratings) for our selected users
users_of_interest = ["3817", "54922"]
user_ratings = {u: {} for u in users_of_interest}
for (m, u) in residual_ratings:
    if u in users_of_interest:
        user_ratings[u][m] = residual_ratings[(m, u)]
for u in users_of_interest:
    top_genres = []
    for i, m in enumerate(sorted(user_ratings[u], key=user_ratings[u].get, reverse=True)[:100]):
        if i < 10:
            print(u, m, residual_ratings[(m, u)], movies_data[m]["title"], movies_data[m]["genres"])
        top_genres.extend(movies_data[m]["genres"])
    print()
    top_genres = Counter(top_genres)
    for g in sorted(top_genres):
        print(g, top_genres[g])
    print("\n\n")


3817 1984 2.5422939267471487 Halloween III: Season of the Witch ['Horror', 'Mystery', 'Science Fiction']
3817 3973 2.4734053932297706 Book of Shadows: Blair Witch 2 ['Mystery', 'Thriller', 'Horror']
3817 1981 2.1731847712586183 Friday the 13th Part VIII: Jason Takes Manhattan ['Horror', 'Thriller']
3817 3024 2.124114605600765 Piranha ['Horror', 'Thriller', 'Science Fiction', 'Comedy']
3817 3564 2.0833480091834837 The Flintstones in Viva Rock Vegas ['Science Fiction', 'Comedy', 'Family', 'Romance']
3817 6872 2.078144878980354 House of the Dead ['Horror', 'Action', 'Thriller']
3817 4388 2.038376940429229 Scary Movie 2 ['Comedy']
3817 1891 1.9651718305234342 The Ugly ['Horror', 'Thriller']
3817 2907 1.922202662817143 Superstar ['Comedy', 'Family']
3817 5585 1.8453051479856413 Ernest Scared Stupid ['Horror', 'Comedy', 'Family']

Action 15
Adventure 11
Comedy 41
Crime 8
Documentary 4
Drama 24
Family 9
Fantasy 3
History 2
Horror 51
Music 1
Mystery 13
Romance 16
Science Fiction 17
Thriller 36
War 2



54922 2163 2.401613780913019 Attack of the Killer Tomatoes! ['Comedy', 'Horror', 'Science Fiction']
54922 5303 2.148589861486617 Joe Versus the Volcano ['Comedy', 'Romance']
54922 7104 2.1312929126254225 1941 ['Comedy']
54922 2123 2.059386558110372 All Dogs Go to Heaven ['Drama', 'Animation', 'Family', 'Comedy', 'Fantasy']
54922 3689 1.9206702020027118 Porky's II: The Next Day ['Comedy']
54922 7367 1.8938075682812312 The Ladykillers ['Comedy', 'Crime', 'Thriller']
54922 2826 1.8764928154082634 The 13th Warrior ['Action', 'Adventure', 'Fantasy']
54922 52722 1.873172918274114 Spider-Man 3 ['Fantasy', 'Action', 'Adventure']
54922 1030 1.8650212279811686 Pete's Dragon ['Fantasy', 'Animation', 'Comedy', 'Family']
54922 2053 1.8584703941256704 Honey I Blew Up the Kid ['Adventure', 'Comedy', 'Family', 'Science Fiction']

Action 12
Adventure 19
Animation 10
Comedy 61
Crime 8
Drama 36
Family 23
Fantasy 30
History 1
Horror 12
Music 6
Mystery 12
Romance 27
Science Fiction 15
Thriller 16
Western 2




In [22]:
## get only users with more than 900 ratings (~ 1000 targets)
user_ratings = {u: [] for u in map_userid2index_train}
for (m, u) in tuple_ratings:
    user_ratings[u].append(m)
user1000_idx = sorted([map_userid2index_train[u] for u in user_ratings if len(user_ratings[u]) >= 900])
print(len(map_userid2index_train), len(user1000_idx))


69878 1063

In [23]:
## train a simec model on all the data
e_dim = 100
model = SimilarityEncoder(feat_mat_train.shape[1], e_dim, len(user1000_idx), sparse_inputs=True,
                          opt=0.005)
model.fit(feat_mat_train, ratings_matrix[:, user1000_idx], epochs=100)


Epoch 1/100
10604/10604 [==============================] - 5s 477us/step - loss: 0.0873
Epoch 2/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0842
Epoch 3/100
10604/10604 [==============================] - 5s 454us/step - loss: 0.0820
Epoch 4/100
10604/10604 [==============================] - 5s 457us/step - loss: 0.0802
Epoch 5/100
10604/10604 [==============================] - 5s 453us/step - loss: 0.0788
Epoch 6/100
10604/10604 [==============================] - 5s 455us/step - loss: 0.0774
Epoch 7/100
10604/10604 [==============================] - 5s 455us/step - loss: 0.0763
Epoch 8/100
10604/10604 [==============================] - 5s 454us/step - loss: 0.0753
Epoch 9/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0744
Epoch 10/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0736
Epoch 11/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0729
Epoch 12/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0723
Epoch 13/100
10604/10604 [==============================] - 5s 455us/step - loss: 0.0717
Epoch 14/100
10604/10604 [==============================] - 5s 454us/step - loss: 0.0712
Epoch 15/100
10604/10604 [==============================] - 5s 453us/step - loss: 0.0708
Epoch 16/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0704
Epoch 17/100
10604/10604 [==============================] - 5s 455us/step - loss: 0.0700
Epoch 18/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0697
Epoch 19/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0694
Epoch 20/100
10604/10604 [==============================] - 5s 458us/step - loss: 0.0692
Epoch 21/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0689
Epoch 22/100
10604/10604 [==============================] - 5s 455us/step - loss: 0.0687
Epoch 23/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0685
Epoch 24/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0683
Epoch 25/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0682
Epoch 26/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0679
Epoch 27/100
10604/10604 [==============================] - 5s 453us/step - loss: 0.0678
Epoch 28/100
10604/10604 [==============================] - 5s 472us/step - loss: 0.0677
Epoch 29/100
10604/10604 [==============================] - 5s 464us/step - loss: 0.0676
Epoch 30/100
10604/10604 [==============================] - 5s 476us/step - loss: 0.0675
Epoch 31/100
10604/10604 [==============================] - 5s 461us/step - loss: 0.0674
Epoch 32/100
10604/10604 [==============================] - 5s 458us/step - loss: 0.0673
Epoch 33/100
10604/10604 [==============================] - 5s 454us/step - loss: 0.0672
Epoch 34/100
10604/10604 [==============================] - 5s 455us/step - loss: 0.0670
Epoch 35/100
10604/10604 [==============================] - 5s 454us/step - loss: 0.0670
Epoch 36/100
10604/10604 [==============================] - 5s 454us/step - loss: 0.0669
Epoch 37/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0668
Epoch 38/100
10604/10604 [==============================] - 5s 453us/step - loss: 0.0668
Epoch 39/100
10604/10604 [==============================] - 5s 454us/step - loss: 0.0667
Epoch 40/100
10604/10604 [==============================] - 5s 452us/step - loss: 0.0667
Epoch 41/100
10604/10604 [==============================] - 5s 451us/step - loss: 0.0667
Epoch 42/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0666
Epoch 43/100
10604/10604 [==============================] - 5s 453us/step - loss: 0.0666
Epoch 44/100
10604/10604 [==============================] - 5s 457us/step - loss: 0.0665
Epoch 45/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0665
Epoch 46/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0665
Epoch 47/100
10604/10604 [==============================] - 5s 454us/step - loss: 0.0664
Epoch 48/100
10604/10604 [==============================] - 5s 454us/step - loss: 0.0664
Epoch 49/100
10604/10604 [==============================] - 5s 455us/step - loss: 0.0663
Epoch 50/100
10604/10604 [==============================] - 5s 453us/step - loss: 0.0663
Epoch 51/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0663
Epoch 52/100
10604/10604 [==============================] - 5s 455us/step - loss: 0.0663
Epoch 53/100
10604/10604 [==============================] - 5s 454us/step - loss: 0.0663
Epoch 54/100
10604/10604 [==============================] - 5s 457us/step - loss: 0.0662
Epoch 55/100
10604/10604 [==============================] - 5s 457us/step - loss: 0.0662
Epoch 56/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0662
Epoch 57/100
10604/10604 [==============================] - 5s 457us/step - loss: 0.0662
Epoch 58/100
10604/10604 [==============================] - 5s 457us/step - loss: 0.0661
Epoch 59/100
10604/10604 [==============================] - 5s 458us/step - loss: 0.0661
Epoch 60/100
10604/10604 [==============================] - 5s 457us/step - loss: 0.0660
Epoch 61/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0661
Epoch 62/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0661
Epoch 63/100
10604/10604 [==============================] - 5s 455us/step - loss: 0.0661
Epoch 64/100
10604/10604 [==============================] - 5s 458us/step - loss: 0.0660
Epoch 65/100
10604/10604 [==============================] - 5s 457us/step - loss: 0.0661
Epoch 66/100
10604/10604 [==============================] - 5s 458us/step - loss: 0.0660
Epoch 67/100
10604/10604 [==============================] - 5s 457us/step - loss: 0.0660
Epoch 68/100
10604/10604 [==============================] - 5s 458us/step - loss: 0.0660
Epoch 69/100
10604/10604 [==============================] - 5s 459us/step - loss: 0.0660
Epoch 70/100
10604/10604 [==============================] - 5s 459us/step - loss: 0.0660
Epoch 71/100
10604/10604 [==============================] - 5s 457us/step - loss: 0.0659
Epoch 72/100
10604/10604 [==============================] - 5s 458us/step - loss: 0.0660
Epoch 73/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0659
Epoch 74/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0659
Epoch 75/100
10604/10604 [==============================] - 5s 458us/step - loss: 0.0660
Epoch 76/100
10604/10604 [==============================] - 5s 457us/step - loss: 0.0659
Epoch 77/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0659
Epoch 78/100
10604/10604 [==============================] - 5s 458us/step - loss: 0.0659
Epoch 79/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0659
Epoch 80/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0659
Epoch 81/100
10604/10604 [==============================] - 5s 458us/step - loss: 0.0659
Epoch 82/100
10604/10604 [==============================] - 5s 457us/step - loss: 0.0659
Epoch 83/100
10604/10604 [==============================] - 5s 458us/step - loss: 0.0658
Epoch 84/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0658
Epoch 85/100
10604/10604 [==============================] - 5s 457us/step - loss: 0.0659
Epoch 86/100
10604/10604 [==============================] - 5s 458us/step - loss: 0.0658
Epoch 87/100
10604/10604 [==============================] - 5s 454us/step - loss: 0.0659
Epoch 88/100
10604/10604 [==============================] - 5s 457us/step - loss: 0.0658
Epoch 89/100
10604/10604 [==============================] - 5s 459us/step - loss: 0.0658
Epoch 90/100
10604/10604 [==============================] - 5s 454us/step - loss: 0.0659
Epoch 91/100
10604/10604 [==============================] - 5s 459us/step - loss: 0.0658
Epoch 92/100
10604/10604 [==============================] - 5s 458us/step - loss: 0.0658
Epoch 93/100
10604/10604 [==============================] - 5s 458us/step - loss: 0.0658
Epoch 94/100
10604/10604 [==============================] - 5s 460us/step - loss: 0.0658
Epoch 95/100
10604/10604 [==============================] - 5s 454us/step - loss: 0.0658
Epoch 96/100
10604/10604 [==============================] - 5s 457us/step - loss: 0.0658
Epoch 97/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0657
Epoch 98/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0658
Epoch 99/100
10604/10604 [==============================] - 5s 455us/step - loss: 0.0658
Epoch 100/100
10604/10604 [==============================] - 5s 456us/step - loss: 0.0658

In [24]:
# look through all users that made the cut
users_of_interest = ["3817", "14134", "38928", "43992", "67542", "33399", "55005", "57481", "69714", "63905", "6845", "54922", "3810", "36827", "9181", "44041", "37412", "9139", "12367", "23988", "48717", "19111", "28855", "35015", "18870", "42242", "27313", "34324", "70116", "59934", "35779", "3171", "32501", "41130"]
m = "8874"
mid = map_movieid2index_train[m]
for u1 in users_of_interest:
    uid1 = user1000_idx.index(map_userid2index_train[u1]) # map_userid2index_train[u1]
    # check regular prediction scores
    print(u1, residual_ratings[(m,u1)], model.predict(feat_mat_train[mid,:])[:,uid1])
    f_movie = csr_matrix(diags(feat_mat_train[mid,:].toarray()[0], 0, shape=(feat_mat_train.shape[1], feat_mat_train.shape[1])))
    # featurewise predictions need to be corrected for the bias
    temp = np.dot((model.transform(f_movie) + (1/f_movie.count_nonzero() - 1.) * model.model.layers[1].get_weights()[1]), model.model.layers[2].get_weights()[0][:, [uid1]])
    uid1_scores = {f: temp[i,0] for i, f in enumerate(featurenames) if f_movie[i,i]}
    del temp
    if uid1_scores["Comedy"] > 0.01 and uid1_scores["Horror"] < -0.011 and uid1_scores["zombie"] < -0.01:
        # this should be the same as the original predictions
        print(sum(uid1_scores.values()), tuple_ratings[(m, u1)], residual_ratings[(m, u1)])
        for f in sorted(uid1_scores, key=uid1_scores.get, reverse=True):
            print(f, uid1_scores[f])
        print()


3817 1.5081229063347688 [0.7279849]
14134 1.6716881864501811 [1.3890111]
38928 1.3442179670144445 [0.82094383]
43992 1.2236161733568913 [1.5427803]
67542 1.106614890265012 [0.63286096]
33399 1.9380311195472126 [1.13463]
55005 1.13232605409724 [1.1954658]
57481 1.4555266082444507 [1.0935943]
69714 1.1623612943406805 [0.8275987]
63905 1.5606527097317775 [1.2182063]
6845 1.5592188723707587 [1.4585662]
54922 1.0400421687510661 [0.53259724]
3810 2.043085622347617 [1.1533103]
36827 1.309172313434308 [0.39910054]
9181 1.8202991844135212 [0.6309608]
44041 1.1353354126237232 [0.5756967]
37412 1.270476152066227 [0.1324903]
9139 1.153721495646416 [0.89275867]
12367 2.3878733444222604 [1.9464864]
23988 1.2933566035544573 [0.14259747]
48717 1.4109352817251617 [1.036586]
19111 1.0889827922593098 [-0.11167835]
28855 1.5213618266175453 [0.5396532]
35015 1.7502507389270914 [0.5413419]
18870 1.2961326308946255 [0.39439023]
42242 1.4751743595099494 [0.6422601]
27313 1.452408774685475 [-0.11228691]
34324 1.9353799227547306 [0.6819677]
70116 1.5142049150747163 [0.44670537]
59934 1.4393837145186872 [0.67911905]
35779 1.1960969734606874 [0.21757182]
3171 1.1279566689404534 [0.46096477]
32501 1.1305470215933204 [0.4331051]
41130 1.5087429128183403 [1.3436705]

In [25]:
# for a certain movie, create a new feature matrix with the features on the diagonal
# to get relevancy scores for each separate feature (only works for linear SimEc)
m = "8874"
u1 = "3817"
u2 = "54922"
mid = map_movieid2index_train[m]
uid1 = user1000_idx.index(map_userid2index_train[u1]) # map_userid2index_train[u1]
uid2 = user1000_idx.index(map_userid2index_train[u2]) # map_userid2index_train[u2]
# check regular prediction scores
print(residual_ratings[(m,u1)], residual_ratings[(m,u2)])
print(model.predict(feat_mat_train[mid,:])[:,[uid1, uid2]])
f_movie = csr_matrix(diags(feat_mat_train[mid,:].toarray()[0], 0, shape=(feat_mat_train.shape[1], feat_mat_train.shape[1])))
# featurewise predictions need to be corrected for the bias
temp = np.dot((model.transform(f_movie) + (1/f_movie.count_nonzero() - 1.) * model.model.layers[1].get_weights()[1]), model.model.layers[2].get_weights()[0][:, [uid1, uid2]])
uid1_scores = {f: temp[i,0] for i, f in enumerate(featurenames) if f_movie[i,i]}
uid2_scores = {f: temp[i,1] for i, f in enumerate(featurenames) if f_movie[i,i]}
del temp
# this should be the same as the original predictions
print(sum(uid1_scores.values()), sum(uid2_scores.values()))


1.5081229063347688 1.0400421687510661
[[0.7279849  0.53259724]]
0.7279849220067263 0.532597238663584

In [26]:
# the Horror afine user seems to like this move mostly because it's a horror movie
print(sum(uid1_scores.values()), tuple_ratings[(m, u1)], residual_ratings[(m, u1)])
for f in sorted(uid1_scores, key=uid1_scores.get, reverse=True):
    print(f, uid1_scores[f])


0.7279849220067263 5.0 1.5081229063347688
british pub 0.15882124
survival horror 0.10006458
Horror 0.08814314
zombie 0.0868675
Edgar Wright 0.07315873
friends 0.058685143
pub 0.05762812
surrey 0.042946897
satire 0.028254418
cult film 0.024366971
Comedy 0.01897367
mother 0.014142217
dark comedy -0.02406771

In [27]:
# the comedy user likes it because it's a comedy cult film and zombie and Horror score negatively
print(sum(uid2_scores.values()), tuple_ratings[(m, u2)], residual_ratings[(m, u2)])
for f in sorted(uid2_scores, key=uid2_scores.get, reverse=True):
    print(f, uid2_scores[f])


0.532597238663584 5.0 1.0400421687510661
british pub 0.1660256
Edgar Wright 0.15661564
survival horror 0.1318464
dark comedy 0.09435095
surrey 0.020742737
Comedy 0.020615192
mother 0.006690478
cult film 0.00036479626
zombie -0.0027495015
pub -0.0035401683
friends -0.009514988
Horror -0.020013636
satire -0.028836256

Content based recommendations

In the following, we discuss recommendations on item basis, e.g., suggesting similar items alongside an item a user is currently looking at. With content based recommendations, these suggestions are made based on the items' features, not the user ratings, which means that suggestions can also be made for new items that did not receive any ratings yet. However, a similarity score computed straight from the items' feature vectors does not correspond well to what users perceive as similar items, e.g., movies that got similar ratings from users are not necessarily similar with respect to their feature vectors.

By using SimEc to learn a mapping from the items' feature vectors into an embedding space where the user ratings based similarities are preserved, more useful suggestions can be generated even for new items. By first projecting the item's feature vector in the embedding space and then computing the dot product with all other items' embedding vectors, the most similar items can be identified and recommended alongside the item of interest:


In [28]:
# load data
rating_pairs_train, rating_pairs_test, map_index2userid_train, map_index2movieid_train, map_userid2index_train, map_movieid2index_train = split_traintest("T1")
rating_pairs_train, rating_pairs_test = list(rating_pairs_train), list(rating_pairs_test)
rating_pairs_train.extend(rating_pairs_test)
# get feature vectors for all movies
print("computing content based similarities")
feat_mat_train, _, featurenames = get_features_mats(map_index2movieid_train, [])
# normalize them to have length 1 (since features are binary, norm is just the sqrt(sum()))
N = np.sqrt(feat_mat_train.sum(1))
print("%i movies have no features...:/" % np.sum(N==0))
N[N==0] = 1
feat_mat_train_normed = feat_mat_train/N
# compute content based similarity matrix ("similar movies have similar genres/directors/topics")
S_content = feat_mat_train_normed.dot(feat_mat_train_normed.T).A  # cosine similarity
# get sparse matrix with residuals
print("computing user ratings based similarities")
# initalize and fit means model
mmodel = MeansModel()
mmodel.fit(tuple_ratings, rating_pairs_train)
residual_ratings = mmodel.compute_residuals(tuple_ratings, rating_pairs_train)
# transform residual_ratings into a dict with {movieid: {userid: rating}}
residual_ratings_movies = defaultdict(dict)
for (m, u) in residual_ratings:
    residual_ratings_movies[m][u] = residual_ratings[(m, u)]
# normalize rating vectors to have unit length
for m in residual_ratings_movies:
    N = np.linalg.norm(list(residual_ratings_movies[m].values())) 
    residual_ratings_movies[m] = {u: residual_ratings_movies[m][u]/N for u in residual_ratings_movies[m]}
# transform into sparse matrix
ratings_matrix = dok_matrix((len(map_movieid2index_train), len(map_userid2index_train)), dtype=float)
for i, m in enumerate(map_index2movieid_train):
    for u in residual_ratings_movies[m]:
        ratings_matrix[i, map_userid2index_train[u]] = residual_ratings_movies[m][u]
ratings_matrix = csr_matrix(ratings_matrix)
# compute user ratings based similarity matrix ("similar movies get similar ratings")
S_user = ratings_matrix.dot(ratings_matrix.T).A
# get movies with more than 1k ratings
movie1000_idx = sorted([map_movieid2index_train[m] for m in residual_ratings_movies if len(residual_ratings_movies[m]) >= 1000])
print(len(map_movieid2index_train), len(movie1000_idx))
S_user = S_user[movie1000_idx][:,movie1000_idx]
S_content = S_content[movie1000_idx][:,movie1000_idx]
# check correlation between both similarity scores (--> if movies with similar content also 
# get similar ratings, the content based similarity score can be used for recommendations)
s_content_flat = np.asarray(S_content[np.triu_indices_from(S_content, 1)])
s_user_flat = np.asarray(S_user[np.triu_indices_from(S_user, 1)])
corr_pear = pearsonr(s_content_flat, s_user_flat)[0]
corr_spear = spearmanr(s_content_flat, s_user_flat)[0]
print("Correlation between ratings:", corr_pear, corr_spear)
ridx = np.random.permutation(len(s_content_flat))[:5000]
plt.figure()
plt.scatter(s_user_flat[ridx], s_content_flat[ridx], s=2)
plt.xlabel("User ratings based similarities")
plt.ylabel("Content based similarities");
plt.title("Movie similarities (Pearson corr: %.3f)" % corr_pear)
if savefigs: plt.savefig("ratings_corr_basic.png", dpi=300, bbox_inches="tight")


generating train/test splits for scenario 'T1'
got 6992764 training and 2996900 test ratings
computing content based similarities
11 movies have no features...:/
computing user ratings based similarities
10604 2031
Correlation between ratings: 0.18514933950049306 0.1504878927838488

In [29]:
# train a simec to map the movie feature vectors into an embedding space,
# where the user rating based similarities are preserved
e_dim = 100
n_targets = len(movie1000_idx)
feat_mat_train = feat_mat_train[movie1000_idx]
model = SimilarityEncoder(feat_mat_train.shape[1], e_dim, n_targets, [(500, "tanh")], sparse_inputs=True, opt=0.001,
                          s_ll_reg=10., S_ll=S_user[:n_targets, :n_targets])
model.fit(feat_mat_train, S_user[:, :n_targets], epochs=20)
# compute embeddings for movies
X_embed = model.transform(feat_mat_train)
# compute simec similarity and check correlation
print (S_user[0,:6])
print (model.predict(feat_mat_train)[0,:6])
S_simec = X_embed.dot(X_embed.T)
print (S_simec[0,:6])
s_simec_flat = np.asarray(S_simec[np.triu_indices_from(S_simec, 1)])
corr_pear = pearsonr(s_simec_flat, s_user_flat)[0]
corr_spear = spearmanr(s_simec_flat, s_user_flat)[0]
print("Correlation between ratings:", corr_pear, corr_spear)
plt.figure()
plt.scatter(s_user_flat[ridx], s_simec_flat[ridx], s=2)
plt.xlabel("User ratings based similarities")
plt.ylabel("SimEc embedding similarities");
plt.title("Movie similarities (Pearson corr: %.3f)" % corr_pear)
if savefigs: plt.savefig("ratings_corr_simec.png", dpi=300, bbox_inches="tight")


Epoch 1/20
2031/2031 [==============================] - 7s 3ms/step - loss: 0.0090
Epoch 2/20
2031/2031 [==============================] - 6s 3ms/step - loss: 0.0064
Epoch 3/20
2031/2031 [==============================] - 6s 3ms/step - loss: 0.0057
Epoch 4/20
2031/2031 [==============================] - 6s 3ms/step - loss: 0.0053
Epoch 5/20
2031/2031 [==============================] - 6s 3ms/step - loss: 0.0051
Epoch 6/20
2031/2031 [==============================] - 6s 3ms/step - loss: 0.0049
Epoch 7/20
2031/2031 [==============================] - 6s 3ms/step - loss: 0.0048
Epoch 8/20
2031/2031 [==============================] - 6s 3ms/step - loss: 0.0048
Epoch 9/20
2031/2031 [==============================] - 6s 3ms/step - loss: 0.0048
Epoch 10/20
2031/2031 [==============================] - 6s 3ms/step - loss: 0.0047
Epoch 11/20
2031/2031 [==============================] - 6s 3ms/step - loss: 0.0047
Epoch 12/20
2031/2031 [==============================] - 6s 3ms/step - loss: 0.0047
Epoch 13/20
2031/2031 [==============================] - 6s 3ms/step - loss: 0.0047
Epoch 14/20
2031/2031 [==============================] - 6s 3ms/step - loss: 0.0047
Epoch 15/20
2031/2031 [==============================] - 6s 3ms/step - loss: 0.0047
Epoch 16/20
2031/2031 [==============================] - 6s 3ms/step - loss: 0.0047
Epoch 17/20
2031/2031 [==============================] - 6s 3ms/step - loss: 0.0047
Epoch 18/20
2031/2031 [==============================] - 6s 3ms/step - loss: 0.0047
Epoch 19/20
2031/2031 [==============================] - 6s 3ms/step - loss: 0.0046
Epoch 20/20
2031/2031 [==============================] - 6s 3ms/step - loss: 0.0046
[ 1.00000000e+00  2.73197152e-02  1.42990715e-02  1.18468596e-02
 -4.98530284e-04  3.39386392e-03]
[ 0.07369027  0.01827954 -0.00049409  0.00426445 -0.01001609  0.00814696]
[ 0.07516888  0.01427721  0.00067488  0.00507359 -0.01216156  0.0072187 ]
Correlation between ratings: 0.8664695261724725 0.8428527595705543

In [30]:
# e.g. 8874 Shaun of the Dead
# other movies from 2004:
# 7451 Mean Girls
# 7293 50 First Dates
# 8957 Saw
# 7360 Dawn of the Dead
movies_of_interest = ["8874", "7293", "7360"]
moi_ratings = {m: [] for m in movies_of_interest}
moi_users = {m: set() for m in movies_of_interest}
for (m, u) in tuple_ratings:
    if m in movies_of_interest:
        moi_ratings[m].append(tuple_ratings[(m, u)])
        moi_users[m].add(u)
for m in moi_ratings:
    print(m, movies_data[m]["title"], len(moi_ratings[m]), np.mean(moi_ratings[m]), "\n", movies_data[m]["genres"], "\n", movies_data[m]["keywords"], "\n", movies_data[m]["directors"])


8874 Shaun of the Dead 3964 3.8925327951564075 
 ['Horror', 'Comedy'] 
 ['mother', 'pub', 'surrey', 'satire', 'friends', 'dark comedy', 'zombie', 'cult film', 'survival horror', 'british pub'] 
 ['Edgar Wright']
7293 50 First Dates 2164 3.4450092421441774 
 ['Comedy', 'Romance'] 
 ['deja vu', 'amnesia', 'hawaii', 'ladykiller', 'romantic comedy'] 
 ['Peter Segal']
7360 Dawn of the Dead 1742 3.5539609644087258 
 ['Fantasy', 'Horror', 'Action', 'Drama'] 
 ['refugee', 'mass murder', 'habor', 'car journey', 'department store', 'blackout', 'bus ride', 'pregnancy and birth', 'dying and death', 'bite', 'to shoot dead', 'lorry', 'munition', 'basement garage', 'guard', 'zombie', 'dog', 'car', 'duringcreditsstinger'] 
 ['Zack Snyder']

In [31]:
# content based and simec similarities for our 3 movies of interest
m1, m2, m3 = "8874", "7293", "7360"
for i, j in [(m1, m2), (m1, m3), (m2, m3)]:
    print("Similarities between", movies_data[i]["title"], movies_data[j]["title"])
    i, j = movie1000_idx.index(map_movieid2index_train[i]), movie1000_idx.index(map_movieid2index_train[j])
    print("Content based", S_content[i,j])
    print("User ratings based", S_user[i,j])
    print("SimEc", S_simec[i,j])


Similarities between Shaun of the Dead 50 First Dates
Content based 0.09805807
User ratings based -0.013588011525843567
SimEc 0.0016836281
Similarities between Shaun of the Dead Dawn of the Dead
Content based 0.1132277
User ratings based 0.1013175233895677
SimEc 0.05162354
Similarities between 50 First Dates Dawn of the Dead
Content based 0.0
User ratings based 0.009332836728617075
SimEc 0.0050364695

In [ ]: