Recommender Engine

Perhaps the most famous example of a recommender engine in the Data Science world was the Netflix competition started in 2006, in which teams from all around the world competed to improve on Netflix's reccomendation algorithm. The final prize of $1,000,000 was awarded to a team which developed a solution which had about a 10% increase in accuracy over Netflix's. In fact, this competition resulted in the development of some new techniques which are still in use. For more reading on this topic, see Simon Funk's blog post

In this exercise, you will build a collaborative-filter recommendatin engine using both a cosine similarity approach and SVD (singular value decomposition). Before proceding download the MovieLens dataset.

Importing and Pre-processing the Data

First familiarize yourself with the data you downloaded, and then import the u.data file and take a look at the first few rows.


In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('./data/ml-100k/u.data', sep='\t', header=None)
df.columns = ['userid', 'itemid', 'rating', 'timestamp']
df['timestamp'] = pd.to_datetime(df['timestamp'],unit='s')
df.head()


Out[2]:
userid itemid rating timestamp
0 196 242 3 1997-12-04 15:55:49
1 186 302 3 1998-04-04 19:22:22
2 22 377 1 1997-11-07 07:18:36
3 244 51 2 1997-11-27 05:02:03
4 166 346 1 1998-02-02 05:33:16

Before building any recommendation engines, we'll have to get the data into a useful form. Do this by first splitting the data into testing and training sets, and then by constructing two new dataframes whose columns are each unique movie and rows are each unique user, filling in 0 for missing values.


In [3]:
len(df['userid'].unique()), len(df['itemid'].unique())


Out[3]:
(943, 1682)

In [4]:
from sklearn.model_selection import train_test_split
dftrain, dftest = train_test_split(df, test_size=0.2)

In [5]:
len(dftrain['userid'].unique()), len(dftrain['itemid'].unique())


Out[5]:
(943, 1645)

In [6]:
missing_items = set(df['itemid'].unique()) - set(dftrain['itemid'].unique())
items = []
for item in missing_items:
    items.append([1, item, 0, np.nan])

dftrain_filled = dftrain.append(pd.DataFrame(items, columns=dftrain.columns)).reset_index()

In [7]:
len(dftrain_filled['userid'].unique()), len(dftrain_filled['itemid'].unique())


Out[7]:
(943, 1682)

In [8]:
len(dftest['userid'].unique()), len(dftest['itemid'].unique())


Out[8]:
(942, 1406)

In [9]:
missing_items = set(df['itemid'].unique()) - set(dftest['itemid'].unique())
items = []
for item in missing_items:
    items.append([1, item, 0, np.nan])

dftest_filled = dftest.append(pd.DataFrame(items, columns=dftest.columns)).reset_index()

In [10]:
len(dftest_filled['userid'].unique()), len(dftest_filled['itemid'].unique())


Out[10]:
(942, 1682)

In [11]:
# X = pd.pivot_table(df, values='rating', index='userid', columns='itemid').fillna(0)
# X.head()

Xtrain = pd.pivot_table(dftrain_filled, values='rating', index='userid', columns='itemid').fillna(0)
Xtest = pd.pivot_table(dftest_filled, values='rating', index='userid', columns='itemid').fillna(0)
Xtrain.head()


Out[11]:
itemid 1 2 3 4 5 6 7 8 9 10 ... 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682
userid
1 5.0 3.0 4.0 3.0 0.0 5.0 4.0 1.0 5.0 3.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 4.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 1682 columns


In [12]:
print(Xtrain.shape)
print(Xtest.shape)


(943, 1682)
(942, 1682)

Now split the data into a training and test set, using a ratio 80/20 for train/test.


In [13]:
# from sklearn.model_selection import train_test_split
# Xtrain, Xtest = train_test_split(X, test_size=0.2)

Cosine Similarity

Building a recommendation engine can be thought of as "filling in the holes" in the sparse matrix you made above. For example, take a look at the MovieLense data. You'll see that that matrix is mostly zeros. Our task here is to predict what a given user will rate a given movie depending on the users tastes. To determine a users taste, we can use cosine similarity which is given by $$s_u^{cos}(u_k,u_a) = \frac{ u_k \cdot u_a }{ \left \| u_k \right \| \left \| u_a \right \| } = \frac{ \sum x_{k,m}x_{a,m} }{ \sqrt{\sum x_{k,m}^2\sum x_{a,m}^2} }$$ for users $u_a$ and $u_k$ on ratings given by $x_{a,m}$ and $x_{b,m}$. This is just the cosine of the angle between the two vectors. Likewise, this can also be calculated for the similarity between two items, $i_a$ and $i_b$, given by $$s_u^{cos}(i_m,i_b) = \frac{ i_m \cdot i_b }{ \left \| i_m \right \| \left \| i_b \right \| } = \frac{ \sum x_{a,m} x_{a,b} }{ \sqrt{ \sum x_{a,m}^2 \sum x_{a,b}^2 } }$$

Then, the similarity between two users is given by $$\hat{x}_{k,m} = \bar{x}_{k} + \frac{\sum\limits_{u_a} s_u^{cos}(u_k,u_a) (x_{a,m})}{\sum\limits_{u_a}|s_u^{cos}(u_k,u_a)|}$$ and for items given by $$\hat{x}_{k,m} = \frac{\sum\limits_{i_b} s_u^{cos}(i_m,i_b) (x_{k,b}) }{\sum\limits_{i_b}|s_u^{cos}(i_m,i_b)|}$$

Use these ideas to construct a class cos_engine which can be used to recommend movies for a given user. Be sure to also test your algorithm, reporting its accuracy. Bonus: Use adjusted cosine similiarity.


In [14]:
class cos_engine():
    def __init__(self, k, how='user'):
        self.k = k
        self.how = how
    
    def _cos_sim(self, x, y):
        num = np.sum(x * y)
        den = np.sum(x**2) * np.sum(y**2)
        return num / np.sqrt(den)
    
    def _user_based_filtering(self, user, X):
        sim = X.apply(lambda row: self._cos_sim(X.loc[user, :].values, row.values), axis=1)
        sim = sim.drop(user, axis=0)
        rec = sim.sort_values(ascending=False).head(self.k)
        return rec
    
    def _item_based_filtering(self, item, X):
        Xt = X.T
        sim = Xt.apply(lambda row: self._cos_sim(Xt.loc[item, :].values, row.values), axis=1)
        sim = sim.drop(item, axis=0)
        rec = sim.sort_values(ascending=False).head(self.k)
        return rec
    
    def fit(self, X):
        if self.how == 'user':
            Xfilled = []
            for user in X.index.values:
                ubf = self._user_based_filtering(user, X)
                weighted_sum = np.sum(X.loc[ubf.index, :].values * ubf.values.reshape(-1, 1), axis=0)
                Xfilled.append(weighted_sum / np.sum(np.abs(ubf.values)))
            self.Xfilled = pd.DataFrame(np.array(Xfilled), index=X.index, columns=X.columns)
        elif self.how == 'item':
            Xtfilled = []
            for item in X.columns.values:
                ibf = self._item_based_filtering(item, X)
                Xt = X.T
                weighted_sum = np.sum(Xt.loc[ibf.index, :].values * ibf.values.reshape(-1, 1), axis=0)
                Xtfilled.append(weighted_sum / np.sum(np.abs(ibf.values)))
            self.Xfilled = pd.DataFrame(np.array(Xtfilled).T, index=X.index, columns=X.columns)
        else:
            print('Use either "user" or "item" for the parameter "how".')
    
    def predict(self, user, X):
        try:
            ratings = self.Xfilled.loc[user, :]
            rated_items = X.loc[user, :] > 0
            ratings = ratings[~rated_items]
            return ratings.sort_values(ascending=False)
        except AttributeError:
            print('You have to fit the recommender first!')
    
    def score(self, X):
        try:
            return np.sum(np.sum(np.abs(self.Xfilled - X.where(X > 0)))) / X.where(X > 0).count().sum()
        except AttributeError:
            print('You have to fit the recommender first!')

In [15]:
rec_ub = cos_engine(k=7)
rec_ub.fit(Xtrain)

In [16]:
rec_ub.Xfilled.loc[100, 258]


Out[16]:
3.4191191865071575

In [19]:
rec_ub.predict(100, Xtrain + Xtest).head(10)


Out[19]:
itemid
327    2.433366
748    1.723340
343    1.607300
303    1.549017
245    1.387070
873    1.301103
331    1.180993
329    1.141351
322    1.116204
882    1.050156
Name: 100, dtype: float64

In [20]:
rec_ub.score(Xtest)


Out[20]:
2.161008381608144

In [21]:
rec_ib = cos_engine(k=7, how='item')
rec_ib.fit(Xtrain)


C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:9: RuntimeWarning: invalid value encountered in double_scalars
  if __name__ == '__main__':

In [22]:
rec_ib.Xfilled.loc[100, 258]


Out[22]:
1.5603889490414089

In [24]:
rec_ib.predict(100, Xtrain + Xtest).head(10)


Out[24]:
itemid
301    3.699117
312    3.259807
748    2.870880
307    2.591593
304    2.553904
245    2.426965
311    2.312615
327    2.293961
343    2.199110
319    2.109540
Name: 100, dtype: float64

In [25]:
rec_ib.score(Xtest)


Out[25]:
2.009026847268654

In [11]:
# class cos_engine():
#     def __init__(self, k, how='user'):
#         self.k = k
#         self.how = how
    
#     def _cos_sim(self, x, y):
#         num = np.sum(x * y)
#         den = np.sum(x**2) * np.sum(y**2)
#         return num / np.sqrt(den)
    
#     def _user_based_filtering(self, user, X):
#         sim = X.apply(lambda row: self._cos_sim(X.loc[user, :].values, row.values), axis=1)
#         sim = sim.drop(user, axis=0)
#         rec = sim.sort_values(ascending=False)#.head(self.k)
#         return rec
    
#     def _item_based_filtering(self, item, X):
#         Xt = X.T
#         sim = Xt.apply(lambda row: self._cos_sim(Xt.loc[item, :].values, row.values), axis=1)
#         sim = sim.drop(item, axis=0)
#         rec = sim.sort_values(ascending=False)#.head(self.k)
#         return rec
    
#     def fit(self, user, X):
#         if self.how == 'user':
#             ubf = self._user_based_filtering(user, X)
#             rec = {}
#             for item in X.columns.values:
#                 temp = X.loc[ubf.index, item]
#                 rec[item] = ubf.loc[temp > 0].head(self.k)
#             self.rec = rec
#         elif self.how == 'item':
#             rec = {}
#             for item in X.columns.values:
#                 print('Fitting item ', item) # This is pretty slow...
#                 ibf = self._item_based_filtering(item, X)
#                 Xt = X.T
#                 temp = Xt.loc[ibf.index, user]
#                 rec[item] = ibf.loc[temp > 0].head(self.k)
#             self.rec = rec
#         else:
#             print('Use either "user" or "item" for the parameter "how".')
    
#     def _predict_one(self, user, item, X):
#         try:
#             if self.how == 'user':
#                 idx = self.rec[item].index
#                 if len(idx) == 0:
#                     return X.loc[user, item]
#                 weighted_sum = np.sum(X.loc[idx, item].values * self.rec[item].values, axis=0)
#                 return weighted_sum / np.sum(self.rec[item].values, axis=0)
#             elif self.how == 'item':
#                 Xt = X.T
#                 idx = self.rec[item].index
#                 if len(idx) == 0:
#                     return X.loc[user, item]
#                 weighted_sum = np.sum(Xt.loc[idx, user].values * self.rec[item].values, axis=0)
#                 return weighted_sum / np.sum(self.rec[item].values, axis=0)
#         except AttributeError:
#             print('You have to fit the recommender first!')
    
#     def predict(self, user, X):
#         pred = []
#         for item in X.columns.values:
#             pred.append(self._predict_one(user, item, X))
#         return pd.Series(np.array(np.round(pred), dtype=int), index=X.columns)
    
#     def score(self, user, X):
#         try:
#             pred = self.predict(user, X)
#             rated_items = X.loc[user, :] > 0
#             user_ratings = X.loc[user, rated_items]
#             pred_ratings = pred.loc[rated_items]
#             return np.sum(user_ratings == pred_ratings) / len(user_ratings)
#         except AttributeError:
#             print('You have to fit the recommender first!')

In [19]:
# solution
# Libraries
from sklearn.metrics.pairwise import cosine_similarity

np.abs(cosine_similarity(Xtrain)).sum(axis=1).reshape(943, 1).shape


Out[19]:
(943, 1)

SVD

Above we used Cosine Similarity to fill the holes in our sparse matrix. Another, and much more popular, method for matrix completion is called a Singluar Value Decomposition. SVD factors our data matrix into three smaller matricies, given by $$\textbf{M} = \textbf{U} \bf{\Sigma} \textbf{V}^*$$ where $\textbf{M}$ is our data matrix, $\textbf{U}$ is a unitary matrix containg the latent variables in the user space, $\bf{\Sigma}$ is diagonal matrix containing the singular values of $\textbf{M}$, and $\textbf{V}$ is a unitary matrix containing the latent variables in the item space. For more information on the SVD see the Wikipedia article.

Numpy contains a package to estimate the SVD of a sparse matrix. By making estimates of the matricies $\textbf{U}$, $\bf{\Sigma}$, and $\textbf{V}$, and then by multiplying them together, we can reconstruct an estimate for the matrix $\textbf{M}$ with all the holes filled in.

Use these ideas to construct a class svd_engine which can be used to recommend movies for a given user. Be sure to also test your algorithm, reporting its accuracy. Bonus: Tune any parameters.


In [26]:
class svd_engine():
    
    def __init__(self, k):
        self.k = k
        
    def fit(self, X):
        if self.k <= np.min(X.shape):
            U, S, V = np.linalg.svd(X)
            Uk = np.dot(U[:,:self.k], np.sqrt(np.diag(S[:self.k])))
            Vk = np.dot(np.sqrt(np.diag(S[:self.k])), V[:self.k,:])
            self.Xfilled = pd.DataFrame(np.dot(Uk, Vk), index=X.index, columns=X.columns)
        else:
            print('k has to be smaller than the minimum dimension of X')
    
    def predict(self, user, X):
        try:
            ratings = self.Xfilled.loc[user, :]
            rated_items = X.loc[user, :] > 0
            ratings = ratings[~rated_items]
            return ratings.sort_values(ascending=False)
        except AttributeError:
            print('You have to fit the recommender first!')
    
    def score(self, X):
        try:
            return np.sum(np.sum(np.abs(self.Xfilled - X.where(X > 0)))) / X.where(X > 0).count().sum()
        except AttributeError:
            print('You have to fit the recommender first!')

In [27]:
rec_svd = svd_engine(k=100)
rec_svd.fit(Xtrain)

In [28]:
rec_svd.Xfilled.loc[100, 258]


Out[28]:
3.0395589377080148

In [30]:
rec_svd.predict(100, Xtrain + Xtest).head(10)


Out[30]:
itemid
307    1.791480
312    1.339279
748    1.328989
301    1.317524
343    1.289623
311    1.228762
331    1.191659
332    1.140870
327    1.135568
245    1.032280
Name: 100, dtype: float64

In [31]:
rec_svd.score(Xtest)


Out[31]:
2.917198706745735

In [32]:
scores = []
for k in np.arange(10, 950, 10):
    rec_svd = svd_engine(k=k)
    rec_svd.fit(Xtrain)
    scores.append([k, rec_svd.score(Xtest)])

In [35]:
pd.DataFrame(scores, columns=['k', 'MAE']).sort_values(by='MAE').head()


Out[35]:
k MAE
0 10 2.340430
1 20 2.365737
2 30 2.449003
3 40 2.527992
4 50 2.608181

In [36]:
scores = []
for k in np.arange(2, 11):
    rec_svd = svd_engine(k=k)
    rec_svd.fit(Xtrain)
    scores.append([k, rec_svd.score(Xtest)])

In [37]:
pd.DataFrame(scores, columns=['k', 'MAE']).sort_values(by='MAE')


Out[37]:
k MAE
8 10 2.340430
7 9 2.346365
6 8 2.354198
5 7 2.371857
4 6 2.389727
3 5 2.422451
2 4 2.460086
1 3 2.496903
0 2 2.579625

In [42]:
rec_svd = svd_engine(k=10)
rec_svd.fit(Xtrain)

In [43]:
rec_svd.Xfilled.loc[100, 258]


Out[43]:
3.2613236758469246

In [44]:
rec_svd.predict(100, Xtrain + Xtest).head(10)


Out[44]:
itemid
748    2.103909
307    1.999920
301    1.888959
332    1.752203
245    1.635943
327    1.577040
303    1.488295
322    1.477374
331    1.462423
304    1.313905
Name: 100, dtype: float64

In [45]:
rec_svd.score(Xtest)


Out[45]:
2.3404299215316