Recommender Engine

Perhaps the most famous example of a recommender engine in the Data Science world was the Netflix competition started in 2006, in which teams from all around the world competed to improve on Netflix's reccomendation algorithm. The final prize of $1,000,000 was awarded to a team which developed a solution which had about a 10% increase in accuracy over Netflix's. In fact, this competition resulted in the development of some new techniques which are still in use. For more reading on this topic, see Simon Funk's blog post

In this exercise, you will build a collaborative-filter recommendatin engine using both a cosine similarity approach and SVD (singular value decomposition). Before proceding download the MovieLens dataset.

Importing and Pre-processing the Data

First familiarize yourself with the data you downloaded, and then import the u.data file and take a look at the first few rows.



In [1]:

    
import numpy as np
import pandas as pd



In [2]:

    
df = pd.read_csv('./data/ml-100k/u.data', sep='\t', header=None)
df.columns = ['userid', 'itemid', 'rating', 'timestamp']
df['timestamp'] = pd.to_datetime(df['timestamp'],unit='s')
df.head()









    Out[2]:







  
    
      
      userid
      itemid
      rating
      timestamp
    
  
  
    
      0
      196
      242
      3
      1997-12-04 15:55:49
    
    
      1
      186
      302
      3
      1998-04-04 19:22:22
    
    
      2
      22
      377
      1
      1997-11-07 07:18:36
    
    
      3
      244
      51
      2
      1997-11-27 05:02:03
    
    
      4
      166
      346
      1
      1998-02-02 05:33:16

Before building any recommendation engines, we'll have to get the data into a useful form. Do this by first splitting the data into testing and training sets, and then by constructing two new dataframes whose columns are each unique movie and rows are each unique user, filling in 0 for missing values.



In [3]:

    
len(df['userid'].unique()), len(df['itemid'].unique())









    Out[3]:





(943, 1682)



In [4]:

    
from sklearn.model_selection import train_test_split
dftrain, dftest = train_test_split(df, test_size=0.2)



In [5]:

    
len(dftrain['userid'].unique()), len(dftrain['itemid'].unique())









    Out[5]:





(943, 1645)



In [6]:

    
missing_items = set(df['itemid'].unique()) - set(dftrain['itemid'].unique())
items = []
for item in missing_items:
    items.append([1, item, 0, np.nan])

dftrain_filled = dftrain.append(pd.DataFrame(items, columns=dftrain.columns)).reset_index()



In [7]:

    
len(dftrain_filled['userid'].unique()), len(dftrain_filled['itemid'].unique())









    Out[7]:





(943, 1682)



In [8]:

    
len(dftest['userid'].unique()), len(dftest['itemid'].unique())









    Out[8]:





(942, 1406)



In [9]:

    
missing_items = set(df['itemid'].unique()) - set(dftest['itemid'].unique())
items = []
for item in missing_items:
    items.append([1, item, 0, np.nan])

dftest_filled = dftest.append(pd.DataFrame(items, columns=dftest.columns)).reset_index()



In [10]:

    
len(dftest_filled['userid'].unique()), len(dftest_filled['itemid'].unique())









    Out[10]:





(942, 1682)



In [11]:

    
# X = pd.pivot_table(df, values='rating', index='userid', columns='itemid').fillna(0)
# X.head()

Xtrain = pd.pivot_table(dftrain_filled, values='rating', index='userid', columns='itemid').fillna(0)
Xtest = pd.pivot_table(dftest_filled, values='rating', index='userid', columns='itemid').fillna(0)
Xtrain.head()









    Out[11]:







  
    
      itemid
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      ...
      1673
      1674
      1675
      1676
      1677
      1678
      1679
      1680
      1681
      1682
    
    
      userid
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      5.0
      3.0
      4.0
      3.0
      0.0
      5.0
      4.0
      1.0
      5.0
      3.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      2
      4.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      3
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      4
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      5
      4.0
      3.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
  

5 rows × 1682 columns



In [12]:

    
print(Xtrain.shape)
print(Xtest.shape)









    



(943, 1682)
(942, 1682)

Now split the data into a training and test set, using a ratio 80/20 for train/test.



In [13]:

    
# from sklearn.model_selection import train_test_split
# Xtrain, Xtest = train_test_split(X, test_size=0.2)

Cosine Similarity

Building a recommendation engine can be thought of as "filling in the holes" in the sparse matrix you made above. For example, take a look at the MovieLense data. You'll see that that matrix is mostly zeros. Our task here is to predict what a given user will rate a given movie depending on the users tastes. To determine a users taste, we can use cosine similarity which is given by $$s_u^{cos}(u_k,u_a) = \frac{ u_k \cdot u_a }{ \left \| u_k \right \| \left \| u_a \right \| } = \frac{ \sum x_{k,m}x_{a,m} }{ \sqrt{\sum x_{k,m}^2\sum x_{a,m}^2} }$$ for users $u_a$ and $u_k$ on ratings given by $x_{a,m}$ and $x_{b,m}$. This is just the cosine of the angle between the two vectors. Likewise, this can also be calculated for the similarity between two items, $i_a$ and $i_b$, given by $$s_u^{cos}(i_m,i_b) = \frac{ i_m \cdot i_b }{ \left \| i_m \right \| \left \| i_b \right \| } = \frac{ \sum x_{a,m} x_{a,b} }{ \sqrt{ \sum x_{a,m}^2 \sum x_{a,b}^2 } }$$

Then, the similarity between two users is given by $$\hat{x}_{k,m} = \bar{x}_{k} + \frac{\sum\limits_{u_a} s_u^{cos}(u_k,u_a) (x_{a,m})}{\sum\limits_{u_a}|s_u^{cos}(u_k,u_a)|}$$ and for items given by $$\hat{x}_{k,m} = \frac{\sum\limits_{i_b} s_u^{cos}(i_m,i_b) (x_{k,b}) }{\sum\limits_{i_b}|s_u^{cos}(i_m,i_b)|}$$

Use these ideas to construct a class cos_engine which can be used to recommend movies for a given user. Be sure to also test your algorithm, reporting its accuracy. Bonus: Use adjusted cosine similiarity.



In [14]:

    
class cos_engine():
    def __init__(self, k, how='user'):
        self.k = k
        self.how = how
    
    def _cos_sim(self, x, y):
        num = np.sum(x * y)
        den = np.sum(x**2) * np.sum(y**2)
        return num / np.sqrt(den)
    
    def _user_based_filtering(self, user, X):
        sim = X.apply(lambda row: self._cos_sim(X.loc[user, :].values, row.values), axis=1)
        sim = sim.drop(user, axis=0)
        rec = sim.sort_values(ascending=False).head(self.k)
        return rec
    
    def _item_based_filtering(self, item, X):
        Xt = X.T
        sim = Xt.apply(lambda row: self._cos_sim(Xt.loc[item, :].values, row.values), axis=1)
        sim = sim.drop(item, axis=0)
        rec = sim.sort_values(ascending=False).head(self.k)
        return rec
    
    def fit(self, X):
        if self.how == 'user':
            Xfilled = []
            for user in X.index.values:
                ubf = self._user_based_filtering(user, X)
                weighted_sum = np.sum(X.loc[ubf.index, :].values * ubf.values.reshape(-1, 1), axis=0)
                Xfilled.append(weighted_sum / np.sum(np.abs(ubf.values)))
            self.Xfilled = pd.DataFrame(np.array(Xfilled), index=X.index, columns=X.columns)
        elif self.how == 'item':
            Xtfilled = []
            for item in X.columns.values:
                ibf = self._item_based_filtering(item, X)
                Xt = X.T
                weighted_sum = np.sum(Xt.loc[ibf.index, :].values * ibf.values.reshape(-1, 1), axis=0)
                Xtfilled.append(weighted_sum / np.sum(np.abs(ibf.values)))
            self.Xfilled = pd.DataFrame(np.array(Xtfilled).T, index=X.index, columns=X.columns)
        else:
            print('Use either "user" or "item" for the parameter "how".')
    
    def predict(self, user, X):
        try:
            ratings = self.Xfilled.loc[user, :]
            rated_items = X.loc[user, :] > 0
            ratings = ratings[~rated_items]
            return ratings.sort_values(ascending=False)
        except AttributeError:
            print('You have to fit the recommender first!')
    
    def score(self, X):
        try:
            return np.sum(np.sum(np.abs(self.Xfilled - X.where(X > 0)))) / X.where(X > 0).count().sum()
        except AttributeError:
            print('You have to fit the recommender first!')



In [15]:

    
rec_ub = cos_engine(k=7)
rec_ub.fit(Xtrain)



In [16]:

    
rec_ub.Xfilled.loc[100, 258]









    Out[16]:





3.4191191865071575



In [19]:

    
rec_ub.predict(100, Xtrain + Xtest).head(10)









    Out[19]:





itemid
327    2.433366
748    1.723340
343    1.607300
303    1.549017
245    1.387070
873    1.301103
331    1.180993
329    1.141351
322    1.116204
882    1.050156
Name: 100, dtype: float64



In [20]:

    
rec_ub.score(Xtest)









    Out[20]:





2.161008381608144



In [21]:

    
rec_ib = cos_engine(k=7, how='item')
rec_ib.fit(Xtrain)









    



C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:9: RuntimeWarning: invalid value encountered in double_scalars
  if __name__ == '__main__':



In [22]:

    
rec_ib.Xfilled.loc[100, 258]









    Out[22]:





1.5603889490414089



In [24]:

    
rec_ib.predict(100, Xtrain + Xtest).head(10)









    Out[24]:





itemid
301    3.699117
312    3.259807
748    2.870880
307    2.591593
304    2.553904
245    2.426965
311    2.312615
327    2.293961
343    2.199110
319    2.109540
Name: 100, dtype: float64



In [25]:

    
rec_ib.score(Xtest)









    Out[25]:





2.009026847268654



In [11]:

    
# class cos_engine():
#     def __init__(self, k, how='user'):
#         self.k = k
#         self.how = how
    
#     def _cos_sim(self, x, y):
#         num = np.sum(x * y)
#         den = np.sum(x**2) * np.sum(y**2)
#         return num / np.sqrt(den)
    
#     def _user_based_filtering(self, user, X):
#         sim = X.apply(lambda row: self._cos_sim(X.loc[user, :].values, row.values), axis=1)
#         sim = sim.drop(user, axis=0)
#         rec = sim.sort_values(ascending=False)#.head(self.k)
#         return rec
    
#     def _item_based_filtering(self, item, X):
#         Xt = X.T
#         sim = Xt.apply(lambda row: self._cos_sim(Xt.loc[item, :].values, row.values), axis=1)
#         sim = sim.drop(item, axis=0)
#         rec = sim.sort_values(ascending=False)#.head(self.k)
#         return rec
    
#     def fit(self, user, X):
#         if self.how == 'user':
#             ubf = self._user_based_filtering(user, X)
#             rec = {}
#             for item in X.columns.values:
#                 temp = X.loc[ubf.index, item]
#                 rec[item] = ubf.loc[temp > 0].head(self.k)
#             self.rec = rec
#         elif self.how == 'item':
#             rec = {}
#             for item in X.columns.values:
#                 print('Fitting item ', item) # This is pretty slow...
#                 ibf = self._item_based_filtering(item, X)
#                 Xt = X.T
#                 temp = Xt.loc[ibf.index, user]
#                 rec[item] = ibf.loc[temp > 0].head(self.k)
#             self.rec = rec
#         else:
#             print('Use either "user" or "item" for the parameter "how".')
    
#     def _predict_one(self, user, item, X):
#         try:
#             if self.how == 'user':
#                 idx = self.rec[item].index
#                 if len(idx) == 0:
#                     return X.loc[user, item]
#                 weighted_sum = np.sum(X.loc[idx, item].values * self.rec[item].values, axis=0)
#                 return weighted_sum / np.sum(self.rec[item].values, axis=0)
#             elif self.how == 'item':
#                 Xt = X.T
#                 idx = self.rec[item].index
#                 if len(idx) == 0:
#                     return X.loc[user, item]
#                 weighted_sum = np.sum(Xt.loc[idx, user].values * self.rec[item].values, axis=0)
#                 return weighted_sum / np.sum(self.rec[item].values, axis=0)
#         except AttributeError:
#             print('You have to fit the recommender first!')
    
#     def predict(self, user, X):
#         pred = []
#         for item in X.columns.values:
#             pred.append(self._predict_one(user, item, X))
#         return pd.Series(np.array(np.round(pred), dtype=int), index=X.columns)
    
#     def score(self, user, X):
#         try:
#             pred = self.predict(user, X)
#             rated_items = X.loc[user, :] > 0
#             user_ratings = X.loc[user, rated_items]
#             pred_ratings = pred.loc[rated_items]
#             return np.sum(user_ratings == pred_ratings) / len(user_ratings)
#         except AttributeError:
#             print('You have to fit the recommender first!')



In [19]:

    
# solution
# Libraries
from sklearn.metrics.pairwise import cosine_similarity

np.abs(cosine_similarity(Xtrain)).sum(axis=1).reshape(943, 1).shape









    Out[19]:





(943, 1)

SVD

Above we used Cosine Similarity to fill the holes in our sparse matrix. Another, and much more popular, method for matrix completion is called a Singluar Value Decomposition. SVD factors our data matrix into three smaller matricies, given by $$\textbf{M} = \textbf{U} \bf{\Sigma} \textbf{V}^*$$ where $\textbf{M}$ is our data matrix, $\textbf{U}$ is a unitary matrix containg the latent variables in the user space, $\bf{\Sigma}$ is diagonal matrix containing the singular values of $\textbf{M}$, and $\textbf{V}$ is a unitary matrix containing the latent variables in the item space. For more information on the SVD see the Wikipedia article.

Numpy contains a package to estimate the SVD of a sparse matrix. By making estimates of the matricies $\textbf{U}$, $\bf{\Sigma}$, and $\textbf{V}$, and then by multiplying them together, we can reconstruct an estimate for the matrix $\textbf{M}$ with all the holes filled in.

Use these ideas to construct a class svd_engine which can be used to recommend movies for a given user. Be sure to also test your algorithm, reporting its accuracy. Bonus: Tune any parameters.



In [26]:

    
class svd_engine():
    
    def __init__(self, k):
        self.k = k
        
    def fit(self, X):
        if self.k <= np.min(X.shape):
            U, S, V = np.linalg.svd(X)
            Uk = np.dot(U[:,:self.k], np.sqrt(np.diag(S[:self.k])))
            Vk = np.dot(np.sqrt(np.diag(S[:self.k])), V[:self.k,:])
            self.Xfilled = pd.DataFrame(np.dot(Uk, Vk), index=X.index, columns=X.columns)
        else:
            print('k has to be smaller than the minimum dimension of X')
    
    def predict(self, user, X):
        try:
            ratings = self.Xfilled.loc[user, :]
            rated_items = X.loc[user, :] > 0
            ratings = ratings[~rated_items]
            return ratings.sort_values(ascending=False)
        except AttributeError:
            print('You have to fit the recommender first!')
    
    def score(self, X):
        try:
            return np.sum(np.sum(np.abs(self.Xfilled - X.where(X > 0)))) / X.where(X > 0).count().sum()
        except AttributeError:
            print('You have to fit the recommender first!')



In [27]:

    
rec_svd = svd_engine(k=100)
rec_svd.fit(Xtrain)



In [28]:

    
rec_svd.Xfilled.loc[100, 258]









    Out[28]:





3.0395589377080148



In [30]:

    
rec_svd.predict(100, Xtrain + Xtest).head(10)









    Out[30]:





itemid
307    1.791480
312    1.339279
748    1.328989
301    1.317524
343    1.289623
311    1.228762
331    1.191659
332    1.140870
327    1.135568
245    1.032280
Name: 100, dtype: float64



In [31]:

    
rec_svd.score(Xtest)









    Out[31]:





2.917198706745735



In [32]:

    
scores = []
for k in np.arange(10, 950, 10):
    rec_svd = svd_engine(k=k)
    rec_svd.fit(Xtrain)
    scores.append([k, rec_svd.score(Xtest)])



In [35]:

    
pd.DataFrame(scores, columns=['k', 'MAE']).sort_values(by='MAE').head()



In [36]:

    
scores = []
for k in np.arange(2, 11):
    rec_svd = svd_engine(k=k)
    rec_svd.fit(Xtrain)
    scores.append([k, rec_svd.score(Xtest)])



In [37]:

    
pd.DataFrame(scores, columns=['k', 'MAE']).sort_values(by='MAE')



In [42]:

    
rec_svd = svd_engine(k=10)
rec_svd.fit(Xtrain)



In [43]:

    
rec_svd.Xfilled.loc[100, 258]









    Out[43]:





3.2613236758469246



In [44]:

    
rec_svd.predict(100, Xtrain + Xtest).head(10)









    Out[44]:





itemid
748    2.103909
307    1.999920
301    1.888959
332    1.752203
245    1.635943
327    1.577040
303    1.488295
322    1.477374
331    1.462423
304    1.313905
Name: 100, dtype: float64



In [45]:

    
rec_svd.score(Xtest)









    Out[45]:





2.3404299215316

	k	MAE
8	10	2.340430
7	9	2.346365
6	8	2.354198
5	7	2.371857
4	6	2.389727
3	5	2.422451
2	4	2.460086
1	3	2.496903
0	2	2.579625

	userid	itemid	rating	timestamp
0	196	242	3	1997-12-04 15:55:49
1	186	302	3	1998-04-04 19:22:22
2	22	377	1	1997-11-07 07:18:36
3	244	51	2	1997-11-27 05:02:03
4	166	346	1	1998-02-02 05:33:16

itemid	1	2	3	4	5	6	7	8	9	10	...	1673	1674	1675	1676	1677	1678	1679	1680	1681	1682
userid
1	5.0	3.0	4.0	3.0	0.0	5.0	4.0	1.0	5.0	3.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	4.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5	4.0	3.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0