Matrix Factorization via Singular Value Decomposition

Matrix factorization is the breaking down of one matrix in a product of multiple matrices. It's extremely well studied in mathematics, and it's highly useful. There are many different ways to factor matrices, but singular value decomposition is particularly useful for making recommendations.

So what is singular value decomposition (SVD)? At a high level, SVD is an algorithm that decomposes a matrix $R$ into the best lower rank (i.e. smaller/simpler) approximation of the original matrix $R$. Mathematically, it decomposes R into a two unitary matrices and a diagonal matrix:

$$\begin{equation} R = U\Sigma V^{T} \end{equation}$$

where R is users's ratings matrix, $U$ is the user "features" matrix, $\Sigma$ is the diagonal matrix of singular values (essentially weights), and $V^{T}$ is the movie "features" matrix. $U$ and $V^{T}$ are orthogonal, and represent different things. $U$ represents how much users "like" each feature and $V^{T}$ represents how relevant each feature is to each movie.

To get the lower rank approximation, we take these matrices and keep only the top $k$ features, which we think of as the underlying tastes and preferences vectors.



In [17]:

    
import pandas as pd
import numpy as np



In [18]:

    
movies_df = pd.read_csv('movies.csv')
movies_df['movie_id'] = movies_df['movie_id'].apply(pd.to_numeric)
movies_df.head(3)









    Out[18]:







  
    
      
      movie_id
      title
      genres
    
  
  
    
      0
      1
      Toy Story (1995)
      Adventure|Animation|Children|Comedy|Fantasy
    
    
      1
      2
      Jumanji (1995)
      Adventure|Children|Fantasy
    
    
      2
      3
      Grumpier Old Men (1995)
      Comedy|Romance



In [19]:

    
ratings_df=pd.read_csv('ratings.csv')
ratings_df.head(3)

These look good, but I want the format of my ratings matrix to be one row per user and one column per movie. I'll pivot ratings_df to get that and call the new variable R.



In [20]:

    
R_df = ratings_df.pivot(index = 'user_id', columns ='movie_id', values = 'rating').fillna(0)
R_df.head()









    Out[20]:







  
    
      movie_id
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      ...
      161084
      161155
      161594
      161830
      161918
      161944
      162376
      162542
      162672
      163949
    
    
      user_id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      2
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      4.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      3
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      4
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      4.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      5
      0.0
      0.0
      4.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
  

5 rows × 9066 columns

The last thing I need to do is de-mean the data (normalize by each users mean) and convert it from a dataframe to a numpy array.



In [21]:

    
R = R_df.as_matrix()
user_ratings_mean = np.mean(R, axis = 1)
R_demeaned = R - user_ratings_mean.reshape(-1, 1)

Singular Value Decomposition

Scipy and Numpy both have functions to do the singular value decomposition. I'm going to use the Scipy function svds because it let's me choose how many latent factors I want to use to approximate the original ratings matrix (instead of having to truncate it after).



In [22]:

    
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(R_demeaned, k = 50)

Done. The function returns exactly what I detailed earlier in this post, except that the $\Sigma$ returned is just the values instead of a diagonal matrix. This is useful, but since I'm going to leverage matrix multiplication to get predictions I'll convert it to the diagonal matrix form.



In [23]:

    
sigma = np.diag(sigma)

Making Predictions from the Decomposed Matrices

I now have everything I need to make movie ratings predictions for every user. I can do it all at once by following the math and matrix multiply $U$, $\Sigma$, and $V^{T}$ back to get the rank $k=50$ approximation of $R$.

I also need to add the user means back to get the actual star ratings prediction.



In [24]:

    
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)



In [25]:

    
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = R_df.columns)
preds_df.head()









    Out[25]:







  
    
      movie_id
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      ...
      161084
      161155
      161594
      161830
      161918
      161944
      162376
      162542
      162672
      163949
    
  
  
    
      0
      -0.054239
      0.045130
      -0.004835
      -0.019817
      -0.011284
      0.041373
      -0.007822
      -0.017188
      0.012246
      0.037670
      ...
      -0.005258
      -0.005453
      0.012369
      -0.004991
      -0.004639
      -0.019055
      0.021402
      -0.006365
      -0.006098
      -0.004819
    
    
      1
      0.419835
      1.406440
      -0.188807
      0.156658
      0.268032
      0.414698
      0.052172
      0.044728
      -0.020198
      2.220256
      ...
      -0.005909
      -0.003974
      -0.012555
      -0.003555
      -0.002711
      -0.071621
      -0.016212
      0.001047
      -0.001468
      -0.006577
    
    
      2
      1.345619
      0.266505
      -0.011962
      0.012278
      0.079508
      0.090960
      -0.122094
      0.031327
      -0.018023
      0.141176
      ...
      -0.002647
      -0.002364
      -0.010153
      0.000277
      -0.000116
      -0.018063
      -0.015761
      0.010611
      0.006792
      -0.006357
    
    
      3
      1.133455
      1.046982
      0.141275
      0.081841
      -0.339675
      -1.484659
      -0.263096
      -0.169750
      -0.021862
      1.611664
      ...
      0.020805
      0.000410
      0.056040
      -0.002817
      -0.000767
      0.159159
      0.087519
      -0.030854
      -0.021279
      0.048529
    
    
      4
      1.389578
      1.466495
      0.605557
      -0.029647
      0.729380
      -0.118539
      -0.026017
      0.065577
      -0.156655
      0.307926
      ...
      -0.007422
      -0.011810
      0.006644
      -0.005159
      -0.001249
      -0.034658
      0.016456
      0.001710
      -0.004166
      -0.001864
    
  

5 rows × 9066 columns



In [26]:

    
def recommend_movies(predictions_df, userID, movies_df, original_ratings_df, num_recommendations):
    
    # Get and sort the user's predictions
    user_row_number = userID - 1 # UserID starts at 1, not 0
    sorted_user_predictions = preds_df.iloc[user_row_number].sort_values(ascending=False) # UserID starts at 1
    
    # Get the user's data and merge in the movie information.
    user_data = original_ratings_df[original_ratings_df.user_id == (userID)]
    user_full = (user_data.merge(movies_df, how = 'left', left_on = 'movie_id', right_on = 'movie_id').
                     sort_values(['rating'], ascending=False)
                 )

    print('User {0} has already rated {1} movies.'.format(userID, user_full.shape[0]))
    print('Recommending highest {0} predicted ratings movies not already rated.'.format(num_recommendations))
    
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (movies_df[~movies_df['movie_id'].isin(user_full['movie_id'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'movie_id',
               right_on = 'movie_id').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return user_full, recommendations



In [27]:

    
already_rated, predictions = recommend_movies(preds_df,11, movies_df, ratings_df, 10)









    



User 11 has already rated 38 movies.
Recommending highest 10 predicted ratings movies not already rated.



In [28]:

    
predictions









    Out[28]:







  
    
      
      movie_id
      title
      genres
    
  
  
    
      279
      318
      Shawshank Redemption, The (1994)
      Crime|Drama
    
    
      6894
      58559
      Dark Knight, The (2008)
      Action|Crime|Drama|IMAX
    
    
      2359
      2959
      Fight Club (1999)
      Action|Crime|Drama|Thriller
    
    
      530
      608
      Fargo (1996)
      Comedy|Crime|Drama|Thriller
    
    
      1356
      1732
      Big Lebowski, The (1998)
      Comedy|Crime
    
    
      959
      1213
      Goodfellas (1990)
      Crime|Drama
    
    
      7264
      70286
      District 9 (2009)
      Mystery|Sci-Fi|Thriller
    
    
      2025
      2542
      Lock, Stock & Two Smoking Barrels (1998)
      Comedy|Crime|Thriller
    
    
      520
      593
      Silence of the Lambs, The (1991)
      Crime|Horror|Thriller
    
    
      871
      1089
      Reservoir Dogs (1992)
      Crime|Mystery|Thriller



In [29]:

    
already_rated.head(10)









    Out[29]:







  
    
      
      user_id
      movie_id
      rating
      title
      genres
    
  
  
    
      0
      11
      50
      5.0
      Usual Suspects, The (1995)
      Crime|Mystery|Thriller
    
    
      7
      11
      923
      5.0
      Citizen Kane (1941)
      Drama|Mystery
    
    
      36
      11
      104841
      5.0
      Gravity (2013)
      Action|Sci-Fi|IMAX
    
    
      18
      11
      26614
      5.0
      Bourne Identity, The (1988)
      Action|Adventure|Drama|Mystery|Thriller
    
    
      17
      11
      6598
      5.0
      Step Into Liquid (2002)
      Documentary
    
    
      10
      11
      1408
      5.0
      Last of the Mohicans, The (1992)
      Action|Romance|War|Western
    
    
      9
      11
      1201
      5.0
      Good, the Bad and the Ugly, The (Buono, il bru...
      Action|Adventure|Western
    
    
      19
      11
      48516
      5.0
      Departed, The (2006)
      Crime|Drama|Thriller
    
    
      37
      11
      106487
      5.0
      The Hunger Games: Catching Fire (2013)
      Action|Adventure|Sci-Fi|IMAX
    
    
      4
      11
      296
      5.0
      Pulp Fiction (1994)
      Comedy|Crime|Drama|Thriller

Conclusion

We've seen that we can make good recommendations with raw data based collaborative filtering methods (neighborhood models) and latent features from low-rank matrix factorization methods (factorization models).

Low-dimensional matrix recommenders try to capture the underlying features driving the raw data (which we understand as tastes and preferences). From a theoretical perspective, if we want to make recommendations based on people's tastes, this seems like the better approach. This technique also scales significantly better to larger datasets.

However, we still likely lose some meaningful signals by using a lower-rank matrix. And though these factorization based techniques work extremely well, there's research being done on new methods. These efforts have resulted in various types probabilistic matrix factorization (which works and scales even better) and many other approaches.

	movie_id	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance

movie_id	1	2	3	4	5	6	7	8	9	10	...	161084	161155	161594	161830	161918	161944	162376	162542	162672	163949
user_id
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	4.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	4.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5	0.0	0.0	4.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

movie_id	1	2	3	4	5	6	7	8	9	10	...	161084	161155	161594	161830	161918	161944	162376	162542	162672	163949
0	-0.054239	0.045130	-0.004835	-0.019817	-0.011284	0.041373	-0.007822	-0.017188	0.012246	0.037670	...	-0.005258	-0.005453	0.012369	-0.004991	-0.004639	-0.019055	0.021402	-0.006365	-0.006098	-0.004819
1	0.419835	1.406440	-0.188807	0.156658	0.268032	0.414698	0.052172	0.044728	-0.020198	2.220256	...	-0.005909	-0.003974	-0.012555	-0.003555	-0.002711	-0.071621	-0.016212	0.001047	-0.001468	-0.006577
2	1.345619	0.266505	-0.011962	0.012278	0.079508	0.090960	-0.122094	0.031327	-0.018023	0.141176	...	-0.002647	-0.002364	-0.010153	0.000277	-0.000116	-0.018063	-0.015761	0.010611	0.006792	-0.006357
3	1.133455	1.046982	0.141275	0.081841	-0.339675	-1.484659	-0.263096	-0.169750	-0.021862	1.611664	...	0.020805	0.000410	0.056040	-0.002817	-0.000767	0.159159	0.087519	-0.030854	-0.021279	0.048529
4	1.389578	1.466495	0.605557	-0.029647	0.729380	-0.118539	-0.026017	0.065577	-0.156655	0.307926	...	-0.007422	-0.011810	0.006644	-0.005159	-0.001249	-0.034658	0.016456	0.001710	-0.004166	-0.001864

	movie_id	title	genres
279	318	Shawshank Redemption, The (1994)	Crime\|Drama
6894	58559	Dark Knight, The (2008)	Action\|Crime\|Drama\|IMAX
2359	2959	Fight Club (1999)	Action\|Crime\|Drama\|Thriller
530	608	Fargo (1996)	Comedy\|Crime\|Drama\|Thriller
1356	1732	Big Lebowski, The (1998)	Comedy\|Crime
959	1213	Goodfellas (1990)	Crime\|Drama
7264	70286	District 9 (2009)	Mystery\|Sci-Fi\|Thriller
2025	2542	Lock, Stock & Two Smoking Barrels (1998)	Comedy\|Crime\|Thriller
520	593	Silence of the Lambs, The (1991)	Crime\|Horror\|Thriller
871	1089	Reservoir Dogs (1992)	Crime\|Mystery\|Thriller

	user_id	movie_id	rating	title	genres
0	11	50	5.0	Usual Suspects, The (1995)	Crime\|Mystery\|Thriller
7	11	923	5.0	Citizen Kane (1941)	Drama\|Mystery
36	11	104841	5.0	Gravity (2013)	Action\|Sci-Fi\|IMAX
18	11	26614	5.0	Bourne Identity, The (1988)	Action\|Adventure\|Drama\|Mystery\|Thriller
17	11	6598	5.0	Step Into Liquid (2002)	Documentary
10	11	1408	5.0	Last of the Mohicans, The (1992)	Action\|Romance\|War\|Western
9	11	1201	5.0	Good, the Bad and the Ugly, The (Buono, il bru...	Action\|Adventure\|Western
19	11	48516	5.0	Departed, The (2006)	Crime\|Drama\|Thriller
37	11	106487	5.0	The Hunger Games: Catching Fire (2013)	Action\|Adventure\|Sci-Fi\|IMAX
4	11	296	5.0	Pulp Fiction (1994)	Comedy\|Crime\|Drama\|Thriller