Matrix Factorization via Singular Value Decomposition

Matrix factorization is the breaking down of one matrix in a product of multiple matrices. It's extremely well studied in mathematics, and it's highly useful. There are many different ways to factor matrices, but singular value decomposition is particularly useful for making recommendations.

So what is singular value decomposition (SVD)? At a high level, SVD is an algorithm that decomposes a matrix $R$ into the best lower rank (i.e. smaller/simpler) approximation of the original matrix $R$. Mathematically, it decomposes R into a two unitary matrices and a diagonal matrix:

$$\begin{equation} R = U\Sigma V^{T} \end{equation}$$

where R is users's ratings matrix, $U$ is the user "features" matrix, $\Sigma$ is the diagonal matrix of singular values (essentially weights), and $V^{T}$ is the movie "features" matrix. $U$ and $V^{T}$ are orthogonal, and represent different things. $U$ represents how much users "like" each feature and $V^{T}$ represents how relevant each feature is to each movie.

To get the lower rank approximation, we take these matrices and keep only the top $k$ features, which we think of as the underlying tastes and preferences vectors.



In [87]:

    
import pandas as pd
import numpy as np

r_cols = ['user_id', 'movie_id', 'rating']
m_cols = ['movie_id', 'title', 'genres']

ratings_df = pd.read_csv('ratings.dat',sep='::', names=r_cols, engine='python', usecols=range(3), dtype = int)
movies_df = pd.read_csv('movies.dat', sep='::', names=m_cols, engine='python')
movies_df['movie_id'] = movies_df['movie_id'].apply(pd.to_numeric)



In [88]:

    
movies_df.head(3)









    Out[88]:







  
    
      
      movie_id
      title
      genres
    
  
  
    
      0
      1
      Toy Story (1995)
      Animation|Children's|Comedy
    
    
      1
      2
      Jumanji (1995)
      Adventure|Children's|Fantasy
    
    
      2
      3
      Grumpier Old Men (1995)
      Comedy|Romance



In [89]:

    
ratings_df.head(3)

These look good, but I want the format of my ratings matrix to be one row per user and one column per movie. I'll pivot ratings_df to get that and call the new variable R.



In [90]:

    
R_df = ratings_df.pivot(index = 'user_id', columns ='movie_id', values = 'rating').fillna(0)
R_df.head()









    Out[90]:







  
    
      movie_id
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      ...
      3943
      3944
      3945
      3946
      3947
      3948
      3949
      3950
      3951
      3952
    
    
      user_id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      5.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      2
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      3
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      4
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      5
      0.0
      0.0
      0.0
      0.0
      0.0
      2.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
  

5 rows × 3706 columns

The last thing I need to do is de-mean the data (normalize by each users mean) and convert it from a dataframe to a numpy array.



In [91]:

    
R = R_df.as_matrix()
user_ratings_mean = np.mean(R, axis = 1)
R_demeaned = R - user_ratings_mean.reshape(-1, 1)

Singular Value Decomposition

Scipy and Numpy both have functions to do the singular value decomposition. I'm going to use the Scipy function svds because it let's me choose how many latent factors I want to use to approximate the original ratings matrix (instead of having to truncate it after).



In [92]:

    
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(R_demeaned, k = 50)

Done. The function returns exactly what I detailed earlier in this post, except that the $\Sigma$ returned is just the values instead of a diagonal matrix. This is useful, but since I'm going to leverage matrix multiplication to get predictions I'll convert it to the diagonal matrix form.



In [93]:

    
sigma = np.diag(sigma)

Making Predictions from the Decomposed Matrices

I now have everything I need to make movie ratings predictions for every user. I can do it all at once by following the math and matrix multiply $U$, $\Sigma$, and $V^{T}$ back to get the rank $k=50$ approximation of $R$.

I also need to add the user means back to get the actual star ratings prediction.



In [94]:

    
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)



In [95]:

    
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = R_df.columns)
preds_df.head()









    Out[95]:







  
    
      movie_id
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      ...
      3943
      3944
      3945
      3946
      3947
      3948
      3949
      3950
      3951
      3952
    
  
  
    
      0
      4.288861
      0.143055
      -0.195080
      -0.018843
      0.012232
      -0.176604
      -0.074120
      0.141358
      -0.059553
      -0.195950
      ...
      0.027807
      0.001640
      0.026395
      -0.022024
      -0.085415
      0.403529
      0.105579
      0.031912
      0.050450
      0.088910
    
    
      1
      0.744716
      0.169659
      0.335418
      0.000758
      0.022475
      1.353050
      0.051426
      0.071258
      0.161601
      1.567246
      ...
      -0.056502
      -0.013733
      -0.010580
      0.062576
      -0.016248
      0.155790
      -0.418737
      -0.101102
      -0.054098
      -0.140188
    
    
      2
      1.818824
      0.456136
      0.090978
      -0.043037
      -0.025694
      -0.158617
      -0.131778
      0.098977
      0.030551
      0.735470
      ...
      0.040481
      -0.005301
      0.012832
      0.029349
      0.020866
      0.121532
      0.076205
      0.012345
      0.015148
      -0.109956
    
    
      3
      0.408057
      -0.072960
      0.039642
      0.089363
      0.041950
      0.237753
      -0.049426
      0.009467
      0.045469
      -0.111370
      ...
      0.008571
      -0.005425
      -0.008500
      -0.003417
      -0.083982
      0.094512
      0.057557
      -0.026050
      0.014841
      -0.034224
    
    
      4
      1.574272
      0.021239
      -0.051300
      0.246884
      -0.032406
      1.552281
      -0.199630
      -0.014920
      -0.060498
      0.450512
      ...
      0.110151
      0.046010
      0.006934
      -0.015940
      -0.050080
      -0.052539
      0.507189
      0.033830
      0.125706
      0.199244
    
  

5 rows × 3706 columns



In [96]:

    
def recommend_movies(predictions_df, userID, movies_df, original_ratings_df, num_recommendations):
    
    # Get and sort the user's predictions
    user_row_number = userID - 1 # UserID starts at 1, not 0
    sorted_user_predictions = preds_df.iloc[user_row_number].sort_values(ascending=False) # UserID starts at 1
    
    # Get the user's data and merge in the movie information.
    user_data = original_ratings_df[original_ratings_df.user_id == (userID)]
    user_full = (user_data.merge(movies_df, how = 'left', left_on = 'movie_id', right_on = 'movie_id').
                     sort_values(['rating'], ascending=False)
                 )

    print('User {0} has already rated {1} movies.'.format(userID, user_full.shape[0]))
    print('Recommending highest {0} predicted ratings movies not already rated.'.format(num_recommendations))
    
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (movies_df[~movies_df['movie_id'].isin(user_full['movie_id'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'movie_id',
               right_on = 'movie_id').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return user_full, recommendations



In [97]:

    
already_rated, predictions = recommend_movies(preds_df, 837, movies_df, ratings_df, 10)









    



User 837 has already rated 69 movies.
Recommending highest 10 predicted ratings movies not already rated.



In [98]:

    
predictions









    Out[98]:







  
    
      
      movie_id
      title
      genres
    
  
  
    
      516
      527
      Schindler's List (1993)
      Drama|War
    
    
      1848
      1953
      French Connection, The (1971)
      Action|Crime|Drama|Thriller
    
    
      596
      608
      Fargo (1996)
      Crime|Drama|Thriller
    
    
      1235
      1284
      Big Sleep, The (1946)
      Film-Noir|Mystery
    
    
      2085
      2194
      Untouchables, The (1987)
      Action|Crime|Drama
    
    
      1188
      1230
      Annie Hall (1977)
      Comedy|Romance
    
    
      1198
      1242
      Glory (1989)
      Action|Drama|War
    
    
      897
      922
      Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)
      Film-Noir
    
    
      1849
      1954
      Rocky (1976)
      Action|Drama
    
    
      581
      593
      Silence of the Lambs, The (1991)
      Drama|Thriller



In [99]:

    
already_rated.head(10)









    Out[99]:







  
    
      
      user_id
      movie_id
      rating
      title
      genres
    
  
  
    
      36
      837
      858
      5
      Godfather, The (1972)
      Action|Crime|Drama
    
    
      35
      837
      1387
      5
      Jaws (1975)
      Action|Horror
    
    
      65
      837
      2028
      5
      Saving Private Ryan (1998)
      Action|Drama|War
    
    
      63
      837
      1221
      5
      Godfather: Part II, The (1974)
      Action|Crime|Drama
    
    
      11
      837
      913
      5
      Maltese Falcon, The (1941)
      Film-Noir|Mystery
    
    
      20
      837
      3417
      5
      Crimson Pirate, The (1952)
      Adventure|Comedy|Sci-Fi
    
    
      34
      837
      2186
      4
      Strangers on a Train (1951)
      Film-Noir|Thriller
    
    
      55
      837
      2791
      4
      Airplane! (1980)
      Comedy
    
    
      31
      837
      1188
      4
      Strictly Ballroom (1992)
      Comedy|Romance
    
    
      28
      837
      1304
      4
      Butch Cassidy and the Sundance Kid (1969)
      Action|Comedy|Western

Conclusion

We've seen that we can make good recommendations with raw data based collaborative filtering methods (neighborhood models) and latent features from low-rank matrix factorization methods (factorization models).

Low-dimensional matrix recommenders try to capture the underlying features driving the raw data (which we understand as tastes and preferences). From a theoretical perspective, if we want to make recommendations based on people's tastes, this seems like the better approach. This technique also scales significantly better to larger datasets.

However, we still likely lose some meaningful signals by using a lower-rank matrix. And though these factorization based techniques work extremely well, there's research being done on new methods. These efforts have resulted in various types probabilistic matrix factorization (which works and scales even better) and many other approaches.

	movie_id	title	genres
0	1	Toy Story (1995)	Animation\|Children's\|Comedy
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance

movie_id	1	2	3	4	5	6	7	8	9	10	...	3943	3944	3945	3946	3947	3948	3949	3950	3951	3952
user_id
1	5.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	0.0	0.0	2.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

movie_id	1	2	3	4	5	6	7	8	9	10	...	3943	3944	3945	3946	3947	3948	3949	3950	3951	3952
0	4.288861	0.143055	-0.195080	-0.018843	0.012232	-0.176604	-0.074120	0.141358	-0.059553	-0.195950	...	0.027807	0.001640	0.026395	-0.022024	-0.085415	0.403529	0.105579	0.031912	0.050450	0.088910
1	0.744716	0.169659	0.335418	0.000758	0.022475	1.353050	0.051426	0.071258	0.161601	1.567246	...	-0.056502	-0.013733	-0.010580	0.062576	-0.016248	0.155790	-0.418737	-0.101102	-0.054098	-0.140188
2	1.818824	0.456136	0.090978	-0.043037	-0.025694	-0.158617	-0.131778	0.098977	0.030551	0.735470	...	0.040481	-0.005301	0.012832	0.029349	0.020866	0.121532	0.076205	0.012345	0.015148	-0.109956
3	0.408057	-0.072960	0.039642	0.089363	0.041950	0.237753	-0.049426	0.009467	0.045469	-0.111370	...	0.008571	-0.005425	-0.008500	-0.003417	-0.083982	0.094512	0.057557	-0.026050	0.014841	-0.034224
4	1.574272	0.021239	-0.051300	0.246884	-0.032406	1.552281	-0.199630	-0.014920	-0.060498	0.450512	...	0.110151	0.046010	0.006934	-0.015940	-0.050080	-0.052539	0.507189	0.033830	0.125706	0.199244

	movie_id	title	genres
516	527	Schindler's List (1993)	Drama\|War
1848	1953	French Connection, The (1971)	Action\|Crime\|Drama\|Thriller
596	608	Fargo (1996)	Crime\|Drama\|Thriller
1235	1284	Big Sleep, The (1946)	Film-Noir\|Mystery
2085	2194	Untouchables, The (1987)	Action\|Crime\|Drama
1188	1230	Annie Hall (1977)	Comedy\|Romance
1198	1242	Glory (1989)	Action\|Drama\|War
897	922	Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)	Film-Noir
1849	1954	Rocky (1976)	Action\|Drama
581	593	Silence of the Lambs, The (1991)	Drama\|Thriller

	user_id	movie_id	rating	title	genres
36	837	858	5	Godfather, The (1972)	Action\|Crime\|Drama
35	837	1387	5	Jaws (1975)	Action\|Horror
65	837	2028	5	Saving Private Ryan (1998)	Action\|Drama\|War
63	837	1221	5	Godfather: Part II, The (1974)	Action\|Crime\|Drama
11	837	913	5	Maltese Falcon, The (1941)	Film-Noir\|Mystery
20	837	3417	5	Crimson Pirate, The (1952)	Adventure\|Comedy\|Sci-Fi
34	837	2186	4	Strangers on a Train (1951)	Film-Noir\|Thriller
55	837	2791	4	Airplane! (1980)	Comedy
31	837	1188	4	Strictly Ballroom (1992)	Comedy\|Romance
28	837	1304	4	Butch Cassidy and the Sundance Kid (1969)	Action\|Comedy\|Western