Matrix Factorization via Singular Value Decomposition

Matrix factorization is the breaking down of one matrix in a product of multiple matrices. It's extremely well studied in mathematics, and it's highly useful. There are many different ways to factor matrices, but singular value decomposition is particularly useful for making recommendations.

So what is singular value decomposition (SVD)? At a high level, SVD is an algorithm that decomposes a matrix $R$ into the best lower rank (i.e. smaller/simpler) approximation of the original matrix $R$. Mathematically, it decomposes R into a two unitary matrices and a diagonal matrix:

$$\begin{equation} R = U\Sigma V^{T} \end{equation}$$

where R is users's ratings matrix, $U$ is the user "features" matrix, $\Sigma$ is the diagonal matrix of singular values (essentially weights), and $V^{T}$ is the movie "features" matrix. $U$ and $V^{T}$ are orthogonal, and represent different things. $U$ represents how much users "like" each feature and $V^{T}$ represents how relevant each feature is to each movie.

To get the lower rank approximation, we take these matrices and keep only the top $k$ features, which we think of as the underlying tastes and preferences vectors.



In [1]:

    
import pandas as pd
import numpy as np

r_cols = ['user_id', 'movie_id', 'rating']
m_cols = ['movie_id', 'title']

ratings_df = pd.read_csv('u.data',sep='\t', names=r_cols, usecols = range(3), dtype = int)
movies_df = pd.read_csv('u.item', sep='|', names=m_cols, usecols=range(2))
movies_df['movie_id'] = movies_df['movie_id'].apply(pd.to_numeric)



In [2]:

    
movies_df.head(3)









    Out[2]:







  
    
      
      movie_id
      title
    
  
  
    
      0
      1
      Toy Story (1995)
    
    
      1
      2
      GoldenEye (1995)
    
    
      2
      3
      Four Rooms (1995)



In [3]:

    
ratings_df.head(3)

These look good, but I want the format of my ratings matrix to be one row per user and one column per movie. I'll pivot ratings_df to get that and call the new variable R.



In [4]:

    
R_df = ratings_df.pivot(index = 'user_id', columns ='movie_id', values = 'rating').fillna(0)
R_df.head()









    Out[4]:







  
    
      movie_id
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      ...
      1673
      1674
      1675
      1676
      1677
      1678
      1679
      1680
      1681
      1682
    
    
      user_id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      5.0
      3.0
      4.0
      3.0
      3.0
      5.0
      4.0
      1.0
      5.0
      3.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      2
      4.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      3
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      4
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      5
      4.0
      3.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
  

5 rows × 1682 columns

The last thing I need to do is de-mean the data (normalize by each users mean) and convert it from a dataframe to a numpy array.



In [5]:

    
R = R_df.as_matrix()
user_ratings_mean = np.mean(R, axis = 1)
R_demeaned = R - user_ratings_mean.reshape(-1, 1)

Singular Value Decomposition

Scipy and Numpy both have functions to do the singular value decomposition. I'm going to use the Scipy function svds because it let's me choose how many latent factors I want to use to approximate the original ratings matrix (instead of having to truncate it after).



In [6]:

    
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(R_demeaned, k = 50)

Done. The function returns exactly what I detailed earlier in this post, except that the $\Sigma$ returned is just the values instead of a diagonal matrix. This is useful, but since I'm going to leverage matrix multiplication to get predictions I'll convert it to the diagonal matrix form.



In [7]:

    
sigma = np.diag(sigma)

Making Predictions from the Decomposed Matrices

I now have everything I need to make movie ratings predictions for every user. I can do it all at once by following the math and matrix multiply $U$, $\Sigma$, and $V^{T}$ back to get the rank $k=50$ approximation of $R$.

I also need to add the user means back to get the actual star ratings prediction.



In [8]:

    
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)



In [9]:

    
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = R_df.columns)
preds_df.head()









    Out[9]:







  
    
      movie_id
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      ...
      1673
      1674
      1675
      1676
      1677
      1678
      1679
      1680
      1681
      1682
    
  
  
    
      0
      6.488436
      2.959503
      1.634987
      3.024467
      1.656526
      1.659506
      3.630469
      0.240669
      1.791518
      3.347816
      ...
      0.011976
      -0.092017
      -0.074553
      -0.060985
      0.009427
      -0.035641
      -0.039227
      -0.037434
      -0.025552
      0.023513
    
    
      1
      2.347262
      0.129689
      -0.098917
      0.328828
      0.159517
      0.481361
      0.213002
      0.097908
      1.892100
      0.671000
      ...
      0.003943
      -0.026939
      -0.035460
      -0.029883
      -0.027153
      -0.015244
      -0.008277
      -0.011760
      0.011639
      -0.046924
    
    
      2
      0.291905
      -0.263830
      -0.151454
      -0.179289
      0.013462
      -0.088309
      -0.057624
      0.568764
      -0.018506
      0.280742
      ...
      -0.028964
      -0.031622
      0.045513
      0.026089
      -0.021705
      0.002282
      0.032363
      0.017322
      -0.006644
      -0.009480
    
    
      3
      0.366410
      -0.443535
      0.041151
      -0.007616
      0.055373
      -0.080352
      0.299015
      -0.010882
      -0.160888
      -0.118834
      ...
      0.020069
      0.015981
      -0.000182
      0.005593
      0.026634
      0.023562
      0.036405
      0.029984
      0.015612
      -0.008713
    
    
      4
      4.263488
      1.937122
      0.052529
      1.049350
      0.652765
      0.002836
      1.730461
      0.870584
      0.341027
      0.569055
      ...
      0.019973
      -0.053521
      -0.017242
      -0.007137
      -0.038987
      0.010338
      0.004869
      0.007603
      -0.020575
      0.003330
    
  

5 rows × 1682 columns



In [10]:

    
def recommend_movies(predictions_df, userID, movies_df, original_ratings_df, num_recommendations):
    
    # Get and sort the user's predictions
    user_row_number = userID - 1 # UserID starts at 1, not 0
    sorted_user_predictions = preds_df.iloc[user_row_number].sort_values(ascending=False) # UserID starts at 1
    
    # Get the user's data and merge in the movie information.
    user_data = original_ratings_df[original_ratings_df.user_id == (userID)]
    user_full = (user_data.merge(movies_df, how = 'left', left_on = 'movie_id', right_on = 'movie_id').
                     sort_values(['rating'], ascending=False)
                 )

    print('User {0} has already rated {1} movies.'.format(userID, user_full.shape[0]))
    print('Recommending highest {0} predicted ratings movies not already rated.'.format(num_recommendations))
    
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (movies_df[~movies_df['movie_id'].isin(user_full['movie_id'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'movie_id',
               right_on = 'movie_id').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return user_full, recommendations



In [31]:

    
already_rated, predictions = recommend_movies(preds_df,276, movies_df, ratings_df, 10)









    



User 276 has already rated 518 movies.
Recommending highest 10 predicted ratings movies not already rated.



In [32]:

    
predictions









    Out[32]:







  
    
      
      movie_id
      title
    
  
  
    
      25
      93
      Welcome to the Dollhouse (1995)
    
    
      295
      642
      Grifters, The (1990)
    
    
      215
      515
      Boot, Das (1981)
    
    
      24
      90
      So I Married an Axe Murderer (1993)
    
    
      10
      32
      Crumb (1994)
    
    
      209
      509
      My Left Foot (1989)
    
    
      15
      48
      Hoop Dreams (1994)
    
    
      43
      132
      Wizard of Oz, The (1939)
    
    
      406
      824
      Great White Hype, The (1996)
    
    
      183
      480
      North by Northwest (1959)



In [34]:

    
already_rated.head(10)









    Out[34]:







  
    
      
      user_id
      movie_id
      rating
      title
    
  
  
    
      225
      276
      124
      5
      Lone Star (1996)
    
    
      416
      276
      173
      5
      Princess Bride, The (1987)
    
    
      207
      276
      204
      5
      Back to the Future (1985)
    
    
      209
      276
      223
      5
      Sling Blade (1996)
    
    
      95
      276
      272
      5
      Good Will Hunting (1997)
    
    
      293
      276
      100
      5
      Fargo (1996)
    
    
      213
      276
      423
      5
      E.T. the Extra-Terrestrial (1982)
    
    
      292
      276
      603
      5
      Rear Window (1954)
    
    
      387
      276
      174
      5
      Raiders of the Lost Ark (1981)
    
    
      83
      276
      853
      5
      Braindead (1992)

Conclusion

We've seen that we can make good recommendations with raw data based collaborative filtering methods (neighborhood models) and latent features from low-rank matrix factorization methods (factorization models).

Low-dimensional matrix recommenders try to capture the underlying features driving the raw data (which we understand as tastes and preferences). From a theoretical perspective, if we want to make recommendations based on people's tastes, this seems like the better approach. This technique also scales significantly better to larger datasets.

However, we still likely lose some meaningful signals by using a lower-rank matrix. And though these factorization based techniques work extremely well, there's research being done on new methods. These efforts have resulted in various types probabilistic matrix factorization (which works and scales even better) and many other approaches.

movie_id	1	2	3	4	5	6	7	8	9	10	...	1673	1674	1675	1676	1677	1678	1679	1680	1681	1682
user_id
1	5.0	3.0	4.0	3.0	3.0	5.0	4.0	1.0	5.0	3.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	4.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5	4.0	3.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

movie_id	1	2	3	4	5	6	7	8	9	10	...	1673	1674	1675	1676	1677	1678	1679	1680	1681	1682
0	6.488436	2.959503	1.634987	3.024467	1.656526	1.659506	3.630469	0.240669	1.791518	3.347816	...	0.011976	-0.092017	-0.074553	-0.060985	0.009427	-0.035641	-0.039227	-0.037434	-0.025552	0.023513
1	2.347262	0.129689	-0.098917	0.328828	0.159517	0.481361	0.213002	0.097908	1.892100	0.671000	...	0.003943	-0.026939	-0.035460	-0.029883	-0.027153	-0.015244	-0.008277	-0.011760	0.011639	-0.046924
2	0.291905	-0.263830	-0.151454	-0.179289	0.013462	-0.088309	-0.057624	0.568764	-0.018506	0.280742	...	-0.028964	-0.031622	0.045513	0.026089	-0.021705	0.002282	0.032363	0.017322	-0.006644	-0.009480
3	0.366410	-0.443535	0.041151	-0.007616	0.055373	-0.080352	0.299015	-0.010882	-0.160888	-0.118834	...	0.020069	0.015981	-0.000182	0.005593	0.026634	0.023562	0.036405	0.029984	0.015612	-0.008713
4	4.263488	1.937122	0.052529	1.049350	0.652765	0.002836	1.730461	0.870584	0.341027	0.569055	...	0.019973	-0.053521	-0.017242	-0.007137	-0.038987	0.010338	0.004869	0.007603	-0.020575	0.003330

	movie_id	title
25	93	Welcome to the Dollhouse (1995)
295	642	Grifters, The (1990)
215	515	Boot, Das (1981)
24	90	So I Married an Axe Murderer (1993)
10	32	Crumb (1994)
209	509	My Left Foot (1989)
15	48	Hoop Dreams (1994)
43	132	Wizard of Oz, The (1939)
406	824	Great White Hype, The (1996)
183	480	North by Northwest (1959)

	user_id	movie_id	rating	title
225	276	124	5	Lone Star (1996)
416	276	173	5	Princess Bride, The (1987)
207	276	204	5	Back to the Future (1985)
209	276	223	5	Sling Blade (1996)
95	276	272	5	Good Will Hunting (1997)
293	276	100	5	Fargo (1996)
213	276	423	5	E.T. the Extra-Terrestrial (1982)
292	276	603	5	Rear Window (1954)
387	276	174	5	Raiders of the Lost Ark (1981)
83	276	853	5	Braindead (1992)