Assignment 3: User-User Collaborative Filtering



In [ ]:

Importing Libraries



In [1]:

    
import numpy as np
import pandas as pd

Loading the Data

Loading the movie data from Excel into a DataFrame.



In [2]:

    
mov_user_data = pd.read_excel('Assign_3_data.xlsx')

Converting the type of the columns in the dataframe from int to strings.



In [3]:

    
mov_user_data.columns = [str(val) for val in list(mov_user_data.columns)]

Looking at the head of the dataframe.



In [4]:

    
mov_user_data.head()









    Out[4]:






  
    
      
      1648
      5136
      918
      2824
      3867
      860
      3712
      2968
      3525
      4323
      ...
      3556
      5261
      2492
      5062
      2486
      4942
      2267
      4809
      3853
      2288
    
  
  
    
      11: Star Wars: Episode IV - A New Hope (1977)
      NaN
      4.5
      5.0
      4.5
      4.0
      4.0
      NaN
      5.0
      4.0
      5
      ...
      4
      NaN
      4.5
      4.0
      3.5
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      12: Finding Nemo (2003)
      NaN
      5.0
      5.0
      NaN
      4.0
      4.0
      4.5
      4.5
      4.0
      5
      ...
      4
      NaN
      3.5
      4.0
      2.0
      3.5
      NaN
      NaN
      NaN
      3.5
    
    
      13: Forrest Gump (1994)
      NaN
      5.0
      4.5
      5.0
      4.5
      4.5
      NaN
      5.0
      4.5
      5
      ...
      4
      5.0
      3.5
      4.5
      4.5
      4.0
      3.5
      4.5
      3.5
      3.5
    
    
      14: American Beauty (1999)
      NaN
      4.0
      NaN
      NaN
      NaN
      NaN
      4.5
      2.0
      3.5
      5
      ...
      4
      NaN
      3.5
      4.5
      3.5
      4.0
      NaN
      3.5
      NaN
      NaN
    
    
      22: Pirates of the Caribbean: The Curse of the Black Pearl (2003)
      4
      5.0
      3.0
      4.5
      4.0
      2.5
      NaN
      5.0
      3.0
      4
      ...
      3
      1.5
      4.0
      4.0
      2.5
      3.5
      NaN
      5.0
      NaN
      3.5
    
  

5 rows × 25 columns

Creating a correlation matrix dataframe

Creating a correlation coefficient dataframe



In [5]:

    
corr_df = mov_user_data.corr()

Looking at the head of the correlation coefficient dataframe



In [6]:

    
corr_df.head()









    Out[6]:






  
    
      
      1648
      5136
      918
      2824
      3867
      860
      3712
      2968
      3525
      4323
      ...
      3556
      5261
      2492
      5062
      2486
      4942
      2267
      4809
      3853
      2288
    
  
  
    
      1648
      1.000000
      0.402980
      -0.142206
      0.517620
      0.300200
      0.480537
      -0.312412
      0.383348
      0.092775
      0.098191
      ...
      -0.191988
      0.493008
      0.360644
      0.551089
      0.002544
      0.116653
      -0.429183
      0.394371
      -0.304422
      0.245048
    
    
      5136
      0.402980
      1.000000
      0.118979
      0.057916
      0.341734
      0.241377
      0.131398
      0.206695
      0.360056
      0.033642
      ...
      0.488607
      0.328120
      0.422236
      0.226635
      0.305803
      0.037769
      0.240728
      0.411676
      0.189234
      0.390067
    
    
      918
      -0.142206
      0.118979
      1.000000
      -0.317063
      0.294558
      0.468333
      0.092037
      -0.045854
      0.367568
      -0.035394
      ...
      0.373226
      0.470972
      0.069956
      -0.054762
      0.133812
      0.015169
      -0.273096
      0.082528
      0.667168
      0.119162
    
    
      2824
      0.517620
      0.057916
      -0.317063
      1.000000
      -0.060913
      -0.008066
      0.462910
      0.214760
      0.169907
      0.119350
      ...
      -0.201275
      0.228341
      0.238700
      0.259660
      0.247097
      0.149247
      -0.361466
      0.474974
      -0.262073
      0.166999
    
    
      3867
      0.300200
      0.341734
      0.294558
      -0.060913
      1.000000
      0.282497
      0.400275
      0.264249
      0.125193
      -0.333602
      ...
      0.174085
      0.297977
      0.476683
      0.293868
      0.438992
      -0.162818
      -0.295966
      0.054518
      0.464110
      0.379856
    
  

5 rows × 25 columns

Checking correlation coefficients with test values in assignment for consistency.



In [7]:

    
print abs(corr_df['1648'].ix['5136'] - 0.40298) < 1.0e3
print abs(corr_df['918'].ix['2824'] - -0.31706) < 1.03









    



True
True

For consistency check with values given in assignment, Top 5 neighbors for user 3712. Consistency check is passed.



In [8]:

    
corr_df['3712'].sort_values(ascending=False)[1:6]









    Out[8]:





2824    0.462910
3867    0.400275
5062    0.247693
442     0.227130
3853    0.193660
Name: 3712, dtype: float64

Top 5 neighbors for user 3867.



In [9]:

    
corr_df['3867'].sort_values(ascending=False)[1:6]









    Out[9]:





2492    0.476683
3853    0.464110
2486    0.438992
3712    0.400275
2288    0.379856
Name: 3867, dtype: float64

Top 5 neighbors for user 89.



In [10]:

    
corr_df['89'].sort_values(ascending=False)[1:6]









    Out[10]:





4809    0.668516
5136    0.562449
860     0.539066
5062    0.525990
3525    0.475495
Name: 89, dtype: float64

Computing Predictions for movie ratings of users from the 5 nearest neighbors. Performing Predictions without Normalization and with Normalization

Predictions for User 3712

Initializing empty Series to store movie predictions (without and with normalization) for user 3712 from 5 nearest neighbors.



In [11]:

    
pred_3712_no_norm = pd.Series(index=mov_user_data.index)
pred_3712_wi_norm = pd.Series(index=mov_user_data.index)

Storing the labels of the 5 nearest neighbor users and the correlation coefficients between each of these 5 nearest neighbors and user 3712.



In [12]:

    
fiv_nn_3712 = list(corr_df['3712'].sort_values(ascending=False)[1:6].index)
fiv_nn_3712_corr = corr_df['3712'].sort_values(ascending=False)[1:6].values

Using the ratings of the 5 nearest neighbors and the correlations with the 5 nearest neighbors to predict the rating of user 3712 for this movie.



In [13]:

    
for movie in pred_3712_no_norm.index:
    ratings = np.array([ mov_user_data[fiv_nn_3712[0]].ix[movie], mov_user_data[fiv_nn_3712[1]].ix[movie], 
                         mov_user_data[fiv_nn_3712[2]].ix[movie], mov_user_data[fiv_nn_3712[3]].ix[movie], 
                         mov_user_data[fiv_nn_3712[4]].ix[movie] ])
    ind_slice = [i for i, rat_val in enumerate(ratings) if np.isnan(rat_val)==False]
    pred_3712_no_norm.ix[movie] = np.sum(fiv_nn_3712_corr[ind_slice]*ratings[ind_slice])/np.sum(fiv_nn_3712_corr[ind_slice])
    
    rat_norm = np.array([ mov_user_data[fiv_nn_3712[0]].ix[movie] - mov_user_data[fiv_nn_3712[0]].mean(), 
                          mov_user_data[fiv_nn_3712[1]].ix[movie] - mov_user_data[fiv_nn_3712[1]].mean(), 
                          mov_user_data[fiv_nn_3712[2]].ix[movie] - mov_user_data[fiv_nn_3712[2]].mean(),  
                          mov_user_data[fiv_nn_3712[3]].ix[movie] - mov_user_data[fiv_nn_3712[3]].mean(),  
                          mov_user_data[fiv_nn_3712[4]].ix[movie] - mov_user_data[fiv_nn_3712[4]].mean() ])
    
    pred_3712_wi_norm.ix[movie] = ( mov_user_data['3712'].mean() +  
                                    np.sum(rat_norm[ind_slice]*fiv_nn_3712_corr[ind_slice])/np.sum(fiv_nn_3712_corr[ind_slice]) )

Printing the top 5 movies for used 3712 based on the ratings from the 5 nearest neighbors with no normalization.



In [14]:

    
pred_3712_no_norm.sort_values(ascending=False)[0:5]









    Out[14]:





641: Requiem for a Dream (2000)    5.000000
603: The Matrix (1999)             4.855924
105: Back to the Future (1985)     4.739173
107: Snatch (2000)                 4.651432
155: The Dark Knight (2008)        4.622564
dtype: float64

Printing the top 5 movies for used 3712 based on the ratings from the 5 nearest neighbors with normalization.



In [15]:

    
pred_3712_wi_norm.sort_values(ascending=False)[0:5]









    Out[15]:





641: Requiem for a Dream (2000)                      5.900000
603: The Matrix (1999)                               5.545567
105: Back to the Future (1985)                       5.500585
155: The Dark Knight (2008)                          5.312207
121: The Lord of the Rings: The Two Towers (2002)    5.306559
dtype: float64

Predictions for User 3867

Initializing empty Series to store movie predictions (without and with normalization) for user 3867 from 5 nearest neighbors.



In [16]:

    
pred_3867_no_norm = pd.Series(index=mov_user_data.index)
pred_3867_wi_norm = pd.Series(index=mov_user_data.index)

Storing the labels of the 5 nearest neighbor users and the correlation coefficients between each of these 5 nearest neighbors and user 3867.



In [17]:

    
fiv_nn_3867 = list(corr_df['3867'].sort_values(ascending=False)[1:6].index)
fiv_nn_3867_corr = corr_df['3867'].sort_values(ascending=False)[1:6].values

Using the ratings of the 5 nearest neighbors and the correlations with the 5 nearest neighbors to predict the rating of user 3867 for this movie.



In [18]:

    
for movie in pred_3867_no_norm.index:
    ratings = np.array([ mov_user_data[fiv_nn_3867[0]].ix[movie], mov_user_data[fiv_nn_3867[1]].ix[movie], 
                         mov_user_data[fiv_nn_3867[2]].ix[movie], mov_user_data[fiv_nn_3867[3]].ix[movie], 
                         mov_user_data[fiv_nn_3867[4]].ix[movie] ])
    ind_slice = [i for i, rat_val in enumerate(ratings) if np.isnan(rat_val)==False]
    pred_3867_no_norm.ix[movie] = np.sum(fiv_nn_3867_corr[ind_slice]*ratings[ind_slice])/np.sum(fiv_nn_3867_corr[ind_slice])
    
    rat_norm = np.array([ mov_user_data[fiv_nn_3867[0]].ix[movie] - mov_user_data[fiv_nn_3867[0]].mean(), 
                          mov_user_data[fiv_nn_3867[1]].ix[movie] - mov_user_data[fiv_nn_3867[1]].mean(), 
                          mov_user_data[fiv_nn_3867[2]].ix[movie] - mov_user_data[fiv_nn_3867[2]].mean(),  
                          mov_user_data[fiv_nn_3867[3]].ix[movie] - mov_user_data[fiv_nn_3867[3]].mean(),  
                          mov_user_data[fiv_nn_3867[4]].ix[movie] - mov_user_data[fiv_nn_3867[4]].mean() ])
    
    pred_3867_wi_norm.ix[movie] = ( mov_user_data['3867'].mean() +  
                                    np.sum(rat_norm[ind_slice]*fiv_nn_3867_corr[ind_slice])/np.sum(fiv_nn_3867_corr[ind_slice]) )

Printing the top 5 movies for used 3867 based on the ratings from the 5 nearest neighbors with no normalization.



In [19]:

    
pred_3867_no_norm.sort_values(ascending=False)[0:5]









    Out[19]:





1891: Star Wars: Episode V - The Empire Strikes Back (1980)    4.760291
155: The Dark Knight (2008)                                    4.551454
122: The Lord of the Rings: The Return of the King (2003)      4.507637
77: Memento (2000)                                             4.472487
121: The Lord of the Rings: The Two Towers (2002)              4.400194
dtype: float64

Printing the top 5 movies for used 3867 based on the ratings from the 5 nearest neighbors with normalization.



In [20]:

    
pred_3867_wi_norm.sort_values(ascending=False)[0:5]









    Out[20]:





1891: Star Wars: Episode V - The Empire Strikes Back (1980)    5.245509
155: The Dark Knight (2008)                                    4.856770
77: Memento (2000)                                             4.777803
275: Fargo (1996)                                              4.771538
807: Seven (a.k.a. Se7en) (1995)                               4.655569
dtype: float64

Predictions for User 89

Initializing empty Series to store movie predictions (without and with normalization) for user 89 from 5 nearest neighbors.



In [21]:

    
pred_89_no_norm = pd.Series(index=mov_user_data.index)
pred_89_wi_norm = pd.Series(index=mov_user_data.index)

Storing the labels of the 5 nearest neighbor users and the correlation coefficients between each of these 5 nearest neighbors and user 89.



In [22]:

    
fiv_nn_89 = list(corr_df['89'].sort_values(ascending=False)[1:6].index)
fiv_nn_89_corr = corr_df['89'].sort_values(ascending=False)[1:6].values

Using the ratings of the 5 nearest neighbors and the correlations with the 5 nearest neighbors to predict the rating of user 89 for this movie.



In [23]:

    
for movie in pred_89_no_norm.index:
    ratings = np.array([ mov_user_data[fiv_nn_89[0]].ix[movie], mov_user_data[fiv_nn_89[1]].ix[movie], 
                         mov_user_data[fiv_nn_89[2]].ix[movie], mov_user_data[fiv_nn_89[3]].ix[movie], 
                         mov_user_data[fiv_nn_89[4]].ix[movie] ])
    ind_slice = [i for i, rat_val in enumerate(ratings) if np.isnan(rat_val)==False]
    pred_89_no_norm.ix[movie] = np.sum(fiv_nn_89_corr[ind_slice]*ratings[ind_slice])/np.sum(fiv_nn_89_corr[ind_slice])
    
    rat_norm = np.array([ mov_user_data[fiv_nn_89[0]].ix[movie] - mov_user_data[fiv_nn_89[0]].mean(), 
                          mov_user_data[fiv_nn_89[1]].ix[movie] - mov_user_data[fiv_nn_89[1]].mean(), 
                          mov_user_data[fiv_nn_89[2]].ix[movie] - mov_user_data[fiv_nn_89[2]].mean(),  
                          mov_user_data[fiv_nn_89[3]].ix[movie] - mov_user_data[fiv_nn_89[3]].mean(),  
                          mov_user_data[fiv_nn_89[4]].ix[movie] - mov_user_data[fiv_nn_89[4]].mean() ])
    
    pred_89_wi_norm.ix[movie] = ( mov_user_data['89'].mean() +  
                                    np.sum(rat_norm[ind_slice]*fiv_nn_89_corr[ind_slice])/np.sum(fiv_nn_89_corr[ind_slice]) )

Printing the top 5 movies for used 89 based on the ratings from the 5 nearest neighbors with no normalization.



In [24]:

    
pred_89_no_norm.sort_values(ascending=False)[0:5]









    Out[24]:





238: The Godfather (1972)               4.894124
278: The Shawshank Redemption (1994)    4.882194
807: Seven (a.k.a. Se7en) (1995)        4.774093
275: Fargo (1996)                       4.770944
424: Schindler's List (1993)            4.729056
dtype: float64

Printing the top 5 movies for used 89 based on the ratings from the 5 nearest neighbors with normalization.



In [25]:

    
pred_89_wi_norm.sort_values(ascending=False)[0:5]









    Out[25]:





238: The Godfather (1972)               5.322015
278: The Shawshank Redemption (1994)    5.261424
275: Fargo (1996)                       5.241111
807: Seven (a.k.a. Se7en) (1995)        5.201984
424: Schindler's List (1993)            5.199223
dtype: float64



In [ ]:

	1648	5136	918	2824	3867	860	3712	2968	3525	4323	...	3556	5261	2492	5062	2486	4942	2267	4809	3853	2288
11: Star Wars: Episode IV - A New Hope (1977)	NaN	4.5	5.0	4.5	4.0	4.0	NaN	5.0	4.0	5	...	4	NaN	4.5	4.0	3.5	NaN	NaN	NaN	NaN	NaN
12: Finding Nemo (2003)	NaN	5.0	5.0	NaN	4.0	4.0	4.5	4.5	4.0	5	...	4	NaN	3.5	4.0	2.0	3.5	NaN	NaN	NaN	3.5
13: Forrest Gump (1994)	NaN	5.0	4.5	5.0	4.5	4.5	NaN	5.0	4.5	5	...	4	5.0	3.5	4.5	4.5	4.0	3.5	4.5	3.5	3.5
14: American Beauty (1999)	NaN	4.0	NaN	NaN	NaN	NaN	4.5	2.0	3.5	5	...	4	NaN	3.5	4.5	3.5	4.0	NaN	3.5	NaN	NaN
22: Pirates of the Caribbean: The Curse of the Black Pearl (2003)	4	5.0	3.0	4.5	4.0	2.5	NaN	5.0	3.0	4	...	3	1.5	4.0	4.0	2.5	3.5	NaN	5.0	NaN	3.5

	1648	5136	918	2824	3867	860	3712	2968	3525	4323	...	3556	5261	2492	5062	2486	4942	2267	4809	3853	2288
1648	1.000000	0.402980	-0.142206	0.517620	0.300200	0.480537	-0.312412	0.383348	0.092775	0.098191	...	-0.191988	0.493008	0.360644	0.551089	0.002544	0.116653	-0.429183	0.394371	-0.304422	0.245048
5136	0.402980	1.000000	0.118979	0.057916	0.341734	0.241377	0.131398	0.206695	0.360056	0.033642	...	0.488607	0.328120	0.422236	0.226635	0.305803	0.037769	0.240728	0.411676	0.189234	0.390067
918	-0.142206	0.118979	1.000000	-0.317063	0.294558	0.468333	0.092037	-0.045854	0.367568	-0.035394	...	0.373226	0.470972	0.069956	-0.054762	0.133812	0.015169	-0.273096	0.082528	0.667168	0.119162
2824	0.517620	0.057916	-0.317063	1.000000	-0.060913	-0.008066	0.462910	0.214760	0.169907	0.119350	...	-0.201275	0.228341	0.238700	0.259660	0.247097	0.149247	-0.361466	0.474974	-0.262073	0.166999
3867	0.300200	0.341734	0.294558	-0.060913	1.000000	0.282497	0.400275	0.264249	0.125193	-0.333602	...	0.174085	0.297977	0.476683	0.293868	0.438992	-0.162818	-0.295966	0.054518	0.464110	0.379856