Assignment 3: User-User Collaborative Filtering


In [ ]:

Importing Libraries


In [1]:
import numpy as np
import pandas as pd

Loading the Data

Loading the movie data from Excel into a DataFrame.


In [2]:
mov_user_data = pd.read_excel('Assign_3_data.xlsx')

Converting the type of the columns in the dataframe from int to strings.


In [3]:
mov_user_data.columns = [str(val) for val in list(mov_user_data.columns)]

Looking at the head of the dataframe.


In [4]:
mov_user_data.head()


Out[4]:
1648 5136 918 2824 3867 860 3712 2968 3525 4323 ... 3556 5261 2492 5062 2486 4942 2267 4809 3853 2288
11: Star Wars: Episode IV - A New Hope (1977) NaN 4.5 5.0 4.5 4.0 4.0 NaN 5.0 4.0 5 ... 4 NaN 4.5 4.0 3.5 NaN NaN NaN NaN NaN
12: Finding Nemo (2003) NaN 5.0 5.0 NaN 4.0 4.0 4.5 4.5 4.0 5 ... 4 NaN 3.5 4.0 2.0 3.5 NaN NaN NaN 3.5
13: Forrest Gump (1994) NaN 5.0 4.5 5.0 4.5 4.5 NaN 5.0 4.5 5 ... 4 5.0 3.5 4.5 4.5 4.0 3.5 4.5 3.5 3.5
14: American Beauty (1999) NaN 4.0 NaN NaN NaN NaN 4.5 2.0 3.5 5 ... 4 NaN 3.5 4.5 3.5 4.0 NaN 3.5 NaN NaN
22: Pirates of the Caribbean: The Curse of the Black Pearl (2003) 4 5.0 3.0 4.5 4.0 2.5 NaN 5.0 3.0 4 ... 3 1.5 4.0 4.0 2.5 3.5 NaN 5.0 NaN 3.5

5 rows × 25 columns

Creating a correlation matrix dataframe

Creating a correlation coefficient dataframe


In [5]:
corr_df = mov_user_data.corr()

Looking at the head of the correlation coefficient dataframe


In [6]:
corr_df.head()


Out[6]:
1648 5136 918 2824 3867 860 3712 2968 3525 4323 ... 3556 5261 2492 5062 2486 4942 2267 4809 3853 2288
1648 1.000000 0.402980 -0.142206 0.517620 0.300200 0.480537 -0.312412 0.383348 0.092775 0.098191 ... -0.191988 0.493008 0.360644 0.551089 0.002544 0.116653 -0.429183 0.394371 -0.304422 0.245048
5136 0.402980 1.000000 0.118979 0.057916 0.341734 0.241377 0.131398 0.206695 0.360056 0.033642 ... 0.488607 0.328120 0.422236 0.226635 0.305803 0.037769 0.240728 0.411676 0.189234 0.390067
918 -0.142206 0.118979 1.000000 -0.317063 0.294558 0.468333 0.092037 -0.045854 0.367568 -0.035394 ... 0.373226 0.470972 0.069956 -0.054762 0.133812 0.015169 -0.273096 0.082528 0.667168 0.119162
2824 0.517620 0.057916 -0.317063 1.000000 -0.060913 -0.008066 0.462910 0.214760 0.169907 0.119350 ... -0.201275 0.228341 0.238700 0.259660 0.247097 0.149247 -0.361466 0.474974 -0.262073 0.166999
3867 0.300200 0.341734 0.294558 -0.060913 1.000000 0.282497 0.400275 0.264249 0.125193 -0.333602 ... 0.174085 0.297977 0.476683 0.293868 0.438992 -0.162818 -0.295966 0.054518 0.464110 0.379856

5 rows × 25 columns

Checking correlation coefficients with test values in assignment for consistency.


In [7]:
print abs(corr_df['1648'].ix['5136'] - 0.40298) < 1.0e3
print abs(corr_df['918'].ix['2824'] - -0.31706) < 1.03


True
True

For consistency check with values given in assignment, Top 5 neighbors for user 3712. Consistency check is passed.


In [8]:
corr_df['3712'].sort_values(ascending=False)[1:6]


Out[8]:
2824    0.462910
3867    0.400275
5062    0.247693
442     0.227130
3853    0.193660
Name: 3712, dtype: float64

Top 5 neighbors for user 3867.


In [9]:
corr_df['3867'].sort_values(ascending=False)[1:6]


Out[9]:
2492    0.476683
3853    0.464110
2486    0.438992
3712    0.400275
2288    0.379856
Name: 3867, dtype: float64

Top 5 neighbors for user 89.


In [10]:
corr_df['89'].sort_values(ascending=False)[1:6]


Out[10]:
4809    0.668516
5136    0.562449
860     0.539066
5062    0.525990
3525    0.475495
Name: 89, dtype: float64

Computing Predictions for movie ratings of users from the 5 nearest neighbors. Performing Predictions without Normalization and with Normalization

Predictions for User 3712

Initializing empty Series to store movie predictions (without and with normalization) for user 3712 from 5 nearest neighbors.


In [11]:
pred_3712_no_norm = pd.Series(index=mov_user_data.index)
pred_3712_wi_norm = pd.Series(index=mov_user_data.index)

Storing the labels of the 5 nearest neighbor users and the correlation coefficients between each of these 5 nearest neighbors and user 3712.


In [12]:
fiv_nn_3712 = list(corr_df['3712'].sort_values(ascending=False)[1:6].index)
fiv_nn_3712_corr = corr_df['3712'].sort_values(ascending=False)[1:6].values

Using the ratings of the 5 nearest neighbors and the correlations with the 5 nearest neighbors to predict the rating of user 3712 for this movie.


In [13]:
for movie in pred_3712_no_norm.index:
    ratings = np.array([ mov_user_data[fiv_nn_3712[0]].ix[movie], mov_user_data[fiv_nn_3712[1]].ix[movie], 
                         mov_user_data[fiv_nn_3712[2]].ix[movie], mov_user_data[fiv_nn_3712[3]].ix[movie], 
                         mov_user_data[fiv_nn_3712[4]].ix[movie] ])
    ind_slice = [i for i, rat_val in enumerate(ratings) if np.isnan(rat_val)==False]
    pred_3712_no_norm.ix[movie] = np.sum(fiv_nn_3712_corr[ind_slice]*ratings[ind_slice])/np.sum(fiv_nn_3712_corr[ind_slice])
    
    rat_norm = np.array([ mov_user_data[fiv_nn_3712[0]].ix[movie] - mov_user_data[fiv_nn_3712[0]].mean(), 
                          mov_user_data[fiv_nn_3712[1]].ix[movie] - mov_user_data[fiv_nn_3712[1]].mean(), 
                          mov_user_data[fiv_nn_3712[2]].ix[movie] - mov_user_data[fiv_nn_3712[2]].mean(),  
                          mov_user_data[fiv_nn_3712[3]].ix[movie] - mov_user_data[fiv_nn_3712[3]].mean(),  
                          mov_user_data[fiv_nn_3712[4]].ix[movie] - mov_user_data[fiv_nn_3712[4]].mean() ])
    
    pred_3712_wi_norm.ix[movie] = ( mov_user_data['3712'].mean() +  
                                    np.sum(rat_norm[ind_slice]*fiv_nn_3712_corr[ind_slice])/np.sum(fiv_nn_3712_corr[ind_slice]) )

Printing the top 5 movies for used 3712 based on the ratings from the 5 nearest neighbors with no normalization.


In [14]:
pred_3712_no_norm.sort_values(ascending=False)[0:5]


Out[14]:
641: Requiem for a Dream (2000)    5.000000
603: The Matrix (1999)             4.855924
105: Back to the Future (1985)     4.739173
107: Snatch (2000)                 4.651432
155: The Dark Knight (2008)        4.622564
dtype: float64

Printing the top 5 movies for used 3712 based on the ratings from the 5 nearest neighbors with normalization.


In [15]:
pred_3712_wi_norm.sort_values(ascending=False)[0:5]


Out[15]:
641: Requiem for a Dream (2000)                      5.900000
603: The Matrix (1999)                               5.545567
105: Back to the Future (1985)                       5.500585
155: The Dark Knight (2008)                          5.312207
121: The Lord of the Rings: The Two Towers (2002)    5.306559
dtype: float64

Predictions for User 3867

Initializing empty Series to store movie predictions (without and with normalization) for user 3867 from 5 nearest neighbors.


In [16]:
pred_3867_no_norm = pd.Series(index=mov_user_data.index)
pred_3867_wi_norm = pd.Series(index=mov_user_data.index)

Storing the labels of the 5 nearest neighbor users and the correlation coefficients between each of these 5 nearest neighbors and user 3867.


In [17]:
fiv_nn_3867 = list(corr_df['3867'].sort_values(ascending=False)[1:6].index)
fiv_nn_3867_corr = corr_df['3867'].sort_values(ascending=False)[1:6].values

Using the ratings of the 5 nearest neighbors and the correlations with the 5 nearest neighbors to predict the rating of user 3867 for this movie.


In [18]:
for movie in pred_3867_no_norm.index:
    ratings = np.array([ mov_user_data[fiv_nn_3867[0]].ix[movie], mov_user_data[fiv_nn_3867[1]].ix[movie], 
                         mov_user_data[fiv_nn_3867[2]].ix[movie], mov_user_data[fiv_nn_3867[3]].ix[movie], 
                         mov_user_data[fiv_nn_3867[4]].ix[movie] ])
    ind_slice = [i for i, rat_val in enumerate(ratings) if np.isnan(rat_val)==False]
    pred_3867_no_norm.ix[movie] = np.sum(fiv_nn_3867_corr[ind_slice]*ratings[ind_slice])/np.sum(fiv_nn_3867_corr[ind_slice])
    
    rat_norm = np.array([ mov_user_data[fiv_nn_3867[0]].ix[movie] - mov_user_data[fiv_nn_3867[0]].mean(), 
                          mov_user_data[fiv_nn_3867[1]].ix[movie] - mov_user_data[fiv_nn_3867[1]].mean(), 
                          mov_user_data[fiv_nn_3867[2]].ix[movie] - mov_user_data[fiv_nn_3867[2]].mean(),  
                          mov_user_data[fiv_nn_3867[3]].ix[movie] - mov_user_data[fiv_nn_3867[3]].mean(),  
                          mov_user_data[fiv_nn_3867[4]].ix[movie] - mov_user_data[fiv_nn_3867[4]].mean() ])
    
    pred_3867_wi_norm.ix[movie] = ( mov_user_data['3867'].mean() +  
                                    np.sum(rat_norm[ind_slice]*fiv_nn_3867_corr[ind_slice])/np.sum(fiv_nn_3867_corr[ind_slice]) )

Printing the top 5 movies for used 3867 based on the ratings from the 5 nearest neighbors with no normalization.


In [19]:
pred_3867_no_norm.sort_values(ascending=False)[0:5]


Out[19]:
1891: Star Wars: Episode V - The Empire Strikes Back (1980)    4.760291
155: The Dark Knight (2008)                                    4.551454
122: The Lord of the Rings: The Return of the King (2003)      4.507637
77: Memento (2000)                                             4.472487
121: The Lord of the Rings: The Two Towers (2002)              4.400194
dtype: float64

Printing the top 5 movies for used 3867 based on the ratings from the 5 nearest neighbors with normalization.


In [20]:
pred_3867_wi_norm.sort_values(ascending=False)[0:5]


Out[20]:
1891: Star Wars: Episode V - The Empire Strikes Back (1980)    5.245509
155: The Dark Knight (2008)                                    4.856770
77: Memento (2000)                                             4.777803
275: Fargo (1996)                                              4.771538
807: Seven (a.k.a. Se7en) (1995)                               4.655569
dtype: float64

Predictions for User 89

Initializing empty Series to store movie predictions (without and with normalization) for user 89 from 5 nearest neighbors.


In [21]:
pred_89_no_norm = pd.Series(index=mov_user_data.index)
pred_89_wi_norm = pd.Series(index=mov_user_data.index)

Storing the labels of the 5 nearest neighbor users and the correlation coefficients between each of these 5 nearest neighbors and user 89.


In [22]:
fiv_nn_89 = list(corr_df['89'].sort_values(ascending=False)[1:6].index)
fiv_nn_89_corr = corr_df['89'].sort_values(ascending=False)[1:6].values

Using the ratings of the 5 nearest neighbors and the correlations with the 5 nearest neighbors to predict the rating of user 89 for this movie.


In [23]:
for movie in pred_89_no_norm.index:
    ratings = np.array([ mov_user_data[fiv_nn_89[0]].ix[movie], mov_user_data[fiv_nn_89[1]].ix[movie], 
                         mov_user_data[fiv_nn_89[2]].ix[movie], mov_user_data[fiv_nn_89[3]].ix[movie], 
                         mov_user_data[fiv_nn_89[4]].ix[movie] ])
    ind_slice = [i for i, rat_val in enumerate(ratings) if np.isnan(rat_val)==False]
    pred_89_no_norm.ix[movie] = np.sum(fiv_nn_89_corr[ind_slice]*ratings[ind_slice])/np.sum(fiv_nn_89_corr[ind_slice])
    
    rat_norm = np.array([ mov_user_data[fiv_nn_89[0]].ix[movie] - mov_user_data[fiv_nn_89[0]].mean(), 
                          mov_user_data[fiv_nn_89[1]].ix[movie] - mov_user_data[fiv_nn_89[1]].mean(), 
                          mov_user_data[fiv_nn_89[2]].ix[movie] - mov_user_data[fiv_nn_89[2]].mean(),  
                          mov_user_data[fiv_nn_89[3]].ix[movie] - mov_user_data[fiv_nn_89[3]].mean(),  
                          mov_user_data[fiv_nn_89[4]].ix[movie] - mov_user_data[fiv_nn_89[4]].mean() ])
    
    pred_89_wi_norm.ix[movie] = ( mov_user_data['89'].mean() +  
                                    np.sum(rat_norm[ind_slice]*fiv_nn_89_corr[ind_slice])/np.sum(fiv_nn_89_corr[ind_slice]) )

Printing the top 5 movies for used 89 based on the ratings from the 5 nearest neighbors with no normalization.


In [24]:
pred_89_no_norm.sort_values(ascending=False)[0:5]


Out[24]:
238: The Godfather (1972)               4.894124
278: The Shawshank Redemption (1994)    4.882194
807: Seven (a.k.a. Se7en) (1995)        4.774093
275: Fargo (1996)                       4.770944
424: Schindler's List (1993)            4.729056
dtype: float64

Printing the top 5 movies for used 89 based on the ratings from the 5 nearest neighbors with normalization.


In [25]:
pred_89_wi_norm.sort_values(ascending=False)[0:5]


Out[25]:
238: The Godfather (1972)               5.322015
278: The Shawshank Redemption (1994)    5.261424
275: Fargo (1996)                       5.241111
807: Seven (a.k.a. Se7en) (1995)        5.201984
424: Schindler's List (1993)            5.199223
dtype: float64

In [ ]: