BUILDING A RECOMMENDER SYSTEM ON USER-USER COLLABORATIVE FILTERING (MOVIELENS DATASET)

We will load the data sets firsts.


In [12]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math

#column headers for the dataset
data_cols = ['user id','movie id','rating','timestamp']
item_cols = ['movie id','movie title','release date','video release date','IMDb URL','unknown','Action',
'Adventure','Animation','Childrens','Comedy','Crime','Documentary','Drama','Fantasy','Film-Noir','Horror',
'Musical','Mystery','Romance ','Sci-Fi','Thriller','War' ,'Western']
user_cols = ['user id','age','gender','occupation','zip code']

#importing the data files onto dataframes
users = pd.read_csv('ml-100k/u.user', sep='|', names=user_cols, encoding='latin-1')
item = pd.read_csv('ml-100k/u.item', sep='|', names=item_cols, encoding='latin-1')
data = pd.read_csv('ml-100k/u.data', sep='\t', names=data_cols, encoding='latin-1')

We will use the file u.data first as it contains User ID, Movie IDs and Ratings. These three elements are all we need for determining the similarity of the users based on their ratings for a particular movie. I will first sort the DataFrame by User ID and then we are going to split the data-set into a training set and a test set (I just need one user for the training).


In [13]:
utrain = (data.sort_values('user id'))[:99832]
print(utrain.tail())


       user id  movie id  rating  timestamp
73676      942       479       4  891283118
67222      942       604       4  891283139
95675      942       478       5  891283017
85822      942       659       5  891283161
68192      942       487       4  891282985

In [14]:
utest = (data.sort_values('user id'))[99833:]
print(utest.head())


       user id  movie id  rating  timestamp
91841      943       132       3  888639093
91810      943       204       3  888639117
77956      943        94       4  888639929
87415      943        53       3  888640067
77609      943       124       3  875501995

We convert them to a NumPy Array for ease of iteration!


In [15]:
utrain = utrain.as_matrix(columns = ['user id', 'movie id', 'rating'])
utest = utest.as_matrix(columns = ['user id', 'movie id', 'rating'])

Create a users_list which is a list of users that contains a list of movies rated by him. This part is going to greatly compromise on the program time unfortunately!


In [16]:
users_list = []
for i in range(1,943):
    list = []
    for j in range(0,len(utrain)):
        if utrain[j][0] == i:
            list.append(utrain[j])    
        else:
            break
    utrain = utrain[j:]
    users_list.append(list)

Define a Function by the Name of EucledianScore. The purpose of the EucledianScore is to measure the similarity between two users based on their ratings given to movies that they have both in common. But what if the users have just one movie in common? In my opinion having more movies in common is a great sign of similarity. So if users have less than 4 movies in common then we assign them a high EucledianScore.


In [17]:
def EucledianScore(train_user, test_user):
    sum = 0
    count = 0
    for i in test_user:
        score = 0
        for j in train_user:
            if(int(i[1]) == int(j[1])):
                score= ((float(i[2])-float(j[2]))*(float(i[2])-float(j[2])))
                count= count + 1        
            sum = sum + score
    if(count<4):
        sum = 1000000           
    return(math.sqrt(sum))

Now we will iterate over users_list and find the similarity of the users to the test_user by means of this function and append the EucledianScore along with the User ID to a separate list score_list. We then convert it first to a DataFrame, sort it by the EucledianScore and finally convert it to a NumPy Array score_matrix for the ease of iteration.


In [18]:
score_list = []               
for i in range(0,942):
    score_list.append([i+1,EucledianScore(users_list[i], utest)])

score = pd.DataFrame(score_list, columns = ['user id','Eucledian Score'])
score = score.sort_values(by = 'Eucledian Score')
print(score)
score_matrix = score.as_matrix()


     user id  Eucledian Score
309      310         1.732051
138      139         3.872983
45        46         4.000000
208      209         4.242641
557      558         4.582576
724      725         4.690416
305      306         5.000000
241      242         5.000000
676      677         5.099020
265      266         5.196152
303      304         5.656854
753      754         5.744563
3          4         5.830952
798      799         6.000000
375      376         6.164414
796      797         6.244998
28        29         6.403124
799      800         6.557439
463      464         6.633250
515      516         6.708204
227      228         6.928203
438      439         7.000000
743      744         7.348469
580      581         7.416198
648      649         7.483315
203      204         7.483315
894      895         7.745967
875      876         7.810250
364      365         8.000000
52        53         8.000000
..       ...              ...
650      651      1000.000000
651      652      1000.000000
655      656      1000.000000
148      149      1000.000000
661      662      1000.000000
146      147      1000.000000
145      146      1000.000000
672      673      1000.000000
142      143      1000.000000
674      675      1000.000000
280      281      1000.000000
281      282      1000.000000
139      140      1000.000000
680      681      1000.000000
133      134      1000.000000
283      284      1000.000000
684      685      1000.000000
686      687      1000.000000
687      688      1000.000000
132      133      1000.000000
128      129      1000.000000
125      126      1000.000000
383      384      1000.000000
712      713      1000.000000
111      112      1000.000000
110      111      1000.000000
719      720      1000.000000
106      107      1000.000000
104      105      1000.000000
694      695      1000.000000

[942 rows x 2 columns]

Now we see that the user with ID 310 has the lowest Eucledian score and hence the highest similarity. So now we need to obtain the list of movies that are not common between the two users. Make two lists. Get the full list of movies which are there on USER_ID 310. And then the list of common movies. Convert these lists into sets and get the list of movies to be recommended.


In [19]:
user= int(score_matrix[0][0])
common_list = []
full_list = []
for i in utest:
    for j in users_list[user-1]:
        if(int(i[1])== int(j[1])):
            common_list.append(int(j[1]))
        full_list.append(j[1])

common_list = set(common_list)  
full_list = set(full_list)
recommendation = full_list.difference(common_list)

Now we need to create a compiled list of the movies along with their mean ratings. Merge the item and data files.Then groupby movie titles, select the columns you need and then find the mean ratings of each movie. Then express the dataframe as a NumPy Array.


In [20]:
item_list = (((pd.merge(item,data).sort_values(by = 'movie id')).groupby('movie title')))['movie id', 'movie title', 'rating']
item_list = item_list.mean()
item_list['movie title'] = item_list.index
item_list = item_list.as_matrix()

Now we find the movies on item_list by IDs from recommendation. Then append them to a separate list.


In [21]:
recommendation_list = []
for i in recommendation:
    recommendation_list.append(item_list[i-1])
    
recommendation = (pd.DataFrame(recommendation_list,columns = ['movie id','mean rating' ,'movie title'])).sort_values(by = 'mean rating', ascending = False)
print(recommendation[['mean rating','movie title']])


    mean rating                                        movie title
9      4.292929                                Citizen Kane (1941)
8      4.125000                              A Chef in Love (1996)
15     4.000000                            Butcher Boy, The (1998)
6      3.930514          Indiana Jones and the Last Crusade (1989)
4      3.839050                                 Chasing Amy (1997)
3      3.792899                         In the Line of Fire (1993)
10     3.648352                                      Casino (1995)
12     3.600000                         Murder in the First (1995)
5      3.545455                                     Stalker (1979)
14     3.166667  Flower of My Secret, The (Flor de mi secreto, ...
11     3.105263                                    Bad Boys (1995)
16     2.802632                      Brady Bunch Movie, The (1995)
0      2.750000                           Ladybird Ladybird (1994)
13     2.720930                               Pete's Dragon (1977)
2      2.413793                              Canadian Bacon (1994)
7      2.285714          Last Time I Committed Suicide, The (1997)
1      2.000000                               Calendar Girl (1993)

In [ ]: