In this lab, you will build a simple movie recommender using $k$ nearest neighbours regression. At the end of the lab, you should be able to:
Let's start by importing the packages we'll need. This week, we're going to use the neighbors
subpackage from scikit-learn
to build $k$ nearest neighbours models.
In [ ]:
%matplotlib inline
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GridSearchCV, KFold, cross_val_predict
from sklearn.neighbors import KNeighborsRegressor
Next, let's load the data. Write the path to your ml-100k.csv
file in the cell below:
In [ ]:
path = 'data/ml-100k.csv'
Execute the cell below to load the CSV data into a pandas data frame indexed by the user_id
field in the CSV file.
In [ ]:
df = pd.read_csv(path, index_col='user_id')
df.head()
In [ ]:
stats = df.describe()
stats
As can be seen, the data consists of film ratings in the range [1, 5] for 1664 films. Some films have been rated by many users, but the vast majority have been rated by only a few (i.e. there are many missing values):
In [ ]:
ax = stats.loc['count'].hist(bins=30)
ax.set(
xlabel='Number of ratings',
ylabel='Frequency'
);
We'll need to replace the missing values with appropriate substitutions before we can build our model. One way to do this is to replace each instance where a user didn't see a film with the average rating of that film (although, there are others, e.g. the median or mode values). We can compute the average rating of each film via the mean
method of the data frame:
In [ ]:
average_ratings = df.mean()
average_ratings.head()
Next, let's substitute these values everywhere there is a missing value. With pandas
, you can do this with the fillna
method, as follows:
In [ ]:
df = df.fillna(value=average_ratings)
Let's build a movie recommender using user-based collaborative filtering. For this, we'll need to build a model that can identify the most similar users to a given user and use that relationship to predict ratings for new movies. We can use $k$ nearest neighbours regression for this.
Before we build the model, let's specify ratings for some of the films in the data set. This gives us a target variable to fit our model to. The values below are just examples - feel free to add your own ratings or change the films.
In [ ]:
y = pd.Series({
'L.A. Confidential (1997)': 3.5,
'Jaws (1975)': 3.5,
'Evil Dead II (1987)': 4.5,
'Fargo (1996)': 5.0,
'Naked Gun 33 1/3: The Final Insult (1994)': 2.5,
'Wings of Desire (1987)': 5.0,
'North by Northwest (1959)': 5.0,
"Monty Python's Life of Brian (1979)": 4.5,
'Raiders of the Lost Ark (1981)': 4.0,
'Annie Hall (1977)': 5.0,
'True Lies (1994)': 3.0,
'GoldenEye (1995)': 2.0,
'Good, The Bad and The Ugly, The (1966)': 4.0,
'Empire Strikes Back, The (1980)': 4.0,
'Godfather, The (1972)': 4.5,
'Waterworld (1995)': 1.0,
'Blade Runner (1982)': 4.0,
'Seven (Se7en) (1995)': 3.5,
'Alien (1979)': 4.0,
'Free Willy (1993)': 1.0
})
Next, let's select the features to learn from. In user-based collaborative filtering, we need to identify the users that are most similar to us. Consequently, we need to transpose our data matrix (with the T
attribute of the data frame) so that its columns (i.e. features) represent users and its rows (i.e. samples) represent films. We'll also need to select just the films that we specified above, as our target variable consists of these only.
In [ ]:
X = df.T.loc[y.index]
X.head()
Let's build a $k$ nearest neighbours regression model to see what improvement can be made over the dummy model:
In [ ]:
algorithm = KNeighborsRegressor()
parameters = {
'n_neighbors': [2, 5, 10, 15],
'weights': ['uniform', 'distance'],
'metric': ['manhattan', 'euclidean']
}
# Use inner CV to select the best model
inner_cv = KFold(n_splits=10, shuffle=True, random_state=0) # K = 10
clf = GridSearchCV(algorithm, parameters, cv=inner_cv, n_jobs=-1) # n_jobs=-1 uses all available CPUs = faster
clf.fit(X, y)
# Use outer CV to evaluate the error of the best model
outer_cv = KFold(n_splits=10, shuffle=True, random_state=0) # K = 10, doesn't have to be the same
y_pred = cross_val_predict(clf, X, y, cv=outer_cv)
# Print the results
print('Mean absolute error: %f' % mean_absolute_error(y, y_pred))
print('Standard deviation of the error: %f' % (y - y_pred).std())
ax = (y - y_pred).hist()
ax.set(
title='Distribution of errors for the nearest neighbours regression model',
xlabel='Error'
);
As can be seen, the $k$ nearest neighbours model is able to predict ratings to within ±0.88, with a standard deviation of 0.97. While this error is not small, it's not so large that it won't be useful. Further impovements can be made by filling the missing values in a different way or providing more ratings.
Now that we have a final model, we can make recommendations about films we haven't rated:
In [ ]:
predictions = pd.Series()
for film in df.columns:
if film in y.index:
continue # If we've already rated the film, skip it
predictions[film] = clf.predict(df.loc[:, [film]].T)[0]
predictions.sort_values(ascending=False).head(10)