In [3]:
import numpy as np
from polara.recommender.data import RecommenderData
from polara.recommender.models import RecommenderModel
from polara.tools.movielens import get_movielens_data
this will download movielens-1m dataset from http://grouplens.org/datasets/movielens/:
In [4]:
data, genres = get_movielens_data(get_genres=True)
In [5]:
data.head()
Out[5]:
In [6]:
data.info()
In [7]:
genres.head()
Out[7]:
In [8]:
%matplotlib inline
Rating distribution in the dataset:
In [9]:
data.rating.value_counts().sort_index().plot.bar()
Out[9]:
RecommenderData
class provides a set of tools for manipulating the data and preparing it for experimentation.
Input parameters are: the data itself (pandas dataframe) and mapping of the data fields (column names) to internal representation: userid
, itemid
and feedback
:
In [30]:
data_model = RecommenderData(data, userid='userid', itemid='movieid', feedback='rating')
Verify correct mapping:
In [33]:
data.columns
Out[33]:
In [34]:
data_model.fields
Out[34]:
RecommenderData
class has a number of parameters to control how the data is processed. Defaults are fine to start with:
In [11]:
data_model.get_configuration()
Out[11]:
Use prepare
method to split the dataset into 2 parts: training data and test data.
In [12]:
data_model.prepare()
As the original data possibly contains gaps in users' and items' indices, the data preparation process will clean this up: items from the training data will be indexed starting from zero with no gaps and the result will be stored in:
In [26]:
data_model.index.itemid.head()
Out[26]:
Similarly, all userid's from both training and test set are reindexed and stored in:
In [25]:
data_model.index.userid.training.head()
Out[25]:
In [27]:
data_model.index.userid.test.head()
Out[27]:
Internally only new inices are used. This ensures consistency of various methods used by the model.
The dataset is split according to test_fold
and test_ratio
attributes. By default it uses first 80% of users for training and last 20% of the users as test data.
In [13]:
data_model.training.head()
Out[13]:
In [14]:
data_model.training.shape
Out[14]:
The test data is further split into testset
and evaluation set (evalset
). Testset is used to generate recommendations, which are than evaluated against the evaluation set.
In [15]:
data_model.test.testset.head()
Out[15]:
In [16]:
data_model.test.testset.shape
Out[16]:
In [ ]:
In [17]:
data_model.test.evalset.head()
Out[17]:
In [18]:
data_model.test.evalset.shape
Out[18]:
The users in the test and evaluation sets are the same (but this users are not in the training
set!).
For every test user the evaluation set contains a fixed number of items which are held out from the original test data. The number of holdout items is controlled by holdout_size
parameter. By default it's set to 3:
In [19]:
data_model.holdout_size
Out[19]:
In [20]:
data_model.test.evalset.groupby('userid').movieid.count().head()
Out[20]:
You can create your own model by subclassing RecommenderModel
class and defining two required methods: self.build()
and self.get_recommendations()
:
In [21]:
class TopMovies(RecommenderModel):
def build(self):
self._recommendations = None # this is required line in order to ensure consitency in experiments
itemid = self.data.fields.itemid # get the name of the column, that corresponds to movieid
# calculate popularity of the movies based on the number of ratings
item_scores = self.data.training[itemid].value_counts().sort_index().values
# store it for later use in some attribute
self.item_scores = item_scores
def get_recommendations(self):
userid = self.data.fields.userid #get the name of the column, that corresponds to userid
# get the number of test users
# we expect that userid doesn't have gaps in numbering (as it might be in original dataset,
# RecommenderData class takes care of that)
num_users = self.data.test.testset[userid].max() + 1
# repeat computed popularity scores in accordance with the number of test users
scores = np.repeat(self.item_scores[None, :], num_users, axis=0)
# we got the scores, but what we actually need is items (their id)
# we also need only top-k items, not all of them (for top-k recommendation task)
# here's how to get it:
top_recs = self.get_topk_items(scores)
# here leftmost items are those with the highest scores
return top_recs
Note, that recommendations, generated by this model, do not take into account the fact, that some of the recommended items may be present in the test set and thus, should not be recommended (they are considered seen by a test user). In order to fix that you can use filter_seen
parameter along with downvote_seen_items
method as follows:
if self.filter_seen:
#prevent seen items from appearing in recommendations
itemid = self.data.fields.itemid
test_idx = (test_data[userid].values.astype(np.int64),
test_data[itemid].values.astype(np.int64))
self.downvote_seen_items(scores, test_idx)
With this procedure "seen" items will get the lowest scores and they will be sorted out. Place this code snippet inside the get_recommendations
routine before handovering scores into get_top_k_items
. This will improve the baseline.
Another way is to define slice_recommendations
instead of get_recommendations
method. With slice_recommendations
defined, the model will scale better when huge datasets are used.
The method slice_recommendations
takes a piece of the test data slice by slice instead of processing it as a whole. Slice if defined by start
and stop
parameter (which are simply a userid to start with and userid to stop at). Slicing the data avoids memory overhead and leads to a faster evaluation of models. Slicing is done automatically behind the scene and you don't have to specify anything else. Another advantage: seen items will be automatically sorted out from recommendations as long as filter_seen
attribute is set to True
(it is by default). So it will requires less line of code.
In [ ]:
class TopMoviesALT(RecommenderModel):
def build(self):
# should be the same as in TopMovies
def slice_recommendations(self, test_data, shape, start, stop):
# current implementation requires handovering slice data in specific format further,
# and the easiest way to get it is via get_test_matrix method. It also returns
# test data in sparse matrix format, but as our recommender model is non-personalized
# we don't actually need it. See SVDModel implementation to see when it's useful.
test_matrix, slice_data = self.get_test_matrix(test_data, shape, (start, stop))
nusers = stop - start
scores = np.repeat(self.item_scores[None, :], nusers, axis=0)
return scores, slice_data
Now everything is set to create an instance of the recommender model and produce recommendations.
In [20]:
top = TopMovies(data_model) # the model takes as input parameter the recommender data model
In [21]:
top.build()
In [22]:
recs = top.get_recommendations()
In [23]:
recs
Out[23]:
In [24]:
recs.shape
Out[24]:
In [25]:
top.topk
Out[25]:
You can evaluate your model befotre submitting the results (to ensure that you have improved above baseline):
In [26]:
top.evaluate()
Out[26]:
Try to change your model to maximize the true_positive
score.
After you have created your perfect recsys model, firstly, save your recommendation into file. Please, use your name as the name for file (this will be used to display at leaderboard)
In [27]:
np.savez('your_full_name', recs=recs)
Now you can uppload your results:
In [28]:
import requests
files = {'upload': open('your_full_name.npz','rb')}
url = "http://isp2017.azurewebsites.net/upload"
r = requests.post(url, files=files)
Verify, that upload is successful:
In [29]:
print r.status_code, r.reason
You can also do it manyally at http://isp2017.azurewebsites.net/upload
Check out how do your result compare to others at: http://isp2017.azurewebsites.net
In [ ]: