Recommenders have been around since at least 1992. Today we see different flavours of recommenders, deployed across different verticals:
What exactly do they do?
In a typical recommender system people provide recommendations as inputs, which the system then aggregates and directs to appropriate recipients. -- Resnick and Varian, 1997
Collaborative filtering simply means that people collaborate to help one another perform filtering by recording their reactions to documents they read. -- Goldberg et al, 1992
In its most common formulation, the recommendation problem is reduced to the problem of estimating ratings for the items that have not been seen by a user. Intuitively, this estimation is usually based on the ratings given by this user to other items and on some other information [...] Once we can estimate ratings for the yet unrated items, we can recommend to the user the item(s) with the highest estimated rating(s). -- Adomavicius and Tuzhilin, 2005
Driven by computer algorithms, recommenders help consumers by selecting products they will probably like and might buy based on their browsing, searches, purchases, and preferences. -- Konstan and Riedl, 2012
The recommendation problem in its most basic form is quite simple to define:
|-------------------+-----+-----+-----+-----+-----|
| user_id, movie_id | m_1 | m_2 | m_3 | m_4 | m_5 |
|-------------------+-----+-----+-----+-----+-----|
| u_1 | ? | ? | 4 | ? | 1 |
|-------------------+-----+-----+-----+-----+-----|
| u_2 | 3 | ? | ? | 2 | 2 |
|-------------------+-----+-----+-----+-----+-----|
| u_3 | 3 | ? | ? | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_4 | ? | 1 | 2 | 1 | 1 |
|-------------------+-----+-----+-----+-----+-----|
| u_5 | ? | ? | ? | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_6 | 2 | ? | 2 | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_7 | ? | ? | ? | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_8 | 3 | 1 | 5 | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_9 | ? | ? | ? | ? | 2 |
|-------------------+-----+-----+-----+-----+-----|
Given a partially filled matrix of ratings ($|U|x|I|$), estimate the missing values.
Content-based techniques are limited by the amount of metadata that is available to describe an item. There are domains in which feature extraction methods are expensive or time consuming, e.g., processing multimedia data such as graphics, audio/video streams. In the context of grocery items for example, it's often the case that item information is only partial or completely missing. Examples include:
A user has to have rated a sufficient number of items before a recommender system can have a good idea of what their preferences are. In a content-based system, the aggregation function needs ratings to aggregate.
Collaborative filters rely on an item being rated by many users to compute aggregates of those ratings. Think of this as the exact counterpart of the new user problem for content-based systems.
When looking at the more general versions of content-based and collaborative systems, the success of the recommender system depends on the availability of a critical mass of user/item iteractions. We get a first glance at the data sparsity problem by quantifying the ratio of existing ratings vs $|U|x|I|$. A highly sparse matrix of interactions makes it difficult to compute similarities between users and items. As an example, for a user whose tastes are unusual compared to the rest of the population, there will not be any other users who are particularly similar, leading to poor recommendations.
In [4]:
from IPython.core.display import Image
Image(filename='/Users/chengjun/GitHub/cjc2016/figure/recsys_arch.png')
Out[4]:
In [5]:
import pandas as pd
unames = ['user_id', 'username']
users = pd.read_table('/Users/chengjun/GitHub/cjc2016/data/users_set.dat',
sep='|', header=None, names=unames)
rnames = ['user_id', 'course_id', 'rating']
ratings = pd.read_table('/Users/chengjun/GitHub/cjc2016/data/ratings.dat',
sep='|', header=None, names=rnames)
mnames = ['course_id', 'title', 'avg_rating', 'workload', 'university', 'difficulty', 'provider']
courses = pd.read_table('/Users/chengjun/GitHub/cjc2016/data/cursos.dat',
sep='|', header=None, names=mnames)
# show how one of them looks
ratings.head(10)
Out[5]:
In [6]:
# show how one of them looks
users[:5]
Out[6]:
In [7]:
courses[:5]
Out[7]:
Using pd.merge
we get it all into one big DataFrame.
In [8]:
coursetalk = pd.merge(pd.merge(ratings, courses), users)
coursetalk
Out[8]:
In [9]:
coursetalk.ix[0]
Out[9]:
To get mean course ratings grouped by the provider, we can use the pivot_table method:
In [14]:
dir(pivot_table)
Out[14]:
In [15]:
from pandas import pivot_table
mean_ratings = pivot_table(coursetalk, values = 'rating', columns='provider', aggfunc='mean')
mean_ratings.order(ascending=False)
Out[15]:
Now let's filter down to courses that received at least 20 ratings (a completely arbitrary number); To do this, I group the data by course_id and use size() to get a Series of group sizes for each title:
In [16]:
ratings_by_title = coursetalk.groupby('title').size()
ratings_by_title[:10]
Out[16]:
In [17]:
active_titles = ratings_by_title.index[ratings_by_title >= 20]
active_titles[:10]
Out[17]:
The index of titles receiving at least 20 ratings can then be used to select rows from mean_ratings above:
In [18]:
mean_ratings = coursetalk.pivot_table('rating', columns='title', aggfunc='mean')
mean_ratings
Out[18]:
By computing the mean rating for each course, we will order with the highest rating listed first.
In [19]:
mean_ratings.ix[active_titles].order(ascending=False)
Out[19]:
To see the top courses among Coursera students, we can sort by the 'Coursera' column in descending order:
In [20]:
mean_ratings = coursetalk.pivot_table('rating', index='title',columns='provider', aggfunc='mean')
mean_ratings[:10]
Out[20]:
In [21]:
mean_ratings['coursera'][active_titles].order(ascending=False)[:10]
Out[21]:
Now, let's go further! How about rank the courses with the highest percentage of ratings that are 4 or higher ? % of ratings 4+
Let's start with a simple pivoting example that does not involve any aggregation. We can extract a ratings matrix as follows:
In [23]:
# transform the ratings frame into a ratings matrix
ratings_mtx_df = coursetalk.pivot_table(values='rating',
index='user_id',
columns='title')
ratings_mtx_df.ix[ratings_mtx_df.index[:15], ratings_mtx_df.columns[:15]]
Out[23]:
Let's extract only the rating that are 4 or higher.
In [24]:
ratings_gte_4 = ratings_mtx_df[ratings_mtx_df>=4.0]
# with an integer axis index only label-based indexing is possible
ratings_gte_4.ix[ratings_gte_4.index[:15], ratings_gte_4.columns[:15]]
Out[24]:
Now picking the number of total ratings for each course and the count of ratings 4+ , we can merge them into one DataFrame.
In [25]:
ratings_gte_4_pd = pd.DataFrame({'total': ratings_mtx_df.count(), 'gte_4': ratings_gte_4.count()})
ratings_gte_4_pd.head(10)
Out[25]:
In [26]:
ratings_gte_4_pd['gte_4_ratio'] = (ratings_gte_4_pd['gte_4'] * 1.0)/ ratings_gte_4_pd.total
ratings_gte_4_pd.head(10)
Out[26]:
In [27]:
ranking = [(title,total,gte_4, score) for title, total, gte_4, score in ratings_gte_4_pd.itertuples()]
for title, total, gte_4, score in sorted(ranking, key=lambda x: (x[3], x[2], x[1]) , reverse=True)[:10]:
print title, total, gte_4, score
Let's now go easy. Let's count the number of ratings for each course, and order with the most number of ratings.
In [28]:
ratings_by_title = coursetalk.groupby('title').size()
ratings_by_title.order(ascending=False)[:10]
Out[28]:
Considering this information we can sort by the most rated ones with highest percentage of 4+ ratings.
In [29]:
for title, total, gte_4, score in sorted(ranking, key=lambda x: (x[2], x[3], x[1]) , reverse=True)[:10]:
print title, total, gte_4, score
Finally using the formula above that we learned, let's find out what the courses that most often occur wit the popular MOOC An introduction to Interactive Programming with Python by using the method "x + y/ x" . For each course, calculate the percentage of Programming with python raters who also rated that course. Order with the highest percentage first, and voilá we have the top 5 moocs.
In [31]:
course_users = coursetalk.pivot_table('rating', index='title', columns='user_id')
course_users.ix[course_users.index[:15], course_users.columns[:15]]
Out[31]:
First, let's get only the users that rated the course An Introduction to Interactive Programming in Python
In [32]:
ratings_by_course = coursetalk[coursetalk.title == 'An Introduction to Interactive Programming in Python']
ratings_by_course.set_index('user_id', inplace=True)
Now, for all other courses let's filter out only the ratings from users that rated the Python course.
In [33]:
their_ids = ratings_by_course.index
their_ratings = course_users[their_ids]
course_users[their_ids].ix[course_users[their_ids].index[:15], course_users[their_ids].columns[:15]]
Out[33]:
By applying the division: number of ratings who rated Python Course and the given course / total of ratings who rated the Python Course we have our percentage.
In [34]:
course_count = their_ratings.ix['An Introduction to Interactive Programming in Python'].count()
sims = their_ratings.apply(lambda profile: profile.count() / float(course_count) , axis=1)
Ordering by the score, highest first excepts the first one which contains the course itself.
In [35]:
sims.order(ascending=False)[1:][:10]
Out[35]:
In [ ]: