Website: https://googledrive.com/host/0BxV_WlGqTmvrWXpzaUJESnRSTEk/
Online recommendation systems are a very valuable source of information for tasks such as personalization, recommendation, and sentiment analysis. We wanted to make effective use of these reviews and build a robust system that can give a good recommendation to a user no matter what the structure and sparsity of the data looks like. This meant we would not only use item-item similarity scores but also user-user similarity to provide greater amounts of information for the system to work with and a backup recommendation in edge cases such as an unique item that has no closely related products. To further make this more robust, we decided to integrate a user-centric textual analysis recommender. In looking for a dataset we found BeerAdvocate as a good repository with a large number of reviews with many users and beers to work with. We noticed that BeerAdvocate.com itself has no personal recommendation system, only global lists of ratings. The people want to drink beers that they know they'll personally like -- and we will provide that for them.
We were inspired by Homework 4 to implement a good collaborative filtering recommender, but to push it further in terms of features, functionality, and scale. We were inspired to implement the textual analysis engine by our fiery interest in machine learning, math, and sentiment analysis and also by Learning Attitudes and Attributes from Multi-Aspect Reviews by Julian McAuley, Jure Leskovec, and Dan Jurafsky at Stanford.
We are trying to create a very robust recommendation system that will always come up with a good recommendation. How can we take the item-item collaborative filtering system from Homework 4 and beef it up? How can we inject more information into the recommendation engine and make it robust to sparsity in item-item similarities? How can we use anti-correlations and interpret similar items to items we know the user dislikes? Should we predict individual aspects and combine them or just the overall score? How do we combine aspects we predict into an overall score that is personalized per user? What is the best way to do sentiment and textual analysis on the text of a review and relate it to individual user preferences and numerical ratings? How do we scale everything we discussed above to millions of reviews, hundreds of millions of similarity relationships, and gigabytes of data? How can we combine user-user, item-item, and textual analysis predictions into a single prediction?
We scraped data from beeradvocate.com. Our scraping code is included with this notebook.
We based our collaborative filtering model on the homework 4 model, but added some really good modifications
We created a textual analysis engine as well for another layer of information to leverage. We analyzed each sentence in a textual review in order to identify which aspect the sentence talks about. This allows us to measure the sentiment in specific words in relation to an aspect rating. We could also use this to correlate and analyze sentences with more advanced aspects that aren't summed up in the Look, Taste, Feel, Smell categories.
We achieved this by modelling two parameter vectors that respectively encode words that discuss an aspect and words that discuss the associated sentiment. We can use these parameter vectors to calculate an indivdual probability of that a sentence discusses a particular aspect given the ratings as well as the probability for an entire review and the entire corpus. We model the two parameter vectors and latent aspect assignments by maximizing the log-likelihood probability of the corpus. We optimize this by coordinate ascent on the parameter vectors and alternately optimize the most likely latent aspect by calculating probabilities and the most likely parameter vectors through gradient ascent.
We predict how a user would rate a particular beer by analyzing the text of the reviews the user has posted as well as the textual reviews posted by other users of the beer. We predict this by looking at the words other users reviewed the beer with and consider how the user associates these words with ratings in their own reviews.
In more detail, from the aspect parameter vectors we modeled we can get the probabilities that the particular user associates a word to a particular aspect and rating. For every aspect, we look at every review of the particular beer and get the sentence that is most likely to be about that aspect. We then store, for the particular user, the most likely rating the word implies for the aspect given the user's reviews and the probability that the user associates the word with the aspect. We then take a weighted average of how the user would give ratings in the existing reviews weighted by how strongly the user associates the word with the aspect to predict how the user would rate that aspect of the beer.
In [ ]:
%matplotlib inline
from local_config import REVIEWS_FILE_PATH, BEERS_FILE_PATH, BEER_SIM_DB_FILE_PATH, USER_SIM_DB_FILE_PATH
import matplotlib.pyplot as plt
from multiprocessing import Process, Queue, Pool
from munkres import Munkres
import numpy as np
import nltk
import pandas as pd
import re
import random
import scipy as sp
from scipy import stats
from scipy.stats.stats import pearsonr
import sqlite3
import time
from matplotlib import rcParams
import matplotlib.cm as cm
import matplotlib as mpl
# colorbrewer2 Dark2 qualitative color table
dark2_colors = [(0.10588235294117647, 0.6196078431372549, 0.4666666666666667),
(0.8509803921568627, 0.37254901960784315, 0.00784313725490196),
(0.4588235294117647, 0.4392156862745098, 0.7019607843137254),
(0.9058823529411765, 0.1607843137254902, 0.5411764705882353),
(0.4, 0.6509803921568628, 0.11764705882352941),
(0.9019607843137255, 0.6705882352941176, 0.00784313725490196),
(0.6509803921568628, 0.4627450980392157, 0.11372549019607843)]
rcParams['figure.figsize'] = (10, 6)
rcParams['figure.dpi'] = 400
rcParams['axes.color_cycle'] = dark2_colors
rcParams['lines.linewidth'] = 2
rcParams['axes.facecolor'] = 'white'
rcParams['font.size'] = 14
rcParams['patch.edgecolor'] = 'white'
rcParams['patch.facecolor'] = dark2_colors[0]
# rcParams['font.family'] = 'StixGeneral'
def remove_border(axes=None, top=False, right=False, left=True, bottom=True):
"""
Minimize chart junk by stripping out unnecesasry plot borders and axis ticks
The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn
"""
ax = axes or plt.gca()
ax.spines['top'].set_visible(top)
ax.spines['right'].set_visible(right)
ax.spines['left'].set_visible(left)
ax.spines['bottom'].set_visible(bottom)
# turn off all ticks
ax.yaxis.set_ticks_position('none')
ax.xaxis.set_ticks_position('none')
# now re-enable visibles
if top:
ax.xaxis.tick_top()
if bottom:
ax.xaxis.tick_bottom()
if left:
ax.yaxis.tick_left()
if right:
ax.yaxis.tick_right()
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
def autolabel(rects, height_offset, fontsize):
"""Label rects with their height"""
for rect in rects:
height = rect.get_height()
plt.text(rect.get_x() + rect.get_width() / 2.0,
height + height_offset,
'%d' % int(height),
ha='center',
va='bottom',
rotation='vertical',
fontsize=fontsize)
In [ ]:
####################
# Define constants #
####################
DEFAULT_K = 7
DEFAULT_REG = 3.0
DEFAULT_ASPECT_REG_PARAM = 3.0
ASPECTS = ['look', 'smell', 'taste', 'feel', 'overall']
ASPECTS_MINUS_OVERALL = ['look', 'smell', 'taste', 'feel']
RATINGS = [1.0 + 0.25 * x for x in range(17)] # 1-5, in steps of 0.25
WORD_SPLIT_REGEX = re.compile(r"[\w']+")
SENTENCE_TOKENIZER = nltk.data.load('tokenizers/punkt/english.pickle')
EXCLUDED_WORDS = set()
with open('excluded_words.txt', 'r') as f:
for line in f:
EXCLUDED_WORDS.add(line.strip().lower())
BEER_DB_CURSOR = sqlite3.connect(BEER_SIM_DB_FILE_PATH).cursor()
USER_DB_CURSOR = sqlite3.connect(USER_SIM_DB_FILE_PATH).cursor()
Our data is split across two dataframes -- one for beer reviews and one for beer information (e.g. name, brewery, etc.) -- to minimize memory usage, so we define some functions that convert between IDs and names.
Also, because our recommender depends on aspect ratings instead of a single overall rating, we filter out reviews that don't have aspect ratings. Below we've plotted some attributes of the dataset to get a sense of what it contains.
In [ ]:
# Read in reviews and beers data
reviews_df_raw = pd.read_csv(REVIEWS_FILE_PATH)
beer_df = pd.read_csv(BEERS_FILE_PATH)
In [ ]:
# filter out reviews with no username and without a text review;
# this restricts the dataset to only those reviews with aspect subratings
reviews_df = reviews_df_raw[pd.notnull(reviews_df_raw['username'])]
reviews_df = reviews_df[pd.notnull(reviews_df['text'])]
#
# The code below is used to create a smaller dataset for testing purposes
#
# filter out users with a small number of reviews
# filtered_usernames = []
# username_groups = reviews_df.groupby('username')
# for username, reviews in username_groups:
# if len(reviews) > 100:
# filtered_usernames.append(username)
# reviews_df = reviews_df[reviews_df['username'].isin(filtered_usernames)]
# filter out beers with a small number of reviews
# filtered_beer_ids = []
# beer_groups = reviews_df.groupby('beer_id')
# for beer_id, reviews in beer_groups:
# if len(reviews) > 500:
# filtered_beer_ids.append(beer_id)
# reviews_df = reviews_df[reviews_df['beer_id'].isin(filtered_beer_ids)]
We define utility functions that map information types in the dataset with the goal of readable and reusable code throughout the project
In [ ]:
####################################
# Data lookup/conversion functions #
####################################
"""
Beers
"""
def beer_id_to_name(beer_id, beer_df):
return beer_df[beer_df['beer_id'] == beer_id].iloc[0]['beer_name']
def beer_id_to_brewery_id(beer_id, beer_df):
return beer_df[beer_df['beer_id'] == beer_id].iloc[0]['brewery_id']
def beer_name_to_id(beer_name, beer_df):
return beer_df[beer_df['beer_name'] == beer_name].iloc[0]['beer_id']
"""
Breweries
"""
def brewery_id_to_name(brewery_id, beer_df):
return beer_df[beer_df['brewery_id'] == brewery_id].iloc[0]['brewery_name']
def brewery_name_to_id(brewery_name, beer_df):
return beer_df[beer_df['brewery_name'] == brewery_name].iloc[0]['brewery_id']
"""
Cross-category
"""
def beer_id_to_brewery_name(beer_id, beer_df):
brewery_id = beer_id_to_brewery_id(beer_id, beer_df)
return brewery_id_to_name(brewery_id , beer_df)
We explore the data to get a sense of what the dataset is like. We look at high level statistics such as mean and median about the number of reviews per beer, user, and brewery. We also look at the number of reviews, users, beers, and breweries in the full dataset.
In [ ]:
####################
# Data exploration #
####################
def get_reviews_by(df, column_name):
num_reviews_by_item = {}
for value, indices in df.groupby(column_name).groups.iteritems():
num_reviews_by_item[value] = len(indices)
return num_reviews_by_item
def display_stats_by(df, column_name, description, max_val_lookup_df, max_val_lookup_key):
num_reviews_by_item = get_reviews_by(df, column_name)
num_reviews = np.array(num_reviews_by_item.values())
# get the max number of reviews, and look up the human-readable name if required
max_name, max_num = max(num_reviews_by_item.iteritems(), key=lambda x: x[1])
if max_val_lookup_df:
max_name = max_val_lookup_df[max_val_lookup_df[column_name] == max_name].iloc[0][max_val_lookup_key]
print 'Reviews per %s stats:' % description.lower()
print ' Mean: %s' % num_reviews.mean()
print ' Median: %s' % int(np.median(num_reviews))
print ' Mode: %s (%s occurrences)' % tuple([int(x[0]) for x in stats.mode(num_reviews)])
print ' Min: %s' % num_reviews.min()
print " Max: %s (%s)" % (max_num, max_name)
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.hist(num_reviews, bins=45, log=True)
ax.set_title('Number of Reviews per %s' % description.title())
ax.set_ylabel('Occurrences')
ax.set_xlabel('Number of Reviews')
def display_stats(df):
total_num_reviews = len(df)
num_text_reviews = len(df[pd.notnull(df['text'])])
num_nontext_reviews = len(df[pd.isnull(df['text'])])
print 'Total reviews: %s' % total_num_reviews
print 'Text reviews: %s (%s%%)' % (num_text_reviews, float(num_text_reviews) / total_num_reviews * 100.0)
print 'Non-text reviews: %s (%s%%)' % (num_nontext_reviews, float(num_nontext_reviews) / total_num_reviews * 100.0)
print
print 'Number of users: %s' % len(df['username'].unique())
print 'Number of beers: %s' % len(df['beer_id'].unique())
print 'Number of breweries: %s' % len(df['brewery_id'].unique())
print
display_stats_by(df, 'username', 'User', None, None)
print
display_stats_by(df, 'beer_id', 'Beer', beer_df, 'beer_name')
print
display_stats_by(df, 'brewery_id', 'Brewery', beer_df, 'brewery_name')
print '-------FULL DATASET-------'
print 'Number of reviews: %s' % len(reviews_df_raw)
print 'Number of users: %s' % len(reviews_df_raw['username'].unique())
print 'Number of beers: %s' % len(reviews_df_raw['beer_id'].unique())
print 'Number of breweries: %s' % len(reviews_df_raw['brewery_id'].unique())
print '\n'
print '-------FILTERED DATASET-------'
display_stats(reviews_df)
#
# Plot number of beers by style
#
# filter out aliases, and also beers with no ratings or reviews
reviewed_beers = beer_df[pd.isnull(beer_df['alias_id']) & (beer_df['num_ratings'] > 0)]
# get the number of reviews for each style, and sort them
num_reviews_by_style = get_reviews_by(reviewed_beers, 'style')
sorted_num = sorted([(k, v) for k, v in num_reviews_by_style.iteritems()], key=lambda x: x[1], reverse=True)
# construct x (evenly spaced coords), y (num reviews), and label (style name) arrays for plotting
num_bars = len(sorted_num)
x = np.array(range(num_bars))
y = [count for _, count in reversed(sorted_num)]
labels = [unicode(style, 'utf_8') for style, _ in reversed(sorted_num)]
# plot review count by style
fig = plt.figure(figsize=(20, 15))
fig.subplots_adjust(bottom=0.16)
ax = fig.add_subplot(1, 1, 1)
rects = ax.bar(x, y, align='center', width=.6)
ax.set_xlim([-1, num_bars])
ax.set_xticks(x)
ax.set_xticklabels(labels, rotation='vertical', fontsize=8)
ax.set_xlabel('Style')
ax.set_ylabel('Beers')
ax.set_title('Number of Beers by Style')
# put count above each bar
autolabel(rects, 40, 7)
We see that, as expected, the majority of users and beers have a small number of reviews. Also, the most reviewed beer is Dogfish Head's 90 Minute IPA! Pretty cool!
While predicting the overall score a user will give a beer he hasn't seen before, our recommender will predict what the user would rate the beer according to aspects such as look, taste, feel, and smell. Different users will like different things about beers differently - for example someone might put more emphasis on feel in a beer more than look, while another user might be exactly the opposite. We thus must account for these differences in our model.
In our data exploration, we found that the Pearson R correlation coefficient was very good at correlating how much a user emphasized a particular aspect to the overall user rating. We thus use this correlation to set the aspect weights in our model.
In [ ]:
def normalize(a):
s = float(sum(a))
if s == 0.0:
return a
return [0.0 if x == 0.0 else float(x) / s for x in a]
# cache aspect weights so we don't continually re-compute them for the same users
CACHED_ASPECT_WEIGHTS = {}
def get_aspect_weights(username, df, reg=DEFAULT_ASPECT_REG_PARAM):
if username in CACHED_ASPECT_WEIGHTS:
return CACHED_ASPECT_WEIGHTS[username]
user_reviews = df[df['username'] == username]
overall_ratings = user_reviews['overall']
weights = [shrunk_sim(pearsonr(user_reviews[aspect], overall_ratings)[0], float(len(user_reviews)), reg) for aspect in ASPECTS_MINUS_OVERALL]
CACHED_ASPECT_WEIGHTS[username] = normalize(weights)
return CACHED_ASPECT_WEIGHTS[username]
In [ ]:
#
# Compute aspect weights for a subset of users and use them to reconstruct overall ratings in order to check model validity
#
# choose a random set of usernames for which to compute aspect weights
random_usernames = random.sample(reviews_df['username'].unique(), 75)
correlations = []
for username in random_usernames:
reviews = reviews_df[reviews_df['username'] == username]
look, smell, taste, feel = get_aspect_weights(username, reviews_df)
predicted = np.array(reviews['look']) * look \
+ np.array(reviews['smell']) * smell \
+ np.array(reviews['taste']) * taste \
+ np.array(reviews['feel']) * feel
actual = np.array(reviews['overall'])
corr = pearsonr(predicted, actual)[0]
if pd.notnull(corr):
correlations.append(corr)
plt.hist(correlations, bins=40)
plt.title('Correlation Between Actual and Reconstructed Ratings')
plt.xlabel('correlation')
plt.ylabel('frequency')
Because of the huge scale of data we are working with it was vital that we optimize the runtime of our code as much as possible. One optimization opportunity we found was to cache the beer and user averages so we had it available as we looked them up throughout the recommender. We also define utility functions for getting sets of data from a dataframe.
In [ ]:
############################
# Data filtering functions #
############################
# used to cache user and beer averages so we don't have to re-compute them
USER_AVERAGES = {}
BEER_AVERAGES = {}
def get_user_averages(df, rating_col_name):
return dict(df.groupby('username')[rating_col_name].mean())
def get_single_user_average(df, username, aspect):
if username in USER_AVERAGES:
if aspect in USER_AVERAGES[username]:
return USER_AVERAGES[username][aspect]
else:
USER_AVERAGES[username] = {}
USER_AVERAGES[username][aspect] = df[df.username == username][rating_col_name].mean()
return USER_AVERAGES[username][aspect]
def get_beer_averages(df, rating_col_name):
return dict(df.groupby('beer_id')[rating_col_name].mean())
def get_single_beer_average(df, beer_id, aspect):
if beer_id in BEER_AVERAGES:
if aspect in BEER_AVERAGES[beer_id]:
return BEER_AVERAGES[beer_id][aspect]
else:
BEER_AVERAGES[beer_id] = {}
BEER_AVERAGES[beer_id][aspect] = df[df.beer_id == beer_id][rating_col_name].mean()
return BEER_AVERAGES[beer_id][aspect]
def get_user_reviewed(username, df):
return set(df[df['username'] == username]['beer_id'])
def get_beer_reviewers(beer_id, df):
return set(df[df['beer_id'] == beer_id]['username'])
def get_user_top_rated(username, rating_col_name, df, numchoices=5):
"Return the sorted top numchoices beers for a user by the given rating column name."
return df[df['username'] == username][['beer_id', rating_col_name]].sort([rating_col_name], ascending=False).head(numchoices)
We define utility functions for getting certain types of data from sets of data and working with common support sets.
In [ ]:
#########################################################
# Define functions used for common support calculations #
#########################################################
def get_common_reviewers(beer_id_1, beer_id_2, df):
beer_1_reviewers = df[df['beer_id'] == beer_id_1]['username'].unique()
beer_2_reviewers = df[df['beer_id'] == beer_id_2]['username'].unique()
return set(beer_1_reviewers).intersection(beer_2_reviewers)
def get_common_reviewed(username_1, username_2, df):
user_1_reviewed = df[df['username'] == username_1]['beer_id'].unique()
user_2_reviewed = df[df['username'] == username_2]['beer_id'].unique()
return set(user_1_reviewed).intersection(user_2_reviewed)
def get_common_support(df):
beers = df.beer_id.unique()
print len(beers)
supports = []
for i, beer_id_1 in enumerate(beers):
for j, beer_id_2 in enumerate(beers):
if i < j:
common_reviewers = get_common_reviewers(beer_id_1, beer_id_2, df)
supports.append(len(common_reviewers))
return supports
def get_reviews_for_beer_and_users(beer_id, user_set, df):
"""Given a beer ID and a set of usernames, return the sub-dataframe of the users' reviews of the beer."""
mask = (df['username'].isin(user_set)) & (df['beer_id'] == beer_id)
reviews = df[mask]
return reviews[reviews['username'].duplicated() == False]
def get_reviews_for_user_and_beers(username, beer_set, df):
"""Given a username and a set of beer IDs, return the sub-dataframe of the user's reviews of the beers."""
mask = (df['beer_id'].isin(beer_set)) & (df['username'] == username)
reviews = df[mask]
return reviews[reviews['beer_id'].duplicated() == False]
We define functions for working with similarity functions.
In [ ]:
####################################
# Similarity calculation functions #
####################################
def pearson_sim(reviews_df_1, reviews_df_2, averages, num_common, rating_col_name, avg_lookup_col_name):
"""
Given 2 subframes of reviews, return the Pearson correlation coefficient between the reviews.
* Also subtract an average rating value from the reviews before calculating correlation.
* If there is no common support between the review sets, return 0.
* If varainces are 0, NaN may be returned.
"""
if num_common == 0:
return 0.0
diff1 = reviews_df_1.apply(lambda x: x[rating_col_name] - averages[x[avg_lookup_col_name]], axis=1)
diff2 = reviews_df_2.apply(lambda x: x[rating_col_name] - averages[x[avg_lookup_col_name]], axis=1)
return pearsonr(diff1, diff2)[0]
def calculate_beer_similarity(beer_id_1, beer_id_2, user_averages, similarity_func, rating_col_name, df):
common_reviewers = get_common_reviewers(beer_id_1, beer_id_2, df)
beer_1_reviews = get_reviews_for_beer_and_users(beer_id_1, common_reviewers, df)
beer_2_reviews = get_reviews_for_beer_and_users(beer_id_2, common_reviewers, df)
sim = similarity_func(beer_1_reviews, beer_2_reviews, user_averages, len(common_reviewers), rating_col_name, 'username')
return (0.0 if np.isnan(sim) else sim, len(common_reviewers))
def calculate_user_similarity(username_1, username_2, beer_averages, similarity_func, rating_col_name, df):
common_beers = get_common_reviewed(username_1, username_2, df)
user_1_reviews = get_reviews_for_user_and_beers(username_1, common_beers, df)
user_2_reviews = get_reviews_for_user_and_beers(username_2, common_beers, df)
sim = similarity_func(user_1_reviews, user_2_reviews, beer_averages, len(common_beers), rating_col_name, 'beer_id')
return (0.0 if np.isnan(sim) else sim, len(common_beers))
def shrunk_sim(sim, n_common, reg=DEFAULT_REG):
"Shrink the similarity with the regularizer."
return (n_common * sim) / (n_common + reg)
We generalize the knearest function to be able to take both users and beers either as an object_id or a search_set for when we do both item-item similarity and user-user similarity.
In [ ]:
##################################################
# Functions for k-nearest neighbors calculations #
##################################################
def k_nearest(object_id, search_set, aspect, db, k=DEFAULT_K, reg=DEFAULT_REG):
similar = []
for current_object_id in search_set:
if current_object_id != object_id:
sim, support = db.get(object_id, current_object_id, aspect)
similar.append((current_object_id, shrunk_sim(sim, support, reg=reg), support))
similar.sort(key=lambda x: x[1], reverse=True)
return similar[:k]
In [ ]:
############################
# Recommendation functions #
############################
def get_top_recos_for_user(username, rating_col_name, df, db, n, k=DEFAULT_K, reg=DEFAULT_REG):
# we'll get similar beers from all those in the dataset
unique_beer_ids = df['beer_id'].unique()
neighbors = set()
# for each of the user's top-rated beers...
for i, top_beer_id in get_user_top_rated(username, rating_col_name, df, numchoices=n)['beer_id'].iteritems():
# ...get similar beers
for near_beer_id, _, _ in k_nearest(top_beer_id, unique_beer_ids, db, k=k, reg=reg):
neighbors.add(near_beer_id)
# only use beers that the user has not reviewed
neighbors = neighbors - get_user_reviewed(username, df)
result = [(beer_id, get_single_beer_average(df, beer_id, rating_col_name)) for beer_id in neighbors]
return sorted(result, key=lambda x: x[1], reverse=True)[:n]
Things we did below:
Our first attempt at predicting a user's overall rating of a new beer is based on the restuarant recommender developed in Homework 4, with several key modifications designed to improve accuracy and precision. The first major modification was possible due to the nature of our data, namely the fact that in a review, users provide not only an overall rating for the beer, but also a rating for 4 aspects, namely taste, smell, look and feel. Rather than trying to predict the overall rating directly, we decided to add an intuitive and defensible (and hence likely useful) layer of complexity to our model by first predicting a rating for each of the four aspects, and then combining them into an overall rating using weights based on the correlations between each of the aspects and the overall rating for the given user. The next modification was to incorporate user-user similarity information in addition to the information gleaned from item-item similarities. This decision intuitively would help our rating predictor perform better in situations where the user decides to try a unique beer that isn't really similar to any other beers, namely, we can look at the reaction of similar users to the beer and predict the rating based on that information. The final modification was to look for not just similar things, but also dissimilar ones, which arguably are just as useful for predicting a rating. For example, if the user really liked a beer that is the exact opposite of the given beer, he probably will not like the given beer very much. In other words, only things that are uncorrelated with other things (i.e. neither similar or dissimilar) give no useful information for the rating predictor. Our final implementation hence went as follows. First, calculate all beer-beer similarities and user-user similarities. Sort these by the absolute value of the similarity. Pick the k-nearest things (beers or users). Incorporate the information from these similarities with the proper direction (i.e. positive for similar, negative for dissimilar) into the rating. Do this separately for each aspect. Finally, combine all the aspect ratings according to user-specific weights to get the overall rating.
For predict_aspect_rating_from_text, i.e. the rating predictor based on text (and hence more of a bottom-up approach):
In [ ]:
############################
# Aspect rating prediction #
############################
# pre-compute global averages
GLOBAL_AVG = {}
for aspect in ASPECTS:
GLOBAL_AVG[aspect] = df[aspect].mean()
def baseline(global_avg, user_avg, beer_avg):
return global_avg + (user_avg - global_avg) + (beer_avg - global_avg)
def predict_aspect_rating(beer_id, username, aspect, beer_db, user_db, df, k=DEFAULT_K, reg=DEFAULT_REG):
BEER = 0
USER = 1
beer_avg = get_single_beer_average(df, beer_id, aspect)
user_avg = get_single_user_average(df, username, aspect)
nearest_beers = k_nearest(beer_id, get_user_reviewed(username, df), beer_db, k=k, reg=reg)
# get k nearest users who have reviewed this beer
nearest_users = k_nearest(username, df[df['beer_id'] == beer_id]['username'].unique(), user_db, k=k, reg=reg)
nearest = []
for beer, sim, support in nearest_beers:
nearest.append((BEER, beer, shrunk_sim(sim, support, reg=reg), support))
for user, sim, support in nearest_users:
nearest.append((USER, user, shrunk_sim(sim, support, reg=reg), support))
nearest.sort(key=lambda x: abs(x[2]), ascending=False)
num = 0.0
denom = 0.0
for id_type, object_id, sim, abs_sim, support in nearest[:k]:
if id_type == BEER:
# get the user's review of the similar beer
reviews = df[(df['username'] == username) & (df['beer_id'] == object_id)]
assert(reviews.shape[0] == 1)
# get average for the similar beer
similar_beer_avg = get_single_beer_average(df, object_id, aspect)
num += sim * (float(reviews.irow(0)[aspect]) - baseline(GLOBAL_AVG[aspect], user_avg, similar_beer_avg))
denom += abs(sim)
elif id_type == USER:
# get the similar user's review of the beer
reviews = df[(df['username'] == object_id) & (df['beer_id'] == beer_id)]
assert(reviews.shape[0] == 1)
# get average for the similar user
similar_user_avg = get_single_user_average(df, object_id, aspect)
num += sim * (float(reviews.irow(0)[aspect]) - baseline(GLOBAL_AVG[aspect], similar_user_avg, beer_avg))
denom += abs(sim)
if denom != 0:
return baseline(GLOBAL_AVG[aspect], user_avg, beer_avg) + num / denom
else:
return baseline(GLOBAL_AVG[aspect], user_avg, beer_avg)
def predict_overall_rating(beer_id, username, beer_db, user_db, df, k=DEFAULT_K, reg=DEFAULT_REG):
aspect_ratings = []
for aspect in ASPECTS_MINUS_OVERALL:
aspect_ratings.append(
predict_aspect_rating(beer_id, username, aspect, beer_db, user_db, df, k=k, reg=reg))
user_aspect_weights = get_aspect_weights(username, df)
return np.array(aspect_ratings).dot(user_aspect_weights)
def predict_aspect_rating_from_text(beer_id, username, aspect, df):
sentence_model = SentenceModel(df[df['username'] == username])
sentence_model.train()
predicted_ratings_sum = 0.0
count = 0
for review in df[df['beer_id'] == beer_id].iterrows():
max_prob = float('-inf')
max_sentence = None
for sentence in SentenceModel.get_sentences(review['text']):
prob = sentence_model.get_sentence_prob_from_words(words, aspect)
if prob > max_prob:
max_prob = prob
max_sentence = sentence
num = 0.0
denom = 0.0
for word in max_sentence:
if word in sentence_model.words_set:
max_phi = float('-inf')
max_rating = None
for r in RATINGS:
phi = sentence_model.phi[aspect][r][word]
if phi > max_phi:
max_phi = phi
max_rating = r
num += sentence_model.theta[k][w] * max_rating
denom += sentence_model.theta[k][w]
predicted_ratings_sum += num / denom
count += 1
return predicted_ratings_sum / float(count)
We have databases that store the similarities between users and users and beers and beers for the recommenders. The similarities were computed at scale with MapReduce on Amazon EC2. However, the scale of the data was really large: there were 242 million entires in the beer-beer similarity database and 75million entries in the user-user database. The beer-beer database was 16GB in size, which was too large to fit into memory, so we couldn't just create a Database object. We used SQLite to store similarities and indexed by item ID which allowed for constant time lookup.
In [ ]:
class Database(object):
"""A class representing a database of similaries and common supports."""
def __init__(self, df, id_col, average_function):
self.df = df
self.average_function = average_function
self.id_col = id_col
self.opposite_id_col = None
if id_col == 'beer_id':
self.opposite_id_col = 'username'
else:
self.opposite_id_col = 'beer_id'
self.unique_ids = {v: k for (k, v) in enumerate(df[id_col].unique())}
keys = self.unique_ids.keys()
num_keys = len(keys)
self.similarities = np.zeros([num_keys, num_keys])
self.supports = np.zeros([num_keys, num_keys], dtype=np.int)
def calculate_similarity(self, id_1, id_2, averages, similarity_func, rating_col_name, df):
raise NotImplementedError
def populate_by_calculating(self, similarity_func, rating_col_name):
averages = self.average_function(self.df, rating_col_name)
items = self.unique_ids.items()
count = 0
for id_1, i in items:
print '%s (i = %s)' % (count, i)
count += 1
for id_2, j in items:
if i < j:
sim, nsup = self.calculate_similarity(id_1, id_2, averages, similarity_func, rating_col_name, self.df)
self.similarities[i][j] = sim
self.similarities[j][i] = sim
self.supports[i][j] = nsup
self.supports[j][i] = nsup
elif i == j:
nsup = self.df[self.df[self.id_col] == id_1][self.opposite_id_col].count()
self.similarities[i][i] = 1.0
self.supports[i][i] = nsup
def get(self, id_1, id_2):
"Return a (similarity, common_support) tuple for the given IDs"
return (
self.similarities[self.unique_ids[id_1]][self.unique_ids[id_2]],
self.supports[self.unique_ids[id_1]][self.unique_ids[id_2]]
)
class BeerDatabase(Database):
def __init__(self, df):
super(BeerDatabase, self).__init__(df, 'beer_id', get_user_averages)
self.calculate_similarity = calculate_beer_similarity
class UserDatabase(Database):
def __init__(self, df):
super(UserDatabase, self).__init__(df, 'username', get_beer_averages)
self.calculate_similarity = calculate_user_similarity
"""
# The classes below retrieve similarities from sqlite3 databases of beer and user similarities.
"""
class SQLDatabase(object):
def __init__(self):
self.cursor = None
def get(self, object_id_1, object_id_2, aspect):
# we need to sort the object IDs because the database has one row
# for each object pair, indexed by the *sorted* pair
object_id_1, object_id_2 = sorted((object_id_1, object_id_2))
return self.cursor.execute("SELECT %s FROM similarities WHERE object_id_1='%s' AND object_id_2='%s'" % (aspect, object_id_1, object_id_2)).fetchone()[0]
class SQLBeerDatabase(SQLDatabase):
def __init__(self):
super(SQLBeerDatabase, self).__init__()
self.cursor = BEER_DB_CURSOR
class SQLUserDatabase(SQLDatabase):
def __init__(self):
super(SQLUserDatabase, self).__init__()
self.cursor = USER_DB_CURSOR
# def get_sim_sql(object_id_1, object_id_2, aspect, cursor):
# table_name = c.execute("SELECT table_name FROM object_lookup WHERE object_id=?", (object_id_1,)).fetchone()[0]
# print c.execute("SELECT %s FROM %s WHERE object_id='%s'" % (aspect, table_name, object_id_2)).fetchall()
# def get_beer_sim_sql(beer_id_1, beer_id_2, aspect):
# return get_sim_sql(beer_id_1, beer_id_2, aspect, BEER_DB_CURSOR)
# def get_user_sim_sql(username_1, username_2, aspect):
# return get_sim_sql(username_1, username_2, aspect, USER_DB_CURSOR)
In [ ]:
beer_db = BeerDatabase(reviews_df)
# beer_db.populate_by_calculating(pearson_sim, 'rating')
sql_beer_db = SQLBeerDatabase()
user_db = UserDatabase(reviews_df)
# user_db.populate_by_calculating(pearson_sim, 'rating')
sql_user_db = SQLUserDatabase()
We process similarities on MapReduce because of the scale, so the code below exports our dataframe to a MapReduce ready format.
In [ ]:
# get a copy of the raw df, but with null usernames and review text filtered out
full_reviews_df = reviews_df_raw[pd.notnull(reviews_df_raw['username'])]
full_reviews_df = full_reviews_df[pd.notnull(full_reviews_df['text'])]
def convert_to_mr_format_beer(df):
# get DataFrame-wide user averages
user_averages = {}
for aspect in ASPECTS:
user_averages[aspect] = get_user_averages(df, aspect)
with open('mr_input/mr_input_all_full_beer.txt', 'w') as f:
for i, review in df.iterrows():
username = review['username']
beer_id = review['beer_id']
things = [username, beer_id]
for aspect in ASPECTS:
things.append(review[aspect] - user_averages[aspect][username])
f.write(' '.join([str(x) for x in things]) + '\n')
def convert_to_mr_format_user(df):
# get DataFrame-wide beer averages
beer_averages = {}
for aspect in ASPECTS:
beer_averages[aspect] = get_beer_averages(df, aspect)
with open('mr_input/mr_input_all_full_user.txt', 'w') as f:
for i, review in df.iterrows():
username = review['username']
beer_id = review['beer_id']
things = [username, beer_id]
for aspect in ASPECTS:
things.append(review[aspect] - beer_averages[aspect][beer_id])
f.write(' '.join([str(x) for x in things]) + '\n')
# convert_to_mr_format_beer(full_reviews_df)
# convert_to_mr_format_user(full_reviews_df)
In [ ]:
# predict ratings for all reviews
reviews_df_copy = reviews_df.copy(deep=True)
reviews_df_copy['predicted'] = reviews_df.apply(lambda x: predict_overall_rating(x['beer_id'], x['username'], sql_beer_db, sql_user_db, reviews_df), axis=1)
In [ ]:
test_beer_id = 58577
nearest_beers = k_nearest(test_beer_id, reviews_df['beer_id'].unique(), beer_db)
print 'Top matches for %s (%s):' % (beer_id_to_name(test_beer_id, beer_df), test_beer_id)
for i, (beer_id, sim, support) in enumerate(nearest_beers):
print i, beer_id_to_name(beer_id, beer_df), "| Sim", sim, "| Support", support
In [ ]:
test_username = 'Sammy'
nearest_users = k_nearest(test_username, reviews_df['username'].unique(), user_db)
print 'Top matches for %s (%s):' % username
for i, (username, sim, support) in enumerate(nearest_users):
print i, username, "| Sim", sim, "| Support", support
The code below uses the approach described in the related work section above to learn which words correspond to which aspects. We then use the rating prediction code above to leverage what we've learned to predict ratings based on textual analysis.
In [ ]:
class SentenceModel(object):
def __init__(self, df):
self.df = df
self.sentences = {}
self.words_set = set()
self.ratings = {}
# used for sentence aspect assignment
self.NUM_EXTRA_NODES = 2
# some useful numbers
self.num_reviews = 0
self.num_rating_possibilites = len(ASPECTS) * len(RATINGS)
# initialize sentences and words
for i, review in df.iterrows():
self.num_reviews += 1
for s in SENTENCE_TOKENIZER.tokenize(review['text']):
# Don't use sentences that just describe the serving type
if s.startswith('Serving type: '):
break
# get all words in the sentence
words = set([w.lower() for w in WORD_SPLIT_REGEX.findall(s)])
# remove common "function" words that aren't useful for our analysis
words -= EXCLUDED_WORDS
# add the sentence to the sentence dict
if i not in self.sentences:
self.sentences[i] = []
self.sentences[i].append(['', words, None])
# add the words to the global word set
self.words_set.update(words)
# record this review's ratings
self.ratings[i] = {}
for k in ASPECTS:
self.ratings[i][k] = review[k]
# initialize aspect weights
self.theta = {}
for aspect in ASPECTS:
self.theta[aspect] = {}
for w in self.words_set:
self.theta[aspect][w] = random.random() * 0.5
for aspect in ASPECTS:
self.theta[aspect][aspect] = 1.0
# initialize sentiment weights
self.phi = {}
for aspect in ASPECTS:
self.phi[aspect] = {}
for r in RATINGS:
self.phi[aspect][r] = {}
for w in self.words_set:
self.phi[aspect][r][w] = random.random() * 0.5
for aspect in ASPECTS:
self.phi[aspect][3.0][aspect] = 1
# will be used later for gradient ascent
self.gradient_theta = {}
self.gradient_phi = {}
def __iter__(self):
"""Allow for iteration through all sentences with one loop."""
for df_index, sentences in self.sentences.iteritems():
for sentence_num, sentence_data in enumerate(sentences):
yield (df_index, sentence_num, sentence_data[0], sentence_data[1], sentence_data[2])
@staticmethod
def get_sentences(text):
sentences = []
for s in SENTENCE_TOKENIZER.tokenize(text):
# get all words in the sentence
words = set([w.lower() for w in WORD_SPLIT_REGEX.findall(s)])
# remove common "function" words
words -= EXCLUDED_WORDS
sentences.append(words)
return sentences
def save_model(self, filename):
with open(filename, 'w') as f:
pickle.dump([self.words_set, self.theta, self.phi], f)
def load_model(self, filename):
with open(filename, 'r') as f:
data = pickle.load(f)
self.words_set = data[0]
self.theta = data[1]
self.phi = data[2]
def get_sentence_prob(self, i, j, aspect):
words = self.get_words(i, j)
z = 0.0
this_aspect_sum = None
for k in ASPECTS:
weight_sum = 0.0
for w in words:
weight_sum += self.theta[k][w] + self.phi[k][self.ratings[i][k]][w]
if k == aspect:
this_aspect_sum = weight_sum
z += np.exp(weight_sum)
return np.exp(this_aspect_sum) / z
def get_sentence_prob_from_words(self, words, aspect):
z = 0.0
this_aspect_sum = None
for k in ASPECTS:
weight_sum = 0.0
for w in words:
if w in self.words_set:
weight_sum += self.theta[k][w] + self.phi[k][self.ratings[i][k]][w]
if k == aspect:
this_aspect_sum = weight_sum
z += np.exp(weight_sum)
return np.exp(this_aspect_sum) / z
def get_sentence_compatability(self, i, j, aspect, words=None):
if not words:
words = self.get_words(i, j)
weight_sum = 0.0
for w in words:
weight_sum += self.theta[aspect][w] + self.phi[aspect][self.ratings[i][aspect]][w]
return weight_sum
def _update_assignments(self):
def worker(i):
sentences = self.sentences[i]
changed = False
num_sentences = len(sentences)
best_aspects = [None for _ in range(num_sentences)]
matrix = np.zeros((num_sentences, num_sentences + self.NUM_EXTRA_NODES))
for j, sentence_data in enumerate(sentences):
# assign an aspect to the current sentence based on its compatability score with each aspect
max_compatability = float('-inf')
for k in ASPECTS:
compatability = self.get_sentence_compatability(i, j, k, words=sentence_data[1])
if compatability > max_compatability:
max_compatability = compatability
best_aspects[j] = k
# fill in the matrix
for k in range(num_sentences + self.NUM_EXTRA_NODES):
if k < len(ASPECTS) and num_sentences >= len(ASPECTS):
matrix[j][k] = -self.get_sentence_compatability(i, j, ASPECTS[k])
else:
matrix[j][k] = -max_compatability
# update sentence aspect assignments based on the results of the Kuhn-Munkres algorithm
m = Munkres()
for row, col in m.compute(matrix):
if col < len(ASPECTS) and num_sentences >= len(ASPECTS):
best_aspects[row] = ASPECTS[col]
# print '(%d, %d) -> %d, %s' % (row, col, matrix[row][col], ASPECTS[col])
# else:
# print '(%d, %d) -> %d' % (row, col, matrix[row][col])
if self.get_aspect(i, row) != best_aspects[row]:
changed = True
self.set_aspect(i, row, best_aspects[row])
return changed
# the single-loop version of this computation is "embarassingly parallel," so we parallelize it
return any(parmap(worker, self.sentences.keys()))
def _init_gradient_dicts(self):
self.gradient_theta = {}
for aspect in ASPECTS:
self.gradient_theta[aspect] = {}
for w in self.words_set:
self.gradient_theta[aspect][w] = 0.0
self.gradient_phi = {}
for aspect in ASPECTS:
self.gradient_phi[aspect] = {}
for r in RATINGS:
self.gradient_phi[aspect][r] = {}
for w in self.words_set:
self.gradient_phi[aspect][r][w] = 0
def _compute_gradient(self):
for k in ASPECTS:
for w in self.words_set:
self.gradient_theta[k][w] = -1.0 * float(self.num_rating_possibilites) * self.theta[k][w]
for r in RATINGS:
self.gradient_phi[k][r][w] = -1.0 * float(self.num_rating_possibilites) * self.phi[k][r][w]
for i, j, s, words, curr_aspect in self:
if not curr_aspect:
continue
curr_aspect_rating = self.ratings[i][curr_aspect]
num = 0.0
denom = 0.0
for k in ASPECTS:
exp_score = 0.0
for w in words:
exp_score += self.theta[k][w] + self.phi[k][self.ratings[i][k]][w]
exp_score = np.exp(exp_score)
if k == curr_aspect:
num = exp_score
denom += exp_score
frac = num / denom
for w in words:
self.gradient_theta[curr_aspect][w] += 1.0 - frac
self.gradient_phi[curr_aspect][curr_aspect_rating][w] += 1.0 - frac
def _compute_log_likelihood(self):
likelihood = 0.0
for i, j, s, words, curr_aspect in self:
denom = 0.0
for k in ASPECTS:
exp_score = 0.0
for w in words:
exp_score += self.theta[k][w] + self.phi[k][self.ratings[i][k]][w]
if k == curr_aspect:
likelihood += exp_score
exp_score = np.exp(exp_score)
denom += exp_score
likelihood -= np.log(denom)
return likelihood
def train(self, learning_rate=None, iterations=10, gradient_ascent_iterations=5):
overall_start = time.time()
if not learning_rate:
learning_rate = float(self.num_rating_possibilites) * 0.01 / float(self.num_reviews)
print 'Defaulting to learning rate of %s' % learning_rate
self._init_gradient_dicts()
likelihood = 0.0
prev_likelihood = 0.0
for iter_num in range(iterations):
main_iter_start = time.time()
print 'Main iter %s' % iter_num
print ' Updating assignments...'
update_assignments_start = time.time()
changed = self._update_assignments()
print ' Time: %s s' % (time.time() - update_assignments_start)
# if the model didn't change, no need to keep going
if (not changed):
print ' Assignments did not change; breaking'
break
else:
print ' Assignments changed'
likelihood = self._compute_log_likelihood()
if iter_num == 0:
prev_likelihood = likelihood
print ' Starting likelihood: %s' % likelihood
for g_iter_num in range(gradient_ascent_iterations):
g_iter_start = time.time()
print ' Gradient ascent iter %s' % g_iter_num
prev_likelihood = likelihood
self._compute_gradient()
# do the actual gradient ascent
for k in ASPECTS:
for w in self.words_set:
self.theta[k][w] += learning_rate * self.gradient_theta[k][w]
for r in RATINGS:
self.phi[k][r][w] += learning_rate * self.gradient_phi[k][r][w]
likelihood = self._compute_log_likelihood()
print ' Likelihood: %s' % likelihood
# undo the last operation and break if likelihood didn't improve
if not (likelihood > prev_likelihood):
print ' Likelihood did not improve; undoing and breaking'
for k in ASPECTS:
for w in self.words_set:
self.theta[k][w] -= learning_rate * self.gradient_theta[k][w]
for r in RATINGS:
self.phi[k][r][w] -= learning_rate * self.gradient_phi[k][r][w]
likelihood = prev_likelihood
break
print ' Time: %s s' % (time.time() - g_iter_start)
pi = {}
for k in ASPECTS:
pi[k] = {}
for w in self.words_set:
pi[k][w] = 0.0
for r in RATINGS:
pi[k][w] += self.phi[k][r][w]
pi[k][w] /= len(RATINGS)
for k in ASPECTS:
for w in self.words_set:
for r in RATINGS:
self.phi[k][r][w] -= pi[k][w]
self.theta[k][w] += pi[k][w]
prev_likelihood = likelihood
print ' Likelihood: %s' % likelihood
print ' Time: %s s' % (time.time() - main_iter_start)
print 'Total time: %s s' % (time.time() - overall_start)
def get_aspect(self, i, j):
return self.sentences[i][j][2]
def set_aspect(self, i, j, aspect):
self.sentences[i][j][2] = aspect
def get_words(self, i, j):
return self.sentences[i][j][1]
small_df = reviews_df.iloc[0:100]
sentence_model = SentenceModel(small_df)
print len(sentence_model.words_set), 'words'
In [ ]:
# train the model
sentence_model.train(iterations=15, gradient_ascent_iterations=7)
In [ ]:
# print out the learned words
for k, values in sentence_model.theta.iteritems():
print k.upper() + ':'
for word, weight in sorted(values.iteritems(), key=lambda x: x[1], reverse=True)[:10]:
print ' %s: %s' % (word, weight)
In [ ]: