Beer Recommender

Marcin Kostecki, Lucas Lin, Michael Traver, Michael Wee

Website: https://googledrive.com/host/0BxV_WlGqTmvrWXpzaUJESnRSTEk/

Overview and Motivation

Online recommendation systems are a very valuable source of information for tasks such as personalization, recommendation, and sentiment analysis. We wanted to make effective use of these reviews and build a robust system that can give a good recommendation to a user no matter what the structure and sparsity of the data looks like. This meant we would not only use item-item similarity scores but also user-user similarity to provide greater amounts of information for the system to work with and a backup recommendation in edge cases such as an unique item that has no closely related products. To further make this more robust, we decided to integrate a user-centric textual analysis recommender. In looking for a dataset we found BeerAdvocate as a good repository with a large number of reviews with many users and beers to work with. We noticed that BeerAdvocate.com itself has no personal recommendation system, only global lists of ratings. The people want to drink beers that they know they'll personally like -- and we will provide that for them.

We were inspired by Homework 4 to implement a good collaborative filtering recommender, but to push it further in terms of features, functionality, and scale. We were inspired to implement the textual analysis engine by our fiery interest in machine learning, math, and sentiment analysis and also by Learning Attitudes and Attributes from Multi-Aspect Reviews by Julian McAuley, Jure Leskovec, and Dan Jurafsky at Stanford.

Initial Questions

We are trying to create a very robust recommendation system that will always come up with a good recommendation. How can we take the item-item collaborative filtering system from Homework 4 and beef it up? How can we inject more information into the recommendation engine and make it robust to sparsity in item-item similarities? How can we use anti-correlations and interpret similar items to items we know the user dislikes? Should we predict individual aspects and combine them or just the overall score? How do we combine aspects we predict into an overall score that is personalized per user? What is the best way to do sentiment and textual analysis on the text of a review and relate it to individual user preferences and numerical ratings? How do we scale everything we discussed above to millions of reviews, hundreds of millions of similarity relationships, and gigabytes of data? How can we combine user-user, item-item, and textual analysis predictions into a single prediction?

Data

We scraped data from beeradvocate.com. Our scraping code is included with this notebook.

Collaborative Filtering Model

We based our collaborative filtering model on the homework 4 model, but added some really good modifications

  • Modification 1: We transformed the pearson correlation coefficient to [0,1] with the absolute value function, but kept track of which guys were originally negative
    • This gets rid of the problem of having the denominator be too close to 0 which sometimes results in wacky results
    • It allows us to take advantage of both similar and dissimilar beers in filtering, predicting, and ranking (for dissimilar items, multiply deviation term by -1)
  • Modification 2: We use user-user similarities in our model as well
    • If there are no similar beers, or the similarity is low, we can rely on similar users!
    • We take the k highest similarities/dissimilarities overall and keep track of the type (i.e. user vs. item, similar vs. dissimilar)
  • Modification 3: We predict several aspects such as look, feel, and taste and use them to predict the overall score with different weights customized to each user!
    • It makes the recommendation engine more robust and less likely to screw up a recommendation, because a mistake in one aspect can be averaged out of other aspects

Textual Analysis - Aspects

We created a textual analysis engine as well for another layer of information to leverage. We analyzed each sentence in a textual review in order to identify which aspect the sentence talks about. This allows us to measure the sentiment in specific words in relation to an aspect rating. We could also use this to correlate and analyze sentences with more advanced aspects that aren't summed up in the Look, Taste, Feel, Smell categories.

We achieved this by modelling two parameter vectors that respectively encode words that discuss an aspect and words that discuss the associated sentiment. We can use these parameter vectors to calculate an indivdual probability of that a sentence discusses a particular aspect given the ratings as well as the probability for an entire review and the entire corpus. We model the two parameter vectors and latent aspect assignments by maximizing the log-likelihood probability of the corpus. We optimize this by coordinate ascent on the parameter vectors and alternately optimize the most likely latent aspect by calculating probabilities and the most likely parameter vectors through gradient ascent.

Textual Analysis - Sentiment

We predict how a user would rate a particular beer by analyzing the text of the reviews the user has posted as well as the textual reviews posted by other users of the beer. We predict this by looking at the words other users reviewed the beer with and consider how the user associates these words with ratings in their own reviews.

In more detail, from the aspect parameter vectors we modeled we can get the probabilities that the particular user associates a word to a particular aspect and rating. For every aspect, we look at every review of the particular beer and get the sentence that is most likely to be about that aspect. We then store, for the particular user, the most likely rating the word implies for the aspect given the user's reviews and the probability that the user associates the word with the aspect. We then take a weighted average of how the user would give ratings in the existing reviews weighted by how strongly the user associates the word with the aspect to predict how the user would rate that aspect of the beer.


In [ ]:
%matplotlib inline

from local_config import REVIEWS_FILE_PATH, BEERS_FILE_PATH, BEER_SIM_DB_FILE_PATH, USER_SIM_DB_FILE_PATH

import matplotlib.pyplot as plt
from multiprocessing import Process, Queue, Pool
from munkres import Munkres
import numpy as np
import nltk
import pandas as pd
import re
import random
import scipy as sp
from scipy import stats
from scipy.stats.stats import pearsonr
import sqlite3
import time

from matplotlib import rcParams
import matplotlib.cm as cm
import matplotlib as mpl

# colorbrewer2 Dark2 qualitative color table
dark2_colors = [(0.10588235294117647, 0.6196078431372549, 0.4666666666666667),
                (0.8509803921568627, 0.37254901960784315, 0.00784313725490196),
                (0.4588235294117647, 0.4392156862745098, 0.7019607843137254),
                (0.9058823529411765, 0.1607843137254902, 0.5411764705882353),
                (0.4, 0.6509803921568628, 0.11764705882352941),
                (0.9019607843137255, 0.6705882352941176, 0.00784313725490196),
                (0.6509803921568628, 0.4627450980392157, 0.11372549019607843)]

rcParams['figure.figsize'] = (10, 6)
rcParams['figure.dpi'] = 400
rcParams['axes.color_cycle'] = dark2_colors
rcParams['lines.linewidth'] = 2
rcParams['axes.facecolor'] = 'white'
rcParams['font.size'] = 14
rcParams['patch.edgecolor'] = 'white'
rcParams['patch.facecolor'] = dark2_colors[0]
# rcParams['font.family'] = 'StixGeneral'

def remove_border(axes=None, top=False, right=False, left=True, bottom=True):
    """
    Minimize chart junk by stripping out unnecesasry plot borders and axis ticks
    
    The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn
    """
    ax = axes or plt.gca()
    ax.spines['top'].set_visible(top)
    ax.spines['right'].set_visible(right)
    ax.spines['left'].set_visible(left)
    ax.spines['bottom'].set_visible(bottom)
    
    # turn off all ticks
    ax.yaxis.set_ticks_position('none')
    ax.xaxis.set_ticks_position('none')
    
    # now re-enable visibles
    if top:
        ax.xaxis.tick_top()
    if bottom:
        ax.xaxis.tick_bottom()
    if left:
        ax.yaxis.tick_left()
    if right:
        ax.yaxis.tick_right()
        
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)

def autolabel(rects, height_offset, fontsize):
    """Label rects with their height"""
    for rect in rects:
        height = rect.get_height()
        plt.text(rect.get_x() + rect.get_width() / 2.0,
                 height + height_offset,
                 '%d' % int(height),
                 ha='center',
                 va='bottom',
                 rotation='vertical',
                 fontsize=fontsize)

In [ ]:
####################
# Define constants #
####################

DEFAULT_K = 7
DEFAULT_REG = 3.0
DEFAULT_ASPECT_REG_PARAM = 3.0

ASPECTS = ['look', 'smell', 'taste', 'feel', 'overall']
ASPECTS_MINUS_OVERALL = ['look', 'smell', 'taste', 'feel']
RATINGS = [1.0 + 0.25 * x for x in range(17)] # 1-5, in steps of 0.25

WORD_SPLIT_REGEX = re.compile(r"[\w']+")
SENTENCE_TOKENIZER = nltk.data.load('tokenizers/punkt/english.pickle')

EXCLUDED_WORDS = set()
with open('excluded_words.txt', 'r') as f:
    for line in f:
        EXCLUDED_WORDS.add(line.strip().lower())

BEER_DB_CURSOR = sqlite3.connect(BEER_SIM_DB_FILE_PATH).cursor()
USER_DB_CURSOR = sqlite3.connect(USER_SIM_DB_FILE_PATH).cursor()

Data Exploration

Our data is split across two dataframes -- one for beer reviews and one for beer information (e.g. name, brewery, etc.) -- to minimize memory usage, so we define some functions that convert between IDs and names.

Also, because our recommender depends on aspect ratings instead of a single overall rating, we filter out reviews that don't have aspect ratings. Below we've plotted some attributes of the dataset to get a sense of what it contains.


In [ ]:
# Read in reviews and beers data
reviews_df_raw = pd.read_csv(REVIEWS_FILE_PATH)
beer_df = pd.read_csv(BEERS_FILE_PATH)

In [ ]:
# filter out reviews with no username and without a text review;
# this restricts the dataset to only those reviews with aspect subratings
reviews_df = reviews_df_raw[pd.notnull(reviews_df_raw['username'])]
reviews_df = reviews_df[pd.notnull(reviews_df['text'])]


#
# The code below is used to create a smaller dataset for testing purposes
#

# filter out users with a small number of reviews
# filtered_usernames = []
# username_groups = reviews_df.groupby('username')
# for username, reviews in username_groups:
#     if len(reviews) > 100:
#         filtered_usernames.append(username)
# reviews_df = reviews_df[reviews_df['username'].isin(filtered_usernames)]

# filter out beers with a small number of reviews
# filtered_beer_ids = []
# beer_groups = reviews_df.groupby('beer_id')
# for beer_id, reviews in beer_groups:
#     if len(reviews) > 500:
#         filtered_beer_ids.append(beer_id)
# reviews_df = reviews_df[reviews_df['beer_id'].isin(filtered_beer_ids)]

We define utility functions that map information types in the dataset with the goal of readable and reusable code throughout the project


In [ ]:
####################################
# Data lookup/conversion functions #
####################################

"""
Beers
"""
def beer_id_to_name(beer_id, beer_df):
    return beer_df[beer_df['beer_id'] == beer_id].iloc[0]['beer_name']

def beer_id_to_brewery_id(beer_id, beer_df):
    return beer_df[beer_df['beer_id'] == beer_id].iloc[0]['brewery_id']

def beer_name_to_id(beer_name, beer_df):
    return beer_df[beer_df['beer_name'] == beer_name].iloc[0]['beer_id']

"""
Breweries
"""
def brewery_id_to_name(brewery_id, beer_df):
    return beer_df[beer_df['brewery_id'] == brewery_id].iloc[0]['brewery_name']

def brewery_name_to_id(brewery_name, beer_df):
    return beer_df[beer_df['brewery_name'] == brewery_name].iloc[0]['brewery_id']

"""
Cross-category
"""
def beer_id_to_brewery_name(beer_id, beer_df):
    brewery_id = beer_id_to_brewery_id(beer_id, beer_df)
    return brewery_id_to_name(brewery_id , beer_df)

We explore the data to get a sense of what the dataset is like. We look at high level statistics such as mean and median about the number of reviews per beer, user, and brewery. We also look at the number of reviews, users, beers, and breweries in the full dataset.


In [ ]:
####################
# Data exploration #
####################

def get_reviews_by(df, column_name):
    num_reviews_by_item = {}
    for value, indices in df.groupby(column_name).groups.iteritems():
        num_reviews_by_item[value] = len(indices)
    return num_reviews_by_item

def display_stats_by(df, column_name, description, max_val_lookup_df, max_val_lookup_key):
    num_reviews_by_item = get_reviews_by(df, column_name)
    num_reviews = np.array(num_reviews_by_item.values())
    
    # get the max number of reviews, and look up the human-readable name if required
    max_name, max_num = max(num_reviews_by_item.iteritems(), key=lambda x: x[1])
    if max_val_lookup_df:
        max_name = max_val_lookup_df[max_val_lookup_df[column_name] == max_name].iloc[0][max_val_lookup_key]
    
    print 'Reviews per %s stats:' % description.lower()
    print '    Mean:   %s' % num_reviews.mean()
    print '    Median: %s' % int(np.median(num_reviews))
    print '    Mode:   %s (%s occurrences)' % tuple([int(x[0]) for x in stats.mode(num_reviews)])
    print '    Min:    %s' % num_reviews.min()
    print "    Max:    %s (%s)" % (max_num, max_name)

    fig = plt.figure()
    ax = fig.add_subplot(1, 1, 1)
    ax.hist(num_reviews, bins=45, log=True)
    ax.set_title('Number of Reviews per %s' % description.title())
    ax.set_ylabel('Occurrences')
    ax.set_xlabel('Number of Reviews')

def display_stats(df):
    total_num_reviews = len(df)
    num_text_reviews = len(df[pd.notnull(df['text'])])
    num_nontext_reviews = len(df[pd.isnull(df['text'])])

    print 'Total reviews:    %s' % total_num_reviews
    print 'Text reviews:     %s (%s%%)' % (num_text_reviews, float(num_text_reviews) / total_num_reviews * 100.0)
    print 'Non-text reviews: %s (%s%%)' % (num_nontext_reviews, float(num_nontext_reviews) / total_num_reviews * 100.0)

    print
    print 'Number of users:     %s' % len(df['username'].unique())
    print 'Number of beers:     %s' % len(df['beer_id'].unique())
    print 'Number of breweries: %s' % len(df['brewery_id'].unique())

    print
    display_stats_by(df, 'username', 'User', None, None)
    print
    display_stats_by(df, 'beer_id', 'Beer', beer_df, 'beer_name')
    print
    display_stats_by(df, 'brewery_id', 'Brewery', beer_df, 'brewery_name')

print '-------FULL DATASET-------'
print 'Number of reviews:   %s' % len(reviews_df_raw)
print 'Number of users:     %s' % len(reviews_df_raw['username'].unique())
print 'Number of beers:     %s' % len(reviews_df_raw['beer_id'].unique())
print 'Number of breweries: %s' % len(reviews_df_raw['brewery_id'].unique())

print '\n'
print '-------FILTERED DATASET-------'
display_stats(reviews_df)

#
# Plot number of beers by style
#

# filter out aliases, and also beers with no ratings or reviews
reviewed_beers = beer_df[pd.isnull(beer_df['alias_id']) & (beer_df['num_ratings'] > 0)]

# get the number of reviews for each style, and sort them
num_reviews_by_style = get_reviews_by(reviewed_beers, 'style')
sorted_num = sorted([(k, v) for k, v in num_reviews_by_style.iteritems()], key=lambda x: x[1], reverse=True)

# construct x (evenly spaced coords), y (num reviews), and label (style name) arrays for plotting
num_bars = len(sorted_num)
x = np.array(range(num_bars))
y = [count for _, count in reversed(sorted_num)]
labels = [unicode(style, 'utf_8') for style, _ in reversed(sorted_num)]

# plot review count by style
fig = plt.figure(figsize=(20, 15))
fig.subplots_adjust(bottom=0.16)
ax = fig.add_subplot(1, 1, 1)
rects = ax.bar(x, y, align='center', width=.6)
ax.set_xlim([-1, num_bars])
ax.set_xticks(x)
ax.set_xticklabels(labels, rotation='vertical', fontsize=8)
ax.set_xlabel('Style')
ax.set_ylabel('Beers')
ax.set_title('Number of Beers by Style')

# put count above each bar
autolabel(rects, 40, 7)

We see that, as expected, the majority of users and beers have a small number of reviews. Also, the most reviewed beer is Dogfish Head's 90 Minute IPA! Pretty cool!

Building Our Recommender

While predicting the overall score a user will give a beer he hasn't seen before, our recommender will predict what the user would rate the beer according to aspects such as look, taste, feel, and smell. Different users will like different things about beers differently - for example someone might put more emphasis on feel in a beer more than look, while another user might be exactly the opposite. We thus must account for these differences in our model.

In our data exploration, we found that the Pearson R correlation coefficient was very good at correlating how much a user emphasized a particular aspect to the overall user rating. We thus use this correlation to set the aspect weights in our model.


In [ ]:
def normalize(a):
    s = float(sum(a))
    if s == 0.0:
        return a
    
    return [0.0 if x == 0.0 else float(x) / s for x in a]

# cache aspect weights so we don't continually re-compute them for the same users
CACHED_ASPECT_WEIGHTS = {}

def get_aspect_weights(username, df, reg=DEFAULT_ASPECT_REG_PARAM):
    if username in CACHED_ASPECT_WEIGHTS:
        return CACHED_ASPECT_WEIGHTS[username]
    
    user_reviews = df[df['username'] == username]
    overall_ratings = user_reviews['overall']
    
    weights = [shrunk_sim(pearsonr(user_reviews[aspect], overall_ratings)[0], float(len(user_reviews)), reg) for aspect in ASPECTS_MINUS_OVERALL]
    CACHED_ASPECT_WEIGHTS[username] = normalize(weights)
    
    return CACHED_ASPECT_WEIGHTS[username]

In [ ]:
#
# Compute aspect weights for a subset of users and use them to reconstruct overall ratings in order to check model validity
#

# choose a random set of usernames for which to compute aspect weights
random_usernames = random.sample(reviews_df['username'].unique(), 75)

correlations = []
for username in random_usernames:    
    reviews = reviews_df[reviews_df['username'] == username]
    
    look, smell, taste, feel = get_aspect_weights(username, reviews_df)
    
    predicted = np.array(reviews['look']) * look \
        + np.array(reviews['smell']) * smell \
        + np.array(reviews['taste']) * taste \
        + np.array(reviews['feel']) * feel
    actual = np.array(reviews['overall'])
    
    corr = pearsonr(predicted, actual)[0]
    if pd.notnull(corr):
        correlations.append(corr)

plt.hist(correlations, bins=40)
plt.title('Correlation Between Actual and Reconstructed Ratings')
plt.xlabel('correlation')
plt.ylabel('frequency')

Because of the huge scale of data we are working with it was vital that we optimize the runtime of our code as much as possible. One optimization opportunity we found was to cache the beer and user averages so we had it available as we looked them up throughout the recommender. We also define utility functions for getting sets of data from a dataframe.


In [ ]:
############################
# Data filtering functions #
############################

# used to cache user and beer averages so we don't have to re-compute them
USER_AVERAGES = {}
BEER_AVERAGES = {}

def get_user_averages(df, rating_col_name):
    return dict(df.groupby('username')[rating_col_name].mean())

def get_single_user_average(df, username, aspect):
    if username in USER_AVERAGES:
        if aspect in USER_AVERAGES[username]:
            return USER_AVERAGES[username][aspect]
    else:
        USER_AVERAGES[username] = {}
    
    USER_AVERAGES[username][aspect] = df[df.username == username][rating_col_name].mean()
    return USER_AVERAGES[username][aspect]

def get_beer_averages(df, rating_col_name):
    return dict(df.groupby('beer_id')[rating_col_name].mean())

def get_single_beer_average(df, beer_id, aspect):
    if beer_id in BEER_AVERAGES:
        if aspect in BEER_AVERAGES[beer_id]:
            return BEER_AVERAGES[beer_id][aspect]
    else:
        BEER_AVERAGES[beer_id] = {}
    
    BEER_AVERAGES[beer_id][aspect] = df[df.beer_id == beer_id][rating_col_name].mean()
    return BEER_AVERAGES[beer_id][aspect]
    

def get_user_reviewed(username, df):
    return set(df[df['username'] == username]['beer_id'])

def get_beer_reviewers(beer_id, df):
    return set(df[df['beer_id'] == beer_id]['username'])
    
def get_user_top_rated(username, rating_col_name, df, numchoices=5):
    "Return the sorted top numchoices beers for a user by the given rating column name."
    return df[df['username'] == username][['beer_id', rating_col_name]].sort([rating_col_name], ascending=False).head(numchoices)

We define utility functions for getting certain types of data from sets of data and working with common support sets.


In [ ]:
#########################################################
# Define functions used for common support calculations #
#########################################################

def get_common_reviewers(beer_id_1, beer_id_2, df):
    beer_1_reviewers = df[df['beer_id'] == beer_id_1]['username'].unique()
    beer_2_reviewers = df[df['beer_id'] == beer_id_2]['username'].unique()
    return set(beer_1_reviewers).intersection(beer_2_reviewers)

def get_common_reviewed(username_1, username_2, df):
    user_1_reviewed = df[df['username'] == username_1]['beer_id'].unique()
    user_2_reviewed = df[df['username'] == username_2]['beer_id'].unique()
    return set(user_1_reviewed).intersection(user_2_reviewed)

def get_common_support(df):
    beers = df.beer_id.unique()
    print len(beers)
    supports = []
    for i, beer_id_1 in enumerate(beers):
        for j, beer_id_2 in enumerate(beers):
            if  i < j:
                common_reviewers = get_common_reviewers(beer_id_1, beer_id_2, df)
                supports.append(len(common_reviewers))
    return supports

def get_reviews_for_beer_and_users(beer_id, user_set, df):
    """Given a beer ID and a set of usernames, return the sub-dataframe of the users' reviews of the beer."""
    mask = (df['username'].isin(user_set)) & (df['beer_id'] == beer_id)
    reviews = df[mask]
    return reviews[reviews['username'].duplicated() == False]

def get_reviews_for_user_and_beers(username, beer_set, df):
    """Given a username and a set of beer IDs, return the sub-dataframe of the user's reviews of the beers."""
    mask = (df['beer_id'].isin(beer_set)) & (df['username'] == username)
    reviews = df[mask]
    return reviews[reviews['beer_id'].duplicated() == False]

We define functions for working with similarity functions.


In [ ]:
####################################
# Similarity calculation functions #
####################################

def pearson_sim(reviews_df_1, reviews_df_2, averages, num_common, rating_col_name, avg_lookup_col_name):
    """
    Given 2 subframes of reviews, return the Pearson correlation coefficient between the reviews.

    * Also subtract an average rating value from the reviews before calculating correlation.
    * If there is no common support between the review sets, return 0.
    * If varainces are 0, NaN may be returned.
    """
    if num_common == 0:
        return 0.0
    
    diff1 = reviews_df_1.apply(lambda x: x[rating_col_name] - averages[x[avg_lookup_col_name]], axis=1)
    diff2 = reviews_df_2.apply(lambda x: x[rating_col_name] - averages[x[avg_lookup_col_name]], axis=1)
    return pearsonr(diff1, diff2)[0]

def calculate_beer_similarity(beer_id_1, beer_id_2, user_averages, similarity_func, rating_col_name, df):
    common_reviewers = get_common_reviewers(beer_id_1, beer_id_2, df)
    beer_1_reviews = get_reviews_for_beer_and_users(beer_id_1, common_reviewers, df)
    beer_2_reviews = get_reviews_for_beer_and_users(beer_id_2, common_reviewers, df)
    sim = similarity_func(beer_1_reviews, beer_2_reviews, user_averages, len(common_reviewers), rating_col_name, 'username')
    return (0.0 if np.isnan(sim) else sim, len(common_reviewers))

def calculate_user_similarity(username_1, username_2, beer_averages, similarity_func, rating_col_name, df):
    common_beers = get_common_reviewed(username_1, username_2, df)
    user_1_reviews = get_reviews_for_user_and_beers(username_1, common_beers, df)
    user_2_reviews = get_reviews_for_user_and_beers(username_2, common_beers, df)
    sim = similarity_func(user_1_reviews, user_2_reviews, beer_averages, len(common_beers), rating_col_name, 'beer_id')
    return (0.0 if np.isnan(sim) else sim, len(common_beers))

def shrunk_sim(sim, n_common, reg=DEFAULT_REG):
    "Shrink the similarity with the regularizer."
    return (n_common * sim) / (n_common + reg)

We generalize the knearest function to be able to take both users and beers either as an object_id or a search_set for when we do both item-item similarity and user-user similarity.


In [ ]:
##################################################
# Functions for k-nearest neighbors calculations #
##################################################

def k_nearest(object_id, search_set, aspect, db, k=DEFAULT_K, reg=DEFAULT_REG):
    similar = []
    for current_object_id in search_set:
        if current_object_id != object_id:
            sim, support = db.get(object_id, current_object_id, aspect)
            similar.append((current_object_id, shrunk_sim(sim, support, reg=reg), support))
    similar.sort(key=lambda x: x[1], reverse=True)
    return similar[:k]

In [ ]:
############################
# Recommendation functions #
############################

def get_top_recos_for_user(username, rating_col_name, df, db, n, k=DEFAULT_K, reg=DEFAULT_REG):
    # we'll get similar beers from all those in the dataset
    unique_beer_ids = df['beer_id'].unique()

    neighbors = set()
    
    # for each of the user's top-rated beers...
    for i, top_beer_id in get_user_top_rated(username, rating_col_name, df, numchoices=n)['beer_id'].iteritems():
        # ...get similar beers
        for near_beer_id, _, _ in k_nearest(top_beer_id, unique_beer_ids, db, k=k, reg=reg):
            neighbors.add(near_beer_id)

    # only use beers that the user has not reviewed
    neighbors = neighbors - get_user_reviewed(username, df)
    
    result = [(beer_id, get_single_beer_average(df, beer_id, rating_col_name)) for beer_id in neighbors]
    return sorted(result, key=lambda x: x[1], reverse=True)[:n]

Things we did below:

  • sort by abs of shrunken sim
  • use regular, non-abs shrunken sim for num, but abs sim for denom
  • this is so that the denom doesn't get really small
  • this allows us to leverage the expressive power of both similar beers and disimilar beers

Our first attempt at predicting a user's overall rating of a new beer is based on the restuarant recommender developed in Homework 4, with several key modifications designed to improve accuracy and precision. The first major modification was possible due to the nature of our data, namely the fact that in a review, users provide not only an overall rating for the beer, but also a rating for 4 aspects, namely taste, smell, look and feel. Rather than trying to predict the overall rating directly, we decided to add an intuitive and defensible (and hence likely useful) layer of complexity to our model by first predicting a rating for each of the four aspects, and then combining them into an overall rating using weights based on the correlations between each of the aspects and the overall rating for the given user. The next modification was to incorporate user-user similarity information in addition to the information gleaned from item-item similarities. This decision intuitively would help our rating predictor perform better in situations where the user decides to try a unique beer that isn't really similar to any other beers, namely, we can look at the reaction of similar users to the beer and predict the rating based on that information. The final modification was to look for not just similar things, but also dissimilar ones, which arguably are just as useful for predicting a rating. For example, if the user really liked a beer that is the exact opposite of the given beer, he probably will not like the given beer very much. In other words, only things that are uncorrelated with other things (i.e. neither similar or dissimilar) give no useful information for the rating predictor. Our final implementation hence went as follows. First, calculate all beer-beer similarities and user-user similarities. Sort these by the absolute value of the similarity. Pick the k-nearest things (beers or users). Incorporate the information from these similarities with the proper direction (i.e. positive for similar, negative for dissimilar) into the rating. Do this separately for each aspect. Finally, combine all the aspect ratings according to user-specific weights to get the overall rating.

For predict_aspect_rating_from_text, i.e. the rating predictor based on text (and hence more of a bottom-up approach):

  • objective:
    • given (user, beer, aspect) return predicted rating using text analysis
  • procedure:
    1. train text analysis model using all of user's reviews, which gives:
      • theta(aspect, word): increases ==> the probability that a sentence written by user and containing word was used to describe aspect increases ==> the probability that user writes word to describe aspect increases
      • phi(aspect, rating, word): increases ==> the probability that a sentence written by user, corresponding to rating and containing word was used to describe aspect increases ==> the probability that user writes word to describe aspect with rating increases
    2. loop through all of beer's reviews
      1. loop through all sentences in review
        • calculate the probability that the user would have written this sentence to describe aspect (expression for this probability is given in the paper)
      2. identify the sentence that maximizes this probability, call it sentence_aspect
      3. loop through all words in sentence_aspect
        1. loop through all possible values of rating
          • calculate phi(aspect, rating, word)
        2. identify the value of rating that maximizes phi(aspect, rating, word), call it rating_aspect(word)
      4. calculate sum over words in sentence_aspect of theta(aspect, word) * rating_aspect(word), then divide by sum over words in sentence_aspect of theta(aspect, word) (i.e. calculate average over words in sentence_aspect of rating_aspect(word), weighted by proxy for probability that user writes word to describe aspect), call this rating_aspect_review
    3. return average over review in beer's reviews of rating_aspect_review

In [ ]:
############################
# Aspect rating prediction #
############################

# pre-compute global averages
GLOBAL_AVG = {}
for aspect in ASPECTS:
    GLOBAL_AVG[aspect] = df[aspect].mean()

def baseline(global_avg, user_avg, beer_avg):
    return global_avg + (user_avg - global_avg) + (beer_avg - global_avg)

def predict_aspect_rating(beer_id, username, aspect, beer_db, user_db, df, k=DEFAULT_K, reg=DEFAULT_REG):
    BEER = 0
    USER = 1
    
    beer_avg = get_single_beer_average(df, beer_id, aspect)
    user_avg = get_single_user_average(df, username, aspect)
    
    nearest_beers = k_nearest(beer_id, get_user_reviewed(username, df), beer_db, k=k, reg=reg)
    
    # get k nearest users who have reviewed this beer
    nearest_users = k_nearest(username, df[df['beer_id'] == beer_id]['username'].unique(), user_db, k=k, reg=reg)
    
    nearest = []
    for beer, sim, support in nearest_beers:
        nearest.append((BEER, beer, shrunk_sim(sim, support, reg=reg), support))
    for user, sim, support in nearest_users:        
        nearest.append((USER, user, shrunk_sim(sim, support, reg=reg), support))
    
    nearest.sort(key=lambda x: abs(x[2]), ascending=False)
    
    num = 0.0
    denom = 0.0
    for id_type, object_id, sim, abs_sim, support in nearest[:k]:
        if id_type == BEER:
            # get the user's review of the similar beer
            reviews = df[(df['username'] == username) & (df['beer_id'] == object_id)]
            assert(reviews.shape[0] == 1)
            
            # get average for the similar beer
            similar_beer_avg = get_single_beer_average(df, object_id, aspect)
            
            num += sim * (float(reviews.irow(0)[aspect]) - baseline(GLOBAL_AVG[aspect], user_avg, similar_beer_avg))
            denom += abs(sim)
        elif id_type == USER:
            # get the similar user's review of the beer
            reviews = df[(df['username'] == object_id) & (df['beer_id'] == beer_id)]
            assert(reviews.shape[0] == 1)
            
            # get average for the similar user
            similar_user_avg = get_single_user_average(df, object_id, aspect)
            
            num += sim * (float(reviews.irow(0)[aspect]) - baseline(GLOBAL_AVG[aspect], similar_user_avg, beer_avg))
            denom += abs(sim)
    
    if denom != 0:
        return baseline(GLOBAL_AVG[aspect], user_avg, beer_avg) + num / denom
    else:
        return baseline(GLOBAL_AVG[aspect], user_avg, beer_avg)

def predict_overall_rating(beer_id, username, beer_db, user_db, df, k=DEFAULT_K, reg=DEFAULT_REG):
    aspect_ratings = []
    for aspect in ASPECTS_MINUS_OVERALL:
        aspect_ratings.append(
            predict_aspect_rating(beer_id, username, aspect, beer_db, user_db, df, k=k, reg=reg))
    
    user_aspect_weights = get_aspect_weights(username, df)
    
    return np.array(aspect_ratings).dot(user_aspect_weights)
    
def predict_aspect_rating_from_text(beer_id, username, aspect, df):
    sentence_model = SentenceModel(df[df['username'] == username])
    sentence_model.train()
    
    predicted_ratings_sum = 0.0
    count = 0
    
    for review in df[df['beer_id'] == beer_id].iterrows():
        max_prob = float('-inf')
        max_sentence = None
        for sentence in SentenceModel.get_sentences(review['text']):
            prob = sentence_model.get_sentence_prob_from_words(words, aspect)
            if prob > max_prob:
                max_prob = prob
                max_sentence = sentence
        
        num = 0.0
        denom = 0.0
        for word in max_sentence:
            if word in sentence_model.words_set:
                max_phi = float('-inf')
                max_rating = None
                for r in RATINGS:
                    phi = sentence_model.phi[aspect][r][word]
                    if phi > max_phi:
                        max_phi = phi
                        max_rating = r
                
                num += sentence_model.theta[k][w] * max_rating
                denom += sentence_model.theta[k][w]
                
                predicted_ratings_sum += num / denom
                count += 1
        
    return predicted_ratings_sum / float(count)

We have databases that store the similarities between users and users and beers and beers for the recommenders. The similarities were computed at scale with MapReduce on Amazon EC2. However, the scale of the data was really large: there were 242 million entires in the beer-beer similarity database and 75million entries in the user-user database. The beer-beer database was 16GB in size, which was too large to fit into memory, so we couldn't just create a Database object. We used SQLite to store similarities and indexed by item ID which allowed for constant time lookup.


In [ ]:
class Database(object):
    """A class representing a database of similaries and common supports."""
    def __init__(self, df, id_col, average_function):
        self.df = df
        self.average_function = average_function
        
        self.id_col = id_col
        self.opposite_id_col = None
        if id_col == 'beer_id':
            self.opposite_id_col = 'username'
        else:
            self.opposite_id_col = 'beer_id'
        
        self.unique_ids = {v: k for (k, v) in enumerate(df[id_col].unique())}
        keys = self.unique_ids.keys()
        num_keys = len(keys)
        self.similarities = np.zeros([num_keys, num_keys])
        self.supports = np.zeros([num_keys, num_keys], dtype=np.int)
    
    def calculate_similarity(self, id_1, id_2, averages, similarity_func, rating_col_name, df):
        raise NotImplementedError

    def populate_by_calculating(self, similarity_func, rating_col_name):
        averages = self.average_function(self.df, rating_col_name)
        items = self.unique_ids.items()
        
        count = 0
        for id_1, i in items:
            print '%s (i = %s)' % (count, i)
            count += 1
            for id_2, j in items:
                if i < j:
                    sim, nsup = self.calculate_similarity(id_1, id_2, averages, similarity_func, rating_col_name, self.df)
                    self.similarities[i][j] = sim
                    self.similarities[j][i] = sim
                    self.supports[i][j] = nsup
                    self.supports[j][i] = nsup
                elif i == j:
                    nsup = self.df[self.df[self.id_col] == id_1][self.opposite_id_col].count()
                    self.similarities[i][i] = 1.0
                    self.supports[i][i] = nsup

    def get(self, id_1, id_2):
        "Return a (similarity, common_support) tuple for the given IDs"
        return (
            self.similarities[self.unique_ids[id_1]][self.unique_ids[id_2]],
            self.supports[self.unique_ids[id_1]][self.unique_ids[id_2]]
        )
        
class BeerDatabase(Database):
    def __init__(self, df):
        super(BeerDatabase, self).__init__(df, 'beer_id', get_user_averages)
        self.calculate_similarity = calculate_beer_similarity

class UserDatabase(Database):
    def __init__(self, df):
        super(UserDatabase, self).__init__(df, 'username', get_beer_averages)
        self.calculate_similarity = calculate_user_similarity

        
"""
# The classes below retrieve similarities from sqlite3 databases of beer and user similarities.
"""

class SQLDatabase(object):
    def __init__(self):
        self.cursor = None
    
    def get(self, object_id_1, object_id_2, aspect):
        # we need to sort the object IDs because the database has one row
        # for each object pair, indexed by the *sorted* pair
        object_id_1, object_id_2 = sorted((object_id_1, object_id_2))
        
        return self.cursor.execute("SELECT %s FROM similarities WHERE object_id_1='%s' AND object_id_2='%s'" % (aspect, object_id_1, object_id_2)).fetchone()[0]

class SQLBeerDatabase(SQLDatabase):
    def __init__(self):
        super(SQLBeerDatabase, self).__init__()
        self.cursor = BEER_DB_CURSOR

class SQLUserDatabase(SQLDatabase):
    def __init__(self):
        super(SQLUserDatabase, self).__init__()
        self.cursor = USER_DB_CURSOR
        
# def get_sim_sql(object_id_1, object_id_2, aspect, cursor):
#     table_name = c.execute("SELECT table_name FROM object_lookup WHERE object_id=?", (object_id_1,)).fetchone()[0]
#     print c.execute("SELECT %s FROM %s WHERE object_id='%s'" % (aspect, table_name, object_id_2)).fetchall()
        
# def get_beer_sim_sql(beer_id_1, beer_id_2, aspect):
#     return get_sim_sql(beer_id_1, beer_id_2, aspect, BEER_DB_CURSOR)
        
# def get_user_sim_sql(username_1, username_2, aspect):
#     return get_sim_sql(username_1, username_2, aspect, USER_DB_CURSOR)

In [ ]:
beer_db = BeerDatabase(reviews_df)
# beer_db.populate_by_calculating(pearson_sim, 'rating')
sql_beer_db = SQLBeerDatabase()

user_db = UserDatabase(reviews_df)
# user_db.populate_by_calculating(pearson_sim, 'rating')
sql_user_db = SQLUserDatabase()

We process similarities on MapReduce because of the scale, so the code below exports our dataframe to a MapReduce ready format.


In [ ]:
# get a copy of the raw df, but with null usernames and review text filtered out
full_reviews_df = reviews_df_raw[pd.notnull(reviews_df_raw['username'])]
full_reviews_df = full_reviews_df[pd.notnull(full_reviews_df['text'])]

def convert_to_mr_format_beer(df):
    # get DataFrame-wide user averages
    user_averages = {}
    for aspect in ASPECTS:
        user_averages[aspect] = get_user_averages(df, aspect)
        
    with open('mr_input/mr_input_all_full_beer.txt', 'w') as f:
        for i, review in df.iterrows():
            username = review['username']
            beer_id = review['beer_id']
            
            things = [username, beer_id]
            for aspect in ASPECTS:
                things.append(review[aspect] - user_averages[aspect][username])
            
            f.write(' '.join([str(x) for x in things]) + '\n')

def convert_to_mr_format_user(df):
    # get DataFrame-wide beer averages
    beer_averages = {}
    for aspect in ASPECTS:
        beer_averages[aspect] = get_beer_averages(df, aspect)
        
    with open('mr_input/mr_input_all_full_user.txt', 'w') as f:
        for i, review in df.iterrows():
            username = review['username']
            beer_id = review['beer_id']
            
            things = [username, beer_id]
            for aspect in ASPECTS:
                things.append(review[aspect] - beer_averages[aspect][beer_id])
            
            f.write(' '.join([str(x) for x in things]) + '\n')

# convert_to_mr_format_beer(full_reviews_df)
# convert_to_mr_format_user(full_reviews_df)

In [ ]:
# predict ratings for all reviews
reviews_df_copy = reviews_df.copy(deep=True)
reviews_df_copy['predicted'] = reviews_df.apply(lambda x: predict_overall_rating(x['beer_id'], x['username'], sql_beer_db, sql_user_db, reviews_df), axis=1)

In [ ]:
test_beer_id = 58577
nearest_beers = k_nearest(test_beer_id, reviews_df['beer_id'].unique(), beer_db)

print 'Top matches for %s (%s):' % (beer_id_to_name(test_beer_id, beer_df), test_beer_id)
for i, (beer_id, sim, support) in enumerate(nearest_beers):
    print i, beer_id_to_name(beer_id, beer_df), "| Sim", sim, "| Support", support

In [ ]:
test_username = 'Sammy'
nearest_users = k_nearest(test_username, reviews_df['username'].unique(), user_db)

print 'Top matches for %s (%s):' % username
for i, (username, sim, support) in enumerate(nearest_users):
    print i, username, "| Sim", sim, "| Support", support

Text Analysis

The code below uses the approach described in the related work section above to learn which words correspond to which aspects. We then use the rating prediction code above to leverage what we've learned to predict ratings based on textual analysis.


In [ ]:
class SentenceModel(object):
    def __init__(self, df):
        self.df = df
        self.sentences = {}
        self.words_set = set()
        self.ratings = {}

        # used for sentence aspect assignment
        self.NUM_EXTRA_NODES = 2
        
        # some useful numbers
        self.num_reviews = 0
        self.num_rating_possibilites = len(ASPECTS) * len(RATINGS)
        
        # initialize sentences and words
        for i, review in df.iterrows():
            self.num_reviews += 1
            
            for s in SENTENCE_TOKENIZER.tokenize(review['text']):
                # Don't use sentences that just describe the serving type
                if s.startswith('Serving type: '):
                    break
                    
                # get all words in the sentence
                words = set([w.lower() for w in WORD_SPLIT_REGEX.findall(s)])
                
                # remove common "function" words that aren't useful for our analysis
                words -= EXCLUDED_WORDS
                
                # add the sentence to the sentence dict
                if i not in self.sentences:
                    self.sentences[i] = []
                self.sentences[i].append(['', words, None])
                
                # add the words to the global word set
                self.words_set.update(words)

            # record this review's ratings
            self.ratings[i] = {}
            for k in ASPECTS:
                self.ratings[i][k] = review[k]
        
        # initialize aspect weights
        self.theta = {}
        for aspect in ASPECTS:
            self.theta[aspect] = {}
            for w in self.words_set:
                self.theta[aspect][w] = random.random() * 0.5
        for aspect in ASPECTS:
            self.theta[aspect][aspect] = 1.0
        
        # initialize sentiment weights
        self.phi = {}
        for aspect in ASPECTS:
            self.phi[aspect] = {}
            for r in RATINGS:
                self.phi[aspect][r] = {}
                for w in self.words_set:
                   self.phi[aspect][r][w] = random.random() * 0.5
        for aspect in ASPECTS:
            self.phi[aspect][3.0][aspect] = 1
        
        # will be used later for gradient ascent
        self.gradient_theta = {}
        self.gradient_phi = {}
        
    def __iter__(self):
        """Allow for iteration through all sentences with one loop."""
        for df_index, sentences in self.sentences.iteritems():
            for sentence_num, sentence_data in enumerate(sentences):
                yield (df_index, sentence_num, sentence_data[0], sentence_data[1], sentence_data[2])

    @staticmethod
    def get_sentences(text):
        sentences = []
        for s in SENTENCE_TOKENIZER.tokenize(text):
            # get all words in the sentence
            words = set([w.lower() for w in WORD_SPLIT_REGEX.findall(s)])
            
            # remove common "function" words
            words -= EXCLUDED_WORDS

            sentences.append(words)

        return sentences

    def save_model(self, filename):
        with open(filename, 'w') as f:
            pickle.dump([self.words_set, self.theta, self.phi], f)

    def load_model(self, filename):
        with open(filename, 'r') as f:
            data = pickle.load(f)
            self.words_set = data[0]
            self.theta = data[1]
            self.phi = data[2]

    def get_sentence_prob(self, i, j, aspect):
        words = self.get_words(i, j)
        
        z = 0.0
        this_aspect_sum = None
        for k in ASPECTS:
            weight_sum = 0.0
            for w in words:
                weight_sum += self.theta[k][w] + self.phi[k][self.ratings[i][k]][w]
            
            if k == aspect:
                this_aspect_sum = weight_sum
                
            z += np.exp(weight_sum)
                
        return np.exp(this_aspect_sum) / z

    def get_sentence_prob_from_words(self, words, aspect):
        z = 0.0
        this_aspect_sum = None
        for k in ASPECTS:
            weight_sum = 0.0
            for w in words:
                if w in self.words_set:
                    weight_sum += self.theta[k][w] + self.phi[k][self.ratings[i][k]][w]
            
            if k == aspect:
                this_aspect_sum = weight_sum
                
            z += np.exp(weight_sum)
                
        return np.exp(this_aspect_sum) / z
    
    def get_sentence_compatability(self, i, j, aspect, words=None):
        if not words:
            words = self.get_words(i, j)
            
        weight_sum = 0.0
        for w in words:
            weight_sum += self.theta[aspect][w] + self.phi[aspect][self.ratings[i][aspect]][w]
        return weight_sum

    def _update_assignments(self):        
        def worker(i):
            sentences = self.sentences[i]
            
            changed = False

            num_sentences = len(sentences)
            
            best_aspects = [None for _ in range(num_sentences)]
            matrix = np.zeros((num_sentences, num_sentences + self.NUM_EXTRA_NODES))
            
            for j, sentence_data in enumerate(sentences):
                # assign an aspect to the current sentence based on its compatability score with each aspect
                max_compatability = float('-inf')
                for k in ASPECTS:
                    compatability = self.get_sentence_compatability(i, j, k, words=sentence_data[1])
                    if compatability > max_compatability:
                        max_compatability = compatability
                        best_aspects[j] = k
            
                # fill in the matrix
                for k in range(num_sentences + self.NUM_EXTRA_NODES):
                    if k < len(ASPECTS) and num_sentences >= len(ASPECTS):
                        matrix[j][k] = -self.get_sentence_compatability(i, j, ASPECTS[k])
                    else:
                        matrix[j][k] = -max_compatability
            
            # update sentence aspect assignments based on the results of the Kuhn-Munkres algorithm
            m = Munkres()
            for row, col in m.compute(matrix):
                if col < len(ASPECTS) and num_sentences >= len(ASPECTS):
                    best_aspects[row] = ASPECTS[col]
                 
                    # print '(%d, %d) -> %d, %s' % (row, col, matrix[row][col], ASPECTS[col])
                # else:
                   # print '(%d, %d) -> %d' % (row, col, matrix[row][col])
                
                if self.get_aspect(i, row) != best_aspects[row]:
                    changed = True
                self.set_aspect(i, row, best_aspects[row])

            return changed

        # the single-loop version of this computation is "embarassingly parallel," so we parallelize it
        return any(parmap(worker, self.sentences.keys()))
    
    def _init_gradient_dicts(self):
        self.gradient_theta = {}
        for aspect in ASPECTS:
            self.gradient_theta[aspect] = {}
            for w in self.words_set:
                self.gradient_theta[aspect][w] = 0.0
        
        self.gradient_phi = {}
        for aspect in ASPECTS:
            self.gradient_phi[aspect] = {}
            for r in RATINGS:
                self.gradient_phi[aspect][r] = {}
                for w in self.words_set:
                   self.gradient_phi[aspect][r][w] = 0
    
    def _compute_gradient(self):
        for k in ASPECTS:
            for w in self.words_set:
                self.gradient_theta[k][w] = -1.0 * float(self.num_rating_possibilites) * self.theta[k][w]
                
                for r in RATINGS:
                    self.gradient_phi[k][r][w] = -1.0 * float(self.num_rating_possibilites) * self.phi[k][r][w] 
    
        for i, j, s, words, curr_aspect in self:
            if not curr_aspect:
                continue

            curr_aspect_rating = self.ratings[i][curr_aspect]
            
            num = 0.0
            denom = 0.0
            for k in ASPECTS:
                exp_score = 0.0
                for w in words:
                    exp_score += self.theta[k][w] + self.phi[k][self.ratings[i][k]][w]
                exp_score = np.exp(exp_score)
                
                if k == curr_aspect:
                    num = exp_score
                
                denom += exp_score
            
            frac = num / denom
            for w in words:
                self.gradient_theta[curr_aspect][w] += 1.0 - frac
                self.gradient_phi[curr_aspect][curr_aspect_rating][w] += 1.0 - frac
    
    def _compute_log_likelihood(self):
        likelihood = 0.0
        
        for i, j, s, words, curr_aspect in self:
            denom = 0.0
            for k in ASPECTS:
                exp_score = 0.0
                for w in words:
                    exp_score += self.theta[k][w] + self.phi[k][self.ratings[i][k]][w]
                
                if k == curr_aspect:
                    likelihood += exp_score
                exp_score = np.exp(exp_score)
                denom += exp_score
            likelihood -= np.log(denom)
        
        return likelihood
    
    def train(self, learning_rate=None, iterations=10, gradient_ascent_iterations=5):
        overall_start = time.time()

        if not learning_rate:
            learning_rate = float(self.num_rating_possibilites) * 0.01 / float(self.num_reviews)
            print 'Defaulting to learning rate of %s' % learning_rate
        
        self._init_gradient_dicts()
        
        likelihood = 0.0
        prev_likelihood = 0.0
        
        for iter_num in range(iterations):
            main_iter_start = time.time()
            
            print 'Main iter %s' % iter_num
            print '    Updating assignments...'
            update_assignments_start = time.time()
            changed = self._update_assignments()
            print '        Time: %s s' % (time.time() - update_assignments_start)
            
            # if the model didn't change, no need to keep going
            if (not changed):
                print '    Assignments did not change; breaking'
                break
            else:
                print '    Assignments changed'
            
            likelihood = self._compute_log_likelihood()
            if iter_num == 0:
                prev_likelihood = likelihood
            print '    Starting likelihood: %s' % likelihood
            
            for g_iter_num in range(gradient_ascent_iterations):
                g_iter_start = time.time()
                
                print '    Gradient ascent iter %s' % g_iter_num
                prev_likelihood = likelihood
                self._compute_gradient()
                
                # do the actual gradient ascent
                for k in ASPECTS:
                    for w in self.words_set:
                        self.theta[k][w] += learning_rate * self.gradient_theta[k][w]
                        for r in RATINGS:
                            self.phi[k][r][w] += learning_rate * self.gradient_phi[k][r][w]
                
                likelihood = self._compute_log_likelihood()
                print '        Likelihood: %s' % likelihood
                
                # undo the last operation and break if likelihood didn't improve
                if not (likelihood > prev_likelihood):
                    print '        Likelihood did not improve; undoing and breaking'
                    for k in ASPECTS:
                        for w in self.words_set:
                            self.theta[k][w] -= learning_rate * self.gradient_theta[k][w]
                            for r in RATINGS:
                                self.phi[k][r][w] -= learning_rate * self.gradient_phi[k][r][w]
                    likelihood = prev_likelihood
                    break
                
                print '        Time: %s s' % (time.time() - g_iter_start)
            
            pi = {}
            for k in ASPECTS:
                pi[k] = {}
                for w in self.words_set:
                    pi[k][w] = 0.0
                    for r in RATINGS:
                        pi[k][w] += self.phi[k][r][w]
                    pi[k][w] /= len(RATINGS)
            
            for k in ASPECTS:
                for w in self.words_set:
                    for r in RATINGS:
                        self.phi[k][r][w] -= pi[k][w]
                    self.theta[k][w] += pi[k][w]
            
            prev_likelihood = likelihood
            
            print '    Likelihood: %s' % likelihood
            print '    Time: %s s' % (time.time() - main_iter_start)

        print 'Total time: %s s' % (time.time() - overall_start)
    
    def get_aspect(self, i, j):
        return self.sentences[i][j][2]
    
    def set_aspect(self, i, j, aspect):
        self.sentences[i][j][2] = aspect
        
    def get_words(self, i, j):
        return self.sentences[i][j][1]

small_df = reviews_df.iloc[0:100]
sentence_model = SentenceModel(small_df)
print len(sentence_model.words_set), 'words'

In [ ]:
# train the model
sentence_model.train(iterations=15, gradient_ascent_iterations=7)

In [ ]:
# print out the learned words
for k, values in sentence_model.theta.iteritems():
    print k.upper() + ':'
    for word, weight in sorted(values.iteritems(), key=lambda x: x[1], reverse=True)[:10]:
        print '    %s: %s' % (word, weight)

In [ ]: