HW4: Do we really need Chocolate Recommendations?

Before You Start

This is a long homework. Please start early. It uses a lot of different (and sometimes complex) concepts, so you might find yourself reading a lot. So, please, give yourself a lot of time.

Also, please see this link on getting an Amazon Web Services account soon, so that you dont delay its creation. This class gives you $100 in credits which you will use for this homework, possibly your project, and any other projects you might like.

Finally, please go to the labs. The one on 18th October (Today) will cover Gibbs Sampling and Bayesian Normal distributions. The one on the 25th will cover Map-Reduce. Both will help on the homework.

Collaborative Filtering systems

In this homework, you will create a recommendation system for restaurants using collaborative filtering (CF). The general structure of a recommendation system is that there are users and there are items. Users express explicit or implicit preferences towards certain items. CF thus relies on users' past behavior.

There are two primary approaches to CF: neighboorhood and latent factor model. The former is concerned with computing the relationships between items or between users. In the latter approach you have a model of hidden factors through which users and items are transformed to the same space. For example, if you are rating movies we may transform items into genre factors, and users into their preference for a particular genre.

Factor models generally lead to more accurate recommenders. One of the reasons for this is the sparsity of the item-user matrix. Most users tend to rate barely one or two items. Latent factor models are more expressive, and fit fewer parameters. However, neighborhood models are more prevalent, as they have an intuitive aspect that appeals to users(if you liked this you will like that) and online(a new preference can be incorporated very quickly).

Most recommenders today combine neighboorhood CF with model based CF, and SVD based matrix factorization approaches.

To see the example of a simple beer recommender, go here. This homework is inspired by the one there but we go after food instead, and go deeper into the problem of making recommendations.

User and Item based approaches

Original approaches to neighborhood based CF used user-user models. By this we mean that rating estimates are made from recorded ratings of like minded users. However, since most users tend to rate very few items, this is usually a losing proposition for explicit-rating based recommenders. Thus, most neighborhood based systems such as Amazon these days rely on item-item approaches. In these methods, a rating is estimated by other ratings made by the user on "similar" or "nearby" items: we have a K-Nearest-Neighbors algorithm, in effect.

Outline of this Homework

The outline of this homework is as follows:

  1. Create a database of item-item similarities. Use this to implement a neighborhood-based CF recommender that can answer simple questions like "give me more restaurants like this one". This part of the homework assumes that the similaties calculated make good "global recommendations".

  2. In the second part, we go one step further and attempt to predict the rating that a user will give an item they have not seen before. This requires that we find the restaurants that this user would rate as similar (not just those which are globally similar).

  3. In the third part, we implement a factor-based CF recommender using a Bayesian model. While quite a bit more complex, this allows us to pool information both about similar users and about similar restaurants.

  4. We will scale up our system by creating a recommender on the lines of Q1 and Q2 that works on the entire data set. We will use the map-reduce paradigm to split the computation over multiple machines.

You will start simply, by working on a subset of the restaurant data before generalizing to the entire data set in Problem 4. The complete data set has 150,000 reviews, but we shall start with just about 7000. You will create this smaller set by taking all the users who had rated more than 60 restaurants, and all the businesses which had greater than 150 reviews from the larger data set. This is not a random set: indeed we use it as it a computationally tractable set that is a bit less sparse than the entire data set.


In [11]:
%matplotlib inline
from collections import defaultdict
import json

import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pandas as pd

from matplotlib import rcParams
import matplotlib.cm as cm
import matplotlib as mpl

#colorbrewer2 Dark2 qualitative color table
dark2_colors = [(0.10588235294117647, 0.6196078431372549, 0.4666666666666667),
                (0.8509803921568627, 0.37254901960784315, 0.00784313725490196),
                (0.4588235294117647, 0.4392156862745098, 0.7019607843137254),
                (0.9058823529411765, 0.1607843137254902, 0.5411764705882353),
                (0.4, 0.6509803921568628, 0.11764705882352941),
                (0.9019607843137255, 0.6705882352941176, 0.00784313725490196),
                (0.6509803921568628, 0.4627450980392157, 0.11372549019607843)]

rcParams['figure.figsize'] = (10, 6)
rcParams['figure.dpi'] = 150
rcParams['axes.color_cycle'] = dark2_colors
rcParams['lines.linewidth'] = 2
rcParams['axes.facecolor'] = 'white'
rcParams['font.size'] = 14
rcParams['patch.edgecolor'] = 'white'
rcParams['patch.facecolor'] = dark2_colors[0]
rcParams['font.family'] = 'StixGeneral'


def remove_border(axes=None, top=False, right=False, left=True, bottom=True):
    """
    Minimize chartjunk by stripping out unnecesasry plot borders and axis ticks
    
    The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn
    """
    ax = axes or plt.gca()
    ax.spines['top'].set_visible(top)
    ax.spines['right'].set_visible(right)
    ax.spines['left'].set_visible(left)
    ax.spines['bottom'].set_visible(bottom)
    
    #turn off all ticks
    ax.yaxis.set_ticks_position('none')
    ax.xaxis.set_ticks_position('none')
    
    #now re-enable visibles
    if top:
        ax.xaxis.tick_top()
    if bottom:
        ax.xaxis.tick_bottom()
    if left:
        ax.yaxis.tick_left()
    if right:
        ax.yaxis.tick_right()
        
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)

#utility functions

def histogram_style():
    remove_border(left=False)
    plt.grid(False)
    plt.grid(axis='y', color='w', linestyle='-', lw=1)
    
def histogram_labels(xlabel, ylabel, title, loc):    
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    plt.legend(frameon=False, loc=loc)
    
def histogram_settings(xlabel, ylabel, title, loc = 'upper left'):
    histogram_style() 
    histogram_labels(xlabel, ylabel, title, loc)

Description of the data set

The data set has been extracted from the Yelp Phoenix restaurants dataset. It is available here.


In [12]:
fulldf=pd.read_csv("bigdf.csv")
fulldf.head(2)


Out[12]:
user_id business_id date review_id stars usefulvotes_review user_name categories biz_name latitude longitude business_avg business_review_count user_avg user_review_count
0 rLtl8ZkDX5vH5nAx9C3q5Q 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 00:00:00 fWKvX83p0-ka4JS3dc6E5A 5 5 Jason [Breakfast & Brunch, Restaurants] Morning Glory Cafe 33.390792 -112.012504 3.87156 109 3.796954 197
1 SBbftLzfYYKItOMFwOTIJg 9yKzy9PApeiPPOUJEtnvkg 2008-05-04 00:00:00 DASdFe-g0BgfN9J2tanStg 5 1 Jennifer [Breakfast & Brunch, Restaurants] Morning Glory Cafe 33.390792 -112.012504 3.87156 109 3.473684 57

The data frame is a frame of reviews. We have joined in information about users and businesses into this frame so that you have only one frame to work with.

This information is for the reviews themselves:

'stars': (star rating, integer 1-5), 'date': (date, formatted like '2011-04-19'), 'review_id': (unique id for the review).

Here is a description of the data fields in this dataframe, on the business side

'business_id': (a unique identifier for this business), 'biz_name': (the full business name), 'latitude': (latitude), 'longitude': (longitude), 'business_review_count': (review count for the restaurant[this is a repeated field for all reviews of the restaurant]), 'categories': [(localized category names)], 'business_avg': (average stars over all users reviews for business[this is a repeated field for all reviews of the restaurant]).

And Finally, a set of fields for users

'user_id': (unique user identifier), 'user_name': (first name, last initial, like 'Matt J.'), 'user_review_count': (count of restaurants reviewed by user[this is a repeated field for all reviews by the user]), 'user_avg': (floating point average of users reviews over all businesses, like 4.31[this is a repeated field for all reviews by the user]).

In this data set, every user has only one review for each restaurant. Convince yourself of this. (This answer does not need to be submitted).

Our Recommender

To motivate our recommendation system, consider the follwing example. Let's pretend we are in Boston for a second. Lets say the average rating of restaurants here by all the users is 3.5. Sandrine's at Harvard square is better than an average restaurant, so it tends to be rated 0.5 stars above the average (over all the users). However, you are a curmudgeon, who tends to rate 0.2 stars below the average. Then a baseline estimate for the recommendation for Sandrine's, for you, is 3.5+0.5-0.2=3.8.

These baseline estimates thus adjust the data by accounting for the systematic tendencies for some users who give higher ratings than others, and for some restaurants to recieve higher ratings than others. We can write the baseline estimate $\hat Y_{um}^{baseline}$ for an unknown rating $Y_{um}$ for user $u$ and restaurant or business $m$ as:

$$ \hat Y_{um}^{baseline} = \hat \mu + \hat \theta_{u0} + \hat \gamma_{m0} $$

where the unknown parameters $\theta_{u0}$ and $\gamma_{m0}$ indicate the deviations, or biases, of user $u$ and item $m$, respectively, from some intercept parameter $\mu$. (The reason for the strange notation with 0s will become clear in Problem 3)

Notice that the $\theta_{u0}$ and $\gamma_{m0}$ are parameters which need to be fit. The simplest thing to start with, and something we will do for Problems 1 and 2 (but not 3), is to replace them by their "mean" estimates from the data. Thus:

$$ \hat Y^{baseline}_{um} = \bar Y + (\bar Y_u - \bar Y) + (\bar Y_m - \bar Y)$$

where $\bar Y_u$ = user_avg, the average of all a user $u$'s ratings and $\bar Y_m$ = business_avg, the average of all ratings for a restaurant $m$. $\bar Y$ is the average rating over all reviews.

The final two terms correspond to the user-specific and item-specific bias in ratings, that is, how their ratings tend to systematically diverge from the global average. This is the simplest possible way to predict a rating, based only on information about this user and this restaurant.

Can we do a better job of predicting the rating $Y_{um}$ user $u$ would give to restaurant $r$? According to the central dogma of CF, we ought to be able to use the responses of similar users regarding similar restaurants to get a better prediction.

We can make an estimate of $Y_{um}$ as:

$$ \hat{Y_{um}} = \hat Y_{um}^{baseline}\, + \,\frac{\sum\limits_{j \in S^{k}(m)} s_{mj} ( Y_{uj} - \hat Y_{uj}^{baseline} )}{\sum\limits_{j \in S^{k}(m)} s_{mj} } $$

where $s^{k}(m)$ is the $k$ neighbor items of item $m$ based on some pooling criterion, for example, those items which have been rated by user $u$.

In the next two problems, we will focus on using similar restaurants, or the item neighborhood. To do this, we compute a similarity measure $s_{mj}$ between the $m$th and $j$th items. This similarity might be measured via cosine similarity, pearson co-efficient or using other distance based measures. Here we shall use the Pearson coefficient. This measures the tendency of users to rate items similarly. Since most ratings are unknown, it is computed on the "common user support" (n_common), which is the set of common raters of both items.

In the first problem we shall set $S$ to the global neighborhood of the item, and in the second we shall set it to those items which have been rated by user $u$.

Q1. Writing a simple "global" recommender

Now we have a way to pool information between similar restaurants to try to predict a user's recommendation. But how do we choose the neighborhood to pool over? We begin with the simplest choice. We calculate the similarity between items using their entire common user support, and rank the nearest neighbors of an item by this similarity. We call this a "global" recommender because it assumes that every user perceives the similarity between restaurants in the same way. Later on, we will implement a more specific recommender that pools information based on which items seem the most similar to this user.

The global recommender does have the advantage of dealing with the possible sparsity of the user's rated items, but also the disadvantage of giving one answer for all users, without taking the user's preferences into account. This is a classic case of bias-variance tradeoff.(what does it mean?)

Lets implement this simpler global recommender first.

Exploratory Data Analysis

1.1 Visualize the sparsity of the full data set by plotting two histograms of the review count grouped by the user_id and business_id respectively. Are there more users or more businesses?


In [13]:
fulldf.groupby(fulldf.user_id).review_id.count().describe()


Out[13]:
count    34789.000000
mean         4.292133
std         10.876321
min          1.000000
25%          1.000000
50%          1.000000
75%          3.000000
max        332.000000
Name: review_id, dtype: float64

In [14]:
fulldf.user_id.unique().size


Out[14]:
34789

In [15]:
#your code here

plt.hist(fulldf.groupby(fulldf.user_id).review_id.count(), log = True, bins= 20)
plt.xlabel("Reviews per user")
plt.ylabel("review count")
histogram_style()



In [16]:
fulldf.groupby(fulldf.business_id).review_id.count().describe()


Out[16]:
count    4503.000000
mean       33.159893
std        56.449513
min         1.000000
25%         5.000000
50%        12.000000
75%        37.000000
max       742.000000
Name: review_id, dtype: float64

In [17]:
#your code here

plt.hist(fulldf.groupby(fulldf.business_id).review_id.count(), log =True, bins= 20)
plt.xlabel("Reviews per restaurant")
plt.ylabel("review count")
histogram_style()



In [18]:
#your code here
print "No of users: %d" %fulldf.groupby(fulldf.user_id).review_id.count().size
print "No of businesses: %d" %fulldf.groupby(fulldf.business_id).review_id.count().size


No of users: 34789
No of businesses: 4503

Of the total 4503 restaurants in the data, about 75% of the users have reviewed only fewer than 3 of them. Also, it seems a minority of the users contribute the most number of reviews. A similar trend is observed in the number of reviews per restaurant as well. About 75% of the restaurants have fewer than 37 reviews.

1.2 Compute the average rating of reviews in the data set and a histogram of all the ratings in the dataset.


In [19]:
#your code here
plt.hist(fulldf.stars, bins =5 , label ='Ratings')
plt.axvline(fulldf.stars.mean(), 0, 1, color='r', label='Average Rating: %.2f'%fulldf.stars.mean())
plt.legend(frameon=False, loc='upper left')
histogram_style()


The following function is used to re-compute review counts and averages whenever you subset a reviews data frame. We'll use it soon to construct a smaller, more computationally tractable data frame.


In [20]:
def recompute_frame(ldf):
    """
    takes a dataframe ldf, makes a copy of it, and returns the copy
    with all averages and review counts recomputed
    this is used when a frame is subsetted.
    """
    ldfu=ldf.groupby('user_id')
    ldfb=ldf.groupby('business_id')
    user_avg=ldfu.stars.mean()
    user_review_count=ldfu.review_id.count()
    business_avg=ldfb.stars.mean()
    business_review_count=ldfb.review_id.count()
    nldf=ldf.copy()
    nldf.set_index(['business_id'], inplace=True)
    nldf['business_avg']=business_avg
    nldf['business_review_count']=business_review_count
    nldf.reset_index(inplace=True)
    nldf.set_index(['user_id'], inplace=True)
    nldf['user_avg']=user_avg
    nldf['user_review_count']=user_review_count
    nldf.reset_index(inplace=True)
    return nldf

1.3 Create a smaller data set in dataframe smalldf by looking for those businesses with more than 150 reviews and those users with more than 60 reviews. Include all the columns that were there in the parent dataframe. Since you have created a subset of the data set, use the method provided above to recalculate the averages. Print the number of unique users and items in this data set.

Note that while this cut makes sure we have prolific users, the cut on businesses restores sparsity by reducing the number of reviews per user.


In [21]:
#your code here
smalldf = fulldf[(fulldf.business_review_count > 150) & (fulldf.user_review_count > 60)] 
print "smalldf as a percentage of fulldf: %0.2f" % (float(smalldf.shape[0])/float(fulldf.shape[0])*100)


smalldf as a percentage of fulldf: 4.13

In [22]:
smalldf_new = recompute_frame(smalldf)

print "No of unique users: %d" %smalldf.user_id.unique().size
print "No of unique restaurants: %d" %smalldf.business_id.unique().size


No of unique users: 240
No of unique restaurants: 172

How does this compare to the parent data set, in terms of size and sparsity? Once again, plot histograms of the review count grouped by user, and by the review count grouped by business, respectively, and describe the results


In [23]:
# histogram of the review count grouped by user
plt.figure()
count_by_user = smalldf.groupby(smalldf.user_id).review_id.count()
count_by_business = smalldf.groupby(smalldf.business_id).review_id.count()
plt.hist(count_by_user, bins =10)
plt.axvline(count_by_user.mean(), 0, 1, color='r', label='Avg Reviews per user: %.2f'%count_by_user.mean())
histogram_settings("Reviews per user","review count", "Reviews count per user", "upper right")

# histogram of the review count grouped by restaurant
plt.figure()
plt.hist(count_by_business, bins =10)
plt.axvline(count_by_business.mean(), 0, 1, color='r', label='Avg Reviews per restaurant: %.2f'%count_by_business.mean())
histogram_settings("Reviews per restaurant","review count", "Reviews count per restaurant", "upper right")


Data in smalldf looks less sparse than the fulldf.

1.4 Compute histograms of the average user rating in the smaller data set, and the average business rating in the smaller data set. Print the overall mean.


In [24]:
#histogram of the average user rating
plt.figure()
avg_user_rating = smalldf.groupby('user_id').stars.mean()
plt.hist(avg_user_rating, bins =5 , label ='Avg User Ratings')
plt.axvline(avg_user_rating.mean(), 0, 1, color='r', label='Avg User Rating: %.2f'%avg_user_rating.mean())
histogram_settings("Avg User Rating", "Count", "Avg User Rating Count")

plt.figure()
avg_bus_rating = smalldf.groupby('business_id').stars.mean()
plt.hist(avg_bus_rating, bins =5 , label ='Avg Restaurant Ratings')
plt.axvline(avg_bus_rating.mean(), 0, 1, color='r', label='Avg Restaurant Rating: %.2f'%avg_bus_rating.mean())
histogram_settings("Avg Restaurant Rating", "Count", "Avg Restaurant Rating Count")


print "Overall average rating in smalldf: %0.2f"%smalldf.stars.mean()


Overall average rating in smalldf: 3.87

Common Support

Lets now make a histogram of the common user support (the number of common reviewers) of each pair of restaurants on the smaller set, and print the mean. Pay attention to the code, as you will use parts of it later. (This code takes a bit of time to run, so be patient).

The common support is an important concept, as for each pair of restaurants, its the number of people who reviewed both. It will be used to modify similarity between restaurants. If the common support is low, the similarity is less believable.


In [25]:
restaurants=smalldf.business_id.unique()  #get all the unique restaurant ids
supports=[]
for i,rest1 in enumerate(restaurants):  #loop for first restaurant (rest1) in the pair in the f
    for j,rest2 in enumerate(restaurants): # loop for second restaurant(rest2) in the pair
        if  i < j: #skip pairing same restaurants and forming duplicate pairs
            rest1_reviewers = smalldf[smalldf.business_id==rest1].user_id.unique() #find all unique users who reviewed restaurant 1
            rest2_reviewers = smalldf[smalldf.business_id==rest2].user_id.unique() #find all unique users who reviewed restaurant 2
            common_reviewers = set(rest1_reviewers).intersection(rest2_reviewers) # find common reviewers by taking intersection
            supports.append(len(common_reviewers)) # add the no of common reviewers to list
print "Mean support is:",np.mean(supports)
plt.hist(supports) #plot hist of list
histogram_settings("Common reviewers per each pair of restaurants", "Count", "Common User Support")


Mean support is: 6.84679722562
/Users/apple/anaconda/lib/python2.7/site-packages/matplotlib/axes/_axes.py:519: UserWarning: No labelled objects found. Use label='...' kwarg on individual plots.
  warnings.warn("No labelled objects found. "

As you can see, even though we chose a subset of the dataframe in which every restaurant had 150 reviews and every user had atleast made 60, the common support of most pairs of restaurants is really low, indeed less than 10!.

Calculating Similarity

Users rate restaurants on a scale of 1-5. Even though this rating is integer valued, for the purposes of this assignment we shall treat it as a real number.

Even though each reviewer uses the same 5-star scale when rating restaurants, comparing two users by comparing their raw user ratings can be problematic. Consider a user whose average rating is 2. This is a curmudgeonly user. Consider another whose average rating is 4. This is a rather enthusiastic one. How should we compare a 3 rating by the curmudgeonly one to a 5 rating of the enthusiastic one?

It is for this purpose that we must subtract the average rating of the user from the actual rating of the restaurants in computing the similarity of two restaurants. This makes the above ratings by the two users comparable. We do this in the function pearson_sim defined below.

If there is no common support (n_common=0), we have no basis for making a similarity estimate, and so we set the similarity to 0. In the case that the individual restaurant rating variance is 0, such as in the case where there is only one common reviewer (n_common=1), we return the NaN that the scipy pearsonr returns. We will deal with it soon,


In [26]:
from scipy.stats.stats import pearsonr
def pearson_sim(rest1_reviews, rest2_reviews, n_common):
    """
    Given a subframe of restaurant 1 reviews and a subframe of restaurant 2 reviews,
    where the reviewers are those who have reviewed both restaurants, return 
    the pearson correlation coefficient between the user average subtracted ratings.
    The case for zero common reviewers is handled separately. Its
    ok to return a NaN if any of the individual variances are 0.
    """
    if n_common==0:
        rho=0.
    else:
        diff1=rest1_reviews['stars']-rest1_reviews['user_avg']
        diff2=rest2_reviews['stars']-rest2_reviews['user_avg']
        rho=pearsonr(diff1, diff2)[0]
        if np.isnan(rho):
            rho=0.
    return rho

The function get_restaurant_reviews defined below takes a restaurant business_id and a set of users, and returns the reviews of that restaurant by those users. You will use this function in calculating a similarity function, in 1.5.


In [27]:
def get_restaurant_reviews(restaurant_id, df, set_of_users):
    """
    given a resturant id and a set of reviewers, return the sub-dataframe of their
    reviews.
    """
    mask = (df.user_id.isin(set_of_users)) & (df.business_id==restaurant_id)
    reviews = df[mask]
    reviews = reviews[reviews.user_id.duplicated()==False]
    return reviews

1.5 Write a function calculate_similarity that operates between two restaurants and calculates a similarity for them, taking a dataframe and a similarity function similarity_func. An example of the similarity_func is the pearson_sim we defined above. calculate_similarity operates as follows:

  1. For each of the two restaurants, get the set of reviewers who have reviewed the restaurant and compute the intersection of these two sets. Also compute the number of common reviewers n_common.

  2. Use the function get_restaurant_reviews defined below to get the reviews for each restaurant as made by these common reviewers. Notice that get_restaurant_reviews returns a sub data frame of reviews.

  3. Calculate the similarity using similarity_func which takes the two reviews dataframes from part 2 and the number of common reviewers n_common as arguments

  4. Return the similarity and n_common in a tuple (sim, n_common). If the similarity is a NaN, set the similarity to 0.


In [28]:
"""
Function
--------
calculate_similarity

Parameters
----------
rest1 : string
    The id of restaurant 1
rest2 : string
    The id of restaurant 2
df : DataFrame
  A dataframe of reviews, such as the smalldf above
similarity_func : func
  A function like pearson_sim above which takes two dataframes of individual
  restaurant reviews made by a common set of reviewers, and the number of
  common reviews. This function returns the similarity of the two restaurants
  based on the common reviews.
  
Returns
--------
A tuple
  The first element of the tuple is the similarity and the second the
  common support n_common. If the similarity is a NaN, set it to 0
"""
#your code here
def calculate_similarity(rest1, rest2, df, similarity_func):
    rest1_reviewers = df[df.business_id==rest1].user_id.unique() #find all unique users who reviewed restaurant 1
    rest2_reviewers = df[smalldf.business_id==rest2].user_id.unique() #find all unique users who reviewed restaurant 2
    common_reviewers = set(rest1_reviewers).intersection(rest2_reviewers) #
    n_common = len(common_reviewers)
    
    reviews1 = get_restaurant_reviews(rest1, df, common_reviewers)
    reviews2 = get_restaurant_reviews(rest2, df, common_reviewers)
    sim_coef = similarity_func(reviews1, reviews2, n_common)
    if np.isnan(sim_coef):
        sim_coef = 0
    return sim_coef,n_common

Making a database of similarities

We now move to calculating a global database of pairwise restaurant similarities. We provide you here with a function to make a database of the similarities for each pair of restaurants in the database. The class Database is initialized in its constructor by taking as arguments a dataframe of reviews. The method populate_by calculating iterates over every possible pair of business_id's in the dataframe and populates the database with similarities and common supports. It takes as arguments a function the similarity function similarity_func like pearson_sim (calculate_similarity then uses this to calculate the similarity). The get method on the database can be used to retrieve the similarity for two business ids.

(See Thu Oct 17th's class video for information about classes)


In [29]:
class Database:
    "A class representing a database of similaries and common supports"
    
    def __init__(self, df):
        "the constructor, takes a reviews dataframe like smalldf as its argument"
        database={}
        self.df=df
        self.uniquebizids={v:k for (k,v) in enumerate(df.business_id.unique())}
        keys=self.uniquebizids.keys()
        l_keys=len(keys)
        self.database_sim=np.zeros([l_keys,l_keys])
        self.database_sup=np.zeros([l_keys, l_keys], dtype=np.int)
        
    def populate_by_calculating(self, similarity_func):
        """
        a populator for every pair of businesses in df. takes similarity_func like
        pearson_sim as argument
        """
        items=self.uniquebizids.items()
        for b1, i1 in items:
            for b2, i2 in items:
                if i1 < i2:
                    sim, nsup=calculate_similarity(b1, b2, self.df, similarity_func)
                    self.database_sim[i1][i2]=sim
                    self.database_sim[i2][i1]=sim
                    self.database_sup[i1][i2]=nsup
                    self.database_sup[i2][i1]=nsup
                elif i1==i2:
                    nsup=self.df[self.df.business_id==b1].user_id.count()
                    self.database_sim[i1][i1]=1.
                    self.database_sup[i1][i1]=nsup
                    

    def get(self, b1, b2):
        "returns a tuple of similarity,common_support given two business ids"
        sim=self.database_sim[self.uniquebizids[b1]][self.uniquebizids[b2]]
        nsup=self.database_sup[self.uniquebizids[b1]][self.uniquebizids[b2]]
        return (sim, nsup)

Lets run make_database and store the result in the global variable db. Lets print out an example entry. Running this function will take a bit of time.


In [30]:
db=Database(smalldf)
db.populate_by_calculating(pearson_sim)

In [31]:
db.get("z3yFuLVrmH-3RJruPEMYKw", "zruUQvFySeXyEd7_rQixBg")


Out[31]:
(0.34844426604207018, 7)

K-Nearest restaurants (in similarity)

We are now going to find the k-nearest restaurants to a given restaurant based on the database of similarities that we calculated. But we have a problem.

Consider the two cases where there is just one common reviewer, and where there are 40. In the former case, we might get a artificially high similarity based on the tastes of just this user, and thus we must reduce its importance in the nearest-neighbor calculation. In the latter case, we would get a much more unbiased estimator of the similarity of the two restaurants.

To control the effect of small common supports, we can shrink our pearson co-efficients. We shall do this by using the "regularization" parameter reg:

$$s_{mj} = \frac{N_{common}\, \rho_{mj}}{N_{common}+reg} $$

where $N_{common}$ (n_common) is the common reviewer support and $\rho_{ij}$ is the pearson co-relation coefficient.

Recall the notions of regularization introduced in class. We want to reduce the variance in our estimates, so we pull our estimates in toward a conservative point in a way that strongly corrals in estimates when there is very little data, but allows the data to speak when there is a lot. This can be shown as equivalent to adding in a reg amount of bayesian prior, as Joe has alluded to in class.

A good value of the regularizer is intuitively one that doesn't affect the similarity when the common support is high ~10, but has a large effect when the support is small. In this case, values of 2-4 are good. Usually, the value of reg is determined using cross-validation, but for the sake of simplicity we will generally set it to 3.

We define a function shrunk_sim which takes the sim and n_common obtained from the database, and shrinks the similarity down using the regularizer reg.


In [32]:
def shrunk_sim(sim, n_common, reg=3.):
    "takes a similarity and shrinks it down by using the regularizer"
    ssim=(n_common*sim)/(n_common+reg)
    return ssim

1.6 Now we can move to writing a knearest function, which finds the k nearest neighbors of a given restaurant based on the shrunk similarities we calculate. Note that as defined here, the nearest neighbors are global over the entire set of restaurants, as opposed to being restricted to the restaurants a user has reviewed(we shall do that in the next problem). Thus, this is an expensive function!

Write a knearest that returns a k-length sorted list of 3-tuples each corresponding to a restaurant. The tuple structure is (business_id, shrunken similarity score, common support) where the similarity score and common support are with respect to the restaurant whose neighbors we are finding, and the business_id is the id of the "nearby" restaurant found. The nearby restaurants are found from a supplied numpy array of restaurants set_of_restaurants. The spec for the function is given below. HINT: use itemgetter from the operator module to do the sorting.


In [33]:
"""
Function
--------
knearest

Parameters
----------
restaurant_id : string
    The id of the restaurant whose nearest neighbors we want
set_of_restaurants : array
    The set of restaurants from which we want to find the nearest neighbors
dbase : instance of Database class.
    A database of similarities, on which the get method can be used to get the similarity
  of two businessed. e.g. dbase.get(rid1,rid2)
k : int
    the number of nearest neighbors desired, default 7
reg: float
    the regularization.
    
  
Returns
--------
A sorted list
    of the top k similar restaurants. The list is a list of tuples
    (business_id, shrunken similarity, common support).
"""
#your code here
from operator import itemgetter
def knearest(restaurant_id, set_of_restaurants, dbase, k =7, reg =3):
    
    similar = []
    for rest_id in set_of_restaurants: #loop to find similarity of restaurant_id with the rest of restaurants
        if rest_id != restaurant_id:
            sim, n_common = dbase.get(restaurant_id, rest_id) 
            ssm = shrunk_sim(sim, n_common, reg)
            similar.append((rest_id,ssm,n_common))
    similar = sorted(similar,key =itemgetter(1), reverse = True)
    return similar[0:k]

Ok it's time to recommend!

Lets choose the two very different businesses in the dataframe


In [34]:
#testbizid="eIxSLxzIlfExI6vgAbn2JA"
#testbizid2="L-uPZxooP_ziXCtRrWi8Pw"
testbizid="53YGfwmbW73JhFiemNeyzQ"
testbizid2="KoIRdcIfh3XWxiCeV1BDmA"

We provide functions to look up a business name given a business id, and a username given a user id.


In [35]:
def biznamefromid(df, theid):
    return df['biz_name'][df['business_id']==theid].values[0]
def usernamefromid(df, theid):
    return df['user_name'][df['user_id']==theid].values[0]

In [36]:
print testbizid, biznamefromid(smalldf,testbizid)
print testbizid2, biznamefromid(smalldf, testbizid2)


53YGfwmbW73JhFiemNeyzQ Olive & Ivy
KoIRdcIfh3XWxiCeV1BDmA Cien Agaves Tacos & Tequila

In [37]:
smalldf['business_id'][smalldf['biz_name']=="Olive & Ivy"].values[0]


Out[37]:
'53YGfwmbW73JhFiemNeyzQ'

Get top matches

Its now time to answer the question: "if you liked this, you might also like these". We use our testbizid and testbizid2 to compute the k=7 nearest neighbors with a regularization of 3. . We print these top 7 matches names, along with their similarity coefficient and common support.


In [38]:
tops=knearest(testbizid, smalldf.business_id.unique(), db, k=7, reg=3.)
print "For ",biznamefromid(smalldf, testbizid), ", top matches are: \n"
print"No\t    Restaurant Name\t Similarity\t    Support" 
for i, (biz_id, sim, nc) in enumerate(tops): 
    print "%2s" %i,"%-30s" %biznamefromid(smalldf,biz_id), "%-15.4f" %sim,"%5d" %nc # -30s -> left aligns, 30 spaces


For  Olive & Ivy , top matches are: 

No	    Restaurant Name	 Similarity	    Support
 0 Sapporo                        0.6562              7
 1 Sens Asian Tapas & Sake Bar    0.5652              4
 2 Oregano's Pizza Bistro         0.5565              5
 3 Roaring Fork                   0.4892              9
 4 Beckett's Table                0.4873              8
 5 Coup Des Tartes                0.4542              6
 6 Kabuki Japanese Restaurant     0.4251              8

In [39]:
tops2=knearest(testbizid2, smalldf.business_id.unique(), db, k=7, reg=3.)
print "For ",biznamefromid(smalldf, testbizid2), ", top matches are:\n"
print"No Restaurant Name\t\t\t\t Similarity\t    Support" 
for i, (biz_id, sim, nc) in enumerate(tops2): 
    print "%2s" %i,"%-45s" %biznamefromid(smalldf,biz_id), "%-15.4f" %sim,"%5d" %nc


For  Cien Agaves Tacos & Tequila , top matches are:

No Restaurant Name				 Similarity	    Support
 0 Yupha's Thai Kitchen                          0.5714              4
 1 Fuego Bistro                                  0.5139              5
 2 Fogo de Chao                                  0.5000              3
 3 Café Monarch                                 0.5000              3
 4 Lee's Sandwiches                              0.5000              3
 5 Lalibela Ethiopian Cafe                       0.5000              3
 6 Pita Jungle                                   0.5000              3

We can see that these two restaurants are in somewhat different orbits :-).

Lets now turn our attention to another question: what are the top recommendations for a user? To answer this we must find the user's top rated restaurants, find the nearest neighbors of these restaurants, merge these lists while removing the duplicates and the ones that the user has already rated, and sort by the restaurant's average rating. We provide the code to get the user's top choices in a subset data frame.


In [40]:
def get_user_top_choices(user_id, df, numchoices=5):
    "get the sorted top 5 restaurants for a user by the star rating the user gave them"
    udf=df[df.user_id==user_id][['business_id','stars']].sort(['stars'], ascending=False).head(numchoices)
    return udf
testuserid="7cR92zkDv4W3kqzii6axvg"
#testuserid="sALB8-F04S9VcZMF_GkGUA"
print "For user", usernamefromid(smalldf,testuserid), "top choices are:" 
bizs=get_user_top_choices(testuserid, smalldf)['business_id'].values
[biznamefromid(smalldf, biz_id) for biz_id in bizs]


For user Vern top choices are:
/Users/apple/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:3: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  app.launch_new_instance()
Out[40]:
['Tee Pee Mexican Food',
 'Local Breeze',
 "Carly's Bistro",
 'District American Kitchen and Wine Bar',
 'Sonora Mesquite Grill']

Get top recommendations for user.

1.7 Its your job now to write a function get_top_recos_for_user which takes as arguments a userid, the n top choices for the user, the dataframe, k, and a regularizer, and returns the top recommendations obtained from combining the restaurants that are neighbors of each of the n choices, in the way described in the previous paragraph. This returned list is a list of tuples (restaurant_id, business_avg) sorted by business_avg where business_avg is the average rating of the restaurant over the dataframe.


In [41]:
"""
Function
--------
get_top_recos_for_user

Parameters
----------
userid : string
    The id of the user for whom we want the top recommendations
df : Dataframe
    The dataframe of restaurant reviews such as smalldf
dbase : instance of Database class.
    A database of similarities, on which the get method can be used to get the similarity
  of two businesses. e.g. dbase.get(rid1,rid2)
n: int
    the n top choices of the user by star rating
k : int
    the number of nearest neighbors desired, default 8
reg: float
    the regularization.
    
  
Returns
--------
A sorted list
    of the top recommendations. The list is a list of tuples
    (business_id, business_avg). You are combining the k-nearest recommendations 
    for each of the user's n top choices, removing duplicates and the ones the user
    has already rated.
"""

#your code here
def get_top_recos_for_user(userid,df,dbase,n=5,k=7,reg=3):

    rest_ids = get_user_top_choices(userid, df, numchoices =n)["business_id"].values
    rests_rated_by_user = df[df.user_id==userid].business_id.values
 
    top_recos =[]
    for rest in rest_ids:
        tops=knearest(rest, df.business_id.unique(), dbase, k=k, reg=reg)
        for top in tops:
            if top[0] not in rests_rated_by_user:
                top_recos.append(top)    
        #remove duplicates
        ids=[e[0] for e in top_recos]
        #print ids
        uids={k:0 for k in list(set(ids))} # add 0 to identify the unique ones in the full list.
    
        
        top_unique =[]
        for e in top_recos:
            if uids[e[0]] == 0:
                top_unique.append(e)
                uids[e[0]] = 1
       
        top_restaurants =[]
        for r,s,nc in top_unique:
            avg_rating = df[df.business_id ==r].stars.mean()
            top_restaurants.append((r,avg_rating))
        
        top_restaurants = sorted(top_restaurants, key = itemgetter(1), reverse = True)   
       
   
        if n < len(top_restaurants):
            return top_restaurants[0:n]
        else:
            return top_restaurants

Lets print the top recommendations for testuserid, with a regularization of 3.


In [42]:
print "For user", usernamefromid(smalldf,testuserid), "the top recommendations are:"
toprecos=get_top_recos_for_user(testuserid, smalldf, db, n=5, k=7, reg=3.)
for biz_id, biz_avg in toprecos:
    print biznamefromid(smalldf,biz_id), "| Average Rating |", biz_avg


For user Vern the top recommendations are:
Rokerij | Average Rating | 4.37931034483
Cornish Pasty Company | Average Rating | 4.20689655172
Carolina's Mexican Food | Average Rating | 3.91176470588
Malee's Thai Bistro | Average Rating | 3.875
Delux | Average Rating | 3.77611940299
/Users/apple/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:3: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  app.launch_new_instance()

Problem 2: A user based recommender with predicted ratings

This is all very nice. We can provide ratings based on global similarities to a restaurant. However, in many cases this is not enough.

For example, it is hard to judge if the above recommendations are any good. In the usual testing paradigm, say that we break the dataframe into train and test. Based on the training set, I am recommended restaurant B. Now, I have rated B, but that information is in the testing set. I have no way of comparing the rating I give B in the testing set, to the similarity computed from the training set that was used to make the recomendation. The best I could do is to compare the average rating of restaurant B in the training set to my rating of restaurant B in the test set.

In this section, we shift our focus to more fine-grained predictions about each user, and try to predict what rating a user would give to a restaurant they have never tried before. To do this, we will try to personalize the information we use even further, and only pool information from restaurants that the user has rated.

This allows us to return to the original problem of prediction $Y_{um}$ for a restaurant $m$ that user $u$ has never rated before. Using our newly computed similarity metrics, we can modify our original baseline estimate by pulling in information from the user's neighborhood of the restaurant $m$, and predict $Y_{um}$ as:

$$ \hat{Y_{um}} = \hat Y^{baseline}_{um}\, + \,\frac{\sum\limits_{j \in S^{k}(m;u)} s_{mj} ( Y_{uj} - \hat Y^{baseline}_{uj} )}{\sum\limits_{j \in S^{k}(m;u)} s_{mj} } $$

where $s^{k}(m;u)$ is the $k$ neighbor items of item $m$ which have been rated by user $u$.

Now, this is not a particularly good assumption, especially in the situation where a restaurant is new (new item problem) or a user is new (cold start problem), or in the case when there are very few reviewers of a restaurant, or very few reviews by a user respectively. However, one must start somewhere!

Notice that in adding in the similarity term, we subtract the baseline estimate from the observed rating of the user's neighbor items.

Defining the predicted rating

2.1 Write a function knearest_amongst_userrated, analogous to the knearest function we defined above, to find the nearest k neighbors to a given restaurant from the restaurants that the user has already rated. This function will take as arguments the restaurant_id, the user_id, the dataframe of reviews, the database, the k, and the regularizer reg. Just like before, return a k-length sorted list of 3-tuples each corresponding to a restaurant. HINT: use the knearest function you defined earlier


In [49]:
"""
Function
--------
knearest_amongst_userrated

Parameters
----------
restaurant_id : string
    The id of the restaurant whose nearest neighbors we want
user_id : string
    The id of the user, in whose reviewed restaurants we want to find the neighbors
df: Dataframe
    The dataframe of reviews such as smalldf
dbase : instance of Database class.
    A database of similarities, on which the get method can be used to get the similarity
  of two businessed. e.g. dbase.get(rid1,rid2)
k : int
    the number of nearest neighbors desired, default 7
reg: float
    the regularization.
    
  
Returns
--------
A sorted list
    of the top k similar restaurants. The list is a list of tuples
    (business_id, shrunken similarity, common support).
"""
#your code here
from operator import itemgetter
def knearest_amongst_userrated(restaurant_id,user_id, df, dbase, k =7, reg =3):
    
    similar = []
    rests_rated_by_user = df[df.user_id==user_id].business_id.unique()
    return knearest(restaurant_id, rests_rated_by_user, dbase, k, reg)

2.2 Now write a function that returns the predicted rating for a user and an item using the formula at the beginning of this problem. Include code to deal with the possibility that the sum of scores that goes in the denominator is 0: return a predicted rating of the baseline portion of the formula in that case. This function rating takes as arguments the dataframe, the database, the wanted restaurant_id and user_id, and k as well as the regularizer.


In [50]:
"""
Function
--------
rating

Parameters
----------
df: Dataframe
    The dataframe of reviews such as smalldf
dbase : instance of Database class.
    A database of similarities, on which the get method can be used to get the similarity
  of two businessed. e.g. dbase.get(rid1,rid2)
restaurant_id : string
    The id of the restaurant whose nearest neighbors we want
user_id : string
    The id of the user, in whose reviewed restaurants we want to find the neighbors
k : int
    the number of nearest neighbors desired, default 7
reg: float
    the regularization.
    
  
Returns
--------
A float
    which is the impued rating that we predict that user_id will make for restaurant_id
"""


#your code here
def rating(df, dbase, restaurant_id, user_id, k, reg):
    #calculate base line
    overall_avg_rating = df.stars.mean()
    avg_user_rating = df[df.user_id == user_id].user_avg.values[0]
    avg_biz_rating = df[df.business_id == restaurant_id].business_avg.values[0]
    
    prod_similarities_and_rating = 0.
    sum_of_similarities = 0.
   
    #print  baseline_rating
   
    #find knearest and loop through them
    similars = knearest_amongst_userrated(restaurant_id,user_id, df, dbase, k =k, reg =reg)
    for r_id,sim,cs in similars:
        #print("For restaurant: %s" %restaurant_j[0])
        
        rest_reviews = df[(df.user_id == user_id) & (df.business_id == r_id)]
        user_rating = rest_reviews.stars.values[0]
        rest_avg = rest_reviews.business_avg.values[0]
        prod_similarities_and_rating += (sim * (user_rating - (rest_avg + avg_user_rating - overall_avg_rating  )))
        sum_of_similarities += sim
        #print  user_rating_j, base_rating_j,prod_similarities_and_rating, sum_of_similarities 
    baseline_rating =  avg_user_rating + avg_biz_rating -  overall_avg_rating    
    if(sum_of_similarities <= 0):
        return baseline_rating
    else:
        return baseline_rating + (prod_similarities_and_rating/ sum_of_similarities)
      
rating(smalldf,db, '53YGfwmbW73JhFiemNeyzQ', '7cR92zkDv4W3kqzii6axvg', k =7, reg = 3 )


Out[50]:
3.7913871099950494

For the top-recommendations in the variable toprecos from the previous section, we compute the predicted rating and compare it with the average rating over all users available inside the tuples that make up toprecos. We use a k of 7 and regularization 3. For comparision we also print this users' average rating. Do you notice anything interesting about how the order has changed from when we did this with the global similarities? (for you to think, not to answer)


In [51]:
print "User Average", smalldf[smalldf.user_id==testuserid].stars.mean(),"for",usernamefromid(smalldf,testuserid)
print "Predicted ratings for top choices calculated earlier:"
for biz_id,biz_avg in toprecos:
    print biznamefromid(smalldf, biz_id),"|",rating(smalldf, db, biz_id, testuserid, k=7, reg=3.),"|","Average",biz_avg


User Average 3.5652173913 for Vern
Predicted ratings for top choices calculated earlier:
Rokerij | 4.58626646139 | Average 4.37931034483
Cornish Pasty Company | 4.53369320845 | Average 4.20689655172
Carolina's Mexican Food | 4.41500397034 | Average 3.91176470588
Malee's Thai Bistro | 4.37715367449 | Average 3.875
Delux | 3.89089858909 | Average 3.77611940299

Testing the ratings

Let us compare the predicted ratings with a user's ratings. Note that we are doing this on the same set that we constructed the predictions with, so this is not a validation of the procedure, but simply a check of the procedure's fit. We first write a helper function to return the user score for a restaurant, and the restaurant's average score over all users.


In [52]:
def get_other_ratings(restaurant_id, user_id, df):
    "get a user's rating for a restaurant and the restaurant's average rating"
    choice=df[(df.business_id==restaurant_id) & (df.user_id==user_id)]
    users_score=choice.stars.values[0]
    average_score=choice.business_avg.values[0]
    return users_score, average_score

For the user testuserid, we loop over the variable bizs (which is a set of restaurants the user has rated) and print the predicted rating, and the actual rating and restaurant average rating obtained using the function above. We again use k=7 and a regularization of 3.


In [53]:
print "for user",usernamefromid(smalldf,testuserid), 'avg', smalldf[smalldf.user_id==testuserid].stars.mean() 
for biz_id in bizs:
    print "----------------------------------"
    print biznamefromid(smalldf, biz_id)
    print "Predicted Rating:",rating(smalldf, db, biz_id, testuserid, k=7, reg=3.) 
    u,a=get_other_ratings(biz_id, testuserid, smalldf)
    print "Actual User Rating:",u,"Avg Rating",a


for user Vern avg 3.5652173913
----------------------------------
Tee Pee Mexican Food
Predicted Rating: 3.55142662202
Actual User Rating: 5 Avg Rating 3.13636363636
----------------------------------
Local Breeze
Predicted Rating: 4.09134326313
Actual User Rating: 5 Avg Rating 3.86363636364
----------------------------------
Carly's Bistro
Predicted Rating: 4.05263567539
Actual User Rating: 5 Avg Rating 3.65079365079
----------------------------------
District American Kitchen and Wine Bar
Predicted Rating: 3.51619144197
Actual User Rating: 4 Avg Rating 3.575
----------------------------------
Sonora Mesquite Grill
Predicted Rating: 3.93936363386
Actual User Rating: 4 Avg Rating 4.51948051948

2.3 Explain in words why the predicted ratings are lower than the actual ratings. How do the user average rating and restaurant average rating affect this? How does sparsity affect the predicted ratings?

your answer here

Recall that bizs (defined just above question 1.7) has restaurants sorted by Vern's actual star ratings. This means that in this sample, we are looking at Vern's top rated restaurants.

The predicted ratings are lower because these are Vern's top 5 choices, which represent the largest positive deviations away from Vern's mean rating of 3.57. Because we are looking at the upper tail of Vern's rating distribution, but pooling information together from the K nearest neighbors among Vern's rated restaurants to construct the predicted rating, the predicted ratings should fall closer to Vern's user mean than the true ones do. Taking into account the average restaurant rating helps a little bit here because we can adjust the predicted rating to reflect an overall very good restaurant, but it does not counteract the effect of looking at the upper tail of Vern's ratings.

Note that if we were to take Vern's bottom 5 restaurants, we would see the opposite effect. In general, the larger K is (assuming that the similarities within this neighborhood are positive), the closer the predicted rating will be to Vern's user average (this is the bias limit in the bias-variance tradeoff). Similarly, the smaller K is, the more likely we are to have user ratings that are close to the observed rating (the variance limit). The sparsity of the data affects how quickly we move from the variance limit to the bias limit as we increase K. If there were a lot of very similar restaurants in the dataset that Vern had ranked very highly, even with K relatively large, it would be possible to see a predicted rating much closer to the extremely positive ratings we see here in Vern's top 5 (see the results in question 4.4). As these data are now, however, even the most similar 7 restaurants to these that Vern rated so highly lie closer to Vern's mean.

Error Analysis

This next function takes a set of actual ratings, and a set of predicted ratings, and plots the latter against the former. We can use a graph of this kind to see how well or badly we do in our predictions. Since the nearest neighbor models can have alternating positive and negative similarities (the sum of similarity weights in the denominator can get large), the ratings can get very large. Thus we restrict ourselves to be between -10 and 15 in our ratings and calculate the fraction within these bounds. We also plot the line with unit slope, line segments joining the means, and a filled in area representing one standard deviation from the mean.

The first argument to compare_results is a numpy array of the actual star ratings obtained from the dataframe, while the second argument is the numpy array of the predicted ones. (Feel free to improve this function for your display)


In [54]:
def compare_results(stars_actual, stars_predicted, ylow=-10, yhigh=15, title=""):
    """
    plot predicted results against actual results. Takes 2 arguments: a
    numpy array of actual ratings and a numpy array of predicted ratings
    scatterplots the predictions, a unit slope line, line segments joining the mean,
    and a filled in area of the standard deviations."
    """
    fig=plt.figure()
    df=pd.DataFrame(dict(actual=stars_actual, predicted=stars_predicted))
    ax=plt.scatter(df.actual, df.predicted, alpha=0.2, s=30, label="predicted")
    plt.ylim([ylow,yhigh])
    plt.plot([1,5],[1,5], label="slope 1")
    xp=[1,2,3,4,5]
    yp=df.groupby('actual').predicted.mean().values
    plt.plot(xp,yp,'k', label="means")
    sig=df.groupby('actual').predicted.std().values
    plt.fill_between(xp, yp - sig, yp + sig, 
                 color='k', alpha=0.2)
    plt.xlabel("actual")
    plt.ylabel("predicted")
    plt.legend(frameon=False)
    remove_border()
    plt.grid(False)
    plt.title(title)
    print np.mean(np.abs(df.predicted) < 15)

2.4 For each review in the data set, obtain a prediction from the entire dataframe smalldf. Use the function compare_results above to plot the predicted ratings against the observed ones. Make 4 such graphs, at k=3 and k=10, and for reg=3. and reg=15.

Note that this analysis is not strictly a model check because we are testing on the training set. However, since the user averages would change each time a cross-validation split was done on the set, we would incur the prohibitive expense of redoing the database each time. This would be better done on a cluster, using map-reduce or other techniques. While we explore map-reduce later in this homework, we shall not do any cross-validation.

Explain the results you get in the graphs in words.


In [55]:
#your code here
def plot_results(df, k,reg):
    user_id = df.user_id.values
    rest_id = df.business_id.values
    actual = df.stars.values
    predicted= np.zeros(len(actual))
    count =0
    for u_id, rest_id in zip(user_id, rest_id):
        predicted[count]= rating(df, db, rest_id, u_id, k, reg) 
        count+=1
    compare_results(actual, predicted)

In [56]:
#your code here
print "k=3, reg=3."
plot_results(smalldf,3,3.)
plt.title("k=3, reg=3.")

print "k=3, reg=15."
plot_results(smalldf,3,15.,)
plt.title("k=3, reg=15.")

print "k=10, reg=3."
plot_results(smalldf,10,3.)
plt.title("k=10, reg=3.")

print "k=10, reg=15."
plot_results(smalldf,10,15.,)
plt.title("k=10, reg=15.")


k=3, reg=3.
1.0
k=3, reg=15.
1.0
k=10, reg=3.
0.998540145985
k=10, reg=15.
0.997404703974
Out[56]:
<matplotlib.text.Text at 0x15a0adc90>

your answer here If you are a bit confused by the look of these graphs, you should be!

For k=3, the predicted values are quite well-behaved, with the exception of several predictions that have extremely large magnitudes. It appears that the predicted values are pulled into the mean star rating, which sits somewhere around 3.8, so ratings on the low end are overestimated, and similarly ratings on the high end underestimated. The regularization does not appear to have a strong effect when k = 3.

For k=10, the predicted values are much less stable, with many more extreme predictions. The means appear to track better with the true means. The regularization has a much more extreme, although indirect, effect on the appearance of the plot. Since regularization has stronger effects on similarity scores between restaurants that have small common support, we can see that increasing k makes the predictions more sensitive to the regularization because in a small dataset, the common support between a restaurant and it's 10-nearest one will be quite small.

Note that this example does not seem to follow the standard bias-variance tradeoff, where we would expect small k to give a unbiased estimates that capture the extremes, while we would expect large k to give biased estimates that pull extreme values toward the mean. A large reason for this failure for this example to capture this behavior is that we have defined similarity scores that can be positive or negative, and the bias-variance logic is based on the more standard setting where we average together values that have strictly positive weights. When you have negative weights, it's possible that the sum of sij's in a neighborhood can get close to zero, making our estimator Ŷ um explode in the positive or negative direction since this would entail dividing by (nearly) zero. Thus for those restaurants where the denominator goes to 0 (more likely to happen at larger k as you have more chances of it there being more weights to add), the ratings are unstable, even numerically!

This problem is less pronounced in large datasets or with small k because the k-nearest restaurants are likely to have positive similarity with the current one. However, with small datasets, we can find that even with k relatively small (in this case, around 10), there are negative similarities in the k-neighborhood that make the estimator unstable. This sort of instability would be much less pronounced in the large dataset.

If we were to rescale the similarities to be positive, say between 0 and 1, the behavior of the estimator with respect to k and reg would be quite different. (SEE BELOW!)

2.5 Outline a process, in words, for choosing the nearest neighbor parameter k. For this question fix the regularization parameter reg at 3.

your answer here

We could use $F$-fold cross-validation (usually called $k$-fold cross-validation, but we've already used $k$ here to mean something else!). Specifically, we could randomly partition the data into $F$ equally sized folds (pieces), making sure that each fold includes at least one rating for each restaurant and one rating by each user (it would probably best to make sure that if $K$ were the maximum $k$ you would be considering, that each user and each restaurant appear at least $K$ times in each fold). For each value of $k$, and for each fold, we could repeat the procedure above for predicting user ratings by computing similarities using $F-1$ of the folds to compute similarities and computing the prediction error in the held out fold. We could then choose the $k$ with the smallest average prediction error across the folds, and recompute the recommender on the whole dataset using the chosen value of $k$.

If we wanted to both choose a good value for $k$ and check how well we could expect the result to generalize, we could divide the dataset into $F+1$ folds, and keep the last fold out as a verification set. We could perform the cross-validation above on $F$ folds to select $k$, then upon selecting $k$ use all $F$ folds to create a recommender, and see how well this recommender predicted ratings in the final validation set.

Q3 Bayesian Chocolates: Model based recommendations

In this part of the homework, you will use your newly minted Bayesian and Gibbs sampler skills to write a recommender that uses Bayesian techniques to impute ratings.

Model-Based Recommendations

A Note on Frequentist and Bayesian Procedures

In the previous section we implemented a procedure (a set of instructions for processing data) for giving recommendations and predicting user ratings for restaurants. This procedure involved a number of arbitrary choices -- for example, the particular measure of similarity between restaurants, or the weighting scheme for constructing a predicted rating. It also gave no sense of uncertainty -- in the case of giving recommendations, there was no statement about how we would expect the ranking from the procedure to compare to the user's true opinions of restaurants, and in the case of predicting ratings, there was no confidence interval for the prediction.

It is possible in repeated applications of the above procedure to see how it performs in the long run. Based on this long-run performance we could potentially justify certain functional choices and compute measurements of uncertainty. This framework of proposing a procedure first, then evaluating its performance in real or hypothetical replications of the experiment is an example of a frequentist approach to a problem. One aspect of the frequentist approach is that the proposed procedure does not necessarily have to be derived from a model (although it often is). While this means that a proposed procedure may be more flexible or robust than a model-based procedure, it also means that there is no natural way to justify certain functional choices or construct uncertainty estimates.

In contrast, the Bayesian approach to a problem always begins with a probablistic model for how the data were generated. Assuming this model is true, the posterior distribution over unknown quantities (either parameters to be estimated or unobserved data to be predicted) gives a single coherent expression of what the observed data tell us about the unknowns. By summarizing the posterior distribution, we can derive the exact functional form of a procedure for constructing estimates or predictions. We call a procedure derived from this Bayesian approach a Bayes rule (not to be confused with Bayes' Theorem). Using the posterior distribution, we can also give a sense of how uncertain we are about the estimate or prediction we have constructed.

Outline for this Problem

In this section, we construct a model of how ratings are generated, and use this model to build a recommendation and ratings prediction system. We will take a Bayesian approach here, and construct our estimates and predictions from summaries of the posterior distribution of the model's parameters, which we will compute using a Gibbs sampler. We will also give measures of uncertainty based on the posterior distribution. We will evaluate predictions from this approach in the same way we evalutated predictions from the KNN procedure above.

The Latent Factor Model

Model Overview

The central dogma in constructing a recommendation system using collaborative filtering is that similar users will rate similar restaurants similarly. In the previous section, we explicitly encoded this idea by using a similarity function to identify similar restaurants. We also assumed that either all users were the same (the global approach) or that only the current user was similar enough to make a recommendation (the user-specific approach). In this section, we will use a model that allows us to identify both similar users and similar restaurants as a function of latent factors.

We can think of latent factors as properties of restaurants (e.g., spiciness of food or price) that users have a positive or negative preference for. We do not observe these factors or the users' preferences directly, but we assume that they affect how users tend to rate restaurants. For example, if a restaurant serves a lot of spicy food and a user dislikes spicy food, then the restaurant would have a high "spiciness" factor, and the user would have a strongly negative preference, resulting in a prediction of a low rating. Note that if users have similar preferences, then according to the model, they will behave similarly, and likewise, if restaurants have similar latent factors, they will be rated similarly by similar users. Latent factors thus give us an intuitive way to specify a generative model the obeys the central dogma.

One issue that comes up with latent factor models is determining how many latent factors to include. There may be a number of different unmeasured properties that affect ratings in different ways -- for example, in addition to the spiciness factor above, there may also be a price factor that affects how users rate a restaurant. We deal with the problem of choosing the number of latent factors to include in the same way we deal with choosing $K$ in a $K$-nearest neighbors problem.

Rating Model Specification

To make this model concrete, we can write down our probability model as a generative process. First, we define the following quantities:

Counts:

  • $L$: The number of latent factors.

  • $U$: The number of users.

  • $M$: The number of items (restaurants).

  • $N$: The number of observed ratings.

Data:

  • $Y_{um}$: The star rating given to restaurant $m$ by user $u$.
  • $Y$: The full collection of observed star ratings.

Item-specific quantities:

  • $\gamma_m$: An item-specific parameter vector of length $L+1$. The first element of $\gamma_m$, denoted $\gamma_m[0]$ is the item-specific bias. The remaining $L$ elements of $\gamma_m$, denoted $\gamma_m[1:]$, are the latent factors associated with item $m$.

  • $\Gamma$: An $M$ by $L+1$ matrix where the $m$th row is $\gamma_m$.

User-specific quantities:

  • $\theta_u$: A user-specific parameter vector of length $L+1$. The first element of $\theta_u$, denoted $\theta_u[0]$ is the user-specific bias. The remaining $L$ elements of $\theta_u$, denoted $\theta_u[1:]$, are user $u$'s preferences for the latent factors.

  • $\Theta$: A $U$ by $L+1$ matrix where the $u$th row is $\theta_u$.

Global quantities:

  • $\mu$: The overall ratings mean.

  • $\sigma$: The residual variance of ratings after the mean, bias terms, and latent factors have been taken into account.

Using these quantities, we can specify our model for each rating $Y_{um}$ similarly to a linear regression:

$$Y_{um} = \mu + \theta_{u}[0] + \gamma_{m}[0] + \theta_{u}[1:]^{\top}\gamma_{m}[1:] + \epsilon_{um}$$

where

$$\epsilon_{um} \sim N(0, \sigma).$$

Note that while this looks like a linear regression, it is of a slightly different form because the latent factor term involves the product of two unknowns. This is like a linear regression where we forgot to measure some covariates.

We also assume the following priors on the user-specific and item-specific parameters:

$$ \begin{align*} \gamma_m &\sim MVN(\mathbf 0, \Lambda_\gamma^{-1})\\ \theta_u &\sim MVN(\mathbf 0, \Lambda_\theta^{-1}), \end{align*} $$

where $MVN$ means multivariate normal, $\mathbf 0$ is vector of length $L+1$ filled with zeros, and $\Lambda_\theta^{-1}$ and $\Lambda_\gamma^{-1}$ are $L+1 \times L+1$ covariance matrices. $\mu$ and $\sigma$ also have priors, but they are not relevant to your task so we won't write them here.

Goal for this Model

Using this model, we want to make inference about all of the quantities that, if we knew them, would allow us to sample $Y_{um}$ for any user and any item. These quantities are $\mu$, $\sigma$, and the elements of $\Theta$ and $\Gamma$.

3.1: Given the goal specified above, how many quantities (counting a vector of $L$ items as $L$ quantities) are we trying to make inference about? Express your answer in terms of the variables in the "Counts" section above.

your answer here

There are $$1+1+ M\times (L+1)+ U\times (L+1)$$

Gibbs Sampling from the Posterior

Our goal is to compute the posterior distribution over the unknowns $\mu$, $\sigma$, $\Gamma$, and $\Theta$ given $Y$, which reflects how much we know about these quantities given the data we have observed. We write this distribution as $P(\mu, \sigma, \Gamma, \Theta \mid Y)$.

The most general way to learn about the posterior distribution is to sample from it. This can be challenging, particularly in problems that are very high dimensional (see your answer to the question above). One strategy for for sampling from high-dimensional distributions is Gibbs sampling, which we discussed in class and lab.

Gibbs sampling breaks down the posterior probability distribution into blocks of unknowns, and samples iteratively from each block assuming that the values of the other blocks (and the data) are known and fixed. In this case, we will break down the posterior distribution into blocks of $\mu$, $\sigma$, each vector $\gamma_m$, and each vector $\theta_u$. We have already implemented the draws for $\mu$ and $\sigma$. You will need to implement the draws for each $\gamma_m$ and each $\theta_u$. Luckily, the structures of these draws are similar, so you will only need to implement two functions.

First, we'll derive the form of the draws below. Note that you don't need to be able to follow these derivations fully -- you'll just need to be able to use the result at the end.

Distribution of $\gamma_{m'}$ given $Y, \mu, \sigma, \Gamma_{-m'}, \Theta$

Intuitively, this is the distribution of the item-specific parameters for item $m'$, imagining that all of the other unknowns are fixed.

More precisely, we want to draw from the distribution of $\gamma_{m'}$ conditional on the data $Y$ and all other unknowns -- that is, $\mu$, $\sigma$, all of $\Theta$, and all of $\Gamma$ except for $\gamma_{m'}$, which we denote $\Gamma_{-m}$.

Note that in the model specification above, the only places that $\gamma_{m'}$ appears are in the regression equations for each $Y_{um}$ that involves item $m'$. If we write out just these equations, we get a system of the following form,

$$Y_{um'} = \mu + \theta_{u}[0] + \gamma_{m'}[0] + \theta_{u}[1:]^{\top}\gamma_{m'}[1:] + \epsilon_{um'},$$

with one equation for each $u$ that rated item $m'$. Now, because

If we move all of the fully known terms to the left-hand side, we obtain the system:

$$Y_{um'} - \mu - \theta_{u}[0] = \gamma_{m'}[0] + \theta_{u}[1:]^{\top}\gamma_{m'}[1:] + \epsilon_{um'}.$$

Notice that, because we assume that $\theta_{u}$ is known, this equation now fits cleanly into the form of a linear regression, where $\gamma_{m'}$ is the vector of unknown coefficients. This means that the posterior distribution for $\gamma_{m'}$ conditional on everything else is the same as the posterior for the coefficients of a Bayesian linear regression of $(Y_{um'} - \mu - \theta_{u}[0])$ on $\theta_{u}[1:]$ and an intercept.

Let's denote the set of users who rated item $m'$ as $(u_1, \cdots, u_g)$. Then, we can define the following vector and matrix:

\begin{align*} Y_{m'} = \left(\begin{array}{c} Y_{u_1m'}-\mu-\theta_{u_1}[0]\\ \vdots \\ Y_{u_gm'}-\mu-\theta_{u_g}[0]\end{array}\right), \qquad X_{m'} &= \left(\begin{array}{cc} 1 & \theta_{u_1}[1:]^\top \\ \vdots & \vdots \\ 1 & \theta_{u_g}[1:]^\top\end{array}\right), \end{align*}

where $Y_{m'}$ is a vector of length $g$ and $X_{m'}$ is a $g \times L+1$ matrix.

The draw from $\gamma_{m'}$ given everything else then has the form: $$ \gamma_{m'} \mid Y, \mu, \sigma, \Gamma_{-m'}, \Theta \sim MVN\left(Q_{m'}^{-1} \frac{1}{\sigma^2}X_{m'}^\top Y_{m'}, Q_{m'}^{-1}\right)$$ where $$ Q_{m'} = \left(\frac{1}{\sigma^2}X_{m'}^\top X_{m'} + \Lambda_\gamma\right).$$

Distribution of $\theta_{u'}$ given $Y, \mu, \sigma, \Gamma, \Theta_{-u'}$

Intuitively, this is the distribution of the user-specific parameters for user $u'$, imagining that all of the other unknowns are fixed.

We can use a very similar argument to the one above. We can denote the set of items rated by user $u'$ as $(m_1, \cdots, m_g)$ and define the vector and matrix: \begin{align*} Y_{u'} = \left(\begin{array}{c} Y_{u'm_1}-\mu-\gamma_{m_1}[0] \\ \vdots \\ Y_{u'm_g}-\mu-\gamma_{m_g}[0]\end{array}\right), \qquad X_{u'} &= \left(\begin{array}{cc} 1 & \gamma_{m_1}[1:]^\top \\ \vdots & \vdots \\ 1 & \gamma_{m_g}[1:]^\top\end{array}\right), \end{align*}

where $Y_{u'}$ is a vector of length $g$ and $X_{u'}$ is a $g \times L+1$ matrix.

the draw from $\theta_{u'}$ given everything else has the form: $$ \theta_{u'} \mid Y, \mu, \sigma, \Gamma, \Theta_{-u'} \sim MVN\left(Q_{u'}^{-1} \frac{1}{\sigma^2}X_{u'}^\top Y_{u'}, Q_{u'}^{-1}\right)$$ where $$ Q_{u'}= \left(\frac{1}{\sigma^2}X_{u'}^\top X_{u'} + \Lambda_\theta\right).$$

3.2 We will only ask you to implement a tiny portion of the Gibbs sampler. Complete the following functions that implement the conditional posterior draws for $\gamma_m$ and $\theta_u$ derived above.

Hint: np.random.multivariate_normal is a good function to know.


In [370]:
"""
Function
--------
gamma_m_draw

Draw a single sample from the conditional posterior distribution
of gamma_m.

Inputs
-------
X_m: A g-by-L+1 matrix, defined above. 
Y_m: A 1D vector of length g, defined above.
sig2: Residual _variance_, as defined above.
Lambda_gamma: Prior precision matrix.

Outputs
--------
Single draw from conditional posterior, defined above.
"""
#Item-specific parameters given all else
#your code here
def gamma_m_draw(X_m,Y_m,sig2,Lambda_gamma):
    Q_m_inv = np.linalg.inv(np.dot(X_m.T,X_m)/sig2 + Lambda_gamma )
    XtY = np.dot(X_m.T, Y_m)
    return np.random.multivariate_normal(np.dot(Q_m_inv,XtY)/sig2 , Q_m_inv)

In [371]:
"""
Function
--------
theta_u_draw

Draw a single sample from the conditional posterior distribution
of gamma_m.

Inputs
-------
X_u: A g-by-L+1 matrix, defined above. 
Y_u: A 1D vector of length g, defined above.
sig2: Residual _variance_, as defined above.
Lambda_theta: Prior precision matrix.

Outputs
--------
Single draw from conditional posterior, defined above.
"""
#User-specific parameters given all else
#your code here
def theta_u_draw(X_u,Y_u,sig2,Lambda_theta):
    Q_u_inv = np.linalg.inv(np.dot(X_u.T,X_u)/sig2 + Lambda_theta)
    XtY = np.dot(X_u.T, Y_u)
    return np.random.multivariate_normal(np.dot(Q_u_inv,XtY)/sig2 , Q_u_inv)

Here is the Gibbs sampler skeleton that your functions fit into. Look over the structure to see how for each draw from the posterior, the sampler iterates through $\mu$, $\sigma$, $\gamma_m$ for each item, and $\theta_u$ for each user.


In [372]:
"""
Function
--------
factor_gibbs

Runs a gibbs sampler to infer mean, variance, user-specific, and item-specific
parameters.

Inputs
-------
data: A dataframe containing ratings data.
L: Dimension of latent factors.
maxit: Number of samples to draw from posterior.
Lambda_theta_diag: Hyperparameter controlling regularization of Theta.
Lambda_gamma_diag: Hyperparameter controlling regularization of Gamma.
progress: if true, print iteration number every 100 iterations.

Outputs
--------
Dictionary with elements
mu: Draws of mu. 1D array of length maxiter.
sig2: Draws of sig2, residual _variance_. 1D array of length maxiter.
theta: Draws of Theta. U-by-L-by-maxiter array.
gamma: Draws of Gamma. M-by-L-by-maxiter array.
EY: Draws of fitted values of Y. N-by-maxiter array.
"""
def factor_gibbs(data, L, maxit, Lambda_theta_diag, Lambda_gamma_diag, progress=True):
    data = data.copy()
    N = data.shape[0]

    #Create indices that allow us to map users and restaurants to rows
    #in parameter vectors.
    uusers, uidx = np.unique(data.user_id, return_inverse=True)
    uitems, midx = np.unique(data.business_id, return_inverse=True)

    nusers = uusers.size
    nitems = uitems.size

    #Add numerical indices to dataframe.
    data["uidx"] = uidx
    data["midx"] = midx

    #Group observations by user and by business.
    ugroups = data.groupby("uidx")
    mgroups = data.groupby("midx")

    all_avg = data.stars.mean()
    u_avg = ugroups.stars.mean()
    m_avg = mgroups.stars.mean()

    #Initialize parameters and set up data structures for
    #holding draws.
    #Overall mean
    mu = all_avg
    mu_draws = np.zeros(maxit)
    #Residual variance
    sig2 = 0.5
    sig2_draws = np.zeros(maxit)

    #Matrix of user-specific bias and L latent factors.
    theta = np.zeros([nusers, L+1])
    theta[:,0] = u_avg-all_avg
    theta_draws = np.zeros([nusers, L+1, maxit])

    #Matrix of item-specific bias and L latent factors.
    gamma = np.zeros([nitems, L+1])
    gamma[:,0] = m_avg-all_avg
    gamma_draws = np.zeros([nitems, L+1, maxit])

    #Matrix for holding the expected number of stars
    #for each observation at each draw from the posterior.
    EY_draws = np.zeros([data.shape[0], maxit])

    #Inverse covariance matrices from the prior on each theta_u
    #and gamma_b. These are diagonal, like Ridge regression.
    Lambda_theta = np.eye(L+1)*Lambda_theta_diag
    Lambda_gamma = np.eye(L+1)*Lambda_gamma_diag

    #Main sampler code
    for i in range(maxit):
        if i%100==0 and progress:
            print i

        #The entire regression equation except for the overall mean.
        nomu = np.sum(theta[data.uidx,1:]*gamma[data.midx,1:], axis=1) +\
                  theta[data.uidx,0] + gamma[data.midx,0]

        #Compute the expectation of each observation given the current
        #parameter values.
        EY_draws[:,i]=mu+nomu

        #Draw overall mean from a normal distribution
        mu = np.random.normal(np.mean(data.stars-nomu), np.sqrt(sig2/N))
        #Draw overall residual variance from a scaled inverse-Chi squared distribution.
        sig2 = np.sum(np.power(data.stars-nomu-mu,2))/np.random.chisquare(N-2)
        
        #For each item
        for mi,itemdf in mgroups:
            #Gather relevant observations, and subtract out overall mean and
            #user-specific biases, which we are holding fixed.
            Y_m = itemdf.stars-mu-theta[itemdf.uidx,0]
            #Build the regression design matrix implied by holding user factors
            #fixed.
            X_m = np.hstack((np.ones([itemdf.shape[0],1]),
                             theta[itemdf.uidx,1:]))
            gamma[mi,:] = gamma_m_draw(X_m, Y_m, sig2, Lambda_gamma)
            
        #For each user
        for ui,userdf in ugroups:
            #Gather relevant observations, and subtract out overall mean and
            #business-specific biases, which we are holding fixed.
            Y_u = userdf.stars-mu-gamma[userdf.midx,0]
            #Build the regression design matrix implied by holding business factors
            #fixed.
            X_u = np.hstack((np.ones([userdf.shape[0],1]),
                             gamma[userdf.midx,1:]))
            
            theta[ui,:] = theta_u_draw(X_u, Y_u, sig2, Lambda_theta)

        #Record draws
        mu_draws[i] = mu
        sig2_draws[i] = sig2
        theta_draws[:,:,i] = theta
        gamma_draws[:,:,i] = gamma

    return {"mu": mu_draws, "sig2": sig2_draws,
            "theta": theta_draws, "gamma": gamma_draws,
            "EY": EY_draws}

Posterior Summaries

Once you have posterior draws from the sampler, the most natural thing to do is to compute the posterior mean of each quantity you are intersted in. To do this, we simply need to take the average value of each quantity across the samples drawn from the sampler. Before taking the average, however, we will want to ignore the first 20-30% of samples because these correspond the burnin period, the time during which the sampler is still looking for the main meat of the distribution.

Ok it's time to recommend!

3.3 Now that you have the Gibbs sampler, draw 1000 samples from the posterior distribution using a two-dimensional latent factor and prior precisions Lambda_theta_diag and Lambda_gamma_diag both equal to 0.1.

Compute the posterior mean of the fitted values for each $Y_{um}$, eliminating the first 200 samples. Call these the prediction. These constitute our recommendations. True to the bayesian paradigm, we dont just have mean predictions, but entire distributions. But currently we are only interested in the means.


In [373]:
#your code here
gibbs_sample = factor_gibbs(smalldf, 2, 1000, 0.1, 0.1, progress=True)


0
100
200
300
400
500
600
700
800
900

In [393]:
burnin = 200
prediction =np.mean(gibbs_sample["EY"][:,burnin:],axis = 1)

Plot the predictions against the observed data.You can use the compare_results function defined in the previous section. How do the fitted values compare to those from the KNN procedure?


In [390]:
#your code here

compare_results(smalldf.stars.values, prediction, ylow=1, yhigh=5, title="From Gibbs Sampler")


1.0

your answer here

The results from the latent factor model appear to be better-behaved than those from the KNN procedure (with negative similarities) since there are no extreme values. In terms of the non-extreme values from the KNN procedure, the latent factor model appears to make similar predictions to the KNN procedure when k = 3., Specifically, the average prediction line for each class is too compressed toward the total data mean of around 3.8, meaning that we are in the bias limit where we are pooling too much information between ratings. Thus, we are overpredicting low ratings and underpredicting high ratings.

(If we compare to the KNN procedure with strictly positive weights, as in the homework addendum, we see again that the recommenders make comparable predictions, although the latent factor model again appears to fit slightly better than at both the bias or variance limits of the KNN procedure).

There is also a bias-variance tradeoff here. In this case, it appears that we have proposed a model that is too simple (thus close to the bias limit) because of how the ratings are pulled in toward the data mean. In this case, proposing more latent factors increases the flexibility of the model, and thus moves us toward the variance limit. Thus, the plot suggests that we may want to reduce the bias at the cost of increasing variance, for example by considering a latent factor model with more factors (say, L=15) to obtain a better fit (see ADDENDUM below).


In [394]:
gibbs_out = factor_gibbs(smalldf, 15, 1000, 0.1, 0.1)
burnin = 200
predicted=np.mean(gibbs_out['EY'][:,burnin:], axis=1)
compare_results(smalldf.stars.values, predicted, ylow=1, yhigh=5, title="From Gibbs Sampler")


0
100
200
300
400
500
600
700
800
900
1.0

Q4 Scaling Up

All our recommenders suffer from problems having to do with the fact that we subsetted an already sparse user-item matrix. The more items we have, the more items we may find in the vicinity of a given item, and thus we are likely to give a more robust average rating to the given item.

In this problem we shall use Amazon Elastic Map-Reduce to tackle the entire user-restaurant matrix. We shall do this in two parts: we'll use MRJob locally on your machine to on the smaller data set to calclate the pearson database, and then we'll tackle the entire data set on Amazon.

The larger set has 35000 users and 4500 items. Computing the 4500X4500 similarity matrix on one machine will be prohibitively expensive. Thus we'll adopt a strategy where we'll split the calculation over multiple machines using the map-reduce paradigm, with mappers and reducers working on multiple machines

Then we calculate the k-nearest neighbors in the 'space' of the user: this involves a database lookup and an iteration over the items a user has rated. Since the latter is usually not a very large number, this computation can be managed on a front end machine (even if storing the database will take a lot of memory).

We'll first create subset data frames, which have just those columns which we will send to the map-reduce. We'll also strip out the header and index of the frame. The reason for doing this is: unless we pre-populate the machines on Amazon with software, we can rely only on the regular python library, numpy, and scipy being there (and at python 2.6), and thus we will need to parse the csv file, line by line (mrjob uses hadoop's stream protocol and thus needs to be fed line by line).


In [6]:
subsetoffull=fulldf[['user_id','business_id', 'stars','business_avg','user_avg']]
subsetoffull.to_csv("subset-full.csv", index=False, header=False)
subsetofsmall=smalldf[['user_id','business_id', 'stars','business_avg','user_avg']]
subsetofsmall.to_csv("subset-small.csv", index=False, header=False)

Running mrjob locally

mrjob scripts cannot be run from the ipython notebook, as they fork themselves on execution. Thus you must write the code for mrjob in a separate file which you must submit along with this homework, in the same folder as the python notebook file.

If you have not done so already (you were supposed to do this as part of HW 0), you will first need to install mrjob. The appropriate equivalent of the following incantation should do the job:

~/anaconda/bin/pip install mrjob



To familiarize yourself with the structure of an mrjob script, please read this . Run the examples in that document to familiarize yourself with mrjob.

The kind of script you will be writing is in the section "Writing your second job" in that document.

All mrjob tasks use the map-reduce strategy to divide up computation across computers. You should work through the mrjob tutorial to gain familiarity with this, but we’ll also outline the basic process here:

  1. During the first map step, mrjob calls a mapper function with a key (which for the first step is None), and a value (which for the first step is a line of data from an input file). This function does whatever it wants with this data, and yields a key and value. The key is used in step 2 to gather up the values from all the different mappers into groups

  2. mrjob collects the outputs from all the mappers, and gathers them into subsets with the same key value (this is similar to what pandas.groupby does). It passes each of these subsets to a reducer (or “collector”) function, whose job is to synthesize this list of grouped data into something useful (e.g., computing the mean of all the inputs). It then yields the key and reduced value.

  3. If there are any additional steps, mrjob feeds each output from a reducer function in step 2 to the next mapper. Otherwise, it prints the output.

The point behind map-reduce is to agree upon a common framework to split up a large computational job into smaller tasks. mrjob then has a lot of freedom to organize how these tasks run in parallel, on many machines

Writing your script

4.1 Write a MRJOB script, called computesim.py. The object of this script is to take a csv file and return a tuple (rho, n_common) as calculate_similarity for pairs of restaurants. See skeleton.py below for the SPEC of this file. Your job is to fill in those methods. You MUST use this skeleton.

This script is to be run like so (substitute your own operating system's call):

~/anaconda/bin/python computesim.py subset-small.csv > output.small.local.txt

Thus, when the script below is run in this fashion, mrjob will read the data line-by-line from subset-small.csv, and pass it to the first "step".

Algorithm to calculate pearson similarities

Here is the description of the algorithm for RestaurantSimilarities.

Your code will have two steps. Each step will have a mapper and a reducer. These are described in turn here:

  1. line_mapper will split the line, yielding the user_id as key, and the rest as value. This method's implementation is provided for you.

  2. users_items_collector is a reducer. It is passed ALL mapper outputs corresponding to a particular user_id. Put these emissions into a list, and re-emit the user_id with this list.

  3. pair_items_mapper takes the user_id and the list. It dosent do anything with the user_id, however, it takes every combination (thus len(list) choose 2) of 2 business_ids from the passed on list (see combinations in itertools in the python documentation) and sends on the remaining information keyed on the tuple (restaurant1, restaurant2). Be sure to handle the case where the restaurant id's are flipped: include them somehow under the same key.

  4. calc_sim_collector is passed ALL sent on list information for the pair of restaurants that was emitted in the previous step. Note that thse will come from different user_ids. This sort of collection is key to this style of programming. This list information should now correspond to all the common support of the two restaurants. Use this information to calculate this common support and the pearson similarity. Return the aforementioned tuple by yielding it keyed by the tuple of restaurants. This information will be sent to the output file. The output keys and values will both be in JSON format, separated by a tab.

The output should be saved in a file via redirection as output.small.local.txt

Skeleton File for this problem

You ca access it here or just run the next cell to see it.


In [396]:
from pygments import highlight
from pygments.lexers import PythonLexer
from pygments.formatters import HtmlFormatter
from IPython.display import HTML
import urllib
skelcode = urllib.urlopen("https://raw.github.com/cs109/content/master/skeleton.py").read()
skelhtml=highlight(skelcode, PythonLexer(), HtmlFormatter())
HTML(skelhtml)


Out[396]:
import numpy as np

from mrjob.job import MRJob
from itertools import combinations, permutations

from scipy.stats.stats import pearsonr


class RestaurantSimilarities(MRJob):

    def steps(self):
        "the steps in the map-reduce process"
        thesteps = [
            self.mr(mapper=self.line_mapper, reducer=self.users_items_collector),
            self.mr(mapper=self.pair_items_mapper, reducer=self.calc_sim_collector)
        ]
        return thesteps

    def line_mapper(self,_,line):
        "this is the complete implementation"
        user_id,business_id,stars,business_avg,user_avg=line.split(',')
        yield user_id, (business_id,stars,business_avg,user_avg)


    def users_items_collector(self, user_id, values):
        """
        #iterate over the list of tuples yielded in the previous mapper
        #and append them to an array of rating information
        """
        pass


    def pair_items_mapper(self, user_id, values):
        """
        ignoring the user_id key, take all combinations of business pairs
        and yield as key the pair id, and as value the pair rating information
        """
	   pass #your code here

    def calc_sim_collector(self, key, values):
        """
        Pick up the information from the previous yield as shown. Compute
        the pearson correlation and yield the final information as in the
        last line here.
        """
        (rest1, rest2), common_ratings = key, values
	    #your code here
        yield (rest1, rest2), (rho, n_common)


#Below MUST be there for things to work
if __name__ == '__main__':
    RestaurantSimilarities.run()

Explanation for those funny yield keywords

The functions above “yield” values, and do not “return” them. They are generators. Here is an example:


In [399]:
def upper_generator(words):
    for word in words:
        yield word.upper()

words = ['a', 'couple', 'of', 'words', 'to', 'process']

print upper_generator(words)
print list(upper_generator(words))
for u in upper_generator(words):
     print u


<generator object upper_generator at 0x113857e10>
['A', 'COUPLE', 'OF', 'WORDS', 'TO', 'PROCESS']
A
COUPLE
OF
WORDS
TO
PROCESS

You can read more here. Also see Thu Oct 17th's class video for information about classes and generators.

Include computesim.py in your submission in the same folder as the notebook. Uncommenting and running the following cell should output your code in here.


In [ ]:
#thecode = open("computesim.py").read()
#thehtml=highlight(thecode, PythonLexer(), HtmlFormatter())
#HTML(thehtml)

Checking the results

Let us load the data from the file


In [403]:
output_small_local=[[json.loads(j) for j in line.strip().split("\t")] for line in open("./output.small.local.txt")]
output_small_local[0]


Out[403]:
[[u'-4A5xmN21zi_TXnUESauUQ', u'-AAig9FG0s8gYE4f8GfowQ'],
 [0.41420943050302544, 5]]

We will Implement a function make_database_from_pairs which takes a dataframe of restaurants smalldf and the output parsed in the previous command to create the database like before. By the nature of the map-reduce algorithms these only contain those restaurant pairs with common support. The Database constructor initializes the remaining similarities to 0.

The function will take the dataframe and bizpairs obtained by parsing the EMR output file which have the key of business pairs and value the pair of pearson correlation and n_common. It will return an instance of the Database class.

This function will take a long time to run on large data sets.


In [46]:
def make_database_from_pairs(df, bizpairs):
    """
    make the database from the pairs returned from mrjob.
    df is the dataframe, smalldf or fulldf.
    bizpairs are a list of elements, each of which is a list of two
        lists. The first of these lists has the two business id's, while
        the second has the similarity and the common support
    Returns an instance of the Database class.
    """
    dbase=Database(df)
    cache={}
    for bp,corrs in bizpairs:
        b1,b2=bp
        i1=dbase.uniquebizids[b1]
        i2=dbase.uniquebizids[b2]
        sim,nsup=corrs
        dbase.database_sim[i1][i2]=sim
        dbase.database_sim[i2][i1]=sim
        dbase.database_sup[i1][i2]=nsup
        dbase.database_sup[i2][i1]=nsup
        if cache.has_key(b1):
            nsup1=cache[b1]
        else:
            nsup1=dbase.df[dbase.df.business_id==b1].user_id.count()
            cache[b1]=nsup1
        if cache.has_key(b2):
            nsup2=cache[b2]
        else:
            nsup2=dbase.df[dbase.df.business_id==b2].user_id.count()
            cache[b2]=nsup2
        dbase.database_sim[i1][i1]=1.0
        dbase.database_sim[i2][i2]=1.0
        dbase.database_sup[i1][i1]=nsup1
        dbase.database_sup[i2][i2]=nsup2
    return dbase

We will store the output in variable db_mrjob_local.


In [405]:
db_mrjob_local=make_database_from_pairs(smalldf, output_small_local)

We print a pair to see that our answers are identical.


In [406]:
print db.get("zruUQvFySeXyEd7_rQixBg", "z3yFuLVrmH-3RJruPEMYKw")
print db_mrjob_local.get("zruUQvFySeXyEd7_rQixBg", "z3yFuLVrmH-3RJruPEMYKw")


(0.34844426604207018, 7)
(0.34844426604467255, 7)

4.2 Lets test that our results are overall the same as before


In [407]:
sums=0.
count=0
for k in db.uniquebizids.keys():
    for k2 in db.uniquebizids.keys():
        count=count+1
        sums=sums+db.get(k,k2)[0]-db_mrjob_local.get(k,k2)[0]
print sums, count


-3.63036600781e-10 29584

Running on Amazon Elastic Map Reduce(EMR)

At this point, we shall shift to running on Amazon EMR.

Follow the instruction below (Successfully ran by Manoj)

  1. Make sure the AWS account is opened and the Key file (CS109.pem in this case) is generated.
  2. Run chmod og-rwx /Users/apple/Dropbox/DataScience/ML/Admin/CS109.pem so that ssh will be happy
  3. Create an mr config file 'mrjob.config' in /etc (Go to folder /etc) with the below contents:
    runners:
    emr:
     aws_access_key_id:AKIAJ3NXQU4EPLLSE6CQ
     aws_secret_access_key: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
     ec2_key_pair: CS109
     ec2_key_pair_file: /Users/apple/Dropbox/DataScience/ML/Admin/CS109.pem # ~/ and $$ENV_VARS allowed here
     ssh_tunnel: true
  4. Create boto config file (touch ~/.boto, vi ~/.boto) with the following contents:
    [Credentials]
    aws_access_key_id:AKIAJ3NXQU4EPLLSE6CQ
    aws_secret_access_key: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
  5. Run
    ~/anaconda/bin/python computesim.py -r emr --num-ec2-instances 2 subset-small.csv > output.small.emr.txt

See the log trail upon running the above:

No configs found; falling back on auto-configuration
num_ec2_instances is deprecated; set num_core_instances to 1 instead
Auto-created temp S3 bucket mrjob-5179031e01afd90b
Using s3://mrjob-5179031e01afd90b/tmp/ as our temp dir on S3
Creating temp directory /var/folders/x5/650ndbx116g04pzh2xwxtrp40000gn/T/computesim.apple.20160919.043438.080233
Copying local files to s3://mrjob-5179031e01afd90b/tmp/computesim.apple.20160919.043438.080233/files/...
Auto-created instance profile mrjob-604f15e493b94887
Auto-created service role mrjob-0de5ff0e438b1170
Created new cluster j-NI45QLK9ZDGD
Waiting for step 1 of 2 (s-Z0UTL6J582TX) to complete...
  PENDING (cluster is STARTING)
  PENDING (cluster is STARTING)
  PENDING (cluster is STARTING)
  PENDING (cluster is STARTING)
  PENDING (cluster is STARTING)
  PENDING (cluster is STARTING)
  PENDING (cluster is STARTING)
  PENDING (cluster is STARTING)
  PENDING (cluster is STARTING: Configuring cluster software)
  PENDING (cluster is STARTING: Configuring cluster software)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  PENDING (cluster is BOOTSTRAPPING: Running bootstrap actions)
  RUNNING for 8.1s
  RUNNING for 39.7s
  RUNNING for 71.1s
  RUNNING for 102.6s
  COMPLETED
Attempting to fetch counters from logs...
Waiting 10 minutes for logs to transfer to S3... (ctrl-c to skip)

To fetch logs immediately next time, set up SSH. See:
https://pythonhosted.org/mrjob/guides/emr-quickstart.html#configuring-ssh-credentials

Looking for step log in s3://mrjob-5179031e01afd90b/tmp/logs/j-NI45QLK9ZDGD/steps/s-Z0UTL6J582TX...
  Parsing step log: s3://mrjob-5179031e01afd90b/tmp/logs/j-NI45QLK9ZDGD/steps/s-Z0UTL6J582TX/syslog.gz
Counters: 54
    File Input Format Counters 
        Bytes Read=472919
    File Output Format Counters 
        Bytes Written=323603
    File System Counters
        FILE: Number of bytes read=161546
        FILE: Number of bytes written=607833
        FILE: Number of large read operations=0
        FILE: Number of read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=292
        HDFS: Number of bytes written=323603
        HDFS: Number of large read operations=0
        HDFS: Number of read operations=7
        HDFS: Number of write operations=2
        S3: Number of bytes read=472919
        S3: Number of bytes written=0
        S3: Number of large read operations=0
        S3: Number of read operations=0
        S3: Number of write operations=0
    Job Counters 
        Data-local map tasks=2
        Launched map tasks=2
        Launched reduce tasks=1
        Total megabyte-seconds taken by all map tasks=33032448
        Total megabyte-seconds taken by all reduce tasks=12291072
        Total time spent by all map tasks (ms)=43011
        Total time spent by all maps in occupied slots (ms)=129033
        Total time spent by all reduce tasks (ms)=12003
        Total time spent by all reduces in occupied slots (ms)=48012
        Total vcore-seconds taken by all map tasks=43011
        Total vcore-seconds taken by all reduce tasks=12003
    Map-Reduce Framework
        CPU time spent (ms)=14890
        Combine input records=0
        Combine output records=0
        Failed Shuffles=0
        GC time elapsed (ms)=954
        Input split bytes=292
        Map input records=6165
        Map output bytes=553754
        Map output materialized bytes=135599
        Map output records=6165
        Merged Map outputs=2
        Physical memory (bytes) snapshot=899678208
        Reduce input groups=240
        Reduce input records=6165
        Reduce output records=240
        Reduce shuffle bytes=135599
        Shuffled Maps =2
        Spilled Records=12330
        Total committed heap usage (bytes)=598155264
        Virtual memory (bytes) snapshot=3936567296
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
Waiting for step 2 of 2 (s-1CBLWBN4DEEWY) to complete...
  COMPLETED
Attempting to fetch counters from logs...
Looking for step log in s3://mrjob-5179031e01afd90b/tmp/logs/j-NI45QLK9ZDGD/steps/s-1CBLWBN4DEEWY...
  Parsing step log: s3://mrjob-5179031e01afd90b/tmp/logs/j-NI45QLK9ZDGD/steps/s-1CBLWBN4DEEWY/syslog.gz
Counters: 54
    File Input Format Counters 
        Bytes Read=358410
    File Output Format Counters 
        Bytes Written=1096478
    File System Counters
        FILE: Number of bytes read=1863884
        FILE: Number of bytes written=4177449
        FILE: Number of large read operations=0
        FILE: Number of read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=358720
        HDFS: Number of bytes written=0
        HDFS: Number of large read operations=0
        HDFS: Number of read operations=4
        HDFS: Number of write operations=0
        S3: Number of bytes read=0
        S3: Number of bytes written=1096478
        S3: Number of large read operations=0
        S3: Number of read operations=0
        S3: Number of write operations=0
    Job Counters 
        Data-local map tasks=2
        Launched map tasks=2
        Launched reduce tasks=1
        Total megabyte-seconds taken by all map tasks=27780096
        Total megabyte-seconds taken by all reduce tasks=25753600
        Total time spent by all map tasks (ms)=36172
        Total time spent by all maps in occupied slots (ms)=108516
        Total time spent by all reduce tasks (ms)=25150
        Total time spent by all reduces in occupied slots (ms)=100600
        Total vcore-seconds taken by all map tasks=36172
        Total vcore-seconds taken by all reduce tasks=25150
    Map-Reduce Framework
        CPU time spent (ms)=12680
        Combine input records=0
        Combine output records=0
        Failed Shuffles=0
        GC time elapsed (ms)=1149
        Input split bytes=310
        Map input records=240
        Map output bytes=10193826
        Map output materialized bytes=2002985
        Map output records=100689
        Merged Map outputs=2
        Physical memory (bytes) snapshot=902324224
        Reduce input groups=14348
        Reduce input records=100689
        Reduce output records=14348
        Reduce shuffle bytes=2002985
        Shuffled Maps =2
        Spilled Records=201378
        Total committed heap usage (bytes)=598155264
        Virtual memory (bytes) snapshot=3940614144
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
Streaming final output from s3://mrjob-5179031e01afd90b/tmp/computesim.apple.20160919.043438.080233/output/...
Removing s3 temp directory s3://mrjob-5179031e01afd90b/tmp/computesim.apple.20160919.043438.080233/...
Removing temp directory /var/folders/x5/650ndbx116g04pzh2xwxtrp40000gn/T/computesim.apple.20160919.043438.080233...
Removing log files in s3://mrjob-5179031e01afd90b/tmp/logs/j-NI45QLK9ZDGD/...
Terminating cluster: j-NI45QLK9ZDGD
xpm-mac-mini:content apple$$
  1. Run the script on the larger file subset-full.csv. Use between 4-8 instances on EMR on Amazon. Save the output in output.full.emr.txt. Your incantation will be something like:(By default, AWS assigns to region West Coast (Oregon). So clusters won't be seen under Singapore unless chosen explicitly)
    ~/anaconda/bin/python computesim.py -r emr --num-ec2-instances 5 subset-full.csv > output.full.emr.txt

[HW4 instructions]


Read this document for instructions on how to set yourself up on Amazon.


Reproduce the results with the smaller file on EMR

Test the smaller file and make sure it has the same results. For example, you could use the incantation:

~/anaconda/bin/python computesim.py -r emr --num-ec2-instances 2 subset-small.csv > output.small.emr.txt

You do NOT need to submit any results from that exploration to us.

Important: Please always make sure that your code is bug free, before actually submitting it to amazon. Try to run the job locally first and see if it produces the desired result. Then, if this worked, you are ready to proceed to the cloud. The homework problems are small and your free credit should provide you with a lot of room for running and testing on Amazon. However, it is your responsibility to make sure the jobs terminate properly and do not cause excessive costs.

You can always monitor your currently running jobs (in the US-East sector) using this overview at region US-EAST-1 of your MapReduce job flows.

Running the larger job

4.3 Run the script on the larger file subset-full.csv. Use between 4-8 instances on EMR on Amazon. Save the output in output.full.emr.txt. Your incantation will be something like:

~/anaconda/bin/python computesim.py -r emr --num-ec2-instances 5 subset-full.csv > output.full.emr.txt

You might elect to save the file on S3 and bring it over manually.

Try and think about what size job would be best to run on Amazon, given that there is a setup time. There is a way to persistently set up machines (the mrjob documentation provides the details), but then remember you will be billed for that setup and need to monitor it. However, a persistent setup might come useful for your projects.

Loading the full output from EMR

Lets load the output in. CAUTION The next two cells will also take a lot of time to run and load.


In [4]:
output_full_emr=[[json.loads(j) for j in l.strip().split("\t")] for l in open("./output.full.emr.txt")]

This function will take a very long time to run, on the order of 5 minutes or more, depending on your computer


In [47]:
dbfull=make_database_from_pairs(fulldf, output_full_emr)

4.4 For testuserid, once again, print out the ratings using the bizs list as before. How have they changed with respect to Question 2? Why might this be?


In [57]:
#your code here
#your code here
print "for user",usernamefromid(fulldf,testuserid), 'avg', fulldf[fulldf.user_id==testuserid].stars.mean() 
for i in bizs:
    print "========="
    print biznamefromid(fulldf, i), i
    print rating(fulldf, dbfull, i, testuserid, k=7, reg=3.) 
    u,a=get_other_ratings(i, testuserid, fulldf)
    print "User Score:",u,"Avg score",a


for user Vern avg 3.58227848101
=========
Tee Pee Mexican Food soiGohHtWOltGeomkSxzEw
4.16222806092
User Score: 5 Avg score 3.13636363636
=========
Local Breeze GAPqG0WNBBidKeZTMpEZ-w
4.81062901451
User Score: 5 Avg score 3.86363636364
=========
Carly's Bistro zmFc8M-hS4uuyY0hklIpoQ
4.79797424984
User Score: 5 Avg score 3.65079365079
=========
District American Kitchen and Wine Bar 9ziO3NpoNTKHvIKCBFB_fQ
4.25508726374
User Score: 4 Avg score 3.575
=========
Sonora Mesquite Grill KGX7O-_WqOIy9o7u9NOa9A
3.93085441638
User Score: 4 Avg score 4.51948051948

your answer here

Copy the smalldf ratings below for easy comparison:


In [58]:
print "for user",usernamefromid(smalldf,testuserid), 'avg', smalldf[smalldf.user_id==testuserid].stars.mean() 
for biz_id in bizs:
    print "----------------------------------"
    print biznamefromid(smalldf, biz_id)
    print "Predicted Rating:",rating(smalldf, db, biz_id, testuserid, k=7, reg=3.) 
    u,a=get_other_ratings(biz_id, testuserid, smalldf)
    print "Actual User Rating:",u,"Avg Rating",a


for user Vern avg 3.5652173913
----------------------------------
Tee Pee Mexican Food
Predicted Rating: 3.55142662202
Actual User Rating: 5 Avg Rating 3.13636363636
----------------------------------
Local Breeze
Predicted Rating: 4.09134326313
Actual User Rating: 5 Avg Rating 3.86363636364
----------------------------------
Carly's Bistro
Predicted Rating: 4.05263567539
Actual User Rating: 5 Avg Rating 3.65079365079
----------------------------------
District American Kitchen and Wine Bar
Predicted Rating: 3.51619144197
Actual User Rating: 4 Avg Rating 3.575
----------------------------------
Sonora Mesquite Grill
Predicted Rating: 3.93936363386
Actual User Rating: 4 Avg Rating 4.51948051948

One can see that the predicted rating is much closer to the actual one in the larger set. The reason for this: We are still working with user Vern from the previous set, so we are choosing a "prolific" user who is likely to have a dense neighborhood of restaurants in the larger set (by the choice of our cut to create the small set).

Note that if we evalusted our predictions on the entire data set, rather than cherry-picking users like Vern, our average prediction error may become worse. There is because the larger data set is more likely to be of higher sparsity, as we will include more users who rated very few restaurants.

4.5 Outline another step (in words) in the mrjob map-reduce class to implement a simple but scalable recommender of the global type that we did in Question 1.5 to 1.7.

your answer here

Submission Instructions:

Restart and run your notebook one last time (you do not have to rerun the Amazon EMR script computesim.py), to make sure the output from each cell is up to date. To submit your homework, create a folder named lastname_firstinitial_hw4 and place your solutions in the folder. Double check that the file is still called HW4.ipynb, and that it contains your code. Also include the computesim.py script and the output.small.local.txt data file. Do NOT include the data file output.full.emr.txt from the larger run (its huge, so we will check your answers to 4.4 instead). Compress the folder (please use .zip compression) and submit to the CS109 dropbox in the appropriate folder. If we cannot access your work because these directions are not followed correctly, we will not grade your work!

FINI

You have developed all kinds of recommenders. We hope it was fun. Time constraints prevented us from going into model checking, but perhaps you would like to try that on your own. Or use S3 or a hosted database as a place to store sharded similarities. You might want to take a gander at Yelp's entire Phoenix dataset, or use the other attributes present in the data set. So many possibilities!

If you'd like to learn more, please read Chris Volinksy's papers on the Netflix prize. There are also comprehensive reviews here and here.

css tweaks in this cell