Reccomender

Basing this tutorial from the work of Marcel Caraciolo at http://aimotion.blogspot.com/2012/08/introduction-to-recommendations-with.html

Our goal is to calculate how similar pairs of movies are, so that we recommend movies similar to movies you liked. Using the correlation we can:

For every pair of movies A and B, find all the people who rated botha A and B.
Use these ratings to form a Movie A vector and a Movie B vector.
Calculate the correlation between those two vectors
When someone watches a movie, you can recommend the movies most correlated with it

We are going to work of data set of movie ratings from: http://grouplens.org/datasets/movielens/ For this task we will use the MovieLens Dataset of Movie Ratings with 10.000 ratings from 1000 users on 1700 movies (you can download it at this http://www.grouplens.org/node/73 ).

So the first step is to get our movies file which has three columns: (user, movie, rating). For this task we will use the MovieLens Dataset of Movie Ratings with 10.000 ratings

You want to compute how similar pairs of movies are, so that if someone watches the movie The Matrix, you can recommend movies like BladeRunner. So how should you define the similarity between two movies ?

One possibility is to compute their correlation. The basic idea behind it is for every pair of movies A and B, find all the people who rated both A and B. Use these ratings to form a Movie A vector and a Movie B vector. Then, calculate the correlation between these two vectors. Now when someone watches a movie, you can now recommend him the movies most correlated with it.

So let's divide to conquer. Our first task is for each user, emit a row containing their 'postings' (item, rating). And for reducer, emit the user rating sum and count for use later steps.



In [2]:

    
def group_by_user_rating(self, key, line):
    """
    Emit the user_id and group by their ratings (item and rating)
    17  70,3
    35  21,1
    49  19,2
    49  21,1
    49  70,4
    87  19,1
    87  21,2
    98  19,2
    """
    user_id, item_id, rating = line.split('|')
    #yield (item_id, int(rating)), user_id
    #yield item_id, (user_id, int(rating))
    yield  user_id, (item_id, float(rating))
    #yield (user_id, item_id), int(rating)

def count_ratings_users_freq(self, user_id, values):
    """
    For each user, emit a row containing their "postings"
    (item,rating pairs)
    Also emit user rating sum and count for use later steps.
    17    1,3,(70,3)
    35    1,1,(21,1)
    49    3,7,(19,2 21,1 70,4)
    87    2,3,(19,1 21,2)
    98    1,2,(19,2)
    """
    item_count = 0
    item_sum = 0
    final = []
    for item_id, rating in values:
        item_count += 1
        item_sum += rating
        final.append((item_id, rating))

    yield user_id, (item_count, item_sum, final)

Before using these rating pairs to calculate correlation, let's see how we can compute it. We know that they can be formed as vectors of ratings, so we can use linear algebra to perform norms and dot products, as alo to compute the length of each vector or the sum over all elements in each vector. By representing them as matrices, we can perform several operations on those movies.



In [3]:

    
def pairwise_items(self, user_id, values):
        '''
        The output drops the user from the key entirely, instead it emits
        the pair of items as the key:
        19,21  2,1
        19,70  2,4
        21,70  1,4
        19,21  1,2
        This mapper is the main performance bottleneck.  One improvement
        would be to create a java Combiner to aggregate the
        outputs by key before writing to hdfs, another would be to use
        a vector format and SequenceFiles instead of streaming text
        for the matrix data.
        '''
        item_count, item_sum, ratings = values
        #print item_count, item_sum, [r for r in combinations(ratings, 2)]
        #bottleneck at combinations
        for item1, item2 in combinations(ratings, 2):
            yield (item1[0], item2[0]), \
                    (item1[1], item2[1])

    def calculate_similarity(self, pair_key, lines):
        '''
        Sum components of each corating pair across all users who rated both
        item x and item y, then calculate pairwise pearson similarity and
        corating counts.  The similarities are normalized to the [0,1] scale
        because we do a numerical sort.
        19,21   0.4,2
        21,19   0.4,2
        19,70   0.6,1
        70,19   0.6,1
        21,70   0.1,1
        70,21   0.1,1
        '''
        sum_xx, sum_xy, sum_yy, sum_x, sum_y, n = (0.0, 0.0, 0.0, 0.0, 0.0, 0)
        item_pair, co_ratings = pair_key, lines
        item_xname, item_yname = item_pair
        for item_x, item_y in lines:
            sum_xx += item_x * item_x
            sum_yy += item_y * item_y
            sum_xy += item_x * item_y
            sum_y += item_y
            sum_x += item_x
            n += 1
        similarity = normalized_correlation(n, sum_xy, sum_x, sum_y, \
                sum_xx, sum_yy)
        yield (item_xname, item_yname), (similarity, n)









    



  File "<ipython-input-3-454335f55961>", line 21
    def calculate_similarity(self, pair_key, lines):
                                                    ^
IndentationError: unindent does not match any outer indentation level

To summarize, each row in calculate similarity will compute the number of people who rated both movie and movie2 , the sum over all elements in each ratings vectors (sum_x, sum_y) and the squared sum of each vector (sum_xx, sum__yy). So we can now can calculate the correlation between the movies. The correlation can be expressed as:

So that's it! Now the last step of the job that will sort the top-correlated items for each item and print it to the output.



In [ ]:

    
def calculate_ranking(self, item_keys, values):
        '''
        Emit items with similarity in key for ranking:
        19,0.4    70,1
        19,0.6    21,2
        21,0.6    19,2
        21,0.9    70,1
        70,0.4    19,1
        70,0.9    21,1
        '''
        similarity, n = values
        item_x, item_y = item_keys
        if int(n) > 0:
            yield (item_x, similarity), (item_y, n)

    def top_similar_items(self, key_sim, similar_ns):
        '''
        For each item emit K closest items in comma separated file:
        De La Soul;A Tribe Called Quest;0.6;1
        De La Soul;2Pac;0.4;2
        '''
        item_x, similarity = key_sim
        for item_y, n in similar_ns:
            print '%s;%s;%f;%d' % (item_x, item_y, similarity, n)

All of it in one file MovieSimilarities.py



In [ ]:

    
# %load code/MovieSimilarities.py

'''
 Given a dataset of movies and their ratings by different
 users, how can we compute the similarity between pairs of
 movies?
 This module computes similarities between movies
 by representing each movie as a vector of ratings and
 computing similarity scores over these vectors. 
 Copied from:
 https://github.com/marcelcaraciolo/recsys-mapreduce-mrjob/blob/master/moviesSimilarities.py
'''
__author__ = 'Marcel Caraciolo <caraciol@gmail.com>'

from mrjob.job import MRJob
from metrics import  correlation
from metrics import cosine, regularized_correlation
from math import sqrt

try:
    from itertools import combinations
except ImportError:
    from metrics import combinations


PRIOR_COUNT = 10
PRIOR_CORRELATION = 0


class SemicolonValueProtocol(object):

    # don't need to implement read() since we aren't using it

    def write(self, key, values):
        return ';'.join(str(v) for v in values)


class MoviesSimilarities(MRJob):

    OUTPUT_PROTOCOL = SemicolonValueProtocol

    def steps(self):
        return [
            self.mr(mapper=self.group_by_user_rating,
                    reducer=self.count_ratings_users_freq),
            self.mr(mapper=self.pairwise_items,
                    reducer=self.calculate_similarity),
            self.mr(mapper=self.calculate_ranking,
                    reducer=self.top_similar_items)]

    def group_by_user_rating(self, key, line):
        """
        Emit the user_id and group by their ratings (item and rating)
        17  70,3
        35  21,1
        49  19,2
        49  21,1
        49  70,4
        87  19,1
        87  21,2
        98  19,2
        """
        user_id, item_id, rating = line.split('\t')
        #yield (item_id, int(rating)), user_id
        #yield item_id, (user_id, int(rating))
        yield  user_id, (item_id, float(rating))
        #yield (user_id, item_id), int(rating)

    def count_ratings_users_freq(self, user_id, values):
        """
        For each user, emit a row containing their "postings"
        (item,rating pairs)
        Also emit user rating sum and count for use later steps.
        17    1,3,(70,3)
        35    1,1,(21,1)
        49    3,7,(19,2 21,1 70,4)
        87    2,3,(19,1 21,2)
        98    1,2,(19,2)
        """
        item_count = 0
        item_sum = 0
        final = []
        for item_id, rating in values:
            item_count += 1
            item_sum += rating
            final.append((item_id, rating))

        yield user_id, (item_count, item_sum, final)

    def pairwise_items(self, user_id, values):
        '''
        The output drops the user from the key entirely, instead it emits
        the pair of items as the key:
        19,21  2,1
        19,70  2,4
        21,70  1,4
        19,21  1,2
        This mapper is the main performance bottleneck.  One improvement
        would be to create a java Combiner to aggregate the
        outputs by key before writing to hdfs, another would be to use
        a vector format and SequenceFiles instead of streaming text
        for the matrix data.
        '''
        item_count, item_sum, ratings = values
        #print item_count, item_sum, [r for r in combinations(ratings, 2)]
        #bottleneck at combinations
        for item1, item2 in combinations(ratings, 2):
            yield (item1[0], item2[0]), \
                    (item1[1], item2[1])

    def calculate_similarity(self, pair_key, lines):
        '''
        Sum components of each corating pair across all users who rated both
        item x and item y, then calculate pairwise pearson similarity and
        corating counts.  The similarities are normalized to the [0,1] scale
        because we do a numerical sort.
        19,21   0.4,2
        21,19   0.4,2
        19,70   0.6,1
        70,19   0.6,1
        21,70   0.1,1
        70,21   0.1,1
        '''
        sum_xx, sum_xy, sum_yy, sum_x, sum_y, n = (0.0, 0.0, 0.0, 0.0, 0.0, 0)
        item_pair, co_ratings = pair_key, lines
        item_xname, item_yname = item_pair
        for item_x, item_y in lines:
            sum_xx += item_x * item_x
            sum_yy += item_y * item_y
            sum_xy += item_x * item_y
            sum_y += item_y
            sum_x += item_x
            n += 1

        corr_sim = correlation(n, sum_xy, sum_x, \
                 sum_y, sum_xx, sum_yy)

        reg_corr_sim = regularized_correlation(n, sum_xy, sum_x, \
                sum_y, sum_xx, sum_yy, PRIOR_COUNT, PRIOR_CORRELATION)

        cos_sim = cosine(sum_xy, sqrt(sum_xx), sqrt(sum_yy))

        jaccard_sim = 0.0

        yield (item_xname, item_yname), (corr_sim, \
                cos_sim, reg_corr_sim, jaccard_sim, n)

    def calculate_ranking(self, item_keys, values):
        '''
        Emit items with similarity in key for ranking:
        19,0.4    70,1
        19,0.6    21,2
        21,0.6    19,2
        21,0.9    70,1
        70,0.4    19,1
        70,0.9    21,1
        '''
        corr_sim, cos_sim, reg_corr_sim, jaccard_sim, n = values
        item_x, item_y = item_keys
        if int(n) > 0:
            yield (item_x, corr_sim, cos_sim, reg_corr_sim, jaccard_sim), \
                     (item_y, n)

    def top_similar_items(self, key_sim, similar_ns):
        '''
        For each item emit K closest items in comma separated file:
        De La Soul;A Tribe Called Quest;0.6;1
        De La Soul;2Pac;0.4;2
        '''
        item_x, corr_sim, cos_sim, reg_corr_sim, jaccard_sim = key_sim
        for item_y, n in similar_ns:
            yield None, (item_x, item_y, corr_sim, cos_sim, reg_corr_sim,
                         jaccard_sim, n)


if __name__ == '__main__':
    MoviesSimilarities.run()



In [4]:

    
%run code/MovieSimilarities.py data/ml-100k/ml-100k/u.data > data/output.csv









    



no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
INFO:mrjob.conf:no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
INFO:mrjob.conf:no configs found; falling back on auto-configuration
creating tmp directory /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310
creating tmp directory /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310
creating tmp directory /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310
INFO:mrjob.runner:creating tmp directory /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310



WARNING:mrjob.runner:
PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols
PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols
PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols
WARNING:mrjob.runner:PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols



WARNING:mrjob.runner:
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-mapper_part-00000
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-mapper_part-00000
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-mapper_part-00000
INFO:mrjob.sim:writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-mapper_part-00000
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
Counters from step 1:
Counters from step 1:
Counters from step 1:
INFO:mrjob.runner:Counters from step 1:
  (no counters found)
  (no counters found)
  (no counters found)
INFO:mrjob.runner:  (no counters found)
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-mapper-sorted
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-mapper-sorted
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-mapper-sorted
INFO:mrjob.runner:writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-mapper-sorted
> sort /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-mapper_part-00000
> sort /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-mapper_part-00000
> sort /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-mapper_part-00000
INFO:mrjob.runner:> sort /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-mapper_part-00000
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-reducer_part-00000
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-reducer_part-00000
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-reducer_part-00000
INFO:mrjob.sim:writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-reducer_part-00000
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
Counters from step 1:
Counters from step 1:
Counters from step 1:
INFO:mrjob.runner:Counters from step 1:
  (no counters found)
  (no counters found)
  (no counters found)
INFO:mrjob.runner:  (no counters found)
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-mapper_part-00000
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-mapper_part-00000
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-mapper_part-00000
INFO:mrjob.sim:writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-mapper_part-00000
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
Counters from step 2:
Counters from step 2:
Counters from step 2:
INFO:mrjob.runner:Counters from step 2:
  (no counters found)
  (no counters found)
  (no counters found)
INFO:mrjob.runner:  (no counters found)
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-mapper-sorted
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-mapper-sorted
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-mapper-sorted
INFO:mrjob.runner:writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-mapper-sorted
> sort /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-mapper_part-00000
> sort /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-mapper_part-00000
> sort /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-mapper_part-00000
INFO:mrjob.runner:> sort /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-mapper_part-00000
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-reducer_part-00000
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-reducer_part-00000
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-reducer_part-00000
INFO:mrjob.sim:writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-reducer_part-00000
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.






    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/Users/xdj116/git/big-data-python-class/Lectures/Lecture9- Recommenders/code/MovieSimilarities.py in <module>()
    230 
    231 if __name__ == '__main__':
--> 232     MoviesSimilarities.run()

/Library/Python/2.7/site-packages/mrjob/job.pyc in run(cls)
    459         # load options from the command line
    460         mr_job = cls(args=_READ_ARGS_FROM_SYS_ARGV)
--> 461         mr_job.execute()
    462 
    463     def execute(self):

/Library/Python/2.7/site-packages/mrjob/job.pyc in execute(self)
    477 
    478         else:
--> 479             super(MRJob, self).execute()
    480 
    481     def make_runner(self):

/Library/Python/2.7/site-packages/mrjob/launch.pyc in execute(self)
    151     def execute(self):
    152         # Launcher only runs jobs, doesn't do any Hadoop Streaming stuff
--> 153         self.run_job()
    154 
    155     def make_runner(self):

/Library/Python/2.7/site-packages/mrjob/launch.pyc in run_job(self)
    214 
    215         with self.make_runner() as runner:
--> 216             runner.run()
    217 
    218             if not self.options.no_output:

/Library/Python/2.7/site-packages/mrjob/runner.pyc in run(self)
    468             raise AssertionError("Job already ran!")
    469 
--> 470         self._run()
    471         self._ran_job = True
    472 

/Library/Python/2.7/site-packages/mrjob/sim.pyc in _run(self)
    184 
    185                 # run the reducer
--> 186                 self._invoke_step(step_num, 'reducer')
    187 
    188         # move final output to output directory

/Library/Python/2.7/site-packages/mrjob/sim.pyc in _invoke_step(self, step_num, step_type)
    258 
    259             self._run_step(step_num, step_type, input_path, output_path,
--> 260                            working_dir, env)
    261 
    262             self._prev_outfiles.append(output_path)

/Library/Python/2.7/site-packages/mrjob/inline.pyc in _run_step(self, step_num, step_type, input_path, output_path, working_dir, env, child_stdin)
    158                 child_instance = self._mrjob_cls(args=child_args)
    159                 child_instance.sandbox(stdin=child_stdin, stdout=child_stdout)
--> 160                 child_instance.execute()
    161 
    162         if has_combiner:

/Library/Python/2.7/site-packages/mrjob/job.pyc in execute(self)
    474 
    475         elif self.options.run_reducer:
--> 476             self.run_reducer(self.options.step_num)
    477 
    478         else:

/Library/Python/2.7/site-packages/mrjob/job.pyc in run_reducer(self, step_num)
    578                                                key=lambda(k, v): k):
    579             values = (v for k, v in kv_pairs)
--> 580             for out_key, out_value in reducer(key, values) or ():
    581                 write_line(out_key, out_value)
    582 

/Users/xdj116/git/big-data-python-class/Lectures/Lecture9- Recommenders/code/MovieSimilarities.py in calculate_similarity(self, pair_key, lines)
    192 
    193         reg_corr_sim = regularized_correlation(n, sum_xy, sum_x, \
--> 194                 sum_y, sum_xx, sum_yy, PRIOR_COUNT, PRIOR_CORRELATION)
    195 
    196         cos_sim = cosine(sum_xy, sqrt(sum_xx), sqrt(sum_yy))

/Users/xdj116/git/big-data-python-class/Lectures/Lecture9- Recommenders/code/MovieSimilarities.py in regularized_correlation(size, dot_product, rating_sum, rating2sum, rating_norm_squared, rating2_norm_squared, virtual_cont, prior_correlation)
     62     '''
     63     unregularizedCorrelation = correlation(size, dot_product, rating_sum, \
---> 64             rating2sum, rating_norm_squared, rating2_norm_squared)
     65 
     66     w = size / float(size + virtual_cont)

TypeError: correlate() takes at most 4 arguments (6 given)



In [ ]:



In [ ]: