Reccomender

Basing this tutorial from the work of Marcel Caraciolo at http://aimotion.blogspot.com/2012/08/introduction-to-recommendations-with.html

Our goal is to calculate how similar pairs of movies are, so that we recommend movies similar to movies you liked. Using the correlation we can:

• For every pair of movies A and B, find all the people who rated botha A and B.
• Use these ratings to form a Movie A vector and a Movie B vector.
• Calculate the correlation between those two vectors
• When someone watches a movie, you can recommend the movies most correlated with it

We are going to work of data set of movie ratings from: http://grouplens.org/datasets/movielens/ For this task we will use the MovieLens Dataset of Movie Ratings with 10.000 ratings from 1000 users on 1700 movies (you can download it at this http://www.grouplens.org/node/73 ).

So the first step is to get our movies file which has three columns: (user, movie, rating). For this task we will use the MovieLens Dataset of Movie Ratings with 10.000 ratings

You want to compute how similar pairs of movies are, so that if someone watches the movie The Matrix, you can recommend movies like BladeRunner. So how should you define the similarity between two movies ?

One possibility is to compute their correlation. The basic idea behind it is for every pair of movies A and B, find all the people who rated both A and B. Use these ratings to form a Movie A vector and a Movie B vector. Then, calculate the correlation between these two vectors. Now when someone watches a movie, you can now recommend him the movies most correlated with it.

So let's divide to conquer. Our first task is for each user, emit a row containing their 'postings' (item, rating). And for reducer, emit the user rating sum and count for use later steps.

``````

In [2]:

def group_by_user_rating(self, key, line):
"""
Emit the user_id and group by their ratings (item and rating)
17  70,3
35  21,1
49  19,2
49  21,1
49  70,4
87  19,1
87  21,2
98  19,2
"""
user_id, item_id, rating = line.split('|')
#yield (item_id, int(rating)), user_id
#yield item_id, (user_id, int(rating))
yield  user_id, (item_id, float(rating))
#yield (user_id, item_id), int(rating)

def count_ratings_users_freq(self, user_id, values):
"""
For each user, emit a row containing their "postings"
(item,rating pairs)
Also emit user rating sum and count for use later steps.
17    1,3,(70,3)
35    1,1,(21,1)
49    3,7,(19,2 21,1 70,4)
87    2,3,(19,1 21,2)
98    1,2,(19,2)
"""
item_count = 0
item_sum = 0
final = []
for item_id, rating in values:
item_count += 1
item_sum += rating
final.append((item_id, rating))

yield user_id, (item_count, item_sum, final)

``````

Before using these rating pairs to calculate correlation, let's see how we can compute it. We know that they can be formed as vectors of ratings, so we can use linear algebra to perform norms and dot products, as alo to compute the length of each vector or the sum over all elements in each vector. By representing them as matrices, we can perform several operations on those movies.

``````

In [3]:

def pairwise_items(self, user_id, values):
'''
The output drops the user from the key entirely, instead it emits
the pair of items as the key:
19,21  2,1
19,70  2,4
21,70  1,4
19,21  1,2
This mapper is the main performance bottleneck.  One improvement
would be to create a java Combiner to aggregate the
outputs by key before writing to hdfs, another would be to use
a vector format and SequenceFiles instead of streaming text
for the matrix data.
'''
item_count, item_sum, ratings = values
#print item_count, item_sum, [r for r in combinations(ratings, 2)]
#bottleneck at combinations
for item1, item2 in combinations(ratings, 2):
yield (item1[0], item2[0]), \
(item1[1], item2[1])

def calculate_similarity(self, pair_key, lines):
'''
Sum components of each corating pair across all users who rated both
item x and item y, then calculate pairwise pearson similarity and
corating counts.  The similarities are normalized to the [0,1] scale
because we do a numerical sort.
19,21   0.4,2
21,19   0.4,2
19,70   0.6,1
70,19   0.6,1
21,70   0.1,1
70,21   0.1,1
'''
sum_xx, sum_xy, sum_yy, sum_x, sum_y, n = (0.0, 0.0, 0.0, 0.0, 0.0, 0)
item_pair, co_ratings = pair_key, lines
item_xname, item_yname = item_pair
for item_x, item_y in lines:
sum_xx += item_x * item_x
sum_yy += item_y * item_y
sum_xy += item_x * item_y
sum_y += item_y
sum_x += item_x
n += 1
similarity = normalized_correlation(n, sum_xy, sum_x, sum_y, \
sum_xx, sum_yy)
yield (item_xname, item_yname), (similarity, n)

``````
``````

File "<ipython-input-3-454335f55961>", line 21
def calculate_similarity(self, pair_key, lines):
^
IndentationError: unindent does not match any outer indentation level

``````

To summarize, each row in calculate similarity will compute the number of people who rated both movie and movie2 , the sum over all elements in each ratings vectors (sum_x, sum_y) and the squared sum of each vector (sum_xx, sum__yy). So we can now can calculate the correlation between the movies. The correlation can be expressed as:

So that's it! Now the last step of the job that will sort the top-correlated items for each item and print it to the output.

``````

In [ ]:

def calculate_ranking(self, item_keys, values):
'''
Emit items with similarity in key for ranking:
19,0.4    70,1
19,0.6    21,2
21,0.6    19,2
21,0.9    70,1
70,0.4    19,1
70,0.9    21,1
'''
similarity, n = values
item_x, item_y = item_keys
if int(n) > 0:
yield (item_x, similarity), (item_y, n)

def top_similar_items(self, key_sim, similar_ns):
'''
For each item emit K closest items in comma separated file:
De La Soul;A Tribe Called Quest;0.6;1
De La Soul;2Pac;0.4;2
'''
item_x, similarity = key_sim
for item_y, n in similar_ns:
print '%s;%s;%f;%d' % (item_x, item_y, similarity, n)

``````

All of it in one file MovieSimilarities.py

``````

In [ ]:

'''
Given a dataset of movies and their ratings by different
users, how can we compute the similarity between pairs of
movies?
This module computes similarities between movies
by representing each movie as a vector of ratings and
computing similarity scores over these vectors.
Copied from:
https://github.com/marcelcaraciolo/recsys-mapreduce-mrjob/blob/master/moviesSimilarities.py
'''
__author__ = 'Marcel Caraciolo <caraciol@gmail.com>'

from mrjob.job import MRJob
from metrics import  correlation
from metrics import cosine, regularized_correlation
from math import sqrt

try:
from itertools import combinations
except ImportError:
from metrics import combinations

PRIOR_COUNT = 10
PRIOR_CORRELATION = 0

class SemicolonValueProtocol(object):

# don't need to implement read() since we aren't using it

def write(self, key, values):
return ';'.join(str(v) for v in values)

class MoviesSimilarities(MRJob):

OUTPUT_PROTOCOL = SemicolonValueProtocol

def steps(self):
return [
self.mr(mapper=self.group_by_user_rating,
reducer=self.count_ratings_users_freq),
self.mr(mapper=self.pairwise_items,
reducer=self.calculate_similarity),
self.mr(mapper=self.calculate_ranking,
reducer=self.top_similar_items)]

def group_by_user_rating(self, key, line):
"""
Emit the user_id and group by their ratings (item and rating)
17  70,3
35  21,1
49  19,2
49  21,1
49  70,4
87  19,1
87  21,2
98  19,2
"""
user_id, item_id, rating = line.split('\t')
#yield (item_id, int(rating)), user_id
#yield item_id, (user_id, int(rating))
yield  user_id, (item_id, float(rating))
#yield (user_id, item_id), int(rating)

def count_ratings_users_freq(self, user_id, values):
"""
For each user, emit a row containing their "postings"
(item,rating pairs)
Also emit user rating sum and count for use later steps.
17    1,3,(70,3)
35    1,1,(21,1)
49    3,7,(19,2 21,1 70,4)
87    2,3,(19,1 21,2)
98    1,2,(19,2)
"""
item_count = 0
item_sum = 0
final = []
for item_id, rating in values:
item_count += 1
item_sum += rating
final.append((item_id, rating))

yield user_id, (item_count, item_sum, final)

def pairwise_items(self, user_id, values):
'''
The output drops the user from the key entirely, instead it emits
the pair of items as the key:
19,21  2,1
19,70  2,4
21,70  1,4
19,21  1,2
This mapper is the main performance bottleneck.  One improvement
would be to create a java Combiner to aggregate the
outputs by key before writing to hdfs, another would be to use
a vector format and SequenceFiles instead of streaming text
for the matrix data.
'''
item_count, item_sum, ratings = values
#print item_count, item_sum, [r for r in combinations(ratings, 2)]
#bottleneck at combinations
for item1, item2 in combinations(ratings, 2):
yield (item1[0], item2[0]), \
(item1[1], item2[1])

def calculate_similarity(self, pair_key, lines):
'''
Sum components of each corating pair across all users who rated both
item x and item y, then calculate pairwise pearson similarity and
corating counts.  The similarities are normalized to the [0,1] scale
because we do a numerical sort.
19,21   0.4,2
21,19   0.4,2
19,70   0.6,1
70,19   0.6,1
21,70   0.1,1
70,21   0.1,1
'''
sum_xx, sum_xy, sum_yy, sum_x, sum_y, n = (0.0, 0.0, 0.0, 0.0, 0.0, 0)
item_pair, co_ratings = pair_key, lines
item_xname, item_yname = item_pair
for item_x, item_y in lines:
sum_xx += item_x * item_x
sum_yy += item_y * item_y
sum_xy += item_x * item_y
sum_y += item_y
sum_x += item_x
n += 1

corr_sim = correlation(n, sum_xy, sum_x, \
sum_y, sum_xx, sum_yy)

reg_corr_sim = regularized_correlation(n, sum_xy, sum_x, \
sum_y, sum_xx, sum_yy, PRIOR_COUNT, PRIOR_CORRELATION)

cos_sim = cosine(sum_xy, sqrt(sum_xx), sqrt(sum_yy))

jaccard_sim = 0.0

yield (item_xname, item_yname), (corr_sim, \
cos_sim, reg_corr_sim, jaccard_sim, n)

def calculate_ranking(self, item_keys, values):
'''
Emit items with similarity in key for ranking:
19,0.4    70,1
19,0.6    21,2
21,0.6    19,2
21,0.9    70,1
70,0.4    19,1
70,0.9    21,1
'''
corr_sim, cos_sim, reg_corr_sim, jaccard_sim, n = values
item_x, item_y = item_keys
if int(n) > 0:
yield (item_x, corr_sim, cos_sim, reg_corr_sim, jaccard_sim), \
(item_y, n)

def top_similar_items(self, key_sim, similar_ns):
'''
For each item emit K closest items in comma separated file:
De La Soul;A Tribe Called Quest;0.6;1
De La Soul;2Pac;0.4;2
'''
item_x, corr_sim, cos_sim, reg_corr_sim, jaccard_sim = key_sim
for item_y, n in similar_ns:
yield None, (item_x, item_y, corr_sim, cos_sim, reg_corr_sim,
jaccard_sim, n)

if __name__ == '__main__':
MoviesSimilarities.run()

``````
``````

In [4]:

%run code/MovieSimilarities.py data/ml-100k/ml-100k/u.data > data/output.csv

``````
``````

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
INFO:mrjob.conf:no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
INFO:mrjob.conf:no configs found; falling back on auto-configuration
creating tmp directory /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310
creating tmp directory /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310
creating tmp directory /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310
INFO:mrjob.runner:creating tmp directory /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310

WARNING:mrjob.runner:
PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols
PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols
PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols
WARNING:mrjob.runner:PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

WARNING:mrjob.runner:
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-mapper_part-00000
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-mapper_part-00000
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-mapper_part-00000
INFO:mrjob.sim:writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-mapper_part-00000
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
Counters from step 1:
Counters from step 1:
Counters from step 1:
INFO:mrjob.runner:Counters from step 1:
(no counters found)
(no counters found)
(no counters found)
INFO:mrjob.runner:  (no counters found)
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-mapper-sorted
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-mapper-sorted
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-mapper-sorted
INFO:mrjob.runner:writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-mapper-sorted
> sort /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-mapper_part-00000
> sort /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-mapper_part-00000
> sort /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-mapper_part-00000
INFO:mrjob.runner:> sort /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-mapper_part-00000
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-reducer_part-00000
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-reducer_part-00000
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-reducer_part-00000
INFO:mrjob.sim:writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-0-reducer_part-00000
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
Counters from step 1:
Counters from step 1:
Counters from step 1:
INFO:mrjob.runner:Counters from step 1:
(no counters found)
(no counters found)
(no counters found)
INFO:mrjob.runner:  (no counters found)
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-mapper_part-00000
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-mapper_part-00000
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-mapper_part-00000
INFO:mrjob.sim:writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-mapper_part-00000
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
Counters from step 2:
Counters from step 2:
Counters from step 2:
INFO:mrjob.runner:Counters from step 2:
(no counters found)
(no counters found)
(no counters found)
INFO:mrjob.runner:  (no counters found)
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-mapper-sorted
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-mapper-sorted
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-mapper-sorted
INFO:mrjob.runner:writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-mapper-sorted
> sort /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-mapper_part-00000
> sort /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-mapper_part-00000
> sort /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-mapper_part-00000
INFO:mrjob.runner:> sort /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-mapper_part-00000
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-reducer_part-00000
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-reducer_part-00000
writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-reducer_part-00000
INFO:mrjob.sim:writing to /var/folders/mg/vs4m899j26n6j2ks754qlvrwjzg4kj/T/MovieSimilarities.xdj116.20151118.015147.140310/step-1-reducer_part-00000
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.
WARNING:mrjob.job:mr() is deprecated and will be removed in v0.6.0. Use mrjob.step.MRStep directly instead.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/Users/xdj116/git/big-data-python-class/Lectures/Lecture9- Recommenders/code/MovieSimilarities.py in <module>()
230
231 if __name__ == '__main__':
--> 232     MoviesSimilarities.run()

/Library/Python/2.7/site-packages/mrjob/job.pyc in run(cls)
459         # load options from the command line
--> 461         mr_job.execute()
462
463     def execute(self):

/Library/Python/2.7/site-packages/mrjob/job.pyc in execute(self)
477
478         else:
--> 479             super(MRJob, self).execute()
480
481     def make_runner(self):

/Library/Python/2.7/site-packages/mrjob/launch.pyc in execute(self)
151     def execute(self):
152         # Launcher only runs jobs, doesn't do any Hadoop Streaming stuff
--> 153         self.run_job()
154
155     def make_runner(self):

/Library/Python/2.7/site-packages/mrjob/launch.pyc in run_job(self)
214
215         with self.make_runner() as runner:
--> 216             runner.run()
217
218             if not self.options.no_output:

/Library/Python/2.7/site-packages/mrjob/runner.pyc in run(self)
469
--> 470         self._run()
471         self._ran_job = True
472

/Library/Python/2.7/site-packages/mrjob/sim.pyc in _run(self)
184
185                 # run the reducer
--> 186                 self._invoke_step(step_num, 'reducer')
187
188         # move final output to output directory

/Library/Python/2.7/site-packages/mrjob/sim.pyc in _invoke_step(self, step_num, step_type)
258
259             self._run_step(step_num, step_type, input_path, output_path,
--> 260                            working_dir, env)
261
262             self._prev_outfiles.append(output_path)

/Library/Python/2.7/site-packages/mrjob/inline.pyc in _run_step(self, step_num, step_type, input_path, output_path, working_dir, env, child_stdin)
158                 child_instance = self._mrjob_cls(args=child_args)
159                 child_instance.sandbox(stdin=child_stdin, stdout=child_stdout)
--> 160                 child_instance.execute()
161
162         if has_combiner:

/Library/Python/2.7/site-packages/mrjob/job.pyc in execute(self)
474
475         elif self.options.run_reducer:
--> 476             self.run_reducer(self.options.step_num)
477
478         else:

/Library/Python/2.7/site-packages/mrjob/job.pyc in run_reducer(self, step_num)
578                                                key=lambda(k, v): k):
579             values = (v for k, v in kv_pairs)
--> 580             for out_key, out_value in reducer(key, values) or ():
581                 write_line(out_key, out_value)
582

/Users/xdj116/git/big-data-python-class/Lectures/Lecture9- Recommenders/code/MovieSimilarities.py in calculate_similarity(self, pair_key, lines)
192
193         reg_corr_sim = regularized_correlation(n, sum_xy, sum_x, \
--> 194                 sum_y, sum_xx, sum_yy, PRIOR_COUNT, PRIOR_CORRELATION)
195
196         cos_sim = cosine(sum_xy, sqrt(sum_xx), sqrt(sum_yy))

/Users/xdj116/git/big-data-python-class/Lectures/Lecture9- Recommenders/code/MovieSimilarities.py in regularized_correlation(size, dot_product, rating_sum, rating2sum, rating_norm_squared, rating2_norm_squared, virtual_cont, prior_correlation)
62     '''
63     unregularizedCorrelation = correlation(size, dot_product, rating_sum, \
---> 64             rating2sum, rating_norm_squared, rating2_norm_squared)
65
66     w = size / float(size + virtual_cont)

TypeError: correlate() takes at most 4 arguments (6 given)

``````
``````

In [ ]:

``````
``````

In [ ]:

``````