Collaborative filtering

Note, this require Python 2.7 to run (Open Notebook by "pyspark" in Mac Terminal)

In the following example, we load ratings data from the MovieLens dataset, each row consisting of a user, a movie, a rating and a timestamp. We then train an ALS model which assumes, by default, that the ratings are explicit (implicitPrefs is False). We evaluate the recommendation model by measuring the root-mean-square error of rating prediction.

Refer to the ALS Python docs for more details on the API.


In [1]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row

In [2]:
lines = spark.read.text("data/movielens/sample_movielens_ratings.txt").rdd
parts = lines.map(lambda row: row.value.split("::"))
ratingsRDD = parts.map(lambda p: Row(userId=int(p[0]), movieId=int(p[1]),
                                     rating=float(p[2]), timestamp=long(p[3])))
ratings = spark.createDataFrame(ratingsRDD)
(training, test) = ratings.randomSplit([0.8, 0.2])


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-a498eaaa5f12> in <module>()
----> 1 lines = spark.read.text("data/movielens/sample_movielens_ratings.txt").rdd
      2 parts = lines.map(lambda row: row.value.split("::"))
      3 ratingsRDD = parts.map(lambda p: Row(userId=int(p[0]), movieId=int(p[1]),
      4                                      rating=float(p[2]), timestamp=long(p[3])))
      5 ratings = spark.createDataFrame(ratingsRDD)

NameError: name 'spark' is not defined

In [12]:
# Build the recommendation model using ALS on the training data
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating")
model = als.fit(training)

# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                                predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print "Root-mean-square error = " + str(rmse)


Root-mean-square error = 1.62524233773
Find full example code at "examples/src/main/python/ml/als_example.py" in the Spark repo. If the rating matrix is derived from another source of information (i.e. it is inferred from other signals), you can set implicitPrefs to True to get better results: als = ALS(maxIter=5, regParam=0.01, implicitPrefs=True, userCol="userId", itemCol="movieId", ratingCol="rating")

In [ ]: