Building a Recommender with Ratings Data

Using GraphLab Create we can take an SFrame containing user ratings for movies, and quickly create a recommender. We'll see how to tune parameters for this recommender and how to get a sense of what its performance will be in practice. Finally (mostly for fun) we'll create two stereotypical users, and see what they get recommended.

To start, let's import a few modules.


In [1]:
import graphlab as gl
import matplotlib.pyplot as plt
%matplotlib inline

Prepare the Data

The complete dataset is large. For prototyping on a single machine, let's just use a small sample of all the data. We'll use a sample that contains all ratings for about 12,000 users.


In [2]:
data_url = 'https://static.turi.com/datasets/movie_ratings/sample.small'
sf = gl.SFrame.read_csv(data_url,delimiter='\t',column_type_hints={'rating':int})


[INFO] This commercial license of GraphLab Create is assigned to engr@turi.com.

[INFO] Start server at: ipc:///tmp/graphlab_server-24264 - Server binary: /usr/local/lib/python2.7/dist-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1442615425.log
[INFO] GraphLab Server Version: 1.5.2
PROGRESS: Downloading https://static.turi.com/datasets/movie_ratings/sample.small to /var/tmp/graphlab-toby/24264/000000.small
PROGRESS: Finished parsing file https://static.turi.com/datasets/movie_ratings/sample.small
PROGRESS: Parsing completed. Parsed 100 lines in 0.723517 secs.
PROGRESS: Read 1549015 lines. Lines per second: 2.20795e+06
PROGRESS: Finished parsing file https://static.turi.com/datasets/movie_ratings/sample.small
PROGRESS: Parsing completed. Parsed 4000000 lines in 1.22237 secs.

Using the same data to train and evaluate a model is problematic. Specifically this leads to a problem called overfitting. We'll follow the common approach of holding out a randomly selected 20% of our data to use later for evaluation.

Now we just use this helper function to get our train and test set.


In [3]:
(train_set, test_set) = sf.random_split(0.8)

Now that we have a train and test set, let's come up with a very simple way of predicting ratings. That way when we try more complicated things, we'll have some baseline for comparison.

GraphLab's PopularityRecommender provides this functionality. It just stores the mean rating per item. When asked to predict a user's rating for a particular item pair, it just predicts the mean of all ratings for that item; it pays no attention to user information.

In order to use the PopularityRecommender, all we need to do is pass its create function the data and tell it the pertinent column names.


In [4]:
m = gl.popularity_recommender.create(train_set, 'user', 'movie', 'rating')


PROGRESS: Recsys training: model = popularity
PROGRESS: Preparing data set.
PROGRESS:     Data has 3199340 observations with 12020 users and 17188 items.
PROGRESS:     Data prepared in: 1.82253s
PROGRESS: 3199340 observations to process; with 17188 unique items.

Now that we have a (simple) model, we need a way to measure the accuracy of its predictions. That way we can compare the performance of different models. The Root Mean Squared Error is one of the most common ways to measure the accuracy.


In [5]:
baseline_rmse = gl.evaluation.rmse(test_set['rating'], m.predict(test_set))
print baseline_rmse


0.992343220641

Improving Predictive Accuracy

The type of model that turned out to be the best for the famous Netflix Competition is called Matrix Factorization. This is a form of Collaborative Filtering where recommendations are generated using ratings from users that are some how similar.

Whenever you use a particular type of model there are almost inevitably some parameters you must specify. Matrix Factorization is no exception. Properly tuning these parameters can have a huge effect on how well your model works. The two most important parameters are the number of dimensions and the regularization coefficient.


In [6]:
regularization_vals = [0.001, 0.0001, 0.00001, 0.000001]
models = [gl.factorization_recommender.create(train_set, 'user', 'movie', 'rating',
                                              max_iterations=50, num_factors=5, regularization=r)
          for r in regularization_vals]


PROGRESS: Recsys training: model = factorization_recommender
PROGRESS: Preparing data set.
PROGRESS:     Data has 3199340 observations with 12020 users and 17188 items.
PROGRESS:     Data prepared in: 1.79158s
PROGRESS: Training factorization_recommender for recommendations.
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | Parameter                      | Description                                      | Value    |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | num_factors                    | Factor Dimension                                 | 5        |
PROGRESS: | regularization                 | L2 Regularization on Factors                     | 0.001    |
PROGRESS: | solver                         | Solver used for training                         | sgd      |
PROGRESS: | linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-10    |
PROGRESS: | max_iterations                 | Maximum Number of Iterations                     | 50       |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS:   Optimizing model using SGD; tuning step size.
PROGRESS:   Using 399917 / 3199340 points for tuning the step size.
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Attempt | Initial Step Size | Estimated Objective Value                |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | 0       | 0.0230303         | Not Viable                               |
PROGRESS: | 1       | 0.00575758        | 0.870597                                 |
PROGRESS: | 2       | 0.00287879        | 0.913049                                 |
PROGRESS: | 3       | 0.0014394         | 0.954261                                 |
PROGRESS: | 4       | 0.000719698       | 0.990381                                 |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Final   | 0.00575758        | 0.870597                                 |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: Starting Optimization.
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | Iter.   | Elapsed Time | Approx. Objective | Approx. Training RMSE | Step Size   |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | Initial | 82us         | 1.1461            | 1.07056               |             |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | 1       | 338.838ms    | DIVERGED          | DIVERGED              | 0.00575758  |
PROGRESS: | RESET   | 417.403ms    | 1.1461            | 1.07056               |             |
PROGRESS: | 1       | 743.737ms    | DIVERGED          | DIVERGED              | 0.00287879  |
PROGRESS: | RESET   | 822.623ms    | 1.14612           | 1.07057               |             |
PROGRESS: | 1       | 1.14s        | DIVERGED          | DIVERGED              | 0.0014394   |
PROGRESS: | RESET   | 1.22s        | 1.1461            | 1.07056               |             |
PROGRESS: | 1       | 1.48s        | 0.958228          | 0.978889              | 0.000719698 |
PROGRESS: | 2       | 1.72s        | 0.88606           | 0.941306              | 0.000217921 |
PROGRESS: | 3       | 1.96s        | 0.875021          | 0.935425              | 0.0001284   |
PROGRESS: | 4       | 2.28s        | 0.869428          | 0.932431              | 9.10126e-05 |
PROGRESS: | 5       | 2.53s        | 0.865827          | 0.930498              | 7.04879e-05 |
PROGRESS: | 6       | 2.78s        | 0.863238          | 0.929106              | 5.7517e-05  |
PROGRESS: | 7       | 3.10s        | 0.861245          | 0.928033              | 4.85779e-05 |
PROGRESS: | 11      | 4.16s        | 0.856214          | 0.925318              | 2.99555e-05 |
PROGRESS: | 12      | 4.43s        | 0.855349          | 0.924851              | 2.73357e-05 |
PROGRESS: | 17      | 5.65s        | 0.852209          | 0.923152              | 1.9019e-05  |
PROGRESS: | 22      | 6.87s        | 0.850136          | 0.922028              | 1.45824e-05 |
PROGRESS: | 27      | 8.09s        | 0.848619          | 0.921205              | 1.18242e-05 |
PROGRESS: | 32      | 9.30s        | 0.847437          | 0.920563              | 9.94342e-06 |
PROGRESS: | 37      | 10.52s       | 0.846477          | 0.920042              | 8.57885e-06 |
PROGRESS: | 42      | 11.75s       | 0.845673          | 0.919605              | 7.54362e-06 |
PROGRESS: | 47      | 12.97s       | 0.844986          | 0.919231              | 6.73133e-06 |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: Optimization Complete: Maximum number of passes through the data reached.
PROGRESS: Computing final objective value and training RMSE.
PROGRESS:        Final objective value: 0.844538
PROGRESS:        Final training RMSE: 0.918988
PROGRESS: Recsys training: model = factorization_recommender
PROGRESS: Preparing data set.
PROGRESS:     Data has 3199340 observations with 12020 users and 17188 items.
PROGRESS:     Data prepared in: 1.75709s
PROGRESS: Training factorization_recommender for recommendations.
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | Parameter                      | Description                                      | Value    |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | num_factors                    | Factor Dimension                                 | 5        |
PROGRESS: | regularization                 | L2 Regularization on Factors                     | 0.0001   |
PROGRESS: | solver                         | Solver used for training                         | sgd      |
PROGRESS: | linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-10    |
PROGRESS: | max_iterations                 | Maximum Number of Iterations                     | 50       |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS:   Optimizing model using SGD; tuning step size.
PROGRESS:   Using 399917 / 3199340 points for tuning the step size.
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Attempt | Initial Step Size | Estimated Objective Value                |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | 0       | 0.230303          | Not Viable                               |
PROGRESS: | 1       | 0.0575758         | No Decrease (2.89944 >= 1.1454)          |
PROGRESS: | 2       | 0.014394          | 0.81024                                  |
PROGRESS: | 3       | 0.00719698        | 0.83183                                  |
PROGRESS: | 4       | 0.00359849        | 0.857627                                 |
PROGRESS: | 5       | 0.00179924        | 0.889854                                 |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Final   | 0.014394          | 0.81024                                  |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: Starting Optimization.
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | Iter.   | Elapsed Time | Approx. Objective | Approx. Training RMSE | Step Size   |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | Initial | 96us         | 1.14612           | 1.07057               |             |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | 1       | 330.484ms    | DIVERGED          | DIVERGED              | 0.014394    |
PROGRESS: | RESET   | 406.578ms    | 1.14612           | 1.07057               |             |
PROGRESS: | 1       | 668.368ms    | 1.10355           | 0.920537              | 0.00719698  |
PROGRESS: | 2       | 927.856ms    | 0.912146          | 0.890133              | 0.00217921  |
PROGRESS: | 3       | 1.19s        | 0.872743          | 0.884389              | 0.001284    |
PROGRESS: | 4       | 1.46s        | 0.860191          | 0.88408               | 0.000910126 |
PROGRESS: | 5       | 1.72s        | 0.853122          | 0.884082              | 0.000704879 |
PROGRESS: | 6       | 1.98s        | 0.848552          | 0.884224              | 0.00057517  |
PROGRESS: | 9       | 2.73s        | 0.841555          | 0.884854              | 0.000370587 |
PROGRESS: | 11      | 3.23s        | 0.838957          | 0.885186              | 0.000299555 |
PROGRESS: | 14      | 3.98s        | 0.836326          | 0.885536              | 0.000232662 |
PROGRESS: | 19      | 5.24s        | 0.833631          | 0.885951              | 0.000169556 |
PROGRESS: | 24      | 6.52s        | 0.831958          | 0.886253              | 0.000133379 |
PROGRESS: | 29      | 7.80s        | 0.830745          | 0.886457              | 0.000109925 |
PROGRESS: | 34      | 9.09s        | 0.829858          | 0.886629              | 9.34862e-05 |
PROGRESS: | 39      | 10.35s       | 0.829149          | 0.886761              | 8.13244e-05 |
PROGRESS: | 44      | 11.62s       | 0.828574          | 0.886872              | 7.19626e-05 |
PROGRESS: | 49      | 12.88s       | 0.828091          | 0.886962              | 6.45337e-05 |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: Optimization Complete: Maximum number of passes through the data reached.
PROGRESS: Computing final objective value and training RMSE.
PROGRESS:        Final objective value: 0.828566
PROGRESS:        Final training RMSE: 0.887295
PROGRESS: Recsys training: model = factorization_recommender
PROGRESS: Preparing data set.
PROGRESS:     Data has 3199340 observations with 12020 users and 17188 items.
PROGRESS:     Data prepared in: 1.74667s
PROGRESS: Training factorization_recommender for recommendations.
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | Parameter                      | Description                                      | Value    |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | num_factors                    | Factor Dimension                                 | 5        |
PROGRESS: | regularization                 | L2 Regularization on Factors                     | 1e-05    |
PROGRESS: | solver                         | Solver used for training                         | sgd      |
PROGRESS: | linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-10    |
PROGRESS: | max_iterations                 | Maximum Number of Iterations                     | 50       |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS:   Optimizing model using SGD; tuning step size.
PROGRESS:   Using 399917 / 3199340 points for tuning the step size.
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Attempt | Initial Step Size | Estimated Objective Value                |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | 0       | 2.30303           | Not Viable                               |
PROGRESS: | 1       | 0.575758          | Not Viable                               |
PROGRESS: | 2       | 0.14394           | 0.848182                                 |
PROGRESS: | 3       | 0.0719698         | 0.747884                                 |
PROGRESS: | 4       | 0.0359849         | 0.777004                                 |
PROGRESS: | 5       | 0.0179924         | 0.78454                                  |
PROGRESS: | 6       | 0.00899622        | 0.801716                                 |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Final   | 0.0719698         | 0.747884                                 |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: Starting Optimization.
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | Iter.   | Elapsed Time | Approx. Objective | Approx. Training RMSE | Step Size   |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | Initial | 95us         | 1.14611           | 1.07056               |             |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | 1       | 333.097ms    | DIVERGED          | DIVERGED              | 0.0719698   |
PROGRESS: | RESET   | 409.441ms    | 1.14612           | 1.07057               |             |
PROGRESS: | 1       | 674.458ms    | 0.945166          | 0.933644              | 0.0359849   |
PROGRESS: | 2       | 922.186ms    | 0.833656          | 0.876518              | 0.0167272   |
PROGRESS: | 3       | 1.17s        | 0.794613          | 0.851602              | 0.0108961   |
PROGRESS: | 4       | 1.42s        | 0.78035           | 0.842591              | 0.00807953  |
PROGRESS: | 5       | 1.67s        | 0.773427          | 0.838109              | 0.00642001  |
PROGRESS: | 6       | 1.92s        | 0.769018          | 0.835196              | 0.00532605  |
PROGRESS: | 9       | 2.66s        | 0.76163           | 0.830728              | 0.0035244   |
PROGRESS: | 11      | 3.16s        | 0.758812          | 0.829095              | 0.00287585  |
PROGRESS: | 14      | 3.90s        | 0.756146          | 0.827604              | 0.00225376  |
PROGRESS: | 19      | 5.13s        | 0.753463          | 0.826105              | 0.00165653  |
PROGRESS: | 24      | 6.40s        | 0.751875          | 0.825247              | 0.00130952  |
PROGRESS: | 29      | 7.76s        | 0.750733          | 0.824628              | 0.00108271  |
PROGRESS: | 34      | 9.48s        | 0.749926          | 0.824199              | 0.000922874 |
PROGRESS: | 39      | 11.38s       | 0.749352          | 0.8239                | 0.000804157 |
PROGRESS: | 44      | 13.18s       | 0.748856          | 0.82364               | 0.000712502 |
PROGRESS: | 49      | 14.50s       | 0.748486          | 0.823446              | 0.000639602 |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: Optimization Complete: Maximum number of passes through the data reached.
PROGRESS: Computing final objective value and training RMSE.
PROGRESS:        Final objective value: 0.74713
PROGRESS:        Final training RMSE: 0.822629
PROGRESS: Recsys training: model = factorization_recommender
PROGRESS: Preparing data set.
PROGRESS:     Data has 3199340 observations with 12020 users and 17188 items.
PROGRESS:     Data prepared in: 1.7611s
PROGRESS: Training factorization_recommender for recommendations.
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | Parameter                      | Description                                      | Value    |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | num_factors                    | Factor Dimension                                 | 5        |
PROGRESS: | regularization                 | L2 Regularization on Factors                     | 1e-06    |
PROGRESS: | solver                         | Solver used for training                         | sgd      |
PROGRESS: | linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-10    |
PROGRESS: | max_iterations                 | Maximum Number of Iterations                     | 50       |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS:   Optimizing model using SGD; tuning step size.
PROGRESS:   Using 399917 / 3199340 points for tuning the step size.
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Attempt | Initial Step Size | Estimated Objective Value                |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | 0       | 23.0303           | Not Viable                               |
PROGRESS: | 1       | 5.75758           | Not Viable                               |
PROGRESS: | 2       | 1.4394            | Not Viable                               |
PROGRESS: | 3       | 0.359849          | Not Viable                               |
PROGRESS: | 4       | 0.0899622         | 0.64469                                  |
PROGRESS: | 5       | 0.0449811         | 0.776938                                 |
PROGRESS: | 6       | 0.0224906         | 0.781721                                 |
PROGRESS: | 7       | 0.0112453         | 0.795259                                 |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Final   | 0.0899622         | 0.64469                                  |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: Starting Optimization.
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | Iter.   | Elapsed Time | Approx. Objective | Approx. Training RMSE | Step Size   |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | Initial | 91us         | 1.14611           | 1.07057               |             |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | 1       | 271.257ms    | 1.06564           | 1.02546               | 0.0899622   |
PROGRESS: | 2       | 519.436ms    | 0.893836          | 0.9386                | 0.0534919   |
PROGRESS: | 3       | 781.26ms     | 0.821313          | 0.898336              | 0.0394657   |
PROGRESS: | 4       | 1.02s        | 0.787446          | 0.878677              | 0.0318065   |
PROGRESS: | 5       | 1.27s        | 0.767935          | 0.867039              | 0.026905    |
PROGRESS: | 6       | 1.52s        | 0.754864          | 0.859049              | 0.0234664   |
PROGRESS: | 10      | 2.52s        | 0.728348          | 0.842483              | 0.0159978   |
PROGRESS: | 11      | 2.77s        | 0.724747          | 0.840175              | 0.0148942   |
PROGRESS: | 15      | 3.76s        | 0.714201          | 0.833394              | 0.011803    |
PROGRESS: | 20      | 4.97s        | 0.706563          | 0.828444              | 0.00951235  |
PROGRESS: | 25      | 6.20s        | 0.701591          | 0.8252                | 0.00804647  |
PROGRESS: | 30      | 7.43s        | 0.698227          | 0.822989              | 0.0070181   |
PROGRESS: | 35      | 8.67s        | 0.695756          | 0.821358              | 0.00625186  |
PROGRESS: | 40      | 9.89s        | 0.693684          | 0.820002              | 0.00565608  |
PROGRESS: | 45      | 11.18s       | 0.692125          | 0.818967              | 0.00517787  |
PROGRESS: | 50      | 12.68s       | 0.690705          | 0.818038              | 0.00478446  |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: Optimization Complete: Maximum number of passes through the data reached.
PROGRESS: Computing final objective value and training RMSE.
PROGRESS:        Final objective value: 0.678585
PROGRESS:        Final training RMSE: 0.810596

In [7]:
# Save the train and test RMSE, for each model
(rmse_train, rmse_test) = ([], [])
for m in models:
    rmse_train.append(m['training_rmse'])
    rmse_test.append(gl.evaluation.rmse(test_set['rating'], m.predict(test_set)))

Let's create a plot to show the RMSE for these different regularization values.


In [8]:
(fig, ax) = plt.subplots(figsize=(10, 8))
[p1, p2, p3] = ax.semilogx(regularization_vals, rmse_train, 
                           regularization_vals, rmse_test, 
                           regularization_vals, len(regularization_vals) * [baseline_rmse]
                           )
ax.set_ylim([0.7, 1.1])
ax.set_xlabel('Regularization', fontsize=20)
ax.set_ylabel('RMSE', fontsize=20)
ax.legend([p1, p2, p3], ["Train", "Test", "Baseline"])


Out[8]:
<matplotlib.legend.Legend at 0x7f3b984ea650>

Looks like we get the best Root Mean Squared Error (on the test set) when n_factors = 5 and regularization = 0.00001. Let's use those parameters!

Load a Larger Dataset

Now let's train a new model (using the parameters we've picked) with a larger sample. This time we'll use a dataset that contains all ratings for about 30,000 users.


In [9]:
data_url = 'https://static.turi.com/datasets/movie_ratings/sample.large'
data = gl.SFrame.read_csv(data_url,delimiter='\t',column_type_hints={'rating':int})


PROGRESS: Downloading https://static.turi.com/datasets/movie_ratings/sample.large to /var/tmp/graphlab-toby/24264/000001.large
PROGRESS: Finished parsing file https://static.turi.com/datasets/movie_ratings/sample.large
PROGRESS: Parsing completed. Parsed 100 lines in 0.74785 secs.
PROGRESS: Read 1549015 lines. Lines per second: 2.21292e+06
PROGRESS: Finished parsing file https://static.turi.com/datasets/movie_ratings/sample.large
PROGRESS: Parsing completed. Parsed 10000000 lines in 2.74104 secs.

The Action Lover and the Romance Lover

Let's see how this model behaves for two simulated/stereotypical users who have only rated a small number movies. We'll create two test users. One who loves action movies and hates romance movies. The other loves romance movies and hates actions movies.


In [10]:
action_movies = ['GoldenEye', 'Casino Royale', 'Independence Day', 'Con Air', 'The Rock', 
               'The Bourne Identity', 'Ocean\'s Eleven', 'Lethal Weapon 4', 'Gladiator', 
               'Indiana Jones and the Last Crusade', 'The Matrix', 'Kill Bill: Vol. 1',
               'Air Force One', 'Braveheart', 'The Man with the Golden Gun',
               'The Bourne Supremacy', 'Saving Private Ryan']

romance_movies = ['Sleepless in Seattle', 'An Affair to Remember', 'Ghost', 'Love Actually',
                  'You\'ve Got Mail', 'Notting Hill', 'Titanic', 'Miss Congeniality',
                  'Some Like It Hot', 'Pretty Woman', 'How to Lose a Guy in 10 Days']

# Boring helper function to create ratings
def ratings(movie_list, user, rating):
    num = len(movie_list)
    records = {'user': [user] * num, 'movie': movie_list, 'rating': [rating] * num}
    return gl.SFrame(records)

# Loves action movies, hates romance movies
action_user = 'Archie the Action Lover'
action_user_ratings = ratings(action_movies, action_user, 5)
action_user_ratings = action_user_ratings.append(ratings(romance_movies, action_user, 1))

# Loves romance movies, hates action movies
romantic_user = 'Rebecca the Romance Lover'
romantic_user_ratings = ratings(action_movies, romantic_user, 1)
romantic_user_ratings = romantic_user_ratings.append(ratings(romance_movies, romantic_user, 5))

data = data.append(action_user_ratings)
data = data.append(romantic_user_ratings)

Now, let's create a matrix factorization model, using the larger data sample.


In [11]:
# Create a new model, using the larger dataset, with the tuned parameters
m = gl.ranking_factorization_recommender.create(data, 'user', 'movie', 'rating', 
                                                max_iterations=50, num_factors=5,
                                                regularization=0.00001)


PROGRESS: Recsys training: model = ranking_factorization_recommender
PROGRESS: Preparing data set.
PROGRESS:     Data has 10000056 observations with 30436 users and 17284 items.
PROGRESS:     Data prepared in: 5.91371s
PROGRESS: Training ranking_factorization_recommender for recommendations.
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | Parameter                      | Description                                      | Value    |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | num_factors                    | Factor Dimension                                 | 5        |
PROGRESS: | regularization                 | L2 Regularization on Factors                     | 1e-05    |
PROGRESS: | solver                         | Solver used for training                         | sgd      |
PROGRESS: | linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-09    |
PROGRESS: | ranking_regularization         | Rank-based Regularization Weight                 | 0.25     |
PROGRESS: | max_iterations                 | Maximum Number of Iterations                     | 50       |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS:   Optimizing model using SGD; tuning step size.
PROGRESS:   Using 1250007 / 10000056 points for tuning the step size.
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Attempt | Initial Step Size | Estimated Objective Value                |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | 0       | 0.73682           | Not Viable                               |
PROGRESS: | 1       | 0.184205          | Not Viable                               |
PROGRESS: | 2       | 0.0460513         | 1.43195                                  |
PROGRESS: | 3       | 0.0230256         | 1.49754                                  |
PROGRESS: | 4       | 0.0115128         | 1.51573                                  |
PROGRESS: | 5       | 0.00575641        | 1.55046                                  |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Final   | 0.0460513         | 1.43195                                  |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: Starting Optimization.
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | Iter.   | Elapsed Time | Approx. Objective | Approx. Training RMSE | Step Size   |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | Initial | 276us        | 2.21269           | 1.06232               |             |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | 1       | 3.03s        | DIVERGED          | DIVERGED              | 0.0460513   |
PROGRESS: | RESET   | 3.87s        | 2.21268           | 1.06232               |             |
PROGRESS: | 1       | 6.10s        | 1.76515           | 1.04168               | 0.0230256   |
PROGRESS: | 2       | 8.29s        | 1.496             | 0.988318              | 0.00697202  |
PROGRESS: | 3       | 10.55s       | 1.39915           | 0.943215              | 0.00410794  |
PROGRESS: | 4       | 12.88s       | 1.37948           | 0.937378              | 0.00291179  |
PROGRESS: | 5       | 15.09s       | 1.3731            | 0.934749              | 0.00225514  |
PROGRESS: | 6       | 17.39s       | 1.36922           | 0.933352              | 0.00184015  |
PROGRESS: | 7       | 19.99s       | 1.36645           | 0.932667              | 0.00155416  |
PROGRESS: | 8       | 22.60s       | 1.36499           | 0.932007              | 0.00134511  |
PROGRESS: | 9       | 24.99s       | 1.36376           | 0.93172               | 0.00118563  |
PROGRESS: | 10      | 27.20s       | 1.36259           | 0.931271              | 0.00105996  |
PROGRESS: | 11      | 29.41s       | 1.36159           | 0.931036              | 0.000958373 |
PROGRESS: | 12      | 31.63s       | 1.36065           | 0.930814              | 0.000874557 |
PROGRESS: | 13      | 33.93s       | 1.36026           | 0.930647              | 0.000804223 |
PROGRESS: | 14      | 36.12s       | 1.35963           | 0.930484              | 0.000744359 |
PROGRESS: | 15      | 38.33s       | 1.35922           | 0.93036               | 0.000692791 |
PROGRESS: | 16      | 40.55s       | 1.35868           | 0.930279              | 0.000647904 |
PROGRESS: | 17      | 42.76s       | 1.35827           | 0.930106              | 0.00060848  |
PROGRESS: | 18      | 44.96s       | 1.35792           | 0.930096              | 0.000573579 |
PROGRESS: | 19      | 47.17s       | 1.35757           | 0.929963              | 0.000542464 |
PROGRESS: | 20      | 49.37s       | 1.35714           | 0.929868              | 0.000514551 |
PROGRESS: | 21      | 51.57s       | 1.357             | 0.929816              | 0.000489371 |
PROGRESS: | 22      | 53.79s       | 1.35692           | 0.929796              | 0.000466539 |
PROGRESS: | 23      | 56.00s       | 1.35664           | 0.929771              | 0.000445744 |
PROGRESS: | 24      | 58.17s       | 1.3567            | 0.929697              | 0.000426723 |
PROGRESS: | 25      | 1m 0s        | 1.35615           | 0.929617              | 0.000407887 |
PROGRESS: | 26      | 1m 2s        | 1.35614           | 0.929608              | 0.000390406 |
PROGRESS: | 27      | 1m 4s        | 1.35604           | 0.92957               | 0.000374104 |
PROGRESS: | 28      | 1m 7s        | 1.35573           | 0.929503              | 0.000360613 |
PROGRESS: | 29      | 1m 9s        | 1.35573           | 0.929508              | 0.000348061 |
PROGRESS: | 30      | 1m 11s       | 1.35551           | 0.929469              | 0.000336354 |
PROGRESS: | 31      | 1m 13s       | 1.3554            | 0.929429              | 0.000325409 |
PROGRESS: | 32      | 1m 15s       | 1.3553            | 0.929405              | 0.000315153 |
PROGRESS: | 33      | 1m 18s       | 1.35519           | 0.929338              | 0.000305525 |
PROGRESS: | 34      | 1m 20s       | 1.35522           | 0.92938               | 0.000296467 |
PROGRESS: | 35      | 1m 22s       | 1.35495           | 0.929319              | 0.000286789 |
PROGRESS: | 36      | 1m 24s       | 1.35502           | 0.929336              | 0.000277522 |
PROGRESS: | 37      | 1m 26s       | 1.3547            | 0.929303              | 0.000268611 |
PROGRESS: | 38      | 1m 29s       | 1.35472           | 0.929248              | 0.000260004 |
PROGRESS: | 39      | 1m 31s       | 1.35461           | 0.929242              | 0.000251652 |
PROGRESS: | 40      | 1m 33s       | 1.35469           | 0.929233              | 0.00024351  |
PROGRESS: | 41      | 1m 35s       | 1.35429           | 0.929219              | 0.000235532 |
PROGRESS: | 42      | 1m 38s       | 1.35422           | 0.929119              | 0.000207464 |
PROGRESS: | 43      | 1m 40s       | 1.35385           | 0.928963              | 0.000171404 |
PROGRESS: | 44      | 1m 42s       | 1.35384           | 0.928982              | 0.000168474 |
PROGRESS: | 45      | 1m 44s       | 1.35377           | 0.928969              | 0.000165658 |
PROGRESS: | 46      | 1m 46s       | 1.35392           | 0.928978              | 0.00016295  |
PROGRESS: | 47      | 1m 49s       | 1.35343           | 0.928864              | 0.000134831 |
PROGRESS: | 48      | 1m 51s       | 1.35317           | 0.928789              | 0.000111603 |
PROGRESS: | 49      | 1m 53s       | 1.35328           | 0.928718              | 9.24064e-05 |
PROGRESS: | 50      | 1m 55s       | 1.35305           | 0.928652              | 7.65357e-05 |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: Optimization Complete: Maximum number of passes through the data reached.
PROGRESS: Computing final objective value and training RMSE.
PROGRESS:        Final objective value: 1.36312
PROGRESS:        Final training RMSE: 0.929411

Let see what recommendations we get for our action lover.


In [12]:
# Show recommendations for the action lover.
recommendations = m.recommend(gl.SArray([action_user]), k=40)
print recommendations['movie']


['Lord of the Rings: The Return of the King: Extended Edition', 'The Lord of the Rings: The Fellowship of the Ring: Extended Edition', 'Lord of the Rings: The Return of the King', 'Lord of the Rings: The Two Towers: Extended Edition', 'Star Wars: Episode V: The Empire Strikes Back', 'Lord of the Rings: The Two Towers', 'Raiders of the Lost Ark', 'Star Wars: Episode IV: A New Hope', 'The Shawshank Redemption: Special Edition', 'Lord of the Rings: The Fellowship of the Ring', 'Star Wars: Episode VI: Return of the Jedi', 'GoodFellas: Special Edition', 'The Silence of the Lambs', 'Band of Brothers', 'Pulp Fiction', 'The Godfather', 'The Usual Suspects', 'Full Metal Jacket', 'Seven', 'The Terminator', 'The Sopranos: Season 1', 'Scarface: 20th Anniversary Edition', 'The Sopranos: Season 2', 'The Incredibles', "National Lampoon's Animal House", 'Die Hard', 'Batman Begins', 'Terminator 2: Extreme Edition', 'Forrest Gump', 'The Sixth Sense', 'Caddyshack', "Schindler's List", 'The Sopranos: Season 4', 'The Green Mile', 'The Sopranos: Season 3', "Aliens: Collector's Edition", 'Reservoir Dogs', 'Indiana Jones and the Temple of Doom', 'The Shining', "Alien: Collector's Edition"]

Let see what recommendations we get for our romance lover.


In [13]:
# Show recommendations for the romance lover.
recommendations = m.recommend(gl.SArray([romantic_user]), k=40)
print recommendations['movie']


['Sex and the City: Season 6: Part 1', 'Steel Magnolias', 'Sex and the City: Season 6: Part 2', 'Sex and the City: Season 4', 'Sex and the City: Season 5', 'Sex and the City: Season 3', 'Sex and the City: Season 2', 'The Sound of Music', 'Fried Green Tomatoes', 'Sex and the City: Season 1', 'The Notebook', 'Finding Nemo (Widescreen)', "Gone with the Wind: Collector's Edition", 'Sense and Sensibility', 'Beaches', 'Hotel Rwanda', 'The Shawshank Redemption: Special Edition', 'Pride and Prejudice', 'My Big Fat Greek Wedding', 'When Harry Met Sally', 'Dirty Dancing', "Schindler's List", 'Mary Poppins', 'My Fair Lady: Special Edition', 'Chocolat', 'Bend It Like Beckham', "Bridget Jones's Diary", 'Erin Brockovich', 'To Kill a Mockingbird', "The Wizard of Oz: Collector's Edition", "Something's Gotta Give", 'Philadelphia', 'Shrek (Full-screen)', 'Finding Neverland', 'The Sixth Sense', 'Terms of Endearment', 'Calendar Girls', 'Life Is Beautiful', 'Lord of the Rings: The Return of the King: Extended Edition', 'The Lord of the Rings: The Fellowship of the Ring: Extended Edition']

Looks good to me, especially considering these users haven't rated many movies!

(Looking for more details about the modules and functions? Check out the API docs.)