import turicreate as tc
# set canvas to show sframes and sgraphs in ipython notebook
# import matplotlib.pyplot as plt
# %matplotlib inline

# download data from: http://files.grouplens.org/datasets/movielens/ml-1m.zip

data = tc.SFrame.read_csv('/Users/datalab/bigdata/cjc/ml-1m/ratings.dat', delimiter='\n', 
                                header=False)['X1'].apply(lambda x: x.split('::')).unpack()
for col in data.column_names():
    data[col] = data[col].astype(int)
data = data.rename({'X.0': 'user_id', 'X.1': 'movie_id', 'X.2': 'rating', 'X.3': 'timestamp'})

Finished parsing file /Users/datalab/bigdata/cjc/ml-1m/ratings.dat
Parsing completed. Parsed 100 lines in 0.281192 secs.
Inferred types from first 100 line(s) of file as 
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
Finished parsing file /Users/datalab/bigdata/cjc/ml-1m/ratings.dat
Parsing completed. Parsed 1000209 lines in 0.372092 secs.

users = tc.SFrame.read_csv('/Users/datalab/bigdata/cjc/ml-1m/users.dat', delimiter='\n', 
                                 header=False)['X1'].apply(lambda x: x.split('::')).unpack()
users = users.rename({'X.0': 'user_id', 'X.1': 'gender', 'X.2': 'age', 'X.3': 'occupation', 'X.4': 'zip-code'})
users['user_id'] = users['user_id'].astype(int)

Finished parsing file /Users/datalab/bigdata/cjc/ml-1m/users.dat
Parsing completed. Parsed 100 lines in 0.028041 secs.
Inferred types from first 100 line(s) of file as 
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
Finished parsing file /Users/datalab/bigdata/cjc/ml-1m/users.dat
Parsing completed. Parsed 6040 lines in 0.007235 secs.

#items = tc.SFrame.read_csv('/Users/datalab/bigdata/ml-1m/movies.dat', delimiter='\n', header=False)#['X1'].apply(lambda x: x.split('::')).unpack()
# items = items.rename({'X.0': 'movie_id', 'X.1': 'title', 'X.2': 'genre'})
# items['movie_id'] = items['movie_id'].astype(int)
# items.save('items')

user_id movie_id rating timestamp
1 1193 5 978300760
1 661 3 978302109
1 914 3 978301968
1 3408 4 978300275
1 2355 5 978824291
1 1197 3 978302268
1 1287 5 978302039
1 2804 5 978300719
1 594 4 978302268
1 919 4 978301368
[1000209 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

user_id gender age occupation zip-code
1 F 1 10 48067
2 M 56 16 70072
3 M 25 15 55117
4 M 45 7 02460
5 M 25 20 55455
6 F 50 9 55117
7 M 35 1 06810
8 M 25 12 11413
9 M 25 17 61614
10 F 35 1 95370
[6040 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

#data = data.join(items, on='movie_id')

train_set, test_set = data.random_split(0.95, seed=1)

m = tc.recommender.create(train_set, 'user_id', 'movie_id', 'rating')

Preparing data set.
    Data has 949852 observations with 6040 users and 3701 items.
    Data prepared in: 0.550091s
Training ranking_factorization_recommender for recommendations.
| Parameter                      | Description                                      | Value    |
| num_factors                    | Factor Dimension                                 | 32       |
| regularization                 | L2 Regularization on Factors                     | 1e-09    |
| solver                         | Solver used for training                         | adagrad  |
| linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-09    |
| ranking_regularization         | Rank-based Regularization Weight                 | 0.25     |
| max_iterations                 | Maximum Number of Iterations                     | 25       |
  Optimizing model using SGD; tuning step size.
  Using 118731 / 949852 points for tuning the step size.
| Attempt | Initial Step Size | Estimated Objective Value                |
| 0       | 16.6667           | Not Viable                               |
| 1       | 4.16667           | Not Viable                               |
| 2       | 1.04167           | Not Viable                               |
| 3       | 0.260417          | Not Viable                               |
| 4       | 0.0651042         | 1.8722                                   |
| 5       | 0.0325521         | 1.94425                                  |
| 6       | 0.016276          | 1.95877                                  |
| 7       | 0.00813802        | 2.0441                                   |
| Final   | 0.0651042         | 1.8722                                   |
Starting Optimization.
| Iter.   | Elapsed Time | Approx. Objective | Approx. Training RMSE | Step Size   |
| Initial | 110us        | 2.44718           | 1.1172                |             |
| 1       | 536.251ms    | 2.09737           | 1.13925               | 0.0651042   |
| 2       | 1.05s        | 1.85594           | 1.06079               | 0.0651042   |
| 3       | 1.55s        | 1.79883           | 1.03161               | 0.0651042   |
| 4       | 2.06s        | 1.77231           | 1.02676               | 0.0651042   |
| 5       | 2.57s        | 1.75455           | 1.02264               | 0.0651042   |
| 10      | 5.81s        | 1.66968           | 0.995516              | 0.0651042   |
| 20      | 12.34s       | 1.58039           | 0.969493              | 0.0651042   |
| 25      | 15.69s       | 1.54869           | 0.961055              | 0.0651042   |
Optimization Complete: Maximum number of passes through the data reached.
Computing final objective value and training RMSE.
       Final objective value: 1.57752
       Final training RMSE: 0.95536

Class                            : RankingFactorizationRecommender

User ID                          : user_id
Item ID                          : movie_id
Target                           : rating
Additional observation features  : 1
User side features               : []
Item side features               : []

Number of observations           : 949852
Number of users                  : 6040
Number of items                  : 3701

Training summary
Training time                    : 21.9973

Model Parameters
Model class                      : RankingFactorizationRecommender
num_factors                      : 32
binary_target                    : 0
side_data_factorization          : 1
solver                           : auto
nmf                              : 0
max_iterations                   : 25

Regularization Settings
regularization                   : 0.0
regularization_type              : normal
linear_regularization            : 0.0
ranking_regularization           : 0.25
unobserved_rating_value          : -1.7976931348623157e+308
num_sampled_negative_examples    : 4
ials_confidence_scaling_type     : auto
ials_confidence_scaling_factor   : 1

Optimization Settings
init_random_sigma                : 0.01
sgd_convergence_interval         : 4
sgd_convergence_threshold        : 0.0
sgd_max_trial_iterations         : 5
sgd_sampling_block_size          : 131072
sgd_step_adjustment_interval     : 4
sgd_step_size                    : 0.0
sgd_trial_sample_minimum_size    : 10000
sgd_trial_sample_proportion      : 0.125
step_size_decrease_rate          : 0.75
additional_iterations_if_unhealthy : 5
adagrad_momentum_weighting       : 0.9
num_tempering_iterations         : 4
tempering_regularization_start_value : 0.0
track_exact_loss                 : 0

m2 = tc.item_similarity_recommender.create(train_set, 
                                                 'user_id', 'movie_id', 'rating',

Warning: Ignoring columns timestamp;
    To use these columns in scoring predictions, use a model that allows the use of additional features.
Preparing data set.
    Data has 949852 observations with 6040 users and 3701 items.
    Data prepared in: 0.426101s
Training model from provided data.
Gathering per-item and per-user statistics.
| Elapsed Time (Item Statistics) | % Complete |
| 27.234ms                       | 16.5       |
| 42.954ms                       | 100        |
Setting up lookup tables.
Processing data in one pass using dense lookup tables.
| Elapsed Time (Constructing Lookups) | Total % Complete | Items Processed |
| 73.627ms                            | 0                | 2               |
| 2.79s                               | 100              | 3701            |
Finalizing lookup tables.
Generating candidate set for working with new users.
Finished training in 2.82252s

Class                            : ItemSimilarityRecommender

User ID                          : user_id
Item ID                          : movie_id
Target                           : rating
Additional observation features  : 0
User side features               : []
Item side features               : []

Number of observations           : 949852
Number of users                  : 6040
Number of items                  : 3701

Training summary
Training time                    : 2.8226

Model Parameters
Model class                      : ItemSimilarityRecommender
threshold                        : 0.001
similarity_type                  : pearson
training_method                  : auto

Other Settings
max_data_passes                  : 4096
max_item_neighborhood_size       : 64
nearest_neighbors_interaction_proportion_threshold : 0.05
target_memory_usage              : 8589934592
sparse_density_estimation_sample_size : 4096
degree_approximation_threshold   : 4096
seed_item_set_size               : 50

result = tc.recommender.util.compare_models(test_set, 
                                                  [m, m2],
                                            user_sample=.5, skip_set=train_set)

compare_models: using 2811 users to estimate model performance
PROGRESS: Evaluate model M0
recommendations finished on 1000/2811 queries. users per second: 10084.7
recommendations finished on 2000/2811 queries. users per second: 10557.4
Precision and recall summary statistics by cutoff
| cutoff |     mean_recall      |    mean_precision    |
|   1    | 0.004372314245294037 | 0.03344005691924596  |
|   2    | 0.008439255238125647 | 0.030771967271433692 |
|   3    | 0.011792773608123091 | 0.029764022293371297 |
|   4    | 0.014103362205887681 | 0.027303450729277888 |
|   5    | 0.017724646480050326 | 0.026894343649946677 |
|   6    | 0.01985047799128097  | 0.02549507885687179  |
|   7    | 0.023037645809147193 | 0.025054632311836182 |
|   8    | 0.02564717744662357  | 0.024101743151903235 |
|   9    | 0.027494985038662042 | 0.023123443614372085 |
|   10   | 0.02954846065621093  | 0.022483102098897183 |
[10 rows x 3 columns]

Overall RMSE: 0.988323739301448

Per User RMSE (best)
| user_id |         rmse         | count |
|   4695  | 0.008856667044261357 |   1   |
[1 rows x 3 columns]

Per User RMSE (worst)
| user_id |        rmse       | count |
|   1102  | 2.957562522855876 |   1   |
[1 rows x 3 columns]

Per Item RMSE (best)
| movie_id |         rmse         | count |
|   3674   | 0.012974611607248221 |   1   |
[1 rows x 3 columns]

Per Item RMSE (worst)
| movie_id |        rmse        | count |
|   3886   | 3.4432479133103597 |   1   |
[1 rows x 3 columns]

PROGRESS: Evaluate model M1
recommendations finished on 1000/2811 queries. users per second: 23065.4
recommendations finished on 2000/2811 queries. users per second: 24766.9
Precision and recall summary statistics by cutoff
| cutoff | mean_recall | mean_precision |
|   1    |     0.0     |      0.0       |
|   2    |     0.0     |      0.0       |
|   3    |     0.0     |      0.0       |
|   4    |     0.0     |      0.0       |
|   5    |     0.0     |      0.0       |
|   6    |     0.0     |      0.0       |
|   7    |     0.0     |      0.0       |
|   8    |     0.0     |      0.0       |
|   9    |     0.0     |      0.0       |
|   10   |     0.0     |      0.0       |
[10 rows x 3 columns]

Overall RMSE: 0.977554609754323

Per User RMSE (best)
| user_id |          rmse         | count |
|   3872  | 4.440892098500626e-16 |   1   |
[1 rows x 3 columns]

Per User RMSE (worst)
| user_id |        rmse        | count |
|   5214  | 3.2845314102161183 |   2   |
[1 rows x 3 columns]

Per Item RMSE (best)
| movie_id | rmse | count |
|   1842   | 0.0  |   1   |
[1 rows x 3 columns]

Per Item RMSE (worst)
| movie_id | rmse | count |
|   572    | 4.0  |   1   |
[1 rows x 3 columns]

Getting similar items

m.get_similar_items([1287])  # movie_id is Ben-Hur

movie_id similar score rank
1287 1262 0.8935538530349731 1
1287 1272 0.8684239983558655 2
1287 2662 0.8668187260627747 3
1287 3366 0.8548122048377991 4
1287 2948 0.8543752431869507 5
1287 3062 0.8494184017181396 6
1287 2947 0.8432653546333313 7
1287 3836 0.8384832739830017 8
1287 1304 0.8308332562446594 9
1287 1250 0.8267531394958496 10
[10 rows x 4 columns]

Help on method get_similar_items in module graphlab.toolkits.recommender.util:

get_similar_items(self, items=None, k=10, verbose=False) method of graphlab.toolkits.recommender.ranking_factorization_recommender.RankingFactorizationRecommender instance
    Get the k most similar items for each item in items.
    Each type of recommender has its own model for the similarity
    between items. For example, the item_similarity_recommender will
    return the most similar items according to the user-chosen
    similarity; the factorization_recommender will return the
    nearest items based on the cosine similarity between latent item
    items : SArray or list; optional
        An :class:`~graphlab.SArray` or list of item ids for which to get
        similar items. If 'None', then return the `k` most similar items for
        all items in the training set.
    k : int, optional
        The number of similar items for each item.
    verbose : bool, optional
        Progress printing is shown.
    out : SFrame
        A SFrame with the top ranked similar items for each item. The
        columns `item`, 'similar', 'score' and 'rank', where
        `item` matches the item column name specified at training time.
        The 'rank' is between 1 and `k` and 'score' gives the similarity
        score of that item. The value of the score depends on the method
        used for computing item similarities.
    >>> sf = graphlab.SFrame({'user_id': ["0", "0", "0", "1", "1", "2", "2", "2"],
                              'item_id': ["a", "b", "c", "a", "b", "b", "c", "d"]})
    >>> m = graphlab.item_similarity_recommender.create(sf)
    >>> nn = m.get_similar_items()

'score' gives the similarity score of that item

# m.get_similar_items([1287]).join(items, on={'similar': 'movie_id'}).sort('rank')

Making recommendations

recs = m.recommend()

recommendations finished on 1000/6040 queries. users per second: 11685.2
recommendations finished on 2000/6040 queries. users per second: 11654.4
recommendations finished on 3000/6040 queries. users per second: 11658.6
recommendations finished on 4000/6040 queries. users per second: 11321.5
recommendations finished on 5000/6040 queries. users per second: 11502.9
recommendations finished on 6000/6040 queries. users per second: 11105.9

In [44]:

user_id movie_id score rank
1 318 5.045622686663793 1
1 1198 4.862424055854009 2
1 50 4.76625474802606 3
1 593 4.766107517102884 4
1 858 4.747795152286218 5
1 1196 4.689315832028316 6
1 2858 4.678970253834652 7
1 2396 4.5986619915758835 8
1 110 4.588308471063303 9
1 2571 4.573408636072801 10
[60400 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

data[data['user_id'] == 4]

user_id movie_id rating timestamp
4 3468 5 978294008
4 1210 3 978293924
4 2951 4 978294282
4 1214 4 978294260
4 1036 4 978294282
4 260 5 978294199
4 2028 5 978294230
4 480 4 978294008
4 1196 2 978294199
4 1198 5 978294199
[? rows x 4 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.

# m.recommend(users=[4], k=20).join(items, on='movie_id')

Recommendations for new users

recent_data = tc.SFrame()
recent_data['movie_id'] = [30, 1000, 900, 883, 251, 200, 199, 180, 120, 991, 1212] 
recent_data['user_id'] = 99999
recent_data['rating'] = [2, 1, 3, 4, 0, 0, 1, 1, 1, 2, 3]

movie_id user_id rating
30 99999 2
1000 99999 1
900 99999 3
883 99999 4
251 99999 0
200 99999 0
199 99999 1
180 99999 1
120 99999 1
991 99999 2
[11 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

m2.recommend(users=[99999], new_observation_data=recent_data)#.join(items, on='movie_id').sort('rank')

user_id movie_id score rank
99999 3881 5.0 1
99999 3607 5.0 2
99999 1830 5.0 3
99999 989 5.0 4
99999 3172 5.0 5
99999 3233 5.0 6
99999 787 5.0 7
99999 3382 5.0 8
99999 3656 5.0 9
99999 3280 5.0 10
[10 rows x 4 columns]

Saving and loading models

m_again = graphlab.load_model('my_model')

