In [1]:
import graphlab as gl
gl.canvas.set_target("ipynb")

Curating a data set


In [2]:
# Download and parse
ratings = gl.SFrame.read_csv('ml-1m/ratings.dat', delimiter='::', header=False)
items = gl.SFrame.read_csv('ml-1m/movies.dat', delimiter='::', header=False)

# Rename columns
ratings = ratings.rename({'X1': 'user_id', 'X2': 'item_id', 'X3': 'score', 'X4': 'timestamp'})
items = items.rename({'X1': 'item_id', 'X2': 'title_year', 'X3': 'genres'})


2016-03-28 18:34:07,982 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.9 started. Logging: /tmp/graphlab_server_1459215246.log
Finished parsing file /Users/chris/tutorials/strata-sj-2016/recommendation-systems/ml-1m/ratings.dat
Parsing completed. Parsed 100 lines in 0.598555 secs.
This commercial license of GraphLab Create is assigned to engr@turi.com.
------------------------------------------------------
Finished parsing file /Users/chris/tutorials/strata-sj-2016/recommendation-systems/ml-1m/ratings.dat
Parsing completed. Parsed 1000209 lines in 0.67184 secs.
Finished parsing file /Users/chris/tutorials/strata-sj-2016/recommendation-systems/ml-1m/movies.dat
Parsing completed. Parsed 100 lines in 0.016913 secs.
Inferred types from first line of file as 
column_type_hints=[int,int,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
------------------------------------------------------
Finished parsing file /Users/chris/tutorials/strata-sj-2016/recommendation-systems/ml-1m/movies.dat
Parsing completed. Parsed 3883 lines in 0.013042 secs.
Inferred types from first line of file as 
column_type_hints=[int,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------

In [3]:
ratings


Out[3]:
user_id item_id score timestamp
1 1193 5 978300760
1 661 3 978302109
1 914 3 978301968
1 3408 4 978300275
1 2355 5 978824291
1 1197 3 978302268
1 1287 5 978302039
1 2804 5 978300719
1 594 4 978302268
1 919 4 978301368
[1000209 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [4]:
items


Out[4]:
item_id title_year genres
1 Toy Story (1995) Animation|Children's|Come
dy ...
2 Jumanji (1995) Adventure|Children's|Fant
asy ...
3 Grumpier Old Men (1995) Comedy|Romance
4 Waiting to Exhale (1995) Comedy|Drama
5 Father of the Bride Part
II (1995) ...
Comedy
6 Heat (1995) Action|Crime|Thriller
7 Sabrina (1995) Comedy|Romance
8 Tom and Huck (1995) Adventure|Children's
9 Sudden Death (1995) Action
10 GoldenEye (1995) Action|Adventure|Thriller
[3883 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [5]:
items.show()


Data carpentry

Get year, title, and genres for each item


In [6]:
items['title'] = items['title_year'].apply(lambda x: x[:-7])
items['title'] = items['title'].apply(lambda x: x.decode('iso8859').encode('utf-8'))
items['year'] = items['title_year'].apply(lambda x: x[-5:-1])
items['genres'] = items['genres'].apply(lambda x: x.split('|'))
del items['title_year']

In [7]:
items


Out[7]:
item_id genres title year
1 [Animation, Children's,
Comedy] ...
Toy Story 1995
2 [Adventure, Children's,
Fantasy] ...
Jumanji 1995
3 [Comedy, Romance] Grumpier Old Men 1995
4 [Comedy, Drama] Waiting to Exhale 1995
5 [Comedy] Father of the Bride Part
II ...
1995
6 [Action, Crime, Thriller] Heat 1995
7 [Comedy, Romance] Sabrina 1995
8 [Adventure, Children's] Tom and Huck 1995
9 [Action] Sudden Death 1995
10 [Action, Adventure,
Thriller] ...
GoldenEye 1995
[3883 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

How many unique users do we have?


In [8]:
ratings['user_id'].unique().size()


Out[8]:
6040

In [9]:
items.show()


Create two datasets for training models


In [10]:
explicit = ratings[['user_id', 'item_id', 'score']]
explicit


Out[10]:
user_id item_id score
1 1193 5
1 661 3
1 914 3
1 3408 4
1 2355 5
1 1197 3
1 1287 5
1 2804 5
1 594 4
1 919 4
[1000209 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [12]:
implicit = explicit[explicit['score'] >= 4.0][['user_id', 'item_id']]
implicit


Out[12]:
user_id item_id
1 1193
1 3408
1 2355
1 1287
1 2804
1 594
1 919
1 595
1 938
1 2398
[? rows x 2 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.

Building a model for recommendations


In [15]:
m = gl.recommender.create(implicit, 'user_id', 'item_id')


Recsys training: model = item_similarity
Preparing data set.
    Data has 575281 observations with 6038 users and 3533 items.
    Data prepared in: 0.526861s
Computing item similarity statistics:
Computing most similar items for 3533 items:
+-----------------+-----------------+
| Number of items | Elapsed Time    |
+-----------------+-----------------+
| 1000            | 0.848706        |
| 2000            | 0.939294        |
| 3000            | 1.03531         |
+-----------------+-----------------+
Finished training in 1.26489s

The above model trained an item_similarity model. This computed Jaccard similarities between the items in this dataset, then for each item it ranks the top 100 most similar items, storing these so they can be used at prediction time. For more information on how this model works, see the API reference.

Get a summary of the model


In [16]:
m


Out[16]:
Class                           : ItemSimilarityRecommender

Schema
------
User ID                         : user_id
Item ID                         : item_id
Target                          : None
Additional observation features : 0
Number of user side features    : 0
Number of item side features    : 0

Statistics
----------
Number of observations          : 575281
Number of users                 : 6038
Number of items                 : 3533

Training summary
----------------
Training time                   : 1.265

Model Parameters
----------------
Model class                     : ItemSimilarityRecommender
only_top_k                      : 100
threshold                       : 0.001
similarity_type                 : jaccard
training_method                 : auto

Getting similar items


In [43]:
items[items['item_id'] == 1287]


Out[43]:
item_id genres title year
1287 [Action, Adventure,
Drama] ...
Ben-Hur 1959
[? rows x 4 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.

In [44]:
m.get_similar_items([1287], k=5)


Getting similar items completed in 0.002333
Out[44]:
item_id similar score rank
1287 1262 0.240425531915 1
1287 2944 0.240318906606 2
1287 1954 0.237877401647 3
1287 2947 0.235412474849 4
1287 1201 0.226069246436 5
[5 rows x 4 columns]

In [45]:
m.get_similar_items([1287]).join(items, on={'similar': 'item_id'}).sort('rank')


Getting similar items completed in 0.002062
Out[45]:
item_id similar score rank genres title year
1287 1262 0.240425531915 1 [Adventure, War] Great Escape, The 1963
1287 2944 0.240318906606 2 [Action, War] Dirty Dozen, The 1967
1287 1954 0.237877401647 3 [Action, Drama] Rocky 1976
1287 2947 0.235412474849 4 [Action] Goldfinger 1964
1287 1201 0.226069246436 5 [Action, Western] Good, The Bad and The
Ugly, The ...
1966
1287 1204 0.225868725869 6 [Adventure, War] Lawrence of Arabia 1962
1287 1953 0.224103585657 7 [Action, Crime, Drama,
Thriller] ...
French Connection, The 1971
1287 1250 0.222123893805 8 [Drama, War] Bridge on the River Kwai,
The ...
1957
1287 969 0.217721518987 9 [Action, Adventure,
Romance, War] ...
African Queen, The 1951
1287 2949 0.215827338129 10 [Action] Dr. No 1962
[10 rows x 7 columns]

Build a model for predicting predicted score


In [46]:
m2 = gl.recommender.create(explicit, 'user_id', 'item_id', target='score')


Recsys training: model = ranking_factorization_recommender
Preparing data set.
    Data has 1000209 observations with 6040 users and 3706 items.
    Data prepared in: 0.96017s
Training ranking_factorization_recommender for recommendations.
+--------------------------------+--------------------------------------------------+----------+
| Parameter                      | Description                                      | Value    |
+--------------------------------+--------------------------------------------------+----------+
| num_factors                    | Factor Dimension                                 | 32       |
| regularization                 | L2 Regularization on Factors                     | 1e-09    |
| solver                         | Solver used for training                         | sgd      |
| linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-09    |
| ranking_regularization         | Rank-based Regularization Weight                 | 0.25     |
| max_iterations                 | Maximum Number of Iterations                     | 25       |
+--------------------------------+--------------------------------------------------+----------+
  Optimizing model using SGD; tuning step size.
  Using 125026 / 1000209 points for tuning the step size.
+---------+-------------------+------------------------------------------+
| Attempt | Initial Step Size | Estimated Objective Value                |
+---------+-------------------+------------------------------------------+
| 0       | 25                | Not Viable                               |
| 1       | 6.25              | Not Viable                               |
| 2       | 1.5625            | Not Viable                               |
| 3       | 0.390625          | Not Viable                               |
| 4       | 0.0976562         | 1.75064                                  |
| 5       | 0.0488281         | 1.83661                                  |
| 6       | 0.0244141         | 1.83512                                  |
| 7       | 0.012207          | 1.85294                                  |
+---------+-------------------+------------------------------------------+
| Final   | 0.0976562         | 1.75064                                  |
+---------+-------------------+------------------------------------------+
Starting Optimization.
+---------+--------------+-------------------+-----------------------+-------------+
| Iter.   | Elapsed Time | Approx. Objective | Approx. Training RMSE | Step Size   |
+---------+--------------+-------------------+-----------------------+-------------+
| Initial | 105us        | 2.44674           | 1.1171                |             |
+---------+--------------+-------------------+-----------------------+-------------+
| 1       | 1.31s        | DIVERGED          | DIVERGED              | 0.0976562   |
| RESET   | 1.78s        | 2.44674           | 1.1171                |             |
| 1       | 2.67s        | 1.71749           | 1.04023               | 0.0488281   |
| 2       | 3.54s        | 1.53163           | 0.991377              | 0.0290334   |
| 3       | 4.41s        | 1.42213           | 0.947482              | 0.0214205   |
| 4       | 5.28s        | 1.34409           | 0.920136              | 0.0172633   |
| 5       | 6.14s        | 1.28649           | 0.896133              | 0.014603    |
| 6       | 6.98s        | 1.24887           | 0.881142              | 0.0127367   |
| 9       | 9.62s        | 1.18429           | 0.854512              | 0.00939698  |
| 11      | 11.44s       | 1.16188           | 0.844663              | 0.00808399  |
| 14      | 14.21s       | 1.13836           | 0.834458              | 0.00674643  |
| 19      | 18.68s       | 1.11396           | 0.823934              | 0.00536543  |
| 24      | 23.10s       | 1.0989            | 0.817241              | 0.0045031   |
+---------+--------------+-------------------+-----------------------+-------------+
Optimization Complete: Maximum number of passes through the data reached.
Computing final objective value and training RMSE.
       Final objective value: 1.09867
       Final training RMSE: 0.788257

Making batch recommendations


In [47]:
recs = m.recommend()


recommendations finished on 1000/6038 queries. users per second: 1132.35
recommendations finished on 2000/6038 queries. users per second: 1043.33
recommendations finished on 3000/6038 queries. users per second: 1055.7
recommendations finished on 4000/6038 queries. users per second: 1050.06
recommendations finished on 5000/6038 queries. users per second: 1013.33
recommendations finished on 6000/6038 queries. users per second: 1033.92

In [48]:
recs


Out[48]:
user_id item_id score rank
1 1198 0.154503379509 1
1 1196 0.153149444095 2
1 318 0.152713126805 3
1 1307 0.144081292055 4
1 593 0.136649332825 5
1 1197 0.134134392698 6
1 1265 0.133883400821 7
1 296 0.13312830713 8
1 1291 0.132043287356 9
1 457 0.131461835568 10
[60380 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [49]:
ratings[ratings['user_id'] == 4].join(items, on='item_id')


Out[49]:
user_id item_id score timestamp genres title year
4 260 5 978294199 [Action, Adventure,
Fantasy, Sci-Fi] ...
Star Wars: Episode IV - A
New Hope ...
1977
4 480 4 978294008 [Action, Adventure, Sci-
Fi] ...
Jurassic Park 1993
4 1036 4 978294282 [Action, Thriller] Die Hard 1988
4 1097 4 978293964 [Children's, Drama,
Fantasy, Sci-Fi] ...
E.T. the Extra-
Terrestrial ...
1982
4 1196 2 978294199 [Action, Adventure,
Drama, Sci-Fi, War] ...
Star Wars: Episode V -
The Empire Strikes Back ...
1980
4 1198 5 978294199 [Action, Adventure] Raiders of the Lost Ark 1981
4 1201 5 978294230 [Action, Western] Good, The Bad and The
Ugly, The ...
1966
4 1210 3 978293924 [Action, Adventure,
Romance, Sci-Fi, War] ...
Star Wars: Episode VI -
Return of the Jedi ...
1983
4 1214 4 978294260 [Action, Horror, Sci-Fi,
Thriller] ...
Alien 1979
4 1240 5 978294260 [Action, Sci-Fi,
Thriller] ...
Terminator, The 1984
[21 rows x 7 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [50]:
m.recommend(users=[4], k=20).join(items, on='item_id').sort('rank')


Out[50]:
user_id item_id score rank genres title year
4 1196 0.272550089987 1 [Action, Adventure,
Drama, Sci-Fi, War] ...
Star Wars: Episode V -
The Empire Strikes Back ...
1980
4 1200 0.262437272494 2 [Action, Sci-Fi,
Thriller, War] ...
Aliens 1986
4 1291 0.251093999068 3 [Action, Adventure] Indiana Jones and the
Last Crusade ...
1989
4 589 0.247480019634 4 [Action, Sci-Fi,
Thriller] ...
Terminator 2: Judgment
Day ...
1991
4 2571 0.245820147055 5 [Action, Sci-Fi,
Thriller] ...
Matrix, The 1999
4 858 0.243937367269 6 [Action, Crime, Drama] Godfather, The 1972
4 457 0.235640202322 7 [Action, Thriller] Fugitive, The 1993
4 1221 0.233340571898 8 [Action, Crime, Drama] Godfather: Part II, The 1974
4 1610 0.22213252277 9 [Action, Thriller] Hunt for Red October, The 1990
4 1210 0.220900300697 10 [Action, Adventure,
Romance, Sci-Fi, War] ...
Star Wars: Episode VI -
Return of the Jedi ...
1983
[20 rows x 7 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [51]:
m.recommend?

Recommendations for new users


In [53]:
recent_data = gl.SFrame()
recent_data['item_id'] = [1291]   # Indiana Jones and the Last Crusade
recent_data['user_id'] = 99999

In [54]:
m.recommend(users=[99999], new_observation_data=recent_data).join(items, on='item_id').sort('rank')


Out[54]:
user_id item_id score rank genres title year
99999 1198 0.475933609959 1 [Action, Adventure] Raiders of the Lost Ark 1981
99999 1036 0.415724286484 2 [Action, Thriller] Die Hard 1988
99999 1210 0.390739236393 3 [Action, Adventure,
Romance, Sci-Fi, War] ...
Star Wars: Episode VI -
Return of the Jedi ...
1983
99999 1196 0.390430971512 4 [Action, Adventure,
Drama, Sci-Fi, War] ...
Star Wars: Episode V -
The Empire Strikes Back ...
1980
99999 1240 0.368227731864 5 [Action, Sci-Fi,
Thriller] ...
Terminator, The 1984
99999 260 0.362182829336 6 [Action, Adventure,
Fantasy, Sci-Fi] ...
Star Wars: Episode IV - A
New Hope ...
1977
99999 592 0.356594110115 7 [Action, Adventure,
Crime, Drama] ...
Batman 1989
99999 2115 0.352819807428 8 [Action, Adventure] Indiana Jones and the
Temple of Doom ...
1984
99999 2000 0.345524017467 9 [Action, Comedy, Crime,
Drama] ...
Lethal Weapon 1987
99999 1197 0.34544695071 10 [Action, Adventure,
Comedy, Romance] ...
Princess Bride, The 1987
[10 rows x 7 columns]

Saving and loading models and data


In [55]:
m.save('my_model')

In [56]:
m_again = gl.load_model('my_model')

In [57]:
m_again


Out[57]:
Class                           : ItemSimilarityRecommender

Schema
------
User ID                         : user_id
Item ID                         : item_id
Target                          : None
Additional observation features : 0
Number of user side features    : 0
Number of item side features    : 0

Statistics
----------
Number of observations          : 575281
Number of users                 : 6038
Number of items                 : 3533

Training summary
----------------
Training time                   : 1.1486

Model Parameters
----------------
Model class                     : ItemSimilarityRecommender
only_top_k                      : 100
threshold                       : 0.001
similarity_type                 : jaccard
training_method                 : auto

In [58]:
items.save('items')
ratings.save('ratings')
explicit.save('explicit')
implicit.save('implicit')


Getting similar items completed in 0.002338

In [ ]: