In [1]:
import graphlab as gl
ratings = gl.SFrame.read_csv("../../data/netflix/netflix_mm.train", delimiter="\t", header=False,
                             column_type_hints=[int,int,int]);
ratings.rename({'X1':'movieid', 'X2':'userid', 'X3':'rating'});


[INFO] Start server at: ipc:///tmp/graphlab_server-27360 - Server binary: /usr/local/lib/python2.7/dist-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1431408759.log
[INFO] GraphLab Server Version: 1.3.0
PROGRESS: Read 4082259 lines. Lines per second: 4.81022e+06
PROGRESS: Read 56182627 lines. Lines per second: 9.46167e+06
PROGRESS: Finished parsing file /code/BIDMach/data/netflix/netflix_mm.train
PROGRESS: Parsing completed. Parsed 99072124 lines in 10.1339 secs.

In [2]:
ratings.head()


Out[2]:
movieid userid rating
1 1 3
8 1 4
17 1 2
30 1 3
44 1 3
58 1 5
76 1 3
80 1 3
81 1 3
83 1 3
[10 rows x 3 columns]

In [3]:
ratings['userid'].max()


Out[3]:
480189

In [4]:
training, testing = ratings.random_split(0.9)

In [5]:
#training, testing = gl.recommender.random_split_by_user(ratings, 'userid', 'movieid', max_num_users=1000, item_test_proportion=0.3)
#print training.num_rows(), testing.num_rows()

In [6]:
m = gl.recommender.factorization_recommender.create(training, 'userid', 'movieid', 'rating',
                                                    num_factors=512, max_iterations=15, solver='sgd', 
                                                    regularization = 1.0e-6, sgd_step_size=0.0)


PROGRESS: Recsys training: model = factorization_recommender
PROGRESS: Preparing data set.
PROGRESS:     Data has 89161285 observations with 479965 users and 17770 items.
PROGRESS:     Data prepared in: 35.6968s
PROGRESS: Training factorization_recommender for recommendations.
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | Parameter                      | Description                                      | Value    |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | num_factors                    | Factor Dimension                                 | 512      |
PROGRESS: | regularization                 | L2 Regularization on Factors                     | 1e-06    |
PROGRESS: | solver                         | Solver used for training                         | sgd      |
PROGRESS: | linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-10    |
PROGRESS: | max_iterations                 | Maximum Number of Iterations                     | 15       |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS:   Optimizing model using SGD; tuning step size.
PROGRESS:   Using 11145160 / 89161285 points for tuning the step size.
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Attempt | Initial Step Size | Estimated Objective Value                |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | 0       | 0.826398          | Not Viable                               |
PROGRESS: | 1       | 0.206599          | Not Viable                               |
PROGRESS: | 2       | 0.0516499         | Not Viable                               |
PROGRESS: | 3       | 0.0129125         | 0.828073                                 |
PROGRESS: | 4       | 0.00645623        | 0.848489                                 |
PROGRESS: | 5       | 0.00322812        | 0.875842                                 |
PROGRESS: | 6       | 0.00161406        | 0.911026                                 |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Final   | 0.0129125         | 0.828073                                 |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: Starting Optimization.
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | Iter.   | Elapsed Time | Approx. Objective | Approx. Training RMSE | Step Size   |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | Initial | 107us        | 1.1764            | 1.08462               |             |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: | 1       | 42.04s       | DIVERGED          | DIVERGED              | 0.0129125   |
PROGRESS: | RESET   | 51.97s       | 1.17639           | 1.08462               |             |
PROGRESS: | 1       | 1m 26s       | 0.868174          | 0.900875              | 0.00645623  |
PROGRESS: | 2       | 2m 1s        | 0.791215          | 0.845199              | 0.0038389   |
PROGRESS: | 3       | 2m 35s       | 0.759365          | 0.8179                | 0.00283229  |
PROGRESS: | 4       | 3m 10s       | 0.744856          | 0.804225              | 0.00228262  |
PROGRESS: | 5       | 3m 45s       | 0.735868          | 0.795077              | 0.00193086  |
PROGRESS: | 6       | 4m 19s       | 0.728657          | 0.787512              | 0.00166474  |
PROGRESS: | 7       | 5m 22s       | 0.722316          | 0.780765              | 0.00144958  |
PROGRESS: | 8       | 6m 27s       | 0.716643          | 0.774656              | 0.00128367  |
PROGRESS: | 9       | 7m 30s       | 0.711653          | 0.769031              | 0.00115184  |
PROGRESS: | 10      | 8m 5s        | 0.707019          | 0.763785              | 0.00104456  |
PROGRESS: | 11      | 8m 39s       | 0.702889          | 0.758958              | 0.000955564 |
PROGRESS: | 12      | 9m 13s       | 0.699133          | 0.754473              | 0.000880543 |
PROGRESS: | 13      | 9m 47s       | 0.6957            | 0.750289              | 0.000816443 |
PROGRESS: | 14      | 10m 21s      | 0.692552          | 0.746361              | 0.000761043 |
PROGRESS: | 15      | 10m 55s      | 0.689431          | 0.742615              | 0.000712684 |
PROGRESS: +---------+--------------+-------------------+-----------------------+-------------+
PROGRESS: Optimization Complete: Maximum number of passes through the data reached.
PROGRESS: Computing final objective value and training RMSE.
PROGRESS:        Final objective value: 0.683794
PROGRESS:        Final training RMSE: 0.73881

In [7]:
# Look at model statistics
m


Out[7]:
Class                           : FactorizationRecommender

Schema
------
User ID                         : userid
Item ID                         : movieid
Target                          : rating
Additional observation features : 0
Number of user side features    : 0
Number of item side features    : 0

Statistics
----------
Number of observations          : 89161285
Number of users                 : 479965
Number of items                 : 17770

Training summary
----------------
Training time                   : 849.7042

Settings
--------
additional_iterations_if_unhealthy: 5
sgd_step_adjustment_interval    : 4
num_factors                     : 512
init_random_sigma               : 0.01
max_iterations                  : 15
regularization_type             : normal
side_data_factorization         : 1
regularization                  : 0.0
num_tempering_iterations        : 4
sgd_step_size                   : 0.0
sgd_trial_sample_proportion     : 0.125
sgd_sampling_block_size         : 131072
binary_target                   : 0
nmf                             : 0
track_exact_loss                : 0
sgd_trial_sample_minimum_size   : 10000
sgd_convergence_interval        : 4
solver                          : sgd
tempering_regularization_start_value: 0.0
sgd_convergence_threshold       : 0.0
sgd_max_trial_iterations        : 5
step_size_decrease_rate         : 0.75
linear_regularization           : 0.0

In [8]:
# Evaluate rmse (average prediction error) on the test set
%time m.evaluate(testing,verbose=False)


CPU times: user 636 ms, sys: 167 ms, total: 803 ms
Wall time: 3min 50s
Out[8]:
{'precision_recall_by_user': Columns:
	userid	int
	cutoff	int
	precision	float
	recall	float
	count	int

Rows: 1359441

Data:
+--------+--------+-----------+-----------------+-------+
| userid | cutoff | precision |      recall     | count |
+--------+--------+-----------+-----------------+-------+
|   1    |   5    |    1.0    | 0.0223214285714 |  224  |
|   1    |   10   |    0.7    |     0.03125     |  224  |
|   1    |   15   |    0.8    | 0.0535714285714 |  224  |
|   2    |   5    |    0.0    |       0.0       |   9   |
|   2    |   10   |    0.0    |       0.0       |   9   |
|   2    |   15   |    0.0    |       0.0       |   9   |
|   3    |   5    |    0.0    |       0.0       |   36  |
|   3    |   10   |    0.0    |       0.0       |   36  |
|   3    |   15   |    0.0    |       0.0       |   36  |
|   4    |   5    |    0.0    |       0.0       |   9   |
|  ...   |  ...   |    ...    |       ...       |  ...  |
+--------+--------+-----------+-----------------+-------+
[1359441 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
 'precision_recall_overall': Columns:
	cutoff	int
	precision	float
	recall	float

Rows: 3

Data:
+--------+-----------------+-----------------+
| cutoff |    precision    |      recall     |
+--------+-----------------+-----------------+
|   5    | 0.0789218509667 |  0.018458748315 |
|   10   | 0.0601906224691 | 0.0279574359437 |
|   15   |  0.051090705665 |  0.035643976875 |
+--------+-----------------+-----------------+
[3 rows x 3 columns]
,
 'rmse_by_item': Columns:
	movieid	int
	count	int
	rmse	float

Rows: 17762

Data:
+---------+-------+----------------+
| movieid | count |      rmse      |
+---------+-------+----------------+
|   7899  |  1411 | 0.776659090298 |
|   5288  |   13  | 1.06684819974  |
|   5684  |   19  | 1.23791877002  |
|  14515  |   97  | 1.04050973948  |
|  10273  |  132  | 0.705268251495 |
|  11106  |   23  | 0.696915869417 |
|   5531  |  228  | 0.825257800539 |
|   5921  |   18  | 1.28093222584  |
|  14406  |   27  | 0.946851217404 |
|   2871  |   42  | 0.954883211501 |
|   ...   |  ...  |      ...       |
+---------+-------+----------------+
[17762 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
 'rmse_by_user': Columns:
	userid	int
	count	int
	rmse	float

Rows: 453147

Data:
+--------+-------+----------------+
| userid | count |      rmse      |
+--------+-------+----------------+
| 211023 |   5   | 0.440468462295 |
| 442699 |   13  | 0.319222179826 |
| 79732  |   6   | 0.672370758241 |
| 333842 |   63  | 0.735925846462 |
|  7899  |   34  | 0.746596218723 |
| 25263  |   1   | 0.835913385967 |
| 453182 |   15  | 0.459506083838 |
| 87629  |   29  | 0.490420055215 |
| 459067 |   38  | 0.58144293467  |
| 43116  |   31  | 0.778477874692 |
|  ...   |  ...  |      ...       |
+--------+-------+----------------+
[453147 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
 'rmse_overall': 0.8234641905616967}

In [8]: