In [1]:
import numpy as np
import tensorflow as tf

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder
import json
import pickle
import pandas as pd
from sklearn.externals import joblib

%pylab inline

Populating the interactive namespace from numpy and matplotlib

Get data

In [6]:
! mkdir ./tmp

In [7]:
! wget -O ./

In [8]:
! unzip -o ./

In [9]:
! cat ./ml-100k/README


MovieLens data sets were collected by the GroupLens Research Project
at the University of Minnesota.
This data set consists of:
	* 100,000 ratings (1-5) from 943 users on 1682 movies. 
	* Each user has rated at least 20 movies. 
        * Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site
( during the seven-month period from September 19th, 
1997 through April 22nd, 1998. This data has been cleaned up - users
who had less than 20 ratings or did not have complete demographic
information were removed from this data set. Detailed descriptions of
the data file can be found at the end of this file.

Neither the University of Minnesota nor any of the researchers
involved can guarantee the correctness of the data, its suitability
for any particular purpose, or the validity of results based on the
use of the data set.  The data set may be used for any research
purposes under the following conditions:

     * The user may not state or imply any endorsement from the
       University of Minnesota or the GroupLens Research Group.

     * The user must acknowledge the use of the data set in
       publications resulting from the use of the data set
       (see below for citation information).

     * The user may not redistribute the data without separate

     * The user may not use this information for any commercial or
       revenue-bearing purposes without first obtaining permission
       from a faculty member of the GroupLens Research Project at the
       University of Minnesota.

If you have any further questions or comments, please contact GroupLens


To acknowledge use of the dataset in publications, please cite the 
following paper:

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets:
History and Context. ACM Transactions on Interactive Intelligent
Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages.


Thanks to Al Borchers for cleaning up this data and writing the
accompanying scripts.


Herlocker, J., Konstan, J., Borchers, A., Riedl, J.. An Algorithmic
Framework for Performing Collaborative Filtering. Proceedings of the
1999 Conference on Research and Development in Information
Retrieval. Aug. 1999.


The GroupLens Research Project is a research group in the Department
of Computer Science and Engineering at the University of Minnesota.
Members of the GroupLens Research Project are involved in many

collaborative filtering, and recommender systems. The project is lead
by professors John Riedl and Joseph Konstan. The project began to
explore automated collaborative filtering in 1992, but is most well
known for its world wide trial of an automated collaborative filtering
system for Usenet news in 1996.  The technology developed in the
Usenet trial formed the base for the formation of Net Perceptions,
Inc., which was founded by members of GroupLens Research. Since then
the project has expanded its scope to research overall information
filtering solutions, integrating in content-based methods as well as
improving current collaborative filtering technology.

Further information on the GroupLens Research project, including
research publications, can be found at the following web site:

GroupLens Research currently operates a movie recommender based on
collaborative filtering:


Here are brief descriptions of the data.

ml-data.tar.gz   -- Compressed tar file.  To rebuild the u data files do this:
                gunzip ml-data.tar.gz
                tar xvf ml-data.tar
           -- The full u data set, 100000 ratings by 943 users on 1682 items.
              Each user has rated at least 20 movies.  Users and items are
              numbered consecutively from 1.  The data is randomly
              ordered. This is a tab separated list of 
	         user id | item id | rating | timestamp. 
              The time stamps are unix seconds since 1/1/1970 UTC     -- The number of users, items, and ratings in the u data set.

u.item     -- Information about the items (movies); this is a tab separated
              list of
              movie id | movie title | release date | video release date |
              IMDb URL | unknown | Action | Adventure | Animation |
              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
              Thriller | War | Western |
              The last 19 fields are the genres, a 1 indicates the movie
              is of that genre, a 0 indicates it is not; movies can be in
              several genres at once.
              The movie ids are the ones used in the data set.

u.genre    -- A list of the genres.

u.user     -- Demographic information about the users; this is a tab
              separated list of
              user id | age | gender | occupation | zip code
              The user ids are the ones used in the data set.

u.occupation -- A list of the occupations.

u1.base    -- The data sets u1.base and u1.test through u5.base and u5.test
u1.test       are 80%/20% splits of the u data into training and test data.
u2.base       Each of u1, ..., u5 have disjoint test sets; this if for
u2.test       5 fold cross validation (where you repeat your experiment
u3.base       with each training and test set and average the results).
u3.test       These data sets can be generated from by

ua.base    -- The data sets ua.base, ua.test, ub.base, and ub.test
ua.test       split the u data into a training set and a test set with
ub.base       exactly 10 ratings per user in the test set.  The sets
ub.test       ua.test and ub.test are disjoint.  These data sets can
              be generated from by  -- The script that generates training and test sets where
              all but n of a users ratings are in the training data.     -- A shell script to generate all the u data sets from

In [2]:
#remove broken symbols
! iconv -f utf-8 -t utf-8 -c ml-100k/u.item >  ml-100k/u.item2

user part

In [3]:
! head -3 ./ml-100k/u.user


In [4]:
df_user = pd.read_csv('./ml-100k/u.user', sep='|', names='user id | age | gender | occupation | zip code'.split(' | '))
df_user['living_area'] = df_user['zip code'].map(lambda x: x[0])
del df_user['zip code']

user id age gender occupation living_area
0 1 24 M technician 8
1 2 53 F other 9
2 3 23 M writer 3
3 4 24 M technician 4
4 5 33 F other 1

In [5]:
res = []
for age in list(map(str, df_user['age'].values)):
    res.append(int(round(int(age), -1)))
df_user['age'] = res

In [6]:
for f in ['age', 'gender', 'occupation', 'living_area']:


In [7]:
features_list = ['age', 'gender', 'occupation', 'living_area']
s_users = []
le = LabelEncoder()

users_mat = []
for feature in features_list:
    col = le.fit_transform(df_user[feature].values)
users_mat = np.array(users_mat).T

(943, 4)

In [8]:
users = {}
for i, id in enumerate(df_user['user id'].values):
    users[id] = users_mat[i]

item part

In [9]:
! head -3 ./ml-100k/u.item2

1|Toy Story (1995)|01-Jan-1995|||0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995|||0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995|||0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0

In [10]:
df_item = pd.read_csv('./ml-100k/u.item2', 
                      names=(['id', 'title', 'release_date', 'video_release_date', 'url'] + 
                             ['g{}'.format(i) for i in range(19)])
df_item['year'] = df_item['release_date'].map(lambda x: str(x).split('-')[-1])

In [11]:
res = []
for age in list(map(str, df_item['year'].values)):
    if age == 'nan':
    res.append(int(round(int(age), -1)))
df_item['decade'] = res

In [12]:
for f in ['decade']:


In [19]:
features_list = ['decade'] + ['g{}'.format(i) for i in range(19)]
s_item = []

items_mat = []
for feature in features_list:
    col = le.fit_transform(df_item[feature].values)
items_mat = np.array(items_mat).T

(1682, 20)

In [20]:
items = {}
for i, id in enumerate(df_item['id'].values):
    items[id] = items_mat[i]

ratings part

In [21]:
! head -3 ./ml-100k/

196	242	3	881250949
186	302	3	891717742
22	377	1	878887116

In [23]:
df_data = pd.read_csv('./ml-100k/', 
                      names='user id | item id | rating | timestamp'.split(' | ')

In [24]:
df_data['target'] = df_data['rating'] > 4.5
data = df_data[['user id', 'item id']].as_matrix()
target = df_data['target'].values
print('Mean target: {}'.format(np.mean(target==True)))

Mean target: 0.21201

In [25]:
# split to pos/neg samples
positive_idx = np.where(target==True)[0]
negative_idx = np.where(target!=True)[0]

In [28]:
from sklearn.cross_validation import train_test_split
pos_idx_tr, pos_idx_te = train_test_split(positive_idx, random_state=42, test_size=0.5)
neg_idx_tr, neg_idx_te = train_test_split(negative_idx, random_state=42, train_size=len(pos_idx_tr))

In [30]:
def build_matrix(pos_idx, neg_idx):
    rows_user = []
    rows_item = []
    rows_pair = []
    for idx in list(pos_idx) + list(neg_idx):
        u, i = data[idx]
        # values should be 1-based 
        rows_user.append(users[u] + 1)
        rows_item.append(items[i] + 1)
        # u and i already 1-based
    X = np.hstack(map(np.array, [rows_user, rows_pair, rows_item]))
    Y = np.zeros(len(pos_idx) + len(neg_idx))
    Y[:len(pos_idx)] = 1
    perm = np.random.permutation(X.shape[0])
    return X[perm], Y[perm]

In [31]:
n_users = 943
n_items = 1682

X_tr, Y_tr = build_matrix(pos_idx_tr, neg_idx_tr)
X_te, Y_te = build_matrix(pos_idx_te, neg_idx_te)

# sizes of categorical features
s_features = s_users + [n_users, n_items] + s_item

In [35]:
print('X_tr shape: ', X_tr.shape)
print('X_te shape: ', X_te.shape)
print('Num of features: ', len(s_features))
print('Size of feature space: ',
print('Sizes of features: ', s_features)

X_tr shape:  (21200, 26)
X_te shape:  (78800, 26)
Num of features:  26
Size of feature space:  2914558821173035008
Sizes of features:  [61, 2, 21, 19, 943, 1682, 72, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]

In [36]:
# dump to disk
joblib.dump((X_tr, Y_tr, s_features), './tmp/train_categotical.jl')
joblib.dump((X_te, Y_te, s_features), './tmp/test_categorical.jl')