In this notebook we demostrate how to build a neural network recommendation system, Neural Collaborative Filtering(NCF) with explict feedback. We use Recommender API in Analytics Zoo to build a model, and use optimizer of BigDL to train the model.
The system (Recommendation systems: Principles, methods and evaluation) normally prompts the user through the system interface to provide ratings for items in order to construct and improve the model. The accuracy of recommendation depends on the quantity of ratings provided by the user.
NCF(He, 2015) leverages a multi-layer perceptrons to learn the user–item interaction function, at the mean time, NCF can express and generalize matrix factorization under its framework. includeMF(Boolean) is provided for users to build a NCF with or without matrix factorization.
Data:
References:
import necessary libraries
In [1]:
from zoo.pipeline.api.keras.layers import *
from zoo.models.recommendation import UserItemFeature
from zoo.models.recommendation import NeuralCF
from zoo.common.nncontext import init_nncontext
import matplotlib
from sklearn import metrics
from operator import itemgetter
from bigdl.dataset import movielens
from bigdl.util.common import *
matplotlib.use('agg')
import matplotlib.pyplot as plt
%pylab inline
Initilaize NN context, it will get a SparkContext with optimized configuration for BigDL performance.
In [2]:
sc = init_nncontext("NCF Example")
Download and read movielens 1M data
In [3]:
movielens_data = movielens.get_id_ratings("/tmp/movielens/")
Understand the data. Each record is in format of (userid, movieid, rating_score). UserIDs range between 1 and 6040. MovieIDs range between 1 and 3952. Ratings are made on a 5-star scale (whole-star ratings only). Counts of users and movies are recorded for later use.
In [4]:
min_user_id = np.min(movielens_data[:,0])
max_user_id = np.max(movielens_data[:,0])
min_movie_id = np.min(movielens_data[:,1])
max_movie_id = np.max(movielens_data[:,1])
rating_labels= np.unique(movielens_data[:,2])
print(movielens_data.shape)
print(min_user_id, max_user_id, min_movie_id, max_movie_id, rating_labels)
Transform original data into RDD of sample.
We use optimizer of BigDL directly to train the model, it requires data to be provided in format of RDD(Sample). A Sample is a BigDL data structure which can be constructed using 2 numpy arrays, feature and label respectively. The API interface is Sample.from_ndarray(feature, label)
Here, labels are tranformed into zero-based since original labels start from 1.
In [5]:
def build_sample(user_id, item_id, rating):
sample = Sample.from_ndarray(np.array([user_id, item_id]), np.array([rating]))
return UserItemFeature(user_id, item_id, sample)
pairFeatureRdds = sc.parallelize(movielens_data)\
.map(lambda x: build_sample(x[0], x[1], x[2]-1))
pairFeatureRdds.take(3)
Out[5]:
Randomly split the data into train (80%) and validation (20%)
In [6]:
trainPairFeatureRdds, valPairFeatureRdds = pairFeatureRdds.randomSplit([0.8, 0.2], seed= 1)
valPairFeatureRdds.cache()
train_rdd= trainPairFeatureRdds.map(lambda pair_feature: pair_feature.sample)
val_rdd= valPairFeatureRdds.map(lambda pair_feature: pair_feature.sample)
val_rdd.persist()
Out[6]:
In [7]:
print(train_rdd.count())
train_rdd.take(3)
Out[7]:
In Analytics Zoo, it is simple to build NCF model by calling NeuralCF API. You need specify the user count, item count and class number according to your data, then add hidden layers as needed, you can also choose to include matrix factorization in the network. The model could be fed into an Optimizer of BigDL or NNClassifier of analytics-zoo. Please refer to the document for more details. In this example, we demostrate how to use optimizer of BigDL.
In [8]:
ncf = NeuralCF(user_count=max_user_id,
item_count=max_movie_id,
class_num=5,
hidden_layers=[20, 10],
include_mf = False)
Compile model given specific optimizers, loss, as well as metrics for evaluation. Optimizer tries to minimize the loss of the neural net with respect to its weights/biases, over the training set. To create an Optimizer in BigDL, you want to at least specify arguments: model(a neural network model), criterion(the loss function), traing_rdd(training dataset) and batch size. Please refer to (ProgrammingGuide)and (Optimizer) for more details to create efficient optimizers.
In [9]:
ncf.compile(optimizer= "adam",
loss= "sparse_categorical_crossentropy",
metrics=['accuracy'])
In [10]:
tmp_log_dir = create_tmp_path()
ncf.set_tensorboard(tmp_log_dir, "training_ncf")
In [11]:
ncf.fit(train_rdd,
nb_epoch= 10,
batch_size= 8000,
validation_data=val_rdd)
In [12]:
results = ncf.predict(val_rdd)
results.take(5)
results_class = ncf.predict_class(val_rdd)
results_class.take(5)
Out[12]:
In Analytics Zoo, Recommender has provied 3 unique APIs to predict user-item pairs and make recommendations for users or items given candidates. Predict for user item pairs
In [13]:
userItemPairPrediction = ncf.predict_user_item_pair(valPairFeatureRdds)
for result in userItemPairPrediction.take(5): print(result)
Recommend 3 items for each user given candidates in the feature RDDs
In [14]:
userRecs = ncf.recommend_for_user(valPairFeatureRdds, 3)
for result in userRecs.take(5): print(result)
Recommend 3 users for each item given candidates in the feature RDDs
In [15]:
itemRecs = ncf.recommend_for_item(valPairFeatureRdds, 3)
for result in itemRecs.take(5): print(result)
Plot the train and validation loss curves
In [16]:
#retrieve train and validation summary object and read the loss data into ndarray's.
train_loss = np.array(ncf.get_train_summary("Loss"))
val_loss = np.array(ncf.get_validation_summary("Loss"))
#plot the train and validation curves
# each event data is a tuple in form of (iteration_count, value, timestamp)
plt.figure(figsize = (12,6))
plt.plot(train_loss[:,0],train_loss[:,1],label='train loss')
plt.plot(val_loss[:,0],val_loss[:,1],label='val loss',color='green')
plt.scatter(val_loss[:,0],val_loss[:,1],color='green')
plt.legend();
plt.xlim(0,train_loss.shape[0]+10)
plt.grid(True)
plt.title("loss")
Out[16]:
plot accuracy
In [17]:
plt.figure(figsize = (12,6))
top1 = np.array(ncf.get_validation_summary("Top1Accuracy"))
plt.plot(top1[:,0],top1[:,1],label='top1')
plt.title("top1 accuracy")
plt.grid(True)
plt.legend();
In [ ]: