Load the file.
In [9]:
sc.addPyFile("./fm/fm_parallel_sgd.py")
In [10]:
import fm_parallel_sgd as fm
from fm_parallel_sgd import *
Set some matplotlib parameters for plotting figures directly in the notebook:
In [11]:
%matplotlib inline
import matplotlib.pylab as pylab
pylab.rcParams['figure.figsize']=(16.0, 12.0)
The dataset should be a RDD of LabeledPoints. Labels should be -1 or 1. Features should be either SparseVector or DenseVector from mllib.linalg library
The Adult dataset (a9a) is used to predict who has a salary over $50.000, based on various information (Platt, 1998). You can download it here : a9a
123 features - 11% sparse
In [12]:
nrPartitions = 5
trainPath = "/path/to/a9a_train_dataset/a9a"
trainAll = MLUtils.loadLibSVMFile(sc, trainPath, numFeatures=123).repartition(nrPartitions)
testPath = "/path/to/a9a_test_dataset/a9a.t"
test = MLUtils.loadLibSVMFile(sc, testPath, numFeatures=123)
print trainAll.count()
print test.count()
print trainAll.first()
Train a Factorization Machine model using parallel stochastic gradient descent.
In [13]:
?trainFM_parallel_sgd
In [14]:
temp = time.time()
model = trainFM_parallel_sgd (sc, trainAll, iterations=1, iter_sgd= 1, alpha=0.01, regParam=0.01, factorLength=4,\
verbose=False, savingFilename = None, evalTraining=None)
print 'time :'; print time.time()-temp;
Evaluate your model on a test set.
In [15]:
?evaluate
In [16]:
print evaluate(test, model)
Split the rdd into a training set and a validation set. Print the evaluation after each iteration.
Saves the model in a pickle file after each iteration. The files are saved in the current directory. The files are named 'savingFilename_iteration_#'
In [7]:
temp = time.time()
trainFM_parallel_sgd (sc, trainAll, iterations=5, iter_sgd= 3, alpha=0.01, regParam=0.01, factorLength=4,\
verbose=True, savingFilename = 'a9a', evalTraining=None)
print 'total time :'; print time.time()-temp;
In [8]:
model = loadModel('a9a_iteration_5')
In [9]:
evaluate(test, model)
Out[9]:
Used to plot the evaluation (train+validation) during the training
You need to create an instance of the class evaluation first.
You can set evalTraining.modulo = 5 to evaluate the model after each 5 iterations for example (default is 1)
In [6]:
temp = time.time()
evalTraining = evaluation(trainAll)
evalTraining.modulo = 1
trainFM_parallel_sgd(sc, trainAll, iterations=10, iter_sgd=1, verbose = True, evalTraining=evalTraining)
print 'total time'
print time.time()-temp
Let's plot the different parameters using a sample of the training set
In [18]:
trainSample = trainAll.sample(False, 0.1)
Specify the list of alpha you want to plot in the alpha_list
In [18]:
model = plotAlpha(sc, trainSample, iterations=10, iter_sgd=1, alpha_list = [0.001, 0.003, 0.006, 0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1])
Specify the list of factor length you want to plot in the factorLength_list
In [19]:
model = plotFactorLength(sc, trainSample, factorLength_list = [1,5, 10, 15,20, 30, 40],\
iterations=5, iter_sgd=1, alpha=0.01, regParam=0.)
Specify the list of factor length you want to plot in the factorLength_list
In [24]:
model = plotRegParam(sc, trainSample, regParam_list = [0, 0.0001, 0.001, 0.01], iterations=5, iter_sgd=1, alpha=0.01, factorLength=4)
Specify the alpha_list and the factorLength_list to plot a color map of the best parameters. The brighter is the lower logloss.
In [19]:
bestModel = plotAlpha_RegParam(sc, trainSample, alpha_list = [0.01, 0.03, 0.06, 0.1],\
regParam_list = [0, 0.0001, 0.001, 0.01],\
iterations=5, iter_sgd=1)
In [20]:
evaluate(test, bestModel)
Out[20]:
To calculate the probabilities according to the model for a test set, call predict(data, model). This return a RDD with probability scores.
In [21]:
prediction = predictFM(test, bestModel)
prediction.take(5)
Out[21]: