For this tutorial, we'll BIDMach's GLM (Generalized Linear Model) package. It includes linear regression, logistic regression, and support vector machines (SVMs). The imports below include both BIDMat's matrix classes, and BIDMach machine learning classes.
In [1]:
import BIDMat.{CMat,CSMat,DMat,Dict,IDict,FMat,FND,GDMat,GMat,GIMat,GSDMat,GSMat,HMat,Image,IMat,Mat,SMat,SBMat,SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.Plotting._
import BIDMach.Learner
import BIDMach.models.{FM,GLM,KMeans,KMeansw,ICA,LDA,LDAgibbs,NMF,RandomForest,SFA}
import BIDMach.datasources.{MatDS,FilesDS,SFilesDS}
import BIDMach.mixins.{CosineSim,Perplexity,Top,L1Regularizer,L2Regularizer}
import BIDMach.updaters.{ADAGrad,Batch,BatchNorm,IncMult,IncNorm,Telescoping}
import BIDMach.causal.{IPTW}
Mat.checkMKL
Mat.checkCUDA
if (Mat.hasCUDA > 0) GPUmem
Out[1]:
The dataset is the widely used Reuters news article dataset RCV1 V2. This dataset and several others are loaded by running the script getdata.sh
from the BIDMach/scripts directory. The data include both train and test subsets, and train and test labels (cats).
In [ ]:
var dir = "../data/rcv1/" // Assumes bidmach is run from BIDMach/tutorials. Adjust to point to the BIDMach/data/rcv1 directory
tic
val train = loadSMat(dir+"docs.smat.lz4")
val cats = loadFMat(dir+"cats.fmat.lz4")
val test = loadSMat(dir+"testdocs.smat.lz4")
val tcats = loadFMat(dir+"testcats.fmat.lz4")
toc
BIDMach's basic classifiers can invoked like this on data that fits in memory:
In [ ]:
val (mm, opts) = GLM.learner(train, cats, GLM.logistic)
The last option specifies the type of model, linear, logistic or SVM. The syntax is a little unusual. There are two values returned. The first mm
is a "learner" which includes model, optimizer, and mixin classes. The second opts
is an options object specialized to that combination of learner components. This design facilitates rapid iteration over model parameters from the command line or notebook.
The parameters of the model can be viewed and modified by doing opts.what
In [ ]:
opts.what
opts.lrate=0.3f
Most of these will work well with their default values. On the other hand, a few have a strong effect on performance. Those include:
lrate: the learning rate batchSize: the minibatch size npasses: the number of passes over the datasetWe will talk about tuning those in a moment. For now lets train the model:
In [ ]:
opts.npasses=2
mm.train
The output includes important information about the training cycle:
The likelihood is calculated on a set of minibatches that are held out from training on every cycle. So this is a cross-validated likelihood estimate. Cross-validated likelihood will increase initially, but will then flatten and may decrease. There is random variation in the likelihood estimates because we are using SGD. Determining the best point to stop is tricky to do automatically, and is instead left to the analyst.
To evaluate the model, we build a classifier from it:
In [ ]:
val preds = zeros(tcats.nrows, tcats.ncols) // An array to hold the predictions
val (pp, popts) = GLM.predictor(mm.model, test, preds)
And invoke the predict method on the predictor:
In [ ]:
pp.predict
Although ll values are printed above, they are not meaningful (there is no target to compare the prediction with).
We can now compare the accuracy of predictions (preds matrix) with ground truth (the tcats matrix).
In [ ]:
val lls = mean(ln(1e-7f + tcats ∘ preds + (1-tcats) ∘ (1-preds)),2) // actual logistic likelihood
mean(lls)
A more thorough measure is ROC area:
In [ ]:
val rocs = roc2(preds, tcats, 1-tcats, 100) // Compute ROC curves for all categories
In [ ]:
plot(rocs(?,0->7))
In [ ]:
val aucs = mean(rocs)
In [ ]:
aucs(6)
We could go ahead and start tuning our model, or we automate the process as in the next worksheet.