BIDMach: basic classification

For this tutorial, we'll BIDMach's GLM (Generalized Linear Model) package. It includes linear regression, logistic regression, and support vector machines (SVMs). The imports below include both BIDMat's matrix classes, and BIDMach machine learning classes.


In [1]:
import BIDMat.{CMat,CSMat,DMat,Dict,IDict,FMat,FND,GDMat,GMat,GIMat,GSDMat,GSMat,HMat,Image,IMat,Mat,SMat,SBMat,SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.Plotting._
import BIDMach.Learner
import BIDMach.models.{FM,GLM,KMeans,KMeansw,ICA,LDA,LDAgibbs,NMF,RandomForest,SFA}
import BIDMach.datasources.{MatDS,FilesDS,SFilesDS}
import BIDMach.mixins.{CosineSim,Perplexity,Top,L1Regularizer,L2Regularizer}
import BIDMach.updaters.{ADAGrad,Batch,BatchNorm,IncMult,IncNorm,Telescoping}
import BIDMach.causal.{IPTW}

Mat.checkMKL
Mat.checkCUDA
if (Mat.hasCUDA > 0) GPUmem


1 CUDA device found, CUDA version 6.5
Out[1]:
(0.9881383,12731871232,12884705280)

Dataset: Reuters RCV1 V2

The dataset is the widely used Reuters news article dataset RCV1 V2. This dataset and several others are loaded by running the script getdata.sh from the BIDMach/scripts directory. The data include both train and test subsets, and train and test labels (cats).


In [ ]:
var dir = "../data/rcv1/"  // Assumes bidmach is run from BIDMach/tutorials. Adjust to point to the BIDMach/data/rcv1 directory
tic
val train = loadSMat(dir+"docs.smat.lz4")
val cats = loadFMat(dir+"cats.fmat.lz4")
val test = loadSMat(dir+"testdocs.smat.lz4")
val tcats = loadFMat(dir+"testcats.fmat.lz4")
toc

BIDMach's basic classifiers can invoked like this on data that fits in memory:


In [ ]:
val (mm, opts) = GLM.learner(train, cats, GLM.logistic)

The last option specifies the type of model, linear, logistic or SVM. The syntax is a little unusual. There are two values returned. The first mm is a "learner" which includes model, optimizer, and mixin classes. The second opts is an options object specialized to that combination of learner components. This design facilitates rapid iteration over model parameters from the command line or notebook.

The parameters of the model can be viewed and modified by doing opts.what


In [ ]:
opts.what
opts.lrate=0.3f

Most of these will work well with their default values. On the other hand, a few have a strong effect on performance. Those include:

lrate: the learning rate
batchSize: the minibatch size
npasses: the number of passes over the dataset
We will talk about tuning those in a moment. For now lets train the model:


In [ ]:
opts.npasses=2
mm.train

The output includes important information about the training cycle:

  • Percentage of dataset processed
  • Cross-validated log likelihood (or negative loss)
  • Overall throughput in gigaflops
  • Elapsed time in seconds
  • Total Gigabytes processed
  • I/O throughput in MB/s
  • GPU memory remaining (if using a GPU)

The likelihood is calculated on a set of minibatches that are held out from training on every cycle. So this is a cross-validated likelihood estimate. Cross-validated likelihood will increase initially, but will then flatten and may decrease. There is random variation in the likelihood estimates because we are using SGD. Determining the best point to stop is tricky to do automatically, and is instead left to the analyst.

To evaluate the model, we build a classifier from it:


In [ ]:
val preds = zeros(tcats.nrows, tcats.ncols)       // An array to hold the predictions
val (pp, popts) = GLM.predictor(mm.model, test, preds)

And invoke the predict method on the predictor:


In [ ]:
pp.predict

Although ll values are printed above, they are not meaningful (there is no target to compare the prediction with).

We can now compare the accuracy of predictions (preds matrix) with ground truth (the tcats matrix).


In [ ]:
val lls = mean(ln(1e-7f + tcats  preds + (1-tcats)  (1-preds)),2)  // actual logistic likelihood
mean(lls)

A more thorough measure is ROC area:


In [ ]:
val rocs = roc2(preds, tcats, 1-tcats, 100)   // Compute ROC curves for all categories

In [ ]:
plot(rocs(?,0->7))

In [ ]:
val aucs = mean(rocs)

In [ ]:
aucs(6)

We could go ahead and start tuning our model, or we automate the process as in the next worksheet.