BIDMach: basic classification

For this tutorial, we'll BIDMach's GLM (Generalized Linear Model) package. It includes linear regression, logistic regression, and support vector machines (SVMs). The imports below include both BIDMat's matrix classes, and BIDMach machine learning classes.


In [1]:
import BIDMat.{CMat,CSMat,DMat,Dict,IDict,FMat,FND,GDMat,GMat,GIMat,GSDMat,GSMat,HMat,Image,IMat,Mat,SMat,SBMat,SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.JPlotting._
import BIDMach.Learner
import BIDMach.models.{FM,GLM,KMeans,KMeansw,ICA,LDA,LDAgibbs,NMF,RandomForest,SFA}
import BIDMach.datasources.{MatSource,FileSource,SFileSource}
import BIDMach.mixins.{CosineSim,Perplexity,Top,L1Regularizer,L2Regularizer}
import BIDMach.updaters.{ADAGrad,Batch,BatchNorm,IncMult,IncNorm,Telescoping}
import BIDMach.causal.{IPTW}

Mat.checkMKL
Mat.checkCUDA
Mat.setInline
if (Mat.hasCUDA > 0) GPUmem


1 CUDA device found, CUDA version 7.0
Out[1]:
(0.96885055,11703132160,12079398912)

Dataset: Reuters RCV1 V2

The dataset is the widely used Reuters news article dataset RCV1 V2. This dataset and several others are loaded by running the script getdata.sh from the BIDMach/scripts directory. The data include both train and test subsets, and train and test labels (cats).


In [2]:
var dir = "../data/rcv1/"  // Assumes bidmach is run from BIDMach/tutorials. Adjust to point to the BIDMach/data/rcv1 directory
tic
val train = loadSMat(dir+"docs.smat.lz4")
val cats = loadFMat(dir+"cats.fmat.lz4")
val test = loadSMat(dir+"testdocs.smat.lz4")
val tcats = loadFMat(dir+"testcats.fmat.lz4")
toc



Out[2]:
2.314

BIDMach's basic classifiers can invoked like this on data that fits in memory:


In [3]:
val (mm, opts) = GLM.learner(train, cats, GLM.logistic)



Out[3]:
BIDMach.models.GLM$LearnOptions@3ec51f5

The last option specifies the type of model, linear, logistic or SVM. The syntax is a little unusual. There are two values returned. The first mm is a "learner" which includes model, optimizer, and mixin classes. The second opts is an options object specialized to that combination of learner components. This design facilitates rapid iteration over model parameters from the command line or notebook.

The parameters of the model can be viewed and modified by doing opts.what


In [4]:
opts.what
opts.lrate=0.3f


Option Name       Type          Value
===========       ====          =====
addConstFeat      boolean       false
aopts             Opts          null
autoReset         boolean       true
batchSize         int           10000
checkPointFile    String        null
checkPointInterval  float         0.0
cumScore          int           0
debug             int           0
debugMem          boolean       false
dim               int           256
doubleScore       boolean       false
epsilon           float         1.0E-5
evalStep          int           11
featThreshold     Mat           null
featType          int           1
hashBound1        int           1000000
hashBound2        int           1000000
hashFeatures      int           0
initsumsq         float         1.0E-5
iweight           FMat          null
langevin          float         0.0
lim               float         0.0
links             IMat          1,1,1,1,1,1,1,1,1,1,...
lrate             FMat          1
mask              FMat          null
momentum          FMat          null
nesterov          FMat          null
npasses           int           2
nzPerColumn       int           0
pexp              FMat          0.50000
policies          Function2[]   null
pstep             float         0.01
putBack           int           -1
r1nmats           int           1
reg1weight        FMat          1.0000e-07
resFile           String        null
rmask             FMat          null
sample            float         1.0
sizeMargin        float         3.0
startBlock        int           8000
targets           FMat          null
targmap           FMat          null
texp              FMat          0.50000
updateAll         boolean       false
useCache          boolean       true
useDouble         boolean       false
useGPU            boolean       true
vexp              FMat          0.50000
waitsteps         int           3
Out[4]:
0.30000

Most of these will work well with their default values. On the other hand, a few have a strong effect on performance. Those include:

lrate: the learning rate
batchSize: the minibatch size
npasses: the number of passes over the dataset
We will talk about tuning those in a moment. For now lets train the model:


In [5]:
opts.npasses=2
mm.train


corpus perplexity=5582.125391
pass= 0
 2.00%, ll=-0.69315, gf=1.439, secs=0.5, GB=0.02, MB/s=45.57, GPUmem=0.906281
16.00%, ll=-0.06984, gf=6.724, secs=0.8, GB=0.13, MB/s=169.60, GPUmem=0.906281
30.00%, ll=-0.05885, gf=8.968, secs=1.1, GB=0.25, MB/s=222.78, GPUmem=0.906281
44.00%, ll=-0.05025, gf=10.087, secs=1.5, GB=0.36, MB/s=248.91, GPUmem=0.906281
58.00%, ll=-0.04849, gf=10.703, secs=1.8, GB=0.48, MB/s=263.14, GPUmem=0.906281
72.00%, ll=-0.03930, gf=11.182, secs=2.1, GB=0.59, MB/s=274.45, GPUmem=0.906281
87.00%, ll=-0.04346, gf=11.537, secs=2.5, GB=0.70, MB/s=282.83, GPUmem=0.906281
100.00%, ll=-0.03860, gf=11.681, secs=2.8, GB=0.81, MB/s=284.46, GPUmem=0.905934
pass= 1
 2.00%, ll=-0.04507, gf=11.722, secs=2.9, GB=0.83, MB/s=287.13, GPUmem=0.905934
16.00%, ll=-0.03975, gf=11.876, secs=3.2, GB=0.94, MB/s=290.65, GPUmem=0.905934
30.00%, ll=-0.04341, gf=11.886, secs=3.6, GB=1.05, MB/s=290.78, GPUmem=0.905934
44.00%, ll=-0.03979, gf=11.392, secs=4.2, GB=1.17, MB/s=278.54, GPUmem=0.905934
58.00%, ll=-0.04178, gf=11.367, secs=4.6, GB=1.28, MB/s=277.77, GPUmem=0.905934
72.00%, ll=-0.03438, gf=11.455, secs=5.0, GB=1.39, MB/s=279.87, GPUmem=0.905934
87.00%, ll=-0.04052, gf=11.499, secs=5.4, GB=1.51, MB/s=280.90, GPUmem=0.905934
100.00%, ll=-0.03577, gf=11.575, secs=5.7, GB=1.61, MB/s=281.87, GPUmem=0.905934
Time=5.7200 secs, gflops=11.58

The output includes important information about the training cycle:

  • Percentage of dataset processed
  • Cross-validated log likelihood (or negative loss)
  • Overall throughput in gigaflops
  • Elapsed time in seconds
  • Total Gigabytes processed
  • I/O throughput in MB/s
  • GPU memory remaining (if using a GPU)

The likelihood is calculated on a set of minibatches that are held out from training on every cycle. So this is a cross-validated likelihood estimate. Cross-validated likelihood will increase initially, but will then flatten and may decrease. There is random variation in the likelihood estimates because we are using SGD. Determining the best point to stop is tricky to do automatically, and is instead left to the analyst.

To evaluate the model, we build a classifier from it:


In [6]:
val (pp, popts) = GLM.predictor(mm.model, test)



Out[6]:
BIDMach.models.GLM$PredOptions@3aa5218c

And invoke the predict method on the predictor:


In [7]:
pp.predict


corpus perplexity=65579.335560
Predicting
 3.00%, ll=0.00000, gf=0.053, secs=0.3, GB=0.00, MB/s= 1.82, GPUmem=0.95
 6.00%, ll=0.00000, gf=0.103, secs=0.3, GB=0.00, MB/s= 3.56, GPUmem=0.95
10.00%, ll=0.00000, gf=0.147, secs=0.3, GB=0.00, MB/s= 5.02, GPUmem=0.95
13.00%, ll=0.00000, gf=0.198, secs=0.3, GB=0.00, MB/s= 6.80, GPUmem=0.95
16.00%, ll=0.00000, gf=0.245, secs=0.3, GB=0.00, MB/s= 8.41, GPUmem=0.95
20.00%, ll=0.00000, gf=0.290, secs=0.3, GB=0.00, MB/s= 9.91, GPUmem=0.95
23.00%, ll=0.00000, gf=0.330, secs=0.3, GB=0.00, MB/s=11.26, GPUmem=0.95
26.00%, ll=0.00000, gf=0.372, secs=0.3, GB=0.00, MB/s=12.71, GPUmem=0.95
30.00%, ll=0.00000, gf=0.418, secs=0.3, GB=0.00, MB/s=14.29, GPUmem=0.95
33.00%, ll=0.00000, gf=0.464, secs=0.3, GB=0.00, MB/s=15.87, GPUmem=0.95
36.00%, ll=0.00000, gf=0.511, secs=0.3, GB=0.01, MB/s=17.49, GPUmem=0.95
40.00%, ll=0.00000, gf=0.554, secs=0.3, GB=0.01, MB/s=18.93, GPUmem=0.95
43.00%, ll=0.00000, gf=0.595, secs=0.3, GB=0.01, MB/s=20.35, GPUmem=0.95
46.00%, ll=0.00000, gf=0.656, secs=0.3, GB=0.01, MB/s=22.49, GPUmem=0.95
50.00%, ll=0.00000, gf=0.696, secs=0.3, GB=0.01, MB/s=23.89, GPUmem=0.95
53.00%, ll=0.00000, gf=0.738, secs=0.3, GB=0.01, MB/s=25.29, GPUmem=0.95
56.00%, ll=0.00000, gf=0.782, secs=0.3, GB=0.01, MB/s=26.81, GPUmem=0.95
60.00%, ll=0.00000, gf=0.824, secs=0.3, GB=0.01, MB/s=28.27, GPUmem=0.95
63.00%, ll=0.00000, gf=0.864, secs=0.3, GB=0.01, MB/s=29.61, GPUmem=0.95
66.00%, ll=0.00000, gf=0.905, secs=0.3, GB=0.01, MB/s=31.03, GPUmem=0.95
70.00%, ll=0.00000, gf=0.946, secs=0.3, GB=0.01, MB/s=32.43, GPUmem=0.95
73.00%, ll=0.00000, gf=0.986, secs=0.3, GB=0.01, MB/s=33.81, GPUmem=0.95
76.00%, ll=0.00000, gf=1.021, secs=0.3, GB=0.01, MB/s=34.98, GPUmem=0.95
80.00%, ll=0.00000, gf=1.066, secs=0.3, GB=0.01, MB/s=36.56, GPUmem=0.95
83.00%, ll=0.00000, gf=1.106, secs=0.3, GB=0.01, MB/s=37.92, GPUmem=0.95
86.00%, ll=0.00000, gf=1.146, secs=0.3, GB=0.01, MB/s=39.30, GPUmem=0.95
90.00%, ll=0.00000, gf=1.180, secs=0.3, GB=0.01, MB/s=40.43, GPUmem=0.95
93.00%, ll=0.00000, gf=1.224, secs=0.3, GB=0.01, MB/s=41.99, GPUmem=0.95
96.00%, ll=0.00000, gf=1.262, secs=0.3, GB=0.01, MB/s=43.28, GPUmem=0.95
100.00%, ll=0.00000, gf=1.300, secs=0.3, GB=0.01, MB/s=44.61, GPUmem=0.95
Time=0.3150 secs, gflops=1.30

Although ll values are printed above, they are not meaningful (there is no target to compare the prediction with).

We can now compare the accuracy of predictions (preds matrix) with ground truth (the tcats matrix).


In [8]:
val preds = FMat(pp.preds(0))



Out[8]:
    0.051716  2.8286e-15  9.3898e-10  9.3898e-10  0.00024929  8.0685e-12...
    0.011135  3.4162e-05  4.7918e-06  4.7918e-06   0.0059853  9.2513e-08...
     0.78789  2.8876e-12  1.7810e-09  1.7810e-09  1.3463e-05  5.0863e-09...
    0.011328  2.3973e-15  1.1961e-11  1.1961e-11  3.4776e-05  8.7643e-06...
     0.99887  1.0619e-06  2.1314e-09  2.1314e-09  2.3170e-05           1...
  2.0127e-07     0.99854  7.5009e-07  7.5009e-07     0.14460  3.9604e-07...
  1.1595e-05           1      1.0000      1.0000     0.90988  1.4346e-05...
  3.8421e-07  7.1216e-06     0.73840     0.73840    0.062895  4.6554e-11...
          ..          ..          ..          ..          ..          ..

In [9]:
val lls = mean(ln(1e-7f + tcats  preds + (1-tcats)  (1-preds)),2)  // actual logistic likelihood
mean(lls)



Out[9]:
-0.039590

A more thorough measure is ROC area:


In [10]:
val rocs = roc2(preds, tcats, 1-tcats, 100)   // Compute ROC curves for all categories
plot(rocs(?,6))



Out[10]:

In [11]:
plot(rocs(?,0->5))



Out[11]:

In [12]:
val aucs = mean(rocs)



Out[12]:
0.92796,0.95949,0.97780,0.96988,0.98077,0.94405,0.97540,0.97547,0.97647,0.94959,0.97718,0.94474,0.86756,0.91574,0.98457,0.96938,0.97628,0.95293,0.91328,0.95620,0.90527,0.91562,0.95315,0.94977,0.97943,0.95953,0.96210,0.95130,0.97605,0.94854,0.94699,0.96866,0.94554,0.96259,0.91478,0.95784,0.90442,0.96816,0.80705,0.94270,0.94160,0.95267,0.66955,0.92352,0.95091,0.92467,0.90882,0.95015,0.96943,0.96349,0.97251,0.97540,0.91864,0.95583,0.93168,0.96431,0.96347,0.61410,0.91780,0.96541,0.98574,0.91007,0.96804,0.76783,0.94777,0.87311,0.95931,0.98108,0.90297,0.98872,0.92896,0.96976,0.88119,0.88600,0.90381,0.93929,0.92118,0.90963,0.96243,0.95381,0.96487,0.88981,0.91170,0.92376,0.93833,0.87769,0.74387,0.93540,0.95893,0.98602,0.95531,0.93069,0.96342,0.97120,0.95280,0.78630,0.69109,0.66667,0.44554,0.98515,0.50000,NaN,NaN

In [13]:
aucs(6)



Out[13]:
0.9753971348784144

We could go ahead and start tuning our model, or we automate the process as in the next worksheet.