Machine Learning at Scale, Part II

First lets initialize BIDMach again.



In [ ]:

    
import BIDMat.{CMat,CSMat,DMat,Dict,IDict,Image,FMat,FND,GMat,GIMat,GSMat,HMat,IMat,Mat,SMat,SBMat,SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.Plotting._
import BIDMach.Learner
import BIDMach.models.{FM,GLM,KMeans,KMeansw,LDA,LDAgibbs,Model,NMF,SFA}
import BIDMach.datasources.{DataSource,MatDS,FilesDS,SFilesDS}
import BIDMach.mixins.{CosineSim,Perplexity,Top,L1Regularizer,L2Regularizer}
import BIDMach.updaters.{ADAGrad,Batch,BatchNorm,IncMult,IncNorm,Telescoping}
import BIDMach.causal.{IPTW}

Mat.checkMKL
Mat.checkCUDA
if (Mat.hasCUDA > 0) GPUmem

Check the GPU memory above. It should be close to 1 again (0.98 or greater). If not, check if you have any "zombie" java processes from the linux shell with ps -ef | grep java. You will normally have two processes, one for the main IScala shell and one for this notebook.

Sourcing Data from Disk

Training models with data that fits in memory is very limiting. But minibatch learners can easily work with data directly from disk. They access data sequentially (which avoids seeking and maximizes disk throughput) and converge in very few passes. We'll look again at the MNIST data set, this time the full version with 8 million images (about 10 GB, and 120x larger than the small MNIST dataset). The dataset has been partition into groups of 100k images (e.g. using the unix split command, or in a cluster) and saved in compressed lz4 files.

We're also going to build up a learner in stages. First we're going to define a mixture class to hold all the options.



In [ ]:

    
class xopts extends Learner.Options with FilesDS.Opts with GLM.Opts with ADAGrad.Opts; 
val mnopts = new xopts

Next come the options for the data source. There are quite a few of these, but they only need to be set once and apart from the minibatch size, dont need to be tuned.

The following options specify various useful traits of the data source. Many of these are default values and dont actually need to be set, but its useful to know what they do. Most options are explained below.

The fnames option specifies a list of functions that return a filename given an integer. Its a list so you can specify a single datasource that combines data from different files, e.g. data files and target files for regression. The method FilesDS.simpleEnum() takes a format string argument and returns such a function that uses the format string and an integer argument to construct the filename to retrieve. So this source builds data blocks from part00.smat.lz4, part01.smat.lz4,... It stacks the rows of "cats" matrices on top of the rows of the "part" (data) matrices.



In [ ]:

    
val mdir = "../data/MNIST8M/parts/"
mnopts.fnames = List(FilesDS.simpleEnum(mdir+"data%02d.fmat.lz4", 1, 0),  // File name templates, %02d is replaced by a number
                     FilesDS.simpleEnum(mdir+"cats%02d.fmat.lz4", 1, 0));
mnopts.nstart = 0;                 // Starting file number
mnopts.nend = 70;                  // Ending file number
mnopts.order = 0;                  // (0) sample order, 0=linear, 1=random
mnopts.lookahead = 2;              // (2) number of prefetch threads
mnopts.featType = 1;               // (1) feature type, 0=binary, 1=linear
mnopts.addConstFeat = true         // add a constant feature (effectively adds a $\beta_0$ term to $X\beta$)

The next options define the type of GLM model (linear, logistic etc) through the links value. They also specify through the targets matrix, how to extract target values from the input matrix. Finally the mask is a vector that specifies which data values can be used for training: i.e. it should exclude the target values.



In [ ]:

    
mnopts.links = GLM.maxp*iones(10,1)
mnopts.autoReset = false            // Dont reset the GPU after the training run, so we can use a GPU model for prediction

Next we'll create a custom datasource. We need a bit of runtime configuration to ensure that the datasource runs reliably. Because it prefetches files with separate threads, we need to make sure that enough threads are available or it will stall. The threadPool(4) call takes care of this.



In [ ]:

    
val ds = {
  implicit val ec = threadPool(4)   // make sure there are enough threads (more than the lookahead count)
  new FilesDS(mnopts)              // the datasource
}

Next we define the main learner class, which is built up from various "plug-and-play" learning modules.



In [ ]:

    
val nn = new Learner(                // make a learner instance
    ds,                              // datasource
    new GLM(mnopts),                 // the model (a GLM model)
    null,                            // list of mixins or regularizers
    new ADAGrad(mnopts),             // the optimization class to use
    mnopts)                          // pass the options to the learner as well
nn

Tuning Options

The following options are the important ones for tuning.



In [ ]:

    
mnopts.batchSize = 1000
mnopts.npasses = 2
mnopts.lrate = 0.001

You invoke the learner the same way as before. You can change the options above after each run to optimize performance.



In [ ]:

    
nn.train

Note the relatively low gflops throughput and high I/O throughput for this dataset. This problem is clearly I/O bound. This is often true for regression problems, especially those with few targets (there were 10 here vs. 104 for RCV1). To study computing limits, we'll next do a compute-intensive problem: K-means clustering.

Shutdown this document before continuing

Go back to your master notebook page, and click on shutdown next to this page.



In [ ]: