Machine Learning at Scale, Part III



In [ ]:

    
import BIDMat.{CMat,CSMat,DMat,Dict,IDict,Image,FMat,FND,GMat,GIMat,GSMat,HMat,IMat,Mat,SMat,SBMat,SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.Plotting._
import BIDMach.Learner
import BIDMach.models.{FM,GLM,KMeans,KMeansw,LDA,LDAgibbs,Model,NMF,SFA}
import BIDMach.datasources.{DataSource,MatDS,FilesDS,SFilesDS}
import BIDMach.mixins.{CosineSim,Perplexity,Top,L1Regularizer,L2Regularizer}
import BIDMach.updaters.{ADAGrad,Batch,BatchNorm,IncMult,IncNorm,Telescoping}
import BIDMach.causal.{IPTW}

Mat.checkMKL
Mat.checkCUDA
if (Mat.hasCUDA > 0) GPUmem

KMeans clustering at scale

Training models with data that fits in memory is very limiting. But minibatch learners can easily work with data directly from disk. They access data sequentially (which avoids seeking and maximizes disk throughput) and converge in very few passes. We'll look again at the MNIST data set, this time the full version with 8 million images (about 17 GB). The dataset has been partition into groups of 100k images (using the unix split command) and saved in compressed lz4 files.

If you dont yet have the "alls" files, you can build them from the cats and part files with the cell below. Just uncomment the for loop. As you can see, alls is built from the data matrix and a scaled-up copy of the category matrix. The exaggerated category feature forces each cluster to be pure - of one category. It also creates a category feature for the cluster which can be used to retrieve its dominant category for labeling.



In [ ]:

    
val mdir = "../data/MNIST8M/parts/"
/* for (i <- 0 to 80) {
    val a = loadFMat(mdir+"part%02d.fmat.lz4" format i)
    val c = loadFMat(mdir+"cats%02d.fmat.lz4" format i)
    saveFMat(mdir+"data%02d.fmat.lz4" format i, a)
    val alls = (c * 10000f) on a
    saveFMat(mdir+"alls%02d.fmat.lz4" format i, alls)
    print(".")
} */

First we define a class xopts which contains all the options for the learner we will use. BIDMach has a modular design with pluggable learning pieces. This class holds all the options for those pieces. Learner is the main learning class, FilesDS is the data source, KMeans is the model and Batch is the updater (the code that gets run to update the model) for KMeans.

Then we make an instance of that class called mnopts.



In [ ]:

    
class xopts extends Learner.Options with FilesDS.Opts with KMeans.Opts with Batch.Opts; 
val mnopts = new xopts

Next come the options for the data source. There are quite a few of these, but they only need to be set once and apart from the minibatch size, dont need to be tuned.

The following options specify various useful traits of the data source. Many of these are default values and dont actually need to be set, but its useful to know what they do.



In [ ]:

    
mnopts.fnames = List(FilesDS.simpleEnum(mdir+"alls%02d.fmat.lz4", 1, 0));  // File name templates, %02d is replaced by a number
mnopts.nstart = 0;                 // Starting file number
mnopts.nend = 80;                  // Ending file number
mnopts.order = 0;                  // (0) sample order, 0=linear, 1=random
mnopts.lookahead = 2;              // (2) number of prefetch threads
mnopts.featType = 1;               // (1) feature type, 0=binary, 1=linear

Next we define the number of kmeans clusters.

Autoreset is an option that tells the Learner not to reset GPU memory after training.



In [ ]:

    
mnopts.autoReset = false            // Dont reset the GPU after the training run, so we can use a GPU model for prediction
mnopts.dim = 300                    // Number of kmeans clusters

Next we'll create a custom datasource. We need a bit of runtime configuration to ensure that the datasource runs reliably. Because it prefetches files with separate threads, we need to make sure that enough threads are available or it will stall. The threadPool(4) call takes care of this.



In [ ]:

    
val ds = {
  implicit val ec = threadPool(4)   // make sure there are enough threads (more than the lookahead count)
  new FilesDS(mnopts)              // the datasource
}

Next we define the main learner class, which is built up from various "plug-and-play" learning modules.



In [ ]:

    
val nn = new Learner(                // make a learner instance
    ds,                              // datasource
    new KMeans(mnopts),              // the model (a KMeans model)
    null,                            // list of mixins or regularizers
    new Batch(mnopts),               // the optimization class to use
    mnopts)                          // pass the options to the learner as well
nn

Tuning Options

The following options are the important ones for tuning. For KMeans, batchSize has no effect on accracy since the algorithm uses all the data instances to perform an update. So you're free to tune it for best speed. Generally larger is better, as long as you dont use too much GPU ram.

npasses is the number of passes over the dataset. Larger is typically better, but the model may overfit at some point.



In [ ]:

    
mnopts.batchSize = 50000
mnopts.npasses = 6

You invoke the learner the same way as before. You can change the options above after each run to optimize performance.



In [ ]:

    
nn.train

Now lets extract the model as a Floating-point matrix. We included the category features for clustering to make sure that each cluster is a subset of images for one digit.



In [ ]:

    
val model=FMat(nn.modelmat)

Next we build a 30 x 10 array of images to view the first 300 cluster centers as images.



In [ ]:

    
val nx = 30
val ny = 10
val im = zeros(28,28)
val allim = zeros(28*nx,28*ny)
for (i<-0 until nx) {
    for (j<-0 until ny) {
        val slice = model(i+nx*j,10->794)
        im(?) = slice(?)
        allim((28*i)->(28*(i+1)), (28*j)->(28*(j+1))) = im
    }
}
Image.show(allim kron ones(2,2))

We'll predict using the closest cluster (or 1-NN if you like). The classify function below takes a block of data (which includes the labels in rows 0->10), and predicts using the other features. It then stacks the predicted and actual categories in a 2*k matrix, where k is the number of samples.



In [ ]:

    
val datamodel = model(?, 10->794)
val catmodel = model(?, 0->10)
val (vcat, icat) = maxi2(catmodel,2)
val mdot = (datamodel ∙→ datamodel)

def classify(a:FMat):IMat = {
    val cdata = a(0->10, ?);
    val (vm, im) = maxi2(cdata);
    val ddata = a(10->794, ?);
    val dists = -2 *(datamodel * ddata) + (ddata ∙ ddata) + mdot;
    val (vdist, idist) = mini2(dists);
    icat(idist) on im
}

The cmatch function takes the actual and predicted categories and constructs a 10x10 confusion matrix from them. The confusion matrix element c(i,j) is the count of inputs that were predicted to be in category i, but are actually in category j. Its basically just a call to the accum function.



In [ ]:

    
def cmatch(crows:IMat):DMat = {
    accum(crows.t, 1.0, 10, 10)
}

To evaluate, we'll run the classification on the remainder of the data source (files 71 to 80 which we didnt read yet).



In [ ]:

    
mnopts.nstart=80
mnopts.nend=81
ds.reset

This code draws minibatches from the datasource, computes predictions from them, and then adds the corresponding counts to the confusion matrix acc.



In [ ]:

    
var k = 0
val acc = dzeros(10,10)
while (ds.hasNext) {
    val mats=ds.next
    val f=FMat(mats(0))
    acc ~ acc + cmatch(classify(f))
    k += 1
    print(".")
}

Once we have the confusion counts, we can normalize the matrix of counts to produce a matrix sacc which is the fraction of samples with actual label j that are classified as i.

Its common to show this matrix as a 2D gray-scale or false-color plot with white as 1.00 and black as 0.0.



In [ ]:

    
val sacc = FMat(acc/sum(acc))
Image.show((sacc * 250f) ⊗ ones(64,64))
sacc

Its useful to isolate the correct classification rate by digit, which is:



In [ ]:

    
val dacc = getdiag(sacc).t

We can take the mean of the diagonal accuracies to get an overall accuracy for this model.



In [ ]:

    
mean(dacc)

Run the experiment again with a larger number of clusters (3000, then 30000). You should reduce the batchSize option to 20000 to avoid memory problems.

Include the training time output by the call to nn.train but not the evaluation time (the evaluation code above is not using the GPU). Rerun and fill out the table below:

KMeans Clusters	Training time	Avg. gflops	Accuracy
300	...	...	...
3000	...	...	...
30000	...	...	...



In [ ]: