Machine Learning at Scale, Part I

For this tutorial, we'll explore Machine Learning at scale. Much of the recent focus on scaling machine learning has been on cluster computing, without much attention to single-machine performance. It turns out that there is a lot of mileage in optimizing single-machine performance for machine learning, especially leveraging inexpensive GPUs. In fact single-node hardware holds the record for most common machine learning tasks. See this collection of recent benchmarks. There are several steps to getting high performance on single machines:

"rooflining" or optimizing the performance of each kernel against the theoretical limits imposed by memory or the chip's arithmetic units.
Rooflined GPU kernels: there is even more headroom for improvements in machine learning algorithms on graphics processors.
Efficient optimization: Fast optimization (SGD with ADAGRAD or fast coordinate ascent) can add another order of magnitude or more of performance.

Each of these can contribute an order-of-magnitude or more of speedup. With the tools you've used so far and modest hardware of your own or in the cloud, you can tackle problems at the frontier of scale for machine learning. First lets initialize BIDMach as usual.



In [ ]:

    
import BIDMat.{CMat,CSMat,DMat,Dict,IDict,Image,FMat,FND,GMat,GIMat,GSMat,HMat,IMat,Mat,SMat,SBMat,SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.Plotting._
import BIDMach.Learner
import BIDMach.models.{FM,GLM,KMeans,KMeansw,LDA,LDAgibbs,Model,NMF,SFA}
import BIDMach.datasources.{DataSource,MatDS,FilesDS,SFilesDS}
import BIDMach.mixins.{CosineSim,Perplexity,Top,L1Regularizer,L2Regularizer}
import BIDMach.updaters.{ADAGrad,Batch,BatchNorm,IncMult,IncNorm,Telescoping}
import BIDMach.causal.{IPTW}

Mat.checkMKL
Mat.checkCUDA
if (Mat.hasCUDA>0) GPUmem

This time, you should have a graphics processor, and CUDA version 6.5. If not, something strange is afoot. Check that you started with "bidmach notebook" and not "ipython notebook". If that's not the problem best to ask for help.

The next line is a report on available GPU memory. The first number is the fraction of GPU memory available. It should be close to 1. If not, either you have another java process that is holding memory or someone else on the same machine does. Its fine if other people are using the machine and GPU, but avoid leaving zombie processes of your own around. Check from the shell with ps -ef | grep java. Kill any of your own processes that you dont need with kill -9 process_id.

Scaled-up Logistic Regression

Let's load the full RCV1 dataset. This dataset is only about 0.5 GB, so we'll load it directly into memory. This is not a big dataset, but has been widely benchmarked and is still challenging for some systems.



In [ ]:

    
val dir = "../data/"                                // Assumes you started the notebook from <BIDMach>/tutorials
val data = loadSMat(dir+"rcv1/docs.smat.lz4")
val cats = loadFMat(dir+"rcv1/cats.fmat.lz4")
val testdata = loadSMat(dir+"rcv1/testdocs.smat.lz4")
val testcats = loadFMat(dir+"rcv1/testcats.fmat.lz4")

Check the types of rcv1data and rcv1cats. They are "SMat" and "FMat" respectively. These are respectively sparse and dense matrices of float values. Dense matrices are simply arrays with a storage element for each location A(i,j) in a matrix A. Sparse matrices are used for data with many zeros. They contain tuples of (row, column, value) for those elements (sometimes the column index is implicit), rather like a Pandas table. You'll notice they display differently for that reason:



In [ ]:

    
data

In all the datasets we use, rows will index features (words here) and columns index "instances" or input samples, which are documents for this dataset. So rcv1data(?,20) is the $20^{th}$ input sample.

Let's construct a learner similar to the one we built in lab, ready for training.



In [ ]:

    
val preds = zeros(testcats.nrows, testcats.ncols)
val (rcv, opts, trcv, topts) = GLM.learner(data, cats, testdata, preds, GLM.maxp)

We can tailor the model's options before lauching it:



In [ ]:

    
opts.batchSize=10000
opts.npasses=2
opts.autoReset=false



In [ ]:

    
rcv.train

Understanding Performance Stats

Lets examine the columns of the trace above in detail. Taking them in turn they are:

Percent progress meter is the progress over the full dataset. If the algorithm makes multiple passes, each pass is prefaced by a "pass=X" title.
Cross-validated log likelihood shows the log likelihood on the most recent held-out minibatch of data. The same minibatches are held out on each pass over the dataset, so that training and test data are not mixed.
gigaflops how many billion arithmetic operations per second the algorithm is completing. Algorithms vary in the number of operations they require, but for a fixed algorithm this is an easy measure to compare performance. There are also "roofline" limits for the dominant operations in each algorithm and a particular kind of hardware. The roofline for sparse-dense matrix multiply (the dominant step in logistic regression) is 20-30 gflops for this GPU model, and around 6 gflops for the CPU. You can compare those numbers to see if the algorithm could possibly be optimized any more.
secs is the total elapsed seconds for this training run.
GB the total number of Gigabytes of input data that has been read or re-read during training.
MB/s the data read rate, which is just the quotient of total input bytes by time. This is an important measure because it shows how well the I/O system is doing, and will be the limiting factor for tasks that are not too compute-intensive. Standard disks can manage up to 200 MB/s. SSDs, which are attached to this Amazon instance, can manage up to 500 MB/s. Since BIDMach uses very fast file compression (lz4) you will sometimes see values several times higher than this. That happens because the uncompressed data rate is several times the rate of compressed data coming off the disk.
GPUmem This is the fraction of GPU memory available. It should remain constant because of BIDMach's caching scheme and it serves as a guide to how much space you have, e.g. to increase model size.

Lets rerun the experiment above (which used the GPU by default), to understand the effect of processor throughput on this calculation. This will take a while. If you want to instead continue with the tutorial, reset the kernel for this page, comment out the "train" line below, and restart the page with "Run All" under the "Cell" menu.



In [ ]:

    
val (rcv2, opts2) = GLM.learner(data, cats, GLM.maxp)
opts2.useGPU = false
//rcv2.train

You can see that for this run, performance was limited by the CPU's throughput (gflops) and not by I/O since the I/O system is the same for both runs. For the first run, both Gflops and I/O are close to their respective limits, so its less clear which was the limiting factor.

Evaluation

Lets evaluate the predictor on the test data, and score it again using category 6 AUC



In [ ]:

    
trcv.predict



In [ ]:

    
val itest = 6
val scores = preds(itest,?)
val good = testcats(itest,?)
val bad = 1-testcats(itest,?)
val rr =roc(scores,good,bad,100)
val xaxis = row(0 to 100)*0.01
plot(xaxis,rr)
val auc = mean(rr)



In [ ]:

    
val liftx = xaxis(1 to 100)
val lift= rr(1 to 100)/liftx
plot(liftx, lift)

Performance Evaluation

In the cell below, make a table of performance for various batch sizes. The columns should be average gflops, overall time, and AUC for cat 6. You should try batch sizes of 10000, 2000 and 1000. You can also see the effect of increasing the number of passes over the dataset. You should find that increasing batchSize, while it reduces accuray a little, can be more than compensated for by increasing the number of iterations. Comment out the cell with the CPU training runs to save time.

TODO: Fill this table:

Minibatch size Numer of passes Avg. gflops Overall time AUC

1000 2 ... ... ...

2000 2 ... ... ...

10000 2 ... ... ...

10000 20 ... ... ...

The total work (measured in gflop/s * time) increases with decreasing minibatch size. That's because there are fixed costs associated with the minibatch model updates.

Shutdown this page before continuing.

Your job will have used memory on both the CPU and GPU. Its important to free that up before proceeding. Go back to the "Home" tab of your IPython browser and click "shutdown" on this tutorial. When you're ready go on to part II.