For this tutorial, we'll explore Machine Learning at scale. Much of the recent focus on scaling machine learning has been on cluster computing, without much attention to single-machine performance. It turns out that there is a lot of mileage in optimizing single-machine performance for machine learning, especially leveraging inexpensive GPUs. In fact single-node hardware holds the record for most common machine learning tasks. See this collection of recent benchmarks. There are several steps to getting high performance on single machines:
Each of these can contribute an order-of-magnitude or more of speedup. With the tools you've used so far and modest hardware of your own or in the cloud, you can tackle problems at the frontier of scale for machine learning. First lets initialize BIDMach as usual.
In [ ]:
import BIDMat.{CMat,CSMat,DMat,Dict,IDict,Image,FMat,FND,GMat,GIMat,GSMat,HMat,IMat,Mat,SMat,SBMat,SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.Plotting._
import BIDMach.Learner
import BIDMach.models.{FM,GLM,KMeans,KMeansw,LDA,LDAgibbs,Model,NMF,SFA}
import BIDMach.datasources.{DataSource,MatDS,FilesDS,SFilesDS}
import BIDMach.mixins.{CosineSim,Perplexity,Top,L1Regularizer,L2Regularizer}
import BIDMach.updaters.{ADAGrad,Batch,BatchNorm,IncMult,IncNorm,Telescoping}
import BIDMach.causal.{IPTW}
Mat.checkMKL
Mat.checkCUDA
if (Mat.hasCUDA>0) GPUmem
This time, you should have a graphics processor, and CUDA version 6.5. If not, something strange is afoot. Check that you started with "bidmach notebook" and not "ipython notebook". If that's not the problem best to ask for help.
The next line is a report on available GPU memory. The first number is the fraction of GPU memory available. It should be close to 1. If not, either you have another java process that is holding memory or someone else on the same machine does. Its fine if other people are using the machine and GPU, but avoid leaving zombie processes of your own around. Check from the shell with ps -ef | grep java
. Kill any of your own processes that you dont need with kill -9 process_id
.
Let's load the full RCV1 dataset. This dataset is only about 0.5 GB, so we'll load it directly into memory. This is not a big dataset, but has been widely benchmarked and is still challenging for some systems.
In [ ]:
val dir = "../data/" // Assumes you started the notebook from <BIDMach>/tutorials
val data = loadSMat(dir+"rcv1/docs.smat.lz4")
val cats = loadFMat(dir+"rcv1/cats.fmat.lz4")
val testdata = loadSMat(dir+"rcv1/testdocs.smat.lz4")
val testcats = loadFMat(dir+"rcv1/testcats.fmat.lz4")
Check the types of rcv1data and rcv1cats. They are "SMat" and "FMat" respectively. These are respectively sparse and dense matrices of float values. Dense matrices are simply arrays with a storage element for each location A(i,j) in a matrix A. Sparse matrices are used for data with many zeros. They contain tuples of (row, column, value) for those elements (sometimes the column index is implicit), rather like a Pandas table. You'll notice they display differently for that reason:
In [ ]:
data
In all the datasets we use, rows will index features (words here) and columns index "instances" or input samples, which are documents for this dataset. So rcv1data(?,20)
is the $20^{th}$ input sample.
Let's construct a learner similar to the one we built in lab, ready for training.
In [ ]:
val preds = zeros(testcats.nrows, testcats.ncols)
val (rcv, opts, trcv, topts) = GLM.learner(data, cats, testdata, preds, GLM.maxp)
We can tailor the model's options before lauching it:
In [ ]:
opts.batchSize=10000
opts.npasses=2
opts.autoReset=false
In [ ]:
rcv.train
Lets examine the columns of the trace above in detail. Taking them in turn they are:
Lets rerun the experiment above (which used the GPU by default), to understand the effect of processor throughput on this calculation. This will take a while. If you want to instead continue with the tutorial, reset the kernel for this page, comment out the "train" line below, and restart the page with "Run All" under the "Cell" menu.
In [ ]:
val (rcv2, opts2) = GLM.learner(data, cats, GLM.maxp)
opts2.useGPU = false
//rcv2.train
You can see that for this run, performance was limited by the CPU's throughput (gflops) and not by I/O since the I/O system is the same for both runs. For the first run, both Gflops and I/O are close to their respective limits, so its less clear which was the limiting factor.
Lets evaluate the predictor on the test data, and score it again using category 6 AUC
In [ ]:
trcv.predict
In [ ]:
val itest = 6
val scores = preds(itest,?)
val good = testcats(itest,?)
val bad = 1-testcats(itest,?)
val rr =roc(scores,good,bad,100)
val xaxis = row(0 to 100)*0.01
plot(xaxis,rr)
val auc = mean(rr)
In [ ]:
val liftx = xaxis(1 to 100)
val lift= rr(1 to 100)/liftx
plot(liftx, lift)
In the cell below, make a table of performance for various batch sizes. The columns should be average gflops, overall time, and AUC for cat 6. You should try batch sizes of 10000, 2000 and 1000. You can also see the effect of increasing the number of passes over the dataset. You should find that increasing batchSize, while it reduces accuray a little, can be more than compensated for by increasing the number of iterations. Comment out the cell with the CPU training runs to save time.
TODO: Fill this table:
Minibatch size Numer of passes Avg. gflops Overall time AUC 1000 2 ... ... ... 2000 2 ... ... ... 10000 2 ... ... ... 10000 20 ... ... ...
The total work (measured in gflop/s * time) increases with decreasing minibatch size. That's because there are fixed costs associated with the minibatch model updates.
Your job will have used memory on both the CPU and GPU. Its important to free that up before proceeding. Go back to the "Home" tab of your IPython browser and click "shutdown" on this tutorial. When you're ready go on to part II.