For this tutorial, we'll explore Machine Learning at scale. Much of the recent focus on scaling machine learning has been on cluster computing, without much attention to single-machine performance. It turns out that there is a lot of mileage in optimizing single-machine performance for machine learning, especially leveraging inexpensive GPUs. In fact single-node hardware holds the record for most common machine learning tasks. See this collection of recent benchmarks. There are several steps to getting high performance on single machines:

- "rooflining" or optimizing the performance of each kernel against the theoretical limits imposed by memory or the chip's arithmetic units.
- Rooflined GPU kernels: there is even more headroom for improvements in machine learning algorithms on graphics processors.
- Efficient optimization: Fast optimization (SGD with ADAGRAD or fast coordinate ascent) can add another order of magnitude or more of performance.

Each of these can contribute an order-of-magnitude or more of speedup. With the tools you've used so far and modest hardware of your own or in the cloud, you can tackle problems at the frontier of scale for machine learning. First lets initialize BIDMach as usual.

```
In [ ]:
```import BIDMat.{CMat,CSMat,DMat,Dict,IDict,Image,FMat,FND,GMat,GIMat,GSMat,HMat,IMat,Mat,SMat,SBMat,SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.Plotting._
import BIDMach.Learner
import BIDMach.models.{FM,GLM,KMeans,KMeansw,LDA,LDAgibbs,Model,NMF,SFA}
import BIDMach.datasources.{DataSource,MatDS,FilesDS,SFilesDS}
import BIDMach.mixins.{CosineSim,Perplexity,Top,L1Regularizer,L2Regularizer}
import BIDMach.updaters.{ADAGrad,Batch,BatchNorm,IncMult,IncNorm,Telescoping}
import BIDMach.causal.{IPTW}
Mat.checkMKL
Mat.checkCUDA
if (Mat.hasCUDA>0) GPUmem

This time, you should have a graphics processor, and CUDA version 6.5. If not, something strange is afoot. Check that you started with "bidmach notebook" and not "ipython notebook". If that's not the problem best to ask for help.

The next line is a report on available GPU memory. The first number is the fraction of GPU memory available. It should be close to 1. If not, either you have another java process that is holding memory or someone else on the same machine does. Its fine if other people are using the machine and GPU, but avoid leaving zombie processes of your own around. Check from the shell with `ps -ef | grep java`

. Kill any of your own processes that you dont need with `kill -9 process_id`

.

```
In [ ]:
```val dir = "../data/" // Assumes you started the notebook from <BIDMach>/tutorials
val data = loadSMat(dir+"rcv1/docs.smat.lz4")
val cats = loadFMat(dir+"rcv1/cats.fmat.lz4")
val testdata = loadSMat(dir+"rcv1/testdocs.smat.lz4")
val testcats = loadFMat(dir+"rcv1/testcats.fmat.lz4")

```
In [ ]:
``````
data
```

In all the datasets we use, rows will index features (words here) and columns index "instances" or input samples, which are documents for this dataset. So `rcv1data(?,20)`

is the $20^{th}$ input sample.

Let's construct a learner similar to the one we built in lab, ready for training.

```
In [ ]:
```val preds = zeros(testcats.nrows, testcats.ncols)
val (rcv, opts, trcv, topts) = GLM.learner(data, cats, testdata, preds, GLM.maxp)

We can tailor the model's options before lauching it:

```
In [ ]:
```opts.batchSize=10000
opts.npasses=2
opts.autoReset=false

```
In [ ]:
```rcv.train

Lets examine the columns of the trace above in detail. Taking them in turn they are:

**Percent progress meter**is the progress over the full dataset. If the algorithm makes multiple passes, each pass is prefaced by a "pass=X" title.**Cross-validated log likelihood**shows the log likelihood on the most recent held-out minibatch of data. The same minibatches are held out on each pass over the dataset, so that training and test data are not mixed.**gigaflops**how many billion arithmetic operations per second the algorithm is completing. Algorithms vary in the number of operations they require, but for a fixed algorithm this is an easy measure to compare performance. There are also "roofline" limits for the dominant operations in each algorithm and a particular kind of hardware. The roofline for sparse-dense matrix multiply (the dominant step in logistic regression) is 20-30 gflops for this GPU model, and around 6 gflops for the CPU. You can compare those numbers to see if the algorithm could possibly be optimized any more.**secs**is the total elapsed seconds for this training run.**GB**the total number of Gigabytes of input data that has been read or re-read during training.**MB/s**the data read rate, which is just the quotient of total input bytes by time. This is an important measure because it shows how well the I/O system is doing, and will be the limiting factor for tasks that are not too compute-intensive. Standard disks can manage up to 200 MB/s. SSDs, which are attached to this Amazon instance, can manage up to 500 MB/s. Since BIDMach uses very fast file compression (lz4) you will sometimes see values several times higher than this. That happens because the uncompressed data rate is several times the rate of compressed data coming off the disk.**GPUmem**This is the fraction of GPU memory available. It should remain constant because of BIDMach's caching scheme and it serves as a guide to how much space you have, e.g. to increase model size.

```
In [ ]:
```val (rcv2, opts2) = GLM.learner(data, cats, GLM.maxp)
opts2.useGPU = false
//rcv2.train

Lets evaluate the predictor on the test data, and score it again using category 6 AUC

```
In [ ]:
```trcv.predict

```
In [ ]:
```val itest = 6
val scores = preds(itest,?)
val good = testcats(itest,?)
val bad = 1-testcats(itest,?)
val rr =roc(scores,good,bad,100)
val xaxis = row(0 to 100)*0.01
plot(xaxis,rr)
val auc = mean(rr)

```
In [ ]:
```val liftx = xaxis(1 to 100)
val lift= rr(1 to 100)/liftx
plot(liftx, lift)

TODO: Fill this table:

Minibatch size Numer of passes Avg. gflops Overall time AUC 1000 2 ... ... ... 2000 2 ... ... ... 10000 2 ... ... ... 10000 20 ... ... ...