```
In [ ]:
```import BIDMat.{CMat,CSMat,DMat,Dict,IDict,Image,FMat,FND,GMat,GIMat,GSMat,HMat,IMat,Mat,SMat,SBMat,SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.Plotting._
import BIDMach.Learner
import BIDMach.models.{FM,GLM,KMeans,KMeansw,LDA,LDAgibbs,Model,NMF,SFA}
import BIDMach.datasources.{DataSource,MatDS,FilesDS,SFilesDS}
import BIDMach.mixins.{CosineSim,Perplexity,Top,L1Regularizer,L2Regularizer}
import BIDMach.updaters.{ADAGrad,Batch,BatchNorm,IncMult,IncNorm,Telescoping}
import BIDMach.causal.{IPTW}
Mat.checkMKL
Mat.checkCUDA
if (Mat.hasCUDA > 0) GPUmem

Training models with data that fits in memory is very limiting. But minibatch learners can easily work with data directly from disk. They access data sequentially (which avoids seeking and maximizes disk throughput) and converge in very few passes. We'll look again at the MNIST data set, this time the full version with 8 million images (about 17 GB). The dataset has been partition into groups of 100k images (using the unix split command) and saved in compressed lz4 files.

If you dont yet have the "alls" files, you can build them from the cats and part files with the cell below. Just uncomment the for loop. As you can see, alls is built from the data matrix and a scaled-up copy of the category matrix. The exaggerated category feature forces each cluster to be pure - of one category. It also creates a category feature for the cluster which can be used to retrieve its dominant category for labeling.

```
In [ ]:
```val mdir = "../data/MNIST8M/parts/"
/* for (i <- 0 to 80) {
val a = loadFMat(mdir+"part%02d.fmat.lz4" format i)
val c = loadFMat(mdir+"cats%02d.fmat.lz4" format i)
saveFMat(mdir+"data%02d.fmat.lz4" format i, a)
val alls = (c * 10000f) on a
saveFMat(mdir+"alls%02d.fmat.lz4" format i, alls)
print(".")
} */

First we define a class xopts which contains all the options for the learner we will use. BIDMach has a modular design with pluggable learning pieces. This class holds all the options for those pieces. Learner is the main learning class, FilesDS is the data source, KMeans is the model and Batch is the updater (the code that gets run to update the model) for KMeans.

Then we make an instance of that class called mnopts.

```
In [ ]:
```class xopts extends Learner.Options with FilesDS.Opts with KMeans.Opts with Batch.Opts;
val mnopts = new xopts

Next come the options for the data source. There are quite a few of these, but they only need to be set once and apart from the minibatch size, dont need to be tuned.

The following options specify various useful traits of the data source. Many of these are default values and dont actually need to be set, but its useful to know what they do.

```
In [ ]:
```mnopts.fnames = List(FilesDS.simpleEnum(mdir+"alls%02d.fmat.lz4", 1, 0)); // File name templates, %02d is replaced by a number
mnopts.nstart = 0; // Starting file number
mnopts.nend = 80; // Ending file number
mnopts.order = 0; // (0) sample order, 0=linear, 1=random
mnopts.lookahead = 2; // (2) number of prefetch threads
mnopts.featType = 1; // (1) feature type, 0=binary, 1=linear

Next we define the number of kmeans clusters.

Autoreset is an option that tells the Learner not to reset GPU memory after training.

```
In [ ]:
```mnopts.autoReset = false // Dont reset the GPU after the training run, so we can use a GPU model for prediction
mnopts.dim = 300 // Number of kmeans clusters

`threadPool(4)`

call takes care of this.

```
In [ ]:
```val ds = {
implicit val ec = threadPool(4) // make sure there are enough threads (more than the lookahead count)
new FilesDS(mnopts) // the datasource
}

```
In [ ]:
```val nn = new Learner( // make a learner instance
ds, // datasource
new KMeans(mnopts), // the model (a KMeans model)
null, // list of mixins or regularizers
new Batch(mnopts), // the optimization class to use
mnopts) // pass the options to the learner as well
nn

The following options are the important ones for tuning. For KMeans, batchSize has no effect on accracy since the algorithm uses all the data instances to perform an update. So you're free to tune it for best speed. Generally larger is better, as long as you dont use too much GPU ram.

npasses is the number of passes over the dataset. Larger is typically better, but the model may overfit at some point.

```
In [ ]:
```mnopts.batchSize = 50000
mnopts.npasses = 6

```
In [ ]:
```nn.train

```
In [ ]:
```val model=FMat(nn.modelmat)

Next we build a 30 x 10 array of images to view the first 300 cluster centers as images.

```
In [ ]:
```val nx = 30
val ny = 10
val im = zeros(28,28)
val allim = zeros(28*nx,28*ny)
for (i<-0 until nx) {
for (j<-0 until ny) {
val slice = model(i+nx*j,10->794)
im(?) = slice(?)
allim((28*i)->(28*(i+1)), (28*j)->(28*(j+1))) = im
}
}
Image.show(allim kron ones(2,2))

`2*k`

matrix, where k is the number of samples.

```
In [ ]:
```val datamodel = model(?, 10->794)
val catmodel = model(?, 0->10)
val (vcat, icat) = maxi2(catmodel,2)
val mdot = (datamodel ∙→ datamodel)
def classify(a:FMat):IMat = {
val cdata = a(0->10, ?);
val (vm, im) = maxi2(cdata);
val ddata = a(10->794, ?);
val dists = -2 *(datamodel * ddata) + (ddata ∙ ddata) + mdot;
val (vdist, idist) = mini2(dists);
icat(idist) on im
}

**confusion matrix** from them. The confusion matrix element c(i,j) is the count of inputs that were predicted to be in category i, but are actually in category j. Its basically just a call to the accum function.

```
In [ ]:
```def cmatch(crows:IMat):DMat = {
accum(crows.t, 1.0, 10, 10)
}

```
In [ ]:
```mnopts.nstart=80
mnopts.nend=81
ds.reset

`acc`

.

```
In [ ]:
```var k = 0
val acc = dzeros(10,10)
while (ds.hasNext) {
val mats=ds.next
val f=FMat(mats(0))
acc ~ acc + cmatch(classify(f))
k += 1
print(".")
}

Once we have the confusion counts, we can normalize the matrix of counts to produce a matrix sacc which is the fraction of samples with actual label j that are classified as i.

Its common to show this matrix as a 2D gray-scale or false-color plot with white as 1.00 and black as 0.0.

```
In [ ]:
```val sacc = FMat(acc/sum(acc))
Image.show((sacc * 250f) ⊗ ones(64,64))
sacc

Its useful to isolate the correct classification rate by digit, which is:

```
In [ ]:
```val dacc = getdiag(sacc).t

We can take the mean of the diagonal accuracies to get an overall accuracy for this model.

```
In [ ]:
```mean(dacc)

Run the experiment again with a larger number of clusters (3000, then 30000). You should reduce the batchSize option to 20000 to avoid memory problems.

Include the training time output by the call to `nn.train`

but not the evaluation time (the evaluation code above is not using the GPU). Rerun and fill out the table below:

KMeans Clusters | Training time | Avg. gflops | Accuracy |
---|---|---|---|

300 | ... | ... | ... |

3000 | ... | ... | ... |

30000 | ... | ... | ... |

```
In [ ]:
```