```
In [ ]:
```import BIDMat.{CMat,CSMat,DMat,Dict,IDict,FMat,FND,GMat,GIMat,GSMat,HMat,Image,IMat,Mat,SMat,SBMat,SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.Plotting._
import BIDMach.Learner
import BIDMach.models.{FM,GLM,KMeans,LDA,LDAgibbs,NMF,SFA}
import BIDMach.datasources.{MatDS,FilesDS,SFilesDS}
import BIDMach.mixins.{CosineSim,Perplexity,Top,L1Regularizer,L2Regularizer}
import BIDMach.updaters.{ADAGrad,Batch,BatchNorm,IncMult,IncNorm,Telescoping}
import BIDMach.causal.{IPTW}
Mat.checkMKL
Mat.checkCUDA
if (Mat.hasCUDA > 0) GPUmem

This tutorial uses the small (70k samples) MNIST dataset. To get it, run `getmnist.sh`

from the BIDMach/scripts directory.

k-Nearest Neighbors is a very simple algorithm, so we'll code it up directly in matrix algebra. That way we can explore variations like inverse-distance weighting.

First lets load the data. The files contain 28x28 images of hand-written digits. dtrain and dtest are the training and test images respectively. ctrain and ctest are the image labels (0..9).

Each image is actually stored as a 28x28 = 784-element column of the data matrix. To convert to an image (the matrix im) we use slicing as below. This cell should display a sample image in a Java window on your display.

```
In [ ]:
```var dir = "../data/MNIST/" // adjust to point to the BIDMach/data/MNIST directory
tic
val dtrain = loadFMat(dir+"train.fmat.lz4")
val ctrain = loadIMat(dir+"ctrain.imat.lz4")
val dtest0 = loadFMat(dir+"test.fmat.lz4")
val ctest0 = loadIMat(dir+"ctest.imat.lz4")
toc

```
In [ ]:
```val m = 1000
val dtest = dtest0(?,0->m)
val ctest = ctest0(0,0->m)

dist is the distance matrix which will have size ntraining x ntest. To compute all distances, we can use the expansion for distance

dist = (u - v)T (u - v) = u ∙ u + v ∙ v - 2 v^T u

To compute *all* pairwise distances betweem m vectors u and n vectors v, we can use the matrix form of this identity, which looks like this:

```
In [ ]:
```flip
val dist = (-2 * dtrain ^* dtest) + (dtest ∙ dtest) + (dtrain ∙ dtrain).t
gflop

If you do run out of memory while trying this, try a smaller m. Then click on the "Kernel" menu in the IPython toolbar above. Select "Restart Kernel". Then click on the "Cell" menu and select "Run All Above" which evaluates all the cells above this one.

The flip/gflop calls above compute both the running time (second output) and the gflops (first output).

Next we need to sort the columns of the distance matrix so that the smallest distances (closest neighbors) are at the top. The sort2 function returns both the sorted columns and also a matrix of the indices of those elements, which will be the numbers of the closest training point.

```
In [ ]:
```tic
val (dss, iss) = sort2(dist)
toc

```
In [ ]:
```val k = 3
val bestvals = dss(0->k, ?)
val bestinds = iss(0->k, ?)
val votes = ctrain(bestinds)

`votes`

matrix contains the class labels for each of the n closest neighbors. Next we sum those votes into a tally matrix for each class.

```
In [ ]:
```val (ii, jj, vv) = find3(votes+1) // get the row, column and values of the votes matrix
val tally = accum((vv-1) \ jj, 1/(1+bestvals(?)), 10, m) // accumulate votes by class number (vv-1) and column (jj)

`ibest`

). The mean function provides a simple way to compute the fraction of `ibest`

(predicted) and `ctest`

(actual) class labels which agree. This is exactly the method's accuracy.

```
In [ ]:
```val (vb, ibest) = maxi2(tally) // now find the class with the most votes
mean(ibest == ctest)

TODO 1: Try different values of k in the cell that defined it. What happens to the accuracy?

Now lets try clustering the digit image data into 10 clusters, using k-Means.

```
In [ ]:
```val (nn,opts) = KMeans.learner(dtrain,10)

BIDMach's KMeans implementation accepts the name of the training dataset and the number of dimensions. Here we try clustering into 10 images to see how close the clusters are to ideal digit prototypes.

The learer function returns two values: nn which is a "Learner" class containing the model itself, optimizer and regularizer classes if any, and opts: which is an options object. You can look at the options using the "what" method:

```
In [ ]:
```opts.what

You dont need to know about most of these. The only significant parameters for k-Means are `dim`

the number of clusters, and `npasses`

the number of passes over the dataset.

Lets first set the number of passes to 20, and then train the model.

```
In [ ]:
```opts.npasses = 20
opts.dim = 10
opts.useGPU = false
nn.train

The output above shows various diagnostic information about the training:

- The first column shows completion milestones for each pass over the dataset.
- The second column is the likelihood, in this case its the mean-squared distance from each point to its center.
- The third column is the overall gigaflops achieved.
- The fourth column is the wall clock time so far in seconds.
- The fifth columns is the total amount of data read or re-read.
- The last column is the rate at which input data was consumed.

You can use this information to tune the training in various ways. Note that BIDMach compute and displays a cross-validation likelihood, not an internal model likelihood. That helps avoid over-fitting. If the model starts to overfit, the displayed likelihood will start increasing. BIDMach does this by holding out every m-th minibatch (the same ones on each pass) to use for testing while the training is in progress.

For k-means on this dataset, the likelihood is rather noisy, so lets try plotting it.

```
In [ ]:
```plot(nn.results(1,?),nn.results(0,?))

The plot function above plots the second row of the results matrix as the X-axis and the first row as the Y-axis. The X-axis is then the number of training samples processed, and the Y-axis is the likelihood (negative mean squared distance to center).

You may want to use the mouse to select a rectangle from the plot to zoom into it. This will give you a better idea of whether the likelihood is decreasing at any time (indicating overfitting).

TODO 3: Do you see any evidence of over-fitting? Why do you think that is?

Next lets examine the model, which is just a matrix.

```
In [ ]:
```val mm=nn.modelmat.t
size(mm)

For efficiency reasons, the model is stored so that the centroids are rows. We transpose it above so that it has 10 columns, one for each centroid.

Next we want to display the images. Unfortunately image display wasnt included in the version of BIDMat we installed at the begining of the semester, but we can inline the code here:

```
In [ ]:
```val im = zeros(28,28)
for (i <- 20 until 35) {
im(?) = dtrain(?,i)
Image.show(im ⊗ ones(4,4))
}

TODO 4: wrap a for loop around the code above to display all 10 images.

The kronecker product above replaces each pixel of the original image with a 5x5 square block. This enlarges the images, making them easier to see, and also creates enough space on the toolbar to be able to move them.

You should also be able to see all the images spread out by simply clicking on the Java icon on the taskbar to the left.

The images will be displayed on top of each other, so to see them you will have to move them by dragging.

TODO 5: Do the centroids correspond to the 10 digits? If not, what are they?