Basic Algorithms in BIDMach

For this tutorial, we'll explore two of the basic algorithms: kNN and k-Means. k-NN is simple enough that we'll simply write it in matrix code. For k-Means, we'll run BIDMach's builtin implementation. First we need to import the classes used by BIDMach.


In [ ]:
import BIDMat.{CMat,CSMat,DMat,Dict,IDict,FMat,FND,GMat,GIMat,GSMat,HMat,Image,IMat,Mat,SMat,SBMat,SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.Plotting._
import BIDMach.Learner
import BIDMach.models.{FM,GLM,KMeans,LDA,LDAgibbs,NMF,SFA}
import BIDMach.datasources.{MatDS,FilesDS,SFilesDS}
import BIDMach.mixins.{CosineSim,Perplexity,Top,L1Regularizer,L2Regularizer}
import BIDMach.updaters.{ADAGrad,Batch,BatchNorm,IncMult,IncNorm,Telescoping}
import BIDMach.causal.{IPTW}

Mat.checkMKL
Mat.checkCUDA
if (Mat.hasCUDA > 0) GPUmem

A kNN Digit Recognizer

This tutorial uses the small (70k samples) MNIST dataset. To get it, run getmnist.sh from the BIDMach/scripts directory.

k-Nearest Neighbors is a very simple algorithm, so we'll code it up directly in matrix algebra. That way we can explore variations like inverse-distance weighting.

First lets load the data. The files contain 28x28 images of hand-written digits. dtrain and dtest are the training and test images respectively. ctrain and ctest are the image labels (0..9).

Each image is actually stored as a 28x28 = 784-element column of the data matrix. To convert to an image (the matrix im) we use slicing as below. This cell should display a sample image in a Java window on your display.


In [ ]:
var dir = "../data/MNIST/"                         // adjust to point to the BIDMach/data/MNIST directory
tic
val dtrain = loadFMat(dir+"train.fmat.lz4")
val ctrain = loadIMat(dir+"ctrain.imat.lz4")
val dtest0 = loadFMat(dir+"test.fmat.lz4")
val ctest0 = loadIMat(dir+"ctest.imat.lz4")
toc

The test data contain 10000 images, while there are 60000 training images. That means the complete distance matrix for test images to training images is 10000 x 60000 which is very likely too big for your VM. We set m to a smaller test set size. See if 1000 works on your machine. If not, use a smaller value.


In [ ]:
val m = 1000
val dtest = dtest0(?,0->m)
val ctest = ctest0(0,0->m)

dist is the distance matrix which will have size ntraining x ntest. To compute all distances, we can use the expansion for distance

dist = (u - v)T (u - v) = u ∙ u + v ∙ v - 2 v^T u

To compute all pairwise distances betweem m vectors u and n vectors v, we can use the matrix form of this identity, which looks like this:


In [ ]:
flip
val dist = (-2 * dtrain ^* dtest) + (dtest  dtest) + (dtrain  dtrain).t 
gflop

If you do run out of memory while trying this, try a smaller m. Then click on the "Kernel" menu in the IPython toolbar above. Select "Restart Kernel". Then click on the "Cell" menu and select "Run All Above" which evaluates all the cells above this one.

The flip/gflop calls above compute both the running time (second output) and the gflops (first output).

Next we need to sort the columns of the distance matrix so that the smallest distances (closest neighbors) are at the top. The sort2 function returns both the sorted columns and also a matrix of the indices of those elements, which will be the numbers of the closest training point.


In [ ]:
tic
val (dss, iss) = sort2(dist)
toc

Next we slice the top k rows of the sorted distances, and the indices (the number of the closest training point). With the indices of the closest points (bestinds), we look up the cluster for that point using the ctrain matrix.


In [ ]:
val k = 3
val bestvals = dss(0->k, ?)
val bestinds = iss(0->k, ?)
val votes = ctrain(bestinds)

The votes matrix contains the class labels for each of the n closest neighbors. Next we sum those votes into a tally matrix for each class.


In [ ]:
val (ii, jj, vv) = find3(votes+1)                           // get the row, column and values of the votes matrix
val tally = accum((vv-1) \ jj, 1/(1+bestvals(?)), 10, m)    // accumulate votes by class number (vv-1) and column (jj)

Here we find the class with the most votes (ibest). The mean function provides a simple way to compute the fraction of ibest (predicted) and ctest (actual) class labels which agree. This is exactly the method's accuracy.


In [ ]:
val (vb, ibest) = maxi2(tally)                              // now find the class with the most votes
mean(ibest == ctest)

TODO 1: Try different values of k in the cell that defined it. What happens to the accuracy?

TODO 2: Modify the dist function above (its best to comment out the original), to use cosine distance instead of Euclidean distance. You'll need the sqrt() function, and its best to add an offset (sqrt(1+x)) to avoid divide-by-zero errors.

k-Means Clustering

Now lets try clustering the digit image data into 10 clusters, using k-Means.


In [ ]:
val (nn,opts) = KMeans.learner(dtrain,10)

BIDMach's KMeans implementation accepts the name of the training dataset and the number of dimensions. Here we try clustering into 10 images to see how close the clusters are to ideal digit prototypes.

The learer function returns two values: nn which is a "Learner" class containing the model itself, optimizer and regularizer classes if any, and opts: which is an options object. You can look at the options using the "what" method:


In [ ]:
opts.what

You dont need to know about most of these. The only significant parameters for k-Means are dim the number of clusters, and npasses the number of passes over the dataset.

Lets first set the number of passes to 20, and then train the model.


In [ ]:
opts.npasses = 20
opts.dim = 10
opts.useGPU = false
nn.train

The output above shows various diagnostic information about the training:

  • The first column shows completion milestones for each pass over the dataset.
  • The second column is the likelihood, in this case its the mean-squared distance from each point to its center.
  • The third column is the overall gigaflops achieved.
  • The fourth column is the wall clock time so far in seconds.
  • The fifth columns is the total amount of data read or re-read.
  • The last column is the rate at which input data was consumed.

You can use this information to tune the training in various ways. Note that BIDMach compute and displays a cross-validation likelihood, not an internal model likelihood. That helps avoid over-fitting. If the model starts to overfit, the displayed likelihood will start increasing. BIDMach does this by holding out every m-th minibatch (the same ones on each pass) to use for testing while the training is in progress.

For k-means on this dataset, the likelihood is rather noisy, so lets try plotting it.


In [ ]:
plot(nn.results(1,?),nn.results(0,?))

The plot function above plots the second row of the results matrix as the X-axis and the first row as the Y-axis. The X-axis is then the number of training samples processed, and the Y-axis is the likelihood (negative mean squared distance to center).

You may want to use the mouse to select a rectangle from the plot to zoom into it. This will give you a better idea of whether the likelihood is decreasing at any time (indicating overfitting).

TODO 3: Do you see any evidence of over-fitting? Why do you think that is?

Next lets examine the model, which is just a matrix.


In [ ]:
val mm=nn.modelmat.t
size(mm)

For efficiency reasons, the model is stored so that the centroids are rows. We transpose it above so that it has 10 columns, one for each centroid.

Next we want to display the images. Unfortunately image display wasnt included in the version of BIDMat we installed at the begining of the semester, but we can inline the code here:

Now we can display each image. We first need to reshape from a 784x1 column of the data matrix into a 28x28 image. We use matrix slicing for that.


In [ ]:
val im = zeros(28,28)
for (i <- 20 until 35) {
im(?) = dtrain(?,i)
Image.show(im  ones(4,4))
}

TODO 4: wrap a for loop around the code above to display all 10 images.

The kronecker product above replaces each pixel of the original image with a 5x5 square block. This enlarges the images, making them easier to see, and also creates enough space on the toolbar to be able to move them.

You should also be able to see all the images spread out by simply clicking on the Java icon on the taskbar to the left.

The images will be displayed on top of each other, so to see them you will have to move them by dragging.

TODO 5: Do the centroids correspond to the 10 digits? If not, what are they?