Training models with data that fits in memory is very limiting. But minibatch learners can easily work with data directly from disk.
We'll use the MNIST data set, which has 8 million images (about 17 GB). The dataset has been partition into groups of 100k images (using the unix split command) and saved in compressed lz4 files. This dataset is very large and doesnt get loaded by default by getdata.sh
. You have to load it explicitly by calling getmnist.sh
from the scripts directory. The script automatically splits the data into files that are small enough to be loaded into memory.
Let's load BIDMat/BIDMach
In [ ]:
import BIDMat.{CMat,CSMat,DMat,Dict,IDict,Image,FMat,FND,GDMat,GMat,GIMat,GSDMat,GSMat,HMat,IMat,Mat,SMat,SBMat,SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.JPlotting._
import BIDMach.Learner
import BIDMach.models.{FM,GLM,KMeans,KMeansw,ICA,LDA,LDAgibbs,Model,NMF,RandomForest,SFA,SVD}
import BIDMach.datasources.{DataSource,MatSource,FileSource,SFileSource}
import BIDMach.mixins.{CosineSim,Perplexity,Top,L1Regularizer,L2Regularizer}
import BIDMach.updaters.{ADAGrad,Batch,BatchNorm,IncMult,IncNorm,Telescoping}
import BIDMach.causal.{IPTW}
Mat.checkMKL
Mat.checkCUDA
Mat.setInline
if (Mat.hasCUDA > 0) GPUmem
And define the root directory for this dataset.
In [ ]:
val mdir = "/code/BIDMach/data/MNIST8M/parts/"
For this tutorial, we are going to evaluate the quality of clustering by using it for classification. We use a labeled dataset, and compute clusters of training samples using k-Means. Then we match new test samples to the clusters and find the best match. The label assigned to the new sample is the majority vote of the cluster.
This method by itself doesnt work well. Clusters will often straddle label boundaries leading to poor labelings. Its better to force each cluster to have a single label. We do that by adding the labels in as very strong features before clustering. The label features cause samples with different labels to be very far apart. Far enough that k-Means will never assign them to the same cluster. The data we want looks like this:
Instance 0 Instance 1 Instance 2 ... has label "2" has label "7" has label "0" ... / 0 0 10000 ... | 0 0 0 ... | 10000 0 0 ... | 0 0 0 ... label / 0 0 0 ... features \ 0 0 0 ... (10) | 0 0 0 ... | 0 10000 0 ... | 0 0 0 ... \ 0 0 0 ... / 128 19 5 ... | 47 28 9 ... image / 42 111 18 ... features \ 37 128 17 ... (784) | 18 176 14 ... | .. .. ..
We chose the label feature weights (here 10000) to force the distance between differently-labeled samples (2 10000^2) to be larger than the distance between two image samples (1000 256^2). This guarantees that points will not be assigned to a cluster containing a different label (assuming there is initially at least one cluster center with each label).
Even though these label features are present in cluster centroids after training, they dont affect matching at test time. Test images dont have the label features, and will match the closest cluster based only on image features. That cluster will have a unique label, which we then assign to the test point.
The files containind data in this form are named "alls00.fmat.lz4", "alls01.fmat.lz4" etc. Since they contain both data and labels, we dont need to load label files separately. We can create a learner using a pattern for accessing these files:
In [ ]:
val (mm, opts) = KMeans.learner(mdir+"alls%02d.fmat.lz4")
The string "%02d" is a C/Scala format string that expands into a two-digit ASCII number to help with the enumeration.
There are several new options that can tailor a files datasource, but we'll mostly use the defaults. One thing we will do is define the last file to use for training (number 70). This leaves us with some held-out files to use for testing.
In [ ]:
opts.dim = 300
opts.nend = 10
Note that the training data include image data and labels (0-9). K-Means is an unsupervised algorithm and if we used image data only KMeans will often build clusters containing different digit images. To produce cleaner clusters, and to facilitate classification later on, the alls
data includes both labels in the first 10 rows, and image data in the remaining rows. The label features are scaled by a large constant factor. That means that images of different digits will be far apart in feature space. It effectively prevents different digits occuring in the same cluster.
The following options are the important ones for tuning. For KMeans, batchSize has no effect on accracy since the algorithm uses all the data instances to perform an update. So you're free to tune it for best speed. Generally larger is better, as long as you dont use too much GPU ram.
npasses is the number of passes over the dataset. Larger is typically better, but the model may overfit at some point.
In [ ]:
opts.batchSize = 20000
opts.npasses = 6
You invoke the learner the same way as before. You can change the options above after each run to optimize performance.
In [ ]:
mm.train
Now lets extract the model as a Floating-point matrix. We included the category features for clustering to make sure that each cluster is a subset of images for one digit.
In [ ]:
val modelmat = FMat(mm.modelmat)
Next we build a 30 x 10 array of images to view the first 300 cluster centers as images.
In [ ]:
val nx = 30
val ny = 10
val im = zeros(28,28)
val allim = zeros(28*nx,28*ny)
for (i<-0 until nx) {
for (j<-0 until ny) {
val slice = modelmat(i+nx*j,10->794)
im(?) = slice(?)
allim((28*i)->(28*(i+1)), (28*j)->(28*(j+1))) = im
}
}
show(allim kron ones(2,2))
We'll predict using the closest cluster (or 1-NN if you like). Since we did constrained clustering, our data include the labels for each instance, but unlabeled test data doesnt have this. So we project the model matrix down to remove its first 10 features. Before doing this though we find the strongest label for each cluster so later on we can map from cluster id to label.
In [ ]:
val igood = find(sum(modelmat,2) > 100) // find non-empty clusters
val mmat = modelmat(igood,?)
In [ ]:
val (dmy, catmap) = maxi2(mmat(?,0->10).t) // Lookup the label for each cluster
mm.model.modelmats(0) = mmat(?,10->mmat.ncols) // Remove the label features
mm.model.modelmats(1) = mm.modelmats(1)(igood,0)
catmap(0->100)
Next we define a predictor from the just-computed model and the testdata, with the preds files to catch the predictions.
In [ ]:
val (pp, popts) = KMeans.predictor(mm.model, mdir+"data%02d.fmat.lz4", mdir+"preds%02d.imat.lz4")
popts.nstart = 10 // start with file 70 as test data
popts.nend = 20 // finish at file 79
popts.ofcols = 100000 // Match number of samples per file to test file
popts.batchSize = 20000
Lets run the predictor
In [ ]:
pp.predict
The preds
files now contains the numbers of the best-matching cluster centers. We still need to look up the category label for each one, and compare with the reference data. We'll do this one file at a time, so that our evaluation can scale to arbitrary problem sizes.
In [ ]:
val totals = (popts.nstart until popts.nend).map(i => {
val preds = loadIMat(mdir + "preds%02d.imat.lz4" format i); // predicted centroids
val cats = loadIMat(mdir + "cat%02d.imat.lz4" format i); // reference labels
val cpreds = catmap(preds); // map centroid to label
accum(cats.t \ cpreds.t, 1.0, 10, 10) // form a confusion matrix
}).reduce(_+_)
totals
From the actual and predicted categories, we can compute a confusion matrix:
In [ ]:
val conf = float(totals / sum(totals))
Now lets create an image by multiplying each confusion matrix cell by a white square:
In [ ]:
show((conf * 250f) ⊗ ones(32,32))
Its useful to isolate the correct classification rate by digit, which is:
In [ ]:
val dacc = getdiag(conf).t
We can take the mean of the diagonal accuracies to get an overall accuracy for this model.
In [ ]:
mean(dacc)
Enter your scores in the table below (esc-return) to open up this cell, the shift-return when you're done. You should see the fields to be edited with dotted lines ...
We've pre-filled the table with values for 3000 and 30000 clusters. The final AUC (0.99) is quite good. We've demonstrated what is effectively a scalable approach to k-nearest neighbors. By condensing the potential neighbor set down to a collection of centroids, we limit the amount of data that needs to be retained. At the same time, we've reduced the variance in the dataset by using centroids instead of raw samples.
KMeans Clusters | Training time | Avg. gflops | Accuracy |
---|---|---|---|
300 | ... | ... | ... |
3000 | 36s (Titan-X) | 1260 | 0.962 |
30000 | 272s (Titan-X) | 1660 | 0.990 |
In [ ]: