Creating Models

One of the main goals of BIDMat/BIDMach is to make model creation, customization and experimentation much easier.

To that end is has reusable classes that cover the elements of Learning:

Model: The core class for a learning algorithm, and often the only one you need to implement.
DataSource: A source of data, like an in-memory matrix, a set of files (possibly on HDFS) or a data iterator (for Spark).
DataSink: A target for data such as predictions, like an in-memory matrix, a set of files, or an iterator.
Updaters: Update a model using minibatch update from a Model class. Includes SGD, ADAGRAD, Monte-Carlo updates, and multiplicative updates.
Mixins: Secondary Loss functions that are added to the global gradient. Includes L1 and L2 regularizers, cluster quality metrics, factor model metrics.
Learner: Combines the classes above and provides high-level control over the learning process: iterations, stop/start/resume

When creating a new model, its often only necessary to creat a new model class. We recently needed a scalable SVD (Singular Value Decomposition) for some student projects. Lets walk through creating this from scratch.

Scalable SVD

This model works like the previous example of in-memory SVD for a matrix M. The singular values of M are the eigenvalues of M M^T so we do subspace iteration:

$$P = M M^T Q$$$$(Q,R) = QR(P)$$

But now we want to deal with an M which is too big to fit in memory. In the minibatch context, we can write M as a horizontal concatenation of mini-batches (this assumes data samples are columns of M and features are rows):

$$M = M_1 M_2 \cdots M_n$$

and then $$P = \sum_{i=1}^n M_i M_i^T Q$$

so we can compute $P$ by operating only on the minibatches $M_i$. We need to be able to fit $P$ and $Q$ in memory, their size is only $k~ |F|$ where $k$ is the SVD dimension and $F$ is the feature set.

Model Class

We start by defining a new model class which extends BIDMach's Model class. It will always take an Options instance as an argument:

class SVD(opts:SVD.Opts = new SVD.Options) extends Model(opts)

The options are defined in the "Object" associated with the class. In Scala "Object" defines a singleton which holds all of the static methods of the class. It looks like this:

object SVD { trait Opts extends Model.Opts { var deliciousness = 3 }

class Options extends Opts {} ... </code></b>

Truthfully, an SVD model doesnt need a "deliciousness" option, in fact it doesnt need any Options at all - or rather what it needs is inherited from its parent. But its there to show how options are created. The Opts are defined as a trait rather than a class so they can be mixed in with the Options of other learning classes.

Local Variables and Initialization

There are three variables we need to keep track of:

var Q:Mat = null; // (Left) Singular vectors var SV:Mat = null; // Singular values var P:Mat = null; // P (accumulator)

and an initialization routine sets these to appropriate values.

Minibatch Update

Each update should update the stable model: Here its $P$:

def dobatch(mats:Array[Mat], ipass:Int, pos:Long):Unit = { val M = mats(0); P ~ P + (Q.t &ast M &ast^ M).t // Compute P = M &ast M^t &ast Q efficiently }

Score Batches

The score method should return a floating point vector of scores for this minibatch.

def evalbatch(mat:Array[Mat], ipass:Int, pos:Long):FMat = { SV ~ P ∙ Q; // Estimate the singular values val diff = (P / SV) - Q; // residual row(-(math.sqrt(norm(diff) / diff.length))); // return the norm of the residual }

Update the Model

At the end of a pass over the data, we update $Q$. Not all models need this step, and minibatch algorithms typically dont have it.

override def updatePass(ipass:Int) = { QRdecompt(P, Q, null); // Basic subspace iteration P.clear; // Clear P for the next pass }

Convenience Functions

We're done defining the SVD model. We can run it now, but to make that easier we'll define a couple of convenience functions.

An in-memory Learner

class MatOptions extends Learner.Options with SVD.Opts with MatSource.Opts with Batch.Opts

def learner(mat:Mat):(Learner, MatOptions) = { val opts = new MatOptions; opts.batchSize = math.min(100000, mat.ncols/30 + 1) val nn = new Learner( new MatSource(Array(mat), opts), new SVD(opts), null, new Batch(opts), null, opts) (nn, opts) } </code></b>

A File-based Learner

class FileOptions extends Learner.Options with SVD.Opts with FileSource.Opts with Batch.Opts

def learner(fnames:String):(Learner, FileOptions) = { val opts = new FileOptions; opts.batchSize = 10000; opts.fnames = List(FileSource.simpleEnum(fnames, 1, 0)); implicit val threads = threadPool(4); val nn = new Learner( new FileSource(opts), new SVD(opts), null, new Batch(opts), null, opts) (nn, opts) } </code></b>

A Predictor

A predictor is a Learner which runs an existing model over a DataSource and outputs to a DataSink. For SVD, the predictor outputs the right singular vectors, which may be too large to fit in memory. Here's a memory-to-memory predictor:

class PredOptions extends Learner.Options with SVD.Opts with MatSource.Opts with MatSink.Opts;

// This function constructs a predictor from an existing model def predictor(model:Model, mat1:Mat):(Learner, PredOptions) = { val nopts = new PredOptions; nopts.batchSize = math.min(10000, mat1.ncols/30 + 1) nopts.dim = model.opts.dim; val newmod = new SVD(nopts); newmod.refresh = false model.copyTo(newmod) val nn = new Learner( new MatSource(Array(mat1), nopts), newmod, null, null, new MatSink(nopts), nopts) (nn, nopts) } </code></b>

Testing

Now lets try it out! First we initialize BIDMach as before.



In [1]:

    
import BIDMat.{CMat,CSMat,DMat,Dict,IDict,FMat,FND,GDMat,GMat,GIMat,GLMat,GSDMat,GSMat,
               HMat,IMat,Image,LMat,Mat,SMat,SBMat,SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.JPlotting._
import BIDMach.Learner
import BIDMach.models.{FM,GLM,KMeans,KMeansw,ICA,LDA,LDAgibbs,Model,NMF,RandomForest,SFA,SVD}
import BIDMach.datasources.{DataSource,MatSource,FileSource,SFileSource}
import BIDMach.mixins.{CosineSim,Perplexity,Top,L1Regularizer,L2Regularizer}
import BIDMach.updaters.{ADAGrad,Batch,BatchNorm,IncMult,IncNorm,Telescoping}
import BIDMach.causal.{IPTW}

Mat.checkMKL
Mat.checkCUDA
Mat.setInline
if (Mat.hasCUDA > 0) GPUmem









    



1 CUDA device found, CUDA version 7.0






    Out[1]:





(0.99132067,11974557696,12079398912)

We'll run on the MNIST 8M (8 millon images) digit data, which is a large dataset distributed over multiple files



In [2]:

    
val dir="../data/MNIST8M/parts/"
val (nn, opts) = SVD.learner(dir+"data%02d.fmat.lz4");

Let's set some options:



In [3]:

    
opts.nend = 10;
opts.dim = 20;
opts.npasses = 2;
opts.batchSize = 20000;

and release the beast:



In [4]:

    
nn.train









    



pass= 0
 4.00%, ll=-0.01936, gf=4.010, secs=0.6, GB=0.13, MB/s=200.38, GPUmem=0.983898
25.00%, ll=-0.00600, gf=13.319, secs=1.2, GB=0.82, MB/s=665.60, GPUmem=0.983898
48.00%, ll=-0.00578, gf=16.846, secs=1.8, GB=1.51, MB/s=841.88, GPUmem=0.983898
70.00%, ll=-0.00638, gf=18.716, secs=2.3, GB=2.20, MB/s=935.32, GPUmem=0.983898
91.00%, ll=-0.00555, gf=19.860, secs=2.9, GB=2.89, MB/s=992.47, GPUmem=0.983898
100.00%, ll=-0.00647, gf=20.827, secs=3.0, GB=3.14, MB/s=1040.82, GPUmem=0.983898
pass= 1
 4.00%, ll=-0.00599, gf=18.593, secs=3.5, GB=3.26, MB/s=929.19, GPUmem=0.983898
25.00%, ll=-0.00425, gf=21.004, secs=3.8, GB=3.95, MB/s=1049.78, GPUmem=0.983898
48.00%, ll=-0.00573, gf=23.088, secs=4.0, GB=4.64, MB/s=1153.97, GPUmem=0.983898
70.00%, ll=-0.00617, gf=24.961, secs=4.3, GB=5.33, MB/s=1247.65, GPUmem=0.983898
91.00%, ll=-0.00391, gf=26.602, secs=4.5, GB=6.02, MB/s=1329.75, GPUmem=0.983898
100.00%, ll=-0.00589, gf=27.082, secs=4.6, GB=6.27, MB/s=1353.77, GPUmem=0.983898
Time=4.6360 secs, gflops=27.07

The model matrices for this model hold the results. They are generic matrices, so we cast them to FMats:



In [5]:

    
val svals = FMat(nn.modelmats(1));
val svecs = FMat(nn.modelmats(0));



In [6]:

    
semilogy(svals)









    









    Out[6]:

To see how well we did, we will compute the SVD directly by computing $M M^T$ and computing its eigendecomposition. Normally we can't do this because $MM^T$ is nfeats x nfeats, but for this dataset nfeats is only 784.



In [7]:

    
tic
val MMt = zeros(784,784);
for (i <- 0 until opts.nend) {
val Mi = loadFMat(dir+"data%02d.fmat.lz4" format i);
MMt ~ Mi *^ Mi;
print(".");
} 
println;
toc









    



..........






    Out[7]:





14.375

Now we call an eigenvalue routine to compute the eigenvalues and eigenvectors of $MM^T$, which are the singular values and left singular vectors of $M$.



In [8]:

    
val (eval, evecs) = feig(MMt);
val topvecs = evecs(?, 783 to 784-opts.dim by -1);

Eigenvectors have a sign ambiguity, and its common to see V or -V. So next we compute dot products between the two sets of vectors and flip signs if a dot product is negative:



In [9]:

    
val dots = svecs ∙ topvecs;
svecs ~ svecs ∘ (2*(dots>0) - 1);

Lets now look at the eigenvectors as small images, decreasing in strength from left to right. First the reference eigenvectors:



In [10]:

    
val onerow = topvecs.view(28,28*opts.dim);
val nc = onerow.ncols;
val tworows = onerow(?,0->(nc/2)) on onerow(?,(nc/2)->nc)
show((tworows.t*500+128) ⊗ ones(3,3))









    









    Out[10]:



In [11]:

    
val onerow = svecs.view(28,28*opts.dim);
val nc = onerow.ncols;
val tworows = onerow(?,0->(nc/2)) on onerow(?,(nc/2)->nc)
show((tworows.t*500+128) ⊗ ones(3,3))









    









    Out[11]:

Extracting Right Singular Vectors

So far, we obtained the singular values and left singular vectors from the model's modelmats array. The right singular vectors grow in size with the dataset, and in general wont fit in memory. But we can still compute them by running a predictor on the dataset. This predictor takes a parametrized input file name for the matrix $M$ and a parametrized output file name to hold the right singular vectors. A key option to set is ofcols the number of samples per output file:



In [14]:

    
val (pp, popts) = SVD.predictor(nn.model, dir+"data%02d.fmat.lz4", dir+"rSingVectors%02d.fmat.lz4")
popts.ofcols = 100000                  // number of columns per file, here the same as the input files
popts.nend = 10                        // Number of input files to process



In [15]:

    
pp.predict









    



Predicting
 2.00%, ll=-0.01308, gf=1.979, secs=0.6, GB=0.06, MB/s=98.93, GPUmem=0.99
 3.00%, ll=-0.01367, gf=2.877, secs=0.7, GB=0.09, MB/s=143.85, GPUmem=0.99
 5.00%, ll=-0.01273, gf=4.519, secs=0.7, GB=0.16, MB/s=225.94, GPUmem=0.99
 6.00%, ll=-0.01348, gf=5.316, secs=0.7, GB=0.19, MB/s=265.76, GPUmem=0.99
 8.00%, ll=-0.00652, gf=6.551, secs=0.8, GB=0.25, MB/s=327.52, GPUmem=0.99
10.00%, ll=-0.00664, gf=7.503, secs=0.8, GB=0.31, MB/s=375.12, GPUmem=0.99
11.00%, ll=-0.00559, gf=8.061, secs=0.9, GB=0.34, MB/s=402.99, GPUmem=0.99
12.00%, ll=-0.00574, gf=8.593, secs=0.9, GB=0.38, MB/s=429.59, GPUmem=0.99
14.00%, ll=-0.00634, gf=9.483, secs=0.9, GB=0.44, MB/s=474.13, GPUmem=0.99
16.00%, ll=-0.00670, gf=10.283, secs=1.0, GB=0.50, MB/s=514.10, GPUmem=0.99
17.00%, ll=-0.00566, gf=11.113, secs=1.0, GB=0.56, MB/s=555.59, GPUmem=0.99
19.00%, ll=-0.00624, gf=11.394, secs=1.0, GB=0.60, MB/s=569.64, GPUmem=0.99
20.00%, ll=-0.00687, gf=9.531, secs=1.4, GB=0.66, MB/s=476.53, GPUmem=0.99
22.00%, ll=-0.00606, gf=9.794, secs=1.4, GB=0.69, MB/s=489.65, GPUmem=0.99
24.00%, ll=-0.00577, gf=10.186, secs=1.5, GB=0.75, MB/s=509.23, GPUmem=0.99
25.00%, ll=-0.00639, gf=10.744, secs=1.5, GB=0.82, MB/s=537.13, GPUmem=0.99
27.00%, ll=-0.00755, gf=11.026, secs=1.5, GB=0.85, MB/s=551.25, GPUmem=0.99
28.00%, ll=-0.00601, gf=11.259, secs=1.6, GB=0.88, MB/s=562.87, GPUmem=0.99
29.00%, ll=-0.00534, gf=11.369, secs=1.6, GB=0.91, MB/s=568.40, GPUmem=0.99
30.00%, ll=-0.00536, gf=11.405, secs=1.6, GB=0.94, MB/s=570.18, GPUmem=0.99
31.00%, ll=-0.00609, gf=11.506, secs=1.7, GB=0.97, MB/s=575.24, GPUmem=0.99
32.00%, ll=-0.00692, gf=11.958, secs=1.7, GB=1.03, MB/s=597.85, GPUmem=0.99
34.00%, ll=-0.00610, gf=12.029, secs=1.8, GB=1.07, MB/s=601.38, GPUmem=0.99
35.00%, ll=-0.00552, gf=12.483, secs=1.8, GB=1.13, MB/s=624.08, GPUmem=0.99
37.00%, ll=-0.00629, gf=12.579, secs=1.8, GB=1.16, MB/s=628.90, GPUmem=0.99
39.00%, ll=-0.00694, gf=12.909, secs=1.9, GB=1.22, MB/s=645.40, GPUmem=0.99
40.00%, ll=-0.00606, gf=12.900, secs=1.9, GB=1.25, MB/s=644.94, GPUmem=0.99
41.00%, ll=-0.00533, gf=11.435, secs=2.2, GB=1.29, MB/s=571.70, GPUmem=0.99
43.00%, ll=-0.00585, gf=11.732, secs=2.3, GB=1.35, MB/s=586.55, GPUmem=0.99
44.00%, ll=-0.00650, gf=11.901, secs=2.3, GB=1.38, MB/s=595.02, GPUmem=0.99
45.00%, ll=-0.00678, gf=12.242, secs=2.4, GB=1.44, MB/s=612.03, GPUmem=0.99
46.00%, ll=-0.00566, gf=12.330, secs=2.4, GB=1.47, MB/s=616.44, GPUmem=0.99
48.00%, ll=-0.00544, gf=12.436, secs=2.4, GB=1.51, MB/s=621.76, GPUmem=0.99
49.00%, ll=-0.00614, gf=12.633, secs=2.4, GB=1.54, MB/s=631.58, GPUmem=0.99
50.00%, ll=-0.00663, gf=12.626, secs=2.5, GB=1.57, MB/s=631.24, GPUmem=0.99
51.00%, ll=-0.00671, gf=12.716, secs=2.6, GB=1.63, MB/s=635.76, GPUmem=0.99
53.00%, ll=-0.00531, gf=12.772, secs=2.6, GB=1.66, MB/s=638.52, GPUmem=0.99
54.00%, ll=-0.00554, gf=12.914, secs=2.6, GB=1.69, MB/s=645.61, GPUmem=0.99
55.00%, ll=-0.00623, gf=13.004, secs=2.7, GB=1.72, MB/s=650.13, GPUmem=0.99
57.00%, ll=-0.00698, gf=13.277, secs=2.7, GB=1.79, MB/s=663.77, GPUmem=0.99
58.00%, ll=-0.00614, gf=13.361, secs=2.7, GB=1.82, MB/s=667.97, GPUmem=0.99
59.00%, ll=-0.00569, gf=13.541, secs=2.7, GB=1.85, MB/s=677.00, GPUmem=0.99
60.00%, ll=-0.00531, gf=13.465, secs=2.8, GB=1.88, MB/s=673.20, GPUmem=0.99
61.00%, ll=-0.00623, gf=12.464, secs=3.1, GB=1.91, MB/s=623.11, GPUmem=0.99
63.00%, ll=-0.00680, gf=12.573, secs=3.1, GB=1.98, MB/s=628.60, GPUmem=0.99
64.00%, ll=-0.00570, gf=12.741, secs=3.2, GB=2.04, MB/s=637.00, GPUmem=0.99
65.00%, ll=-0.00590, gf=12.774, secs=3.2, GB=2.07, MB/s=638.62, GPUmem=0.99
66.00%, ll=-0.00621, gf=12.888, secs=3.3, GB=2.10, MB/s=644.32, GPUmem=0.99
68.00%, ll=-0.00668, gf=12.961, secs=3.3, GB=2.13, MB/s=647.97, GPUmem=0.99
70.00%, ll=-0.00667, gf=13.127, secs=3.3, GB=2.20, MB/s=656.26, GPUmem=0.99
71.00%, ll=-0.00588, gf=13.283, secs=3.4, GB=2.26, MB/s=664.09, GPUmem=0.99
73.00%, ll=-0.00620, gf=13.354, secs=3.4, GB=2.29, MB/s=667.62, GPUmem=0.99
74.00%, ll=-0.00677, gf=13.412, secs=3.5, GB=2.32, MB/s=670.51, GPUmem=0.99
76.00%, ll=-0.00665, gf=13.559, secs=3.5, GB=2.38, MB/s=677.86, GPUmem=0.99
78.00%, ll=-0.00564, gf=13.674, secs=3.6, GB=2.45, MB/s=683.64, GPUmem=0.99
79.00%, ll=-0.00625, gf=13.753, secs=3.6, GB=2.48, MB/s=687.60, GPUmem=0.99
80.00%, ll=-0.00663, gf=13.699, secs=3.7, GB=2.51, MB/s=684.90, GPUmem=0.99
81.00%, ll=-0.00736, gf=12.811, secs=4.0, GB=2.54, MB/s=640.48, GPUmem=0.99
83.00%, ll=-0.00581, gf=12.928, secs=4.0, GB=2.60, MB/s=646.36, GPUmem=0.99
85.00%, ll=-0.00618, gf=13.078, secs=4.1, GB=2.67, MB/s=653.81, GPUmem=0.99
86.00%, ll=-0.00659, gf=13.167, secs=4.1, GB=2.70, MB/s=658.28, GPUmem=0.99
87.00%, ll=-0.00715, gf=13.275, secs=4.1, GB=2.73, MB/s=663.66, GPUmem=0.99
89.00%, ll=-0.00580, gf=13.420, secs=4.2, GB=2.79, MB/s=670.92, GPUmem=0.99
91.00%, ll=-0.00605, gf=13.475, secs=4.2, GB=2.85, MB/s=673.69, GPUmem=0.99
93.00%, ll=-0.00704, gf=13.566, secs=4.3, GB=2.92, MB/s=678.25, GPUmem=0.99
95.00%, ll=-0.00573, gf=13.655, secs=4.4, GB=2.98, MB/s=682.68, GPUmem=0.99
96.00%, ll=-0.00579, gf=13.692, secs=4.4, GB=3.01, MB/s=684.53, GPUmem=0.99
98.00%, ll=-0.00652, gf=13.777, secs=4.5, GB=3.07, MB/s=688.77, GPUmem=0.99
100.00%, ll=-0.00664, gf=13.789, secs=4.5, GB=3.14, MB/s=689.38, GPUmem=0.99
Time=4.5490 secs, gflops=13.79

Notes

BIDMach's actual SVD model is only slightly different from the code presented here: First it computes the subspace updates on minibatches of data for the first few iterations to more rapidly get near a good model. Secondly it alternates subspace iteration with Rayleigh-Ritz iterations for faster convergence.