BIDMach Overview

BIDMach builds on the Scala Language and follows a functional+object-oriented programming style. But it adds many functions and operators, and is best considered a DSL (Domain-Specific Language).

These functions and operators are all the (default) global scope and provide an experience similar to Matlab or Julia.

In this cell we are in a standard Scala environment. To start BIDMach, we import its classes and test for native code linkage with these statements:


In [ ]:
import BIDMat.{CMat,CSMat,DMat,Dict,IDict,FMat,GMat,GIMat,GSMat,HMat,IMat,Mat,SMat,SBMat,SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.Plotting._
import BIDMach.Learner
import BIDMach.models.{FM,GLM,KMeans,KMeansw,LDA,LDAgibbs,NMF,SFA}
import BIDMach.datasources.{MatDS,FilesDS,SFilesDS}
import BIDMach.mixins.{CosineSim,Perplexity,Top,L1Regularizer,L2Regularizer}
import BIDMach.updaters.{ADAGrad,Batch,BatchNorm,IncMult,IncNorm,Telescoping}
import BIDMach.causal.{IPTW}

Mat.checkMKL
Mat.checkCUDA
if (Mat.hasCUDA > 0) GPUmem

Dont worry if it says "Couldnt load CUDA runtime" - if you dont have a GPU or the CUDA toolkit installed on your machine, that is normal.

If you do have a GPU and CUDA, GPUmem will printout the fraction of free memory, the absolute free memory and the total memory for the default GPU.

Basic Matrix Algebra

From this cell onward, we are in the BIDMach environment. Let define some matrices and basic algebra on them. BIDMach has Float, Double, Int and Complex matrix types. We'll start with integer matrices. To create an array of ones, do


In [ ]:
val a = ones(4,4)                // ones creates a 4x4 Float matrix (FMat)

In [ ]:
%type a

You can create row and column matrices (FMat) by listing their elements:


In [ ]:
row(2,2,4,4,5,5)

In [ ]:
col(5,4,3,2)

You can also create integer row or column matrices (IMat) with irow and icol. These functions accept lists of values, or ranges:


In [ ]:
val r = irow(0 until 10)   // until gives a range excluding the last element

In [ ]:
%type r

In [ ]:
icol(0 to 2)               // to gives a range including the last element

You can create a matrix of sequential integers like this


In [ ]:
val b = izeros(4,4)             // An integer matrix this time, filled with zeros
b(?) = icol(0 until 16)         // Now treat as a 1d array, fill with a range

The questionmark ? is BIDMach's wildcard character. Even though b is two-dimensional, b(?) linearizes its contents into a 16-element column and puts the RHS into it.

The RHS should be another 16x1 integer matrix (IMat), but when supplied with a range (0 until 16), BIDMach automatically casts the range to an IMat. This is called an implicit conversion in Scala.

From the order of elments in the array after the assignment, you can see that BIDMach uses Column-major-order. This is similar to Matlab, Fortran and Julia, but different from C and Python which are row-major.

Transpose is implemented with a "t" operator:


In [ ]:
val bt = b.t

Basic Math Operators

Math operators have their expected results:


In [ ]:
val c = a + b

In [ ]:
%type c             // We added an integer matrix (IMat) and a float matrix (FMat), so what type is the result?

BIDMach implicitly casts IMats to FMats to perform algebraic operations.


In [ ]:
b - a

In [ ]:
a * b                    // Matrix multiply (not element-wise multiply)

In [ ]:
b / a                    // This is element-wise division, some toolkits instead multiply by the inverse of a.

Advanced Operators

As well as these operators, BIDMach includes several other important operators with their standard unicode representation. They have an ASCII alias in case unicode input is difficult. Here they are:

Unicode operator    ASCII alias    Operation
================    ===========    =========
       ∘                *@         Element-wise (Hadamard) product
       ∙                dot        Column-wise dot product
       ∙→              dotr        Row-wise dot product
       ⊗               kron        Kronecker (Cartesian) product

In [ ]:
b  a

In [ ]:
b  a

In [ ]:
b ∙→ a

In [ ]:
b  a

TODO: using the operators above, construct a 5x5 matrix such that every element is one greater than the element to the left, and the element above.


In [ ]:

Transposed Multiplies

Matrix multiply is the most expensive step in many calculations, and often involves transposed matrices. To speed up those calcualtions, we expose two operators that combine the transpose and multiply operations:

^&ast  - transpose the first argument, so a ^&ast b is equivalent to a.t &ast b
&ast^  - transpose the second argument, so a &ast^ b is equivalent to a &ast b.t
these operators are implemented natively, i.e. they do not actually perform transposes, but implement the effective calculation. This is particulary important for sparse matrices since transpose would involve an index sort.


In [ ]:
a ^* b

In [ ]:
a.t * b

In [ ]:
b *^ a

In [ ]:
b * a.t

Scalars

Scalars (primitive numerical values) can be used in most expressions:


In [ ]:
a + 1                      // Add an integer, result is still an FMat

In [ ]:
val aa = 3.5 * a             // Multiply by a double, result is still an FMat

In [ ]:
%type aa

Note In Scala, floating point numbers like 3.5 have double precision by default. Many languages would cast the last result to a double matrix since double is the smallest container for both Float and Double data. We argue that when someone writes 3.5 x a, they mean to scale the matrix a by that factor and preserve its type, not to cause a type conversion. Single-precision constants in Scala need an "f" suffix, i.e.


In [ ]:
a + 2.7f

In case you encounter double matrices in a calculation without meaning to, it may because of operations with double-precision constants. Use the floating point notation above for scalars to minimize the chances of unintentional up-conversion to double matrices.

For the next step, we will need a floating point version of the matrix b, which we can construct like this:


In [ ]:
val c = FMat(b)

TODO: Create a float vector of values from -10 to 10 spaced by 0.1. Then apply the logistic function 1/(1+exp(-c)) to and call the plot() function on the results


In [ ]:
val x=
val y=
plot(x,y)

Edge Operators

Most operators support scalar arguments as in the last section. There are also many situations where its helpful to apply an operation with an "edge" argument, that is a vector whose long dimension matches the matrix. For example, we can define a vector v as:


In [ ]:
val v = col(1,2,3,4)

In [ ]:
v  a

The elementwise multiply by v is applied to every column of a. We could also apply v.t to the rows of a:


In [ ]:
v.t  a

Edge operators arise in several context, e.g. normalizing a matrix along rows or columns.

The sum() function computes the sums of columns of b and returns a row vector:


In [ ]:
sum(c)

We can use this to normalize b along its columns:


In [ ]:
val d = c / sum(c)

In [ ]:
sum(d)

or rows


In [ ]:
val e = c / sum(c,2)

In [ ]:
sum(e,2)

TODO: Using the sqrt() function on matrices, normalize the columns of the matrix c so that their L2 norm (or equivalently the dot product of the column with itself) is 1.


In [ ]:
//val cn =

In [ ]:
//cn  cn

Indexing and Slicing

You access invidual array elements using parenthesis (unlike [] in Python)


In [ ]:
a(1,1) = 2

In [ ]:
a

In [ ]:
a(0,3)

You can use the wildcard ? to access rows and columns:


In [ ]:
b

In [ ]:
b(?,1)

In [ ]:
b(2,?)

Ranges work as expected:


In [ ]:
b(1->3, 1->3)

And you can use arbitrary integer vectors to access submatrices:


In [ ]:
b(icol(0,1), icol(0,1,3))

Another shorthand constructor for integer matrices is the backslash operator:


In [ ]:
val ii = 0\1\3

In [ ]:
%type ii

And this syntax is handy for indexing expressions


In [ ]:
b(0\1, 0\1\3)

Slices can be used for assignment:


In [ ]:
b(1, 0\1\3) = 0\0\0
b

and you can use scalars on the RHS to simplify bulk assignments


In [ ]:
b(0\1, 0\1\3) = -1
b

Matrices also accept single indices that reference elements in column-major order:


In [ ]:
b(7)

You can also use vectors of indices or ranges to assign arbitrary elements of a matrix, or all of them:


In [ ]:
b(0->16) = (16 to 1 by -1)

TODO: Define a set of indices ii such that for any 4x4 matrix m, m(ii) = m.t.

HINT: you already computed it!


In [ ]:
val m = rand(4,4)

In [ ]:
// val ii =

In [ ]:
// m(ii)

Reducers

BIDMach has several "reducers" that aggregate along rows or columns. We already saw one of these, which was the sum() function. Two other important ones are maxi() and mini(). These both compute the max or min respectively, along columns (default) or rows. e.g.


In [ ]:
val x = rand(4,5)

In [ ]:
val xm = maxi(x)

In [ ]:
val xm2 = maxi(x,2)

Its often very useful to know not only what element was the max or min, but also its index. The functions maxi2 and mini2 do this. They have a "2" suffix to indicate that they return 2 values. The first is the max or min value, the second is the max or min index:


In [ ]:
val (vmax, imax) = maxi2(x)
vmax

In [ ]:
imax

The first 3 means that the max element of the first column was in row number 3, etc. We can similarly compute the min along rows:


In [ ]:
val (vmin, imin) = mini2(x,2)
vmin

In [ ]:
imin

A last important reducer is accum which is similar to Matlab's accumarray, or numpy's accum. It allows you to tally some values into specific positions in an output array. The format is:

accum(inds, vals, nrows, ncols)

where inds is an nx2 matrix (IMat) of row,column indices, vals are the values to sum there, and nrows and ncols are the matrix dimensions. Its easiest to see with an example:


In [ ]:
val inds = 0\0 on 0\1 on 2\3 on 0\1 on 3\3

In [ ]:
val vals = col(1f, 2f, 3f, 4f, 5f)

In [ ]:
accum(inds, vals, 4, 5)

You can see that each of the vals was saved in the position specified by the corresponding inds. Most of the locations to save were distinct, except for the second and fourth rows, which specified the same location. Those two values (2f and 4f) were summed in that location.

Find Functions

The find function is similar to Matlab's find, and Numpy's nonzero function. It comes in several flavors, depending on how many values are returned:

val ii = find(m)            // find and return the single indices of non-zero elements of m
val (ii, jj) = find2(m)     // find and return the (row, column) indices of non-zeros in m
val (ii, jj, vv) = find3(m) // find an return the (row, column) indices and values of non-zeros of m

TODO: For the matrix below, use find3 to deconstruct it into row, column and value matrices. Then use accum to build it up again. You can use the \ operator to horizontally concatenate two matrices.


In [ ]:
val rr = rand(4,5)

In [ ]:
// deconstruct rr
// val rr2 =                              // Rebuild it

Ranges, For and While Loops

We used ranges before. There are two flavors, closed or "to" ranges, and open or "until" ranges.


In [ ]:
0 to 5

In [ ]:
0 until 5

For loops use ranges in a natural way


In [ ]:
for (i <- 0 until 5) {
    println("run number %d" format i)
}

While loops provide no special loop variable management, simply a test


In [ ]:
var i = 6
while (i > 0) {
    println("counting down %d" format i)
    i -= 1
}

Functional Programming in BIDMach

Functional programming in BIDMach (or Numpy or Matlab) avoids explicit iteration over the elements of matrices and concentrates instead on whole-array operations and (if irregular access is needed) on manipulation of index matrices. Its not unlike the use of global operations on DataFrames.

This approach allows highly-parallelized code to be used to implement these routines. It often makes for more succinct code and (with some practice), greater readability. We'll concentrate on applying those ideas in the next part of the Lab.

I Feel the Need for Speed !!

Yes, its not only fun to have a fast toolkit, but it really matters for performance. Not just runtime, but most algorithms can trade off time for precision by simply training more thoroughly, or training a richer model.

BIDMat/BIDMach is currently the only system which fully integrates GPU computing with CPU computing. Its only the only system to have fully rooflined sparse matrix primitives. This is very important, since these are the bottleneck for machine learning on the most common types of data (text, web, clickthrough etc). Let's measure exactly how much difference this makes.

First, we'll define a few matrices, both on the CPU and on a GPU (assuming you have one).


In [ ]:
val n = 8192
val a = rand(n,n)                     // a random dense matrix (CPU)
val b = powrand(n,n,100)              // a random power-law sparse matrix with 100 elements per column (CPU)
val ga = GMat(a)                      // a GPU version of a
val gb = GSMat(b)                     // a GPU version of b

Now we could just go ahead and do our calculations on a,b,ga,gb directly. This is a common scenario. But we would also like to illustrate BIDMat/BIDMach's support for generics. So instead will create variables of type "Mat" to hold those variables, and perform arithmetic on those instead.


In [ ]:
var ma:Mat = a              // create a generic Mat variable and bind it to a
var mb:Mat = b              // create a generic Mat variable and bind it to b
var mc:Mat = null           // we'll use this to hold results

Now we'll benchmark both dense and sparse matrix multiply. Dense first.


In [ ]:
flip
mc = ma * ma
gflop

The "flip" function starts a timer and reset the flop count. "gflop" returns two values: the gigaflop count, and the time since the last "flip".


In [ ]:
flip
mc = ma * mb
gflop

Now lets bind those variables to GPU matrices instead:


In [ ]:
ma = ga
mb = gb

and run exactly the same code:


In [ ]:
flip
mc = ma * ma
gflop

You'll probably see a good order-of-magnitude speedup over the CPU calculation. This shouldnt be surprising. GPUs have a well-earned reputation for dense-matrix performance.

What's less well-known is their sparse matrix performance, which yields roughly order-of-magnitude gains as well.


In [ ]:
flip
mc = ma *^ mb
gflop

You should see performance in the 30-40 gflops range, which is the roofline for sparse operations on the current generation of Nvidia GPUs.

This is very important, because that operation (and two other variants that have similar performance) is the dominant step in most common machine learning algorithms. With careful design of the entire learning pipeline, you can translate that advantage into a end-to-end speedup by the same factor.

Furthermore, by writing generic code (using the Mat class as above) you can hide the details of implementation (CPU vs GPU) and run your algorithm in either environment. You can also support either sparse or dense matrices, and many of BIDMach's learning algorithms will work with either.

TODO: try the transpose operator &ast^ in the cells above. Note the prefix transpose operator ^&ast is not needed in most machine learning algorithms and doesnt have a fast implementation.