Introduction to BIDMat

BIDMat is a multi-platform matrix library similar to Matlab, Julia or Numpy/Scipy. Its intended primarily for machine learning, but is has a broad set of operations and datatypes and should be suitable for many other applications.

BIDMat is probably unique in its integration of CPU and GPU data types. Other features include:

  • Interactivity. Thanks to the Scala language, BIDMat is interactive and scriptable.
  • Performance, thanks to CPU and GPU native code, and to Scala's speed on memory-bound operations.
  • Parallelism, thanks to Scala's actor framework and parallel collection classes.
  • Rich, open syntax of math operators, +,-,*,/,⊗,∙,∘
  • Runs on JVM, extremely portable, leverages Hadoop, Yarn, Spark etc.

BIDMat is a library that is loaded by a startup script, and a set of imports that include the default classes and functions. We include them explicitly in this notebook.


In [ ]:
import BIDMat.{CMat,CSMat,DMat,Dict,IDict,FMat,GMat,GIMat,GSMat,HMat,IMat,Mat,SMat,SBMat,SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.Plotting._

Mat.checkMKL
Mat.checkCUDA
if (Mat.hasCUDA > 0) GPUmem

These calls check that CPU and GPU native libs loaded correctly, and what GPUs are accessible. If you have a GPU and CUDA installed, GPUmem will printout the fraction of free memory, the absolute free memory and the total memory for the default GPU.

Basic Matrix Algebra

From this cell onward, we are in the BIDMat environment. Let define some matrices and basic algebra on them. BIDMat has Float, Double, Int and Complex matrix types. We'll start with integer matrices. To create an array of ones, do


In [ ]:
val a = ones(4,4)   // ones creates a 4x4 Float matrix (FMat). FMat is the default.
                    // you use prefixes to get other types. e.g. iones gives an integer matrix.

In [ ]:
%type a

You can create row and column matrices (FMat) by listing their elements:


In [ ]:
row(2,2,4,4,5,5)

In [ ]:
col(5,4,3,2)

You can also create integer row or column matrices (IMat) with irow and icol. These functions accept lists of values, or ranges:


In [ ]:
val r = irow(0 until 10)   // until gives a range excluding the last element

In [ ]:
%type r

In [ ]:
icol(0 to 2)               // to gives a range including the last element

You can create a matrix of sequential integers like this


In [ ]:
val b = izeros(4,4)             // An integer matrix this time, filled with zeros
b(?) = icol(0 until 16)         // Now treat as a 1d array, fill with a range
b

The questionmark ? is BIDMat's wildcard character. Even though b is two-dimensional, b(?) linearizes its contents into a 16-element column and puts the RHS into it.

The RHS should be another 16x1 integer matrix (IMat), but when supplied with a range (0 until 16), BIDMat automatically casts the range to an IMat. This is called an implicit conversion in Scala.

From the order of elments in the array after the assignment, you can see that BIDMat uses Column-major-order. This is similar to Matlab, Fortran and Julia, but different from C and Python which are row-major.

Transpose is implemented with a "t" operator:


In [ ]:
val bt = b.t

Basic Math Operators

Math operators have their expected results:


In [ ]:
val c = a + b

In [ ]:
%type c             // We added an integer matrix (IMat) and a float matrix (FMat), so what type is the result?

BIDMat implicitly casts IMats to FMats to perform algebraic operations.


In [ ]:
b - a

In [ ]:
a * b                    // Matrix multiply (not element-wise multiply)

In [ ]:
b / a                    // This is element-wise division, some toolkits instead multiply by the inverse of a.

Advanced Math Operators

As well as these operators, BIDMat includes several other important operators with their standard unicode representation. They have an ASCII alias in case unicode input is difficult. Here they are:

Unicode operator    ASCII alias    Operation
================    ===========    =========
       ∘                *@         Element-wise (Hadamard) product
       ∙                dot        Column-wise dot product
       ∙→              dotr        Row-wise dot product
       ⊗               kron        Kronecker (Cartesian) product

In [ ]:
b  a

In [ ]:
b  a

In [ ]:
b ∙→ a

In [ ]:
b  a

TODO: using the operators above, construct a 5x5 matrix such that every element is one greater than the element to the left, and the element above.


In [ ]:

Transposed Multiplies

Matrix multiply is the most expensive step in many calculations, and often involves transposed matrices. To speed up those calcualtions, we expose two operators that combine the transpose and multiply operations:

^&ast  - transpose the first argument, so a ^&ast b is equivalent to a.t &ast b
&ast^  - transpose the second argument, so a &ast^ b is equivalent to a &ast b.t
these operators are implemented natively, i.e. they do not actually perform transposes, but implement the effective calculation. This is particulary important for sparse matrices since transpose would involve an index sort.


In [ ]:
a ^* b

In [ ]:
a.t * b

In [ ]:
b *^ a

In [ ]:
b * a.t

Edge Operators

Most operators support scalar arguments as in the last section. There are also many situations where its helpful to apply an operation with an "edge" argument, that is a vector whose long dimension matches the matrix. For example, we can define a vector v as:


In [ ]:
val v = col(1,2,3,4)

In [ ]:
v  a

The elementwise multiply by v is applied to every column of a. We could also apply v.t to the rows of a:


In [ ]:
v.t  a

Edge operators arise in several context, e.g. normalizing a matrix along rows or columns.

The sum() function computes the sums of columns of b and returns a row vector:


In [ ]:
sum(c)

We can use this to normalize b along its columns:


In [ ]:
val d = c / sum(c)

In [ ]:
sum(d)

or rows


In [ ]:
val e = c / sum(c,2)

In [ ]:
sum(e,2)

TODO: Using the sqrt() function on matrices, normalize the columns of the matrix c so that their L2 norm (or equivalently the dot product of the column with itself) is 1.


In [ ]:
//val cn =

In [ ]:
//cn  cn

Scalars

Scalars (primitive numerical values) can be used in most expressions:


In [ ]:
a + 1                      // Add an integer, result is still an FMat

In [ ]:
val aa = 3.5 * a             // Multiply by a double, result is still an FMat

In [ ]:
%type aa

Note In Scala, floating point numbers like 3.5 have double precision by default. Many languages would cast the last result to a double matrix since double is the smallest container for both Float and Double data. We argue that when someone writes 3.5 x a, they mean to scale the matrix a by that factor and preserve its type, not to cause a type conversion. Single-precision constants in Scala need an "f" suffix, i.e.


In [ ]:
a + 2.7f

In case you encounter double matrices in a calculation without meaning to, it may because of operations with double-precision constants. Use the floating point notation above for scalars to minimize the chances of unintentional up-conversion to double matrices.

For the next step, we will need a floating point version of the matrix b, which we can construct like this:


In [ ]:
val c = FMat(b)

TODO: Create a float vector of values from -10 to 10 spaced by 0.1. Then apply the logistic function 1/(1+exp(-c)) to and call the plot() function on the results


In [ ]:
val x=
val y=
plot(x,y)

Indexing and Slicing

You access invidual array elements using parenthesis (unlike [] in Python)


In [ ]:
a(1,1) = 2

In [ ]:
a

In [ ]:
a(0,3)

You can use the wildcard ? to access rows and columns:


In [ ]:
b

In [ ]:
b(?,1)

In [ ]:
b(2,?)

Ranges work as expected:


In [ ]:
b(1->3, 1->3)

And you can use arbitrary integer vectors to access submatrices:


In [ ]:
b(icol(0,1), icol(0,1,3))

Another shorthand constructor for integer matrices is the backslash operator:


In [ ]:
val ii = 0\1\3

In [ ]:
%type ii

And this syntax is handy for indexing expressions


In [ ]:
b(0\1, 0\1\3)

Slices can be used for assignment:


In [ ]:
b(1, 0\1\3) = 0\0\0
b

and you can use scalars on the RHS to simplify bulk assignments


In [ ]:
b(0\1, 0\1\3) = -1
b

Matrices also accept single indices that reference elements in column-major order:


In [ ]:
b(7)

You can also use vectors of indices or ranges to assign arbitrary elements of a matrix, or all of them:


In [ ]:
b(0->16) = (16 to 1 by -1)

TODO: Define a set of indices ii such that for any 4x4 matrix m, m(ii) = m.t.

HINT: you already computed it!


In [ ]:
val m = rand(4,4)

In [ ]:
// val ii =

In [ ]:
// m(ii)

Reducers

BIDMach has several "reducers" that aggregate along rows or columns. We already saw one of these, which was the sum() function. Two other important ones are maxi() and mini(). These both compute the max or min respectively, along columns (default) or rows. e.g.


In [ ]:
val x = rand(4,5)

In [ ]:
val xm = maxi(x)

In [ ]:
val xm2 = maxi(x,2)

Its often very useful to know not only what element was the max or min, but also its index. The functions maxi2 and mini2 do this. They have a "2" suffix to indicate that they return 2 values. The first is the max or min value, the second is the max or min index:


In [ ]:
val (vmax, imax) = maxi2(x)
vmax

In [ ]:
imax

The first 3 means that the max element of the first column was in row number 3, etc. We can similarly compute the min along rows:


In [ ]:
val (vmin, imin) = mini2(x,2)
vmin

In [ ]:
imin

A last important reducer is accum which is similar to Matlab's accumarray, or numpy's accum. It allows you to tally some values into specific positions in an output array. The format is:

accum(inds, vals, nrows, ncols)

where inds is an nx2 matrix (IMat) of row,column indices, vals are the values to sum there, and nrows and ncols are the matrix dimensions. Its easiest to see with an example:


In [ ]:
val inds = 0\0 on 0\1 on 2\3 on 0\1 on 3\3

In [ ]:
val vals = col(1f, 2f, 3f, 4f, 5f)

In [ ]:
accum(inds, vals, 4, 5)

You can see that each of the vals was saved in the position specified by the corresponding inds. Most of the locations to save were distinct, except for the second and fourth rows, which specified the same location. Those two values (2f and 4f) were summed in that location.

Find Functions

The find function is similar to Matlab's find, and Numpy's nonzero function. It comes in several flavors, depending on how many values are returned:

val ii = find(m)            // find and return the single indices of non-zero elements of m
val (ii, jj) = find2(m)     // find and return the (row, column) indices of non-zeros in m
val (ii, jj, vv) = find3(m) // find an return the (row, column) indices and values of non-zeros of m

TODO: For the matrix below, use find3 to deconstruct it into row, column and value matrices. Then use accum to build it up again. You can use the \ operator to horizontally concatenate two matrices.


In [ ]:
val rr = rand(4,5)

In [ ]:
// deconstruct rr
// val rr2 =                              // Rebuild it

For and While Loops

We used ranges before. There are two flavors, closed or "to" ranges, and open or "until" ranges.


In [ ]:
0 to 5

In [ ]:
0 until 5

For loops use ranges in a natural way


In [ ]:
for (i <- 0 until 5) {
    println("run number %d" format i)
}

While loops provide no special loop variable management, simply a test


In [ ]:
var i = 6
while (i > 0) {
    println("counting down %d" format i)
    i -= 1
}

Warning: Performance Sinkhole!

For loops are much more complex that while loops. They apparently create a local evaluation context for each iteration, and the overhead is several times higher than for while loops. Lets measure this:


In [ ]:
import java.util.Random
val randgen = new Random
tic
var sum = 0f
for (i <- 0 until 1000000) {
    sum += randgen.nextFloat - 0.5f
}
val t = toc
(sum, t)

In [ ]:
tic
var sum = 0f
var i = 0
while (i < 1000000*100) {
    sum += randgen.nextFloat - 0.5f
    i += 1
}
val t = toc
(sum, t)

For loops (and Scala's other functional tools like sequence classes) are extremely powerful for e.g. multi-threading and multi-GPU computing. But they're not suitable for lightweight iteration over elements.

Functional Programming in BIDMat

Functional programming in BIDMat (or Numpy or Matlab) avoids explicit iteration over the elements of matrices and concentrates instead on whole-array operations and (if irregular access is needed) on manipulation of index matrices. Its not unlike the use of global operations on DataFrames.

This approach allows highly-parallelized code to be used to implement these routines. It often makes for more succinct code and (with some practice), greater readability. We'll concentrate on applying those ideas in the next part of the Lab.

I Feel the Need for Speed !!

Yes, its not only fun to have a fast toolkit, but it really matters for performance. Not just runtime, but most algorithms can trade off time for precision by simply training more thoroughly, or training a richer model.

BIDMat/BIDMach is currently the only system which fully integrates GPU computing with CPU computing. Its only the only system to have fully rooflined sparse matrix primitives. This is very important, since these are the bottleneck for machine learning on the most common types of data (text, web, clickthrough etc). Let's measure exactly how much difference this makes.

First, we'll define a few matrices, both on the CPU and on a GPU (assuming you have one).


In [ ]:
val n = 8192
val a = rand(n,n)                     // a random dense matrix (CPU)
val b = powrand(n,n,100)              // a random power-law sparse matrix with 100 elements per column (CPU)
val ga = GMat(a)                      // a GPU version of a
val gb = GSMat(b)                     // a GPU version of b

Now we could just go ahead and do our calculations on a,b,ga,gb directly. This is a common scenario. But we would also like to illustrate BIDMat/BIDMach's support for generics. So instead will create variables of type "Mat" to hold those variables, and perform arithmetic on those instead.


In [ ]:
var ma:Mat = a              // create a generic Mat variable and bind it to a
var mb:Mat = b              // create a generic Mat variable and bind it to b
var mc:Mat = null           // we'll use this to hold results

Now we'll benchmark both dense and sparse matrix multiply. Dense CPU multiply first.


In [ ]:
flip
mc = ma * ma
gflop

The "flip" function starts a timer and reset the flop count. "gflop" returns two values: the gigaflop count, and the time since the last "flip".


In [ ]:
flip
mc = ma * mb
gflop

Now lets bind those variables to GPU matrices instead:


In [ ]:
ma = ga
mb = gb

and run exactly the same code:


In [ ]:
flip
mc = ma * ma
gflop

You'll probably see a good order-of-magnitude speedup over the CPU calculation. This shouldnt be surprising. GPUs have a well-earned reputation for dense-matrix performance.

What's less well-known is their sparse matrix performance, which yields roughly order-of-magnitude gains as well.


In [ ]:
flip
mc = ma * mb
gflop

You should see performance in the 20-50 gflops range, which is near the roofline for sparse operations on the current generation of Nvidia GPUs.

This is very important, because that operation (and two other variants that have similar performance) is the dominant step in most common machine learning algorithms. With careful design of the entire learning pipeline, you can translate that advantage into a end-to-end speedup by the same factor.

Furthermore, by writing generic code (using the Mat class as above) you can hide the details of implementation (CPU vs GPU) and run your algorithm in either environment. You can also support either sparse or dense matrices, and many of BIDMach's learning algorithms will work with either.

TODO: try the transpose operator &ast^ in the cells above.


In [ ]:

Matrix Caching

One of the challenges of working with GPUs is the current lack of memory management (i.e. a garbage collector). The very high streaming speed of GPU memory makes it very difficult to do memory management without significant slowdowns. BIDMach includes a matrix caching scheme which allows re-use of matrix storage. It works particularly well in BIDMach's minibatch algorithms, which process same-sized blocks of data many times.

To understand caching, lets first notice that every matrix has a unique long id or guid:


In [ ]:
val a = rand(4,4)
val b = rand(4,4)
(a.GUID, b.GUID)

Normally, when you do calculations with matrices, new containers are created to hold the results:


In [ ]:
val c = a * b
val d = a + b
(c.GUID, d.GUID)

But with caching enabled, the same expression will yield the same container (different expressions with the same arguments will still yield different containers):


In [ ]:
Mat.useCache = true
val c = a + b
val d = a + b
val e = a * b
(c.GUID, d.GUID, e.GUID)

Although this approach causes aliasing (c and d now point to the same container), in a functional programming language the same expressions should always hold the same value. Arrays are mutable objects, so are not guaranteed to hold the same value. Nevertheless if you program in functional style, the same expressions should hold the same value and caching is a safe operation.

With that caveat, caching is a very helpful performance optimization. The learners in BIDMach automatically turn caching on and off when you run a learning algorithm and in this way are able to eliminate matrix allocation after the first iteration. Its a necessary feature to be able to use a GPU on large datasets, and it often accelerates calculations on the CPU by removing memory allocation and garbage collection overhead.


In [ ]:
Mat.useCache = false