Features of BIDMat and Scala

BIDMat is a multi-platform matrix library similar to R, Matlab, Julia or Numpy/Scipy. It takes full advantage of the very powerful Scala Language. Its intended primarily for machine learning, but is has a broad set of operations and datatypes and should be suitable for many other applications. BIDMat has several unique features:

  • Built from the ground up with GPU + CPU backends. BIDMat code is implementation independent.
  • GPU memory management uses caching, designed to support iterative algorithms.
  • Natural and extensible syntax (thanks to scala). Math operators include +,-,*,/,⊗,∙,∘
  • Probably the most complete support for matrix types: dense matrices of float32, double, int and long. Sparse matrices with single or double elements. All are available on CPU or GPU.
  • Highest performance sparse matrix operations on power-law data.

BIDMat has several other state-of-the-art features:

  • Interactivity. Thanks to the Scala language, BIDMat is interactive and scriptable.
  • Massive code base thanks to Java.
  • Easy-to-use Parallelism, thanks to Scala's actor framework and parallel collection classes.
  • Runs on JVM, extremely portable. Runs on Mac, Linux, Windows, Android.
  • Cluster-ready, leverages Hadoop, Yarn, Spark etc.

BIDMat is a library that is loaded by a startup script, and a set of imports that include the default classes and functions. We include them explicitly in this notebook.


In [1]:
import BIDMat.{CMat,CSMat,DMat,Dict,IDict,FMat,FND,GMat,GDMat,GIMat,GLMat,GSMat,GSDMat,
               HMat,IMat,Image,LMat,Mat,ND,SMat,SBMat,SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.JPlotting._

Mat.checkMKL
Mat.checkCUDA
Mat.setInline
if (Mat.hasCUDA > 0) GPUmem


1 CUDA device found, CUDA version 7.0
Out[1]:
(0.9833951,11878821888,12079398912)

These calls check that CPU and GPU native libs loaded correctly, and what GPUs are accessible. If you have a GPU and CUDA installed, GPUmem will printout the fraction of free memory, the absolute free memory and the total memory for the default GPU.

CPU and GPU matrices

BIDMat's matrix types are given in the table below. All are children of the "Mat" parent class, which allows code to be written generically. Many of BIDMach's learning algorithms will run with either single or double precision, dense or sparse input data.

CPU MatricesGPU Matrices
DenseSparseDenseSparse
Float32FMatSMatGMatGSMat
Float64DMatSDMatGDMatGSDMat
Int32IMatGIMat
Int64LMatGLMat

In [2]:
val n = 4096          // "val" designates a constant. n is statically typed (as in Int here), but its type is inferred.
val a = rand(n,n)     // Create an nxn matrix (on the CPU)



Out[2]:
   0.62465   0.89703   0.27582   0.13452   0.88863   0.32627   0.16042...
   0.63891   0.44064   0.36484   0.18473   0.75125   0.18824   0.75513...
   0.84862   0.13745   0.54177   0.69702   0.27047  0.040770  0.080366...
   0.26990  0.088154   0.17225   0.95003   0.51944   0.89452   0.75103...
  0.050579   0.15992   0.15957   0.49453   0.54010   0.32854   0.49945...
   0.73763   0.38573   0.80398   0.14497   0.16867   0.89132   0.31194...
   0.39953   0.52224   0.14720   0.11078   0.11138   0.95166  0.019181...
   0.91040  0.076302   0.60944   0.40543   0.79685   0.85392   0.23852...
        ..        ..        ..        ..        ..        ..        ..

In [3]:
%type a             // Most scientific funtions in BIDMat return single-precision results by default.


BIDMat.FMat

CPU matrix operations use Intel MKL acceleration for linear algebra, scientific and statistical functions. BIDMat includes "tic" and "toc" for timing, and "flip" and "flop" for floating point performance.


In [4]:
flip; val b = a * a; val gf=gflop
print("The product took %4.2f seconds at %3.0f gflops" format (gf._2, gf._1))
gf


The product took 1.32 seconds at 104 gflops
Out[4]:
(104.35759,1.317)

GPU matrices behave very similarly.


In [5]:
val ga = grand(n,n)            // Another nxn random matrix



Out[5]:
   0.88383  0.027557   0.64643   0.67664   0.29091   0.60243   0.57059...
   0.81384   0.76326   0.57175   0.17944   0.11371   0.22653  0.088297...
   0.39205   0.46169   0.76375   0.85130   0.46558   0.56107   0.79527...
   0.11010   0.51376   0.27728   0.91142  0.023597   0.16610   0.30512...
   0.25042   0.69971   0.83499   0.46664   0.49516   0.57655   0.79105...
   0.48823   0.72891   0.98704   0.11257   0.99232   0.10887   0.19260...
   0.70476   0.50308   0.38125   0.80248   0.52444   0.42028   0.33877...
   0.18731   0.11773   0.84390   0.67353   0.73904   0.19960   0.32721...
        ..        ..        ..        ..        ..        ..        ..

In [6]:
flip; val gb = ga * ga; val gf=gflop
print("The product took %4.2f seconds at %3.0f gflops" format (gf._2, gf._1))
gf


The product took 0.04 seconds at 3124 gflops
Out[6]:
(3123.6125,0.044)

In [7]:
%type ga


BIDMat.GMat

But much of the power of BIDMat is that we dont have to worry about matrix types. Lets explore that with an example.

SVD (Singular Value Decomposition) on a Budget

Now lets try solving a real problem with this infrastructure: An approximate Singular-Value Decomposition (SVD) or PCA of a matrix $M$. We'll do this by computing the leading eigenvalues and eigenvectors of $MM^T$. The method we use is subspace iteration and it generalizes the power method for computing the largest-magnitude eigenvalue. An eigenvector is a vector $v$ such that $$Mv =\lambda v$$ where $\lambda$ is a scalar called the eigenvalue.


In [8]:
def SVD(M:Mat, ndims:Int, niter:Int) = {
    var Q = M.zeros(M.nrows, ndims)                 // A block of ndims column vectors
    normrnd(0, 1, Q)                                // randomly initialize the vectors
    Mat.useCache = true                             // Turn matrix caching on
    
    for (i <- 0 until niter) {                      // Perform subspace iteration
        val P = (Q.t * M *^ M).t                    // Compute P = M * M^t * Q efficiently
        QRdecompt(P, Q, null)                       // QR-decomposition of P, saving Q
    }
    Mat.useCache = false                            // Turn caching off after the iteration
    val P = (Q.t * M *^ M).t                        // Compute P again.
    (Q, P  Q)                          // Return Left singular vectors and singular values
}



Notice that the code above used only the "Mat" matrix type. If you examine the variables V and P in a Scala IDE (Eclipse has one) you will find that they both also have type "Mat". Let's try it with an FMat (CPU single precision, dense matrix).

Movie Data Example

We load some data from the MovieLens project.


In [10]:
val ndims = 32                             // Number of PCA dimension
val niter = 128                            // Number of iterations to do

val S = loadSMat("../data/movielens/train.smat.lz4")(0->10000,0->4000)
val M = full(S)                            // Put in a dense matrix

flip; 
val (svecs, svals) = SVD(M, ndims, niter); // Compute the singular vectors and values
val gf=gflop

print("The calculation took %4.2f seconds at %2.1f gflops" format (gf._2, gf._1))
svals.t


The calculation took 15.66 seconds at 42.3 gflops
Out[10]:
  1.1839e+06
  2.0694e+05
  1.5514e+05
       95136
       91670
       69232
       68042
       56050
          ..

Let's take a peek at the singular values on a plot


In [11]:
S.nnz



Out[11]:
450523

In [12]:
plot(svals)



Out[12]:

Which shrinks a little too fast. Lets look at it on a log-log plot instead:


In [13]:
loglog(row(1 to svals.length), svals)



Out[13]:

Now lets try it with a GPU, single-precision, dense matrix.


In [14]:
val G = GMat(M)                            // Try a dense GPU matrix

flip; 
val (svecs, svals) = SVD(G, ndims, niter); // Compute the singular vectors and values 
val gf=gflop

print("The calculation took %4.2f seconds at %2.1f gflops" format (gf._2, gf._1))
svals.t


The calculation took 0.86 seconds at 772.0 gflops
Out[14]:
  1.1839e+06
  2.0694e+05
  1.5514e+05
       95135
       91671
       69269
       68005
       56050
          ..

That's not bad, the GPU version was nearly 4x faster. Now lets try a sparse, CPU single-precision matrix. Note that by construction our matrix was only 10% dense anyway.

Sparse SVD


In [15]:
flip;                                       // Try a sparse CPU matrix

val (svecs, svals) = SVD(S, ndims, niter);  // Compute the singular vectors and values
val gf=gflop    

print("The calculation took %4.2f seconds at %2.1f gflops" format (gf._2, gf._1))
svals.t


The calculation took 6.57 seconds at 1.5 gflops
Out[15]:
  1.1839e+06
  2.0694e+05
  1.5514e+05
       95112
       91694
       69286
       67988
       56050
          ..

This next one is important. Dense matrix operations are the bread-and-butter of scientific computing, and now most deep learning. But other machine learning tasks (logistic regression, SVMs, k-Means, topic models etc) most commonly take sparse input data like text, URLs, cookies etc. And so performance on sparse matrix operations is critical.

GPU performance on sparse data, especially power law data - which covers most of the case above (the commerically important cases) - has historically been poor. But in fact GPU hardware supports extremely fast sparse operations when the kernels are carefully designed. Such kernels are only available in BIDMat right now. NVIDIA's sparse matrix kernels, which have been tuned for sparse scientific data, do not work well on power-law data.

In any case, let's try BIDMat's GPU sparse matrix type:


In [16]:
val GS = GSMat(S)                           // Try a sparse GPU matrix

flip;
val (svecs, svals) = SVD(GS, ndims, niter); // Compute the singular vectors and values
val gf=gflop

print("The calculation took %4.2f seconds at %2.1f gflops" format (gf._2, gf._1))
svals.t


The calculation took 0.37 seconds at 27.4 gflops
Out[16]:
  1.1839e+06
  2.0694e+05
  1.5514e+05
       95136
       91670
       69196
       68079
       56050
          ..

That's a 10x improvement end-to-end, which is similar to the GPU's advantage on dense matrices. This result is certainly not specific to SVD, and is reproduced in most ML algorithms. So GPUs have a key role to play in general machine learning, and its likely that at some point they will assume a central role as they currently enjoy in scientific computing and deep learning.

GPU Double Precision

One last performance issue: GPU hardware normally prioritizes single-precision floating point over double-precision, and there is a big gap on dense matrix operations. But calculations on sparse data are memory-limited and this largely masks the difference in arithmetic. Lets try a sparse, double-precision matrix, which will force all the calculations to double precision.


In [17]:
val GSD = GSDMat(GS)                             // Try a sparse, double GPU matrix

flip; 
val (svecs, svals) = SVD(GSD, ndims, niter); // Compute the singular vectors and values
val gf=gflop

print("The calculation took %4.2f seconds at %2.1f gflops" format (gf._2, gf._1))
svals.t


The calculation took 0.59 seconds at 17.0 gflops
Out[17]:
  1.1839e+06
  2.0694e+05
  1.5514e+05
       95135
       91671
       69311
       67964
       56050
          ..

Which is noticebly slower, but still 3x faster than the CPU version running in single precision.

Using Cusparse

NVIDIA's cusparse library, which is optimized for scientific data, doesnt perform as well on power-law data.


In [18]:
def SVD(M:Mat, ndims:Int, niter:Int) = {
    var Q = M.zeros(M.nrows, ndims)
    normrnd(0, 1, Q)
    Mat.useCache = true     
    for (i <- 0 until niter) {                      // Perform subspace iteration
        val P = M * (M ^* Q)                        // Compute P = M * M^t * Q with cusparse
        QRdecompt(P, Q, null) 
    }
    Mat.useCache = false 
    val P = M * (M ^* Q)                            // Compute P again.
    (Q, getdiag(P ^* Q))                            // Left singular vectors and singular values
}




In [19]:
// Try sparse GPU matrix
flip; 
val (svecs, svals) = SVD(GS, ndims, niter); 
val gf=gflop

print("The calculation took %4.2f seconds at %2.1f gflops" format (gf._2, gf._1))
svals.t


The calculation took 2.43 seconds at 4.2 gflops
Out[19]:
1.1839e+06,2.0694e+05,1.5514e+05,95136,91670,68943,68331,56050,50187,39210,34314,31400,29052,26634,25330,23351,22124,22016,21393,20844,20219,19662,18437,17666,17517,16631,16258,15611,15526,15101,14755,14386

Unicode Math Operators, Functions and Variables

As well as the standard operators +,-,*,/, BIDMat includes several other important operators with their standard unicode representation. They have an ASCII alias in case unicode input is difficult. Here they are:

Unicode operator    ASCII alias    Operation
================    ===========    =========
       ∘                *@         Element-wise (Hadamard) product
       ∙                dot        Column-wise dot product
       ∙→              dotr        Row-wise dot product
       ⊗               kron        Kronecker (Cartesian) product

In [20]:
val a = ones(4,1) * row(1->5)



Out[20]:
   1   2   3   4
   1   2   3   4
   1   2   3   4
   1   2   3   4

In [21]:
val b = col(1->5) * ones(1,4)



Out[21]:
   1   1   1   1
   2   2   2   2
   3   3   3   3
   4   4   4   4

Hadamard (element-wise) multiply


In [22]:
b  a



Out[22]:
   1   2   3   4
   2   4   6   8
   3   6   9  12
   4   8  12  16

Dot product, by default along columns


In [23]:
b  a



Out[23]:
10,20,30,40

Dot product along rows


In [24]:
b ∙→ a



Out[24]:
  10
  20
  30
  40

Kronecker product


In [25]:
b  a



Out[25]:
   1   2   3   4   1   2   3   4   1   2   3   4   1   2   3   4
   1   2   3   4   1   2   3   4   1   2   3   4   1   2   3   4
   1   2   3   4   1   2   3   4   1   2   3   4   1   2   3   4
   1   2   3   4   1   2   3   4   1   2   3   4   1   2   3   4
   2   4   6   8   2   4   6   8   2   4   6   8   2   4   6   8
   2   4   6   8   2   4   6   8   2   4   6   8   2   4   6   8
   2   4   6   8   2   4   6   8   2   4   6   8   2   4   6   8
   2   4   6   8   2   4   6   8   2   4   6   8   2   4   6   8
  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..

As well as operators, functions in BIDMach can use unicode characters. e.g.


In [26]:
val ii = row(1->10)



Out[26]:
1,2,3,4,5,6,7,8,9

In [27]:
ii on Γ(ii)                            // Stack this row on the results of a Gamma function applied to it



Out[27]:
      1      2      3      4      5      6      7      8      9
      1      1      2      6     24    120    720   5040  40320

You can certainly define new unicode operators:


In [28]:
def (x:Mat) = sqrt(x)
def (x:Double) = math.sqrt(x)
(ii)



Out[28]:
1,1.4142,1.7321,2,2.2361,2.4495,2.6458,2.8284,3

and use as much Greek as you want:


In [29]:
val α = row(1->10)
val β = α + 2
val γ = β on Γ(β)



Out[29]:
       3       4       5       6       7       8       9      10...
       2       6      24     120     720    5040   40320  362880...

or English:


In [30]:
class NewMat(nr:Int, nc:Int, data0:Array[Float]) extends FMat(nr,nc,data0) {
  def quick(a:FMat) = this * a;
  def fox(a:FMat) = this + a;
  def over(a:FMat) = this - a;
  def lazzy(a:FMat) = this / a ;
}

implicit def convNew(a:FMat):NewMat = new NewMat(a.nrows, a.ncols, a.data)

val n = 2;
val the = rand(n,n);
val brown = rand(n,n);
val jumps = rand(n,n);
val dog = rand(n,n);




In [31]:
the quick brown fox jumps over the lazzy dog



Out[31]:
  0.93592  0.50697
   2.7636  0.33282

Transposed Multiplies

Matrix multiply is the most expensive step in many calculations, and often involves transposed matrices. To speed up those calcualtions, we expose two operators that combine the transpose and multiply operations:

^&ast  - transpose the first argument, so a ^&ast b is equivalent to a.t &ast b
&ast^  - transpose the second argument, so a &ast^ b is equivalent to a &ast b.t
these operators are implemented natively, i.e. they do not actually perform transposes, but implement the effective calculation. This is particulary important for sparse matrices since transpose would involve an index sort.


In [32]:
a ^* b



Out[32]:
  10  10  10  10
  20  20  20  20
  30  30  30  30
  40  40  40  40

In [33]:
a.t * b



Out[33]:
  10  10  10  10
  20  20  20  20
  30  30  30  30
  40  40  40  40

In [34]:
a *^ b



Out[34]:
  10  20  30  40
  10  20  30  40
  10  20  30  40
  10  20  30  40

In [35]:
a * b.t



Out[35]:
  10  20  30  40
  10  20  30  40
  10  20  30  40
  10  20  30  40

Highlights of the Scala Language

Scala is a remarkable language. It is an object-oriented language with similar semantics to Java which it effectively extends. But it also has a particular clean functional syntax for anonymous functions and closures.

It has a REPL (Read-Eval-Print-Loop) like Python, and can be used interactively or it can run scripts in or outside an interactive session.

Like Python, types are determined by assignments, but they are static rather than dynamic. So the language has the economy of Python, but the type-safety of a static language.

Scala includes a tuple type for multiple-value returns, and on-the-fly data structuring.

Finally it has outstanding support for concurrency with parallel classes and an actor system called Akka.

Performance

First we examine the performance of Scala as a scientific language. Let's implement an example that has been widely used to illustrate the performance of the Julia language. Its a random walk, i.e. a 1D array with random steps from one element to the next.


In [36]:
import java.util.Random
val random = new Random()

def rwalk(m:FMat) = {
    val n = m.length
    m(0) = random.nextFloat
    var i = 1
    while (i < n) {
        m(i) = m(i-1) + random.nextFloat - 0.5f
        i += 1
    }
}




In [37]:
val n = 100000000
val a = zeros(n, 1)
tic; val x = rwalk(a); val t=toc
print("computed %2.1f million steps per second in %2.1f seconds" format (n/t/1e6f, t))


computed 71.5 million steps per second in 1.4 seconds

If we try the same calculation in the Julia language (a new language designed for scientific computing) and in Python we find that:

ScalaJuliaPython
with rand1.0s0.43s147s
without rand0.1s0.26s100s

Vectorized Operations

But does this matter? A random walk can be computed efficiently with vector operations: vector random numbers and a cumulative sum. And in general most ML algorithms can be implemented with vector and matrix operations efficiently. Let's try in BIDMat:


In [38]:
tic; rand(a); val b=cumsum(a-0.5f); val t=toc
print("computed %2.1f million steps per second in %2.1f seconds" format (n/t/1e6f, t))


computed 81.6 million steps per second in 1.2 seconds

Which is better, due to the faster random number generation in the vectorized rand function. But More interesting is the GPU running time:


In [39]:
val ga = GMat(a)
tic; rand(ga); val gb=cumsum(ga-0.5f); val t=toc
print("computed %2.1f million steps per second in %2.1f seconds" format (n/t/1e6f, t))


computed 1694.9 million steps per second in 0.1 seconds

If we run similar operators in Julia and Python we find:

BIDMach(CPU)BIDMach(GPU)JuliaPython
with rand0.6s0.1s0.44s1.4s
without rand0.3s0.05s0.26s0.5s

Vectorized operators even the playing field, and bring Python up to speed compared to the other systems. On the other hand, GPU hardware maintains a near-order-of-magnitude advantage for vector operations.

GPU Performance Summary

GPU-acceleration gives an order-of-magnitude speedup (or more) for the following operations:

  • Dense matrix multiply
  • Sparse matrix multiply
  • Vector operations and reductions
  • Random numbers and transcendental function evaluation
  • Sorting

So its not just for scientific computing or deep learning, but for a much wider gamut of data processing and ML.

Tapping the Java Universe


In [40]:
<img style="width:4in" alt="NGC 4414 (NASA-med).jpg" src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/NGC_4414_%28NASA-med%29.jpg/1200px-NGC_4414_%28NASA-med%29.jpg"/>



Out[40]:

Almost every piece of Java code can be used in Scala. And therefore any piece of Java code can be used interactively.

There's very little work to do. You find a package and add it to your dependencies and then import as you would in Java.


In [41]:
import org.apache.commons.math3.stat.inference.TestUtils._



Apache Commons Math includes a Statistics package with many useful functions and tests. Lets create two arrays of random data and compare them.


In [42]:
val x = normrnd(0,1,1,40)
val y = normrnd(0,1,1,40) + 0.5



Out[42]:
0.49470,-1.3179,1.3120,0.34983,-0.15448,0.18021,-0.0047791,-0.83169,1.0949,1.3314,3.5964,-0.0050649,-0.40160,-0.47461,-1.1233,0.24085,1.1717,0.37238,1.1787,1.7823,1.1191,-1.2020,1.3341,0.68787,1.4621,0.053709,0.044125,-0.50037,0.99586,-0.74199,-0.45141,0.19420,2.5512,-0.30597,0.56445,1.0749,0.22237,-0.74343,0.47367,-0.34668

BIDMat has enriched matrix types like FMat, SMat etc, while Apache Commons Math expects Java Arrays of Double precision floats. To get these, we can convert FMat to DMat (double) and extra the data field which contains the matrices data.


In [43]:
val dx = DMat(x)
val dy = DMat(y)
tTest(dx.data, dy.data)



Out[43]:
0.2374389453681478

But rather than doing this conversion every time we want to use some BIDMat matrices, we can instruct Scala to do the work for us. We do this with an implicit conversion from FMat to Array[Double]. Simply defining this function will case a coercion whenever we supply an FMat argument to a function that expects Array[Double].


In [44]:
implicit def fMatToDarray(a:FMat):Array[Double] = DMat(a).data



And magically we can perform t-Tests on BIDMat matrices as though they had known each other all along.


In [45]:
tTest(x, y)



Out[45]:
0.2374389453681478

and its important to get your daily dose of beta:


In [46]:
import org.apache.commons.math3.distribution._

val betadist = new BetaDistribution(2,5)



Out[46]:
org.apache.commons.math3.distribution.BetaDistribution@2172a353

In [47]:
val n = 100000
val x = new DMat(1, n, (0 until n).map(x => betadist.sample).toArray); null



Out[47]:
null

In [48]:
hist(x, 100)



Out[48]:

Deconstruction


In [49]:
<image src="https://sketchesfromthealbum.files.wordpress.com/2015/01/jacquesderrida.jpg" style="width:4in"/>



Out[49]:

Let's make a raw Java Array of float integers.


In [50]:
val i = row(0->10).data



Out[50]:
Array(0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0)

First of all, Scala supports Tuple types for ad-hoc data structuring.


In [51]:
val j = i.map(x => (x, x*x))



Out[51]:
Array((0.0,0.0), (1.0,1.0), (2.0,4.0), (3.0,9.0), (4.0,16.0), (5.0,25.0), (6.0,36.0), (7.0,49.0), (8.0,64.0), (9.0,81.0))

We can also deconstruct tuples using Scala Pattern matching:


In [52]:
j.map{case (x,y) => (y,x)}



Out[52]:
Array((0.0,0.0), (1.0,1.0), (4.0,2.0), (9.0,3.0), (16.0,4.0), (25.0,5.0), (36.0,6.0), (49.0,7.0), (64.0,8.0), (81.0,9.0))

And reduce operations can use deconstruction as well:


In [53]:
val k = j.reduce((ab,cd) => {val (a,b) = ab; val (c,d) = cd; (a+c, b+d)})



Out[53]:
(45.0,285.0)