BIDMach builds on the Scala Language and follows a functional+object-oriented programming style. But it adds many functions and operators, and is best considered a DSL (Domain-Specific Language).

These functions and operators are all the (default) global scope and provide an experience similar to Matlab or Julia.

In this cell we are in a standard Scala environment. To start BIDMach, we import its classes and test for native code linkage with these statements:

```
In [ ]:
```import BIDMat.{CMat,CSMat,DMat,Dict,IDict,FMat,GMat,GIMat,GSMat,HMat,IMat,Mat,SMat,SBMat,SDMat}
import BIDMat.MatFunctions._
import BIDMat.SciFunctions._
import BIDMat.Solvers._
import BIDMat.Plotting._
import BIDMach.Learner
import BIDMach.models.{FM,GLM,KMeans,KMeansw,LDA,LDAgibbs,NMF,SFA}
import BIDMach.datasources.{MatDS,FilesDS,SFilesDS}
import BIDMach.mixins.{CosineSim,Perplexity,Top,L1Regularizer,L2Regularizer}
import BIDMach.updaters.{ADAGrad,Batch,BatchNorm,IncMult,IncNorm,Telescoping}
import BIDMach.causal.{IPTW}
Mat.checkMKL
Mat.checkCUDA
if (Mat.hasCUDA > 0) GPUmem

Dont worry if it says "Couldnt load CUDA runtime" - if you dont have a GPU or the CUDA toolkit installed on your machine, that is normal.

If you do have a GPU and CUDA, GPUmem will printout the fraction of free memory, the absolute free memory and the total memory for the default GPU.

```
In [ ]:
```val a = ones(4,4) // ones creates a 4x4 Float matrix (FMat)

```
In [ ]:
```%type a

You can create row and column matrices (FMat) by listing their elements:

```
In [ ]:
```row(2,2,4,4,5,5)

```
In [ ]:
```col(5,4,3,2)

*or* ranges:

```
In [ ]:
```val r = irow(0 until 10) // until gives a range excluding the last element

```
In [ ]:
```%type r

```
In [ ]:
```icol(0 to 2) // to gives a range including the last element

You can create a matrix of sequential integers like this

```
In [ ]:
```val b = izeros(4,4) // An integer matrix this time, filled with zeros
b(?) = icol(0 until 16) // Now treat as a 1d array, fill with a range

The questionmark ? is BIDMach's wildcard character. Even though b is two-dimensional, b(?) linearizes its contents into a 16-element column and puts the RHS into it.

The RHS should be another 16x1 integer matrix (IMat), but when supplied with a range (0 until 16), BIDMach automatically casts the range to an IMat. This is called an implicit conversion in Scala.

From the order of elments in the array after the assignment, you can see that BIDMach uses *Column-major-order*. This is similar to Matlab, Fortran and Julia, but different from C and Python which are row-major.

Transpose is implemented with a "t" operator:

```
In [ ]:
```val bt = b.t

Math operators have their expected results:

```
In [ ]:
```val c = a + b

```
In [ ]:
```%type c // We added an integer matrix (IMat) and a float matrix (FMat), so what type is the result?

BIDMach implicitly casts IMats to FMats to perform algebraic operations.

```
In [ ]:
```b - a

```
In [ ]:
```a * b // Matrix multiply (not element-wise multiply)

```
In [ ]:
```b / a // This is element-wise division, some toolkits instead multiply by the inverse of a.

As well as these operators, BIDMach includes several other important operators with their standard unicode representation. They have an ASCII alias in case unicode input is difficult. Here they are:

Unicode operator ASCII alias Operation ================ =========== ========= ∘ *@ Element-wise (Hadamard) product ∙ dot Column-wise dot product ∙→ dotr Row-wise dot product ⊗ kron Kronecker (Cartesian) product

```
In [ ]:
```b ∘ a

```
In [ ]:
```b ∙ a

```
In [ ]:
```b ∙→ a

```
In [ ]:
```b ⊗ a

```
In [ ]:
```

Matrix multiply is the most expensive step in many calculations, and often involves transposed matrices. To speed up those calcualtions, we expose two operators that combine the transpose and multiply operations:

^&ast - transpose the first argument, so a ^&ast b is equivalent to a.t &ast b &ast^ - transpose the second argument, so a &ast^ b is equivalent to a &ast b.tthese operators are implemented natively, i.e. they do not actually perform transposes, but implement the effective calculation. This is particulary important for sparse matrices since transpose would involve an index sort.

```
In [ ]:
```a ^* b

```
In [ ]:
```a.t * b

```
In [ ]:
```b *^ a

```
In [ ]:
```b * a.t

Scalars (primitive numerical values) can be used in most expressions:

```
In [ ]:
```a + 1 // Add an integer, result is still an FMat

```
In [ ]:
```val aa = 3.5 * a // Multiply by a double, result is still an FMat

```
In [ ]:
```%type aa

**Note** In Scala, floating point numbers like 3.5 have double precision by default. Many languages would cast the last result to a double matrix since double is the smallest container for both Float and Double data. We argue that when someone writes 3.5 x a, they mean to scale the matrix a by that factor and preserve its type, not to cause a type conversion. Single-precision constants in Scala need an "f" suffix, i.e.

```
In [ ]:
```a + 2.7f

In case you encounter double matrices in a calculation without meaning to, it may because of operations with double-precision constants. Use the floating point notation above for scalars to minimize the chances of unintentional up-conversion to double matrices.

For the next step, we will need a floating point version of the matrix b, which we can construct like this:

```
In [ ]:
```val c = FMat(b)

```
In [ ]:
```val x=
val y=
plot(x,y)

```
In [ ]:
```val v = col(1,2,3,4)

```
In [ ]:
```v ∘ a

```
In [ ]:
```v.t ∘ a

Edge operators arise in several context, e.g. normalizing a matrix along rows or columns.

The sum() function computes the sums of columns of b and returns a row vector:

```
In [ ]:
```sum(c)

We can use this to normalize b along its columns:

```
In [ ]:
```val d = c / sum(c)

```
In [ ]:
```sum(d)

or rows

```
In [ ]:
```val e = c / sum(c,2)

```
In [ ]:
```sum(e,2)

```
In [ ]:
```//val cn =

```
In [ ]:
```//cn ∙ cn

You access invidual array elements using parenthesis (unlike [] in Python)

```
In [ ]:
```a(1,1) = 2

```
In [ ]:
``````
a
```

```
In [ ]:
```a(0,3)

You can use the wildcard ? to access rows and columns:

```
In [ ]:
``````
b
```

```
In [ ]:
```b(?,1)

```
In [ ]:
```b(2,?)

Ranges work as expected:

```
In [ ]:
```b(1->3, 1->3)

And you can use arbitrary integer vectors to access submatrices:

```
In [ ]:
```b(icol(0,1), icol(0,1,3))

Another shorthand constructor for integer matrices is the backslash operator:

```
In [ ]:
```val ii = 0\1\3

```
In [ ]:
```%type ii

And this syntax is handy for indexing expressions

```
In [ ]:
```b(0\1, 0\1\3)

Slices can be used for assignment:

```
In [ ]:
```b(1, 0\1\3) = 0\0\0
b

and you can use scalars on the RHS to simplify bulk assignments

```
In [ ]:
```b(0\1, 0\1\3) = -1
b

Matrices also accept single indices that reference elements in column-major order:

```
In [ ]:
```b(7)

```
In [ ]:
```b(0->16) = (16 to 1 by -1)

TODO: Define a set of indices ii such that for any 4x4 matrix m, m(ii) = m.t.

HINT:you already computed it!

```
In [ ]:
```val m = rand(4,4)

```
In [ ]:
```// val ii =

```
In [ ]:
```// m(ii)

`sum()`

function. Two other important ones are maxi() and mini(). These both compute the max or min respectively, along columns (default) or rows. e.g.

```
In [ ]:
```val x = rand(4,5)

```
In [ ]:
```val xm = maxi(x)

```
In [ ]:
```val xm2 = maxi(x,2)

*index*. The functions `maxi2`

and `mini2`

do this. They have a "2" suffix to indicate that they return 2 values. The first is the max or min *value*, the second is the max or min *index*:

```
In [ ]:
```val (vmax, imax) = maxi2(x)
vmax

```
In [ ]:
``````
imax
```

```
In [ ]:
```val (vmin, imin) = mini2(x,2)
vmin

```
In [ ]:
``````
imin
```

A last important reducer is `accum`

which is similar to Matlab's `accumarray`

, or numpy's `accum`

. It allows you to tally some values into specific positions in an output array. The format is:

accum(inds, vals, nrows, ncols)

where inds is an nx2 matrix (IMat) of row,column indices, vals are the values to sum there, and nrows and ncols are the matrix dimensions. Its easiest to see with an example:

```
In [ ]:
```val inds = 0\0 on 0\1 on 2\3 on 0\1 on 3\3

```
In [ ]:
```val vals = col(1f, 2f, 3f, 4f, 5f)

```
In [ ]:
```accum(inds, vals, 4, 5)

`vals`

was saved in the position specified by the corresponding `inds`

. Most of the locations to save were distinct, except for the second and fourth rows, which specified the same location. Those two values (2f and 4f) were summed in that location.

The `find`

function is similar to Matlab's find, and Numpy's `nonzero`

function. It comes in several flavors, depending on how many values are returned:

val ii = find(m) // find and return the single indices of non-zero elements of m val (ii, jj) = find2(m) // find and return the (row, column) indices of non-zeros in m val (ii, jj, vv) = find3(m) // find an return the (row, column) indices and values of non-zeros of m

```
In [ ]:
```val rr = rand(4,5)

```
In [ ]:
```// deconstruct rr
// val rr2 = // Rebuild it

We used ranges before. There are two flavors, closed or "to" ranges, and open or "until" ranges.

```
In [ ]:
```0 to 5

```
In [ ]:
```0 until 5

For loops use ranges in a natural way

```
In [ ]:
```for (i <- 0 until 5) {
println("run number %d" format i)
}

While loops provide no special loop variable management, simply a test

```
In [ ]:
```var i = 6
while (i > 0) {
println("counting down %d" format i)
i -= 1
}

Functional programming in BIDMach (or Numpy or Matlab) avoids explicit iteration over the elements of matrices and concentrates instead on whole-array operations and (if irregular access is needed) on manipulation of index matrices. Its not unlike the use of global operations on DataFrames.

This approach allows highly-parallelized code to be used to implement these routines. It often makes for more succinct code and (with some practice), greater readability. We'll concentrate on applying those ideas in the next part of the Lab.

Yes, its not only fun to have a fast toolkit, but it really matters for performance. Not just runtime, but most algorithms can trade off time for precision by simply training more thoroughly, or training a richer model.

BIDMat/BIDMach is currently the only system which fully integrates GPU computing with CPU computing. Its only the only system to have fully rooflined sparse matrix primitives. This is very important, since these are the bottleneck for machine learning on the most common types of data (text, web, clickthrough etc). Let's measure exactly how much difference this makes.

First, we'll define a few matrices, both on the CPU and on a GPU (assuming you have one).

```
In [ ]:
```val n = 8192
val a = rand(n,n) // a random dense matrix (CPU)
val b = powrand(n,n,100) // a random power-law sparse matrix with 100 elements per column (CPU)
val ga = GMat(a) // a GPU version of a
val gb = GSMat(b) // a GPU version of b

```
In [ ]:
```var ma:Mat = a // create a generic Mat variable and bind it to a
var mb:Mat = b // create a generic Mat variable and bind it to b
var mc:Mat = null // we'll use this to hold results

Now we'll benchmark both dense and sparse matrix multiply. Dense first.

```
In [ ]:
```flip
mc = ma * ma
gflop

```
In [ ]:
```flip
mc = ma * mb
gflop

Now lets bind those variables to GPU matrices instead:

```
In [ ]:
```ma = ga
mb = gb

and run exactly the same code:

```
In [ ]:
```flip
mc = ma * ma
gflop

You'll probably see a good order-of-magnitude speedup over the CPU calculation. This shouldnt be surprising. GPUs have a well-earned reputation for dense-matrix performance.

What's less well-known is their sparse matrix performance, which yields roughly order-of-magnitude gains as well.

```
In [ ]:
```flip
mc = ma *^ mb
gflop

You should see performance in the 30-40 gflops range, which is the roofline for sparse operations on the current generation of Nvidia GPUs.

This is very important, because that operation (and two other variants that have similar performance) is the dominant step in most common machine learning algorithms. With careful design of the entire learning pipeline, you can translate that advantage into a end-to-end speedup by the same factor.

Furthermore, by writing generic code (using the Mat class as above) you can hide the details of implementation (CPU vs GPU) and run your algorithm in either environment. You can also support either sparse or dense matrices, and many of BIDMach's learning algorithms will work with either.

Notethe prefix transpose operator ^&ast is not needed in most machine learning algorithms and doesnt have a fast implementation.