Scala Overview

BIDMach is based on the Scala language, a very powerful language with a flexible syntax and produces code that runs on the Java virtual machine (JVM). Some quick features of Scala:

It includes many features familiar to Python programmers: anonymous functions, both functional and object-oriented styles, and a rich set of libraries of common data structures.
It has a REPL (Read-Eval-Print Loop), so that code can be typed directly at a command prompt and executed. You can also run scripts from a file.
It has an open syntax of operators, and new operators can be defined by simply defining a class method with the same name. These operators can use unicode characters, so many common math operators can be defined using their conventional symbol, e.g. ∘, ∙, ⊗
It can load and call virtually any existing java class and methods, so there is a very large pre-existing library for new Scala code.
It has very good support for concurrent programming, included an "Actor" framework.
Performance is very good, comparable with C code for many operations. You can also call native C code (through Java) if needed.

For this tutorial, we'll work through some basic language constructs in Scala. A lot will be left out, but our focus is on the kind of programming you might do for data analysis.

Functional Programming in Scala

First, lets try some functional code in Scala.



In [ ]:

    
def fib(n:Int):Int = if (n <= 2) 1 else fib(n-1) + fib(n-2)

As you can see the syntax is similar to many languages. One significant difference is that variable types are written as qualifiers after the variable name, e.g. n:Int</code>, and the function return value as fib(n:Int):Int.

Instead of declaring variables with their type, you preface each declaration with val or var. A val is a value (not a variable) which is constant in its current scope. A var is a variable that can be changed.

Try running this function and assigned the return value to v:



In [ ]:

    
val v = fib(40)

Assigning a new value to this symbol will result in an error:



In [ ]:

    
v = fib(20)

Variable declarations allow such changes:



In [ ]:

    
var u = fib(20)
u = fib(30)

Notice that we didnt specify a type for u or v. But in fact the compiler can figure out the type of both. The %type magic function in IScala lets us check



In [ ]:

    
%type v



In [ ]:

    
%type u

Since the tree of fib calls is a binary tree with leaves with value 1, the total number of calls is exactly 2 * fib(n) - 1, which allows us to evaluate the speed of the function: tic starts a seconds timer, while toc returns its value. (The BIDMat.MatFunctions class contains the timers)



In [ ]:

    
import BIDMat.MatFunctions._
tic
val n=fib(44)
val t=toc
println("time = %f secs, speed = %f million function calls per second" format (t, (2*n-1)/t/1e6))

TODO: Try writing and running a factorial function on your own:



In [ ]:

    
def fact(n:Int):Int

Control Flow

We used if/else syntax in an intuitive way in the fib() function. What was perhaps not completely clear was that this is a functional form of if statement. That it, it returns a value which is the result of either the first (then) statement or the second (else). If the return value is needed, it is illegal to skip the else clause.

Very often, the code in a then or else clause will be more than one statement, in which case we need to enclose them in Braces {}. The value returned by a set of Braces is the value of the last line in the enclosed statements. e.g.



In [ ]:

    
def test(a:Int):Int = {
    if (a < 0) {
        val b = a*a;
        2 * a;
    } else {
        4 * a;
    }
}
val x = (test(-4), test(3))

Notice that we made two function calls enclosed in a kind of list enclosed in parentheses. That's because cell evaluation, just like any code block in Scala, returns only the last statement value, but we wanted to see two values. If we check the type of this object we see:



In [ ]:

    
%type x

This is a tuple of two Int types. Its is more primitive than a list, and includes special compiler support for type checking. And e.g. it allows functions to return multiple values of different types while still checking them, which is very powerful.

Write a function sincos(x) that returns both the sin and cos of its argument. The sin and cos functions are in the math package.



In [ ]:

    
def sincos(x:Double):(Double, Double)

For and While Loops

Scala includes a functional for loop, which you can use for C-style iteration:



In [ ]:

    
def fact(n:Int):Int = {
    var p = 1;
    for (i <- 2 to n) {
        p *= i;
    }
    p
}
fact(5)

While loops have a familiar syntax:



In [ ]:

    
def fact2(n:Int):Int = {
    var p = 1;
    var i = n;
    while (i > 0) {
        p *= i
        i -= 1
    }
    p
}
fact2(5)

Although for and while look very similar they are very different under the hood. While is implemented by the compiler with simple iteration and conditional jumps and is very fast. For creates a lexical context for each iteration, will be much slower if variables outside the context are used. For performance code, you want to use while loops.



In [ ]:

    
import BIDMat.MatFunctions._

val n = 100000000
var i = 0;
tic;
while (i < n) {
    i += 1
}
val t = toc
println("t = %f secs, speed = %f million loops/sec" format (t, n/t/1e6))



In [ ]:

    
val n2 = 1000000;
var m = 0;
tic;
for (i <- 0 until n2) {
    m += 1
}
val t2 = toc
println("t = %f secs, speed = %f million loops/sec" format (t2, n2/t2/1e6))

Lists

You can either define lists up front to use them, or construct them on the fly.



In [ ]:

    
val ll = List(1,2,3)



In [ ]:

    
%type ll



In [ ]:

    
val ll2 = 4 :: 5 :: 6 :: Nil  // :: is a "cons" operator



In [ ]:

    
0 :: ll     // You can also use :: to append to the front of a existing list in constant time



In [ ]:

    
ll :+ 4     // You can use :+ to append to the tail of a list, but this is much slower O(n)

Lists are immutable in Scala, so you cannot actually modify them. You can access an element with () but not modify it:



In [ ]:

    
ll(2)



In [ ]:

    
ll(2) = 4



In [ ]:

    
val ll3 = ll ++ ll2     // ++ is the operator to concatenate two lists

Lists, like other collections, support a variety of map and reduce functions. e.g. try



In [ ]:

    
def square(v:Int):Int = v*v
ll3.map(square)

Small, local utility functions are used a lot in Scala, and it has a simple syntax for creating anonymous functions. To define the square function on the fly, we could do:



In [ ]:

    
ll3.map((x:Int) => x * x)

And for a reducer, we just need to supply a function with two arguments:



In [ ]:

    
ll3.reduce((x:Int, y:Int) => x + y)

The vars x, y are repeated in the same order in the function parameter list and in its body, and that pattern occurs a lot in short utility functions. Scala includes special syntax for it. The underscore character "_" represents both the function parameter, and the value, in the order it appears. In other words

_op_ is equivalent to (x:Type, y:Type) => x op y

So we can write the sum reducer as:



In [ ]:

    
ll3.reduce(_+_)   // the first "_" holds the first function parameter, while the second "_" holds the second.

The fold functions foldLeft and foldRight are like reduce, except they include an explicit start value. To copy a list we can do:



In [ ]:

    
ll3.foldRight(List[Int]())(_::_)

TODO: Write a similar expression below using foldRight that reverses the list. Hint: you can't use _ arguments because you need to switch the order that the arguments appear when you concatenate them.



In [ ]:

Ranges

There are special operators "to" and "until" which construct ranges of consecutive integers for use in functional "loops". "to" gives a sequence up to an including the last value, "until" discards the last value (which is what you want for C-style for loops).



In [ ]:

    
1 to 5



In [ ]:

    
(1 to 5).map(square)



In [ ]:

    
(1 to 5).reduce(_+_)

which gives us a functional way to compute factorial.

TODO: Write a functional factorial (with neither loops nor recursion) in the cell below:



In [ ]:

    
def factf(n:Int):Int

Arrays

Arrays are the basic building block for numeric computation. In Scala they are mutable data structure of fixed size with constaint time indexed access.



In [ ]:

    
val arr = Array(1,2,3,4)

Arrays dont seem to print correctly in IScala. To see an arrays contents, you can convert it to a List or a Vector



In [ ]:

    
arr.toVector

You access elements of an array using (indx) instead of [indx] as in other languages.



In [ ]:

    
arr(2)

Since arrays are mutable, you can change their values:



In [ ]:

    
arr(2) = 7



In [ ]:

    
arr.toVector

Methods like map, reduce, fold are all applicable to arrays. You can convert from most other sequence-like structure to an array with the toArray method.



In [ ]:

    
val ar = (1 to 10).toArray
ar.toVector



In [ ]:

    
ar.reduce(_*_)

You can add the contents of two same-sized arrays by writing a while loop, but a more fucntional approach is to use the "zip" method which makes an array of pairs of values (2-tuples). With this array, you can use the sum function to add the arrays.



In [ ]:

    
val br = Array(4,4,5,6,3,3,4,4,2,2)
val ab = ar.zip(br)
val absum = ab.map(x => x._1 + x._2)  // x is a 2-tuple, you need to extract its parts with _1 and _2 methods.
absum.toVector

Scala's functional syntax is very powerful and succinct, but generally not fast (the vector addition routine above is 100x slower than a while-loop-based version). BIDMach layers a different functional metaphor on top of Scala's arrays which is both fast and simple (similar to standard math syntax). We'll explore that shortly.

Dictionaries

Like Python, Scala has a flexible Dictionary class. Actually there are several which use different internal representations. The HashMap will amost always be what we want.



In [ ]:

    
import scala.collection.mutable.HashMap
val a = HashMap[String,Int]()



In [ ]:

    
a("end") = 20
a("begin") = 40
a("up") = 100
a("down") = 1000
a

You can iterate through the map using a for loop to retrieve all the tuples:



In [ ]:

    
for (i <- a) println("key %s, value %d" format (i._1, i._2))

TODO: Write a function that computes and returns the (key,value) pair with the smallest value:



In [ ]: