The purpose of these notes is to introduce R programmers to the Julia programming language.
Julia functions are sufficiently similar to R functions that most R programmers can read most Julia functions. They are sufficiently different that initial attempts at writing a Julia function can be frustrating.
One way of learning a programming language like Julia is to read programs written by experienced in that language. One of the amazing aspects of Julia is that much of the Base system is written in Julia itself. One of the current (June, 2014) deficiencies of Julia is that the documentation, especially package documentation, is quite sparse and you often need to read the source of a function to understand its purpose.
In these notes I am assuming the use of julia-0.4.2 or later.
In [1]:
versioninfo()
Entering the name of a function in R returns the definition of the function. Entering the name of a function in Julia provides information on the number of methods defined for that generic. Remember that all Julia functions are generic functions.
In [2]:
mean
Out[2]:
The methods
function gives the signatures of the methods for the generic.
In [3]:
methods(mean)
Out[3]:
In an IJulia notebook the names of the files are links to the file in the github repository.
Prebuilt binary packages for various operating systems may have a copy of the base directory elsewhere. The Ubuntu binaries put it in /usr/share/julia/base
In addition to functions Julia provides macros. Writing macros is not a priority when you start but you should know about some of the macros that can be useful with regard to functions. A call to a macro always begins with the @
character and parenthesis around the argument list are optional.
For generics that have many methods (try methods(*)
) it may not be obvious which method applies to a particular set of arguments. The @which
macro tells you which method will be used.
In [4]:
@which mean([1:10;])
Out[4]:
In a REPL session the filenames will not be live links. You can use @less mean([1:10;])
to view the source file at the indicated location.
Two very useful macros for tuning the performance of your functions are @time
and @profile. Recent versions of the @time
tell you the elapsed time, the total number of bytes allocated and the proportion of the time spent in garbage collection.
Note that the number of bytes allocated does not measure the increase in memory used by the process. Often objects are allocated, populated, used and freed within the course of a functions's execution.
Also, bear in mind that the very short execution times are quite variable.
In [5]:
mean(randn((25,1_000_000)),1) # 1,000,000 replications of a normal sample of size 25
Out[5]:
In [6]:
@time mn25n = mean(randn(25,1_000_000),1);
The first time a particular function signature is called it will be slower, often much slower, than subsequent times. This is because the first call to a signature can cause a lot of compilation of methods.
Another useful macro is @profile
, especially when paired with the view
function in the ProfileView
package.
In [7]:
@profile mean(randn(25,1_000_000),1);
In [8]:
#Pkg.add("ProfileView") # similar to install.packages() in R
using ProfileView # similar to library(ProfileView) in R
ProfileView.view()
Out[8]:
In this case the results aren't very interesting but in a REPL session the bars in the image are live links to lines in the source code.
Note that using one package may bring in others.
In [9]:
whos()
An example I like to use for function creation is simulation of the distribution of sample statistics. For creating samples from various distributions we will use the Distributions package. I was going to use the Gadfly package for visualization but it doesn't seem to be happy today.
In [10]:
# Pkg.add("Distributions") # done previously
using Distributions
In the last section, we simulated the means of samples of size 25 from a standard normal distribution. We know that the distribution of this sample statistic should be a normal with mean zero and standard deviation 5.
In [11]:
mean(mn25n) # expect a value close to zero
Out[11]:
In [12]:
std(mn25n) # expect a value close to 0.2
Out[12]:
The randn
function for sampling from a normal distribution is in the Base
module. For other distributions we use the Distributions package. Check the documentation for the Distributions package to see all the possibilities.
To write a general function for simulating the distribution of sample statistics we need the distribution from which to sample, the sample size, the statistic to evaluate and the number of replications to perform. A distribution object encompasses both the distribution type and the parameters of the distribution.
In [13]:
typeof(Poisson)
Out[13]:
In [14]:
methods(Poisson)
Out[14]:
In [15]:
d = Poisson(5)
Out[15]:
A function to replicate N times the evaluation of the statistic, s, on a sample of size n from distribution d can be written
In [16]:
function sampstat(N::Integer,s::Function,n::Integer,d::Distribution)
samp = rand(d,n) # simulate the first sample
v1 = s(samp) # evaluate the statistic
res = Array(typeof(v1),(N,)) # create the result array
res[1] = v1 # set the first element
for i in 2:N
res[i] = s(rand(d,n))
end
res
end
Out[16]:
It might be good to step through the evaluation just to check that everything is okay.
In [17]:
N = 100; n = 20; s = median
Out[17]:
In [18]:
samp = rand(d,n);
samp'
Out[18]:
In [19]:
v1 = s(samp)
Out[19]:
In [20]:
res = Array(typeof(v1),N);
res[1] = v1
Out[20]:
In [21]:
for i in 2:N
res[i] = s(rand(d,n))
end
In [22]:
res'
Out[22]:
Next we generate a large sample and check the timing
In [23]:
@time med20p = sampstat(1_000_000,median,20,d);
In [24]:
med20p'
Out[24]:
We see that a large amount of storage is allocated and about 25% of the execution time is in garbage collection. Because the function is fairly simple we can guess that the allocation is taking place in the call to rand
. For each of the 1,000,000 replications a new vector of 20 Int64 values is being allocated, populated, passed to the function s
then released to be garbage collected later.
Julia allows for mutating functions that change the value of one or more of their arguments. This may seem heretical to R programmers but it can be very useful. Tuning the performance of long-running functions often comes down to finding out where temporary storage is being allocated and avoiding doing that.
By convention, the names of mutating functions end in !
so rand
allocates new storage whereas rand!
uses storage passed to it. Fortunately the vector samp
has already been allocated and we can reuse it.
In [25]:
samp'
Out[25]:
In [26]:
rand!(d,samp)'
Out[26]:
In [27]:
samp'
Out[27]:
In [28]:
function sampstat1(N::Integer,s::Function,n::Integer,d::Distribution)
samp = rand(d,n) # simulate the first sample
v1 = s(samp) # evaluate the statistic
res = Array(typeof(v1),(N,)) # create the result array
res[1] = v1 # set the first element
for i in 2:N
res[i] = s(rand!(d,samp))
end
res
end
Out[28]:
In [29]:
@time med20p1 = sampstat1(1_000_000,s,n,d)'
Out[29]:
The reuse of the samp
vector has helped to reduce the amount of storage allocated and the time spent in garbage collection but not substantially. A less obvious source of allocation is the median
function. The non-mutating version must take a copy of the vector to do the (partial) sort evaluating the median. We don't need to preserve the sample because we are going to overwrite it in the next iteration anyway.
In [30]:
@time med20p1 = sampstat1(1_000_000,median!,n,d)'
Out[30]:
We should, of course, check that the results are consistent by setting the random number generator seed and comparing the results from the two sampstat
functions and the mutating and non-mutating median
functions.
In [31]:
srand(1234321)
s1 = sampstat(100_000,median,n,d);
srand(1234321)
s2 = sampstat(100_000,median!,n,d);
srand(1234321)
s3 = sampstat1(100_000,median,n,d);
srand(1234321)
s4 = sampstat1(100_000,median!,n,d);
all(s1 .== s2 .== s3 .== s4)
Out[31]:
We have seen that a method is defined by a block of the form
function <functionname>(arg1::Type1,arg2::Type2,...,argn::Typen)
...
end
Be careful not to let your R instincts take over and write
functionanme = function(arg1::Type1,...,argn::Typen)
...
end
That is not syntactically wrong but it isn't what you want. (The gory details are that it creates an anonymous function, which is generally less effective because of the way the compiler and type inference operates, then assigning this anonymous function to a name.)
The type annotations are optional. Obviously when you are defining methods that are distinguished by the types of their arguments you use them. Other times you may want to use them to validate the arguments. In sampstat
the argument s
is applied as a function so it should be a function. The argument d
is passed to rand
or rand!
and it could be something other than a Distribution but we want a Distribution here. It wouldn't make sense for n
and N
to be other than integers as they represent sizes. The Integer
type is an abstract type with several subtypes.
In [32]:
typeof(Integer)
Out[32]:
In [33]:
subtypes(Integer)
Out[33]:
Those coming to Julia with a background in strongly typed languages (C, C++, Java) tend to overspecify the argument types. Generally I use argument type declarations when I want to distinguish methods and to validate the argument types. Frequently in R I will begin a function with
f <- function(a, b, c) {
stopifnot(is.integer(a), length(a) == 1L, ...)
...
}
to validate the types of the actual arguments. In Julia I specify the types that I want in the argument list. If other types are passed the evaluation of the function fails to find an appropriate method.
The value of a function evaluation is the last object evaluated. There is also a return
directive which can be useful in short-circuiting evaluation.
A method that consists of a single expression can be written with an =
sign. For example, the insupport
generic checks whether a value x
is in the support of a distribution d
. The method for the Binomial type is
insupport(d::Binomial, x::Real) = isinteger(x) && 0 <= x <= d.size
If we want to check for all of a vector's elements being in the support we could write
In [34]:
function vsupport(d::Binomial,v::Vector)
for el in v
insupport(d, el) || return false
end
true
end
Out[34]:
Doing things this way if you have a million values to check and the second one fails you don't bother checking the rest.
An short anonymous function can be written with the "stabby lambda syntax". To apply the logit transform to a vector of values in (0,1) we could write
In [35]:
μ = rand(10);
η = map(p -> log(p/(1-p)), μ);
using DataFrames
DataFrame(μ = μ, η = η)
Out[35]:
Many of the reduction functions (any
, all
, sum
, prod
, etc.) can take a function as the first argument. In that form any
and all
have the short-circuiting behavior so the vsupport
function could be written
In [36]:
vsupport(d::Distribution, v) = any(x -> insupport(d, x), v)
Out[36]: