We will start by adding $n$ processes using function $\textit{addprocs(n)}$. When parallelizing, work will be distributed among processes.
Every process has its own identification. A list of id's of available processes can be obtained with function $\textit{procs()}$. The total number of available processes can be obtained with function $\textit{nprocs()}$.
In [2]:
addprocs(5)
println(procs())
println(nprocs())
However, to many processes will imply a lot of communication between processes, decreasing the performance. Function $\textit{rmprocs(v)}$ can be used to remove processes with id's contained in $v$.
In [3]:
rmprocs(5:6)
Out[3]:
In [4]:
println(procs())
workers()
Out[4]:
First example is tossing a coin, and counting how many heads we get.
In [6]:
nheads = Int64(0);
@time for i=1:4e8
nheads += Int(rand(Bool))
end
We can do better
In [7]:
function count_heads(n::Int64)
c::Int64 = 0
for i=1:n
c += Int(rand(Bool))
end
c
end
Out[7]:
In [9]:
@time count_heads(Int64(4e8))
Out[9]:
We can do even better if we parallelize the code. The $\textit{@parallel}$ macro distributes work among available processes, and the reduction function $(+)$ will group the computed work into a final result.
In [11]:
@time nheads = @parallel (+) for i=1:4e8
Int(rand(Bool))
end
Out[11]:
As using functions improve the performance, we can try to combine it with $\textit{@parallel for}$.
Something to keep in mind, is that every process has to be able to run the functions we need. Every package should be called by all processes, every function and type should be run in all processes. For that purpose, we have the macro $\textit{@everywhere}$.
In [12]:
@everywhere function count_heads(n::Int64)
c::Int64 = 0
for i=1:n
c += Int(rand(Bool))
end
c
end
In [14]:
@time nheads = @parallel (+) for i=1:4
count_heads(Int64(1e8))
end
Out[14]:
We can also use function $\textit{pmap(f,v)}$ to apply function $f$ to all elements in $v$, using all processes in the same way as $\textit{@parallel}$.
$\textit{pmap}$ is intended to be used when $f$ is a complex function and $\textit{length(v)}$ is small, whereas $\textit{@parallel for}$ should be used when tasks are simple and the number of tasks is large.
In [15]:
v = rand(10)
a = pmap(sqrt, v)
println(a)
We can compare the performance of $\textit{pmap}$ with the performance of $\textit{@parallel for}$ with functions.
In [17]:
v = [Int64(1e8) for i in 1:4]
@time nheads = pmap(count_heads, v)
Out[17]:
If we wish to modify an array in parallel, every process should be able to access the array. For that reason, we have the type $\textit{SharedArray}$. It works in the same way as a regular $\textit{Array}$, but every process will have access to it.
In [19]:
workspace() # removes all variables
a = zeros(10)
@parallel for i=1:10
a[i] = i
end
Out[19]:
In [20]:
fetch(a)
Out[20]:
In [21]:
workspace()
a = SharedArray(Float64,10)
@parallel for i=1:10
a[i] = i
end
Out[21]:
In [22]:
println(typeof(a))
println(a)
println(a[5])
If we really need to recover the $\textit{Array}$ type, we can use the function $\textit{sdata}$.
In [23]:
b = sdata(a)
println(b)
println(typeof(b))
Sometimes, for a better performance, it is good to know how $\textit{@parallel for}$ and $\textit{pmap}$ work. For example, if we want to distribute unpredictable or unbalanced tasks among all processes, we want to assign tasks to processes as soon as jobs are done ($\textit{dynamic scheduling}$).
The function $\textit{remotecall(f, i, ...)}$ is called and performed immediatly on one process, and the result is to apply function $f(...)$ on the process $i$. The macro $\textit{@spawnat}$ evaluates the expression provided in the second argument in the process with id provided in the first argument.
The result of a $\textit{remotecall}$ or a $\textit{@spawnat}$ is of type $\textit{Future}$; the full value of a $\textit{Future}$ can be obtained using function $\textit{fetch}$.
In [24]:
workspace()
r = remotecall(rand, 2, 2, 2) # Parameters are: function to be called, process to be used,
# ... (extra parameters to be passed to function)
Out[24]:
In [25]:
s = @spawnat 3 1 .+ fetch(r) # Parameters are: process to be used, task to be run in the selected worker.
Out[25]:
In [26]:
fetch(s)
Out[26]:
The function $\textit{remotecall_fetch}$ yields the same result as $\textit{fetch(remotecall( ))}$, but it is more efficient.
Also, to make things easier, we can use the macro $\textit{@spawn}$. It works the same as $\textit{@spawnat}$, but it will pick the process where to evaluate the expression automatically.
We can also make use of function $\textit{wait()}$ on a returned $\textit{Future}$ to wait for a remote call to finish, and then make decisions and continue computations with the task already finished.
The ultimate goal of parallelization is to obtain a better performance of our code. So, we can not avoid a small reference to performance tips, to really have an optimal code.
Not only the time per run is important, a very large amount of memory allocation is a sign of a non optimal code. Several other tools can be used to explore the code and optimise it ($\textit{Profiling, ProfileView, @code_warntype, Lint, TypeCheck}$ and more).
Julia is very smart with variable type management, as it can change variable types to adapt to what we are doing (i.e. changing from an abstract type to $\textit{Float64}$ when computing). However, this takes a lot of time. Being consistent with our variable types will improve the code performance. Some things to be aware of when coding are:
If we plan to access to every entry of a multidimensional array (say a matrix), the optimal way to do it is to respect the order in which the language orders the array in memory. In the case of Julia, it is better to explore a matrix by columns. Different languages may have different conventions.