We use multiple processors. I am working with an 8 cores desktop, so I can add 7 processors ("workers") in addition to the main one. You won't get an error message if you try to add more (unless you really add too many of them), but it will not improve performance further by magically creating cores.
If you are using a PC, you can know its number of cores by typing Ctrl+Alt+Delete, go to the Task manager, and look at the number of windows below the label "CPU Usage History" (which will also show you the activity of your processors). On a Mac, open the terminal and type sysctl -n hw.ncpu. This count includes so-called "multi/hyper-threaded cores" (on Intel processors), which are multiple logical cores created out of a single physical core.
Whenever you intend to use parallel programming in julia, you should start the file containing the main code by adding additional processors, called "workers".
In [1]:
addprocs(7)
println("Number of processors:"); println(nprocs())
Now let us look at how parallism works in julia.
The function remotecall(function, processor_nb, vararg) executes the function function on processor number processor_nb, where vararg are the arguments of the function. Let us for example create a $(2\times 2)$ matrix of uniformly distributed numbers on processor number 2. By default, processor 1 is the main processor from which messages are sent to launch and coordinate jobs on the other processors.
In [2]:
r = remotecall(rand, 2, 2, 2)
Out[2]:
remotecall creates a remote reference RemoteRef, which points to the matrix we generated on processor 2.
The main code returns immediately on processor 1. However, since computations run asynchronously on various processors, processor 2 may not have completed the job assigned by processor 1. The function isready, which takes remote references as inputs, returns true if it did. This is useful for conditioning statements in the main code.
The type of the object r we created is a RemoteRef, so calling r directly in the main code in processor 1 won't work. We fetch the actual object the RemoteRef is pointing at by using the command fetch.
In [3]:
println("Did processor 2 finish the computations?"); println(isready(r))
println("What is r's type?"); println(r)
println("What is r?"); println(fetch(r))
The syntax and going-back-and-forth between remote calls and remote references is somewhat cumbersome for large and complex codes. This is where macros come in handy. Their position is at the start of the command to be executed.
@spawn runs the code on some worker, whose ID is unknow to us at the time we run the command. @spawnat folled by the worker's ID allows to choose the worker on which the command is run. The syntax of the former is lighter and should be preferred in general.
Here we add 1 to each of the entries of our matrix r. Note that we still need to fetch r, because it is defined on processor 2 and we don't know on which processor @spawn will run the code.
In [6]:
s = @spawn 1 .+ fetch(r)
println("Did the processor finish the computations?"); println(isready(s))
println("What is s's type and on which processor was it computed?"); println(s)
println("What is s?"); println(fetch(s))
s2 = @spawnat 3 1 .+ fetch(r)
println("Did the processor finish the computations?"); println(isready(s2))
println("What is s's type and on which processor was it computed?"); println(s2)
println("What is s?"); println(fetch(s2))
Another extremely useful macro is @everywhere. As we saw, an object created on some processor will not be available on others unless we fetch it. For instance, if we create a random matrix $B$ on processor 3, we won't be able to invert it on processor 5. Defining this matrix @everywhere makes it available to all processors, which can read and modify it.
In [7]:
B = @spawnat 3 rand(10,10)
i = remotecall(inv,5,B)
fetch(i)
In [8]:
@everywhere B = rand(10,10)
i = remotecall(inv,5,B)
fetch(i)
Out[8]:
This is especially true for large and complex codes, where code files, modules and types loaded and/or defined on the main processor should be defined @everywhere, as the growth model example below makes clear.
In practice, parallel maps and parallel loops are the easiest way to do parallel programming in Julia.
An example of @parallel for loop is given in the growth model below.
Let us give an example for pmap. We create a function that generates two random matrices of given sizes, inverts and sums them. We apply it to a collection of matrix sizes, and time the excution for map and pmap. You can check that they return the same collection of matrices.
In [15]:
@everywhere function change_matrix(nA::Int64, nB::Int64)
A = rand(nA, nA)
B = rand(nB, nB)
nmin = min(nA,nB)
A = A[1:nmin,1:nmin]
B = B[1:nmin,1:nmin]
return inv(A) .+ inv(B)
end
@time map(change_matrix, 1:2, 2:3);
@time pmap(change_matrix, 1:2, 2:3);
@time map(change_matrix, 100:200, 200:300);
@time pmap(change_matrix, 100:200, 200:300);
Let us look at a stripped-down version of the neoclassical growth model. The environment is deterministic. Capital $k$ is the only state variable, and there is no depreciation. A representative agent solve the following recursive problem:
$$ V(k) = \max_{c \in (0,f(k))} u(c) + \beta V(k')$$$$ k' = f(k) - c$$$$ f(k) = k^\alpha$$where $\alpha<1$ and the instantaneous utility function $u$ satisfies the Inada conditions.
We solve the model by value function iteration with interpolation of the value function. We compare the speed of the serial solution against the parallel one, for various grids.
We use a single processor.
We start by loading the modules on the local processor, and define the model parameters. Then we define three functions to solve the model:
In [46]:
using Optim: optimize
using Interpolations
using PyPlot
## Primitives and grid
alpha = 0.65
beta = 0.95
grid_max = 2
grid_size = 1500
grid_k = linspace(1e-6,grid_max,grid_size)
u(c) = log(c)
function optim_step(k,Vtilde_itp)
objective(c) = -u(c) - beta*Vtilde_itp[k^alpha - c]
res = optimize(objective,1e-6,k^alpha)
c_star = res.minimum
V1 = -objective(c_star)
return V1
end
function vfi_map(grid_k,criterion)
knots = (grid_k,) # knots for gridded linear interpolations
iter = 0
V0 = 5 .* log(grid_k)
distance = 1
while distance > criterion
Vtilde_itp = interpolate(knots, V0, Gridded(Linear()))
V1 = map(k -> optim_step(k,Vtilde_itp), grid_k)
distance = norm(V1-V0, Inf)
V0 = deepcopy(V1)
iter = iter + 1
end
return V0
end
function vfi_loop(grid_k,criterion)
knots = (grid_k,) # knots for gridded linear interpolations
iter = 0
V0 = 5 .* log(grid_k)
V1 = Array(Float64,length(grid_k))
distance = 1
while distance > criterion
Vtilde_itp = interpolate(knots, V0, Gridded(Linear()))
for (i,k) in enumerate(grid_k)
objective(c) = -u(c) - beta*Vtilde_itp[k^alpha - c]
res = optimize(objective,1e-6,k^alpha)
c_star = res.minimum
V1[i] = -objective(c_star)
end
distance = norm(V1-V0)
V0 = deepcopy(V1)
iter = iter + 1
end
return V0
end
Out[46]:
We time the execution of the optim_loop function for capital grids of various lengths. The two functions vfi_loop and vfi_map have similar performance.
In [47]:
grid_size_array = [150 500 1000 1500 10000]
grids = Array(Any, length(grid_size_array))
for (i,g) in enumerate(grid_size_array)
grids[i] = linspace(1e-6,grid_max,g)
end
for g in grids
@time V=vfi_loop(g,1e-6);
end
Start by loading the modules on the main processor. Then define load modules @everywhere, i.e. on every working processor. It is often needed to load modules on the main processor first, to prevent them from overwriting each other. Also define the model parameters and functions @everywhere.
The function optim_step(k,Vtilde_itp) is identical to the serial case. This is the step of the resolution of the model that we choose to parallelize. That is, we delegate the maximization of the value function for various values of $k$ to several processors.
Then we gather the optimal values computed by the various processors into a SharedArray. This is what the function vfi_ploop(grid_k,criterion) does.
At every iteration of the value function iteration, it delegates the various maximization problems to the various workers using a @parallel for loop (the macro @sync ensures that the program waits for all workers to complete their tasks before it goes on).
The SharedArray makes the information contained in it readable for all workers. This is not strictly necessary here (we could have used a DArray -- a distributed array), but it is often more convenient to code up.
In [1]:
## (Just for illustration, modules already loaded above)
using Optim: optimize
using Interpolations
using PyPlot
@everywhere begin
using Optim: optimize
using Interpolations
## Primitives and grid
alpha = 0.65
beta = 0.95
grid_max = 2
grid_size = 1500
grid_k = linspace(1e-6,grid_max,grid_size)
u(c) = log(c)
function optim_step(k,Vtilde_itp)
objective(c) = -u(c) - beta*Vtilde_itp[k^alpha - c]
res = optimize(objective,1e-6,k^alpha)
c_star = res.minimum
V1 = -objective(c_star)
return V1
end
function vfi_ploop(grid_k,criterion)
knots = (grid_k,)
iter = 0
V0 = 5 .* log(grid_k)
distance = 1
while distance > criterion
Vtilde_itp = interpolate(knots, V0, Gridded(Linear()))
V1 = SharedArray(Float64,length(grid_k))
@sync @parallel for i in eachindex(grid_k)
V1[i] = optim_step(grid_k[i],Vtilde_itp)
end
distance = norm(V1-V0, Inf)
V0 = deepcopy(V1)
iter = iter + 1
end
return V0
end
end
In [6]:
@everywhere begin
grid_size_array = [150 500 1000 1500 10000]
grids = Array(Any, length(grid_size_array))
for (i,g) in enumerate(grid_size_array)
grids[i] = linspace(1e-6,grid_max,g)
end
end
V = Array(Any, length(grids))
for (i,g) in enumerate(grids)
@time V[i] = vfi_ploop(g,1e-6);
end
A few remarks:
To complete the exercise, let us plot the various value functions obtained for various grid precisions.
In [25]:
fig, ax = subplots()
for i in 1:length(grids)
x = grids[i]
y = V[i]
ax[:set_ylim](-40, -30)
ax[:set_xlim](minimum(grids[i]), maximum(grids[i]))
ax[:plot](x, y, "k-", lw=1, alpha=0.5, label = "draw $i")
end
ax[:legend]
Out[25]:
In [ ]:
In [ ]:
In [ ]: