CloudArray: Easy big data programming in the cloud

Usage

First load CloudArray package:


In [1]:
using CloudArray


WARNING: New definition 
    call(Type{DistributedArrays.DArray}, AbstractString, Any...) at /Users/alage/.julia/v0.4/CloudArray/src/CloudArray.jl:209
is ambiguous with: 
    call(Type{DistributedArrays.DArray}, Any, DistributedArrays.DArray) at /Users/alage/.julia/v0.4/DistributedArrays/src/DistributedArrays.jl:89.
To fix, define 
    call(Type{DistributedArrays.DArray}, AbstractString, DistributedArrays.DArray)
before the new definition.
WARNING: requiring "CloudArray" in module "Main" did not define a corresponding module.

Then configure the cloud host address and password:


In [2]:
set_host("cloudarray01.cloudapp.net","cloudarray@")


Out[2]:
true

Main constructors

CloudArray main constructors are very simple and can be created by using an Array or a file.

Creating a CloudArray from an Array

You just need to tell DArray constructor which Array should be used to construct your CloudArray:

DArray(Array(...))

Example

In this example, we first create the array arr with 100 random numbers then we create a CloudArray with the arr data:


In [8]:
arr = rand(100)
cloudarray_from_array = DArray(arr) # will take less than one minute


Creating container (3)...
SSH configuration (3)... 
Warning: Permanently added '[cloudarray01.cloudapp.net]:3049,[23.99.60.212]:3049' (RSA) to the list of known hosts.
Adding worker (3)...
New worker added
Total: 1
elapsed time: 27.525632162 seconds
WARNING: deserialization checks failed while attempting to load cache from /Users/alage/.julia/lib/v0.4/Compat.ji
Out[8]:
100-element DistributedArrays.DArray{Float64,1,Array{Float64,1}}:
 0.448005 
 0.83255  
 0.462912 
 0.236335 
 0.704798 
 0.712569 
 0.0203078
 0.504881 
 0.928329 
 0.682598 
 0.669035 
 0.968663 
 0.67883  
 ⋮        
 0.527087 
 0.376975 
 0.539336 
 0.494806 
 0.288121 
 0.953859 
 0.985124 
 0.045659 
 0.551096 
 0.678965 
 0.349062 
 0.264423 

We can now access any value as it would be a local array:


In [15]:
cloudarray_from_array[57]


Out[15]:
0.9158186461892979

Creating a CloudArray from a file

If you are dealing with big data, i.e., your RAM memory is not enough to store your data, you can create a CloudArray from a file.

DArray(file_path)

file_path is the path to a text file in your local or distributed file system. All lines will be used to fill DArray elements sequentially. This constructor ignores empty lines.

Example

Let's first create a simple text file with 100 random numbers.


In [16]:
f = open("data.txt","w+")
for i=1:100
    if i==100
        write(f,"$(rand())")
    else
        write(f,"$(rand())\n")
    end    
end
close(f)

Then we create a CloudArray with data.txt file.


In [10]:
cloudarray_from_file = DArray("data.txt")


Creating container (4)...
SSH configuration (4)... 
Warning: Permanently added '[cloudarray01.cloudapp.net]:3050,[23.99.60.212]:3050' (RSA) to the list of known hosts.
Adding worker (4)...
New worker added
Total: 2
elapsed time: 20.620173089 seconds
WARNING: deserialization checks failed while attempting to load cache from /Users/alage/.julia/lib/v0.4/Compat.ji
Out[10]:
100-element DistributedArrays.DArray{Float64,1,Array{Float64,1}}:
 0.897144  
 0.550581  
 0.482141  
 0.749212  
 0.943388  
 0.830631  
 0.0458245 
 0.0235713 
 0.537933  
 0.16375   
 0.531444  
 0.269559  
 0.953395  
 ⋮         
 0.963455  
 0.0886685 
 0.13586   
 0.00930461
 0.880849  
 0.553979  
 0.327784  
 0.688187  
 0.952335  
 0.532966  
 0.726188  
 0.883657  

Let's perform a sum operation at cloudarray_from_file:


In [11]:
sum(cloudarray_from_file)


Out[11]:
52.32310720010805

This sum was performed locally at the Master, you can exploit DArray fully parallelism with further functions such as parallel Maps (pmap) and Reductions. See here more information on Parallel programming in Julia.

Core constructor

If you want to tune your CloudArray, you can directly use the CloudArray core constructor:

carray_from_task(generator::Task=task_from_text("test.txt"), is_numeric::Bool=true, chunk_max_size::Int=1024*1024,debug::Bool=false)

Arguments are:

  • task_from_text same as file_path.
  • is_numeric set to false if you need to load String instead of Float.
  • chunk_max_size sets the maximum size that is allowed for each DArray chunk.
  • debug enables debug mode.

Example

As follows, we create a CloudArray by using the data.txt file which holds numeric values, then second argument is set to true. We'll set the third argument (chunk_max_size) to 500 so DArray chunks will not have more than 500 bytes each.


In [3]:
custom_cloudarray_from_file = DArray("data.txt", true, 500)


Creating container (1)...
SSH configuration (1)... 
Warning: Permanently added '[cloudarray01.cloudapp.net]:3053,[23.99.60.212]:3053' (RSA) to the list of known hosts.
Adding worker (1)...
New worker added
Total: 1
elapsed time: 23.376325559 seconds
WARNING: deserialization checks failed while attempting to load cache from /Users/alage/.julia/lib/v0.4/Compat.ji
Creating container (2)...
SSH configuration (2)... 
Warning: Permanently added '[cloudarray01.cloudapp.net]:3054,[23.99.60.212]:3054' (RSA) to the list of known hosts.
Adding worker (2)...
New worker added
Total: 2
elapsed time: 22.231016267 seconds
WARNING: deserialization checks failed while attempting to load cache from /Users/alage/.julia/lib/v0.4/Compat.ji
Out[3]:
100-element DistributedArrays.DArray{Float64,1,Array{Float64,1}}:
 0.112248 
 0.894684 
 0.80095  
 0.0974568
 0.102072 
 0.272024 
 0.960695 
 0.406005 
 0.960488 
 0.801466 
 0.927738 
 0.0988186
 0.415196 
 ⋮        
 0.872471 
 0.568069 
 0.151638 
 0.742604 
 0.894044 
 0.90042  
 0.431675 
 0.910004 
 0.72825  
 0.618541 
 0.324643 
 0.799405 

Now let's define and perform a parallel reduction at the just-created CloudArray:


In [4]:
parallel_reduce(f,darray) = reduce(f, map(fetch, { @spawnat p reduce(f, localpart(darray)) for p in workers()} ))
parallel_reduce(+,custom_cloudarray_from_file)


WARNING: deprecated syntax "{a for a in b}" at In[4]:1.
Use "Any[a for a in b]" instead.
Out[4]:
53.682358805343725

The result is the sum of all values of custom_cloudarray_from_file. Each DArray chunk performed in parallel the sum of the part of the DArrau it holds. The result is sent to the Master which performs the final sum. The function map is used to get the values with the fetch function.

You don't really need to know it, but if you are curious on how your data is stored, you can get further information such as:


In [17]:
@show custom_cloudarray_from_file.chunks
@show custom_cloudarray_from_file.cuts
@show custom_cloudarray_from_file.dims
@show custom_cloudarray_from_file.indexes
@show custom_cloudarray_from_file.pids


custom_cloudarray_from_file.chunks = RemoteRef[RemoteRef{Channel{Any}}(6,1,341),RemoteRef{Channel{Any}}(7,1,350)]
Out[17]:
2-element Array{Int64,1}:
 6
 7
custom_cloudarray_from_file.cuts = [[1,64,101]]
custom_cloudarray_from_file.dims = (100,)
custom_cloudarray_from_file.indexes = [(1:63,),(64:100,)]
custom_cloudarray_from_file.pids = [6,7]

Please read DistributedArrays documentation to better understand these low-level details if you want.