In [1]:
using CloudArray
Then configure the cloud host address and password:
In [2]:
set_host("cloudarray01.cloudapp.net","cloudarray@")
Out[2]:
CloudArray main constructors are very simple and can be created by using an Array or a file.
ArrayYou just need to tell DArray constructor which Array should be used to construct your CloudArray:
DArray(Array(...))
In this example, we first create the array arr with 100 random numbers then we create a CloudArray with the arr data:
In [8]:
arr = rand(100)
cloudarray_from_array = DArray(arr) # will take less than one minute
Out[8]:
We can now access any value as it would be a local array:
In [15]:
cloudarray_from_array[57]
Out[15]:
If you are dealing with big data, i.e., your RAM memory is not enough to store your data, you can create a CloudArray from a file.
DArray(file_path)
file_path is the path to a text file in your local or distributed file system. All lines will be used to fill DArray elements sequentially. This constructor ignores empty lines.
Let's first create a simple text file with 100 random numbers.
In [16]:
f = open("data.txt","w+")
for i=1:100
if i==100
write(f,"$(rand())")
else
write(f,"$(rand())\n")
end
end
close(f)
Then we create a CloudArray with data.txt file.
In [10]:
cloudarray_from_file = DArray("data.txt")
Out[10]:
Let's perform a sum operation at cloudarray_from_file:
In [11]:
sum(cloudarray_from_file)
Out[11]:
This sum was performed locally at the Master, you can exploit DArray fully parallelism with further functions such as parallel Maps (pmap) and Reductions. See here more information on Parallel programming in Julia.
If you want to tune your CloudArray, you can directly use the CloudArray core constructor:
carray_from_task(generator::Task=task_from_text("test.txt"), is_numeric::Bool=true, chunk_max_size::Int=1024*1024,debug::Bool=false)
Arguments are:
task_from_text same as file_path.is_numeric set to false if you need to load String instead of Float.chunk_max_size sets the maximum size that is allowed for each DArray chunk.debug enables debug mode.As follows, we create a CloudArray by using the data.txt file which holds numeric values, then second argument is set to true. We'll set the third argument (chunk_max_size) to 500 so DArray chunks will not have more than 500 bytes each.
In [3]:
custom_cloudarray_from_file = DArray("data.txt", true, 500)
Out[3]:
Now let's define and perform a parallel reduction at the just-created CloudArray:
In [4]:
parallel_reduce(f,darray) = reduce(f, map(fetch, { @spawnat p reduce(f, localpart(darray)) for p in workers()} ))
parallel_reduce(+,custom_cloudarray_from_file)
Out[4]:
The result is the sum of all values of custom_cloudarray_from_file. Each DArray chunk performed in parallel the sum of the part of the DArrau it holds. The result is sent to the Master which performs the final sum. The function map is used to get the values with the fetch function.
You don't really need to know it, but if you are curious on how your data is stored, you can get further information such as:
In [17]:
@show custom_cloudarray_from_file.chunks
@show custom_cloudarray_from_file.cuts
@show custom_cloudarray_from_file.dims
@show custom_cloudarray_from_file.indexes
@show custom_cloudarray_from_file.pids
Out[17]:
Please read DistributedArrays documentation to better understand these low-level details if you want.