To run ndmg on entire datasets, we use AWS Batch.
Our Batch pipeline works as follows.
Steps 4 and 5 above occur in parallel -- meaning, if you have a dataset with 200 scans, there will be 200 EC2 instances running at the same time.
For arbitrarily large datasets, full runs generally take a little over an hour, at roughly \$.05/scan.
This section shows you how to set up a batch environment such that you can easily run ndmg in parallel. For more in-depth documentation, see the AWS official documentation.
Your compute environment defines the compute resources AWS will use for its instances.
For ndmg, we recommend exclusively using r5.large instance types.
TODO : check if default roles can access s3
Here, you define who can access the environment. We recommend using the default IAM roles.
This section lets you define what resources your environment will use. You don't need to use any launch templates here; essentially the only thing necessary is to set the instance type to r5.large, and to raise the maximum vCPUs (in case you're running large datasets)
r5.largeYou can leave the rest of the options set as default in compute environments.
Now, attach the compute environment to the job queue.
1)The job definition is where you specify the container, the job role, the vCPUs, and the memory.
neurodata/ndmg_dev:latest.15200 for the memory.You can leave the other parameters empty.
Given that you've set up your batch environment properly, and your IAM credentials are set up properly, submitting jobs to batch is relatively simple:
ndmg_cloud participant --bucket <bucket> --bidsdir <path on bucket> --jobdir <local/jobdir> --modif <name-of-s3-directory-output>
An example using one of our s3 buckets can be seen below. Note that your dataset must be BIDs-formatted.:
ndmg_cloud participant --bucket ndmg-data --sp native --bidsdir HNU1 --jobdir ~/.ndmg/jobs/HNU1-08-21-native --modif ndmg-08-21-native --dataset HNU1
Note that the above example won't work for people without access to our bucket.
Behind the scenes, what happens here is:
ndmg parses through your s3 bucket, and gets the locations of all subjects and sessionsjson file is created for each scan, containing the ndmg run parameters for that scan.json file is submitted sequentially to AWS-Batch, and a job begins that runs that scan.s3 specified by --modif.You can see all of your runs in the Batch Dashboard.
You can monitor the outputs of a given run by clicking on it and clicking View logs for the most recent attempt in the CloudWatch console.
When all of your runs finish, the outputs will on s3, and you can use your favorite way to pull your graphs to your local machine!