There is not special reason to study AWS compared to Google Cloud, Azure, Digital Ocean etc. Amazon Cloud is probably the most popular today, so it offers a nice parralel to Python. AWS is the web-based gateway to the Amazon Cloud computing resources.
Of note, AWS will deploy a region in Sweden this year, which will make it interesting for genomics research, especially since it will be made GDPR compliant. Currently no Swedish patient data can be processed on premises outside of Sweden, but the cloud is a player in general non-clinical research.
AWS is an umbrella for a large number of computing resources, starting from storage and ending with the management of the remote computing infrastructure. To be practical, our focus is on loading data into a bucket, setting up a cloud instance, and later using Docker to remotely spin up cloud instances. We will also learn how to manage these resources via Python.
Let us start with loading data. This is a common operation when you want to share your research result with someone, but it can also be useful for yourself as a way to backup your data. Clouds use the concept of 'buckets' to hold data. The 'objects' stored in a bucket can have any encoding, from text to film. There used to be severe penalties on loading super massive objects. Today however, the maximum size for an object is 5TB (on AWS).
We will learn how to do this via the web console, via the command line interface and via Python. Note that even thogh these options seem like separated, they are actually using the same API.
Task:
$ for i in {1..5}; do echo "l$i">f$i.txt && gzip f$i.txt; done && \
zcat f*.txt.gz| gzip > f.gz
Now let's repeat those steps using the command line interface. But first, we must install it.
Links:
$ sudo apt install awscli
$ aws configure
AWS Access Key ID [None]:
AWS Secret Access Key [None]:
(also used eu-central-1 for region, and json as format)
The above commang needs SSL certificates. To generate the aws keys:
$ openssl genrsa 2048 > aws-private.pem
$ openssl req -new -x509 -nodes -sha256 -days 365 -key aws-private.pem -outform PEM -out aws-certificate.pem
# if in dire need for security use:
$ sudo apt-get install xclip
$ xclip -sel clip < ~/.ssh/aws-private.pem
Now that you installed the CLI, here are the main bucket related activities:
aws s3 mb s3://my-first-backup-bucket
upload:
aws s3 cp “C:\users\my first backup.bak” s3://my-first-backup-bucket/
download:
aws s3 cp s3://my-first-backup-bucket/my-first-backup.bak ./
delete:
aws s3 rm s3://my-first-backup-bucket/my-first-backup.bak
Data can also be streamed towards a bucket. This can be useful to avoid unnecesary space waste onto the local cloud or PC, but it can be just as useful when it comes to using bucket data without storing all that data locally. It can be done via piping, or proccess substitution:
$ aws s3 mb s3://siofuysni78
$ zcat f*.txt.gz| gzip | aws s3 cp - s3://siofuysni78/f.gz
$ aws s3 rm s3://siofuysni78/f.gz
$ aws s3 rb s3://siofuysni78 --force
Why did I use such a weird name? It is because Amazon indexes all buckets by their name, thus a name such as "test123" will never fly. Here is how to stream from S3 to your computing resource (it can be a cloud instance, you local machine or a remore server)
$ aws s3 mb s3://siofuysni78
$ zcat f*.txt.gz| gzip | aws s3 cp - s3://siofuysni78/f.gz
$ aws s3 cp s3://siofuysni78/f.gz - | gunzip | grep 1
l1
Links:
conda install -c anaconda boto3
pip install boto3
In [1]:
import boto3
# initialize the S3 service
s3 = boto3.client('s3')
# create a test bucket (tip: use a different name!)
s3.create_bucket(Bucket='jo8a7fn8sfn8', CreateBucketConfiguration={'LocationConstraint': 'eu-central-1'})
# Call S3 to list current buckets
response = s3.list_buckets()
# Get a list of all bucket names from the response
buckets = [bucket['Name'] for bucket in response['Buckets']]
# Print out the bucket list
print("Bucket List: %s" % buckets)
In [ ]:
import boto3
# Create an S3 client
s3 = boto3.client('s3')
filename = '/path/to/test/file'
bucket_name = 'jo8a7fn8sfn8'
# Uploads the given file using a managed uploader, which will split up large
# files automatically and upload parts in parallel.
s3.upload_file(filename, bucket_name, filename)
# or
# s3.Object('mybucket', 'hello.txt').put(Body=open('/tmp/hello.txt', 'rb'))
In [ ]:
# https://boto3.readthedocs.io/en/latest/guide/migrations3.html#deleting-a-bucket
import boto3
import botocore
s3 = boto3.resource('s3')
bucket = s3.Bucket('jo8a7fn8sfn8')
for key in bucket.objects.all():
key.delete()
bucket.delete()
Now I want to test using the buchet without local file storage.
Amazon names their most popular instances Elastic Compute Cloud (EC2).
Probably the most basic level of access to the Amazon computing infrastructure is setting up a free tier reserved instance.
Task:
aws ec2 run-instances --image-id ami-xxxxxxxx --count 1 --instance-type t1.micro --key-name MyKeyPair --security-groups my-sg
Task:
Helpful code:
In [ ]:
import boto3
import botocore
import paramiko
ec2 = boto3.resource('ec2')
instance = ec2.Instance('id')
ec2.create_instances(ImageId='<ami-image-id>', MinCount=1, MaxCount=5)
key = paramiko.RSAKey.from_private_key_file(path/to/mykey.pem)
client = paramiko.SSHClient()
client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
# Connect/ssh to an instance
try:
# Here 'ubuntu' is user name and 'instance_ip' is public IP of EC2
client.connect(hostname=instance_ip, username="ubuntu", pkey=key)
# Execute a command(cmd) after connecting/ssh to an instance
stdin, stdout, stderr = client.exec_command(cmd)
print stdout.read()
# close the client connection once the job is done
client.close()
break
except Exception, e:
print e
My preferred way is to use docker machine in order to manage cloud instances that are already set-up with Docker. Then you can pull your intended container from the Docker Hub and run it on the instance. An alternative is usign AWS services to create your instance, which has its own benefits (basically most benefits except for time). Another alternative is usign Docker Cloud or Kubernetes, which is the way to go for multiple instances.
# install docker machine
$ base=https://github.com/docker/machine/releases/download/v0.14.0 && \
curl -L $base/docker-machine-$(uname -s)-$(uname -m) >/tmp/docker-machine && \
sudo install /tmp/docker-machine /usr/local/bin/docker-machine
# setup a cloud instance
$ export SECRET_KEY="..."
$ docker-machine create --driver amazonec2 --amazonec2-region eu-central-1 \
--amazonec2-access-key AKIAJPBEKSXQ7NJGSL3A \
--amazonec2-secret-key $SECRET_KEY \
aws-test
# ssh and delete
docker-machine ssh aws-test
docker-machine rm aws-test
# for other options: --amazonec2-instance-type "t2.2xlarge"
docker-machine create --driver amazonec2
Further read
# run inside the EC2 instance
export DOCKER_ID_USER="grokkaine"
docker login
docker pull $DOCKER_ID_USER/awstest
docker run -ti $DOCKER_ID_USER/awstest /bin/bash
# now run your commands inside the container
This works only if you must run short tasks, because once you log out from the container, the container will end. What you need is to be able to run long jobs. So we must create a detached container, then attach to it during and after the execution of the program with a shell in order to check the logs and save data.
# run inside the EC2 instance
export DOCKER_ID_USER=""
sudo docker login
sudo docker pull $DOCKER_ID_USER/awscrass
sudo docker run -w /home/ -tid $DOCKER_ID_USER/awscrass /bin/bash
# exit container, start it
sudo docker ps
# run a command in detached mode
#sudo docker exec -d containerid bash -c "your command line"
#alternative is to log into the container and run the command there
sudo docker exec -it containerid bash
# start, attach
docker start containerid
docker attach containerid
On AWS you can opt for different types of instances, and you can also upgrade or downgrade your instance to meet your need for resources. One can opt for example for instances that have a lot of RAM assigned when using a RAM intensive computation such as sequence alignment, or GPU instances when needing deep learning or other forms of GPU accelerated computing. More here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html
From the purchasing point of view there are however several major classes of instances, most notably:
You can run containerized clusters of EC2 instances using another AWS web service called ECS clusters. More information here: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ECS_clusters.html
Another popular option is to use Docker Cloud, allowing you to define and store images and build mechanisms of continuous integration then run tests or production clusters on AWS.
Task:
In [ ]: