Introduction

There is not special reason to study AWS compared to Google Cloud, Azure, Digital Ocean etc. Amazon Cloud is probably the most popular today, so it offers a nice parralel to Python. AWS is the web-based gateway to the Amazon Cloud computing resources.

Of note, AWS will deploy a region in Sweden this year, which will make it interesting for genomics research, especially since it will be made GDPR compliant. Currently no Swedish patient data can be processed on premises outside of Sweden, but the cloud is a player in general non-clinical research.

AWS is an umbrella for a large number of computing resources, starting from storage and ending with the management of the remote computing infrastructure. To be practical, our focus is on loading data into a bucket, setting up a cloud instance, and later using Docker to remotely spin up cloud instances. We will also learn how to manage these resources via Python.

Loading data into S3 buckets

Let us start with loading data. This is a common operation when you want to share your research result with someone, but it can also be useful for yourself as a way to backup your data. Clouds use the concept of 'buckets' to hold data. The 'objects' stored in a bucket can have any encoding, from text to film. There used to be severe penalties on loading super massive objects. Today however, the maximum size for an object is 5TB (on AWS).

We will learn how to do this via the web console, via the command line interface and via Python. Note that even thogh these options seem like separated, they are actually using the same API.

Web Console

Task:

  • Use the console to load a test file onto a S3 bucket
  • Follow this doc link: https://docs.aws.amazon.com/AmazonS3/latest/user-guide/upload-objects.html
  • Use the following shell command to generate some test data, or use your own:
    $ for i in {1..5}; do echo "l$i">f$i.txt && gzip f$i.txt; done && \
    zcat f*.txt.gz| gzip > f.gz
  • Figure out how much your bucket would cost (tip: it is free up to a threshold)!

Amazon CLI

Now let's repeat those steps using the command line interface. But first, we must install it.

Links:

$ sudo apt install awscli
$ aws configure
AWS Access Key ID [None]: 
AWS Secret Access Key [None]:
(also used eu-central-1 for region, and json as format)

The above commang needs SSL certificates. To generate the aws keys:

$ openssl genrsa 2048 > aws-private.pem
$ openssl req -new -x509 -nodes -sha256 -days 365 -key aws-private.pem -outform PEM -out aws-certificate.pem

# if in dire need for security use:
$ sudo apt-get install xclip
$ xclip -sel clip < ~/.ssh/aws-private.pem

Now that you installed the CLI, here are the main bucket related activities:

aws s3 mb s3://my-first-backup-bucket
upload:
aws s3 cp “C:\users\my first backup.bak” s3://my-first-backup-bucket/
download:
aws s3 cp s3://my-first-backup-bucket/my-first-backup.bak ./
delete:
aws s3 rm s3://my-first-backup-bucket/my-first-backup.bak

Data can also be streamed towards a bucket. This can be useful to avoid unnecesary space waste onto the local cloud or PC, but it can be just as useful when it comes to using bucket data without storing all that data locally. It can be done via piping, or proccess substitution:

$ aws s3 mb s3://siofuysni78
$ zcat f*.txt.gz| gzip | aws s3 cp - s3://siofuysni78/f.gz
$ aws s3 rm s3://siofuysni78/f.gz
$ aws s3 rb s3://siofuysni78 --force

Why did I use such a weird name? It is because Amazon indexes all buckets by their name, thus a name such as "test123" will never fly. Here is how to stream from S3 to your computing resource (it can be a cloud instance, you local machine or a remore server)

$ aws s3 mb s3://siofuysni78
$ zcat f*.txt.gz| gzip | aws s3 cp - s3://siofuysni78/f.gz
$ aws s3 cp s3://siofuysni78/f.gz - | gunzip | grep 1
l1

Boto3

Links:

conda install -c anaconda boto3
pip install boto3

In [1]:
import boto3

# initialize the S3 service
s3 = boto3.client('s3')

# create a test bucket (tip: use a different name!)
s3.create_bucket(Bucket='jo8a7fn8sfn8', CreateBucketConfiguration={'LocationConstraint': 'eu-central-1'})

# Call S3 to list current buckets
response = s3.list_buckets()

# Get a list of all bucket names from the response
buckets = [bucket['Name'] for bucket in response['Buckets']]

# Print out the bucket list
print("Bucket List: %s" % buckets)


Bucket List: ['crasstestdummy', 'jo8a7fn8sfn8', 'siofuysni78', 'snlmocombined']

In [ ]:
import boto3

# Create an S3 client
s3 = boto3.client('s3')

filename = '/path/to/test/file'
bucket_name = 'jo8a7fn8sfn8'

# Uploads the given file using a managed uploader, which will split up large
# files automatically and upload parts in parallel.
s3.upload_file(filename, bucket_name, filename)

# or
# s3.Object('mybucket', 'hello.txt').put(Body=open('/tmp/hello.txt', 'rb'))

In [ ]:
# https://boto3.readthedocs.io/en/latest/guide/migrations3.html#deleting-a-bucket
import boto3
import botocore

s3 = boto3.resource('s3')
bucket = s3.Bucket('jo8a7fn8sfn8')

for key in bucket.objects.all():
    key.delete()
bucket.delete()

Now I want to test using the buchet without local file storage.

Setting up a reserved instance

Amazon names their most popular instances Elastic Compute Cloud (EC2).

Probably the most basic level of access to the Amazon computing infrastructure is setting up a free tier reserved instance.

Web Console

Task:

Amazon CLI

aws ec2 run-instances --image-id ami-xxxxxxxx --count 1 --instance-type t1.micro --key-name MyKeyPair --security-groups my-sg

Boto3

Task:

  • A larger task is to create an instance with Boto3, install an SSH client such as Paramaiko and run commands on the remote client.

Helpful code:


In [ ]:
import boto3
import botocore
import paramiko


ec2 = boto3.resource('ec2')
instance = ec2.Instance('id')
ec2.create_instances(ImageId='<ami-image-id>', MinCount=1, MaxCount=5)

key = paramiko.RSAKey.from_private_key_file(path/to/mykey.pem)
client = paramiko.SSHClient()
client.set_missing_host_key_policy(paramiko.AutoAddPolicy())

# Connect/ssh to an instance
try:
    # Here 'ubuntu' is user name and 'instance_ip' is public IP of EC2
    client.connect(hostname=instance_ip, username="ubuntu", pkey=key)

    # Execute a command(cmd) after connecting/ssh to an instance
    stdin, stdout, stderr = client.exec_command(cmd)
    print stdout.read()

    # close the client connection once the job is done
    client.close()
    break

except Exception, e:
    print e

Spin up containers via Docker Machine

My preferred way is to use docker machine in order to manage cloud instances that are already set-up with Docker. Then you can pull your intended container from the Docker Hub and run it on the instance. An alternative is usign AWS services to create your instance, which has its own benefits (basically most benefits except for time). Another alternative is usign Docker Cloud or Kubernetes, which is the way to go for multiple instances.

# install docker machine
$ base=https://github.com/docker/machine/releases/download/v0.14.0 && \
curl -L $base/docker-machine-$(uname -s)-$(uname -m) >/tmp/docker-machine && \
sudo install /tmp/docker-machine /usr/local/bin/docker-machine

# setup a cloud instance
$ export SECRET_KEY="..."
$ docker-machine create --driver amazonec2 --amazonec2-region eu-central-1 \
--amazonec2-access-key AKIAJPBEKSXQ7NJGSL3A \
--amazonec2-secret-key $SECRET_KEY \
aws-test

# ssh and delete
docker-machine ssh aws-test
docker-machine rm aws-test

# for other options: --amazonec2-instance-type "t2.2xlarge"
docker-machine create --driver amazonec2

Further read

Pull the docker hub container on the EC2 instance, open shell and run test

# run inside the EC2 instance
export DOCKER_ID_USER="grokkaine"
docker login
docker pull $DOCKER_ID_USER/awstest
docker run -ti $DOCKER_ID_USER/awstest /bin/bash

# now run your commands inside the container

This works only if you must run short tasks, because once you log out from the container, the container will end. What you need is to be able to run long jobs. So we must create a detached container, then attach to it during and after the execution of the program with a shell in order to check the logs and save data.

# run inside the EC2 instance
export DOCKER_ID_USER=""
sudo docker login
sudo docker pull $DOCKER_ID_USER/awscrass
sudo docker run -w /home/ -tid $DOCKER_ID_USER/awscrass /bin/bash

# exit container, start it
sudo docker ps
# run a command in detached mode
#sudo docker exec -d containerid bash -c "your command line"

#alternative is to log into the container and run the command there
sudo docker exec -it containerid bash

# start, attach
docker start containerid
docker attach containerid

Pricing

  • storage on S3 buckets: 150 GB * 0.022$ /month = 3$
  • transfer: 150 GB * 0.09$ /month = 13$
  • compute using m5.4xlarge (64GiB RAM) on demand instances: 20 days * 0.9$/hour = 432$
  • compute using t2.2xlarge (32GiB RAM) 20 days * 0.42$/hour = 201$

Further read:

Task

Your task will be to create a Docker container, push it to Docker Hub, start an EC2 instance and remotely run your container, log out from the instance, then logging back and checking that the output is preserved.

Instance Types

On AWS you can opt for different types of instances, and you can also upgrade or downgrade your instance to meet your need for resources. One can opt for example for instances that have a lot of RAM assigned when using a RAM intensive computation such as sequence alignment, or GPU instances when needing deep learning or other forms of GPU accelerated computing. More here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html

From the purchasing point of view there are however several major classes of instances, most notably:

  • On demand instances. Such instances are available when you request them and will be held up until you close them. - Spot instances. A Spot Instance is an unused EC2 instance that is available for less than the On-Demand price. Your Spot Instance runs whenever capacity is available and the maximum price per hour for your request exceeds the Spot price.
  • read about the other types here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-purchasing-options.html

ECS clusters and Docker Cloud

You can run containerized clusters of EC2 instances using another AWS web service called ECS clusters. More information here: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ECS_clusters.html

Another popular option is to use Docker Cloud, allowing you to define and store images and build mechanisms of continuous integration then run tests or production clusters on AWS.

Task:


In [ ]: