AWS Tips

speaker: Rob Dalton
date: 12/09/2016

Overview

We've learned a lot about AWS so far in this program. However, there are a few things that were either brushed over or weren't mentioned that can make our lives much easier.

This notebook covers those things. The topics I want to cover and the questions I hope to answer:

EC2

What exactly is an EC2 instance? How do I manage one?
Is there an easy way to create an instance with the packages I need?
How do I back up my instances?

EMR

What exactly is an EMR cluster? How do I manage one?
Is there an easy way to create a cluster with the packages I need?

Instance Sizing *In Progress

What size instance do I need?
What size cluster do I need?

Setup



In [51]:

    
import json
import os
import boto3



In [4]:

    
s3 = boto3.client("s3")

EC2

What is it?

Amazon Elastic Cloud Compute (EC2) web service that provides resizable compute capacity in the cloud. Think of it as a way to make computers in the cloud for your use.

You manage EC2 with resources. Resources include:

Amazon Machine Images (AMI)
Instances
Volumes
Snapshots

Amazon Machine Images (AMI)

An AMI is a blueprint for your computing capacity. You can use the ones Amazon has created, ones the AWS community has made available, or you can make your own.

AMI selection page

Instances

Instances are essentially VPS's - Virtual Private Servers. They are blocks within AWS's cloud of computing power that you can reserve for yourself. Think of an instance as a CPU or set of CPUs.

You can create them with the command line or with the AWS console.



In [75]:

    
# create instance
response = !aws ec2 run-instances --image-id ami-a9d276c9 \
    --count 1 \
    --instance-type t2.micro \
    --key-name ec2_rob \
    --security-groups ssh_only \
    --block-device-mappings file://examples/blockDeviceMappings.json \
    --user-data file://examples/bootstrap-ec2.sh

You can specify a script to run on startup with the --user-data option. This enables you to do things like automatically install anaconda and nltk when the instance is created, or to configure a server before you launch it.

Volumes

Volumes are blocks of memory. When you create an instance, that instance is assigned a root volume - the place where its operating information is stored. This root volume is required for your instance to function as a computer.

You can assign more than one volume to an instance. Databases and large datsets should be stored in separate volumes.

Root Volume Size

The default size for this root volume is 8GB - this is too small and you should resize it. You MUST do this when you create the instance.

You can do this in the console. Or, if you're using the command line, you create a block device mapping file - a JSON file that contains the information for how you want your volumes to be configured.



In [71]:

    
# view block device mappings
with open('examples/blockDeviceMappings.json') as f:
    block_device_mappings = json.load(f)
    
print json.dumps(block_device_mappings[0])+"\n"+json.dumps(block_device_mappings[1])









    



{"DeviceName": "/dev/sda1", "Ebs": {"DeleteOnTermination": true, "VolumeType": "gp2", "VolumeSize": 15}}
{"DeviceName": "/dev/sdf", "Ebs": {"DeleteOnTermination": true, "VolumeSize": 20}}

Snapshots

A snapshot is a copy of a volume. Snapshots are extremely useful.

You can use them to:

Create backups for your instances
Create AMIs

You can create them from the command line on your instance. However, it's easier to do it from the console. Note that when creating snapshots, you must specify the id or name of the VOLUME, not the instance

EMR

Amazon Elastic Map Reduce (EMR) is a service that provides clusters of EC2 instances for intensive compute capacity.

As it's based on EC2, you can use many of the same resources to manage EMR.

Boostrap Scripts

When you create an EMR cluster, you use the --bootstrap option to specify any scripts you want to run on startup. This runs the script on EVERY instance in your cluster. This is extremely important - sometimes distributed file systems like Spark require that packages exist on each worker node, and that these packages are found with the same exact paths.

Additional Reading



In [ ]:

    
# best practices guide
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-best-practices.html