AWS Tips


speaker: Rob Dalton
date: 12/09/2016


Overview

We've learned a lot about AWS so far in this program. However, there are a few things that were either brushed over or weren't mentioned that can make our lives much easier.

This notebook covers those things. The topics I want to cover and the questions I hope to answer:

EC2

  • What exactly is an EC2 instance? How do I manage one?
  • Is there an easy way to create an instance with the packages I need?
  • How do I back up my instances?

EMR

  • What exactly is an EMR cluster? How do I manage one?
  • Is there an easy way to create a cluster with the packages I need?

Instance Sizing *In Progress

  • What size instance do I need?
  • What size cluster do I need?

Setup


In [51]:
import json
import os
import boto3

In [4]:
s3 = boto3.client("s3")


EC2

What is it?

Amazon Elastic Cloud Compute (EC2) web service that provides resizable compute capacity in the cloud. Think of it as a way to make computers in the cloud for your use.

You manage EC2 with resources. Resources include:

  • Amazon Machine Images (AMI)
  • Instances
  • Volumes
  • Snapshots

Amazon Machine Images (AMI)

An AMI is a blueprint for your computing capacity. You can use the ones Amazon has created, ones the AWS community has made available, or you can make your own.

Instances

Instances are essentially VPS's - Virtual Private Servers. They are blocks within AWS's cloud of computing power that you can reserve for yourself. Think of an instance as a CPU or set of CPUs.

You can create them with the command line or with the AWS console.


In [75]:
# create instance
response = !aws ec2 run-instances --image-id ami-a9d276c9 \
    --count 1 \
    --instance-type t2.micro \
    --key-name ec2_rob \
    --security-groups ssh_only \
    --block-device-mappings file://examples/blockDeviceMappings.json \
    --user-data file://examples/bootstrap-ec2.sh

You can specify a script to run on startup with the --user-data option. This enables you to do things like automatically install anaconda and nltk when the instance is created, or to configure a server before you launch it.

Volumes

Volumes are blocks of memory. When you create an instance, that instance is assigned a root volume - the place where its operating information is stored. This root volume is required for your instance to function as a computer.

You can assign more than one volume to an instance. Databases and large datsets should be stored in separate volumes.

Root Volume Size

The default size for this root volume is 8GB - this is too small and you should resize it. You MUST do this when you create the instance.

You can do this in the console. Or, if you're using the command line, you create a block device mapping file - a JSON file that contains the information for how you want your volumes to be configured.


In [71]:
# view block device mappings
with open('examples/blockDeviceMappings.json') as f:
    block_device_mappings = json.load(f)
    
print json.dumps(block_device_mappings[0])+"\n"+json.dumps(block_device_mappings[1])


{"DeviceName": "/dev/sda1", "Ebs": {"DeleteOnTermination": true, "VolumeType": "gp2", "VolumeSize": 15}}
{"DeviceName": "/dev/sdf", "Ebs": {"DeleteOnTermination": true, "VolumeSize": 20}}

Snapshots

A snapshot is a copy of a volume. Snapshots are extremely useful.

You can use them to:

  • Create backups for your instances
  • Create AMIs

You can create them from the command line on your instance. However, it's easier to do it from the console. Note that when creating snapshots, you must specify the id or name of the VOLUME, not the instance

Tags

On creation, each resource is assigned a unique ID. You can also tag your resources with custom values. This can help you organize and identify them - for example, the "Name" tag is useful for labeling your individual EC2 instances and snapshots.

Let's add a "Name" tag to the instance we just created.


In [76]:
# convert response to json dict
data = json.loads(''.join(list(response)))

# get instance id
instance_id = data["Instances"][0]["InstanceId"]

print instance_id


i-03f418bb99c4ce339

In [77]:
# add name tag
!aws ec2 create-tags --resources {instance_id} --tags Key=Name,Value=awsTestInstance

EMR

Amazon Elastic Map Reduce (EMR) is a service that provides clusters of EC2 instances for intensive compute capacity.

As it's based on EC2, you can use many of the same resources to manage EMR.

Boostrap Scripts

When you create an EMR cluster, you use the --bootstrap option to specify any scripts you want to run on startup. This runs the script on EVERY instance in your cluster. This is extremely important - sometimes distributed file systems like Spark require that packages exist on each worker node, and that these packages are found with the same exact paths.

Additional Reading


In [ ]:
# best practices guide
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-best-practices.html