We've learned a lot about AWS so far in this program. However, there are a few things that were either brushed over or weren't mentioned that can make our lives much easier.
This notebook covers those things. The topics I want to cover and the questions I hope to answer:
In [51]:
import json
import os
import boto3
In [4]:
s3 = boto3.client("s3")
In [75]:
# create instance
response = !aws ec2 run-instances --image-id ami-a9d276c9 \
--count 1 \
--instance-type t2.micro \
--key-name ec2_rob \
--security-groups ssh_only \
--block-device-mappings file://examples/blockDeviceMappings.json \
--user-data file://examples/bootstrap-ec2.sh
You can specify a script to run on startup with the --user-data
option. This enables you to do things like automatically install anaconda
and nltk
when the instance is created, or to configure a server before you launch it.
Volumes are blocks of memory. When you create an instance, that instance is assigned a root volume - the place where its operating information is stored. This root volume is required for your instance to function as a computer.
You can assign more than one volume to an instance. Databases and large datsets should be stored in separate volumes.
The default size for this root volume is 8GB - this is too small and you should resize it. You MUST do this when you create the instance.
You can do this in the console. Or, if you're using the command line, you create a block device mapping file - a JSON file that contains the information for how you want your volumes to be configured.
In [71]:
# view block device mappings
with open('examples/blockDeviceMappings.json') as f:
block_device_mappings = json.load(f)
print json.dumps(block_device_mappings[0])+"\n"+json.dumps(block_device_mappings[1])
A snapshot is a copy of a volume. Snapshots are extremely useful.
You can use them to:
You can create them from the command line on your instance. However, it's easier to do it from the console. Note that when creating snapshots, you must specify the id or name of the VOLUME, not the instance
On creation, each resource is assigned a unique ID. You can also tag your resources with custom values. This can help you organize and identify them - for example, the "Name" tag is useful for labeling your individual EC2 instances and snapshots.
Let's add a "Name" tag to the instance we just created.
In [76]:
# convert response to json dict
data = json.loads(''.join(list(response)))
# get instance id
instance_id = data["Instances"][0]["InstanceId"]
print instance_id
In [77]:
# add name tag
!aws ec2 create-tags --resources {instance_id} --tags Key=Name,Value=awsTestInstance
Amazon Elastic Map Reduce (EMR) is a service that provides clusters of EC2 instances for intensive compute capacity.
As it's based on EC2, you can use many of the same resources to manage EMR.
When you create an EMR cluster, you use the --bootstrap
option to specify any scripts you want to run on startup. This runs the script on EVERY instance in your cluster. This is extremely important - sometimes distributed file systems like Spark require that packages exist on each worker node, and that these packages are found with the same exact paths.
In [ ]:
# best practices guide
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-best-practices.html