Introduction
Pre-cloud: figuring out the workflow
Getting set up at home
Getting set up on AWS
Running the task
Before we get started, let's do our usual warm-up proceedure:
In [1]:
import matplotlib
matplotlib.use('nbagg')
%matplotlib inline
And imports:
In [2]:
from IPython.display import Image
import sys
sys.path.append("../lib")
import eros
At first blush, it may seem odd that we are contemplating the distributed use of a library which has historically been focused on desktop-type environments. However, when pausing to ponder this more, one does see the value of it: you have probably noted that with large data sets or complex plots, matplotlib runs perhaps more slowly that we might like. What to do, then, when we need to generate a handful of plots for very large data sets or 1000s of plots from diverse sources?
If this sounds far-fetched, keep in mind that there are companies that have massive PDF-generating “farms” of servers for similar activities. This notebook will be exploring a similar use case: you are a researcher working for small company tracking climactic patterns and possible changes at both poles. Your team is focused on the Arctic, and your assignment is to process satellite imagery for the east coast of Greenland: not only the new images as they coming (every 16 days), but previously captured satellite data as well.
Imagine that you are a software engineer working for a research firm tracking climate data in various arctic regions. You've been tasked with processing data for Scoresby Sund, Greenland -- the largest and longest fjord system in the world. You will be using data from Landsat 8, processing both 1) satellite image data since its launch in 2013, and 2) new imagery as it comes in, every 16 days. The data sets are fairly large, weighing in between 600 and 900 MB. We'll take a look at a sample workflow, and then propose a plan for handling this on a recurring basis.
We have based the code used in our eros
module upon the BSD-licensed code that Milos Miljkovic provided for his PyData 2014 talk Analyzing Satellite Images With Python Scientific Stack; we have, however, made a fair number changes -- all of which were so much easier to make thanks to Milos' work!
For your project, you will be acquiring data from the EROS archives using the USGS EarthExplorer site (downloads require a registered user account, which is free). You will use their map to locate what are called “scenes” – specific geographic areas of satellite imagery which can be downloaded using the EarthExplorer “Bulk Download Application”. Your initial focus will be data from Scoresby Sund, the largest and longest fjord system in the world.
Here is a view of the EarthExplorer site where you will search for your scenes:
In [3]:
Image(filename="EarthExplorer.png")
Out[3]:
Your first set of data will come from scene ID “LC82260102014232LGN00”, from a capture in August 2014. In the "Data Sets" view, expand "Landsat Archive" and then select "L8 OLI/TIRS". After clicking "Results", you will be presented with a list of scenes, each with a preview. You can click on the thumbnail image to see the preview. Once you have located the scene, click the tiny little "package" icon (it will be light brown/tan in color). If you haven't logged in, you will be prompted to. Add it to your cart, and then go to your cart to complete the free order.
Next, you will need to open the BDA application and then download your order from there (when it opens, it will show you pending orders and present you with the option of downloading them).
Before creating a cloud workflow, we need to step through the process manually to identify all the steps and indicate which may be automated. We will be using a data set from a specific point in time, but what we define here should be usable by any Landsat 8 data – and some of it will be usable by older satellite remote sensing data.
We will start by organizing the data. The BDA will save its downloads to a specific location (different for each operating system); let's move all the data you've pulled down with the EarthExplorer BDA to a location we can easily reference in our IPython notebook: /EROSData/L8_OLI_TIRS. Ensure that your scene data is there, in the LC82260102014232LGN00 directory.
In [4]:
path = "/EROSData/L8_OLI_TIRS"
scene_id = "LC82260102014232LGN00"
In [5]:
ls -al "/EROSData/L8_OLI_TIRS"
In [6]:
rgb_image = eros.extract_rgb(path, scene_id)
If you examine the source for that function you will see that invokes a chain of tasks such as:
Landsat 8 records mulitple bands, only three of which are bands which cover the familiar channels (R, G, and B) which represent visible light. The Landsat 8 bands are as follows:
In your task, you will be using bands 1 through 4 for water and RGB, 5 for vegetation, and 7 to pick out rocks. Let's take a look at the true-color RGB image for the bands we just extracted:
In [7]:
eros.show_image(rgb_image, "RGB image, data set " + scene_id, figsize=(20, 20))
That image isn't very clear – the colors are all quite muted. We can gain some insight into this by looking at a histogram of the data files for each color channel:
In [8]:
eros.show_color_hist(rgb_image, xlim=[5000,20000], ylim=[0,10000], figsize=(20, 7))
As we can see, the majority of the color information is concentrated in a narrower band, while the outlying data is still included. Let's create a new image, using ranges for the majority of the color data. We'll do this by visually estimating good cut-off points: we'll limit the red channel to the range 5900-11000, green to 6200-11000, and blue to 7600-11000:
In [9]:
rgb_image_he = eros.update_image(rgb_image, (5900, 11000), (6200, 11000), (7600, 11000))
eros.show_image(rgb_image_he, "RGB image, histogram equalized", figsize=(20, 20))
The next image you've been asked to generate is is a false color one with the following taking the place of the standard RGB values:
In this type of image, ice will show up as light blue/cyan.
We will perform the same process of:
In [10]:
swir2nirg_image = eros.extract_swir2nirg(path, scene_id)
In [11]:
eros.show_color_hist(swir2nirg_image, xlim=[4000,30000], ylim=[0,10000], figsize=(20, 7))
In [12]:
swir2nirg_image_he = eros.update_image(swir2nirg_image, (5000, 14000), (5200, 14000), (8000, 14000))
eros.show_image(swir2nirg_image_he, "", figsize=(20, 20))
On an 2009 iMac (Intel Core i7, 8GB RAM), processing both of those images took about 7 minutes total with the Python process consuming about 6GB of RAM. The IPython kernel needs to be restarted after 2 to 3 runs, due to memory consumption. This is a task well-suited to utility computing, that will free up your workstation while large numbers of these jobs can be executed quickly elsewhere.
You've now identified the steps needed to create the images for your project:
eros
libraryIn the process of clearly defining these steps, you've also shown that the resource utilization is high and takes a while on your workstation. As such, you know that jobs of any significant number or duration will need to be distributed across multiple machines, and this is a prime use case for cloud computing.
You've selected the AWS EC2 and S3 infrastructure for your cloud environment, but next need to identify what you will use to configure and run matplotlib plot-generation code. For configuration management you've looked at the following:
After some reading and prototyping with the above potential solutions, you were able to get the furtherest with Docker in the shortest time and, more importantly, we able to confirm that Docker exactly suited your needs. Furthermore, your configuration mangament selction is also a prime virutalization choice for running emphemeral, compute-intensive jobs inside an EC2 instance.
So you've come up with a plan:
The next steps are now clear:
Let's continue on!
You will be creating your Docker images locally, so you will need the following installed on your workstation:
If you're running Linux, you can skip the rest of this section and just install Docker (the docker.io
package on Ubuntu; just docker
on RedHat-based systems).
If you haven't run boot2docker
before, you'll need to do this first:
$ boot2docker init
If you have, you can just do this:
$ boot2docker up
At which point you will see output something like the following:
Waiting for VM and Docker daemon to start...
...............ooo
Started.
Writing /Users/oubiwann/.boot2docker/certs/boot2docker-vm/ca.pem
Writing /Users/oubiwann/.boot2docker/certs/boot2docker-vm/cert.pem
Writing /Users/oubiwann/.boot2docker/certs/boot2docker-vm/key.pem
To connect the Docker client to the Docker daemon, please set:
export DOCKER_CERT_PATH=/Users/oubiwann/.boot2docker/certs/boot2docker-vm
export DOCKER_TLS_VERIFY=1
export DOCKER_HOST=tcp://192.168.59.103:2376
You can either manually export those environment variables, or do this:
$(boot2docker shellinit)
which will set the appropriate variables in your shell environment. At this point, Docker is ready to use!
The heart of your configuration management work is the Dockerfile
. This will be used to generate the Docker image you need to run Docker containers -- where your matplotlib tasks will actually happen. If you are unfamiliar with Docker, here's a quick summary of how to think about these components:
Dockerfile
- the specification for building extendible images (more info)But we will start at the beginning: Dockerfile
s. Here is our first Dockerfile
:
In [13]:
cat ../docker/python/Dockerfile
Note that it is not a Dockerfile
which has been created from scratch; rather, it is based upon the official Ubuntu Vivid Vervet (15.04) image. An image has already been generated from this Dockerfile
and has been pushed up to Docker Hub.
The next Dockerfile
is for NumPy and SciPy dependencies:
In [14]:
cat ../docker/scipy/Dockerfile
As you can see, this Dockerfile
extends our previous one, masteringmatplotlib/python
by adding libraries we most commonly need when using Python for scientific computing. This one has also been pushed up to Docker Hub.
In the next section we will discuss how to create a new Dockerfile
based upon this one.
Since the base Docker images we want to use (the chain of ubuntu:vivid -> masteringmatplotlib/python -> masteringmatplotlib/scipy) have already been built and pushed to Docker Hub, there's no need for us to build them from scrath: we just need to download them. The top-most layer of our image stack is all we need to ask for, and we'll get the rest automatically:
$ docker pull masteringmatplotlib/scipy
This may take awhile to run, depending upon your internet connection: you will be downloading pre-built Docker images, several hundred MB in total.
Once that is completed and to make sure things are working, we can start it up and run the image's default command (python3
in this case):
$ docker run -t -i masteringmatplotlib/scipy
You should be greated with the familiar prompt; try an import
as a sanity check:
>>> import matplotlib
>>>
You can exit out of the Docker container by quiting the Python interpreter:
>>> ^d
We've got our scipystack Docker image downloaded and usable. How do we customize it?
There are two ways we could do it:
Let's go with option #2. To make it even easier, we've provided this file for you as part of this notebook:
In [15]:
cat ../docker/simple/Dockerfile
Points to note:
Dockerfile
extends the masteringmatplotlib/scipy
Docker imagePYTHONPATH
; in most situations, you'll want to created a setup.py
for your Python library, install it with pip
in the Dockerfile
build steps, and then not have to mess with PYTHONPATH
.Let's build our new image!
$ docker build -f yourname/eros ./docker/simple/Dockerfile
After which you should see output like this:
Sending build context to Docker daemon 2.56 kB
Sending build context to Docker daemon
Step 0 : FROM ipython/scipystack
---> 113395173d25
Step 1 : MAINTAINER Py Hacker <you@py.hacker>
---> Using cache
---> fd520c92b33b
[snip]
Removing intermediate container 90983e9fdd54
Step 6 : CMD PYTHONPATH=./cloud-deploy/lib:$PYTHONPATH python3
---> Running in b7a022f2ac29
---> abde2bb0eeaa
Removing intermediate container b7a022f2ac29
Successfully built abde2bb0eeaa
Let's make sure our library is present in our new image:
docker run -t -i yourname/projname python3
>>> import eros
>>> ^d
$
We've still got some more work to do, though.
We need two make a couple of changes from our simple case so that:
DISPLAY
), andThese can both be accomplished simply by changing our Docker CMD
directive:
In [16]:
cat ../docker/eros/Dockerfile
The s3_generate_image
function is the "dispatcher", and depending upon the environment variables which are set when running Docker, it will take different actions. More on that later.
The files we're working with are sizable, ranging from about 150MB to 600MB each. As such, we want to be selective in what we'll be pushing to S3. For your project, the following Landsat bands are needed:
All the files for a particular scene weigh in at over 2GB, so we'll just want to push the files for the bands we need, per the Landsat bands noted above:
"$SCENE_PATH"/$SCENE/*_B[1-5,7].TIF
Before running the commands below, you need to make sure that the user associated with the access and secret keys have appropriate S3 permissions (e.g., the ability to upload files). Let's start by setting some shell variables:
$ SCENE_PATH="/EROSData/L8_OLI_TIRS"
$ SCENE=LC82260102014232LGN00
$ export AWS_ACCESS_KEY_ID=YOURACCESSKEY
$ export AWS_SECRET_ACCESS_KEY=YOURSECRETKEY
Now we can create the S3 bucket:
$ aws s3 mb s3://scoresbysund
And with the bucket now existing, we can upload our particular Landsat scene files:
$ for FILE in "$SCENE_PATH"/$SCENE/*_B[2-5,7].TIF; do
aws s3 cp "$FILE" s3://scoresbysund
done
Through trial and error, you have discovered that an m3/xlarge
EC2 instance will be necessary, due to the intensive memory requirements for the image processing (the next step down was 7.5GB RAM, which was insufficient).
Once the instance is up and running, get the IP address from the AWS Console where you launched it, and use that to SSH into it:
$ ssh -i /path/to/your-ec2-key-pair.pem ubuntu@instance-ip-address
Then, on the running instance, prep the instance by installing Docker and saving your AWS credentials (for access to S3) on the filesystem:
ubuntu@ip-address:~$ sudo apt-get install -y docker.io
ubuntu@ip-address:~$ sudo mkdir /etc/aws
ubuntu@ip-address:~$ sudo vi /etc/aws/access
ubuntu@ip-address:~$ sudo vi /etc/aws/secret
ubuntu@ip-address:~$ sudo chmod 600 /etc/aws/*
ubuntu@ip-address:~$ sudo chown ubuntu /etc/aws/*
Now you need to pull down the Docker image we've created for this task, and then run a container using this image in interactive mode using Python as the shell. You can use the Docker image you created, or the one that we did (masteringmatplotlib/eros
):
ubuntu@ip-address:~$ sudo docker run -i -t masteringmatplotlib/eros python3
Python 3.4.3 (default, Feb 27 2015, 02:35:54)
[GCC 4.9.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
A quick test to make sure everthing is in place:
>>> import eros
>>>
If you get no errors, you're all set for the next steps!
In order to read your scene data from the files that you uploaded to S3, you'll need to do the following:
S3 section in the AWS Console, go to bucket, highlight it, click "Properties", click "Edit bucket policy", in the form that appears, paste the following, substituting your EC2 IP address:
{
"Version": "2012-10-17",
"Id": "S3ScoresbySundGetPolicy",
"Statement": [
{
"Sid": "IPAllow",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:*",
"Resource": "arn:aws:s3:::scoresbysund/*",
"Condition" : {
"IpAddress" : {
"aws:SourceIp": "YOUREC2IPADDRESS/32"
}
}
}
]
}
In the section above -- "Preparing for deployment" -- we mentioned an s3_generate_image
function in our Dockerfile
saying that it was the "dispatcher", and depending upon the environment variables which are set when running Docker, it would take different actions. We're going to cover that now.
When our EC2 instance starts up a Docker container tasked with building our Landsat 8 image files, the Docker container will need to know a few things:
We can pass this information to the Docker container using the -e
flag, which will set environment variabes in the container. Let's try this out and confirm this feature does what we need:
ubuntu@ip-address:~$ sudo docker run -i \
-e "PYTHONPATH=/root/cloud-deploy" \
-e "EROS_SCENE_ID=LC82260102014232LGN00" \
-e "AWS_ACCESS_KEY_ID=`cat /etc/aws/access`" \
-e "AWS_SECRET_ACCESS_KEY=`cat /etc/aws/secret`" \
-t masteringmatplotlib/eros \
python3
This will drop us into a Python prompt in the container where we can check on the environment variables:
>>> import os
>>> os.getenv("EROS_SCENE_ID")
'LC82260102014232LGN00'
>>> os.getenv("AWS_ACCESS_KEY_ID")
'YOURACCESSKEY'
>>> os.getenv("AWS_SECRET_ACCESS_KEY")
'YOURSECRETACCESSKEY'
Now that you've confirmed that you can get the data you need into your Docker containers, you can update your code to check for particular ones that you will set when running a container to generate satellite images. For instance, towards the beginning of the lib/ec2s3eros.py
module, we have the following:
bucket_name = os.environ.get("S3_BUCKET_NAME")
scene_id = os.environ.get("EROS_SCENE_ID")
s3_path = os.environ.get("S3_PATH")
s3_title = os.environ.get("S3_IMAGE_TITLE")
s3_filename = os.environ.get("S3_IMAGE_FILENAME")
s3_image_type = os.environ.get("S3_IMAGE_TYPE", "").lower()
access_key = os.environ.get("AWS_ACCESS_KEY_ID")
secret_key = os.environ.get("AWS_SECRET_ACCESS_KEY")
These are what our code will use inside our various S3-ized image generation functions. These parameters will allow us to create the appropriate image and save it to the appropriate place with the expected name.
Let's now create a final (and updated) version of our Docker image:
$ docker build -t masteringmatplotlib/eros docker/eros/
Next, we'll need to publish it to Docker Hub so that you can pull it down on your EC2 instance:
$ docker push <your-project-or-user-name>/eros
One your EC2 instance, get the latest version of <your-project-or-user-name>/eros
which you just published:
ubuntu@ip-address:~$ sudo docker pull <your-project-or-user-name>/eros
Now you can finally run a Docker container from your image, thus generating a file for the RGB satellite image data:
ubuntu@ip-address:~$ export BUCKET=scoresbysund
ubuntu@ip-address:~$ export SCENE=LC82260102014232LGN00
ubuntu@ip-address:~$ export IMGTYPE=rgb
ubuntu@ip-address:~$ sudo docker run \
-e "S3_IMAGE_TITLE=RGB Image: Scene $SCENE" \
-e "S3_IMAGE_TYPE=$IMGTYPE" \
-e "S3_IMAGE_FILENAME=$SCENE-$IMGTYPE-`date "+%Y%m%d%H%M%S"`.png" \
-e "S3_BUCKET_NAME=$BUCKET" \
-e "S3_PATH=https://s3-us-west-2.amazonaws.com/$BUCKET" \
-e "EROS_SCENE_ID=$SCENE" \
-e "AWS_ACCESS_KEY_ID=`cat /etc/aws/access`" \
-e "AWS_SECRET_ACCESS_KEY=`cat /etc/aws/secret`" \
-t masteringmatplotlib/eros
As that task runs, you will see output like the following:
Generating scene image ...
Saving image to S3 ...
0.0/100
27.499622350156933/100
54.99924470031387/100
82.4988670504708/100
100.0/100
On a relatively modern iMac, this job took about 7 to 8 minutes; on EC2 it only took about 15 seconds.
For the false-color short-wave and IR image, run a similar command:
ubuntu@ip-address:~$ export IMGTYPE=swir2nirg
ubuntu@ip-address:~$ sudo docker run \
-e "S3_IMAGE_TITLE=False-Color Image: Scene $SCENE" \
-e "S3_IMAGE_TYPE=$IMGTYPE" \
-e "S3_IMAGE_FILENAME=$SCENE-$IMGTYPE-`date "+%Y%m%d%H%M%S"`.png" \
-e "S3_BUCKET_NAME=$BUCKET" \
-e "S3_PATH=https://s3-us-west-2.amazonaws.com/$BUCKET" \
-e "EROS_SCENE_ID=$SCENE" \
-e "AWS_ACCESS_KEY_ID=`cat /etc/aws/access`" \
-e "AWS_SECRET_ACCESS_KEY=`cat /etc/aws/secret`" \
-t masteringmatplotlib/eros
Generating scene image ...
Saving image to S3 ...
0.0/100
26.05542715379233/100
52.11085430758466/100
78.16628146137698/100
100.0/100
You can confirm that the image has been saved to your bucket by refreshing the S3 screen in your AWS Console.
Though it may seem awkward to parameterize the Docker container with so many environment variables, this allows you to easily change the data you pass without needing to re-generate the Docker image. Your Docker image produces containers that are generally useful for the task at hand, allowing you to process any scene.