Data Science

AWS, StarCluster, IPython.parallel, Pandas

Alessandro Gagliardi
Sr. Data Scientist, Glassdoor.com

Last Time:

#### Big Data
#### MapReduce Programming Model
#### Implementation Details
#### Word Count Example

Questions?

Agenda:

Readiness Assement
Amazon Web Services
StarCluster
IPython.parallel

Readiness Assement

Two Parts:

Individual
Team

AA | CC | EE
BA | DC | FE
BB | DD | FF

Teams:

A	A	C	C	E	E
B	A	D	C	F	E
B	B	D	D	F	F

Readiness Assement

T/F:
Part of the innovation of MapReduce is that instead of moving code to data, you move data to code.
Which of the following can be used as HDFS:
1. Elastic Beanstalk
2. IAM
3. Redshift
4. S3
5. None of the Above

Write a mapper and reducer (in Python or pseudocode) that takes a list of page visits and yields a list of URLs and the number of unique visitors to those URLs given input like the following:

 timestamp       url                     user
 201301010000    example.com/page01.html userA
 201301010100    example.com/page02.html userA
 201302010330    example.com/page01.html userB
 201303010400    example.com/page03.html userC
 201303010401    example.com/page02.html userA

 def mapper(k1, v1):
     '''
     k1: timestamp
     v1: url user (separated by a space)
     '''
     ...
     yield (k2, v2)

 def reducer(k2, k2_vals):
     ...
     yield (url, number_of_unique_users)

The output should look something like:

    example.com/page01.html    2
    example.com/page02.html    1
    example.com/page03.html    1

Hint: set() and len(), while not necessary, may be helpful.

Amazon Web Services

IAM: Identity and Access Management
S3: Simple Storage Service
EC2: Elastic Cloud Compute
EMR: Elastic MapReduce

Regions:

us-east-1 (North Virginia)
us-west-1 (Northern California)
us-west-2 (Oregon)
also regions in Asia, Europe, and South America

Availability Zones (AZ):

Each region has several availability zones.

AZs should rarely matter, but important to know what it is (and not confuse AZs with regions). AZ will rarely matter. Region often will.

Types of authentication:

(Warning: the word *key* is used a lot to mean different things.)

Key & Secret
- Key = 20 character long token (all upper-case)
- Secret = 40 character long string
- Used for all AWS
- Created and managed in IAM
- Not region specific
Keypair
- RSA encrypted
- Used specifically for SSH into EC2 (including EMR)
- Region specific
- Created and managed in EC2
Other forms of authentication
- Alternative to Key & Secret
- Managed in IAM
- Useful for granting others partial access to your AWS resources
  (i.e. giving a friend write access to a specific S3 bucket)

S3 (Simple Storage Service)

Sort of a file system
- Buckets ~ Folders
- Keys ~ Files (Note: a different meaning for the word key in this context)
- Subdirectories aren't really folders, but rather just prefixes to the key name
Region Specific
- but it usually doesn't matter (except for latency)
Relatively Cheap
Complex Permission Structure (about as complex as a real file system)

EC2: Elastic Cloud Compute: Instance Types

Optimization
- Compute Optimized
- Memory Optimized
- Storage Optimized
- General Purpose
Pricing Examples:
- t1.micro
  - 1 CPU
  - 615 MB RAM
  - \$0.02 per Hour
- i2.8xlarge
  - 32 CPUs
  - 244 GB RAM
  - \$6.82 per Hour
- m1.large
  - 2 CPUs
  - 7.5 GB RAM
  - \$0.24 per Hour
Not all instance types available in all regions (e.g. High Storage Instances not available in us-west-1)
Prices may vary (e.g. instances in us-west-1 are often more expensive)

EC2: Elastic Cloud Compute: Amazon Machine Images (AMIs)

A machine image is a copyable snapshot of an instance's contents and configuration
If you've ever run a virtual machine (e.g. VirtualBox, VMware, Parallels), you may be familiar
EC2 instances start as AMIs that are then instantiated
StarCluster uses AMIs extensively to preconfigure EC2 instances to create a cluster
A well-configured AMI can save a lot of time
AMIs are region specific. Copies to other regions receive new ID

A well-configured AMI can save a lot of time because you don't have to install software on every instance after it's booted up (the software is effectively pre-installed). They can be copied, but only by the creator and it will get a new ID when that happens.

EMR: Elastic MapReduce

Hadoop in the Cloud

Uses EC2 instances
Web interface for provisioning cluster
Installs Hadoop, etc. automatically
Installs Hive, Pig, HBase automatically if desired
Provides a web interface to running Hadoop Streaming jobs.
Charges a small premium on top of EC2 prices

One thing EMR does that StarCluster doesn't is it automatically configures Hadoop to be able to read your S3 buckets as if they were part of HDFS. This has to be done manually in StarCluster (or any other software that runs Hadoop on EC2 (e.g. Whirr, etc.))

We probably won't use EMR anymore in this course. StarCluster works just as well, has more options, and is cheaper. That said, EMR is probably the way most people use Hadoop these days (or at least the way most people use Hadoop on AWS) which is why we covered it last week.

Real Big Data

http://aws.amazon.com/datasets/Encyclopedic/2506

500 GB!!!

StarCluster

StarCluster comes out of the STAR program at MIT. STAR stands for "Software Tools for Academics and Researchers". It is used to quickly provision a cluster of EC2 instances. Like EMR, it automatically configures them to be used as a cluster (rather than as independent machines) with one controller and many workers. Unlike EMR, it does not have a GUI. Also, Hadoop is just one of many plugins for StarCluster. The most important plugin, however, is IPython Cluster.

http://tinyurl.com/EC2calc

For $13.12/hour we can get 8 m2.4xlarge machines with 68.4 GiB Memory a piece.

File: .starcluster/config



In [ ]:

    
[cluster mediumcluster]
# Declares that this cluster uses smallcluster as defaults
EXTENDS=smallcluster
# This section is the same as smallcluster except for the following settings:
NODE_INSTANCE_TYPE = m2.4xlarge
CLUSTER_SIZE=8



In [ ]:

    
$ starcluster start -c mediumcluster wikipedia



In [ ]:

    
$ starcluster put wikipedia --user sgeadmin ~/Downloads/credentials.csv /home/sgeadmin/

Parse Wikipedia Data

Next Time:

APIs & JSON
NoSQL Databases

Homework:

Create a Twitter
Install MongoDB
Install Python Packages (i.e. conda install ... or pip install ...)
- PyMongo
- Twitter