Data Science

AWS, StarCluster, IPython.parallel, Pandas

Alessandro Gagliardi
Sr. Data Scientist, Glassdoor.com

Last Time:

  • #### Big Data
  • #### MapReduce Programming Model
  • #### Implementation Details
  • #### Word Count Example

Questions?

Agenda:

  1. Readiness Assement
  2. Amazon Web Services
  3. StarCluster
  4. IPython.parallel

Readiness Assement

Two Parts:

  1. Individual
  2. Team
AA | CC | EE
BA | DC | FE
BB | DD | FF

Teams:

AA CC EE
BA DC FE
BB DD FF

Readiness Assement

  1. T/F:
    Part of the innovation of MapReduce is that instead of moving code to data, you move data to code.

  2. Which of the following can be used as HDFS:

    1. Elastic Beanstalk
    2. IAM
    3. Redshift
    4. S3
    5. None of the Above
  1. Write a mapper and reducer (in Python or pseudocode) that takes a list of page visits and yields a list of URLs and the number of unique visitors to those URLs given input like the following:

     timestamp       url                     user
     201301010000    example.com/page01.html userA
     201301010100    example.com/page02.html userA
     201302010330    example.com/page01.html userB
     201303010400    example.com/page03.html userC
     201303010401    example.com/page02.html userA
    
     def mapper(k1, v1):
         '''
         k1: timestamp
         v1: url user (separated by a space)
         '''
         ...
         yield (k2, v2)
    
     def reducer(k2, k2_vals):
         ...
         yield (url, number_of_unique_users)

The output should look something like:

    example.com/page01.html    2
    example.com/page02.html    1
    example.com/page03.html    1

Hint: set() and len(), while not necessary, may be helpful.

Amazon Web Services

  • IAM: Identity and Access Management
  • S3: Simple Storage Service
  • EC2: Elastic Cloud Compute
  • EMR: Elastic MapReduce

Regions:

  • us-east-1 (North Virginia)
  • us-west-1 (Northern California)
  • us-west-2 (Oregon)
  • also regions in Asia, Europe, and South America

Availability Zones (AZ):

  • Each region has several availability zones.

AZs should rarely matter, but important to know what it is (and not confuse AZs with regions). AZ will rarely matter. Region often will.

Types of authentication:

(Warning: the word *key* is used a lot to mean different things.)

  1. Key & Secret
    • Key = 20 character long token (all upper-case)
    • Secret = 40 character long string
    • Used for all AWS
    • Created and managed in IAM
    • Not region specific
  2. Keypair
    • RSA encrypted
    • Used specifically for SSH into EC2 (including EMR)
    • Region specific
    • Created and managed in EC2
  3. Other forms of authentication
    • Alternative to Key & Secret
    • Managed in IAM
    • Useful for granting others partial access to your AWS resources
      (i.e. giving a friend write access to a specific S3 bucket)

S3 (Simple Storage Service)

  • Sort of a file system
    • Buckets ~ Folders
    • Keys ~ Files (Note: a different meaning for the word key in this context)
    • Subdirectories aren't really folders, but rather just prefixes to the key name
  • Region Specific
    • but it usually doesn't matter (except for latency)
  • Relatively Cheap
  • Complex Permission Structure (about as complex as a real file system)

EC2: Elastic Cloud Compute: Instance Types

  • Optimization
    • Compute Optimized
    • Memory Optimized
    • Storage Optimized
    • General Purpose
  • Pricing Examples:
    • t1.micro
      • 1 CPU
      • 615 MB RAM
      • \$0.02 per Hour
    • i2.8xlarge
      • 32 CPUs
      • 244 GB RAM
      • \$6.82 per Hour
    • m1.large
      • 2 CPUs
      • 7.5 GB RAM
      • \$0.24 per Hour
  • Not all instance types available in all regions (e.g. High Storage Instances not available in us-west-1)
  • Prices may vary (e.g. instances in us-west-1 are often more expensive)

EC2: Elastic Cloud Compute: Amazon Machine Images (AMIs)

  • A machine image is a copyable snapshot of an instance's contents and configuration
  • If you've ever run a virtual machine (e.g. VirtualBox, VMware, Parallels), you may be familiar
  • EC2 instances start as AMIs that are then instantiated
  • StarCluster uses AMIs extensively to preconfigure EC2 instances to create a cluster
  • A well-configured AMI can save a lot of time
  • AMIs are region specific. Copies to other regions receive new ID

A well-configured AMI can save a lot of time because you don't have to install software on every instance after it's booted up (the software is effectively pre-installed). They can be copied, but only by the creator and it will get a new ID when that happens.

EMR: Elastic MapReduce

Hadoop in the Cloud

  • Uses EC2 instances
  • Web interface for provisioning cluster
  • Installs Hadoop, etc. automatically
  • Installs Hive, Pig, HBase automatically if desired
  • Provides a web interface to running Hadoop Streaming jobs.
  • Charges a small premium on top of EC2 prices

One thing EMR does that StarCluster doesn't is it automatically configures Hadoop to be able to read your S3 buckets as if they were part of HDFS. This has to be done manually in StarCluster (or any other software that runs Hadoop on EC2 (e.g. Whirr, etc.))

We probably won't use EMR anymore in this course. StarCluster works just as well, has more options, and is cheaper. That said, EMR is probably the way most people use Hadoop these days (or at least the way most people use Hadoop on AWS) which is why we covered it last week.

Real Big Data

StarCluster

StarCluster comes out of the STAR program at MIT. STAR stands for "Software Tools for Academics and Researchers". It is used to quickly provision a cluster of EC2 instances. Like EMR, it automatically configures them to be used as a cluster (rather than as independent machines) with one controller and many workers. Unlike EMR, it does not have a GUI. Also, Hadoop is just one of many plugins for StarCluster. The most important plugin, however, is IPython Cluster.

http://tinyurl.com/EC2calc


For $13.12/hour we can get 8 m2.4xlarge machines with 68.4 GiB Memory a piece.

File: .starcluster/config


In [ ]:
[cluster mediumcluster]
# Declares that this cluster uses smallcluster as defaults
EXTENDS=smallcluster
# This section is the same as smallcluster except for the following settings:
NODE_INSTANCE_TYPE = m2.4xlarge
CLUSTER_SIZE=8

In [ ]:
$ starcluster start -c mediumcluster wikipedia

In [ ]:
$ starcluster put wikipedia --user sgeadmin ~/Downloads/credentials.csv /home/sgeadmin/

Parse Wikipedia Data

Next Time:

  1. APIs & JSON
  2. NoSQL Databases

Homework:

  1. Create a Twitter
  2. Install MongoDB
  3. Install Python Packages (i.e. conda install ... or pip install ...)