Alessandro Gagliardi
Sr. Data Scientist, Glassdoor.com
AA | CC | EE
BA | DC | FE
BB | DD | FF
Write a mapper and reducer (in Python or pseudocode) that takes a list of page visits and yields a list of URLs and the number of unique visitors to those URLs given input like the following:
timestamp url user
201301010000 example.com/page01.html userA
201301010100 example.com/page02.html userA
201302010330 example.com/page01.html userB
201303010400 example.com/page03.html userC
201303010401 example.com/page02.html userA
def mapper(k1, v1):
'''
k1: timestamp
v1: url user (separated by a space)
'''
...
yield (k2, v2)
def reducer(k2, k2_vals):
...
yield (url, number_of_unique_users)
The output should look something like:
example.com/page01.html 2
example.com/page02.html 1
example.com/page03.html 1
Hint: set()
and len()
, while not necessary, may be helpful.
AZs should rarely matter, but important to know what it is (and not confuse AZs with regions). AZ will rarely matter. Region often will.
(Warning: the word *key* is used a lot to mean different things.)
A well-configured AMI can save a lot of time because you don't have to install software on every instance after it's booted up (the software is effectively pre-installed). They can be copied, but only by the creator and it will get a new ID when that happens.
One thing EMR does that StarCluster doesn't is it automatically configures Hadoop to be able to read your S3 buckets as if they were part of HDFS. This has to be done manually in StarCluster (or any other software that runs Hadoop on EC2 (e.g. Whirr, etc.))
We probably won't use EMR anymore in this course. StarCluster works just as well, has more options, and is cheaper. That said, EMR is probably the way most people use Hadoop these days (or at least the way most people use Hadoop on AWS) which is why we covered it last week.
StarCluster comes out of the STAR program at MIT. STAR stands for "Software Tools for Academics and Researchers". It is used to quickly provision a cluster of EC2 instances. Like EMR, it automatically configures them to be used as a cluster (rather than as independent machines) with one controller and many workers. Unlike EMR, it does not have a GUI. Also, Hadoop is just one of many plugins for StarCluster. The most important plugin, however, is IPython Cluster.
For $13.12/hour we can get 8 m2.4xlarge machines with 68.4 GiB Memory a piece.
File: .starcluster/config
In [ ]:
[cluster mediumcluster]
# Declares that this cluster uses smallcluster as defaults
EXTENDS=smallcluster
# This section is the same as smallcluster except for the following settings:
NODE_INSTANCE_TYPE = m2.4xlarge
CLUSTER_SIZE=8
In [ ]:
$ starcluster start -c mediumcluster wikipedia
In [ ]:
$ starcluster put wikipedia --user sgeadmin ~/Downloads/credentials.csv /home/sgeadmin/