Introduction to Hadoop MapReduce

Reality of working with Big Data

  • Hundreds or thousands of machines to support big data
    • Distribute data for storage (HDFS)
    • Parallelize data computation (Hadoop MapReduce)
    • Handle failure (HDFS and Hadoop MapReduce)

MapReduce

What is “map”? A function/procedure that is applied to every individual elements of a collection/list/array/…

int square(x) { return x*x;}
map square [1,2,3,4] -> [1,4,9,16]

What is “reduce”? A function/procedure that performs an operation on a list. This operation will “fold/reduce” this list into a single value (or a smaller subset)

reduce ([1,2,3,4]) using sum -> 10
reduce ([1,2,3,4]) using multiply -> 24

Implementation of MapReduce Programming Paradigm in Hadoop MapReduce

Programmers implement:

  • Map function: Take in the input data and return a key,value pair
  • Reduce function: Receive the key,value pairs from the mapper and provide a final output as a reduction operation on the pairs
  • Optional functions:
    • Partition function: determines the distribution of mappers’ key,value pairs to the reducers
    • Combine functions: initial reduction on the mappers to reduce network traffics

The MapReduce Framework handles everything else

WordCount: The Hello, World of Big Data

  • Count how many unique words there are in a file/multiple files
  • Standard parallel programming approach:
    • Count number of files
    • Set number of processes
    • Possibly setting up dynamic workload assignment
    • A lot of data transfer
    • Significant coding effort

MapReduce WordCount Example

MapReduce WordCount Example

MapReduce PageRank Example 1

MapReduce PageRank Example 2

What is "everything else"?

  • Scheduling
  • Data distribution
  • Synchronization
  • Error and Fault Handling

The cost of "everything else"?

  • All algorithms must be expressed as a combination of mapping, reducing, combining, and partitioning functions
  • No control over execution placement of mappers and reducers
  • No control over life cycle of individual mappers and reducers
  • Very limited information about which mapper handles which data block
  • Very limited information about which reducer handles which intermediate key

Additional challenge

** Large scale debugging on big data programming is difficult

  • Functional errors are difficult to follow at large scale
  • Data-dependent errors are even more difficult to catch and fix

Applications of MapReduce

  • Text tokenization, indexing, and search
    • Web access log stats
    • Inverted index construction
    • Term-vector per host
    • Distributed grep/sort
  • Graph creation
    • Web link-graph reversal (Google’s PageRank)
  • Data Mining and machine learning
    • Document clustering
    • Machine learning
    • Statistical machine translation

Working with Hadoop MapReduce on CloudLab

  • From a terminal, SSH to the namenode in the above experiment.
  • If you do not have your SSH key setup, use a shell to login to the above experiment, then setup a password for your account using
$ sudo passwd <your_user_name>
  • From a terminal, SSH to the namenode using your login name and password
  • This is necessary for the easy of copying codes
  • Run the followings:
$ hdfs dfs -ls .
$ hdfs dfs -mkdir intro-to-hadoop
$ hdfs dfs -ls .