MapReduce Design Patterns

Based on "MapReduce Design Patterns" by Donald Miner and Adam Shook

Linh B. Ngo

Motivation

  • MapReduce is designed as a framework
    • The solution has to fit into the framework
    • There are clear boundaries on what can and cannot be done
    • To create a solution that fits within existing boundaries is a challenge
  • MapReduce Design Pattern
    • A template for solving common and general data manipulation problems with MapReduce
    • Design patterns are not specific to a domain, but describe a general approach to solving a problem

Basic Patterns

  • Summarization
  • Filtering
  • Data Organization
  • Join
  • Metapatterns

Summarization

  • Grouping similar data together and perform an operation such as calculating a statistic, building an index, or just simply counting
  • Examples:
    • Numerical summarizations
    • Inverted Index
    • Counting with Counters

Numerical Summerizations

  • Calculate aggregate statistical values over a large data set
  • Benefit from a properly-designed combiner (How?)
  • Benefit from a properly-designed partitioner (How?)

Inverted Index

  • Generating an index from a data set to allow for faster searches or data enrichment capabilities
  • Benefit from the combiner (How?)

Counting with Counters

  • Utilizing MapReduce’s internal counter to calculate global sums on the entire data set
  • Done entirely in the mapping phase
  • Useful if number of keys is small

Filtering

  • Don’t change the actual records but find a subset of data in the entire data set
  • Examples
    • Filtering
    • Top-ten
    • Distinct

Filtering

  • Evaluates each record and decides, based on some condition, whether it should stay or go
  • Not require the “reduce part”
  • Can a combiner helps?

Top-Ten

  • Retrieve a relatively small number of top K records according to a ranking scheme in the data set
  • Important: Unique entries

Distinct

  • Find a unique set of values
  • Key process: deduplication
  • Benefit from deduplication processes implemented inside mappers and combiners
  • Benefit from a large number of reducers

Data Organization

  • Reorganizing data to help improving value, performance, and usability.
  • Examples:
    • Structured to Hierarchical
    • Partitioning and Binning

Structured to Hierarchical

  • Create new records with a different structure from the input data (e.g.: transform row-based data into a hierarchical format such as XML or JSON)

Partitioning and Binning

  • Relies on Hadoop’s Partitioner and Binning capabilities to categorize data based on key ranges.
  • Binning: Mapping phase
  • Partitioning: After mapping

Join

  • Merging similar data (with or without aggregation) from multiple data sets
  • Examples:
    • Reduce-side join
    • Map-side join

Reduce-side join

  • Join large multiple data sets together by some foreign keys
  • Simplest and most straight forward
  • Should be the last solution to look at

Map-side join

  • Joining happens in the map phase
  • Replicated Join
  • Composite Join

Map-side join: replicated join

Map-side join: composite join

Metapatterns

  • Patterns that deal with patterns
  • Examples:
    • Job chaining
    • Job merging

Job chaining

  • Many problems can’t be solved with a single MapReduce job
  • The default MapReduce framework requires a lot of manual coding to handle multistage jobs
  • Clean up intermediate outputs
    • Handle failures
    • Can be chained though the main Java program or external scripting
  • Possible support tools: Oozie, Cascade, Tez …

Job chaining: chain folding

  • Optimization that is applied to MapReduce job chain
  • Reduce the amount of data movement in the MapReduce pipe line

If multiple map phases are adjacent, merge them

If a job ends with a map phase, push that phase into the previous reducer

Split up map operations that decrease the amount of data from those that increase the amount of data

Job merging

  • Allows unrelated jobs loading the same data to share the MapReduce pipeline