notebook.community

MapReduce Design Patterns

Based on "MapReduce Design Patterns" by Donald Miner and Adam Shook

Linh B. Ngo

Motivation

MapReduce is designed as a framework
- The solution has to fit into the framework
- There are clear boundaries on what can and cannot be done
- To create a solution that fits within existing boundaries is a challenge
MapReduce Design Pattern
- A template for solving common and general data manipulation problems with MapReduce
- Design patterns are not specific to a domain, but describe a general approach to solving a problem

Basic Patterns

Summarization
Filtering
Data Organization
Join
Metapatterns

Summarization

Grouping similar data together and perform an operation such as calculating a statistic, building an index, or just simply counting
Examples:
- Numerical summarizations
- Inverted Index
- Counting with Counters

Numerical Summerizations

Calculate aggregate statistical values over a large data set
Benefit from a properly-designed combiner (How?)
Benefit from a properly-designed partitioner (How?)

Inverted Index

Generating an index from a data set to allow for faster searches or data enrichment capabilities
Benefit from the combiner (How?)

Counting with Counters

Utilizing MapReduce’s internal counter to calculate global sums on the entire data set
Done entirely in the mapping phase
Useful if number of keys is small

Filtering

Don’t change the actual records but find a subset of data in the entire data set
Examples
- Filtering
- Top-ten
- Distinct

Filtering

Evaluates each record and decides, based on some condition, whether it should stay or go
Not require the “reduce part”
Can a combiner helps?

Top-Ten

Retrieve a relatively small number of top K records according to a ranking scheme in the data set
Important: Unique entries

Distinct

Find a unique set of values
Key process: deduplication
Benefit from deduplication processes implemented inside mappers and combiners
Benefit from a large number of reducers

Data Organization

Reorganizing data to help improving value, performance, and usability.
Examples:
- Structured to Hierarchical
- Partitioning and Binning

Structured to Hierarchical

Create new records with a different structure from the input data (e.g.: transform row-based data into a hierarchical format such as XML or JSON)

Partitioning and Binning

Relies on Hadoop’s Partitioner and Binning capabilities to categorize data based on key ranges.
Binning: Mapping phase
Partitioning: After mapping

Join

Merging similar data (with or without aggregation) from multiple data sets
Examples:
- Reduce-side join
- Map-side join

Reduce-side join

Join large multiple data sets together by some foreign keys
Simplest and most straight forward
Should be the last solution to look at

Map-side join

Joining happens in the map phase
Replicated Join
Composite Join

Map-side join: replicated join

Map-side join: composite join

Metapatterns

Patterns that deal with patterns
Examples:
- Job chaining
- Job merging

Job chaining

Many problems can’t be solved with a single MapReduce job
The default MapReduce framework requires a lot of manual coding to handle multistage jobs
Clean up intermediate outputs
- Handle failures
- Can be chained though the main Java program or external scripting
Possible support tools: Oozie, Cascade, Tez …

Job chaining: chain folding

Optimization that is applied to MapReduce job chain
Reduce the amount of data movement in the MapReduce pipe line

If multiple map phases are adjacent, merge them

If a job ends with a map phase, push that phase into the previous reducer

Split up map operations that decrease the amount of data from those that increase the amount of data

Job merging

Allows unrelated jobs loading the same data to share the MapReduce pipeline