MapReduce Framework

MapReduce is a programming model designed for proccesing massive datasets. Originally, MapReduce was developed at Google. The design of MapReduce framwork has inherit capability for distribued computing, which is necessary for big data applications. MapReduce software is a proprietary software owned by Google, so we cannot use it for free, however, Hadoop is an open-source implementation of MapReduce, part of Apache project, and can be obtained for free.

Dealing with big data requires using large number of computing resources, and such resources should be fault tolerant (reliable), so that if one machine fails, it should not affect the stored data, not it should affect the computation. The idea behind MapReduce model to solve this problem yet keeping the hardware cost reasonable is to use a large number of commodpty PCs for computation and distrinute the data and computations. For storage, the data files are divided into smaller chunks and each chunk is replicated among multiple nodes (PCs). As a result, if one node containing chuck B fails, there existing other replicas of the same chunk. Furthermore, let's assume that we are running a fairly long analysis, and in the middle of our analysis, one of the nodes crashes. However, Hadoop has an internal protocol to deal with such senarios so that our analysis is not affected at all.

So, as you can imagine, there are a lot of things happening behind the scene, however, when working with Hadoop, we don't see or bother with such internal details, because Hadoop takes care of those for us transparently and hides those details. Therefore, we can focus on our fascinating data analysis.

On the othe hand, writing a Hadoop program for a specific data analysis task is very different than a regular programming. So, here we want to understand the program flow of MapReduce, and learn about design of MapReduce paradigm for data analysis.

There are two fundamental steps

1. Map step

2. Reduce step

Question: What is the difference between MapReduce and Hadoop?

Answer: Hadoop and MapReduce are both distributed programming models for processing large data files. MapReduce software is a proprietray software, but Hadoop is an open-source implementation as part of Apache project.

Examples

Wordcount

Let's walk through a MapReduce example.

Problem definition: Given a a set of large documents, count the frequency of each word that appears in all the documents.

Mapper Input: each mapper takes these in these documents
Mapper Function: apply some text processing, such as converting everything to lower-case, and tokenize the text by spliting based on space characte.
Mapper Output: after splitting the text, produces key value pairs, where the key is each word, and the value is just 1.
- E.g. Assume that we have 3 mappers, and the input to these mapper nodes are as follows
  
  Mapper node 1: “Master Kenobi, you disappoint me.”
  
  Mapper node 2: “Yoda holds you in such high esteem."
  
  Mapper node 3: “Surely you can do better!”

Mapper node 1 key : value	Mapper node 2 key : value	Mapper node 3 key : value
master:1	yoda:1	surely:1
kenobi:1	holds:1	you:1
you:1	you:1	can:1
disappoint:1	in:1	do:1
me:1	such:1	better:1
	high:1
	esteem:1

Shuffling: Hadoop will shuffle the keys, partition the results to reducers. Partitioning the keys is done by using a hashing scheme, trying to balance the number of input keys to reducers. This hashing scheme gaurantees that the all the items with the same key go to the same partition and therefore will be processed on the same reducer.
Reducer Input: the input to each reducer is a set of keys and a list of all their corresponding values.
E.g. continuing the previous example, the input to a reduce could be like this

Reducer node 1 key : value	Reducer node 2 key : value
master:[1]	high:[1]
can:[1]	you:[1,1,1]
better:[1]	kenobi:[1]
esteem:[1]	disappoint:[1]
me:[1]	yoda:[1]
holds:[1]	do:[1]
in:[1]	better:[1]
	surly:[1]
	such:[1]

Reducer Function and Output: add up all the values for each key, and output a new key:value pair, where the new value is the calculated sum of values for that key.

For this wordcount example, the output will be as follows holds:1
such:1
you:3
me:1
...

Reducer 1:

Movies

Amazon purchases

Senators' votes

Website visitors logs



In [ ]: