Exercise 3: The Map-Reduce Pattern of Parallelism


This exercise aims to illustrate the characteristics of the Map-Reduce parallel pattern by means of the Spark programming model.

Prelude: Introduction to Spark

Spark is a framework for parallel distributed data processing on a set of commodity machines. It frees the user from duties like task scheduling, data transfer and fault tolerance, so that they can focus on programming.

The Spark programming model uses the Map-Reduce model as a basis, simplifying its usage and extending it with more functionality. The following are two basic concepts in Spark:

  • Resilient Distributed Datasets (RDDs): an RDD is a distributed collection of items. The collection is divided in partitions, and these partitions are the units of parallelism.
  • Transformations and actions: the user can apply either transformations or actions on RDDs. Transformations are lazy, they are not applied right away, unlike actions. In other words, one can apply a chain of transformations to an RDD, but only when appending an action to that chain the whole computation graph will start executing. map is an example of a transformation, while reduce is an action. However, Spark is not limited to just map and reduce: it offers a rich variety of functional-style operators. Although we will not use all those operators in this exercise, you can find a complete list here.

As for the language, Spark programs can be developed in Scala, Java and Python. The latter will be used in this exercise.

1. Map-Reduce on a collection of elements

Map-Reduce is a combination of two patterns of parallelism:

  • Map: given a collection of elements of type $X$, apply a function $F_{map}$ to each element that returns a single element of type $Y$.
  • Reduce: given a collection of elements of type $Y$, combine its elements into an element of type $Y$ using a function $F_{reduce}$.

These two patterns can be chained to obtain a single result from a collection of elements:

$(X_{1},~X_{2},~...,~X_{n})~\xrightarrow{F_{map}}~(Y_{1},~Y_{2},~...,~Y_{n})~\xrightarrow{F_{reduce}}~Y_{result}$

Spark implements these patterns via two operators, map and reduce. Both receive a function as parameter; in the case of map, the function must receive one parameter of type $X$ and return a value of type $Y$; regarding reduce, the function must receive two parameters of type $Y$ and return a value of type $Y$.

First example

Let's start with a simple example of how to run a Map-Reduce chain with Spark.

First we will perform some initialisation of the Spark library. In particular, we will create a SparkContext object that we will use later.


In [ ]:
from pyspark import SparkContext
sc = SparkContext()

Next we will create our first RDD, i.e. a collection of elements in Spark. In this case, the collection is a list of numbers.


In [ ]:
rdd = sc.parallelize([1, 2, 4, 8])

Let's now map a function to our collection to increment each of its elements. We will first define such function:


In [ ]:
def increment(x):
    return x + 1

We can now apply the increment function to every element in our collection via the map operator.


In [ ]:
incrementedRDD = rdd.map(increment)

You can actually check the incremented values with the collect function of Spark. Note that with the statement above you just scheduled a calculation which is not executed by the Spark runtime: map is indeed a transformation. The actual work is triggered by an action, such as collect or reduce.


In [ ]:
incrementedRDD.collect()

Let's suppose we would like to obtain the sum of all the numbers in the collection. For that purpose, we can use reduce and a function that returns the sum of two numbers.


In [ ]:
def add(x, y):
    return x + y

Spark's reduce functionality will apply add to our collection, adding its elements two by two until calculating the total sum.


In [ ]:
totalSum = incrementedRDD.reduce(add)
totalSum

Note that you can also do the same calculation in a single Map-Reduce chain:


In [ ]:
rdd.map(increment).reduce(add)

Exercise to complete

Now that you became familiar with the Map-Reduce pattern and its implementation in Spark, we propose you an exercise to apply what you just learned to a different problem.

In this case, our collection will be a list of town names:


In [ ]:
names = sc.parallelize(['Madrid', 'Geneva', 'Barcelona', 'Milano', 'Mamungkukumpurangkuntjunya'])

Suppose we would like to know the length of the longest name in our list.

First, we need a function that receives a string and returns its length (hint: you can use the len Python function).


In [ ]:
def length(s):
    # Your code goes here

Now define another function that, given two numbers, returns the maximum.


In [ ]:
def maximum(x, y):
    # Your code goes here

Finally, combine the two functions you just defined in a Map-Reduce chain to obtain the desired result.


In [ ]:
# Your code goes here

2. Map-Reduce on text files: Word Count

A typical real-life example where the Map-Reduce pattern is applied is the processing of text files like, for instance, logs generated by a web server.

In this second part of the exercise, you will process a text file and count the occurrences of each word in it.

Prepare input data

First we will download the input file for this exercise, which corresponds to the Complete Works of William Shakespeare, offered by Project Gutenberg.


In [ ]:
!curl  -o 100.zip https://swanserver.web.cern.ch/swanserver/csc/100.zip
!unzip -o 100.zip

Next we will use Spark to create a collection out of this text file. In the collection, every line of the file will be an element.


In [ ]:
textFile = sc.textFile('100.txt')

textFile is a collection where each element is a string corresponding to a line in the text file.

Introducing new operations

The Map-Reduce chain to count the ocurrences of each word in the file is a bit more complicated than what we have seen so far. The chain will contain two map operations and one reduce. In addition, you will have to use some new operators, which are a slight variation of the map and reduce you already know:

  • flatMap: if the user function $F_{map}$ returns a collection of values instead of single value, flatMap decomposes the collection into individual elements in the final result.


  • reduceByKey: like in the case of reduce, the function $F_{reduce}$ passed as parameter of reduceByKey receives two elements of type $Y$ and returns an element of the same type. However, reduceByKey is applied on collections of key-value pairs, that is, $( ~(key_{1},val_{x}),~(key_{2},val_{y}),~(key_{1},val_{z}),~...~)$, and the output is the reduction of the values for each key $( ~(key_{1}, red_{1}),~(key_{2},red_{2}),~...~)$.

Functions

Let's now define the functions that will be applied in the Map-Reduce chain. First, you will need a function that splits a line of the text file in a list of words (hint: you can use the split Python function and return words with non alphanumeric characters).


In [ ]:
def splitWords(line):
    # Your code goes here

The second function you will need receives a word and returns a key-value pair, where the key is the word and the value is 1 (its partial count).


In [ ]:
def count(word):
    # Your code goes here

Finally, you will need a function that receives two numbers and returns their sum.


In [ ]:
def add(x, y):
    # Your code goes here

Map-Reduce chain

To finish this exercise, combine the three functions you just defined in a chain to get the counts per word in the file.


In [ ]:
wordCounts = textFile# Your code goes here

Let's see if you got it right...


In [ ]:
wordCounts.collect()