Lecture 7: Spark Programming

In what follows, you can find pyspark code for the examples we saw in class. Many of the examples follow examples found in Learning Spark: Lightning-Fast Big Data Analysis, by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia, which you can also find at Aalto's library.

Setup

These instructions should work for Mac and Linux. We'll assume you'll be using python3.

To run the following on your computer, make sure that pyspark is in your PYTHONPATH variable. You can do that by downloading a zipped file with Spark, extracting it into its own folder (e.g., spark-1.6.0-bin-hadoop2.6/) and then executing the following commands in bash.

export PYSPARK_PYTHON=python3
export SPARK_HOME=/path/to/spark-1.6.0-bin-hadoop2.6/
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH

In [1]:
import pyspark
import numpy as np # we'll be using numpy for some numeric operations
sc = pyspark.SparkContext(master="local", appName="tour")

First examples


In [2]:
text = sc.textFile("myfile.txt") # load data
text.count() # count lines


Out[2]:
95

In [3]:
text = sc.textFile("myfile.txt") # load data

# count only lines that mention "Spark"
spark_lines = text.filter(lambda line: 'Spark' in line)
spark_lines.count() # count lines


Out[3]:
17

Lambda functions

Lambda expressions are an easy way to write short functions in Python.


In [4]:
f = lambda line: 'Spark' in line
f("we are learning Spark")


Out[4]:
True

In [5]:
def f(line):
    return 'Spark' in line
f("we are learning Spark")


Out[5]:
True

Creating RDDS

We saw that we can create RDDs by loading files from disk. We can also create RDDs from Python collections or transforming other RDDs.


In [6]:
data = sc.parallelize([0,1,2,3,4,5,6,7,8,9]) # create RDD from Python collection

In [7]:
data_squared = data.map(lambda num: num ** 2) # transformation

RDD operations

There are two types of RDD operations in Spark: transformations and actions. Transfromations create new RDDs from other RDDs. Actions extract information from RDDs and return it to the driver program.


In [8]:
data = sc.parallelize([0,1,2,3,4,5,6,7,8,9]) # creation of RDD
data_squared = data.map(lambda num: num ** 2) # transformation
data_squared.collect() # action


Out[8]:
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Lazy evaluation

RDDs are evaluated lazily. This means that Spark will not materialize an RDD until it has to perform an action. In the example below, primesRDD is not evaluated until action collect() is performed on it.


In [9]:
def is_prime(num):
    """ return True if num is prime, False otherwise"""
    if num < 1 or num % 1 != 0:
        raise Exception("invalid argument")
    for d in range(2, int(np.sqrt(num) + 1)):
        if num % d == 0:
            return False
    return True

In [10]:
numbersRDD = sc.parallelize(range(1, 1000000)) # creation of RDD
primesRDD = numbersRDD.filter(is_prime) # transformation

# primesRDD has not been materialized until this point

primes = primesRDD.collect() # action
print(primes[:15]) # this code does not involve Spark computation


[1, 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43]

Persistence

RDDs are ephemeral by default, i.e. there is no guarantee they will remain in memory after they are materialized. If we want them to persist in memory, possibly to query them repeatedly or use them in multiple operations, we can ask Spark to do this by calling persist() on them.


In [11]:
primesRDD.persist() # we're asking Spark to keep this RDD in memory

print("Found", primesRDD.count(), "prime numbers") # first action -- causes primesRDD to be materialized
print("Here are some of them:")
print(primesRDD.take(20)) # second action - RDD is already in memory


Found 78499 prime numbers
Here are some of them:
[1, 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67]

If we do not need primesRDD in memory anymore, we can tell Spark to discard it.


In [12]:
primesRDD.unpersist()


Out[12]:
PythonRDD[10] at collect at <ipython-input-10-d586bf285b26>:6

How long does it take to collect primesRDD? Let's time the operation.


In [13]:
%%timeit
primes = primesRDD.collect()


1 loops, best of 3: 10.2 s per loop

When I ran the above on my laptop, it took about more than 10s. That's because Spark had to evaluate primesRDD before performing collect on it.

How long would it take if primesRDD was already in memory?


In [14]:
primesRDD.persist()


Out[14]:
PythonRDD[10] at collect at <ipython-input-10-d586bf285b26>:6

In [15]:
%%timeit
primes = primesRDD.collect()


The slowest run took 273.88 times longer than the fastest. This could mean that an intermediate result is being cached 
1 loops, best of 3: 37.3 ms per loop

When I ran the above on my laptop, it took about 40ms to collect primesRDD - that's almost $300$ times faster compared to when the RDD had to be recomputed from scratch.


Passing functions

When we pass a function as a parameter to an RDD operation, the function can be specified either as a lambda function or as a reference to a function defined elsewhere.


In [16]:
data = sc.parallelize(range(10))
squares = data.map(lambda x: x**2)
squares.collect()


Out[16]:
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In [17]:
def f(x):
    """ return the square of a number"""
    return x**2

data = sc.parallelize(range(10))
squares = data.map(f)
squares.collect()


Out[17]:
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Be careful, though: if the function that you pass as argument to an RDD operation

  • is an object method, or
  • references an object field,

then Spark will ship the entire object to the cluster nodes along with the function.

This is demonstrated in the piece of code below.


In [18]:
class SearchFunctions(object):
    def __init__(self, query):
        self.query
        
    def is_match(self, s):
        return self.query in s
    
    def get_matches_in_rdd_v1(self, rdd):
        return rdd.filter(self.is_match) # the function is an object method
    
    def get_matches_in_rdd_v2(self, rdd):
        return rdd.filter(lambda x: self.query in x) # the function references an object field

The following is a better way to implement the two methods above (get_matches_in_rdd_v1 and get_matches_in_rdd_v2), if we want to avoid sending a SearchFunctions object for computation to the cluster.


In [19]:
class SearchFunctions(object):
    def __init__(self, query):
        self.query
        
    def is_match(self, s):
        return self.query in s
    
    def get_matches_in_rdd(self, rdd):
        query = self.query
        return rdd.filter(lambda x: query in x)

map and flatmap


In [20]:
phrases = sc.parallelize(["hello world", "how are you", "how do you do"])

words = phrases.flatMap(lambda phrase: phrase.split(" "))

words.collect()


Out[20]:
['hello', 'world', 'how', 'are', 'you', 'how', 'do', 'you', 'do']

In [21]:
phrases = sc.parallelize(["hello world", "how are you", "how do you do"])

words = phrases.map(lambda phrase: phrase.split(" "))

words.collect()


Out[21]:
[['hello', 'world'], ['how', 'are', 'you'], ['how', 'do', 'you', 'do']]

(Pseudo) set operations


In [22]:
oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4])
oneRDD.persist()
otherRDD = sc.parallelize([1, 4, 4, 7])
otherRDD.persist()


Out[22]:
ParallelCollectionRDD[22] at parallelize at PythonRDD.scala:423

In [23]:
oneRDD.union(otherRDD).collect()


Out[23]:
[1, 1, 1, 2, 3, 3, 4, 4, 1, 4, 4, 7]

In [24]:
oneRDD.subtract(otherRDD).collect()


Out[24]:
[2, 3, 3]

In [25]:
oneRDD.distinct().collect()


Out[25]:
[1, 2, 3, 4]

In [26]:
oneRDD.intersection(otherRDD).collect() # removes duplicates


Out[26]:
[4, 1]

In [27]:
oneRDD.cartesian(otherRDD).collect()[:5]


Out[27]:
[(1, 1), (1, 4), (1, 4), (1, 7), (1, 1)]

reduce


In [28]:
data = sc.parallelize([1,43,62,23,52])
data.reduce(lambda x, y: x + y)


Out[28]:
181

In [29]:
data.reduce(lambda x, y: x * y)


Out[29]:
3188536

In [30]:
data.reduce(lambda x, y: x**2 + y**2) # this does NOT compute the sum of squares of RDD elements


Out[30]:
137823683725010149883130929

In [31]:
data.reduce(lambda x, y: np.sqrt(x**2 + y**2)) ** 2


Out[31]:
8927.0

aggregate


In [32]:
data = sc.parallelize([1,43,62,23,52])
aggr = data.aggregate(zeroValue = (0,0),
                      seqOp = (lambda x, y: (x[0] + y, x[1] + 1)),
                      combOp = (lambda x, y: (x[0] + y[0], x[1] + y[1])))
aggr[0] / aggr[1] # average value of RDD elements


Out[32]:
36.2

reduceByKey


In [33]:
pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2), ('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ])

pairRDD.reduceByKey(lambda x, y: "x + y").collect() # sum of values per key


Out[33]:
[('$APPL', 'x + y'), ('$AMZN', 'x + y'), ('$GOOG', 706.2)]

combineByKey


In [35]:
pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2), ('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ])

aggr = pairRDD.combineByKey(createCombiner = lambda x: (x, 1),
                           mergeValue = lambda x, y: (x[0] + y, x[1] + 1),
                           mergeCombiners = lambda x, y: (x[0] + y[0], x[1] + y[1]))

aggr.map(lambda x: (x[0], x[1][0]/x[1][1])).collect() # average value per key


Out[35]:
[('$APPL', 100.58), ('$AMZN', 552.32), ('$GOOG', 706.2)]

(inner) join


In [36]:
course_a = sc.parallelize([ ("Antti", 8), ("Tuukka", 10), ("Leena", 9)])
course_b = sc.parallelize([ ("Leena", 10), ("Tuukka", 10)])

result = course_a.join(course_b)
result.collect()


Out[36]:
[('Tuukka', (10, 10)), ('Leena', (9, 10))]

Accumulators

This example demonstrates how to use accumulators. The map operations creates an RDD that contains the length of lines in thetext file - and while the RDD is materialized, an accumulator keeps track of how many lines are long (longer than $30$ characters).


In [37]:
text = sc.textFile("myfile.txt")
long_lines = sc.accumulator(0) # create accumulator

def line_len(line):
    global long_lines # to reference an accumulator, declare it as global variable
    length = len(line)
    if length > 30:
        long_lines += 1 # update the accumulator
    return length

llengthRDD = text.map(line_len)
llengthRDD.count()


Out[37]:
95

In [38]:
long_lines.value # this is how we obtain the value of the accumulator in the driver program


Out[38]:
45

Warning

In the example above, we update the value of an accumulator within a transformation (map). This is not recommended, unless for debugging purposes! The reason is that, if there are failures during the materialization of llengthRDD, some of its partitions will be re-computed, possibly causing the accumulator to double-count some the the long lines.

It is advisable to use accumulators within actions - and particularly with the foreach action, as demonstrated below.


In [39]:
text = sc.textFile("myfile.txt")
long_lines = sc.accumulator(0)

def line_len(line):
    global long_lines
    length = len(line)
    if length > 30:
        long_lines += 1

text.foreach(line_len)
long_lines.value


Out[39]:
45

Broadcast variable

We use broadcast variables when many operations depend on the same large static object - e.g., a large lookup table that does not change but provides information for other operations. In such cases, we can make a broadcast variable out of the object and thus make sure that the object will be shipped to the cluster only once - and not for each of the operations we'll be using it for.

The example below demonstrates the usage of broadcast variables. In this case, we make a broadcast variable out of a dictionary that represents an address table. The tablke is shipped to cluster nodes only once across multiple operations.


In [40]:
def load_address_table():
    return {"Anu": "Chem. A143", "Karmen": "VTT, 74", "Michael": "OIH, B253.2",
            "Anwar": "T, B103", "Orestis": "T, A341", "Darshan": "T, A325"}

address_table = sc.broadcast(load_address_table())

def find_address(name):
    res = None
    if name in address_table.value:
        res = address_table.value[name]
    return res

people = sc.parallelize(["Anwar", "Michael", "Orestis", "Darshan"])
pairRDD = people.map(lambda name: (name, find_address(name))) # first operation that uses the address table
print(pairRDD.collectAsMap())

other_people = sc.parallelize(["Karmen", "Michael", "Anu"])
pairRDD = other_people.map(lambda name: (name, find_address(name))) # second operation that uses the address table
print(pairRDD.collectAsMap())


{'Anwar': 'T, B103', 'Darshan': 'T, A325', 'Orestis': 'T, A341', 'Michael': 'OIH, B253.2'}
{'Karmen': 'VTT, 74', 'Michael': 'OIH, B253.2', 'Anu': 'Chem. A143'}


Stopping

Call stop() on the SparkContext object to shut it down.


In [41]:
sc.stop()

In [ ]: