In what follows, you can find pyspark code for the examples we saw in class. Many of the examples follow examples found in Learning Spark: Lightning-Fast Big Data Analysis, by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia, which you can also find at Aalto's library.
These instructions should work for Mac and Linux. We'll assume you'll be using python3.
To run the following on your computer, make sure that pyspark is in your PYTHONPATH variable.
You can do that by downloading a zipped file with Spark, extracting it into its own folder (e.g., spark-1.6.0-bin-hadoop2.6/) and then executing the following commands in bash.
export PYSPARK_PYTHON=python3
export SPARK_HOME=/path/to/spark-1.6.0-bin-hadoop2.6/
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
In [1]:
import pyspark
import numpy as np # we'll be using numpy for some numeric operations
sc = pyspark.SparkContext(master="local", appName="tour")
In [2]:
text = sc.textFile("myfile.txt") # load data
text.count() # count lines
Out[2]:
In [3]:
text = sc.textFile("myfile.txt") # load data
# count only lines that mention "Spark"
spark_lines = text.filter(lambda line: 'Spark' in line)
spark_lines.count() # count lines
Out[3]:
Lambda expressions are an easy way to write short functions in Python.
In [4]:
f = lambda line: 'Spark' in line
f("we are learning Spark")
Out[4]:
In [5]:
def f(line):
return 'Spark' in line
f("we are learning Spark")
Out[5]:
In [6]:
data = sc.parallelize([0,1,2,3,4,5,6,7,8,9]) # create RDD from Python collection
In [7]:
data_squared = data.map(lambda num: num ** 2) # transformation
There are two types of RDD operations in Spark: transformations and actions. Transfromations create new RDDs from other RDDs. Actions extract information from RDDs and return it to the driver program.
In [8]:
data = sc.parallelize([0,1,2,3,4,5,6,7,8,9]) # creation of RDD
data_squared = data.map(lambda num: num ** 2) # transformation
data_squared.collect() # action
Out[8]:
In [9]:
def is_prime(num):
""" return True if num is prime, False otherwise"""
if num < 1 or num % 1 != 0:
raise Exception("invalid argument")
for d in range(2, int(np.sqrt(num) + 1)):
if num % d == 0:
return False
return True
In [10]:
numbersRDD = sc.parallelize(range(1, 1000000)) # creation of RDD
primesRDD = numbersRDD.filter(is_prime) # transformation
# primesRDD has not been materialized until this point
primes = primesRDD.collect() # action
print(primes[:15]) # this code does not involve Spark computation
In [11]:
primesRDD.persist() # we're asking Spark to keep this RDD in memory
print("Found", primesRDD.count(), "prime numbers") # first action -- causes primesRDD to be materialized
print("Here are some of them:")
print(primesRDD.take(20)) # second action - RDD is already in memory
If we do not need primesRDD in memory anymore, we can tell Spark to discard it.
In [12]:
primesRDD.unpersist()
Out[12]:
How long does it take to collect primesRDD? Let's time the operation.
In [13]:
%%timeit
primes = primesRDD.collect()
When I ran the above on my laptop, it took about more than 10s. That's because Spark had to evaluate primesRDD before performing collect on it.
How long would it take if primesRDD was already in memory?
In [14]:
primesRDD.persist()
Out[14]:
In [15]:
%%timeit
primes = primesRDD.collect()
When I ran the above on my laptop, it took about 40ms to collect primesRDD - that's almost $300$ times faster compared to when the RDD had to be recomputed from scratch.
In [16]:
data = sc.parallelize(range(10))
squares = data.map(lambda x: x**2)
squares.collect()
Out[16]:
In [17]:
def f(x):
""" return the square of a number"""
return x**2
data = sc.parallelize(range(10))
squares = data.map(f)
squares.collect()
Out[17]:
Be careful, though: if the function that you pass as argument to an RDD operation
then Spark will ship the entire object to the cluster nodes along with the function.
This is demonstrated in the piece of code below.
In [18]:
class SearchFunctions(object):
def __init__(self, query):
self.query
def is_match(self, s):
return self.query in s
def get_matches_in_rdd_v1(self, rdd):
return rdd.filter(self.is_match) # the function is an object method
def get_matches_in_rdd_v2(self, rdd):
return rdd.filter(lambda x: self.query in x) # the function references an object field
The following is a better way to implement the two methods above (get_matches_in_rdd_v1 and get_matches_in_rdd_v2), if we want to avoid sending a SearchFunctions object for computation to the cluster.
In [19]:
class SearchFunctions(object):
def __init__(self, query):
self.query
def is_match(self, s):
return self.query in s
def get_matches_in_rdd(self, rdd):
query = self.query
return rdd.filter(lambda x: query in x)
In [20]:
phrases = sc.parallelize(["hello world", "how are you", "how do you do"])
words = phrases.flatMap(lambda phrase: phrase.split(" "))
words.collect()
Out[20]:
In [21]:
phrases = sc.parallelize(["hello world", "how are you", "how do you do"])
words = phrases.map(lambda phrase: phrase.split(" "))
words.collect()
Out[21]:
In [22]:
oneRDD = sc.parallelize([1, 1, 1, 2, 3, 3, 4, 4])
oneRDD.persist()
otherRDD = sc.parallelize([1, 4, 4, 7])
otherRDD.persist()
Out[22]:
In [23]:
oneRDD.union(otherRDD).collect()
Out[23]:
In [24]:
oneRDD.subtract(otherRDD).collect()
Out[24]:
In [25]:
oneRDD.distinct().collect()
Out[25]:
In [26]:
oneRDD.intersection(otherRDD).collect() # removes duplicates
Out[26]:
In [27]:
oneRDD.cartesian(otherRDD).collect()[:5]
Out[27]:
In [28]:
data = sc.parallelize([1,43,62,23,52])
data.reduce(lambda x, y: x + y)
Out[28]:
In [29]:
data.reduce(lambda x, y: x * y)
Out[29]:
In [30]:
data.reduce(lambda x, y: x**2 + y**2) # this does NOT compute the sum of squares of RDD elements
Out[30]:
In [31]:
data.reduce(lambda x, y: np.sqrt(x**2 + y**2)) ** 2
Out[31]:
In [32]:
data = sc.parallelize([1,43,62,23,52])
aggr = data.aggregate(zeroValue = (0,0),
seqOp = (lambda x, y: (x[0] + y, x[1] + 1)),
combOp = (lambda x, y: (x[0] + y[0], x[1] + y[1])))
aggr[0] / aggr[1] # average value of RDD elements
Out[32]:
In [33]:
pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2), ('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ])
pairRDD.reduceByKey(lambda x, y: "x + y").collect() # sum of values per key
Out[33]:
In [35]:
pairRDD = sc.parallelize([ ('$APPL', 100.64), ('$GOOG', 706.2), ('$AMZN', 552.32), ('$APPL', 100.52), ('$AMZN', 552.32) ])
aggr = pairRDD.combineByKey(createCombiner = lambda x: (x, 1),
mergeValue = lambda x, y: (x[0] + y, x[1] + 1),
mergeCombiners = lambda x, y: (x[0] + y[0], x[1] + y[1]))
aggr.map(lambda x: (x[0], x[1][0]/x[1][1])).collect() # average value per key
Out[35]:
In [36]:
course_a = sc.parallelize([ ("Antti", 8), ("Tuukka", 10), ("Leena", 9)])
course_b = sc.parallelize([ ("Leena", 10), ("Tuukka", 10)])
result = course_a.join(course_b)
result.collect()
Out[36]:
In [37]:
text = sc.textFile("myfile.txt")
long_lines = sc.accumulator(0) # create accumulator
def line_len(line):
global long_lines # to reference an accumulator, declare it as global variable
length = len(line)
if length > 30:
long_lines += 1 # update the accumulator
return length
llengthRDD = text.map(line_len)
llengthRDD.count()
Out[37]:
In [38]:
long_lines.value # this is how we obtain the value of the accumulator in the driver program
Out[38]:
In the example above, we update the value of an accumulator within a transformation (map). This is not recommended, unless for debugging purposes! The reason is that, if there are failures during the materialization of llengthRDD, some of its partitions will be re-computed, possibly causing the accumulator to double-count some the the long lines.
It is advisable to use accumulators within actions - and particularly with the foreach action, as demonstrated below.
In [39]:
text = sc.textFile("myfile.txt")
long_lines = sc.accumulator(0)
def line_len(line):
global long_lines
length = len(line)
if length > 30:
long_lines += 1
text.foreach(line_len)
long_lines.value
Out[39]:
We use broadcast variables when many operations depend on the same large static object - e.g., a large lookup table that does not change but provides information for other operations. In such cases, we can make a broadcast variable out of the object and thus make sure that the object will be shipped to the cluster only once - and not for each of the operations we'll be using it for.
The example below demonstrates the usage of broadcast variables. In this case, we make a broadcast variable out of a dictionary that represents an address table. The tablke is shipped to cluster nodes only once across multiple operations.
In [40]:
def load_address_table():
return {"Anu": "Chem. A143", "Karmen": "VTT, 74", "Michael": "OIH, B253.2",
"Anwar": "T, B103", "Orestis": "T, A341", "Darshan": "T, A325"}
address_table = sc.broadcast(load_address_table())
def find_address(name):
res = None
if name in address_table.value:
res = address_table.value[name]
return res
people = sc.parallelize(["Anwar", "Michael", "Orestis", "Darshan"])
pairRDD = people.map(lambda name: (name, find_address(name))) # first operation that uses the address table
print(pairRDD.collectAsMap())
other_people = sc.parallelize(["Karmen", "Michael", "Anu"])
pairRDD = other_people.map(lambda name: (name, find_address(name))) # second operation that uses the address table
print(pairRDD.collectAsMap())
In [41]:
sc.stop()
In [ ]: