Spark Context

Let's start with creating a SparkContext - an entry point to Spark application. Parameter 'local[*]' means that we create the Spark cluster locally using all machine cores. Next we check that everything is working fine.



In [4]:

    
import pyspark
sc = pyspark.SparkContext('local[*]')

# do something to prove it works
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)









    Out[4]:





[937, 982, 727, 184, 404]

RDD

RDD is just a collection (like a list) but distributed. You can create it using sc.parallelize method that consumes normal collection.



In [5]:

    
rdd = sc.parallelize(range(100000))



In [6]:

    
rdd









    Out[6]:





PythonRDD[4] at RDD at PythonRDD.scala:43

You can convert it back to python list using collect method:



In [7]:

    
rdd.collect()

RDD operations

Main operations on RDD are transformations:

map - apply a function to every element of the collection;
filter - filter collection using predicate;
flatMap - apply a function that changes each element into a collection and flatten the results;

... and actions (a. k. a. aggregations):

collect - converts RDD to a list;
count - counts the number of elements in the RDD;
take(n) - takes first n elements of the RDD and returns a list;
takeSample(withReplacement, n) - takes a sample of RDD of n elements;
reduce(function) - reduces the collection using function;
aggregate - aggregates elements of an RDD.

See http://spark.apache.org/docs/1.6.0/api/python/pyspark.html#pyspark.RDD for more information and other usefull functions.

Remember: Transformations are lazy and actions are eager.

Task 1 Fill the ... below to aviod AssertionError:



In [13]:

    
assert rdd.map(lambda x: x * 13 % 33).take(34)[-1] == ...









    



---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-13-031785e25a44> in <module>()
----> 1 assert rdd.map(lambda x: x * 13 % 33).take(34)[-1] == ...

AssertionError:

Task 2 Write count using reduce:



In [17]:

    
def fun(x, y):
    return "something"

assert rdd.count() == rdd.reduce(fun)









    



---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-17-377cffa32447> in <module>()
      2     return "something"
      3 
----> 4 assert rdd.count() == rdd.reduce(fun)

AssertionError:

Task 3 Write sum using aggregate described here http://spark.apache.org/docs/1.6.0/api/python/pyspark.html#pyspark.RDD.aggregate



In [ ]:

    
assert rdd.sum() == rdd.aggregate(0, lambda x, y: ..., lambda x, y: ...)



In [20]:

    
rdd2 = rdd.flatMap(lambda x: [x**2 % 8609,  x**3 % 8609])



In [23]:

    
rdd2.take(10)









    Out[23]:





[0, 0, 1, 1, 4, 8, 9, 27, 16, 64]

Task 4 Get the biggest value from in the rdd2 (use reduce):



In [ ]:

    
assert rdd2.max() == rdd.reduce(lambda x, y: ...)

Task 5 Get the second biggest value from rdd2, use reduce or aggregate.



In [ ]:

    
rdd2.aggregate(..., ..., ...)

Pair RDD

In many cases it is convenient to work with an RDD that consists of key-value pairs. We can get one by counting different reminders in rdd2.



In [34]:

    
reminders = rdd2.groupBy(lambda x: x)
reminders.take(10)









    Out[34]:





[(0, <pyspark.resultiterable.ResultIterable at 0x7f48f3df2940>),
 (8192, <pyspark.resultiterable.ResultIterable at 0x7f48f3df2d30>),
 (2, <pyspark.resultiterable.ResultIterable at 0x7f48f3df26a0>),
 (4, <pyspark.resultiterable.ResultIterable at 0x7f48f3df2860>),
 (6, <pyspark.resultiterable.ResultIterable at 0x7f48f3df2320>),
 (8, <pyspark.resultiterable.ResultIterable at 0x7f48f3df2748>),
 (10, <pyspark.resultiterable.ResultIterable at 0x7f48f3df2668>),
 (12, <pyspark.resultiterable.ResultIterable at 0x7f48f3df26d8>),
 (8194, <pyspark.resultiterable.ResultIterable at 0x7f48f3df2400>),
 (14, <pyspark.resultiterable.ResultIterable at 0x7f48f3df2710>)]

We only want the lengths of ResultIterable, so we can map values of the key-value pairs:



In [35]:

    
reminders_counts = reminders.mapValues(lambda x: len(x))
reminders_counts.take(10)









    Out[35]:





[(0, 24),
 (8192, 35),
 (2, 34),
 (4, 34),
 (6, 11),
 (8, 35),
 (10, 36),
 (12, 12),
 (8194, 12),
 (14, 12)]

Let's sort it by count:



In [37]:

    
reminders_counts.sortBy(lambda x: x[1], ascending=False).take(15)









    Out[37]:





[(10, 36),
 (32, 36),
 (80, 36),
 (90, 36),
 (98, 36),
 (104, 36),
 (114, 36),
 (186, 36),
 (188, 36),
 (200, 36),
 (264, 36),
 (266, 36),
 (8240, 36),
 (358, 36),
 (394, 36)]

Task 6 Compute the counts of repetitions of reminders: how many times a reminder occured 36 times, and 35 times, and so on...



In [ ]:

You can move to some real dataset!