Simple example with Spark


This notebook illustrates the use of Spark in SWAN.

The current setup allows to execute PySpark operations on a local standalone Spark instance. This can be used for testing with small datasets.

In the future, SWAN users will be able to attach external Spark clusters to their notebooks, so they can target bigger datasets. Moreover, a Scala Jupyter kernel will be added to use Spark from Scala as well.

Import the necessary modules

The pyspark module is available to perform the necessary imports.


In [1]:
from pyspark import SparkContext

Create a SparkContext

A SparkContext needs to be created before running any Spark operation. This context is linked to the local Spark instance.


In [2]:
sc = SparkContext()

Run Spark actions and transformations

Let's use our SparkContext to parallelize a list.


In [13]:
rdd = sc.parallelize([1, 2, 4, 8])

We can count the number of elements in the list.


In [14]:
rdd.count()


Out[14]:
4

Let's now map a function to our RDD to increment all its elements.


In [15]:
rdd.map(lambda x: x + 1).collect()


Out[15]:
[2, 3, 5, 9]

We can also calculate the sum of all the elements with reduce.


In [16]:
rdd.reduce(lambda x, y: x + y)


Out[16]:
15