This notebook illustrates the use of Spark in SWAN.
The current setup allows to execute PySpark operations on a local standalone Spark instance. This can be used for testing with small datasets.
In the future, SWAN users will be able to attach external Spark clusters to their notebooks, so they can target bigger datasets. Moreover, a Scala Jupyter kernel will be added to use Spark from Scala as well.
The pyspark
module is available to perform the necessary imports.
In [1]:
from pyspark import SparkContext
A SparkContext
needs to be created before running any Spark operation. This context is linked to the local Spark instance.
In [2]:
sc = SparkContext()
Let's use our SparkContext
to parallelize a list.
In [13]:
rdd = sc.parallelize([1, 2, 4, 8])
We can count the number of elements in the list.
In [14]:
rdd.count()
Out[14]:
Let's now map
a function to our RDD to increment all its elements.
In [15]:
rdd.map(lambda x: x + 1).collect()
Out[15]:
We can also calculate the sum of all the elements with reduce
.
In [16]:
rdd.reduce(lambda x, y: x + y)
Out[16]: