Lab 1 - Hello Spark

This Lab will show you how to work with Apache Spark using Python


In [ ]:

Step 1 - Working with Spark Context

Check what version of Apache Spark is setup within this lab notebook.

Apache Spark has a driver application that launches various parrallel applications on a cluster. The drier application uses a spark context allow a programming interface to interact with the driver application. This is know as a Spark Context which supports multiple Python, Scala and Java programming languages.

In step 1 - Invoke the spark context and extract what version of the spark driver application is running.

Note: In the cloud based notebook enviornment used for this lab, the spark context is predefined


In [3]:
#Step 1 - sc is Spark Context, Execute Spark Context to see if its active in cluster

#Note: Notice the programming language used
sc


Out[3]:
<pyspark.context.SparkContext at 0x7fd500879b10>

In [4]:
#Step 1 - The spark context has a .version available to return the version of the spark driver application

#Note: Different versions of spark application support additional functionality such as DataFrames, Streaming and Machine Learning.

#Invoke the spark context version command
sc.version


Out[4]:
u'1.6.1'

Step 2 - Working with Resilient Distributed Datasets

Create multiple RDDs and return results

Apache Spark uses an abstration for working with data called RDDs - Resilient Distributed Datasets. An RDD is simply an immutable distributed collection of objects. In Apache Spark all work is expressed by either creating new RDDs, tranforming existing RDDs or using RDDs to compute results. When working with RDDs, the Spark Driver application automatically distributes the work accross your cluster.

In Step 2 - Create RDD with numbers 1 to 10, Extract first line, Extract first 5 lines, Create RDD with string "Hello Spark", Extract first line.

Invoke an Action to return data from your RDDs.

Note: RDD commands are either transformations or actions. Transformations are commands that do not initiate a computation requiring parrallel application execution on a spark cluster.


In [19]:
#Step 2 - Create RDD Numbers

#Create numbers 1 to 10 into a variable
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

#Place the numbers into an rdd called x_nbr_rdd
#Note: the spark context has a parallelize command to create RDD from data
#Hint... sc.parallelize(Data here, pass in variable)
x_nbr_rdd = sc.parallelize(x)

In [15]:
#Step 2 - Extract first line

#Notice: after running the code you did not recieve a result.
#This is because you haven't yet invoked an Action command.

#Note: RDDs support first() command which returns the first object and is also an action command

#Invoke first on your RDD - Action
x_nbr_rdd.first()


Out[15]:
1

In [5]:
#Step 2 - Extract first 5 lines

#Note: RDDs support take() command which returns x number of objects that you pass into the command
#Invoke take() and extract first 5 lines
x_nbr_rdd.take(5)


Out[5]:
[1, 2, 3, 4, 5]

In [16]:
#Step 2 - Create RDD String, Extract first line

#Create a string "Hello Spark!"
y = ["Hello Spark!"]

#Place the string value into an rdd called y_str_rdd
y_str_rdd = sc.parallelize(y)

#Return the first value in yoru RDD - Action
y_str_rdd.first()


Out[16]:
'Hello Spark!'

Step 3 - Working with Strings

As you can see, you created a single string "Hello Spark!" and you returned value of one object is "Hello Spark!" What if we wanted to work with a string corpus of words and run analysis on each of the works, then you would need to map each word into many objects (or lines) in an RDD.

In Step 3 - Create a larger string of words that include "Hello" and "Spark", Map the string into an RDD as a collection of words, extract the count of words "Hello" and "Spark" found in your RDD.


In [7]:
#Step 3 - Create RDD String, Extract first line

#Create a string with many words including "Hello" and "Spark"
z = ["Hello World!, Hello Universe!, I love Spark"]

#Place the string value into an rdd called z_str_rdd
z_str_rdd = sc.parallelize(z)

#Extract first line
z_str_rdd.first()


Out[7]:
'Hello World!, Hello Universe!, I love Spark'

In [8]:
#Step 3 - Create RDD with object for each word, Extract first 7 words

#Note: To analys your string you have to break down your string in to multiple objects.
#One way to do that is to map each word into elements/lines in your RDD

#A RDD transformation called map, takes in an RDD and runs command on it. For Example a split command to split elements by a value 
#A RDD tranformation called flatmap, does the same as map but returns a list of elements (0 or more) as an iterator
#Hint... flatMap(lambda line: line.split(" "))

#Create a new RDD z_str2_rdd using this transformation
#z_str2_rdd = z_str_rdd.flatMap(lambda line: line.split(" "))
z_str2_rdd = z_str_rdd.flatMap(lambda line: line.split(" "))

#Extract first 7 words - Action
z_str2_rdd.take(7)


Out[8]:
['Hello', 'World!,', 'Hello', 'Universe!,', 'I', 'love', 'Spark']

In [9]:
#Step 3 - Count of "Hello" words

#Note: You can use filter command to create a new RDD from another RDD on a filter criteria
#Hint... filter syntax is .filter(lambda line: "Filter Criteria Value" in line) 

#Create a new RDD z_str3_rdd for all "Hello" words in corpus of words 
z_str3_rdd = z_str2_rdd.filter(lambda line: "Hello" in line) 

#Note: Use a simple python print command to add string to your spark results
#Hint... repr() will represent a number as string
#Hint... Syntax: print "Text:" + repr(Spark commands)

#Extract count of values in the new RDD which represents number of "Hello" words in corpus
print "The count of words 'Hello' in: " + repr(z_str_rdd.first())
print "Is: " + repr(z_str3_rdd.count())


The count of words 'Hello' in: 'Hello World!, Hello Universe!, I love Spark'
Is: 2

In [11]:
#Step 3 - Count of "Spark" words
#Create a new RDD z_str4_rdd for all "Hello" words in corpus of words 
z_str4_rdd = z_str2_rdd.filter(lambda line: "Spark" in line) 

#Extract count of values in the new RDD which represents number of "Spark" words in corpus
print "The count of words 'Spark' in: " + repr(z_str_rdd.first())
print "Is: " + repr(z_str4_rdd.count())


The count of words 'Spark' in: 'Hello World!, Hello Universe!, I love Spark'
Is: 1

In [ ]: