Lab 1 - Hello Spark

This Lab will show you how to work with Apache Spark using Python

Step 1 - Working with Spark Context

Check what version of Apache Spark is setup within this lab notebook.

In step 1 - Invoke the spark context and extract what version of the spark driver application is running.

Type sc.version



In [1]:

    
#Step 1 - Check spark version
#Type:
#sc.version

Step 2 - Working with Resilient Distributed Datasets

Create multiple RDDs and return results

In Step 2 - Create RDD with numbers 1 to 10, Extract first line, Extract first 5 lines, Create RDD with string "Hello Spark", Extract first line.



In [2]:

    
#Step 2 - Create RDD of Numbers 1-10

#Type: 
#x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
#x_nbr_rdd = sc.parallelize(x)



In [3]:

    
#Step 2 - Extract first line

#Type:
#x_nbr_rdd.first()



In [4]:

    
#Step 2 - Extract first 5 lines

#Type:
#x_nbr_rdd.take(5)



In [5]:

    
#Step 2 - Create RDD String, Extract first line

#Type:
#y = ["Hello Spark!"]
#y_str_rdd = sc.parallelize(y)
#y_str_rdd.first()

Step 3 - Working with Strings

In Step 3 - Create a larger string of words that include "Hello" and "Spark", Map the string into an RDD as a collection of words, extract the count of words "Hello" and "Spark" found in your RDD.



In [6]:

    
#Step 3 - Create RDD String, Extract first line

#type:
#z = ["Hello World!, Hello Universe!, I love Spark"]
#z_str_rdd = sc.parallelize(z)
#z_str_rdd.first()



In [7]:

    
#Step 3 - Create RDD with object for each word, Extract first 7 words

#type:
#z_str2_rdd = z_str_rdd.flatMap(lambda line: line.split(" "))
#z_str2_rdd.take(7)



In [8]:

    
#Step 3 - Count of "Hello" words

#type:
#z_str3_rdd = z_str2_rdd.filter(lambda line: "Hello" in line) 
#print "The count of words 'Hello' in: " + repr(z_str_rdd.first())
#print "Is: " + repr(z_str3_rdd.count())



In [9]:

    
#Step 3 - Count of "Spark" words
#type
#z_str4_rdd = z_str2_rdd.filter(lambda line: "Spark" in line) 
#print "The count of words 'Spark' in: " + repr(z_str_rdd.first())
#print "Is: " + repr(z_str4_rdd.count())