PySpark Practicum

Note Load this notebook with the spark profile (environment variable set and Spark context available via pyspark). For more information see Getting Started with Spark (in Python).

If you're following along in a pyspark interpreter, there is no need to load the spark context - it is automatically made available to you in the terminal. If you're writing a Spark application, however, you must load the context.


In [ ]:
from pyspark import  SparkContext
sc = SparkContext( 'local', 'pyspark')

In [ ]:
print sc.master

The isprime function takes a number, n and computes if it is prime by ensuring that every odd number greater than 2 until the square root of n is not a divisor of n.


In [ ]:
def isprime(n):
    """
    check if integer n is a prime
    """
    # make sure n is a positive integer
    n = abs(int(n))
    # 0 and 1 are not primes
    if n < 2:
        return False
    # 2 is the only even prime number
    if n == 2:
        return True
    # all other even numbers are not primes
    if not n & 1:
        return False
    # range starts with 3 and only needs to go up the square root of n
    # for all odd numbers
    for x in range(3, int(n**0.5)+1, 2):
        if n % x == 0:
            return False
    return True

To create an RDD, parallelize a collection (in this case the numbers 0 to 1M) out to the cluster. You can then apply transformations on the RDD (filter) and use actions (count) to return the context back to the driver program.


In [ ]:
# Create an RDD of numbers from 0 to 1,000,000
nums = sc.parallelize(xrange(1000000))

# Compute the number of primes in the RDD
print nums.filter(isprime).count()
# len(filter(isprime, xrange(1000000)))

Word Frequency

Compute the frequency of every word in War and Peace by Leo Tolstoy.


In [ ]:
from operator import add

def tokenize(text):
    return text.split()

text = sc.textFile("fixtures/tolstoy.txt") # Create RDD

# Transform
wc   = text.flatMap(tokenize)
wc   = wc.map(lambda x: (x,1)).reduceByKey(add)

print wc.take(10)

# wc.saveAsTextFile("results/counts")       # Action

In [ ]: