Note Load this notebook with the spark profile (environment variable set and Spark context available via pyspark). For more information see Getting Started with Spark (in Python).
If you're following along in a pyspark interpreter, there is no need to load the spark context - it is automatically made available to you in the terminal. If you're writing a Spark application, however, you must load the context.
In [ ]:
from pyspark import SparkContext
sc = SparkContext( 'local', 'pyspark')
In [ ]:
print sc.master
The isprime function takes a number, n and computes if it is prime by ensuring that every odd number greater than 2 until the square root of n is not a divisor of n.
In [ ]:
def isprime(n):
"""
check if integer n is a prime
"""
# make sure n is a positive integer
n = abs(int(n))
# 0 and 1 are not primes
if n < 2:
return False
# 2 is the only even prime number
if n == 2:
return True
# all other even numbers are not primes
if not n & 1:
return False
# range starts with 3 and only needs to go up the square root of n
# for all odd numbers
for x in range(3, int(n**0.5)+1, 2):
if n % x == 0:
return False
return True
To create an RDD, parallelize a collection (in this case the numbers 0 to 1M) out to the cluster. You can then apply transformations on the RDD (filter) and use actions (count) to return the context back to the driver program.
In [ ]:
# Create an RDD of numbers from 0 to 1,000,000
nums = sc.parallelize(xrange(1000000))
# Compute the number of primes in the RDD
print nums.filter(isprime).count()
# len(filter(isprime, xrange(1000000)))
In [ ]:
from operator import add
def tokenize(text):
return text.split()
text = sc.textFile("fixtures/tolstoy.txt") # Create RDD
# Transform
wc = text.flatMap(tokenize)
wc = wc.map(lambda x: (x,1)).reduceByKey(add)
print wc.take(10)
# wc.saveAsTextFile("results/counts") # Action
In [ ]: