In [ ]:
from pyspark import (SparkContext, SparkConf)

In [ ]:
!echo $SPARK_HOME

In [ ]:
%%bash
/spark/bin/run-example SparkPi 100

In [ ]:
%%bash
PYSPARK_PYTHON=python /spark/bin/spark-submit /spark/examples/src/main/python/pi.py 10

Apache Spark Quick Start


In [ ]:
conf = (SparkConf().
              setMaster("mesos://zk://10.132.126.37:2181/mesos").
              setAppName("RY from jupyter").
              set("spark.executor.uri", "http://apache.petsads.us/spark/spark-1.2.0/spark-1.2.0-bin-hadoop2.4.tgz").
              set("spark.mesos.coarse", "true").
              set("spark.mesos.executor.home", "/spark-1.2.0-bin-hadoop2.4").
              set("spark.executor.extraLibraryPath", "/usr/lib/hadoop/lib/native").
              set("spark.executor.extraJavaOptions","-XX:-UseConcMarkSweepGC").
              set("spark.mesos.native.library", "/usr/local/lib/libmesos.so").
              set("spark.local.ip", "10.9.8.6").
              set("spark.driver.host","10.9.8.6"))

In [ ]:
# for spark local
#  https://spark.apache.org/docs/latest/quick-start.html
sc = SparkContext()
sc.version

In [ ]:
# load the README.md

textFile = sc.textFile("/spark/README.md")

In [ ]:
# action

textFile.count()

In [ ]:
textFile.

In [ ]:
# confirm that the answer using standard lib
len(open("/spark/README.md").read().strip().split("\n"))

In [ ]:
# first line:  action
textFile.first()

In [ ]:
# transformation

textFile.filter(lambda x: "Spark" in x)

In [ ]:
# how many lines with "Spark"

textFile.filter(lambda x: "Spark" in x).count()

In [ ]:
# number of words in the line with most words
textFile.map(lambda line: len(line.split())).reduce(lambda a, b: a if (a > b) else b)

In [ ]:
textFile.map(lambda line: len(line.split())).reduce(max)

In [ ]:
# compute the number of words in the line with most words in another way
max(map(lambda line: len(line.split()), open("/spark/README.md").read().strip().split("\n")))

To understand flatMap, I needed to use an action to convert the RDD to a list. The solution: collect. We can use take(n) instead to grab only the first n items.


In [ ]:
# flatmap
textFile.flatMap(lambda line: line.split()).take(5)

In [ ]:
# a map/reduce workflow

wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
wordCounts.takeSample(False,5)

parallelized collections

Creating RDDs:

The simplest way to create RDDs is to take an existing in-memory collection and pass it to SparkContext’s parallelize method. This approach is very useful when learning Spark, since you can quickly create your own RDDs in the shell and perform operations on them. Keep in mind however, that outside of prototyping and testing, this is not widely used since it requires you have your entire dataset in memory on one machine [emphasis mine].


In [ ]:
nums = sc.parallelize([1, 2, 3, 4])
squared = nums.map(lambda x: x * x).collect()
for num in squared:
    print ("%i " % (num))

Learning Spark

This code drawn from Karau, Holden, Andy Konwinski, Patrick Wendell, and Matei Zaharia. Learning Spark. O’Reilly Media, Inc., 2015. http://my.safaribooksonline.com/book/databases/business-intelligence/9781449359034.

Key concepts from Chap 3:

An RDD in Spark is simply an immutable distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java or Scala objects, including user-defined classes. Users create RDDs in two ways: by loading an external dataset, or by distributing a collection of objects (e.g. a list or set) in their driver program.

S3

Note: this won't work because the rdhyee/ipython-spark needs to have hadoop installed, which it doesn't


In [ ]:
s3file = sc.textFile("s3n://AKIAI3ZHCGO3UMYFXWFA:w0ALUVQ3p6bqmMYytMn1w93fL5JlSLNK5IDKjHRv@aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-52/segment.paths.gz")
s3file.count()

In [ ]:
s3file.count()

In [ ]: