Spark version of wordcount examples

Prepare the pyspark environment.



In [ ]:

    
import findspark
import os
findspark.init('/home/ubuntu/shortcourse/spark-1.5.1-bin-hadoop2.6')

from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("test").setMaster("local[2]")
sc = SparkContext(conf=conf)

Make sure your HDFS is still on and the input files (the three books) are still in the input folder.

Create the input RDD from the files on the HDFS (hdfs://localhost:54310/user/ubuntu/input). (hint: use the textFile API from SparkContext)



In [ ]:

Count how many lines in the



In [ ]:

Simple Word Count

Perform the counting, by flatMap, map, and reduceByKey.



In [ ]:

Take the top 20 frequently used words



In [ ]:

Pattern Matching WordCount

Read the pattern file into a set. (file: /home/ubuntu/shortcourse/notes/scripts/wordcount2/wc2-pattern.txt)



In [ ]:

Perform the counting, by flatMap, filter, map, and reduceByKey.



In [ ]:

Collect and show the results.



In [ ]:



In [ ]:

    
# stop the spark context
sc.stop()