Prepare the pyspark environment.
In [ ]:
import findspark
import os
findspark.init('/home/ubuntu/shortcourse/spark-1.5.1-bin-hadoop2.6')
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("test").setMaster("local[2]")
sc = SparkContext(conf=conf)
Make sure your HDFS is still on and the input files (the three books) are still in the input folder.
Create the input RDD from the files on the HDFS (hdfs://localhost:54310/user/ubuntu/input). (hint: use the textFile API from SparkContext)
In [ ]:
Count how many lines in the
In [ ]:
Perform the counting, by flatMap, map, and reduceByKey.
In [ ]:
Take the top 20 frequently used words
In [ ]:
Read the pattern file into a set. (file: /home/ubuntu/shortcourse/notes/scripts/wordcount2/wc2-pattern.txt)
In [ ]:
Perform the counting, by flatMap, filter, map, and reduceByKey.
In [ ]:
Collect and show the results.
In [ ]:
In [ ]:
# stop the spark context
sc.stop()