Prepare the pyspark environment.
In [1]:
import findspark
import os
findspark.init('/home/ubuntu/shortcourse/spark-1.5.1-bin-hadoop2.6')
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("test").setMaster("local[2]")
sc = SparkContext(conf=conf)
Make sure your HDFS is still on and the input files (the three books) are still in the input folder.
Create the input RDD from the files on the HDFS (hdfs://localhost:54310/user/ubuntu/input).
In [2]:
lines = sc.textFile('hdfs://localhost:54310/user/ubuntu/input')
lines.count()
Out[2]:
Perform the counting, by flatMap, map, and reduceByKey.
In [3]:
from operator import add
counts = lines.flatMap(lambda x: x.split()).map(lambda x: (x, 1)).reduceByKey(add)
Take the top 10 frequently used words
In [5]:
counts.takeOrdered(10, lambda x: -x[1])
Out[5]:
Read the pattern file into a set. (file: /home/ubuntu/shortcourse/notes/scripts/wordcount2/wc2-pattern.txt)
In [6]:
pattern = set()
f = open('/home/ubuntu/shortcourse/notes/scripts/wordcount2/wc2-pattern.txt')
for line in f:
words = line.split()
for word in words:
pattern.add(word)
Perform the counting, by flatMap, filter, map, and reduceByKey.
In [7]:
result = lines.flatMap(lambda x: x.split()).filter(lambda x: x in pattern).map(lambda x: (x, 1)).reduceByKey(add)
Collect and show the results.
In [8]:
result.collect()
Out[8]:
In [9]:
# stop the spark context
sc.stop()
In [ ]: