In [1]:
# create entry points to spark
try:
sc.stop()
except:
pass
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
sc=SparkContext()
spark = SparkSession(sparkContext=sc)
The class pyspark.SparkContext
creates a client which connects to a Spark cluster. This client can be used to create an RDD object. There are two methods from this class for directly creating RDD objects:
parallelize()
textFile()
In [2]:
# from a list
rdd = sc.parallelize([1,2,3])
rdd.collect()
Out[2]:
In [3]:
# from a tuple
rdd = sc.parallelize(('cat', 'dog', 'fish'))
rdd.collect()
Out[3]:
In [4]:
# from a list of tuple
list_t = [('cat', 'dog', 'fish'), ('orange', 'apple')]
rdd = sc.parallelize(list_t)
rdd.collect()
Out[4]:
In [5]:
# from a set
s = {'cat', 'dog', 'fish', 'cat', 'dog', 'dog'}
rdd = sc.parallelize(s)
rdd.collect()
Out[5]:
When it is a dict
, only the keys are used to form the RDD.
In [6]:
# from a dict
d = {
'a': 100,
'b': 200,
'c': 300
}
rdd = sc.parallelize(d)
rdd.collect()
Out[6]:
textFile()
The textFile()
function reads a text file and returns it as an RDD of strings. Usually, you will need to apply some map functions to transform each elements of the RDD to some data structure/type that is suitable for data analysis.
When using textFile()
, each line of the text file becomes an element in the resulting RDD.
Examples:
In [7]:
# read a csv file
rdd = sc.textFile('../../data/mtcars.csv')
rdd.take(5)
Out[7]:
In [8]:
# read a txt file
rdd = sc.textFile('../../data/twitter.txt')
rdd.take(5)
Out[8]:
In [ ]: