In [1]:
# create entry points to spark
try:
    sc.stop()
except:
    pass
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
sc=SparkContext()
spark = SparkSession(sparkContext=sc)

RDD object

The class pyspark.SparkContext creates a client which connects to a Spark cluster. This client can be used to create an RDD object. There are two methods from this class for directly creating RDD objects:

  • parallelize()
  • textFile()

parallelize()

parallelize() distribute a local python collection to form an RDD. Common built-in python collections include dist, list, tuple or set.

Examples:


In [2]:
# from a list
rdd = sc.parallelize([1,2,3])
rdd.collect()


Out[2]:
[1, 2, 3]

In [3]:
# from a tuple
rdd = sc.parallelize(('cat', 'dog', 'fish'))
rdd.collect()


Out[3]:
['cat', 'dog', 'fish']

In [4]:
# from a list of tuple
list_t = [('cat', 'dog', 'fish'), ('orange', 'apple')]
rdd = sc.parallelize(list_t)
rdd.collect()


Out[4]:
[('cat', 'dog', 'fish'), ('orange', 'apple')]

In [5]:
# from a set
s = {'cat', 'dog', 'fish', 'cat', 'dog', 'dog'}
rdd = sc.parallelize(s)
rdd.collect()


Out[5]:
['fish', 'dog', 'cat']

When it is a dict, only the keys are used to form the RDD.


In [6]:
# from a dict
d = {
    'a': 100,
    'b': 200,
    'c': 300
}
rdd = sc.parallelize(d)
rdd.collect()


Out[6]:
['a', 'b', 'c']

textFile()

The textFile() function reads a text file and returns it as an RDD of strings. Usually, you will need to apply some map functions to transform each elements of the RDD to some data structure/type that is suitable for data analysis.

When using textFile(), each line of the text file becomes an element in the resulting RDD.

Examples:


In [7]:
# read a csv file
rdd = sc.textFile('../../data/mtcars.csv')
rdd.take(5)


Out[7]:
[',mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb',
 'Mazda RX4,21,6,160,110,3.9,2.62,16.46,0,1,4,4',
 'Mazda RX4 Wag,21,6,160,110,3.9,2.875,17.02,0,1,4,4',
 'Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1',
 'Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1']

In [8]:
# read a txt file
rdd = sc.textFile('../../data/twitter.txt')
rdd.take(5)


Out[8]:
['Fresh install of XP on new computer. Sweet relief! fuck vista\t1018769417\t1.0',
 'Well. Now I know where to go when I want my knives. #ChiChevySXSW http://post.ly/RvDl\t10284216536\t1.0',
 '"Literally six weeks before I can take off ""SSC Chair"" off my email. Its like the torturous 4th mile before everything stops hurting."\t10298589026\t1.0',
 'Mitsubishi i MiEV - Wikipedia, the free encyclopedia - http://goo.gl/xipe Cutest car ever!\t109017669432377344\t1.0',
 "'Cheap Eats in SLP' - http://t.co/4w8gRp7\t109642968603963392\t1.0"]

In [ ]: