DataFrame object


Create SparkContext and SparkSession


In [2]:
from pyspark import SparkContext
sc = SparkContext(master = 'local')

from pyspark.sql import SparkSession
spark = SparkSession.builder \
          .appName("Python Spark SQL basic example") \
          .config("spark.some.config.option", "some-value") \
          .getOrCreate()

Create a DataFrame object

Creat DataFrame by reading a file


In [3]:
mtcars = spark.read.csv(path='data/mtcars.csv',
                        sep=',',
                        encoding='UTF-8',
                        comment=None,
                        header=True, 
                        inferSchema=True)
mtcars.show(n=5, truncate=False)


+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+
|_c0              |mpg |cyl|disp |hp |drat|wt   |qsec |vs |am |gear|carb|
+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+
|Mazda RX4        |21.0|6  |160.0|110|3.9 |2.62 |16.46|0  |1  |4   |4   |
|Mazda RX4 Wag    |21.0|6  |160.0|110|3.9 |2.875|17.02|0  |1  |4   |4   |
|Datsun 710       |22.8|4  |108.0|93 |3.85|2.32 |18.61|1  |1  |4   |1   |
|Hornet 4 Drive   |21.4|6  |258.0|110|3.08|3.215|19.44|1  |0  |3   |1   |
|Hornet Sportabout|18.7|8  |360.0|175|3.15|3.44 |17.02|0  |0  |3   |2   |
+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+
only showing top 5 rows

Create DataFrame with createDataFrame function

From an RDD

Elements in RDD has to be an Row object


In [4]:
from pyspark.sql import Row
rdd = sc.parallelize([
    Row(x=[1,2,3], y=['a','b','c']),
    Row(x=[4,5,6], y=['e','f','g'])
])
rdd.collect()


Out[4]:
[Row(x=[1, 2, 3], y=['a', 'b', 'c']), Row(x=[4, 5, 6], y=['e', 'f', 'g'])]

In [5]:
df = spark.createDataFrame(rdd)
df.show()


+---------+---------+
|        x|        y|
+---------+---------+
|[1, 2, 3]|[a, b, c]|
|[4, 5, 6]|[e, f, g]|
+---------+---------+

From pandas DataFrame


In [8]:
import pandas as pd
pdf = pd.DataFrame({
    'x': [[1,2,3], [4,5,6]],
    'y': [['a','b','c'], ['e','f','g']]
})
pdf


Out[8]:
x y
0 [1, 2, 3] [a, b, c]
1 [4, 5, 6] [e, f, g]

In [9]:
df = spark.createDataFrame(pdf)
df.show()


+---------+---------+
|        x|        y|
+---------+---------+
|[1, 2, 3]|[a, b, c]|
|[4, 5, 6]|[e, f, g]|
+---------+---------+

From a list

Each element in the list becomes an Row in the DataFrame.


In [16]:
my_list = [['a', 1], ['b', 2]]
df = spark.createDataFrame(my_list, ['letter', 'number'])
df.show()


+------+------+
|letter|number|
+------+------+
|     a|     1|
|     b|     2|
+------+------+


In [17]:
df.dtypes


Out[17]:
[('letter', 'string'), ('number', 'bigint')]

In [18]:
my_list = [['a', 1], ['b', 2]]
df = spark.createDataFrame(my_list, ['my_column'])
df.show()


+---------+---+
|my_column| _2|
+---------+---+
|        a|  1|
|        b|  2|
+---------+---+


In [19]:
df.dtypes


Out[19]:
[('my_column', 'string'), ('_2', 'bigint')]

The following code generates a DataFrame consisting of two columns, each column is a vector column.

Why vector columns are generated in this case? In this case, the list my_list has only one element, a tuple. Therefore, the DataFrame has only one row. This tuple has two elements. Therefore, it generates a two-columns DataFrame. Each element in the tuple is a list, so the resulting columns are vector columns.


In [29]:
my_list = [(['a', 1], ['b', 2])]
df = spark.createDataFrame(my_list, ['x', 'y'])
df.show()


+------+------+
|     x|     y|
+------+------+
|[a, 1]|[b, 2]|
+------+------+


In [ ]:

Column instance

Column instances can be created in two ways:

  1. directly select a column out of a DataFrame: df.colName
  2. create from a column expression: df.colName + 1

Technically, there is only one way to create a column instance. Column expressions start from a column instance.

Remember how to create column instances, because this is usually the starting point if we want to operate DataFrame columns.

The column classes come with some methods that can operate on a column instance. However, almost all functions from the pyspark.sql.functions module take one or more column instances as argument(s). These functions are important for data manipulation tools.

DataFrame column methods

Methods that take column names as arguments:

  • corr(col1, col2): two column names.
  • cov(col1, col2): two column names.
  • crosstab(col1, col2): two column names.
  • describe(*cols): `cols` refers to only column names (strings).*

Methods that take column names or column expressions or both as arguments:

  • cube(*cols): column names (string) or column expressions or both.
  • drop(*cols): a list of column names OR a single column expression.
  • groupBy(*cols): column name (string) or column expression or both.
  • rollup(*cols): column name (string) or column expression or both.
  • select(*cols): column name (string) or column expression or both.
  • sort(*cols, **kwargs): column name (string) or column expression or both.
  • sortWithinPartitions(*cols, **kwargs): column name (string) or column expression or both.
  • orderBy(*cols, **kwargs): column name (string) or column expression or both.
  • sampleBy(col, fractions, sed=None): a column name.
  • toDF(*cols): a list of column names (string).
  • withColumn(colName, col): colName refers to column name; col refers to a column expression.
  • withColumnRenamed(existing, new): takes column names as arguments.
  • filter(condition): *condition refers to a column expression that returns types.BooleanType of values.

In [ ]: