In [1]:
# create entry points to spark
try:
    sc.stop()
except:
    pass
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
sc=SparkContext()
spark = SparkSession(sparkContext=sc)

Column expression

A Spark column instance is NOT a column of values from the DataFrame: when you crate a column instance, it does not give you the actual values of that column in the DataFrame. I found it makes more sense to me if I consider a column instance as a column of expressions. These expressions are evaluated by other methods (e.g., the select(), groupby(), and orderby() from pyspark.sql.DataFrame)

Example data


In [3]:
mtcars = spark.read.csv('../../../data/mtcars.csv', inferSchema=True, header=True)
mtcars = mtcars.withColumnRenamed('_c0', 'model')
mtcars.show(5)


+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+
|            model| mpg|cyl| disp| hp|drat|   wt| qsec| vs| am|gear|carb|
+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+
|        Mazda RX4|21.0|  6|160.0|110| 3.9| 2.62|16.46|  0|  1|   4|   4|
|    Mazda RX4 Wag|21.0|  6|160.0|110| 3.9|2.875|17.02|  0|  1|   4|   4|
|       Datsun 710|22.8|  4|108.0| 93|3.85| 2.32|18.61|  1|  1|   4|   1|
|   Hornet 4 Drive|21.4|  6|258.0|110|3.08|3.215|19.44|  1|  0|   3|   1|
|Hornet Sportabout|18.7|  8|360.0|175|3.15| 3.44|17.02|  0|  0|   3|   2|
+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+
only showing top 5 rows

Use dot (.) to select column from DataFrame


In [7]:
mpg_col = mtcars.mpg
mpg_col


Out[7]:
Column<b'mpg'>

Modify a column to generate a new column


In [8]:
mpg_col + 1


Out[8]:
Column<b'(mpg + 1)'>

In [11]:
mtcars.select(mpg_col * 100).show(5)


+-----------+
|(mpg * 100)|
+-----------+
|     2100.0|
|     2100.0|
|     2280.0|
|     2140.0|
|     1870.0|
+-----------+
only showing top 5 rows

The pyspark.sql.Column has many methods that acts on a column and returns a column instance.


In [12]:
mtcars.select(mtcars.gear.isin([2,3])).show(5)


+----------------+
|(gear IN (2, 3))|
+----------------+
|           false|
|           false|
|           false|
|            true|
|            true|
+----------------+
only showing top 5 rows


In [17]:
mtcars.mpg.asc()


Out[17]:
Column<b'mpg ASC NULLS FIRST'>

In [ ]: