In [1]:

    
# create entry points to spark
try:
    sc.stop()
except:
    pass
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
sc=SparkContext()
spark = SparkSession(sparkContext=sc)

Column expression

A Spark column instance is NOT a column of values from the DataFrame: when you crate a column instance, it does not give you the actual values of that column in the DataFrame. I found it makes more sense to me if I consider a column instance as a column of expressions. These expressions are evaluated by other methods (e.g., the select(), groupby(), and orderby() from pyspark.sql.DataFrame)

Example data



In [3]:

    
mtcars = spark.read.csv('../../../data/mtcars.csv', inferSchema=True, header=True)
mtcars = mtcars.withColumnRenamed('_c0', 'model')
mtcars.show(5)









    



+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+
|            model| mpg|cyl| disp| hp|drat|   wt| qsec| vs| am|gear|carb|
+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+
|        Mazda RX4|21.0|  6|160.0|110| 3.9| 2.62|16.46|  0|  1|   4|   4|
|    Mazda RX4 Wag|21.0|  6|160.0|110| 3.9|2.875|17.02|  0|  1|   4|   4|
|       Datsun 710|22.8|  4|108.0| 93|3.85| 2.32|18.61|  1|  1|   4|   1|
|   Hornet 4 Drive|21.4|  6|258.0|110|3.08|3.215|19.44|  1|  0|   3|   1|
|Hornet Sportabout|18.7|  8|360.0|175|3.15| 3.44|17.02|  0|  0|   3|   2|
+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+
only showing top 5 rows

Use dot (.) to select column from DataFrame



In [7]:

    
mpg_col = mtcars.mpg
mpg_col









    Out[7]:





Column<b'mpg'>

Modify a column to generate a new column



In [8]:

    
mpg_col + 1









    Out[8]:





Column<b'(mpg + 1)'>



In [11]:

    
mtcars.select(mpg_col * 100).show(5)









    



+-----------+
|(mpg * 100)|
+-----------+
|     2100.0|
|     2100.0|
|     2280.0|
|     2140.0|
|     1870.0|
+-----------+
only showing top 5 rows

The pyspark.sql.Column has many methods that acts on a column and returns a column instance.



In [12]:

    
mtcars.select(mtcars.gear.isin([2,3])).show(5)









    



+----------------+
|(gear IN (2, 3))|
+----------------+
|           false|
|           false|
|           false|
|            true|
|            true|
+----------------+
only showing top 5 rows



In [17]:

    
mtcars.mpg.asc()









    Out[17]:





Column<b'mpg ASC NULLS FIRST'>



In [ ]: