In [1]:
# create entry points to spark
try:
sc.stop()
except:
pass
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
sc=SparkContext()
spark = SparkSession(sparkContext=sc)
A Spark column instance is NOT a column of values from the DataFrame: when you crate a column instance, it does not give you the actual values of that column in the DataFrame. I found it makes more sense to me if I consider a column instance as a column of expressions. These expressions are evaluated by other methods (e.g., the select(), groupby(), and orderby() from pyspark.sql.DataFrame)
In [3]:
mtcars = spark.read.csv('../../../data/mtcars.csv', inferSchema=True, header=True)
mtcars = mtcars.withColumnRenamed('_c0', 'model')
mtcars.show(5)
In [7]:
mpg_col = mtcars.mpg
mpg_col
Out[7]:
In [8]:
mpg_col + 1
Out[8]:
In [11]:
mtcars.select(mpg_col * 100).show(5)
The pyspark.sql.Column has many methods that acts on a column and returns a column instance.
In [12]:
mtcars.select(mtcars.gear.isin([2,3])).show(5)
In [17]:
mtcars.mpg.asc()
Out[17]:
In [ ]: