In [2]:
from pyspark import SparkContext
sc = SparkContext(master = 'local')
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
In [3]:
mtcars = spark.read.csv(path='data/mtcars.csv',
sep=',',
encoding='UTF-8',
comment=None,
header=True,
inferSchema=True)
mtcars.show(n=5, truncate=False)
In [4]:
from pyspark.sql import Row
rdd = sc.parallelize([
Row(x=[1,2,3], y=['a','b','c']),
Row(x=[4,5,6], y=['e','f','g'])
])
rdd.collect()
Out[4]:
In [5]:
df = spark.createDataFrame(rdd)
df.show()
In [8]:
import pandas as pd
pdf = pd.DataFrame({
'x': [[1,2,3], [4,5,6]],
'y': [['a','b','c'], ['e','f','g']]
})
pdf
Out[8]:
In [9]:
df = spark.createDataFrame(pdf)
df.show()
In [16]:
my_list = [['a', 1], ['b', 2]]
df = spark.createDataFrame(my_list, ['letter', 'number'])
df.show()
In [17]:
df.dtypes
Out[17]:
In [18]:
my_list = [['a', 1], ['b', 2]]
df = spark.createDataFrame(my_list, ['my_column'])
df.show()
In [19]:
df.dtypes
Out[19]:
The following code generates a DataFrame consisting of two columns, each column is a vector column.
Why vector columns are generated in this case? In this case, the list my_list has only one element, a tuple. Therefore, the DataFrame has only one row. This tuple has two elements. Therefore, it generates a two-columns DataFrame. Each element in the tuple is a list, so the resulting columns are vector columns.
In [29]:
my_list = [(['a', 1], ['b', 2])]
df = spark.createDataFrame(my_list, ['x', 'y'])
df.show()
In [ ]:
Column instances can be created in two ways:
df.colName
df.colName + 1
Technically, there is only one way to create a column instance. Column expressions start from a column instance.
Remember how to create column instances, because this is usually the starting point if we want to operate DataFrame columns.
The column classes come with some methods that can operate on a column instance. However, almost all functions from the pyspark.sql.functions
module take one or more column instances as argument(s). These functions are important for data manipulation tools.
corr(col1, col2)
: two column names.cov(col1, col2)
: two column names.crosstab(col1, col2)
: two column names.describe(*cols)
: `cols` refers to only column names (strings).*cube(*cols)
: column names (string) or column expressions or both.drop(*cols)
: a list of column names OR a single column expression.groupBy(*cols)
: column name (string) or column expression or both.rollup(*cols)
: column name (string) or column expression or both.select(*cols)
: column name (string) or column expression or both.sort(*cols, **kwargs)
: column name (string) or column expression or both.sortWithinPartitions(*cols, **kwargs)
: column name (string) or column expression or both.orderBy(*cols, **kwargs)
: column name (string) or column expression or both.sampleBy(col, fractions, sed=None)
: a column name.toDF(*cols)
: a list of column names (string).withColumn(colName, col)
: colName
refers to column name; col
refers to a column expression.withColumnRenamed(existing, new)
: takes column names as arguments.filter(condition)
: *condition refers to a column expression that returns types.BooleanType
of values.
In [ ]: