In [1]:

    
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession



In [2]:

    
sc = SparkContext(conf=SparkConf())
spark = SparkSession(sparkContext=sc)



In [3]:

    
from pyspark.ml.linalg import Vector, DenseVector, SparseVector

Dense vector and sparse vector

A vector can be represented in dense and sparse formats. A dense vector is a regular vector that has each elements printed. A sparse vector use three components to represent a vector but with less memory.



In [22]:

    
dv = DenseVector([1.0,0.,0.,0.,4.5,0])
dv









    Out[22]:





DenseVector([1.0, 0.0, 0.0, 0.0, 4.5, 0.0])

Three components of a sparse vector

vector size
indices of active elements
values of active elements

In the above dense vector:

vector size = 6
indices of active elements = [0, 4]
values of active elements = [1.0, 4.5]

We can use the SparseVector() function to create a sparse vector. The first argument is the vector size, the second argument is a dictionary. The keys are indices of active elements and the values are values of active elements.



In [23]:

    
sv = SparseVector(6, {0:1.0, 4:4.5})
sv









    Out[23]:





SparseVector(6, {0: 1.0, 4: 4.5})

Convert sparse vector to dense vector



In [30]:

    
DenseVector(sv.toArray())









    Out[30]:





DenseVector([1.0, 0.0, 0.0, 0.0, 4.5, 0.0])

Convert dense vector to sparse vector



In [33]:

    
active_elements_dict = {index: value for index, value in enumerate(dv) if value != 0}
active_elements_dict









    Out[33]:





{0: 1.0, 4: 4.5}



In [34]:

    
SparseVector(len(dv), active_elements_dict)









    Out[34]:





SparseVector(6, {0: 1.0, 4: 4.5})



In [ ]: