This sample notebook is written in Scala and expects the Scala 2.10 runtime. Make sure the kernel is started and we are connected when executing this notebook.
The data source for this example can be found at: http://examples.cloudant.com/crimes/
Replicate the database into your own Cloudant account before you execute this script.
In [1]:
sc.version
Out[1]:
In [2]:
val sqlCtx = new org.apache.spark.sql.SQLContext(sc)
A Dataframe object can be created directly from a Cloudant database. To configure the database as source, pass these options:
1 - package name that provides the classes (like CloudantDataSource
) implemented in the connector to extend BaseRelation
. For the Cloudant Spark connector this will be com.cloudant.spark
2 - cloudant.host
parameter to pass the Cloudant account name
3 - cloudant.user
parameter to pass the Cloudant user name
4 - cloudant.password
parameter to pass the Cloudant account password
In [3]:
val df = sqlCtx.read.format("com.cloudant.spark").option("cloudant.host","examples.cloudant.com").option("cloudant.username", "examples").option("cloudant.password","xxxx").load("crimes")
At this point all transformations and functions should behave as specified with Spark SQL. (http://spark.apache.org/sql/)
There are, however, a number of things the Cloudant Spark connector does not support yet, or things that are simply not working. For that reason we call this connector a BETA release and are only gradually improving it towards GA.
Please direct your any change requests at support@cloudant.com
In [4]:
df.printSchema()
In [5]:
df.count()
Out[5]:
In [6]:
df.select("properties.naturecode").show()
In [8]:
df.filter(df.col("properties.naturecode").startsWith("DISTRB")).show()