Apache Spark Streaming

http://spark.apache.org/streaming/

Documentation URL: http://spark.apache.org/docs/latest/streaming-programming-guide.html

Python reference: http://spark.apache.org/docs/latest/api/python/index.html

Scala reference: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package

1

You need this : a StreamingContext object to do any streaming task, similar to a SparkContext.

Scala Example :



In [ ]:

    
import org.apache.spark._
import org.apache.spark.streaming._

val conf = new SparkConf().setMaster("local[*]").setAppName("Example")
val ssc = new StreamingContext(conf, Seconds(1))

Python Example:



In [1]:

    
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

sc = SparkContext("local[*]", "Example")  # Created a SparkContext object and is being passed to the StreamingContext
ssc = StreamingContext(sc, batchDuration=1)  # batchDuration accepts value in seconds

2

You need this : a DStream object, its a sequence of RDDs.

http://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams

Input Sources: The following examples, use a TCP Socket as an input sources. We can group the input types as,

Basic sources : Sockets, File systems http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources
Advanced sources : Kafka, Flume, etc http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources

Scala Example :



In [ ]:

    
// Create a DStream that will connect to hostname:port, like localhost:9999
val lines = ssc.socketTextStream("localhost", 9999)

Python Example:



In [ ]:

    
# Create a DStream that will connect to hostname:port, like localhost:9999
lines = ssc.socketTextStream("localhost", 9999)

Note:

Receiver(Scala and Java documentation) is available. Receivers are responsible for handling the data from the input source and store it for spark streaming access. For the example, we are using a single data source. When you use multiple data sources, the cores allocated to the workers n, should be greater than the number of input sources.

3

Start the stream: Like "actions", the real computation start after the start command being issued. Only one StreamingContext can be allowed in a Spark Session. But you can use multiple input streams, and don't forget to allocate enough workers for processing those input streams.

ssc.start()

Note: Once the streaming context is started no new code can be added. Stopping the context is similar to Spark Context, you can use

ssc.stop()

4

All action performed on the DStreams are done in parallel, so to collect the final result, or update it periodically, use the following,

UpdateStateByKey

Transofrmations on RDD-to-RDD are still allowed in DStream.

http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams

5

http://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations

Windowed Operations are performed over a DStream containing a discrete data. Here a window represents a collection of such discrete data. DStream considered data in a discrete fashion and uses a single DStream till the end. While using window operation, the data is grouped into multiple windows. Windows may overlap to each other but the final results produce a discrete window too.

Two properties to specify :-

window length
sliding interval

The above properties are required as argument for any window operations.

A Databricks example,

https://docs.cloud.databricks.com/docs/latest/databricks_guide/07%20Spark%20Streaming/10%20Window%20Aggregations.html

References:

Steps to deploy a spark streaming application

http://spark.apache.org/docs/latest/streaming-programming-guide.html#deploying-applications

Window and Sliding window

https://groups.google.com/forum/#!topic/spark-users/GQoxJHAAtX4

http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/