http://spark.apache.org/streaming/
Documentation URL: http://spark.apache.org/docs/latest/streaming-programming-guide.html
Python reference: http://spark.apache.org/docs/latest/api/python/index.html
Scala reference: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package
In [ ]:
import org.apache.spark._
import org.apache.spark.streaming._
val conf = new SparkConf().setMaster("local[*]").setAppName("Example")
val ssc = new StreamingContext(conf, Seconds(1))
Python Example:
In [1]:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
sc = SparkContext("local[*]", "Example") # Created a SparkContext object and is being passed to the StreamingContext
ssc = StreamingContext(sc, batchDuration=1) # batchDuration accepts value in seconds
You need this : a DStream object, its a sequence of RDDs.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams
Input Sources: The following examples, use a TCP Socket as an input sources. We can group the input types as,
Basic sources : Sockets, File systems http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources
Advanced sources : Kafka, Flume, etc http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources
Scala Example :
In [ ]:
// Create a DStream that will connect to hostname:port, like localhost:9999
val lines = ssc.socketTextStream("localhost", 9999)
Python Example:
In [ ]:
# Create a DStream that will connect to hostname:port, like localhost:9999
lines = ssc.socketTextStream("localhost", 9999)
Note:
Receiver(Scala and Java documentation) is available. Receivers are responsible for handling the data from the input source and store it for spark streaming access. For the example, we are using a single data source. When you use multiple data sources, the cores allocated to the workers n, should be greater than the number of input sources.
Start the stream: Like "actions", the real computation start after the start command being issued. Only one StreamingContext can be allowed in a Spark Session. But you can use multiple input streams, and don't forget to allocate enough workers for processing those input streams.
ssc.start()
Note: Once the streaming context is started no new code can be added. Stopping the context is similar to Spark Context, you can use
ssc.stop()
All action performed on the DStreams are done in parallel, so to collect the final result, or update it periodically, use the following,
UpdateStateByKey
Transofrmations on RDD-to-RDD are still allowed in DStream.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams
http://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations
Windowed Operations are performed over a DStream containing a discrete data. Here a window represents a collection of such discrete data. DStream considered data in a discrete fashion and uses a single DStream till the end. While using window operation, the data is grouped into multiple windows. Windows may overlap to each other but the final results produce a discrete window too.
Two properties to specify :-
The above properties are required as argument for any window operations.
A Databricks example,
Steps to deploy a spark streaming application
http://spark.apache.org/docs/latest/streaming-programming-guide.html#deploying-applications
Window and Sliding window
https://groups.google.com/forum/#!topic/spark-users/GQoxJHAAtX4
http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/