2. ● What is Structured Streaming ?
● How is it different from Previous Streaming Engine?
● Structured Streaming Programming model
● Basic Operations, Selection, Projection and Aggregation
● Window Operations on Event Time
● Example Demo
Agenda
4. What is Structured Streaming
● It is a scalable and fault-tolerant stream processing engine built
on the Spark SQL engine.
● It was part of Spark 2.0 release
● A unified API for streams which can combine stream
computation and batch processing
● The computations can be performed in sql-like queries which
are applicable for Dataset/DataFrames on streaming
dataframes.
6. What is new in Structured Streaming
● The entry point of the streaming app is spark session in spite of
previous streamingContext
● Unlike Dstreams, It is an infinite Dataframe.
● It is interoperable with DStreams
● It can harness the power of Catalyst Optimizer to increase
performance of query without changing the query semantics
● An Unified API makes developer task easy and no one has to
reason about how streaming computation will differ from a
normal map-red computation
7. Structured Streaming Programming Model
● It treats a live data stream as an unbounded table
● Any streaming computation can be expressed as a batch-like
query on static tables
● The spark runs this computation as incremental query internally
● The result of the computation depends on the output modes
specified in the streaming query.
10. There are three output modes which decides what result output goes into the
sink namely:
● Complete Mode:
● Update Mode:
● Append Mode:
Structured Streaming Programming Model:Output Modes
11. ● Complete Mode: Entire updated Result Table will be written to the sink.
It is up to the storage connector to decide how to handle writing of the
entire table. It can be specified by outputMode("complete") while
instantiating a stream query object.
Structured Streaming Programming Model:Output Modes
12. ● Append Mode (default) : Only the new rows appended in the result
Table since the last trigger will be written to the external storage.
● This is applicable only on the queries where existing rows in the Result
Table are not expected to change.
Structured Streaming Programming Model:Output Modes
13. ● Update Mode : Only the rows that were updated in the Result Table
since the last trigger will be written to the external storage (not available
yet in Spark 2.0). Note that this is different from the Complete Mode in
that this mode does not output the rows that are not changed.
Note: This mode is not implemented yet till Spark 2.0
Structured Streaming Programming Model:Output Modes
14. // Create DataFrame representing the stream of input lines from connection to localhost:9000
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9000)
.load()
// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))
// Generate running word count
val wordCounts = words.groupBy("value").count()
Example: Running Word Count
15. // Start running the query that prints the running counts to the console
val query = wordCounts.writeStream
.outputMode("complete")
.format("console")
.start()
query.awaitTermination()
Example: Running Word Count
17. case class DeviceData(device: String, type: String, signal: Double, time: DateTime)
val df: DataFrame = ... // streaming DataFrame with IOT device data with schema { device:
string, type: string, signal: double, time: string }
val ds: Dataset[DeviceData] = df.as[DeviceData] // streaming Dataset with IOT device data
/ Select the devices which have signal more than 10
df.select("device").where("signal > 10") // using untyped APIs
ds.filter(_.signal > 10).map(_.device) // using typed APIs
Basic Operations - Selection, Projection, Aggregation
18. / Running count of the number of updates for each device type
df.groupBy("type").count() // using untyped API
// Running average signal for each device type
import org.apache.spark.sql.expressions.scalalang.typed._
ds.groupByKey(_.type).agg(typed.avg(_.signal)) // using typed API
Basic Operations - Selection, Projection, Aggregation
20. Window Operations on Event Time
import spark.implicits._
val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }
// Group the data by window and word and compute the count of each group
val windowedCounts = words.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word"
).count()