Introduction to Structured Streaming

Introduction to Structured Streaming
Manish Mishra
Software Consultant
Knoldus Software LLP

● What is Structured Streaming ?
● How is it different from Previous Streaming Engine?
● Structured Streaming Programming model
● Basic Operations, Selection, Projection and Aggregation
● Window Operations on Event Time
● Example Demo
Agenda

What is Structured Streaming
?

What is Structured Streaming
● It is a scalable and fault-tolerant stream processing engine built
on the Spark SQL engine.
● It was part of Spark 2.0 release
● A unified API for streams which can combine stream
computation and batch processing
● The computations can be performed in sql-like queries which
are applicable for Dataset/DataFrames on streaming
dataframes.

What is new with this Streaming Engine?

What is new in Structured Streaming
● The entry point of the streaming app is spark session in spite of
previous streamingContext
● Unlike Dstreams, It is an infinite Dataframe.
● It is interoperable with DStreams
● It can harness the power of Catalyst Optimizer to increase
performance of query without changing the query semantics
● An Unified API makes developer task easy and no one has to
reason about how streaming computation will differ from a
normal map-red computation

Structured Streaming Programming Model
● It treats a live data stream as an unbounded table
● Any streaming computation can be expressed as a batch-like
query on static tables
● The spark runs this computation as incremental query internally
● The result of the computation depends on the output modes
specified in the streaming query.

Structured Streaming Programming Model
Image Source: Apache Spark Documentations

There are three output modes which decides what result output goes into the
sink namely:
● Complete Mode:
● Update Mode:
● Append Mode:
Structured Streaming Programming Model:Output Modes

● Complete Mode: Entire updated Result Table will be written to the sink.
It is up to the storage connector to decide how to handle writing of the
entire table. It can be specified by outputMode("complete") while
instantiating a stream query object.

● Append Mode (default) : Only the new rows appended in the result
Table since the last trigger will be written to the external storage.
● This is applicable only on the queries where existing rows in the Result
Table are not expected to change.

● Update Mode : Only the rows that were updated in the Result Table
since the last trigger will be written to the external storage (not available
yet in Spark 2.0). Note that this is different from the Complete Mode in
that this mode does not output the rows that are not changed.
Note: This mode is not implemented yet till Spark 2.0

// Create DataFrame representing the stream of input lines from connection to localhost:9000
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9000)
.load()
// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))
// Generate running word count
val wordCounts = words.groupBy("value").count()
Example: Running Word Count

// Start running the query that prints the running counts to the console
val query = wordCounts.writeStream
.outputMode("complete")
.format("console")
.start()
query.awaitTermination()
Example: Running Word Count

Basic Operations - Selection, Projection, Aggregation

case class DeviceData(device: String, type: String, signal: Double, time: DateTime)
val df: DataFrame = ... // streaming DataFrame with IOT device data with schema { device:
string, type: string, signal: double, time: string }
val ds: Dataset[DeviceData] = df.as[DeviceData] // streaming Dataset with IOT device data
/ Select the devices which have signal more than 10
df.select("device").where("signal > 10") // using untyped APIs
ds.filter(_.signal > 10).map(_.device) // using typed APIs

/ Running count of the number of updates for each device type
df.groupBy("type").count() // using untyped API
// Running average signal for each device type
import org.apache.spark.sql.expressions.scalalang.typed._
ds.groupByKey(_.type).agg(typed.avg(_.signal)) // using typed API

Window Operations on Event Time

import spark.implicits._
val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }
// Group the data by window and word and compute the count of each group
val windowedCounts = words.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word"
).count()

Image Source: Apache Spark Documentations

References
Structured Streaming Programming Guide
● http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Introduction to Structured Streaming

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (10)

Similar a Introduction to Structured Streaming

Similar a Introduction to Structured Streaming (20)

Más de Knoldus Inc.

Más de Knoldus Inc. (20)

Último

Último (20)

Introduction to Structured Streaming