Introduction to Structured Streaming

© 2015 IBM Corporation1
! Agenda
- Spark Streaming 1.X
•  Features
•  Areas for Improvement
- Spark Streaming 2.0 – Structured Streaming
•  Addressing the Improvement Areas
•  API
•  Fault Tolerance
•  Event Time
•  Managing Streaming queries
- Structured Streaming Examples
https://github.com/agsachin/spark-meetup/tree/master/sparkStructuredStreaming
- Summary thoughts

Spark Streaming 1. X
! Features of Spark Streaming
-  High Level API (stateful, joins, aggregates, windows etc.)
•  Overlap with RDD API (batch)
-  Fault – Tolerant (exactly once semantics achievable)
-  Back Pressure
-  Deep Integration with Spark Ecosystem (MLlib, SQL, GraphX etc.)
!
Apache
Hadoop
Day
2015

Spark Streaming 1. X – Areas of improvement
! Fault-tolerance
For end-2-end exactly once guarantees, user needs to do all the heavy lifting in
the Sink
Can that be handled in a very simple way for the end-user ?
Apache
Hadoop
Day
2015

Fault-Tolerant Semantics
Exactly
Once,
If
Outputs
are
Idempotent
or
transac6onal

Exactly
Once,
as
long
as
received
data
is
not
lost

Exactly
Once
needs
re-‐playable
sources
(e.g.
Ka?a
Direct)

Source
Receiver
Transforming
Outputting
Sink

Spark Streaming 1. X – Areas of improvement
! Fault-tolerance
-  For end-2-end exactly once guarantees, user needs to do all the heavy lifting in the Sink
! API
-  Request for more seamless API between Batch & Stream
-  Reduce complexities of streaming app *
! No Event Time support
-  Hard to support when processing time/batch time exposed in externals
! Streaming Query Management
! Micro-batch
!
Apache
Hadoop
Day
2015

Spark Streaming 2.0 API
! Built on top of Spark SQL Engine
! Implicit Benefits
- Extend the primary Batch API even to Streaming
- Gain an Optimizer and all other enhancements done in SparkSQL.
! Challenge
- Remove/Keep streaming complexities to minimum
!

Lets Dive in

SQL Batch vs SQL Streaming- Conceptually

Batch vs Streaming - Programmatically

Output Modes - Sink
! Defined as what gets written from the Result table to external storage (Sink)
! Output modes
-  Complete – Entire updated Result table is written to external storage.
-  Append – Only new rows added in the Result table since last incremental query execution is
written to external storage.
-  Update - Only the rows updated in the Result table since last incremental query execution is
written to external storage.
Upto implementation of Storage connector to decide how to write.
* Aggregate queries only support complete mode and non-aggregate queries append mode

Supported Sinks & Modes in 2.0
*DEBUG
ONLY

*DEBUG
ONLY

Windowing in Structured Streaming

Window operations
!  Continuous time based aggregations are most common in Streaming applications.
-  Sliding window & Tumbling window
E.g. Top x hashtags on Twitter in last half hour, every 5 minutes
! New function that treats windowing as a regular aggregation
!  Used in a Group By clause
Can be used in Batch as well

Event Time Windows
! Event-Time is time embedded within the data itself
It is not the time Spark received the data
! What about processing time windows if you want them

Handling Late Arrival in Event-Time
! Since the ‘Result’ table is updated by Spark, the late data is put in its correct
window group
! Use a normal filter in the SQL ?
! Watermarks

Fault Tolerance
! Why Care?
! Different guarantees for Data Loss
! Atleast Once
! Exactly Once
! What all can fail?
! Driver
! Executor

Spark 1.x Best Fault tolerance - Kafka Direct API
•  Simplified Parallelism
•  Less Storage Need
•  Exactly Once Semantics.
source & processing
Beneﬁts
of
this
approach

Fault Tolerance in Structured Streaming
Active
Driver
Checkpoint
to
HDFS

! Structured Streaming Checkpointing
Decided Offsets ranges for a trigger interval is logged to checkpoint Directory *before* any
processing is started for that trigger
Nth record in log indicates data that is currently being processed
N-1 entry in log indicates offsets idempotent written to Sink
Log entries are monotonically increasing integers
! On Recovery
Restart processing of nth entry in WAL

Fault Tolerance in Structured Streaming
! End-to-End Exactly Once guarantees with
-  idempotent Sinks (built-in for commonly used sinks e.g. Files / JDBC)
-  Built-in Sources will *mostly* be only ones that support replay
https://issues.apache.org/jira/browse/SPARK-15842

Managing Streaming Queries
!  Streaming in 1.x was definetly lacking in
-  Starting / Stopping individual Streaming Queries
-  Changing the computation done in a Query.
-  When a Streaming Query abnormally terminates handle more gracefully than app crash.

Summary
!  Overall has a good set of features
-  Easier code share between Batch and Streaming (No different type hierarchies)
-  Window not tied to Batch interval
-  No Streaming context
-  Optimizer now available for your queries.
!  Getting started
-  Combining of 3 things (Output Mode & Sink Type & Query type) needs some time to wrap your head around *
And not much control over those.
-  Only get Runtime exceptions when you mess with above
!  How does it compare to Apache Beam ?

For Each Sink

Thank YOU

Introduction to Structured Streaming

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Introduction to Structured Streaming

Similar a Introduction to Structured Streaming (20)

Más de datamantra

Más de datamantra (15)

Último

Último (20)

Introduction to Structured Streaming