Más contenido relacionado
La actualidad más candente (20)
Similar a Introduction to Structured Streaming (20)
Introduction to Structured Streaming
- 1. © 2015 IBM Corporation1
! Agenda
- Spark Streaming 1.X
• Features
• Areas for Improvement
- Spark Streaming 2.0 – Structured Streaming
• Addressing the Improvement Areas
• API
• Fault Tolerance
• Event Time
• Managing Streaming queries
- Structured Streaming Examples
https://github.com/agsachin/spark-meetup/tree/master/sparkStructuredStreaming
- Summary thoughts
- 2. © 2015 IBM Corporation2
Spark Streaming 1. X
! Features of Spark Streaming
- High Level API (stateful, joins, aggregates, windows etc.)
• Overlap with RDD API (batch)
- Fault – Tolerant (exactly once semantics achievable)
- Back Pressure
- Deep Integration with Spark Ecosystem (MLlib, SQL, GraphX etc.)
!
Apache
Hadoop
Day
2015
- 3. © 2015 IBM Corporation3
Spark Streaming 1. X – Areas of improvement
! Fault-tolerance
For end-2-end exactly once guarantees, user needs to do all the heavy lifting in
the Sink
Can that be handled in a very simple way for the end-user ?
Apache
Hadoop
Day
2015
- 4. © 2015 IBM Corporation4
Fault-Tolerant Semantics
Exactly
Once,
If
Outputs
are
Idempotent
or
transac6onal
Exactly
Once,
as
long
as
received
data
is
not
lost
Exactly
Once
needs
re-‐playable
sources
(e.g.
Ka?a
Direct)
Source
Receiver
Transforming
Outputting
Sink
- 5. © 2015 IBM Corporation5
Spark Streaming 1. X – Areas of improvement
! Fault-tolerance
- For end-2-end exactly once guarantees, user needs to do all the heavy lifting in the Sink
! API
- Request for more seamless API between Batch & Stream
- Reduce complexities of streaming app *
! No Event Time support
- Hard to support when processing time/batch time exposed in externals
! Streaming Query Management
! Micro-batch
!
Apache
Hadoop
Day
2015
- 6. © 2015 IBM Corporation6
Spark Streaming 2.0 API
! Built on top of Spark SQL Engine
! Implicit Benefits
- Extend the primary Batch API even to Streaming
- Gain an Optimizer and all other enhancements done in SparkSQL.
! Challenge
- Remove/Keep streaming complexities to minimum
!
- 8. © 2015 IBM Corporation8
SQL Batch vs SQL Streaming- Conceptually
- 9. © 2015 IBM Corporation9
Batch vs Streaming - Programmatically
- 10. © 2015 IBM Corporation10
Output Modes - Sink
! Defined as what gets written from the Result table to external storage (Sink)
! Output modes
- Complete – Entire updated Result table is written to external storage.
- Append – Only new rows added in the Result table since last incremental query execution is
written to external storage.
- Update - Only the rows updated in the Result table since last incremental query execution is
written to external storage.
Upto implementation of Storage connector to decide how to write.
* Aggregate queries only support complete mode and non-aggregate queries append mode
- 11. © 2015 IBM Corporation11
Supported Sinks & Modes in 2.0
*DEBUG
ONLY
*DEBUG
ONLY
- 12. © 2015 IBM Corporation12
Windowing in Structured Streaming
- 13. © 2015 IBM Corporation13
Window operations
! Continuous time based aggregations are most common in Streaming applications.
- Sliding window & Tumbling window
E.g. Top x hashtags on Twitter in last half hour, every 5 minutes
! New function that treats windowing as a regular aggregation
! Used in a Group By clause
Can be used in Batch as well
- 14. © 2015 IBM Corporation14
Event Time Windows
! Event-Time is time embedded within the data itself
It is not the time Spark received the data
! What about processing time windows if you want them
- 15. © 2015 IBM Corporation15
Handling Late Arrival in Event-Time
! Since the ‘Result’ table is updated by Spark, the late data is put in its correct
window group
! Use a normal filter in the SQL ?
! Watermarks
- 16. © 2015 IBM Corporation16
Fault Tolerance
! Why Care?
! Different guarantees for Data Loss
! Atleast Once
! Exactly Once
! What all can fail?
! Driver
! Executor
- 17. © 2015 IBM Corporation17
Spark 1.x Best Fault tolerance - Kafka Direct API
• Simplified Parallelism
• Less Storage Need
• Exactly Once Semantics.
source & processing
Benefits
of
this
approach
- 18. © 2015 IBM Corporation18
Fault Tolerance in Structured Streaming
Active
Driver
Checkpoint
to
HDFS
! Structured Streaming Checkpointing
Decided Offsets ranges for a trigger interval is logged to checkpoint Directory *before* any
processing is started for that trigger
Nth record in log indicates data that is currently being processed
N-1 entry in log indicates offsets idempotent written to Sink
Log entries are monotonically increasing integers
! On Recovery
Restart processing of nth entry in WAL
- 19. © 2015 IBM Corporation19
Fault Tolerance in Structured Streaming
! End-to-End Exactly Once guarantees with
- idempotent Sinks (built-in for commonly used sinks e.g. Files / JDBC)
- Built-in Sources will *mostly* be only ones that support replay
https://issues.apache.org/jira/browse/SPARK-15842
- 20. © 2015 IBM Corporation20
Managing Streaming Queries
! Streaming in 1.x was definetly lacking in
- Starting / Stopping individual Streaming Queries
- Changing the computation done in a Query.
- When a Streaming Query abnormally terminates handle more gracefully than app crash.
- 21. © 2015 IBM Corporation21
Managing Streaming Queries
- 22. © 2015 IBM Corporation22
Managing Streaming Queries
- 23. © 2015 IBM Corporation23
Summary
! Overall has a good set of features
- Easier code share between Batch and Streaming (No different type hierarchies)
- Window not tied to Batch interval
- No Streaming context
- Optimizer now available for your queries.
! Getting started
- Combining of 3 things (Output Mode & Sink Type & Query type) needs some time to wrap your head around *
And not much control over those.
- Only get Runtime exceptions when you mess with above
! How does it compare to Apache Beam ?