Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Understanding time in structured streaming
1. Understanding Time in
Structured Streaming
Time and Window API
https://github.com/phatak-dev/spark2.0-examples/tree/master/src/main/scala/co
m/madhukaraphatak/examples/sparktwo/streaming
2. ● Madhukara Phatak
● Team Lead at Tellius and
Part time consultant at
datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
3. Agenda
● Evolution of Time in Stream Processing
● Introduction to Structured Streaming
● Different Time Abstractions
● Window API
● Emulating Process Time
● Working With Ingestion Time
● Event Time Abstraction
● Watermarks
● Beyond Time Windows
5. Time is King
● Time plays major role in the stream processing
● Latency dictates the kind of operations users want to
do
● Window Time dictates the state users want to maintain
in stream processor
● Batch Time dictates the rate in which users want to
process
● Most of the business questions asked in stream
processing also time based
6. View of Time in Stream Processing
● Most of early generations stream processing system
optimized for latency
● Latency differentiated between batch processing and
stream processing
● Latency informed the window time and batch time
● So many early generation stream processing only had
one concept of time
● It’s not good enough for new generation systems
7. Need for different time abstractions
● In a streaming system, there is
○ Source - System from where events are generated
like sensors etc
○ Ingestion System - Temporary storage like Kafka
○ Processing System - Structured Streaming
● Each of these system has their own time
● Typically users want to use different system’s time to do
analysis rather depending upon processing system
9. Process Time
● Time is tracked using a clock run by the processing
engine.
● Default abstraction in most of stream processing
engines like DStream API
● Last 10 seconds means the records arrived in last 10
seconds for the processing
● Easy to implement in framework but hard to reason
about for application developers
10. Event Time
● Event Time is birth time of an event at source
● Event time is the time embed in the data that is coming
into the system
● Last 10 seconds means, all the records generated in
those 10 seconds at the source
● This time is independent of the clock that is kept by the
processing engine
● Hard to implement in framework and easy for
application developer to reason
11. Ingestion Time
● Ingestion time is the time when events ingested into the
system
● This time is in between of the event time and processing
time
● In processing time, each machine in cluster is used to
assign the time stamp to track events
● Ingestion time, timestamp is assigned in ingestion so
that all the machines in the cluster have exact same
view
● Source Dependent
13. Structured Streaming
● Structured Streaming is a new streaming API introduced
in 2.0
● In structured streaming, a stream is modeled as an
infinite table aka infinite Dataset
● As we are using structured abstraction, it’s called
structured streaming API
● All input sources, stream transformations and output
sinks modeled as Dataset
● Stream transformations are represented using SQL and
Dataset DSL
14. Advantage of Stream as infinite table
● Structured data analysis is first class not layered over
the unstructured runtime
● Easy to combine with batch data as both use same
Dataset abstraction
● Can use full power of SQL language to express stateful
stream operations
● Benefits from SQL optimisations learnt over decades
● Easy to learn and maintain
16. Window API from Spark SQL
● Supporting multiple time abstractions in a single API is
tricky
● Flink API makes it an environmental setting to specify
what’s the default time abstraction of application
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
● Spark takes route of explicit time column route which is
inspired from spark sql API
● So spark API is optimised for event time by default
17. Window API
● Window API in structured streaming is part of group by
operation on Dataset/Dataframe
● val windowedCount = wordsDs
.groupBy(
window($"processingTime", "15 seconds")
)
● It takes three parameters
○ Time Column - Name of the time column.
○ Window Time - How long is the window.
○ Slide Time - An optional parameter to specify the sliding time.
18. Window on Processing Time
● Useful for porting existing DStream code to structured
streaming API
● By default, window API doesn’t have support processing
time
● But we can emulate processing time, by adding a time
column derived from processing time
● We will be using current_timestamp() API of spark sql to
generate the time column
● Ex : ProcessingTimeWindow
19. Window on Ingestion Time
● Ingestion time abstraction is useful when each batch of
data is captured in real time but takes considerable
amount to process
● Ingestion time helps us to get better results than
processing time without worrying about out of order
events like in event time
● Ingestion time support is depended on source
● In our example, we will use socket stream which has
support same
● IngestionTimeWindow
21. Importance of Event Time
● Event time is source of truth for all the events
● As more and more stream processing are sensitive to
the time of capture, event time plays a big role
● Event time helps developers to correlate the events
from various sources easily
● Correlation of events within and across sources helps
developer build interesting streaming applications
● So for this reason, event time is default abstraction
supported in structured streaming
22. Challenges of Event Time
● Event time is cool but it complicates the design of
stream processing frameworks
● Time passed in source may be different than in
processing engine
● How to handle out of order events and how long you
wait ?
● How you correlate events from source which are
running on their own speed ?
● How to reconcile event time with processing time ?
23. Window on Event Time
● Event Time will be a column embedded in data itself
● Default window API is built for this use case itself
● Windowing on event time make sures that even there is
latency in network we are doing processing on actual
time on source rather than speed of processing engine
● In our example, we analyse apple stock data which
embeds the tick time
● EventTimeExample
24. Late events
● Whenever we use event time, the challenge how to
handle late events?
● Default nature of the event time window in spark, it
keeps windows forever. That means we can handle late
events forever
● It will be great in application point of view to make sure
that we never miss any event at all
● Ex : EventTimeExample
25. Need of Watermarks
● Keeping around windows forever is great for logic, but
problematic resources point of view
● As each window creates state in spark, the state keeps
expanding as time passes
● This kind of state keeps using more memory and makes
recovery more difficult
● So we need to a mechanism to restrict time to keep
around windows
● This mechanism is known as watermarks
26. Watermarks
● Watermarks is a threshold , which defines the how long
we wait for the late events
● Using watermarks with event time make sure spark
drops the window state once this threshold is passed in
source
● Spark will maintain state and allow late data to update
the state until (max event time seen by the engine -
late threshold > T)
● WaterMarkExample
28. Need of timeless windows
● Most of the streaming applications use time as the
criteria to do most of the analysis
● But there are use cases in streaming where is state is
not bounded by time
● In the scenarios, we need a mechanism where we can
define window using non time part of the data
● In DStream API, it was tricky. But with structured
streaming we can define it easily
29. Sessionization
● A session is often period of time that capture different
interactions with an application from user
● In an online portal session normally starts when user
logs into the application and torn down when user
logged out or it expires when there is no activity for
some time
● Session is not a purely time based interaction as
different sessions can go for different time
30. Session Window
● A session window, is a window which allows us to group
different records from the stream for a specific session
● Window will start when the session starts and evaluated
when session is ended
● Window also will support tracking multiple sessions at a
same time
● Session windows are often used to analyze user
behavior across multiple interactions bounded by
session.
32. Custom State Management
● There is no direct API to define non time based
windows in Structured Streaming
● As window internally represented using state , we need
use custom state management to do non time windows
● In structured streaming, mapGroupWithState API
allows developers to do custom state management.
● This API behaves similar to mapWithState from
DStream API
33. Modeling User Session
● case class Session(sessionId:String, value:Double,
endSignal:Option[String])
● sessionId uniquely identifies the given session
● value is the data that is captured for the given session
● endSignal is the explicit signal from the application end
of the session
● This endSignal can be log out event or completion of a
transaction etc
● Timeout will be not part of the record
34. State Management Models
● Whenever we do custom state management we need to
define two different models
● One keeps around SessionInfo which tracks overall
case class SessionInfo(totalSum: Double)
● SessionUpdate model calculate communicates updates
for each batch
case class SessionUpdate(id: String,totalSum: Doubleexpired: Boolean)
35. State Management
● We group records by sessionId
● We use mapGroupState API to go through each
record from batch belonging to specific session id.
● For each group, we check is it expired or not by the data
● If expired, we use state.remove for dropping state
● If not expired, we call state.update to update the state
with new data
● SessionisationExample