This document provides a tutorial on the role of event-time order in data streaming analysis. The agenda covers motivations and examples of data streaming and stream processing engines, causes of out-of-order data and solutions to enforce total ordering, pros and cons of total ordering, and relaxation of total ordering using watermarks. Enforcing total ordering through techniques like sorting tuples is computationally expensive but provides benefits like determinism and synchronization. However, it may be an overkill for some applications and increase latency.
Role Of Transgenic Animal In Target Validation-1.pptx
Tutorial: The Role of Event-Time Analysis Order in Data Streaming
1. Tutorial:
The Role of Event-Time Order
in Data Streaming Analysis
VincenzoGulisano
Chalmers University ofTechnology
Gothenburg, Sweden
vincenzo.gulisano@chalmers.se
Dimitris Palyvos-Giannas
Chalmers University ofTechnology
Gothenburg, Sweden
palyvos@chalmers.se
Bastian Havers
Chalmers University ofTechnology &Volvo Cars
Gothenburg, Sweden
havers@chalmers.se
Marina Papatriantafilou
Chalmers University ofTechnology
Gothenburg, Sweden
ptrianta@chalmers.se
3. Agenda
• Motivation, preliminaries and examples about
data streaming and Stream Processing Engines
• Causes of out-of-order data and solutions enforcing total-ordering
• Pros/Cons of total-ordering
• Relaxation of total-ordering and the watermarks
3
https://github.com/ vincenzo- gulisano/debs2020_tutorial_event_time
4. Agenda
• Motivation, preliminaries and examples about
data streaming and Stream Processing Engines
• Causes of out-of-order data and solutions enforcing total-ordering
• Pros/Cons of total-ordering
• Relaxation of total-ordering and the watermarks
4
6. Where does Big Data Originate?
1 trillion sensors
by 2030
in the Internet of Things (IoT)
uploaded toYouTube
every minute
2.32 billion
Facebook users
219 billion photos
uploaded to Facebook
1 online interaction
every 18 seconds
by 2025
6
500 hours of video
7. Main Memory
Database Management Systems (DBMSs) vs.
Stream Processing Engines (SPEs)
7
Disk
1 Data
Query Processing
3 Query
results
2 Query
Main Memory
Query Processing
Continuous
QueryData
Query
results
9. Flink-related images / code snippets in the following are taken from: https://flink.apache.org/ 9
10. Data Stream:
unbounded sequence of tuples
sharing the same schema
Example: vehicles’ speed reports
10
time
Field Field
vehicle id text
time (secs) text
speed (Km/h) double
X coordinate double
Y coordinate double
A 8:00 55.5 X1 Y1
Let’s assume each source (e.g.,
vehicle)
produces and delivers
a timestamp-sorted stream
A 8:07 34.3 X3 Y3
A 8:03 70.3 X2 Y2
11. Continuous query (or simply query):
Directed Acyclic Graph (DAG) of streams and
operators
OP
OP
OP
OP OP
OP OP
source op
(1+ out streams)
sink op
(1+ in streams)
stream
op
(1+ in, 1+ out streams)
11
12. Data Streaming Operators
Two main types:
• Stateless operators
• do not maintain any state
• one-by-one processing
• if they maintain some state, such state does not evolve depending on the tuples being
processed
• Stateful operators
• maintain a state that evolves depending on the tuples being processed
• produce output tuples that depend on multiple input tuples
12
OP
OP
13. Stateless Operators
Filter / route tuples based on one (or more) conditions
13
Filter
...Map Transform each tuple
15. Stateful Operators
Aggregate information from multiple
tuples (e.g., max, min, sum, ...)
15
Join tuples coming from 2 streams given a
certain predicate
Aggregate
Join
16. Windows and Stateful Analysis
Stateful operations are done over windows:
• Time-based (e.g., tuples in the last 10 minutes)
• Tuple-based (e.g., given the last 50 tuples)
16
time
[8:00,9:00)
[8:20,9:20)
[8:40,9:40)
Example of time-based window of size 1 hour and advance 20 minutes
How many tuple in a window?
Which time period does a window span?
20. Basic operators and user-defined operators
20
Besides a set of basic operators, SPEs usually
allow the user to define ad-hoc operators
(e.g., when existing aggregation are not enough)
21. SPEs and operators' variants
• Each SPE might define its own variants
of certain streaming operators:
21
t1
t2
t3
t4
t1
t2
t3
t4
R S
Sliding
window Window
sizeWS
WSWR
Predicate P
22. Sample Query
"every five minutes, of all vehicles that braked significantly, find the
one that braked the hardest"
22
time
A 8:00 55.5 X1 Y1 ... B 8:07 34.3 X3 Y3 ...
B 8:03 70.3 X2 Y2 ...
23. Sample Query
Remove
unused fields
Map
Field
vehicle id
time (secs)
speed (Km/h)
X coordinate
Y coordinate
...
Field
vehicle id
time (secs)
speed (Km/h)
Every minute,
compute average
speed of each
vehicle during the
last 2.5 minutes
Aggregate
Field
vehicle id
time (secs)
avg speed (Km/h)
Join
High average
speed and slow
current speed?
Filter
Field
vehicle id
time (secs)
braking factor
Join on
vehicle id in
last minute
Field
vehicle id
time (secs)
avg speed (Km/h)
speed (Km/h)
Aggregate
Every 5 minutes,
produce vehicle that
braked the hardest
during last 5
minutes
23
24. Agenda
• Motivation, preliminaries and examples about
data streaming and Stream Processing Engines
• Causes of out-of-order data and solutions enforcing total-ordering
• Pros/Cons of total-ordering
• Relaxation of total-ordering and the watermarks
24
25. From an abstract query
… to streaming application run by an SPE
25
OP1 OP2 OP4 OP6OP3 OP5Source Sink
OP1
OP2
OP4 OP6OP3
OP5Source
Sink
Source OP2
OP3
OP3
OP5
31. Data sources that produce out-of-order data
• Discussed in many related works (e.g., Babu, Shivnath, Utkarsh Srivastava, and JenniferWidom.
"Exploiting k-constraints to reduce memory overhead in continuous queries over data streams." ACM
Transactions on Database Systems (TODS) 29.3 (2004): 545-580.)
• Battery-operated devices, unreliable wireless networks, …
1 trillion sensors
by 2030
in the Internet of Things (IoT)
31
32. Causes of out-of-order data
32
Sources
themselves
Asynchronous
Distributed
Parallel
executions
33. The 3-step procedure
(sequential stream join)
33
For each incoming tuple t:
1. compare t with all tuples in opposite window given predicate P
2. add t to its window
3. remove stale tuples from t’s window
Add tuplesto S
Add tuples to R
Prod
R
Prod
S
Consume resultsConsPU
We assume each producer
delivers tuples in timestamp
order
34. The 3-step procedure, is it enough?
34
t1
t2
t1
t2
R S
WSWR
t3
t1
t2
t1
t2
R S
WSWR
t4
t3
35. Causes of out-of-order data
35
Asynchronous
Distributed
Parallel
executions
Any operator fed data from multiple logical /
physical stream can potentially observe out-
of-order data
41. Parallel execution
• Stateful operators: Semantic awareness
• Aggregate: count within last hour, group-by vehicle id
41
Previous Subcluster
R
…
R
…
M Agg1
M Agg2
M Agg3
…
…
…
Vehicle A
42. Parallel execution
• Depending on the stateful operator semantic:
• Partition input stream into keys
• Each key is processed by 1 thread
• # keys >> # threads/nodes
42
43. Parallel execution
• Depending on the stateful operator semantic:
• Partition input stream into keys
• Each key is processed by 1 thread
• # keys >> # threads/nodes
43
Keys
domain
Agg1 Agg2 Agg3
A
D
E
B
C F
44. Parallel execution
• Depending on the stateful operator semantics:
• Partition input stream into keys
• Each key is processed by 1 thread
• # keys >> # threads/nodes
44
Keys
domain
Agg1 Agg2 Agg3
A
D
E
B
C F
46. A 8:00 55.5 X1 Y1
A 8:07 34.3 X3 Y3
A 8:03 70.3 X2 Y2
Parallel execution
• Stateful operators: Semantic awareness
• Aggregate: count within last hour, group-by vehicle id
46
… R
… R
M Agg1
M Agg2
M Agg3
…
…
…
Map
Map
Vehicle A
Round-robin
(stateless)
47. Parallel execution
• Stateful operators: Semantic awareness
• Aggregate: count within last hour, group-by vehicle id
47
R…
R…
M Agg1
M Agg2
M Agg3
…
…
…
Vehicle A
Map
Map
Round-robin
(stateless)
A 8:00 55.5 X1 Y1
A 8:07 34.3 X3 Y3
A 8:03 70.3 X2 Y2
48. Inherent disorder
48
Map Aggregate Join Filter Aggregate
Print-to-file operator
P
Map0
Map1
Map2
Map3
P Aggregate Filter AggregateJoin
Disorder from parallelism
49. how to merge several timestamp-sorted streams...
49
M
...
...into one timestamp-sorted stream?
50. Gulisano, Vincenzo, et al. "Streamcloud: An elastic and scalable data streaming system." IEEE Transactions on Parallel and
Distributed Systems 23.12 (2012): 2351-2365. 50
51. Balazinska, Magdalena, et al. "Fault-tolerance in the Borealis distributed stream processing system." Proceedings of the
2005 ACM SIGMOD international conference on Management of data. 2005. 51
52. Gulisano, Vincenzo, et al. "Scalejoin: A deterministic, disjoint-parallel and skew-resilient stream join." IEEE Transactions on Big
Data (2016).
52
53. Which one to choose?
• Different options have different costs
• might require different types of data structures
• might require “special tuples” (more on this in the following part of
the tutorial)
• When the price of merge-sorting is paid by who provides the
tuples rather than who receives them, the system can scale better
53
54. Tutorial:
The Role of Event-Time Order
in Data Streaming Analysis
VincenzoGulisano
Chalmers University ofTechnology
Gothenburg, Sweden
vincenzo.gulisano@chalmers.se
Dimitris Palyvos-Giannas
Chalmers University ofTechnology
Gothenburg, Sweden
palyvos@chalmers.se
Bastian Havers
Chalmers University ofTechnology &Volvo Cars
Gothenburg, Sweden
havers@chalmers.se
Marina Papatriantafilou
Chalmers University ofTechnology
Gothenburg, Sweden
ptrianta@chalmers.se
55. Tutorial:
The Role of Event-Time Order
in Data Streaming Analysis
VincenzoGulisano
Chalmers University ofTechnology
Gothenburg, Sweden
vincenzo.gulisano@chalmers.se
Dimitris Palyvos-Giannas
Chalmers University ofTechnology
Gothenburg, Sweden
palyvos@chalmers.se
Bastian Havers
Chalmers University ofTechnology &Volvo Cars
Gothenburg, Sweden
havers@chalmers.se
Marina Papatriantafilou
Chalmers University ofTechnology
Gothenburg, Sweden
ptrianta@chalmers.se
56. 56
Agenda
• Motivation, preliminaries and examples about
data streaming and Stream Processing Engines
• Causes of out-of-order data and solutions enforcing total-ordering
• Pros/Cons of total-ordering
• Relaxation of total-ordering and the watermarks
57. 57
Agenda
• Motivation, preliminaries and examples about
data streaming and Stream Processing Engines
• Causes of out-of-order data and solutions enforcing total-ordering
• Pros/Cons of total-ordering
• Relaxation of total-ordering and the watermarks
58. Pros/Cons of total-ordering
• Cons:
• Expensive (computation- and latency-wise)
• An “overkill” for certain applications (more of this in the following
slides)
• Pros:
• Determinism
• Synchronization
• Eager purging of stale state
58
59. Pros/Cons of total-ordering
• Cons:
• Expensive (computation- and latency-wise)
• An “overkill” for certain applications (more of this in the following
slides)
• Pros:
• Determinism
• Synchronization
• Eager purging of stale state
59
60. Cost
• We need to temporary maintain tuples
• Linear in the number of tuples we receive,
which depends on the streams’ rate
• We need to sort tuples… O(n log(n))
• (n is number of sources or tuples, depending on the case)
• We need data from all sources, the processing latency depends on the slowest source.
• The latency overhead can be estimated based on the sources’ rates1
1Gulisano, Vincenzo, et al. "Performance modeling of stream joins." Proceedings of the 11th ACM International Conference on
Distributed and Event-based Systems. 2017. 60
61. Estimating the latency overhead
1Gulisano, Vincenzo, et al. "Performance modeling of stream joins." Proceedings of the 11th ACM International
Conference on Distributed and Event-based Systems. 2017. 61
62. Pros/Cons of total-ordering
• Cons:
• Expensive (computation- and latency-wise)
• An “overkill” for certain applications (more of this in the following
slides)
• Pros:
• Determinism
• Synchronization
• Eager purging of stale state
62
63. Determinism
63
OP1 OP2 OP4 OP6OP3 OP5Source Sink
OP1
OP2
OP4 OP6OP3
OP5Source
Sink
Source OP2
OP3
OP3
OP5
Balazinska, Magdalena, et al. "Fault-tolerance in the Borealis distributed stream processing system." Proceedings of the 2005 ACM SIGMOD international
conference on Management of data. 2005.
Gulisano, Vincenzo. StreamCloud: an elastic parallel-distributed stream processing engine. Diss. 2012.
64. Determinism
64
t1 t2
S1
S2
t3
t4
t5 t6 t7 t8
t9
t1 t2 t3 t4 t5 t6 t7 t8 t9
…
Tuple-based window, size: 4 / advance: 1
Hwang, Jeong-Hyon, Ugur Cetintemel, and Stan Zdonik. "Fast and reliable stream processing over wide area networks." 2007 IEEE 23rd International
Conference on Data Engineering Workshop. IEEE, 2007.
Gulisano, Vincenzo, Yiannis Nikolakopoulos, Marina Papatriantafilou, and Philippas Tsigas. "Scalejoin: A deterministic, disjoint-parallel and skew-resilient
stream join." IEEE Transactions on Big Data (2016).
65. Pros/Cons of total-ordering
• Cons:
• Expensive (computation- and latency-wise)
• An “overkill” for certain applications (more of this in the following
slides)
• Pros:
• Determinism
• Synchronization
• Eager purging of stale state
65
67. Synchronization
t1
t2
R S
WR
t3
t4
R S
t4
R S
t4
R S
t4
t1
WR
t2
WR WR
t3
Gulisano, Vincenzo, Yiannis Nikolakopoulos, Marina Papatriantafilou, and Philippas Tsigas. "Scalejoin: A deterministic, disjoint-parallel and skew-resilient stream join."
IEEE Transactions on Big Data (2016).
67
68. Synchronization
68
t1 t2R
S
t3
t4
t5 t6 t7 t8
t9
• Act as a barrier
• Carry the reconfiguration to be applied
• Change the parallelism degree
• Change other configuration
t3 t4 t5 t6… …
: Wait for , make sure
are ready and stop
: Wait for , make sure
are ready and stop
Najdataei, Hannaneh, et al. "Stretch: Scalable and elastic deterministic streaming analysis with virtual
shared-nothing parallelism." Proceedings of the 13th ACM International Conference on Distributed and Event-
based Systems. 2019.
69. 69
Agenda
• Motivation, preliminaries and examples about
data streaming and Stream Processing Engines
• Causes of out-of-order data and solutions enforcing total-ordering
• Pros/Cons of total-ordering
• Relaxation of total-ordering and the watermarks
73. Relaxing Correctness
73
Correctness
Performance
Flexibility
✓ In-order input streams
✓ Order in each window
✓ Correct window assignment
✗ In-order input streams
✓ Order in each window
✓ Correct window assignment
✗ In-order input streams
✗ Order in each window
✓ Correct window assignment
74. Disorder + Correct Window Assignment
1. When to create each window?
2. When to close each window (and produce result)?
74
0 10 20
ts: 7ts: 16 ts: 18ts: 3ts: 21
76. Closing Windows
76
0 10 20
ts: 7ts: 18 ts: 9ts: 21ts: 17 CLOSED
Operator needs some guarantee
it will not receive tuples with ts < W
Safely close all windows where
right_boundary ≤ W
77. Watermarks
77
Assume we can compute a monotonic function
𝐹: 𝑂 → 𝐸
that returns W ∈ E the earliest event time of any tuple
that can arrive at operator O in the future
0 10 20
ts: 7ts: 18 ts: 9ts: 21ts: 17
We call the value of this monotonic function F* the (low) watermark of operator O!
Monotonicity ➔ no tuples with ts < W will arrive in the future.
Solves problem of safely closing windows!
CLOSED
* The watermark of operator O is function of O and all its upstream peers, but we omit the latter for brevity.
W W
78. Watermarks in Practice
Watermarks are generated at the sources.
They (conceptually) flow through the pipeline.
They propagate regardless of data filtering ➔ all operators have up-to-date view of time
78
S
O1
02 O3
WS
WO2
WO1
WO3
WO1, OUT
WO2, OUT
WO3, OUT
Input Watermark: Earliest ts that O can receive.
Output Watermark: Earliest ts that O can emit.
79. Input & Output Timestamps
79
Stateless
tsIN tsOUT = tsIN
Stateful
tsIN tsOUT ≠ tsIN
tsIN
2 tsOUT
tsOUT tsOUTtsOUT
2. How is the timestamp set for window results?
State Event Time
tsIN
1
1.Which windows are complete?
81. 81
Flink Example
Map Aggregate Filter Aggregate
Join0
Join1
Join2
Join3
P
1. Result Correctness from correct window assignment
0 10 20
t
s
:
7
t
s
:
1
8
t
s
:
9
CLOSED
W W
2. Watermark propagation
S
O
1
0
2
O
3
WS
WO2
WO1
WO3
WO1, OUT
WO2, OUT
WO3, OUT
output watermark
P
82. Generating Watermarks
• Perfect Watermarks
• Sorted data or very predictable data sources.
• Determinism without sorting for order-independent window functions.
• Disorder inside windows is still possible!
• Sorting possible if needed, but not imposed.
• Heuristic Watermarks
• When impossible to perfectly predict data (e.g., distributed sources).
• Best-effort prediction of event-time progress.
• Possibility for late data.
• More knowledge about internals of sources → less mispredictions (late data).
82
83. Fast and Slow Watermarks
83
Window Complete
Perfect Watermark
Event Time
Window Complete
Slow Watermark
✗ Performance (Latency) Window Complete
Fast Watermark
✗ Correctness
Window Lifetime
84. Triggering
84
0h 12h 36h24h
Latency of 24 hours!
Can we do better?
Emit (partial) result every 1h
Completeness Trigger
Emit result when window is complete
Repeated Update Trigger
Periodically emit (partial) result
(e.g., for every tuple, every hour, etc)
Correctness
Performance
Flexibility
86. Putting It All Together
Window Complete
Perfect Watermark
Event Time
Window Complete
Slow Watermark
✗ Performance (Latency) Window Complete
Fast Watermark
✗ Correctness
Window Lifetime
Repeated Trigger
Early results
✓ Performance (Latency) Repeated Trigger
Late results
✓ Correctness
Watermark
Trigger
On-Time result
86
87. … the light at the end of the tunnel …
87
• Motivation, preliminaries and examples
about data streaming and Stream
Processing Engines
• Causes of out-of-order data and solutions
enforcing total-ordering
• Pros/Cons of total-ordering
• Relaxation of total-ordering and the
watermarks
88. To summarize:
Event time advances based
on:
88
Tuples themselves
Watermarks
Cons Pros
• Determinism, for order
sensitive / insensitive
functions
• Costly merge-sorting
• Coupled processing / output
of tuples
• Decoupled processing /
output of tuples
• Propagation of time
passing even in the absence
of tuples
• Would require special support
for order-sensitive functions
• Latency depends on
frequency of watermarks
89. Tutorial:
The Role of Event-Time Order
in Data Streaming Analysis
VincenzoGulisano
Chalmers University ofTechnology
Gothenburg, Sweden
vincenzo.gulisano@chalmers.se
Dimitris Palyvos-Giannas
Chalmers University ofTechnology
Gothenburg, Sweden
palyvos@chalmers.se
Bastian Havers
Chalmers University ofTechnology &Volvo Cars
Gothenburg, Sweden
havers@chalmers.se
Marina Papatriantafilou
Chalmers University ofTechnology
Gothenburg, Sweden
ptrianta@chalmers.se