SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
Apache Flink Windows
1. dbisINSTITUT FÜR INFORMATIK
HUMBOLDT−UNIVERSITÄT ZU ERLINB
Feeding a Squirrel in Time—Windows in Flink
Apache Flink Meetup Munich
Matthias J. Sax
mjsax@{informatik.hu-berlin.de|apache.org}
@MatthiasJSax
Humboldt-Universit¨at zu Berlin
Department of Computer Science
November 11st
2015
2. –MatthiasJ.Sax–WindowsinApacheFlink
1/21
About Me
Ph. D. student in CS, DBIS Group, HU Berlin
involved in Stratosphere research project
working on data stream processing and optimization
Aeolus: build on top of Apache Storm
(https://github.com/mjsax/aeolus)
Committer at Apache Flink
3. –MatthiasJ.Sax–WindowsinApacheFlink
2/21
Stream Processing
Processing data in motion:
external sources create data constantly
data is pushed to the system
need to keep up with incoming data rate
usage of ingestion buffers (e. g., Apache Kafka)
handle data peaks
back pressure, dynamic scaling (or even load-shedding)
low processing latency (milliseconds)
no micro-batching
4. –MatthiasJ.Sax–WindowsinApacheFlink
3/21
Other Systems
Apache Storm
widely used in industry
different processing guarantees
no guarantee
at-least-once
exactly-once (not for external writes)
no ordering guarantees
no type system
dynamic scaling (to some extent)
some high-level abstractions using Trident
windows, state, exactly-once-processing
52. –MatthiasJ.Sax–WindowsinApacheFlink
14/21
Streaming Tradeoffs
Processing Time
no late data / no skew
windows are simple to build
low latency
inherently non-deterministic
Event Time (external)
late data / skew
out-of-order data (windowing more difficult)
simpler to reason about semantics (deterministic)
increased latency
53. –MatthiasJ.Sax–WindowsinApacheFlink
14/21
Streaming Tradeoffs
Processing Time
no late data / no skew
windows are simple to build
low latency
inherently non-deterministic
Event Time (external)
late data / skew
out-of-order data (windowing more difficult)
simpler to reason about semantics (deterministic)
increased latency
Event Time (ingestion)
no late data / no skew
no out-of-order
simplified watermarking
56. –MatthiasJ.Sax–WindowsinApacheFlink
15/21
Time Based Windows
Timestamp Example
StreamExecutionEnviroment env = ...
// alternatives : ProcessingTime / IngestionTime
env. setStreamTimeCharacteristic (
TimeCharacteristic .EventTime );
DataStream <Tuple > input = ...
input. assignTimestamps (
57. –MatthiasJ.Sax–WindowsinApacheFlink
15/21
Time Based Windows
Timestamp Example
StreamExecutionEnviroment env = ...
// alternatives : ProcessingTime / IngestionTime
env. setStreamTimeCharacteristic (
TimeCharacteristic .EventTime );
DataStream <Tuple > input = ...
input. assignTimestamps (new TimestampExtractor <Tuple > {
58. –MatthiasJ.Sax–WindowsinApacheFlink
15/21
Time Based Windows
Timestamp Example
StreamExecutionEnviroment env = ...
// alternatives : ProcessingTime / IngestionTime
env. setStreamTimeCharacteristic (
TimeCharacteristic .EventTime );
DataStream <Tuple > input = ...
input. assignTimestamps (new TimestampExtractor <Tuple > {
public long extractTimestamp (Tuple element ,
long currentTimestamp ) {
return /* extract from element */;
}
59. –MatthiasJ.Sax–WindowsinApacheFlink
15/21
Time Based Windows
Timestamp Example
StreamExecutionEnviroment env = ...
// alternatives : ProcessingTime / IngestionTime
env. setStreamTimeCharacteristic (
TimeCharacteristic .EventTime );
DataStream <Tuple > input = ...
input. assignTimestamps (new TimestampExtractor <Tuple > {
public long extractTimestamp (Tuple element ,
long currentTimestamp ) {
return /* extract from element */;
}
public long extractWatermark (Tuple element ,
long currentTimestamp ) {
return /* extract from element */;
}
60. –MatthiasJ.Sax–WindowsinApacheFlink
15/21
Time Based Windows
Timestamp Example
StreamExecutionEnviroment env = ...
// alternatives : ProcessingTime / IngestionTime
env. setStreamTimeCharacteristic (
TimeCharacteristic .EventTime );
DataStream <Tuple > input = ...
input. assignTimestamps (new TimestampExtractor <Tuple > {
public long extractTimestamp (Tuple element ,
long currentTimestamp ) {
return /* extract from element */;
}
public long extractWatermark (Tuple element ,
long currentTimestamp ) {
return /* extract from element */;
}
public long getCurrentWatermark () {
return Long.MIN_VALUE;
}
});
61. –MatthiasJ.Sax–WindowsinApacheFlink
16/21
Time Based Windows (cont.)
Sliding Time Window Example
DataStream <...> input = ...
input.keyBy (...)
// size = 5s; slide = 1s
.timeWindow(Time.of(5, TimeUnit.SECONDS),
Time.of(1, TimeUnit.SECONDS ))
.reduce (...);
62. –MatthiasJ.Sax–WindowsinApacheFlink
16/21
Time Based Windows (cont.)
Sliding Time Window Example
DataStream <...> input = ...
input.keyBy (...)
// size = 5s; slide = 1s
.timeWindow(Time.of(5, TimeUnit.SECONDS),
Time.of(1, TimeUnit.SECONDS ))
.reduce (...);
General Window Example
DataStream <...> input = ...
input.keyBy (...)
.window (...)
.apply(new WindowsFunction <... >() {
// ...
});
65. –MatthiasJ.Sax–WindowsinApacheFlink
17/21
Advanced Windowing Concepts
global windows (non-parallelized)
Triggers:
closes a window (i. e., fires)
processing time
watermark
count
delta
... (with different discarding strategies)
Evict:
removes tuple from window before function gets applied
time, count, delta
66. –MatthiasJ.Sax–WindowsinApacheFlink
17/21
Advanced Windowing Concepts
global windows (non-parallelized)
Triggers:
closes a window (i. e., fires)
processing time
watermark
count
delta
... (with different discarding strategies)
Evict:
removes tuple from window before function gets applied
time, count, delta
mix different windows/triggers/evictors
67. –MatthiasJ.Sax–WindowsinApacheFlink
18/21
Stateful Stream Processing
Flink can handle arbitrary user state:
state is store reliably
distributed snapshots algorithm
Example
public class CounterSum
implements RichReduceFunction <Long > {
private OperatorState <Long > counter;
public void open( Configuration config) {
counter = getRuntimeContext ()
. getOperatorState ("myCnt", Long.class , 0L);
}
public Long reduce(Long v1 , Long v2) throws Exception {
counter.update(counter.value () + 1);
return v1 + v2;
}
}
71. –MatthiasJ.Sax–WindowsinApacheFlink
20/21
Summary (cont.)
Flink provides a rich API (Java/Scala) to express different
semantics
state handling for arbitrary UDF code
fault-tolerance with exactly-once guarantees
exaclty-once sink available
What else?
Python API is coming (right now DataSet only)
Google Dataflow on Flink
Storm on Flink
Apache SAMOA on Flink