Apache Flink is an open source platform for distributed stream and batch data processing. It provides APIs called DataStream for unbounded streaming data and DataSet for bounded batch data. Flink runs streaming topologies that allow for windowing, aggregation and other stream processing functions. It supports exactly-once processing semantics through distributed snapshots and checkpoints. The system is optimized for low latency and high throughput streaming applications.
2. Recent History
April ‘14 December ‘14
v0.5 v0.6 v0.7
April ‘15
Project
Incubation
Top Level
Project
v0.8 v0.9
Currently moving towards 0.10 and 1.0 release.
3. What is Flink?
Deployment
Local (Single JVM) · Cluster (Standalone, YARN)
DataStream API
Unbounded Data
DataSet API
Bounded Data
Runtime
Distributed Streaming Data Flow
Libraries
Machine Learning · Graph Processing · SQL-like API
4. What is Flink?
Streaming
Topologies
Stream
Time
Window
Count
Low Latency
Long Batch Pipelines
Resource Utilization
1.2
1.4
1.5
1.2
0.8
0.9
1.0
0.8
Rating Matrix User Matrix Item Matrix
1.5
1.7
1.2
0.6
1.0
1.1
0.8
0.4
W X Y ZW X Y Z
A
B
C
D
4.0
4.5
5.0
3.5
2.0
3.5
4.0
2.0
1.0
= X
User
Machine Learning
Iterative Algorithms
Graph Analysis
53
1 2
4
0.5
0.2 0.9
0.3
0.1
0.4
0.7
Mutable State
7. Cornerstones of Flink
Low Latency for fast results.
High Throughput to handle many events per second.
Exactly-once guarantees for correct results.
Expressive APIs for productivity.
13. DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment()
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataStream Windowed WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // [word, [1, 1, …]] for 10 seconds
.timeWindow(Time.of(10, TimeUnit.SECONDS))
.sum(1); // sum per word per 10 second window
counts.print();
env.execute();
14. DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment()
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataStream Windowed WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // [word, [1, 1, …]] for 10 seconds
.timeWindow(Time.of(10, TimeUnit.SECONDS))
.sum(1); // sum per word per 10 second window
counts.print();
env.execute();
15. DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment()
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataStream Windowed WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // [word, [1, 1, …]] for 10 seconds
.timeWindow(Time.of(10, TimeUnit.SECONDS))
.sum(1); // sum per word per 10 second window
counts.print();
env.execute();
16. DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment()
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataStream Windowed WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // [word, [1, 1, …]] for 10 seconds
.timeWindow(Time.of(10, TimeUnit.SECONDS))
.sum(1); // sum per word per 10 second window
counts.print();
env.execute();
17. DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment()
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataStream Windowed WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // [word, [1, 1, …]] for 10 seconds
.timeWindow(Time.of(10, TimeUnit.SECONDS))
.sum(1); // sum per word per 10 second window
counts.print();
env.execute();
18. DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment()
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataStream Windowed WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // [word, [1, 1, …]] for 10 seconds
.timeWindow(Time.of(10, TimeUnit.SECONDS))
.sum(1); // sum per word per 10 second window
counts.print();
env.execute();
19. DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment()
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataStream Windowed WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // [word, [1, 1, …]] for 10 seconds
.timeWindow(Time.of(10, TimeUnit.SECONDS))
.sum(1); // sum per word per 10 second window
counts.print();
env.execute();
20. DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment()
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataStream Windowed WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // [word, [1, 1, …]] for 10 seconds
.timeWindow(Time.of(10, TimeUnit.SECONDS))
.sum(1); // sum per word per 10 second window
counts.print();
env.execute();
21. DataStream API
public static class SplitByWhitespace
implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap (
String value, Collector<Tuple2<String, Integer>> out) {
String[] tokens = value.toLowerCase().split("W+");
for (String token : tokens) {
if (token.length() > 0) {
out.collect(new Tuple2<>(token, 1));
}
}
}
}
22. DataStream API
public static class SplitByWhitespace
implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap (
String value, Collector<Tuple2<String, Integer>> out) {
String[] tokens = value.toLowerCase().split("W+");
for (String token : tokens) {
if (token.length() > 0) {
out.collect(new Tuple2<>(token, 1));
}
}
}
}
23. DataStream API
public static class SplitByWhitespace
implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap (
String value, Collector<Tuple2<String, Integer>> out) {
String[] tokens = value.toLowerCase().split("W+");
for (String token : tokens) {
if (token.length() > 0) {
out.collect(new Tuple2<>(token, 1));
}
}
}
}
24. DataStream API
public static class SplitByWhitespace
implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap (
String value, Collector<Tuple2<String, Integer>> out) {
String[] tokens = value.toLowerCase().split("W+");
for (String token : tokens) {
if (token.length() > 0) {
out.collect(new Tuple2<>(token, 1));
}
}
}
}
25. DataStream API
public static class SplitByWhitespace
implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap (
String value, Collector<Tuple2<String, Integer>> out) {
String[] tokens = value.toLowerCase().split("W+");
for (String token : tokens) {
if (token.length() > 0) {
out.collect(new Tuple2<>(token, 1));
}
}
}
}
26. DataStream API
public static class SplitByWhitespace
implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap (
String value, Collector<Tuple2<String, Integer>> out) {
String[] tokens = value.toLowerCase().split("W+");
for (String token : tokens) {
if (token.length() > 0) {
out.collect(new Tuple2<>(token, 1));
}
}
}
}
27. Pipelining
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, …);
// DataStream WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // split stream by word
.sum(1); // sum per word as they arrive
Source Map Reduce
37. Streaming Fault Tolerance
At Most Once
• No guarantees at all
At Least Once
• Ensure that all operators see all events.
Exactly Once
• Ensure that all operators see all events.
• Do not perform duplicates updates to operator state.
Flink gives you all guarantees.
39. Distributed Snapshots
Flink guarantees exactly once processing.
JobManager
Master
State Backend
Checkpoint Data
Source 1: State 1:
Source 2: State 2:
Source 3: Sink 1:
Source 4: Sink 2:
Offset: 6791
Offset: 7252
Offset: 5589
Offset: 6843
40. Distributed Snapshots
Flink guarantees exactly once processing.
JobManager
Master
State Backend
Checkpoint Data
Source 1: State 1:
Source 2: State 2:
Source 3: Sink 1:
Source 4: Sink 2:
Offset: 6791
Offset: 7252
Offset: 5589
Offset: 6843
Start Checkpoint
Message
41. Distributed Snapshots
Flink guarantees exactly once processing.
JobManager
Master
State Backend
Checkpoint Data
Source 1: 6791 State 1:
Source 2: 7252 State 2:
Source 3: 5589 Sink 1:
Source 4: 6843 Sink 2:
Emit Barriers
Acknowledge with
Position
42. Distributed Snapshots
Flink guarantees exactly once processing.
JobManager
Master
State Backend
Checkpoint Data
Source 1: 6791 State 1:
Source 2: 7252 State 2:
Source 3: 5589 Sink 1:
Source 4: 6843 Sink 2:
Received barrier
at each input
43. Distributed Snapshots
Flink guarantees exactly once processing.
JobManager
Master
State Backend
Checkpoint Data
Source 1: 6791 State 1:
Source 2: 7252 State 2:
Source 3: 5589 Sink 1:
Source 4: 6843 Sink 2:
s1 Write snapshot
of its state
Received barrier
at each input
44. Distributed Snapshots
Flink guarantees exactly once processing.
JobManager
Master
State Backend
Checkpoint Data
Source 1: 6791 State 1: PTR1
Source 2: 7252 State 2: PTR2
Source 3: 5589 Sink 1:
Source 4: 6843 Sink 2:
s1
Acknowledge with
pointer to state
s2
45. Distributed Snapshots
Flink guarantees exactly once processing.
JobManager
Master
State Backend
Checkpoint Data
Source 1: 6791 State 1: PTR1
Source 2: 7252 State 2: PTR2
Source 3: 5589 Sink 1: ACK
Source 4: 6843 Sink 2: ACK
s1 s2
Acknowledge Checkpoint
Received barrier
at each input
46. Distributed Snapshots
Flink guarantees exactly once processing.
JobManager
Master
State Backend
Checkpoint Data
Source 1: 6791 State 1: PTR1
Source 2: 7252 State 2: PTR2
Source 3: 5589 Sink 1: ACK
Source 4: 6843 Sink 2: ACK
s1 s2
47. Operator State
Stateless Operators
ds.filter(_ != 0)
System state
ds.keyBy(0).window(TumblingTimeWindows.of(5, TimeUnit.SECONDS))
User defined state
public class CounterSum implements RichReduceFunction<Long> {
private OperatorState<Long> counter;
@Override public Long reduce(Long v1, Long v2) throws Exception {
counter.update(counter.value() + 1);
return v1 + v2;
}
@Override public void open(Configuration config) {
counter = getRuntimeContext().getOperatorState(“counter”, 0L, false);
}
}
48. Batch on Streaming
DataStream API
Unbounded Data
DataSet API
Bounded Data
Runtime
Distributed Streaming Data Flow
Libraries
Machine Learning · Graph Processing · SQL-like API
49. Batch on Streaming
Run a bounded stream (data set) on
a stream processor.
Bounded
data set
Unbounded
data stream
50. Batch on Streaming
Stream Windows
Pipelined
Data Exchange
Global View
Pipelined or Blocking
Data Exchange
Infinite Streams Finite Streams
Run a bounded stream (data set) on
a stream processor.
52. DataSet API
ExecutionEnvironment env = ExecutionEnvironment
.getExecutionEnvironment()
DataSet<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataSet WordCount
DataSet<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.groupBy(0) // [word, [1, 1, …]]
.sum(1); // sum per word for all occurrences
counts.print();
53. DataSet API
ExecutionEnvironment env = ExecutionEnvironment
.getExecutionEnvironment()
DataSet<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataSet WordCount
DataSet<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.groupBy(0) // [word, [1, 1, …]]
.sum(1); // sum per word for all occurrences
counts.print();
54. DataSet API
ExecutionEnvironment env = ExecutionEnvironment
.getExecutionEnvironment()
DataSet<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataSet WordCount
DataSet<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.groupBy(0) // [word, [1, 1, …]]
.sum(1); // sum per word for all occurrences
counts.print();
55. DataSet API
ExecutionEnvironment env = ExecutionEnvironment
.getExecutionEnvironment()
DataSet<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataSet WordCount
DataSet<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.groupBy(0) // [word, [1, 1, …]]
.sum(1); // sum per word for all occurrences
counts.print();
56. DataSet API
ExecutionEnvironment env = ExecutionEnvironment
.getExecutionEnvironment()
DataSet<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataSet WordCount
DataSet<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.groupBy(0) // [word, [1, 1, …]]
.sum(1); // sum per word for all occurrences
counts.print();
57. DataSet API
ExecutionEnvironment env = ExecutionEnvironment
.getExecutionEnvironment()
DataSet<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataSet WordCount
DataSet<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.groupBy(0) // [word, [1, 1, …]]
.sum(1); // sum per word for all occurrences
counts.print();
58. DataSet API
ExecutionEnvironment env = ExecutionEnvironment
.getExecutionEnvironment()
DataSet<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataSet WordCount
DataSet<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.groupBy(0) // [word, [1, 1, …]]
.sum(1); // sum per word for all occurrences
counts.print();
59. Batch-specific optimizations
Cost-based optimizer
• Program adapts to changing data size
Managed memory
• On- and off-heap memory
• Internal operators (e.g. join or sort) with out-of-core
support
• Serialization stack for user-types