Graduating Flink Streaming - Chicago meetup

Flink 0.10
Graduating streaming
Márton Balassi
mbalassi@apache.org / @MartonBalassi
Hungarian Academy of Sciences

Streaming in Flink 0.10
• Operational readiness
High Availability
Monitoring
Integration with other systems
• First-class support for event-time
• Hardened statefulness support
• Redefined API

Streaming in Flink 0.10
• Some breaking changes
GroupBy -> KeyBy
Windowing API completely changed
DataStream and alike naming
Internal rewrite
The goal is to harden for 1.0

Windowing
• Why put your data into windows?
• That is why:

Streaming data never stops
Window (5 min)
Count #Hashtags
Just saw #Trump on #CNN,
super cool. :D
Trump: 2394
Cheese: 12984
Money: 42

7
What I didn’t mention
• tweets have a timestamp,
their event time
• tweets from across the globe
arrive with delay
=> tweets with different
timestamps arrive out-of-order

Window (5 min)
Count #Hashtags
12:34 (13.10.2015):
Just saw #Trump on #CNN,
super cool. :D
Trump: 2394
Cheese: 12984
Money: 42
These arrive with 3
minutes slack
Form windows based on
processing time of the
machine.
Processing Time != Event Time
8

9
Why do people use this?
• easy to implement
• low latency
• this is what systems give you
(Spark Streaming, Apex,
Samza, Storm)*
*not Google Cloud Dataflow

10
Lets look at a more
complex example.

11
Window (5 min)
Correlate Tweets
and News
something...
These still have 3 min slack.
These have 8 min slack.
12:33 (13.10.2015):
Donald Trump speaks at
Cheese conference.

=> Mismatch in the
timespace continuum

13
Use cases
• out-of-order elements
• sources with delay
• recovery/fault-tolerance
• “catching up” with a stream
Who does it?
• Google Cloud Dataflow
• Apache Flink

15
We need a
Global Clock
that runs on
event time
instead of
processing time.

16
This is a source
This is our window operator
1
0
0
0 0
1
2
1
2
1
1
This is the current event-time time
2
2
2
2
2
This is a watermark.

18
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(ProcessingTime);
DataStream<Tweet> text = env.addSource(new TwitterSrc());
DataStream<Tuple2<String, Integer>> counts = text
.flatMap(new ExtractHashtags())
.keyBy(“name”)
.timeWindow(Time.of(5, MINUTES)
.apply(new HashtagCounter());
Processing Time

19
Event Time
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(EventTime);
DataStream<Tweet> text = env.addSource(new TwitterSrc());
text = text.assignTimestamps(new MyTimestampExtractor());
DataStream<Tuple2<String, Integer>> counts = text
.flatMap(new ExtractHashtags())
.keyBy(“name”)
.timeWindow(Time.of(5, MINUTES)
.apply(new HashtagCounter());

Fault tolerance in streaming
Fault-tolerance in streaming systems is inherently harder than in batch
• Can’t just restart computation
• State is a problem
• Fast recovery is crucial
• Streaming topologies run 24/7 for a long period
Fault-tolerance is a complex issue
• No single point of failure is allowed
• Guaranteeing input processing
• Consistent operator state
• Fast recovery
• At-least-once vs Exactly-once semantics

Consistency - Flink distributed snapshots
Based on consistent global snapshots
Algorithm designed for stateful dataflows (minimal runtime
overhead)
Exactly-once semantics

Stateful streaming applications
ETL style operations
Filter incoming data,
Log analysis
High throughput, connectors, at-least-
once processing
Window aggregations
Trending tweets,
User sessions, Stream joins
Window abstractions
Inpu
t
Inpu
t
Inpu
tInput
Process/Enrich

Stateful streaming applications
Machine learning
Fitting trends to the evolving
stream, Stream clustering
Model state, cyclic flows
Pattern recognition
Fraud detection, Triggering signals
based on activity
Exactly-once processing

Statefulness in 0.9.1
Stateful dataflow operators (conceptually similar to Samza)
Two state access patterns
Local (Task) state
Partitioned (Key) state
Proper API integration
Java: OperatorState interface
Scala: mapWithState, flatMapWithState…
Exactly-once semantics by checkpointing

Stateful API
words.keyBy(x => x).mapWithState {
(word, count: Option[Int]) =>
{
val newCount = count.getOrElse(0) + 1
val output = (word, newCount)
(output, Some(newCount))
}
}

Local state example (Java)
public class MySource extends RichParallelSourceFunction {
// Omitted details
private OperatorState<Long> offset;
@Override
public void run(SourceContext ctx) {
Object checkpointLock = ctx.getCheckpointLock();
isRunning = true;
while (isRunning) {
synchronized (checkpointLock) {
offset.update(offset.value() + 1);
//ctx.collect(next);
}
}
}
}

Statefulness in 0.10
Internal operators are checkpointed
Aggregations
Window operators
…
KeyValue state
Easing common acces patterns
Flexible state backend interface
Removes non-partitioned operator state

Summary - Streaming in Flink 0.10
• Operational readiness
High Availability
Monitoring
Integration with other systems
• First-class support for event-time
• Hardened statefulness support
• Redefined API

Thanks for the slides
• Material borrowed from:
flink.apache.org
Stephan Ewen
Aljoscha Krettek
Gyula Fóra

Graduating Flink Streaming - Chicago meetup

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Graduating Flink Streaming - Chicago meetup

Similar to Graduating Flink Streaming - Chicago meetup (20)

Recently uploaded

Recently uploaded (20)

Graduating Flink Streaming - Chicago meetup

Editor's Notes