Distributed stream processing is one of the hot topics in big data analytics today. An increasing number of applications are shifting from traditional static data sources to processing the incoming data in real-time. Performing large scale stream processing or analysis requires specialized tools and techniques which have become publicly available in the last couple of years.
This talk will give a deep, technical overview of the top-level Apache stream processing landscape. We compare several frameworks including Spark, Storm, Samza and Flink. Our goal is to highlight the strengths and weaknesses of the individual systems in a project-neutral manner to help selecting the best tools for the specific applications. We will touch on the topics of API expressivity, runtime architecture, performance, fault-tolerance and strong use-cases for the individual frameworks.
2. This talk
§ Stream processing by example
§ Open source stream processors
§ Runtime architecture and programming
model
§ Counting words…
§ Fault tolerance and stateful processing
§ Closing
2Apache: Big Data Europe2015-‐‑09-‐‑28
8. Apache Storm
§ Started in 2010, development driven by
BackType, then Twitter
§ Pioneer in large scale stream processing
§ Distributed dataflow abstraction (spouts &
bolts)
82015-‐‑09-‐‑28 Apache: Big Data Europe
9. Apache Flink
§ Started in 2008 as a research project
(Stratosphere) at European universities
§ Unique combination of low latency streaming
and high throughput batch analysis
§ Flexible operator states and windowing
9
Batch data
Kafka,
RabbitMQ,
...
HDFS,
JDBC,
...
Stream
Data
2015-‐‑09-‐‑28 Apache: Big Data Europe
10. Apache Spark
§ Started in 2009 at UC Berkley, Apache since 2013
§ Very strong community, wide adoption
§ Unified batch and stream processing over a
batch runtime
§ Good integration with batch programs
102015-‐‑09-‐‑28 Apache: Big Data Europe
11. Apache Samza
§ Developed at LinkedIn, open sourced in 2013
§ Builds heavily on Kafka’s log based philosophy
§ Pluggable messaging system and execution
backend
112015-‐‑09-‐‑28 Apache: Big Data Europe
12. System comparison
12
Streaming
model
Native Micro-batching Native Native
API Compositional Declarative Compositional Declarative
Fault tolerance Record ACKs RDD-based Log-based Checkpoints
Guarantee At-least-once Exactly-once At-least-once Exactly-once
State Only in Trident
State as
DStream
Stateful
operators
Stateful
operators
Windowing Not built-in Time based Not built-in Policy based
Latency Very-Low Medium Low Low
Throughput Medium High High High
2015-‐‑09-‐‑28 Apache: Big Data Europe
15. Distributed dataflow runtime
§ Storm, Samza and Flink
§ General properties
• Long standing operators
• Pipelined execution
• Usually possible to create
cyclic flows
2015-‐‑09-‐‑28 Apache: Big Data Europe 15
Pros
• Full expressivity
• Low-latency execution
• Stateful operators
Cons
• Fault-tolerance is hard
• Throughput may suffer
• Load balancing is an
issue
16. Distributed dataflow runtime
§ Storm
• Dynamic typing + Kryo
• Dynamic topology rebalancing
§ Samza
• Almost every component pluggable
• Full task isolation, no backpressure (buffering
handled by the messaging layer)
§ Flink
• Strongly typed streams + custom serializers
• Flow control mechanism
• Memory management
2015-‐‑09-‐‑28 Apache: Big Data Europe 16
18. Micro-batch runtime
§ Implemented by Apache Spark
§ General properties
• Computation broken down
to time intervals
• Load aware scheduling
• Easy interaction with batch
2015-‐‑09-‐‑28 Apache: Big Data Europe 18
Pros
• Easy to reason about
• High-throughput
• FT comes for “free”
• Dynamic load balancing
Cons
• Latency depends on
batch size
• Limited expressivity
• Stateless by nature
19. Programming model
2015-‐‑09-‐‑28 Apache: Big Data Europe 19
Declarative
§ Expose a high-level API
§ Operators are higher order
functions on abstract data
stream types
§ Advanced behavior such as
windowing is supported
§ Query optimization
Compositional
§ Offer basic building blocks
for composing custom
operators and topologies
§ Advanced behavior such as
windowing is often missing
§ Topology needs to be hand-
optimized
20. Programming model
2015-‐‑09-‐‑28 Apache: Big Data Europe 20
DStream, DataStream
§ Transformations abstract
operator details
§ Suitable for engineers and data
analysts
Spout, Consumer,
Bolt, Task, Topology
§ Direct access to the execution
graph / topology
• Suitable for engineers
22. WordCount
2015-‐‑09-‐‑28 Apache: Big Data Europe 22
storm budapest flink
apache storm spark
streaming samza storm
flink apache flink
bigdata storm
flink streaming
(storm, 4)
(budapest, 1)
(flink, 4)
(apache, 2)
(spark, 1)
(streaming, 2)
(samza, 1)
(bigdata, 1)
23. Storm
Assembling the topology
232015-‐‑09-‐‑28 Apache: Big Data Europe
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new SentenceSpout(), 5);
builder.setBolt("split", new Splitter(), 8).shuffleGrouping("spout");
builder.setBolt("count", new Counter(), 12)
.fieldsGrouping("split", new Fields("word"));
public class Counter extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1;
counts.put(word, count);
collector.emit(new Values(word, count));
}
}
Rolling word count bolt
24. Samza
2015-‐‑09-‐‑28 Apache: Big Data Europe 24
public class WordCountTask implements StreamTask {
private KeyValueStore<String, Integer> store;
public void process( IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
String word = envelope.getMessage();
Integer count = store.get(word);
if(count == null){count = 0;}
store.put(word, count + 1);
collector.send(new OutgoingMessageEnvelope(new
SystemStream("kafka", ”wc"), Tuple2.of(word, count)));
}
}
Rolling word count task
25. Flink
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.groupBy("word").sum("frequency")
.print()
Rolling word count
Window word count
252015-‐‑09-‐‑28 Apache: Big Data Europe
28. Fault tolerance intro
§ Fault-tolerance in streaming systems is
inherently harder than in batch
• Can’t just restart computation
• State is a problem
• Fast recovery is crucial
• Streaming topologies run 24/7 for a long period
§ Fault-tolerance is a complex issue
• No single point of failure is allowed
• Guaranteeing input processing
• Consistent operator state
• Fast recovery
• At-least-once vs Exactly-once semantics
2015-‐‑09-‐‑28 Apache: Big Data Europe 28
29. Storm record acknowledgements
§ Track the lineage of tuples as they are
processed (anchors and acks)
§ Special “acker” bolts track each lineage
DAG (efficient xor based algorithm)
§ Replay the root of failed (or timed out)
tuples
2015-‐‑09-‐‑28 Apache: Big Data Europe 29
30. Samza offset tracking
§ Exploits the properties of a durable, offset
based messaging layer
§ Each task maintains its current offset, which
moves forward as it processes elements
§ The offset is checkpointed and restored on
failure (some messages might be repeated)
2015-‐‑09-‐‑28 Apache: Big Data Europe 30
31. Flink checkpointing
§ Based on consistent global snapshots
§ Algorithm designed for stateful dataflows
(minimal runtime overhead)
§ Exactly-once semantics
31Apache: Big Data Europe2015-‐‑09-‐‑28
32. Spark RDD recomputation
§ Immutable data model with
repeatable computation
§ Failed RDDs are recomputed
using their lineage
§ Checkpoint RDDs to reduce
lineage length
§ Parallel recovery of failed
RDDs
§ Exactly-once semantics
2015-‐‑09-‐‑28 Apache: Big Data Europe 32
33. State in streaming programs
§ Almost all non-trivial streaming programs are
stateful
§ Stateful operators (in essence):
𝒇:
𝒊𝒏, 𝒔𝒕𝒂𝒕𝒆 ⟶ 𝒐𝒖𝒕, 𝒔𝒕𝒂𝒕𝒆.
§ State hangs around and can be read and
modified as the stream evolves
§ Goal: Get as close as possible while
maintaining scalability and fault-tolerance
33Apache: Big Data Europe2015-‐‑09-‐‑28
34. § States available only in Trident API
§ Dedicated operators for state updates and
queries
§ State access methods
• stateQuery(…)
• partitionPersist(…)
• persistentAggregate(…)
§ It’s very difficult to
implement transactional
states
Exactly-‐‑once guarantee
34Apache: Big Data Europe2015-‐‑09-‐‑28
35. § Stateless runtime by design
• No continuous operators
• UDFs are assumed to be stateless
§ State can be generated as a separate
stream of RDDs: updateStateByKey(…)
𝒇:
𝑺𝒆𝒒[𝒊𝒏 𝒌], 𝒔𝒕𝒂𝒕𝒆 𝒌 ⟶ 𝒔𝒕𝒂𝒕𝒆.
𝒌
§ 𝒇 is scoped to a specific key
§ Exactly-once semantics
35Apache: Big Data Europe2015-‐‑09-‐‑28
36. § Stateful dataflow operators
(Any task can hold state)
§ State changes are stored
as a log by Kafka
§ Custom storage engines can
be plugged in to the log
§ 𝒇 is scoped to a specific task
§ At-least-once processing
semantics
36Apache: Big Data Europe2015-‐‑09-‐‑28
37. § Stateful dataflow operators (conceptually
similar to Samza)
§ Two state access patterns
• Local (Task) state
• Partitioned (Key) state
§ Proper API integration
• Java: OperatorState interface
• Scala: mapWithState, flatMapWithState…
§ Exactly-once semantics by checkpointing
37Apache: Big Data Europe2015-‐‑09-‐‑28
38. Performance
§ Throughput/Latency
• A cost of a network hop is 25+ msecs
• 1 million records/sec/core is nice
§ Size of Network Buffers/Batching
§ Buffer Timeout
§ Cost of Fault Tolerance
§ Operator chaining/Stages
§ Serialization/Types
2015-‐‑09-‐‑28 Apache: Big Data Europe 38
40. Comparison revisited
40
Streaming
model
Native Micro-batching Native Native
API Compositional Declarative Compositional Declarative
Fault tolerance Record ACKs RDD-based Log-based Checkpoints
Guarantee At-least-once Exactly-once At-least-once Exactly-once
State Only in Trident
State as
DStream
Stateful
operators
Stateful
operators
Windowing Not built-in Time based Not built-in Policy based
Latency Very-Low Medium Low Low
Throughput Medium High High High
2015-‐‑09-‐‑28 Apache: Big Data Europe
41. Summary
§ Streaming applications and stream
processors are very diverse
§ 2 main runtime designs
• Dataflow based (Storm, Samza, Flink)
• Micro-batch based (Spark)
§ The best framework varies based on
application specific needs
§ But high-level APIs are nice J
2015-‐‑09-‐‑28 Apache: Big Data Europe 41