SlideShare una empresa de Scribd logo
1 de 43
Descargar para leer sin conexión
Large-Scale Stream Processing
in the Hadoop Ecosystem
Gyula Fóra
gyfora@apache.org
Márton Balassi
mbalassi@apache.org
This talk
§ Stream processing by example
§ Open source stream processors
§ Runtime architecture and programming
model
§ Counting words…
§ Fault tolerance and stateful processing
§ Closing
2Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
Stream processing
by example
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 3
Streaming applications
ETL style operations
• Filter incoming data,
Log analysis
• High throughput, connectors,
at-least-once processing
Window aggregations
• Trending tweets,
User sessions, Stream joins
• Window abstractions
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 4
Inpu
t
Inpu
t
Inpu
tInput
Process/Enrich
Streaming applications
Machine learning
• Fitting trends to the evolving
stream, Stream clustering
• Model state, cyclic flows
Pattern recognition
• Fraud detection, Triggering
signals based on activity
• Exactly-once processing
5Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
Open source stream
processors
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 6
Apache Streaming landscape
72015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
Apache Storm
§ Started in 2010, development driven by
BackType, then Twitter
§ Pioneer in large scale stream processing
§ Distributed dataflow abstraction (spouts &
bolts)
82015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
Apache Flink
§ Started in 2008 as a research project
(Stratosphere) at European universities
§ Unique combination of low latency streaming
and high throughput batch analysis
§ Flexible operator states and windowing
9
Batch  data
Kafka,	
  RabbitMQ,	
  
...
HDFS,	
  JDBC,	
  
...
Stream	
  Data
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
Apache Spark
§ Started in 2009 at UC Berkley, Apache since 2013
§ Very strong community, wide adoption
§ Unified batch and stream processing over a
batch runtime
§ Good integration with batch programs
102015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
Apache Samza
§ Developed at LinkedIn, open sourced in 2013
§ Builds heavily on Kafka’s log based philosophy
§ Pluggable messaging system and execution
backend
112015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
System comparison
12
Streaming
model
Native Micro-batching Native Native
API Compositional Declarative Compositional Declarative
Fault tolerance Record ACKs RDD-based Log-based Checkpoints
Guarantee At-least-once Exactly-once At-least-once Exactly-once
State Only in Trident
State as
DStream
Stateful
operators
Stateful
operators
Windowing Not built-in Time based Not built-in Policy based
Latency Very-Low Medium Low Low
Throughput Medium High High High
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
Runtime and
programming model
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 13
Native Streaming
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 14
Distributed dataflow runtime
§ Storm, Samza and Flink
§ General properties
• Long standing operators
• Pipelined execution
• Usually possible to create
cyclic flows
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 15
Pros
• Full expressivity
• Low-latency execution
• Stateful operators
Cons
• Fault-tolerance is hard
• Throughput may suffer
• Load balancing is an
issue
Distributed dataflow runtime
§ Storm
• Dynamic typing + Kryo
• Dynamic topology rebalancing
§ Samza
• Almost every component pluggable
• Full task isolation, no backpressure (buffering
handled by the messaging layer)
§ Flink
• Strongly typed streams + custom serializers
• Flow control mechanism
• Memory management
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 16
Micro-batching
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 17
Micro-batch runtime
§ Implemented by Apache Spark
§ General properties
• Computation broken down
to time intervals
• Load aware scheduling
• Easy interaction with batch
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 18
Pros
• Easy to reason about
• High-throughput
• FT comes for “free”
• Dynamic load balancing
Cons
• Latency depends on
batch size
• Limited expressivity
• Stateless by nature
Programming model
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 19
Declarative
§ Expose a high-level API
§ Operators are higher order
functions on abstract data
stream types
§ Advanced behavior such as
windowing is supported
§ Query optimization
Compositional
§ Offer basic building blocks
for composing custom
operators and topologies
§ Advanced behavior such as
windowing is often missing
§ Topology needs to be hand-
optimized
Programming model
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 20
DStream, DataStream
§ Transformations abstract
operator details
§ Suitable for engineers and data
analysts
Spout, Consumer,
Bolt, Task, Topology
§ Direct access to the execution
graph / topology
• Suitable for engineers
Counting words…
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 21
WordCount
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 22
storm  budapest  flink
apache  storm  spark
streaming  samza storm
flink  apache  flink
bigdata  storm
flink  streaming
(storm,  4)
(budapest,  1)
(flink,  4)
(apache,  2)
(spark,  1)
(streaming,  2)
(samza,  1)
(bigdata,  1)
Storm
Assembling the topology
232015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new SentenceSpout(), 5);
builder.setBolt("split", new Splitter(), 8).shuffleGrouping("spout");
builder.setBolt("count", new Counter(), 12)
.fieldsGrouping("split", new Fields("word"));
public class Counter extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1;
counts.put(word, count);
collector.emit(new Values(word, count));
}
}
Rolling word count bolt
Samza
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 24
public class WordCountTask implements StreamTask {
private KeyValueStore<String, Integer> store;
public void process( IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
String word = envelope.getMessage();
Integer count = store.get(word);
if(count == null){count = 0;}
store.put(word, count + 1);
collector.send(new OutgoingMessageEnvelope(new
SystemStream("kafka", ”wc"), Tuple2.of(word, count)));
}
}
Rolling word count task
Flink
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.groupBy("word").sum("frequency")
.print()
Rolling word count
Window word count
252015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
Spark
Window word count
Rolling word count (kind of)
262015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
Fault tolerance and
stateful processing
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 27
Fault tolerance intro
§ Fault-tolerance in streaming systems is
inherently harder than in batch
• Can’t just restart computation
• State is a problem
• Fast recovery is crucial
• Streaming topologies run 24/7 for a long period
§ Fault-tolerance is a complex issue
• No single point of failure is allowed
• Guaranteeing input processing
• Consistent operator state
• Fast recovery
• At-least-once vs Exactly-once semantics
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 28
Storm record acknowledgements
§ Track the lineage of tuples as they are
processed (anchors and acks)
§ Special “acker” bolts track each lineage
DAG (efficient xor based algorithm)
§ Replay the root of failed (or timed out)
tuples
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 29
Samza offset tracking
§ Exploits the properties of a durable, offset
based messaging layer
§ Each task maintains its current offset, which
moves forward as it processes elements
§ The offset is checkpointed and restored on
failure (some messages might be repeated)
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 30
Flink checkpointing
§ Based on consistent global snapshots
§ Algorithm designed for stateful dataflows
(minimal runtime overhead)
§ Exactly-once semantics
31Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
Spark RDD recomputation
§ Immutable data model with
repeatable computation
§ Failed RDDs are recomputed
using their lineage
§ Checkpoint RDDs to reduce
lineage length
§ Parallel recovery of failed
RDDs
§ Exactly-once semantics
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 32
State in streaming programs
§ Almost all non-trivial streaming programs are
stateful
§ Stateful operators (in essence):
𝒇:	
   𝒊𝒏, 𝒔𝒕𝒂𝒕𝒆 ⟶ 𝒐𝒖𝒕, 𝒔𝒕𝒂𝒕𝒆.
§ State hangs around and can be read and
modified as the stream evolves
§ Goal: Get as close as possible while
maintaining scalability and fault-tolerance
33Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
§ States available only in Trident API
§ Dedicated operators for state updates and
queries
§ State access methods
• stateQuery(…)
• partitionPersist(…)
• persistentAggregate(…)
§ It’s very difficult to
implement transactional
states
Exactly-­‐‑once  guarantee
34Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
§ Stateless runtime by design
• No continuous operators
• UDFs are assumed to be stateless
§ State can be generated as a separate
stream of RDDs: updateStateByKey(…)
𝒇:	
   𝑺𝒆𝒒[𝒊𝒏 𝒌], 𝒔𝒕𝒂𝒕𝒆 𝒌 ⟶ 𝒔𝒕𝒂𝒕𝒆.
𝒌
§ 𝒇 is scoped to a specific key
§ Exactly-once semantics
35Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
§ Stateful dataflow operators
(Any task can hold state)
§ State changes are stored
as a log by Kafka
§ Custom storage engines can
be plugged in to the log
§ 𝒇 is scoped to a specific task
§ At-least-once processing
semantics
36Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
§ Stateful dataflow operators (conceptually
similar to Samza)
§ Two state access patterns
• Local (Task) state
• Partitioned (Key) state
§ Proper API integration
• Java: OperatorState interface
• Scala: mapWithState, flatMapWithState…
§ Exactly-once semantics by checkpointing
37Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
Performance
§ Throughput/Latency
• A cost of a network hop is 25+ msecs
• 1 million records/sec/core is nice
§ Size of Network Buffers/Batching
§ Buffer Timeout
§ Cost of Fault Tolerance
§ Operator chaining/Stages
§ Serialization/Types
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 38
Closing
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 39
Comparison revisited
40
Streaming
model
Native Micro-batching Native Native
API Compositional Declarative Compositional Declarative
Fault tolerance Record ACKs RDD-based Log-based Checkpoints
Guarantee At-least-once Exactly-once At-least-once Exactly-once
State Only in Trident
State as
DStream
Stateful
operators
Stateful
operators
Windowing Not built-in Time based Not built-in Policy based
Latency Very-Low Medium Low Low
Throughput Medium High High High
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
Summary
§ Streaming applications and stream
processors are very diverse
§ 2 main runtime designs
• Dataflow based (Storm, Samza, Flink)
• Micro-batch based (Spark)
§ The best framework varies based on
application specific needs
§ But high-level APIs are nice J
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 41
Thank you!
List of Figures (in order of usage)
§ https://upload.wikimedia.org/wikipedia/commons/thumb/2/2a/CPT-FSM-
abcd.svg/326px-CPT-FSM-abcd.svg.png
§ https://storm.apache.org/images/topology.png
§ https://databricks.com/wp-content/uploads/2015/07/image11-1024x655.png
§ https://databricks.com/wp-content/uploads/2015/07/image21-1024x734.png
§ https://people.csail.mit.edu/matei/papers/2012/hotcloud_spark_streaming.pdf,
page 2.
§ http://www.slideshare.net/ptgoetz/storm-hadoop-summit2014, page 69-71.
§ http://samza.apache.org/img/0.9/learn/documentation/container/checkpointi
ng.svg
§ https://databricks.com/wp-content/uploads/2015/07/image41-1024x602.png
§ https://storm.apache.org/documentation/images/spout-vs-state.png
§ http://samza.apache.org/img/0.9/learn/documentation/container/stateful_job.
png
2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 43

Más contenido relacionado

La actualidad más candente

Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Flink Forward
 
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaMohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Flink Forward
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Flink Forward
 

La actualidad más candente (20)

Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentQuerying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 
Fabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on Flink
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
 
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Stream Processing use cases and applications with Apache Apex by Thomas WeiseStream Processing use cases and applications with Apache Apex by Thomas Weise
Stream Processing use cases and applications with Apache Apex by Thomas Weise
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data ProcessingApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
 
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data TransformationsKafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen ShapiraGNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
 
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaMohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
Cost-based Query Optimization
Cost-based Query Optimization Cost-based Query Optimization
Cost-based Query Optimization
 
Big Migrations: Moving elephant herds by Carlos Izquierdo
Big Migrations: Moving elephant herds by Carlos IzquierdoBig Migrations: Moving elephant herds by Carlos Izquierdo
Big Migrations: Moving elephant herds by Carlos Izquierdo
 
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - HackacIntro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache FlinkUnifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
 
Java High Level Stream API
Java High Level Stream APIJava High Level Stream API
Java High Level Stream API
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
 

Destacado

Destacado (20)

Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
 
Geospatial querying in Apache Marmotta - ApacheCon Big Data Europe 2015
Geospatial querying in Apache Marmotta - ApacheCon Big Data Europe 2015Geospatial querying in Apache Marmotta - ApacheCon Big Data Europe 2015
Geospatial querying in Apache Marmotta - ApacheCon Big Data Europe 2015
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
Real Time Analytics with Apache Cassandra - Cassandra Day MunichReal Time Analytics with Apache Cassandra - Cassandra Day Munich
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
 
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitReal-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop Summit
 
KDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics TutorialKDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics Tutorial
 
RBea: Scalable Real-Time Analytics at King
RBea: Scalable Real-Time Analytics at KingRBea: Scalable Real-Time Analytics at King
RBea: Scalable Real-Time Analytics at King
 
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinReal Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
 
Real-time analytics as a service at King
Real-time analytics as a service at King Real-time analytics as a service at King
Real-time analytics as a service at King
 
Streaming Analytics
Streaming AnalyticsStreaming Analytics
Streaming Analytics
 
Data Streaming (in a Nutshell) ... and Spark's window operations
Data Streaming (in a Nutshell) ... and Spark's window operationsData Streaming (in a Nutshell) ... and Spark's window operations
Data Streaming (in a Nutshell) ... and Spark's window operations
 
Stream Analytics in the Enterprise
Stream Analytics in the EnterpriseStream Analytics in the Enterprise
Stream Analytics in the Enterprise
 
Reliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoTReliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoT
 
Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?
 
CamelOne 2012 - Spoilt for Choice: Which Integration Framework to use?
CamelOne 2012 - Spoilt for Choice: Which Integration Framework to use?CamelOne 2012 - Spoilt for Choice: Which Integration Framework to use?
CamelOne 2012 - Spoilt for Choice: Which Integration Framework to use?
 
The end of polling : why and how to transform a REST API into a Data Streamin...
The end of polling : why and how to transform a REST API into a Data Streamin...The end of polling : why and how to transform a REST API into a Data Streamin...
The end of polling : why and how to transform a REST API into a Data Streamin...
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream Processing
 
Oracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream ProcessingOracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream Processing
 
Event Driven Architecture with Apache Camel
Event Driven Architecture with Apache CamelEvent Driven Architecture with Apache Camel
Event Driven Architecture with Apache Camel
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 

Similar a Large-Scale Stream Processing in the Hadoop Ecosystem

Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
Gabriele Modena
 

Similar a Large-Scale Stream Processing in the Hadoop Ecosystem (20)

Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batch
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the art
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big Graphs
 
Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream Processing
 
Leveraging Mainframe Data for Modern Analytics
Leveraging Mainframe Data for Modern AnalyticsLeveraging Mainframe Data for Modern Analytics
Leveraging Mainframe Data for Modern Analytics
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Jags Ramnarayan's presentation
Jags Ramnarayan's presentationJags Ramnarayan's presentation
Jags Ramnarayan's presentation
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 

Último

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Último (20)

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 

Large-Scale Stream Processing in the Hadoop Ecosystem

  • 1. Large-Scale Stream Processing in the Hadoop Ecosystem Gyula Fóra gyfora@apache.org Márton Balassi mbalassi@apache.org
  • 2. This talk § Stream processing by example § Open source stream processors § Runtime architecture and programming model § Counting words… § Fault tolerance and stateful processing § Closing 2Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
  • 4. Streaming applications ETL style operations • Filter incoming data, Log analysis • High throughput, connectors, at-least-once processing Window aggregations • Trending tweets, User sessions, Stream joins • Window abstractions 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 4 Inpu t Inpu t Inpu tInput Process/Enrich
  • 5. Streaming applications Machine learning • Fitting trends to the evolving stream, Stream clustering • Model state, cyclic flows Pattern recognition • Fraud detection, Triggering signals based on activity • Exactly-once processing 5Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
  • 8. Apache Storm § Started in 2010, development driven by BackType, then Twitter § Pioneer in large scale stream processing § Distributed dataflow abstraction (spouts & bolts) 82015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
  • 9. Apache Flink § Started in 2008 as a research project (Stratosphere) at European universities § Unique combination of low latency streaming and high throughput batch analysis § Flexible operator states and windowing 9 Batch  data Kafka,  RabbitMQ,   ... HDFS,  JDBC,   ... Stream  Data 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
  • 10. Apache Spark § Started in 2009 at UC Berkley, Apache since 2013 § Very strong community, wide adoption § Unified batch and stream processing over a batch runtime § Good integration with batch programs 102015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
  • 11. Apache Samza § Developed at LinkedIn, open sourced in 2013 § Builds heavily on Kafka’s log based philosophy § Pluggable messaging system and execution backend 112015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
  • 12. System comparison 12 Streaming model Native Micro-batching Native Native API Compositional Declarative Compositional Declarative Fault tolerance Record ACKs RDD-based Log-based Checkpoints Guarantee At-least-once Exactly-once At-least-once Exactly-once State Only in Trident State as DStream Stateful operators Stateful operators Windowing Not built-in Time based Not built-in Policy based Latency Very-Low Medium Low Low Throughput Medium High High High 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
  • 15. Distributed dataflow runtime § Storm, Samza and Flink § General properties • Long standing operators • Pipelined execution • Usually possible to create cyclic flows 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 15 Pros • Full expressivity • Low-latency execution • Stateful operators Cons • Fault-tolerance is hard • Throughput may suffer • Load balancing is an issue
  • 16. Distributed dataflow runtime § Storm • Dynamic typing + Kryo • Dynamic topology rebalancing § Samza • Almost every component pluggable • Full task isolation, no backpressure (buffering handled by the messaging layer) § Flink • Strongly typed streams + custom serializers • Flow control mechanism • Memory management 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 16
  • 18. Micro-batch runtime § Implemented by Apache Spark § General properties • Computation broken down to time intervals • Load aware scheduling • Easy interaction with batch 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 18 Pros • Easy to reason about • High-throughput • FT comes for “free” • Dynamic load balancing Cons • Latency depends on batch size • Limited expressivity • Stateless by nature
  • 19. Programming model 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 19 Declarative § Expose a high-level API § Operators are higher order functions on abstract data stream types § Advanced behavior such as windowing is supported § Query optimization Compositional § Offer basic building blocks for composing custom operators and topologies § Advanced behavior such as windowing is often missing § Topology needs to be hand- optimized
  • 20. Programming model 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 20 DStream, DataStream § Transformations abstract operator details § Suitable for engineers and data analysts Spout, Consumer, Bolt, Task, Topology § Direct access to the execution graph / topology • Suitable for engineers
  • 22. WordCount 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 22 storm  budapest  flink apache  storm  spark streaming  samza storm flink  apache  flink bigdata  storm flink  streaming (storm,  4) (budapest,  1) (flink,  4) (apache,  2) (spark,  1) (streaming,  2) (samza,  1) (bigdata,  1)
  • 23. Storm Assembling the topology 232015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new SentenceSpout(), 5); builder.setBolt("split", new Splitter(), 8).shuffleGrouping("spout"); builder.setBolt("count", new Counter(), 12) .fieldsGrouping("split", new Fields("word")); public class Counter extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>(); public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1; counts.put(word, count); collector.emit(new Values(word, count)); } } Rolling word count bolt
  • 24. Samza 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 24 public class WordCountTask implements StreamTask { private KeyValueStore<String, Integer> store; public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { String word = envelope.getMessage(); Integer count = store.get(word); if(count == null){count = 0;} store.put(word, count + 1); collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", ”wc"), Tuple2.of(word, count))); } } Rolling word count task
  • 25. Flink val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() case class Word (word: String, frequency: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .groupBy("word").sum("frequency") .print() Rolling word count Window word count 252015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
  • 26. Spark Window word count Rolling word count (kind of) 262015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
  • 27. Fault tolerance and stateful processing 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 27
  • 28. Fault tolerance intro § Fault-tolerance in streaming systems is inherently harder than in batch • Can’t just restart computation • State is a problem • Fast recovery is crucial • Streaming topologies run 24/7 for a long period § Fault-tolerance is a complex issue • No single point of failure is allowed • Guaranteeing input processing • Consistent operator state • Fast recovery • At-least-once vs Exactly-once semantics 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 28
  • 29. Storm record acknowledgements § Track the lineage of tuples as they are processed (anchors and acks) § Special “acker” bolts track each lineage DAG (efficient xor based algorithm) § Replay the root of failed (or timed out) tuples 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 29
  • 30. Samza offset tracking § Exploits the properties of a durable, offset based messaging layer § Each task maintains its current offset, which moves forward as it processes elements § The offset is checkpointed and restored on failure (some messages might be repeated) 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 30
  • 31. Flink checkpointing § Based on consistent global snapshots § Algorithm designed for stateful dataflows (minimal runtime overhead) § Exactly-once semantics 31Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
  • 32. Spark RDD recomputation § Immutable data model with repeatable computation § Failed RDDs are recomputed using their lineage § Checkpoint RDDs to reduce lineage length § Parallel recovery of failed RDDs § Exactly-once semantics 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 32
  • 33. State in streaming programs § Almost all non-trivial streaming programs are stateful § Stateful operators (in essence): 𝒇:   𝒊𝒏, 𝒔𝒕𝒂𝒕𝒆 ⟶ 𝒐𝒖𝒕, 𝒔𝒕𝒂𝒕𝒆. § State hangs around and can be read and modified as the stream evolves § Goal: Get as close as possible while maintaining scalability and fault-tolerance 33Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
  • 34. § States available only in Trident API § Dedicated operators for state updates and queries § State access methods • stateQuery(…) • partitionPersist(…) • persistentAggregate(…) § It’s very difficult to implement transactional states Exactly-­‐‑once  guarantee 34Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
  • 35. § Stateless runtime by design • No continuous operators • UDFs are assumed to be stateless § State can be generated as a separate stream of RDDs: updateStateByKey(…) 𝒇:   𝑺𝒆𝒒[𝒊𝒏 𝒌], 𝒔𝒕𝒂𝒕𝒆 𝒌 ⟶ 𝒔𝒕𝒂𝒕𝒆. 𝒌 § 𝒇 is scoped to a specific key § Exactly-once semantics 35Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
  • 36. § Stateful dataflow operators (Any task can hold state) § State changes are stored as a log by Kafka § Custom storage engines can be plugged in to the log § 𝒇 is scoped to a specific task § At-least-once processing semantics 36Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
  • 37. § Stateful dataflow operators (conceptually similar to Samza) § Two state access patterns • Local (Task) state • Partitioned (Key) state § Proper API integration • Java: OperatorState interface • Scala: mapWithState, flatMapWithState… § Exactly-once semantics by checkpointing 37Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
  • 38. Performance § Throughput/Latency • A cost of a network hop is 25+ msecs • 1 million records/sec/core is nice § Size of Network Buffers/Batching § Buffer Timeout § Cost of Fault Tolerance § Operator chaining/Stages § Serialization/Types 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 38
  • 40. Comparison revisited 40 Streaming model Native Micro-batching Native Native API Compositional Declarative Compositional Declarative Fault tolerance Record ACKs RDD-based Log-based Checkpoints Guarantee At-least-once Exactly-once At-least-once Exactly-once State Only in Trident State as DStream Stateful operators Stateful operators Windowing Not built-in Time based Not built-in Policy based Latency Very-Low Medium Low Low Throughput Medium High High High 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
  • 41. Summary § Streaming applications and stream processors are very diverse § 2 main runtime designs • Dataflow based (Storm, Samza, Flink) • Micro-batch based (Spark) § The best framework varies based on application specific needs § But high-level APIs are nice J 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 41
  • 43. List of Figures (in order of usage) § https://upload.wikimedia.org/wikipedia/commons/thumb/2/2a/CPT-FSM- abcd.svg/326px-CPT-FSM-abcd.svg.png § https://storm.apache.org/images/topology.png § https://databricks.com/wp-content/uploads/2015/07/image11-1024x655.png § https://databricks.com/wp-content/uploads/2015/07/image21-1024x734.png § https://people.csail.mit.edu/matei/papers/2012/hotcloud_spark_streaming.pdf, page 2. § http://www.slideshare.net/ptgoetz/storm-hadoop-summit2014, page 69-71. § http://samza.apache.org/img/0.9/learn/documentation/container/checkpointi ng.svg § https://databricks.com/wp-content/uploads/2015/07/image41-1024x602.png § https://storm.apache.org/documentation/images/spout-vs-state.png § http://samza.apache.org/img/0.9/learn/documentation/container/stateful_job. png 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 43