SlideShare una empresa de Scribd logo
1 de 55
Descargar para leer sin conexión
Apache Storm vs. Spark Streaming – 
Two Stream Processing Platforms compared 
DBTA Workshop on Stream Processing 
Berne, 3.12.2014 
Guido Schmutz 
BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA 
2014 © Trivadis 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3rd December 2014 
1
Guido Schmutz 
§ Working for Trivadis for more than 18 years 
§ Oracle ACE Director for Fusion Middleware and SOA 
§ Co-Author of different books 
§ Consultant, Trainer Software Architect for Java, Oracle, SOA and 
Big Data / Fast Data 
§ Member of Trivadis Architecture Board 
§ Technology Manager @ Trivadis 
§ More than 25 years of software development 
experience 
§ Contact: guido.schmutz@trivadis.com 
§ Blog: http://guidoschmutz.wordpress.com 
§ Twitter: gschmutz 
2014 © Trivadis 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3rd December 2014 
2
Our company 
Trivadis is a market leader in IT consulting, system integration, 
solution engineering and the provision of IT services focusing 
on and technologies in Switzerland, 
Germany and Austria. 
We offer our services in the following strategic business fields: 
Trivadis Services takes over the interacting operation of your IT systems. 
2014 © Trivadis 
O P E R A T I O N 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3rd December 2014 
3
2014 © Trivadis 
Agenda 
1. Introduction 
2. Apache Storm 
3. Apache Spark (Streaming) 
4. Unified Log 
5. Stream Processing Architectures 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3rd December 2014 
4
What is Stream Processing? 
Infrastructure for continuous data processing 
Computational model can be as general as MapReduce but with the ability 
to produce low-latency results 
Data collected continuously is naturally processed continuously 
aka. Event Processing / Complex Event Processing (CEP) 
2014 © Trivadis 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture 
August 2014 
5
Why Stream Processing? 
Stream Processing 
2014 © Trivadis 
Response latency 
Milliseconds to minutes 
RPC 
Synchronous Later. Possibly much later. 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture 
August 2014 
6
How to design a Stream Processing System? 
2014 © Trivadis 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture 
August 2014 
7 
Event 
Stream 
event 
Collecting 
event 
Queue 
(Persist) 
Event 
Stream 
event 
Collecting 
event 
Processing 
event 
Processing 
result 
result 
Event 
Stream 
event Collecting/ 
Processing 
result
How to scale a Stream Processing System? 
event event event result 
2014 © Trivadis 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture 
August 2014 
8 
Queue 
(Persist) 
Event 
Stream 
event 
Collecting 
Thread 1 event event 
Processing 
Thread 1 result 
Collecting 
Thread 2 
Processing 
Thread 2 
Collecting 
Thread n 
Processing 
Thread n
How to scale a Stream Processing System? 
Collecting 
Process 1 
2014 © Trivadis 
Collecting 
Process 1 
Collecting 
Process 1 
event event result 
Collecting 
Process 1 
Collecting 
Process 1 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture 
August 2014 
9 
Queue 1 
(Persist) 
Event 
Stream 
event 
Collecting 
Thread 1 
event event Processing 
Process 1 result 
Collecting 
Thread 1 
Processing 
Process 1 
Queue 2 
event (Persist) 
Processing 
Process 1 
Queue n 
(Persist)
How to scale a Stream Processing System? 
Collecting 
Process 1 
Collecting 
Process 2 
2014 © Trivadis 
Processing A 
Process 2 
Processing B 
Process 2 
Processing A 
Process 1 
Processing B 
Process 1 
e 
e 
e 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture 
August 2014 
Event 
Stream 
10 
Collecting 
Process 1 
Collecting 
Process 2 
Processing A 
Q2 Thread 2 
Processing B 
e 
e 
Q2 Thread 2 
Processing A 
Q1 Thread 1 
Processing B 
Q1 Thread 1 
Processing A 
Process 2 
Processing A 
Qn Thread n
How to make (stateful) Stream Processing System 
reliable? 
Faults and stragglers inevitable in large clusters running big data 
applications 
Streaming applications must recover from them quickly 
2014 © Trivadis 
e 
e 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture 
August 2014 
11 
Collecting 
Process 2 
Processing A 
Process 2 
Processing B 
Process 2 
Event 
Stream 
Collecting 
Process 2 
Processing A 
Q2 Thread 2 
Processing B 
e 
Q2 Thread 2 
Collecting 
Process 2 
Processing A 
Process 2 
e 
Event 
Collecting 
Processing A 
Processing Processing B 
B 
Stream 
Process 2 
Q2 Thread 2 
Q2 Thread Process 2 
2
How to make (stateful) Stream Processing System 
reliable? 
Solution 1: using active/passive system (hot replication) 
• Both systems process the full load 
• In case of a failure, automatically switch and use the “passive” system 
• Stragglers slow down both active and passive system 
2014 © Trivadis 
e 
e 
State 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture 
August 2014 
12 
e 
e 
State = State in-memory and/or on-disk 
Collecting 
Process 2 
Processing A 
Process 2 
Processing B 
Process 2 
Event 
Stream 
Collecting 
Process 2 
Processing A 
Q2 Thread 2 
Processing B 
Q2 Thread 2 
Active 
Collecting 
Process 2 
Processing A 
Process 2 
Processing B 
Process 2 
Collecting 
Process 2 
Processing A 
Q2 Thread 2 
Processing B 
Q2 Thread 2 
Passive 
State
How to make (stateful) Stream Processing System 
reliable? 
Solution 2: Upstream backup 
• Nodes buffer sent messages and reply them to new node in case of failure 
• Stragglers are treated as failures 
Collecting 
Process 2 
Processing A 
Process 2 
e 
e 
Event 
Collecting 
Processing A 
Processing B 
Stream 
Process 2 
Q2 Thread 2 
Process 2 buffer = Buffer for replay in-memory and/or on-disk 
2014 © Trivadis 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture 
August 2014 
13 
State = State in-memory and/or on-disk 
Processing B 
Q2 Thread 2 
State
Processing Models 
Batch Processing 
• Familiar concept of processing data en masse 
• Generally incurs a high-latency 
(Event-) Stream Processing 
• A one-at-a-time processing model 
• A datum is processed as it arrives 
• Sub-second latency 
• Difficult to process state data efficiently 
Micro-Batching 
• A special case of batch processing with very small batch sizes (tiny) 
• A nice mix between batching and streaming 
• At cost of latency 
• Gives stateful computation, making windowing an easy task 
2014 © Trivadis 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3rd December 2014 
14
Message Delivery Semantics 
At most once [0,1] 
• Messages my be lost 
• Messages never redelivered 
At least once [1 .. n] 
• Messages will never be lost 
• but messages may be redelivered (might be ok if consumer can handle it) 
Exactly once [1] 
• Messages are never lost 
• Messages are never redelivered 
• Perfect message delivery 
• Incurs higher latency for transactional semantics 
2014 © Trivadis 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3rd December 2014 
15
Requirements dictate the choice 
Latency 
• Is performance of streaming application paramount 
Development Cost 
• Is it desired to have similar code bases for batch and stream processing => 
lambda architecture 
Message Delivery Guarantees 
• Is there high importance on processing every single record, or is some normal 
amount of data loss acceptable 
Process Fault Tolerance 
• Is high-availability of primary concern 
2014 © Trivadis 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3rd December 2014 
16
2014 © Trivadis 
Agenda 
1. Introduction 
2. Apache Storm 
3. Apache Spark (Streaming) 
4. Unified Log 
5. Stream Processing Architectures 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3rd December 2014 
17
Apache Storm 
A platform for doing analysis on streams of data as they come in, so you 
can react to data as it happens. 
• A highly distributed real-time computation system 
• Provides general primitives to do real-time computation 
• To simplify working with queues & workers 
• scalable and fault-tolerant 
• complementary to Hadoop 
• Written in Clojure, supports Java, Clojure 
• Originated at Backtype, acquired by Twitter in 2011 
• Open Sourced late 2011 
• Part of Apache Incubator since September 2013 
2014 © Trivadis 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture 
August 2014 
18
Apache Storm – Core concepts 
Tuple 
• Core data structure in storm 
• Immutable Set of Key/value pairs 
• You can think of Storm tuples as events 
• Values must be serializable 
Stream 
• Key abstraction of Storm 
• an unbounded sequence of tuples that can be processed in parallel by Storm 
• Each stream is given ID and bolts can produce and consume tuples from 
these streams on the basis of their ID 
• Each stream also has an associated schema of the tuples that will flow 
through it 
2014 © Trivadis 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture 
August 2014 
19 
T T T T T T T T
Apache Storm – Core concepts 
Topology 
• Wires data and functions via a DAG (directed acyclic graph) 
• Executes on many machines similar to a MR job in Hadoop 
Spout 
• Source of data streams (tuples) 
• can be run in “reliable” and “unreliable” mode 
Bolt 
• Consumes 1+ streams and potentially 
produces new streams 
• Complex operations often require multiple 
steps and thus multiple bolts 
• Calculate, Filter, Aggregate, Join, Talk to 
database 
2014 © Trivadis 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture 
August 2014 
20 
Spout 
Spout 
Bolt 
Bolt 
Bolt 
Subscribes: C & D 
Emits: - 
Bolt 
Source of 
Stream B 
Subscribes: A 
Emits: C 
Subscribes: A 
Emits: D 
Subscribes: A & B 
Emits: -
Storm – How does it work ? 
2014 © Trivadis 
Superbowl 
Superbowl 
CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams - Apache Storm 
August 2014 
NFL: Peyton Manning 
and Denver’s elite 
offense fall flat in 
#Superbowl XLVIII 
21 
ow.ly/tdQZn 
#seahawks #broncos 
#Superbowl 
Split 
Sentence 
Twitter 
Spout 
Word 
Count 
Split 
Sentence 
Word 
Count 
NFL 
Manning 
… #Superbowl 
Peyton 
...
Storm – How does it work ? 
2014 © Trivadis 
Peyton 
Superbowl 
Superbowl 
CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams - Apache Storm 
August 2014 
22 
Split 
Sentence 
Twitter 
Spout 
Word 
Count 
Split 
Sentence 
Word 
Count 
INCR 
Superbowl 
INCR 
NFL 
INCR 
Manning 
NFL = 1 
Manning = 1 
1 
… #Superbowl 
INCR 
Superbowl 
NFL: Peyton Manning 
and Denver’s elite 
offense fall flat in 
#SuperBowl XLVIII 
ow.ly/tdQZn 
#seahawks #broncos 
#Superbowl 
Superbowl = 2 
NFL 
Manning 
... 
INCR 
Peyton Peyton = 1
Storm – How does it work ? 
2014 © Trivadis 
Peyton 
Superbowl 
Superbowl 
CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams - Apache Storm 
August 2014 
23 
Split 
Sentence 
Twitter 
Spout 
Word 
Count 
Split 
Sentence 
Word 
Count 
INCR 
Superbowl 
INCR 
NFL 
INCR 
Manning 
NFL = 1 
Manning= 1 
1 
… #Superbowl 
INCR 
Superbowl 
NFL: Peyton Manning 
and Denver’s elite 
offense fall flat in 
#SuperBowl XLVIII 
ow.ly/tdQZn 
#seahawks #broncos 
#Superbowl 
Superbowl = 2 
NFL 
Manning 
... 
INCR 
Peyton Peyton = 1 
Report 
Peyton= 1 
Superbowl = 2 
NFL = 1 
Manning = 1
Storm - Topology 
Global Report 
Each Spout or Bolt are running N instances in parallel 
2014 © Trivadis 
CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams - Apache Storm 
August 2014 
24 
Split 
Sentence 
Twitter 
Spout 
Word 
Count 
Split 
Sentence 
Word 
Count 
Shuffle Fields 
Shuffle grouping is random grouping 
Fields grouping is grouped by value, such that equal value results in equal task 
All grouping replicates to all tasks 
Global grouping makes all tuples go to one task 
None grouping makes bolt run in the same thread as bolt/spout it subscribes to 
Direct grouping producer (task that emits) controls which consumer will receive 
Local or Shuffle 
grouping 
similar to the shuffle grouping but will shuffle tuples among bolt tasks 
running in the same worker process, if any. Falls back to shuffle 
grouping behavior.
Storm - Creating Topology 
2014 © Trivadis 
CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams - Apache Storm 
August 2014 
25
Using a NoSQL database for storing 
results (keeping state with counter type columns) 
2014 © Trivadis 
superbowl INCR 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3rd December 2014 
Twitter 
Stream 
26 
Hashtag 
Splitter 
Twitter 
Spout 
Hashtag 
Counter 
Hashtag 
Splitter 
Hashtag 
Counter 
seahawks 
broncos 
superbowl 
INCR 
seahawks 
INCR 
broncos 
superbowl = 1 
seahawks= 1 
broncos = 1 
superbowl 
… #Superbowl 
INCR 
superbowl 
NFL: Peyton Manning 
and Denver’s elite 
offense fall flat in 
#SuperBowl XLVIII 
ow.ly/tdQZn 
#seahawks #broncos 
#Superbowl 
2
Storm Trident 
High-Level abstraction on top of storm 
Simplifies building topologies 
Core data model is the stream 
• Processed as a series of batches (micro-batches) 
• Stream is partitioned among nodes in cluster 
5 kinds of operations in Trident 
• Operations that apply locally to each partition and cause no network transfer 
• Repartitioning operations that don‘t change the contents 
• Aggregation operations that do network transfer 
• Operations on grouped streams 
• Merges and Joins 
2014 © Trivadis 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3rd December 2014 
27
Storm Trident - Creating Topology 
2014 © Trivadis 
Bolt Bolt 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3rd December 2014 
Twitter 
Stream 
28 
tweet tweet Hashtag 
Splitter 
Twitter 
Spout 
hashtag Hashtag 
Normalizer 
Persistent 
Aggregate 
hashtag 
local groupBy
Trident Concepts - Function 
• takes in a set of input fields and emits zero or more tuples as output 
• fields of the output tuple are appended to the original input tuple in the 
stream 
• If a function emits no tuples, the original input tuple is filtered out 
• Otherwise the input tuple is duplicated for each output tuple 
2014 © Trivadis 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3rd December 2014 
29
Storm Core vs. Storm Trident 
2014 © Trivadis 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3rd December 2014 
30 
Core Storm Storm Trident 
Community > 100 contributors > 100 contributors 
Adoption *** * 
Language Options Java, Clojure, Scala, 
Python, Ruby, … 
Java, Clojure, 
Scala 
Processing Models Event-Streaming Micro-Batching 
Processing DSL No Yes 
Stateful Ops No Yes 
Distributed RPC Yes Yes 
Delivery Guarantees At most once / At least 
once 
Exactly Once 
Latency sub-second seconds 
Platform Storm Cluster, YARN Storm Cluster, YARN
2014 © Trivadis 
Agenda 
1. Introduction 
2. Apache Storm 
3. Apache Spark (Streaming) 
4. Unified Log 
5. Stream Processing Architectures 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3rd December 2014 
31
Apache Spark 
Apache Spark is a fast and general engine for large-scale data processing 
• The hot trend in Big Data! 
• Based on 2007 Microsoft Dryad paper 
• Written in Scala, supports Java, Python, SQL and R 
• Can run programs up to 100x faster than Hadoop MapReduce in memory, or 
10x faster on disk 
• Runs everywhere – runs on Hadoop, Mesos, standalone or in the cloud 
• One of the largest OSS communities in big data with over 200 contributors in 
50+ organizations 
• Originally developed 2009 in UC Berkley’s AMPLab 
• Open Sourced in 2010 – since 2014 part of Apache Software foundation 
2014 © Trivadis 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3rd December 2014 
32
Apache Spark 
Spark Core 
• General execution engine for the Spark platform 
• In-memory computing capabilities deliver speed 
• General execution model supports wide variety of use cases 
• DAG-based 
• Ease of development – native APIs in Java, Scala and Python 
Spark Streaming 
• Run a streaming computation as a series of very small, deterministic batch jobs 
• Batch size as low as ½ sec, latency of about 1 sec 
• Exactly-once semantics 
• Potential for combining batch and streaming processing in same system 
• Started in 2012, first alpha release in 2013 
2014 © Trivadis 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3rd December 2014 
33
Apache Spark - Generality 
2014 © Trivadis 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3rd December 2014 
34 
Spark SQL 
(Batch 
Processing) 
Blink DB 
(Approximate 
Querying) 
Spark Streaming 
(Real-Time) 
MLLib, Spark R 
(Machine 
Learning) 
GraphX 
(Graph 
Processing) 
Spark Core API and Execution Model 
Spark 
Standalone MESOS YARN HDFS Elastic 
Search Cassandra S3 / 
DynamoDB 
Libraries 
Core Runtime 
Cluster Resource Managers Data Stores 
Adapted from C. Fregly: http://slidesha.re/11PP7FV
Apache Spark – Core concepts 
Resilient Distributed Dataset (RDD) 
• Core Spark abstraction 
• Collections of objects (partitions) spread across cluster 
• Partitions can be stored in-memory or on-disk (local) 
• Enables parallel processing on data sets 
• Build through parallel transformations 
• Immutable, recomputable, fault tolerant 
• Contains transformation history (“lineage”) for whole data set 
Operations 
• Stateless Transformations (map, filter, groupBy) 
• Actions (count, collect, save) 
2014 © Trivadis 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture 
August 2014 
35
RDD Lineage Example 
2014 © Trivadis 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3rd December 2014 
36 
HDFS File Input 1 
HadoopRDD 
FilteredRDD 
MappedRDD 
ShuffledRDD 
HDFS File 
Output 
HDFS File Input 2 
HadoopRDD 
MappedRDD 
SparkContext.hadoopFile() 
filter() 
SparkContext.hadoopFile() 
map() 
map() 
join() 
SparkContext.saveAsHadoopFile() 
Transformations 
(Lazy) 
Action 
(Execute Transformations) 
Adapted from Chris Fregly: http://slidesha.re/11PP7FV
RDD Execution Example 
groupByKey() 
2014 © Trivadis 
ShuffledRDD 
…. 
FileRDD 
…. 
FileRDD 
ShuffledRDD 
MappedRDD 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3rd December 2014 
Partition 
1 
37 
FileRDD 
Partition 
2 
…. 
Partition 
5 
Partition 
1 
Partition 
2 
Partition 
5 
Partition 
1 
Partition 
2 
Partition 
5 
FileRDD 
Partition 
1 
Partition 
2 
Partition 
1 
Partition 
2 
Partition 
1 
Partition 
2 
…. 
Partition 
5 
ShuffledRDD 
Partition 
1 
Partition 
2 
…. 
Partition 
5 
Partition 
1 
Partition 
2 
filter() 
map() 
join() 
join()
Apache Spark Streaming – Core concepts 
Discretized Stream (DStream) 
• Core Spark Streaming abstraction 
• micro batches of RDD’s 
• Operations similar to RDD 
Input DStreams 
• Represents the stream of raw data received from streaming sources 
• Data can be ingested from many sources: Kafka, Kinesis, Flume, Twitter, 
ZeroMQ, TCP Socket, Akka actors, etc. 
• Custom Sources can be easily written for custom data sources 
Operations 
• Same as Spark Core 
• Additional Stateful transformations (window, reduceByWindow) 
2014 © Trivadis 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture 
August 2014 
38
Discretized Stream (DStream) 
RDD @time 1 
2014 © Trivadis 
message 
1 
message 
2 
…. 
message 
n 
RDD @time 1 
…. 
…. 
RDD @time 2 
message 
1 
message 
2 
…. 
message 
n 
RDD @time 2 
…. 
…. 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3rd December 2014 
39 
time 1 time 2 time 3 
message 
…. time n 
f(message 
1) 
f(message 
2) 
f(message 
n) 
result 
1 
result 
2 
result 
n 
message 
message 
message 
f(message 
1) 
f(message 
2) 
f(message 
n) 
result 
1 
result 
2 
result 
n 
RDD @time 3 
message 
1 
message 
2 
…. 
message 
n 
RDD @time 3 
f(message 
1) 
f(message 
2) 
…. 
f(message 
n) 
result 
1 
result 
2 
…. 
result 
n 
RDD @time n 
message 
1 
message 
2 
…. 
message 
n 
RDD @time n 
f(message 
1) 
f(message 
2) 
…. 
f(message 
n) 
result 
1 
result 
2 
…. 
result 
n 
Input Stream 
DStream 
MappedDStream 
map() 
saveAsHadoopFiles() 
Time Increasing 
Actions Trigger DStream Transformation Lineage 
Spark Jobs Adapted from Chris Fregly: http://slidesha.re/11PP7FV
Spark Streaming Example 
2014 © Trivadis 
CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams - Apache Storm 
August 2014 
40
Storm Core vs. Storm Trident vs. Spark Streaming 
2014 © Trivadis 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3rd December 2014 
41 
Core Storm Storm Trident Spark Streaming 
Community > 100 contributors > 100 contributors > 280 contributors 
Adoption *** * * 
Language 
Java, Clojure, Scala, 
Java, Clojure, 
Java, Scala 
Options 
Python, Ruby, … 
Scala 
Python (coming) 
Processing 
Models 
Event-Streaming Micro-Batching Micro-Batching 
Batch (Spark Core) 
Processing DSL No Yes Yes 
Stateful Ops No Yes Yes 
Distributed RPC Yes Yes No 
Delivery 
At most once / At 
Guarantees 
least once 
Exactly Once Exactly Once 
Latency sub-second seconds seconds 
Platform Storm Cluster, YARN Storm Cluster, YARN 
YARN, Mesos 
Standalone, DataStax EE
2014 © Trivadis 
Agenda 
1. Introduction 
2. Apache Storm 
3. Apache Spark (Streaming) 
4. Unified Log 
5. Stream Processing Architectures 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3rd December 2014 
42
Unified Log 
That’s what most people think about logs 
137.229.78.245 - - [02/Jul/2012:13:22:26 -0800] "GET /wp-admin/images/date-button.gif HTTP/1.1" 200 111 
137.229.78.245 - - [02/Jul/2012:13:22:26 -0800] "GET /wp-includes/js/tinymce/langs/wp-langs-en.js?ver=349-20805 HTTP/1.1" 200 13593 
137.229.78.245 - - [02/Jul/2012:13:22:26 -0800] "GET /wp-includes/js/tinymce/wp-tinymce.php?c=1&ver=349-20805 HTTP/1.1" 200 101114 
137.229.78.245 - - [02/Jul/2012:13:22:28 -0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 30747 
137.229.78.245 - - [02/Jul/2012:13:22:40 -0800] "POST /wp-admin/post.php HTTP/1.1" 302 - 
137.229.78.245 - - [02/Jul/2012:13:22:40 -0800] "GET /wp-admin/post.php?post=387&action=edit&message=1 HTTP/1.1" 200 73160 
137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "GET /wp-includes/css/editor.css?ver=3.4.1 HTTP/1.1" 304 - 
137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "GET /wp-includes/js/tinymce/langs/wp-langs-en.js?ver=349-20805 HTTP/1.1" 304 - 
137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 30809 
But this is what we mean here by Log 
• a structured log (records are numbered beginning with 0 based on order they 
2014 © Trivadis 
are written) 
• aka. commit log or 
journal 
1st record Next record 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture 
August 2014 
43 
written 
0 1 2 3 4 5 6 7 8 9 10 11
Central Unified Log for (real-time) subscription 
Take all the organization’s data and put it into a central log for subscription 
Properties of the Unified Log: 
• Unified: “Enterprise”, single deployment 
• Append-Only: events are appended, no update in place => immutable 
• Ordered: each event has an offset, which is unique within a shard 
• Fast: should be able to handle thousands of messages / sec 
• Distributed: lives on a cluster of machines 
2014 © Trivadis 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture 
August 2014 
44 
Collector 
0 1 2 3 4 5 6 7 8 9 10 11 
reads 
writes 
Consumer 
System A 
(time = 6) 
reads 
Consumer 
System B 
(time = 10)
Apache Kafka - Overview 
• A distributed publish-subscribe messaging system 
• Designed for processing of real time activity stream data (logs, metrics 
collections, social media streams, …) 
• Initially developed at LinkedIn, now part of Apache 
• Does not follow JMS Standards and does not use JMS API 
• Kafka maintains feeds of messages in topics 
Producer Producer Producer 
2014 © Trivadis 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture 
August 2014 
45 
Kafka Cluster 
Consumer Consumer Consumer 
0 1 2 3 4 5 6 7 8 9 1 
0 
1 
1 
1 
2 
0 1 2 3 4 5 6 7 8 9 
0 1 2 3 4 5 6 7 8 9 1 
0 
1 
1 
1 
2 
Anatomy of a topic: 
Partition 0 
Partition 1 
Partition 2 
Writes 
old new
Apache Kafka - Motivation 
LinkedIn’s motivation for Kafka was: 
§ “A unified platform for handling all the real-time data feeds a large company 
might have.” 
2014 © Trivadis 
Must haves 
§ High throughput to support high volume event feeds. 
§ Support real-time processing of these feeds to create new, derived feeds. 
§ Support large data backlogs to handle periodic ingestion from offline 
systems. 
§ Support low-latency delivery to handle more traditional messaging use 
cases. 
§ Guarantee fault-tolerance in the presence of machine failures. 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture 
August 2014 
46
Apache Kafka - Performance 
Kafka at LinkedIn 
Up to 2 million writes/sec on 3 cheap machines 
§ Using 3 producers on 3 different machines 
2014 © Trivadis 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture 
August 2014 
47 
10+ billion 
writes per day 
172k 
messages per second 
(average) 
55+ billion 
messages per day 
to real-time consumers 
http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
Apache Kafka - Partition offsets 
Offset: messages in the partitions are each assigned a unique (per 
partition) and sequential id called the offset 
• Consumers track their pointers via (offset, partition, topic) tuples 
2014 © Trivadis 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture 
August 2014 
48 
Consumer group C1
Apache Kafka – two Options for Log Cleanup 
Retaining a window of data 
• Ideal for event data 
• Window can be defined in time (days) or space (GBs) – defaults to 1 week 
Retain a complete log (log compaction) 
• Ideal for keyed data 
• Keep a space-efficient complete 
2014 © Trivadis 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture 
August 2014 
49 
log of changes 
• Log compaction runs in the 
background 
• Ensures that always at least the 
last known value for each message 
key within the log of data is retained
Data Flow Graphs using Unified Log 
Stream processing 
allows 
for computing feeds 
off of other feeds 
Derived feeds 
are no different 
than original feeds 
they are computed off 
Single deployment of 
“Unified Log” but 
logically different 
feeds 
2014 © Trivadis 
Customer Aggregate 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture 
August 2014 
50 
Meter 
Readings Collector 
Enrich / 
Transform 
Aggregate 
by Minute 
Raw Meter 
Readings 
Meter with 
Customer 
Meter by Customer 
by Minute 
by Minute 
Meter by 
Minute 
Persist 
Meter by 
Minute 
Persist 
Raw Meter 
Readings
2014 © Trivadis 
Agenda 
1. Introduction 
2. Apache Storm 
3. Apache Spark (Streaming) 
4. Unified Log 
5. Stream Processing Architectures 
6. Summary 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3rd December 2014 
51
Architectural Pattern: Standalone Event Stream 
Processing 
2014 © Trivadis 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture 
August 2014 
Social Media 
52 
Event Processing 
(ESP / CEP) 
State Store / 
Event Store 
Enterprise Event Bus 
(Ingress) 
Event 
Cloud 
Streams 
Internet of 
Things 
Enterprise 
Event Bus 
Analytical 
Applications 
52 
DB 
Enterprise 
Service Bus 
Business Rule 
Management 
Rules System 
Event Processing 
Result 
Store
Architectural Pattern: Event Stream Processing as part 
of Lambda Architecture 
2014 © Trivadis 
Hadoop Big Data 
Infrastructure 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture 
August 2014 
Social Media 
53 
Event Processing 
(ESP / CEP) 
State Store / 
Event Store 
Enterprise Event Bus 
(Ingress) 
Event 
Cloud 
Streams 
Internet of 
Things 
Enterprise 
Event Bus 
Analytical 
Applications 
53 
DB 
Enterprise 
Service Bus 
Event Processing 
Map/ 
HDFS Reduce Result 
Store 
Result 
Store
Architectural Pattern: Event Stream Processing as part 
of Kappa Architecture 
2014 © Trivadis 
Hadoop Big Data 
Infrastructure 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture 
August 2014 
Social Media 
54 
Event Processing 
(ESP / CEP) 
State Store / 
Event Store 
Enterprise Event Bus 
(Ingress) 
Event 
Cloud 
Streams 
Internet of 
Things 
Analytical 
Applications 
54 
DB 
Enterprise 
Service Bus 
Event Processing 
HDFS Replay 
Result 
Store
Questions and answers ... 
Guido Schmutz 
Technology Manager 
BASEL BERN BRUGES LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA 
2014 © Trivadis 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3rd December 2014 
55

Más contenido relacionado

La actualidad más candente

Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Folio3 Software
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDataWorks Summit
 
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...Nathan Bijnens
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraJoe Stein
 
Lambda Architecture with Spark
Lambda Architecture with SparkLambda Architecture with Spark
Lambda Architecture with SparkKnoldus Inc.
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings MeetupGwen (Chen) Shapira
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 
Real time analytics with Kafka and SparkStreaming
Real time analytics with Kafka and SparkStreamingReal time analytics with Kafka and SparkStreaming
Real time analytics with Kafka and SparkStreamingAshish Singh
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsBen Laird
 
STORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOPSTORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOPDataWorks Summit
 
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Adrianos Dadis
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
Data Streaming Technology Overview
Data Streaming Technology OverviewData Streaming Technology Overview
Data Streaming Technology OverviewDan Lynn
 

La actualidad más candente (20)

Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
 
Lambda Architecture with Spark
Lambda Architecture with SparkLambda Architecture with Spark
Lambda Architecture with Spark
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Real time analytics with Kafka and SparkStreaming
Real time analytics with Kafka and SparkStreamingReal time analytics with Kafka and SparkStreaming
Real time analytics with Kafka and SparkStreaming
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 
STORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOPSTORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOP
 
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Data Streaming Technology Overview
Data Streaming Technology OverviewData Streaming Technology Overview
Data Streaming Technology Overview
 

Destacado

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignMichael Noll
 
Real time machine learning
Real time machine learningReal time machine learning
Real time machine learningVinoth Kannan
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormDavorin Vukelic
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiSlim Baltagi
 
Real-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureReal-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureKhalid Salama
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
Big Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in ActionBig Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in ActionGuido Schmutz
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
 
Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Gruter
 
Arquitectura de tiempo real para un sistema inteligente
Arquitectura de tiempo real para un sistema inteligenteArquitectura de tiempo real para un sistema inteligente
Arquitectura de tiempo real para un sistema inteligenteandreygio
 
Scalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduceScalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReducePietro Michiardi
 
Mándala - Sistemas de Tiempo Real
Mándala - Sistemas de Tiempo RealMándala - Sistemas de Tiempo Real
Mándala - Sistemas de Tiempo RealMayra Rodriguez
 
Big Data de verdad, en 4K y en tiempo real; Arquitectura Logtrust
Big Data de verdad, en 4K y en tiempo real; Arquitectura LogtrustBig Data de verdad, en 4K y en tiempo real; Arquitectura Logtrust
Big Data de verdad, en 4K y en tiempo real; Arquitectura LogtrustJoaquin Diez
 
Architektur von Big Data Lösungen
Architektur von Big Data LösungenArchitektur von Big Data Lösungen
Architektur von Big Data LösungenGuido Schmutz
 
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...David Taieb
 
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesReal-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesSingleStore
 

Destacado (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Real time machine learning
Real time machine learningReal time machine learning
Real time machine learning
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
Real-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureReal-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS Azure
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Big Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in ActionBig Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in Action
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Apache Tajo - BWC 2014
Apache Tajo - BWC 2014
 
Arquitectura de tiempo real para un sistema inteligente
Arquitectura de tiempo real para un sistema inteligenteArquitectura de tiempo real para un sistema inteligente
Arquitectura de tiempo real para un sistema inteligente
 
Scalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduceScalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduce
 
Mándala - Sistemas de Tiempo Real
Mándala - Sistemas de Tiempo RealMándala - Sistemas de Tiempo Real
Mándala - Sistemas de Tiempo Real
 
Big Data de verdad, en 4K y en tiempo real; Arquitectura Logtrust
Big Data de verdad, en 4K y en tiempo real; Arquitectura LogtrustBig Data de verdad, en 4K y en tiempo real; Arquitectura Logtrust
Big Data de verdad, en 4K y en tiempo real; Arquitectura Logtrust
 
Architektur von Big Data Lösungen
Architektur von Big Data LösungenArchitektur von Big Data Lösungen
Architektur von Big Data Lösungen
 
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
 
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesReal-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 

Similar a Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared

Unified Log Processing Architecture
Unified Log Processing ArchitectureUnified Log Processing Architecture
Unified Log Processing ArchitectureGuido Schmutz
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks
 
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Impetus Technologies
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko GlobalLogic Ukraine
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Data Streaming For Big Data
Data Streaming For Big DataData Streaming For Big Data
Data Streaming For Big DataSeval Çapraz
 
Distributed Data Processing for Real-time Applications
Distributed Data Processing for Real-time ApplicationsDistributed Data Processing for Real-time Applications
Distributed Data Processing for Real-time ApplicationsScyllaDB
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Data Con LA
 
Elastic Scaling of a High-Throughput Content-Based Publish/Subscribe Engine
Elastic Scaling of a High-Throughput Content-Based Publish/Subscribe EngineElastic Scaling of a High-Throughput Content-Based Publish/Subscribe Engine
Elastic Scaling of a High-Throughput Content-Based Publish/Subscribe EngineZbigniew Jerzak
 
Death of the dumb pipes: Using Apache Kafka® for Integration projects
Death of the dumb pipes: Using Apache Kafka® for Integration projectsDeath of the dumb pipes: Using Apache Kafka® for Integration projects
Death of the dumb pipes: Using Apache Kafka® for Integration projectsHostedbyConfluent
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Twitter Storm: Ereignisverarbeitung in Echtzeit
Twitter Storm: Ereignisverarbeitung in EchtzeitTwitter Storm: Ereignisverarbeitung in Echtzeit
Twitter Storm: Ereignisverarbeitung in EchtzeitGuido Schmutz
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesYahoo Developer Network
 
DataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupDataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupThomas Weise
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop PlatformApache Apex
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317Nan Zhu
 
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBStructured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBCarol McDonald
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingJen Aman
 

Similar a Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared (20)

Unified Log Processing Architecture
Unified Log Processing ArchitectureUnified Log Processing Architecture
Unified Log Processing Architecture
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
 
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Data Streaming For Big Data
Data Streaming For Big DataData Streaming For Big Data
Data Streaming For Big Data
 
Distributed Data Processing for Real-time Applications
Distributed Data Processing for Real-time ApplicationsDistributed Data Processing for Real-time Applications
Distributed Data Processing for Real-time Applications
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
 
Elastic Scaling of a High-Throughput Content-Based Publish/Subscribe Engine
Elastic Scaling of a High-Throughput Content-Based Publish/Subscribe EngineElastic Scaling of a High-Throughput Content-Based Publish/Subscribe Engine
Elastic Scaling of a High-Throughput Content-Based Publish/Subscribe Engine
 
Death of the dumb pipes: Using Apache Kafka® for Integration projects
Death of the dumb pipes: Using Apache Kafka® for Integration projectsDeath of the dumb pipes: Using Apache Kafka® for Integration projects
Death of the dumb pipes: Using Apache Kafka® for Integration projects
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Twitter Storm: Ereignisverarbeitung in Echtzeit
Twitter Storm: Ereignisverarbeitung in EchtzeitTwitter Storm: Ereignisverarbeitung in Echtzeit
Twitter Storm: Ereignisverarbeitung in Echtzeit
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
 
DataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupDataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application Meetup
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317
 
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBStructured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
 
Mhug apache storm
Mhug apache stormMhug apache storm
Mhug apache storm
 

Más de Guido Schmutz

30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as CodeGuido Schmutz
 
Event Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data ArchitectureEvent Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data ArchitectureGuido Schmutz
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsGuido Schmutz
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!Guido Schmutz
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Guido Schmutz
 
Event Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureEvent Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureGuido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaGuido Schmutz
 
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) ArchitectureEvent Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) ArchitectureGuido Schmutz
 
Building Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaBuilding Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaGuido Schmutz
 
Location Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaLocation Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaGuido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache KafkaSolutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache KafkaGuido Schmutz
 
What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?Guido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaGuido Schmutz
 
Location Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using KafkaLocation Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using KafkaGuido Schmutz
 
Streaming Visualisation
Streaming VisualisationStreaming Visualisation
Streaming VisualisationGuido Schmutz
 
Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?Guido Schmutz
 
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaSolutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaGuido Schmutz
 
Fundamentals Big Data and AI Architecture
Fundamentals Big Data and AI ArchitectureFundamentals Big Data and AI Architecture
Fundamentals Big Data and AI ArchitectureGuido Schmutz
 
Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Guido Schmutz
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming VisualizationGuido Schmutz
 

Más de Guido Schmutz (20)

30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code
 
Event Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data ArchitectureEvent Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data Architecture
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
 
Event Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureEvent Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data Architecture
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
 
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) ArchitectureEvent Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
 
Building Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaBuilding Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache Kafka
 
Location Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaLocation Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache Kafka
 
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache KafkaSolutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
 
What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
 
Location Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using KafkaLocation Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using Kafka
 
Streaming Visualisation
Streaming VisualisationStreaming Visualisation
Streaming Visualisation
 
Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?
 
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaSolutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
 
Fundamentals Big Data and AI Architecture
Fundamentals Big Data and AI ArchitectureFundamentals Big Data and AI Architecture
Fundamentals Big Data and AI Architecture
 
Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 

Último

What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 

Último (20)

What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 

Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared

  • 1. Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared DBTA Workshop on Stream Processing Berne, 3.12.2014 Guido Schmutz BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 1
  • 2. Guido Schmutz § Working for Trivadis for more than 18 years § Oracle ACE Director for Fusion Middleware and SOA § Co-Author of different books § Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data § Member of Trivadis Architecture Board § Technology Manager @ Trivadis § More than 25 years of software development experience § Contact: guido.schmutz@trivadis.com § Blog: http://guidoschmutz.wordpress.com § Twitter: gschmutz 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 2
  • 3. Our company Trivadis is a market leader in IT consulting, system integration, solution engineering and the provision of IT services focusing on and technologies in Switzerland, Germany and Austria. We offer our services in the following strategic business fields: Trivadis Services takes over the interacting operation of your IT systems. 2014 © Trivadis O P E R A T I O N Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 3
  • 4. 2014 © Trivadis Agenda 1. Introduction 2. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Processing Architectures Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 4
  • 5. What is Stream Processing? Infrastructure for continuous data processing Computational model can be as general as MapReduce but with the ability to produce low-latency results Data collected continuously is naturally processed continuously aka. Event Processing / Complex Event Processing (CEP) 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 5
  • 6. Why Stream Processing? Stream Processing 2014 © Trivadis Response latency Milliseconds to minutes RPC Synchronous Later. Possibly much later. Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 6
  • 7. How to design a Stream Processing System? 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 7 Event Stream event Collecting event Queue (Persist) Event Stream event Collecting event Processing event Processing result result Event Stream event Collecting/ Processing result
  • 8. How to scale a Stream Processing System? event event event result 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 8 Queue (Persist) Event Stream event Collecting Thread 1 event event Processing Thread 1 result Collecting Thread 2 Processing Thread 2 Collecting Thread n Processing Thread n
  • 9. How to scale a Stream Processing System? Collecting Process 1 2014 © Trivadis Collecting Process 1 Collecting Process 1 event event result Collecting Process 1 Collecting Process 1 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 9 Queue 1 (Persist) Event Stream event Collecting Thread 1 event event Processing Process 1 result Collecting Thread 1 Processing Process 1 Queue 2 event (Persist) Processing Process 1 Queue n (Persist)
  • 10. How to scale a Stream Processing System? Collecting Process 1 Collecting Process 2 2014 © Trivadis Processing A Process 2 Processing B Process 2 Processing A Process 1 Processing B Process 1 e e e Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 Event Stream 10 Collecting Process 1 Collecting Process 2 Processing A Q2 Thread 2 Processing B e e Q2 Thread 2 Processing A Q1 Thread 1 Processing B Q1 Thread 1 Processing A Process 2 Processing A Qn Thread n
  • 11. How to make (stateful) Stream Processing System reliable? Faults and stragglers inevitable in large clusters running big data applications Streaming applications must recover from them quickly 2014 © Trivadis e e Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 11 Collecting Process 2 Processing A Process 2 Processing B Process 2 Event Stream Collecting Process 2 Processing A Q2 Thread 2 Processing B e Q2 Thread 2 Collecting Process 2 Processing A Process 2 e Event Collecting Processing A Processing Processing B B Stream Process 2 Q2 Thread 2 Q2 Thread Process 2 2
  • 12. How to make (stateful) Stream Processing System reliable? Solution 1: using active/passive system (hot replication) • Both systems process the full load • In case of a failure, automatically switch and use the “passive” system • Stragglers slow down both active and passive system 2014 © Trivadis e e State Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 12 e e State = State in-memory and/or on-disk Collecting Process 2 Processing A Process 2 Processing B Process 2 Event Stream Collecting Process 2 Processing A Q2 Thread 2 Processing B Q2 Thread 2 Active Collecting Process 2 Processing A Process 2 Processing B Process 2 Collecting Process 2 Processing A Q2 Thread 2 Processing B Q2 Thread 2 Passive State
  • 13. How to make (stateful) Stream Processing System reliable? Solution 2: Upstream backup • Nodes buffer sent messages and reply them to new node in case of failure • Stragglers are treated as failures Collecting Process 2 Processing A Process 2 e e Event Collecting Processing A Processing B Stream Process 2 Q2 Thread 2 Process 2 buffer = Buffer for replay in-memory and/or on-disk 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 13 State = State in-memory and/or on-disk Processing B Q2 Thread 2 State
  • 14. Processing Models Batch Processing • Familiar concept of processing data en masse • Generally incurs a high-latency (Event-) Stream Processing • A one-at-a-time processing model • A datum is processed as it arrives • Sub-second latency • Difficult to process state data efficiently Micro-Batching • A special case of batch processing with very small batch sizes (tiny) • A nice mix between batching and streaming • At cost of latency • Gives stateful computation, making windowing an easy task 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 14
  • 15. Message Delivery Semantics At most once [0,1] • Messages my be lost • Messages never redelivered At least once [1 .. n] • Messages will never be lost • but messages may be redelivered (might be ok if consumer can handle it) Exactly once [1] • Messages are never lost • Messages are never redelivered • Perfect message delivery • Incurs higher latency for transactional semantics 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 15
  • 16. Requirements dictate the choice Latency • Is performance of streaming application paramount Development Cost • Is it desired to have similar code bases for batch and stream processing => lambda architecture Message Delivery Guarantees • Is there high importance on processing every single record, or is some normal amount of data loss acceptable Process Fault Tolerance • Is high-availability of primary concern 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 16
  • 17. 2014 © Trivadis Agenda 1. Introduction 2. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Processing Architectures Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 17
  • 18. Apache Storm A platform for doing analysis on streams of data as they come in, so you can react to data as it happens. • A highly distributed real-time computation system • Provides general primitives to do real-time computation • To simplify working with queues & workers • scalable and fault-tolerant • complementary to Hadoop • Written in Clojure, supports Java, Clojure • Originated at Backtype, acquired by Twitter in 2011 • Open Sourced late 2011 • Part of Apache Incubator since September 2013 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 18
  • 19. Apache Storm – Core concepts Tuple • Core data structure in storm • Immutable Set of Key/value pairs • You can think of Storm tuples as events • Values must be serializable Stream • Key abstraction of Storm • an unbounded sequence of tuples that can be processed in parallel by Storm • Each stream is given ID and bolts can produce and consume tuples from these streams on the basis of their ID • Each stream also has an associated schema of the tuples that will flow through it 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 19 T T T T T T T T
  • 20. Apache Storm – Core concepts Topology • Wires data and functions via a DAG (directed acyclic graph) • Executes on many machines similar to a MR job in Hadoop Spout • Source of data streams (tuples) • can be run in “reliable” and “unreliable” mode Bolt • Consumes 1+ streams and potentially produces new streams • Complex operations often require multiple steps and thus multiple bolts • Calculate, Filter, Aggregate, Join, Talk to database 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 20 Spout Spout Bolt Bolt Bolt Subscribes: C & D Emits: - Bolt Source of Stream B Subscribes: A Emits: C Subscribes: A Emits: D Subscribes: A & B Emits: -
  • 21. Storm – How does it work ? 2014 © Trivadis Superbowl Superbowl CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams - Apache Storm August 2014 NFL: Peyton Manning and Denver’s elite offense fall flat in #Superbowl XLVIII 21 ow.ly/tdQZn #seahawks #broncos #Superbowl Split Sentence Twitter Spout Word Count Split Sentence Word Count NFL Manning … #Superbowl Peyton ...
  • 22. Storm – How does it work ? 2014 © Trivadis Peyton Superbowl Superbowl CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams - Apache Storm August 2014 22 Split Sentence Twitter Spout Word Count Split Sentence Word Count INCR Superbowl INCR NFL INCR Manning NFL = 1 Manning = 1 1 … #Superbowl INCR Superbowl NFL: Peyton Manning and Denver’s elite offense fall flat in #SuperBowl XLVIII ow.ly/tdQZn #seahawks #broncos #Superbowl Superbowl = 2 NFL Manning ... INCR Peyton Peyton = 1
  • 23. Storm – How does it work ? 2014 © Trivadis Peyton Superbowl Superbowl CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams - Apache Storm August 2014 23 Split Sentence Twitter Spout Word Count Split Sentence Word Count INCR Superbowl INCR NFL INCR Manning NFL = 1 Manning= 1 1 … #Superbowl INCR Superbowl NFL: Peyton Manning and Denver’s elite offense fall flat in #SuperBowl XLVIII ow.ly/tdQZn #seahawks #broncos #Superbowl Superbowl = 2 NFL Manning ... INCR Peyton Peyton = 1 Report Peyton= 1 Superbowl = 2 NFL = 1 Manning = 1
  • 24. Storm - Topology Global Report Each Spout or Bolt are running N instances in parallel 2014 © Trivadis CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams - Apache Storm August 2014 24 Split Sentence Twitter Spout Word Count Split Sentence Word Count Shuffle Fields Shuffle grouping is random grouping Fields grouping is grouped by value, such that equal value results in equal task All grouping replicates to all tasks Global grouping makes all tuples go to one task None grouping makes bolt run in the same thread as bolt/spout it subscribes to Direct grouping producer (task that emits) controls which consumer will receive Local or Shuffle grouping similar to the shuffle grouping but will shuffle tuples among bolt tasks running in the same worker process, if any. Falls back to shuffle grouping behavior.
  • 25. Storm - Creating Topology 2014 © Trivadis CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams - Apache Storm August 2014 25
  • 26. Using a NoSQL database for storing results (keeping state with counter type columns) 2014 © Trivadis superbowl INCR Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 Twitter Stream 26 Hashtag Splitter Twitter Spout Hashtag Counter Hashtag Splitter Hashtag Counter seahawks broncos superbowl INCR seahawks INCR broncos superbowl = 1 seahawks= 1 broncos = 1 superbowl … #Superbowl INCR superbowl NFL: Peyton Manning and Denver’s elite offense fall flat in #SuperBowl XLVIII ow.ly/tdQZn #seahawks #broncos #Superbowl 2
  • 27. Storm Trident High-Level abstraction on top of storm Simplifies building topologies Core data model is the stream • Processed as a series of batches (micro-batches) • Stream is partitioned among nodes in cluster 5 kinds of operations in Trident • Operations that apply locally to each partition and cause no network transfer • Repartitioning operations that don‘t change the contents • Aggregation operations that do network transfer • Operations on grouped streams • Merges and Joins 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 27
  • 28. Storm Trident - Creating Topology 2014 © Trivadis Bolt Bolt Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 Twitter Stream 28 tweet tweet Hashtag Splitter Twitter Spout hashtag Hashtag Normalizer Persistent Aggregate hashtag local groupBy
  • 29. Trident Concepts - Function • takes in a set of input fields and emits zero or more tuples as output • fields of the output tuple are appended to the original input tuple in the stream • If a function emits no tuples, the original input tuple is filtered out • Otherwise the input tuple is duplicated for each output tuple 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 29
  • 30. Storm Core vs. Storm Trident 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 30 Core Storm Storm Trident Community > 100 contributors > 100 contributors Adoption *** * Language Options Java, Clojure, Scala, Python, Ruby, … Java, Clojure, Scala Processing Models Event-Streaming Micro-Batching Processing DSL No Yes Stateful Ops No Yes Distributed RPC Yes Yes Delivery Guarantees At most once / At least once Exactly Once Latency sub-second seconds Platform Storm Cluster, YARN Storm Cluster, YARN
  • 31. 2014 © Trivadis Agenda 1. Introduction 2. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Processing Architectures Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 31
  • 32. Apache Spark Apache Spark is a fast and general engine for large-scale data processing • The hot trend in Big Data! • Based on 2007 Microsoft Dryad paper • Written in Scala, supports Java, Python, SQL and R • Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk • Runs everywhere – runs on Hadoop, Mesos, standalone or in the cloud • One of the largest OSS communities in big data with over 200 contributors in 50+ organizations • Originally developed 2009 in UC Berkley’s AMPLab • Open Sourced in 2010 – since 2014 part of Apache Software foundation 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 32
  • 33. Apache Spark Spark Core • General execution engine for the Spark platform • In-memory computing capabilities deliver speed • General execution model supports wide variety of use cases • DAG-based • Ease of development – native APIs in Java, Scala and Python Spark Streaming • Run a streaming computation as a series of very small, deterministic batch jobs • Batch size as low as ½ sec, latency of about 1 sec • Exactly-once semantics • Potential for combining batch and streaming processing in same system • Started in 2012, first alpha release in 2013 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 33
  • 34. Apache Spark - Generality 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 34 Spark SQL (Batch Processing) Blink DB (Approximate Querying) Spark Streaming (Real-Time) MLLib, Spark R (Machine Learning) GraphX (Graph Processing) Spark Core API and Execution Model Spark Standalone MESOS YARN HDFS Elastic Search Cassandra S3 / DynamoDB Libraries Core Runtime Cluster Resource Managers Data Stores Adapted from C. Fregly: http://slidesha.re/11PP7FV
  • 35. Apache Spark – Core concepts Resilient Distributed Dataset (RDD) • Core Spark abstraction • Collections of objects (partitions) spread across cluster • Partitions can be stored in-memory or on-disk (local) • Enables parallel processing on data sets • Build through parallel transformations • Immutable, recomputable, fault tolerant • Contains transformation history (“lineage”) for whole data set Operations • Stateless Transformations (map, filter, groupBy) • Actions (count, collect, save) 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 35
  • 36. RDD Lineage Example 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 36 HDFS File Input 1 HadoopRDD FilteredRDD MappedRDD ShuffledRDD HDFS File Output HDFS File Input 2 HadoopRDD MappedRDD SparkContext.hadoopFile() filter() SparkContext.hadoopFile() map() map() join() SparkContext.saveAsHadoopFile() Transformations (Lazy) Action (Execute Transformations) Adapted from Chris Fregly: http://slidesha.re/11PP7FV
  • 37. RDD Execution Example groupByKey() 2014 © Trivadis ShuffledRDD …. FileRDD …. FileRDD ShuffledRDD MappedRDD Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 Partition 1 37 FileRDD Partition 2 …. Partition 5 Partition 1 Partition 2 Partition 5 Partition 1 Partition 2 Partition 5 FileRDD Partition 1 Partition 2 Partition 1 Partition 2 Partition 1 Partition 2 …. Partition 5 ShuffledRDD Partition 1 Partition 2 …. Partition 5 Partition 1 Partition 2 filter() map() join() join()
  • 38. Apache Spark Streaming – Core concepts Discretized Stream (DStream) • Core Spark Streaming abstraction • micro batches of RDD’s • Operations similar to RDD Input DStreams • Represents the stream of raw data received from streaming sources • Data can be ingested from many sources: Kafka, Kinesis, Flume, Twitter, ZeroMQ, TCP Socket, Akka actors, etc. • Custom Sources can be easily written for custom data sources Operations • Same as Spark Core • Additional Stateful transformations (window, reduceByWindow) 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 38
  • 39. Discretized Stream (DStream) RDD @time 1 2014 © Trivadis message 1 message 2 …. message n RDD @time 1 …. …. RDD @time 2 message 1 message 2 …. message n RDD @time 2 …. …. Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 39 time 1 time 2 time 3 message …. time n f(message 1) f(message 2) f(message n) result 1 result 2 result n message message message f(message 1) f(message 2) f(message n) result 1 result 2 result n RDD @time 3 message 1 message 2 …. message n RDD @time 3 f(message 1) f(message 2) …. f(message n) result 1 result 2 …. result n RDD @time n message 1 message 2 …. message n RDD @time n f(message 1) f(message 2) …. f(message n) result 1 result 2 …. result n Input Stream DStream MappedDStream map() saveAsHadoopFiles() Time Increasing Actions Trigger DStream Transformation Lineage Spark Jobs Adapted from Chris Fregly: http://slidesha.re/11PP7FV
  • 40. Spark Streaming Example 2014 © Trivadis CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams - Apache Storm August 2014 40
  • 41. Storm Core vs. Storm Trident vs. Spark Streaming 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 41 Core Storm Storm Trident Spark Streaming Community > 100 contributors > 100 contributors > 280 contributors Adoption *** * * Language Java, Clojure, Scala, Java, Clojure, Java, Scala Options Python, Ruby, … Scala Python (coming) Processing Models Event-Streaming Micro-Batching Micro-Batching Batch (Spark Core) Processing DSL No Yes Yes Stateful Ops No Yes Yes Distributed RPC Yes Yes No Delivery At most once / At Guarantees least once Exactly Once Exactly Once Latency sub-second seconds seconds Platform Storm Cluster, YARN Storm Cluster, YARN YARN, Mesos Standalone, DataStax EE
  • 42. 2014 © Trivadis Agenda 1. Introduction 2. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Processing Architectures Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 42
  • 43. Unified Log That’s what most people think about logs 137.229.78.245 - - [02/Jul/2012:13:22:26 -0800] "GET /wp-admin/images/date-button.gif HTTP/1.1" 200 111 137.229.78.245 - - [02/Jul/2012:13:22:26 -0800] "GET /wp-includes/js/tinymce/langs/wp-langs-en.js?ver=349-20805 HTTP/1.1" 200 13593 137.229.78.245 - - [02/Jul/2012:13:22:26 -0800] "GET /wp-includes/js/tinymce/wp-tinymce.php?c=1&ver=349-20805 HTTP/1.1" 200 101114 137.229.78.245 - - [02/Jul/2012:13:22:28 -0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 30747 137.229.78.245 - - [02/Jul/2012:13:22:40 -0800] "POST /wp-admin/post.php HTTP/1.1" 302 - 137.229.78.245 - - [02/Jul/2012:13:22:40 -0800] "GET /wp-admin/post.php?post=387&action=edit&message=1 HTTP/1.1" 200 73160 137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "GET /wp-includes/css/editor.css?ver=3.4.1 HTTP/1.1" 304 - 137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "GET /wp-includes/js/tinymce/langs/wp-langs-en.js?ver=349-20805 HTTP/1.1" 304 - 137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 30809 But this is what we mean here by Log • a structured log (records are numbered beginning with 0 based on order they 2014 © Trivadis are written) • aka. commit log or journal 1st record Next record Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 43 written 0 1 2 3 4 5 6 7 8 9 10 11
  • 44. Central Unified Log for (real-time) subscription Take all the organization’s data and put it into a central log for subscription Properties of the Unified Log: • Unified: “Enterprise”, single deployment • Append-Only: events are appended, no update in place => immutable • Ordered: each event has an offset, which is unique within a shard • Fast: should be able to handle thousands of messages / sec • Distributed: lives on a cluster of machines 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 44 Collector 0 1 2 3 4 5 6 7 8 9 10 11 reads writes Consumer System A (time = 6) reads Consumer System B (time = 10)
  • 45. Apache Kafka - Overview • A distributed publish-subscribe messaging system • Designed for processing of real time activity stream data (logs, metrics collections, social media streams, …) • Initially developed at LinkedIn, now part of Apache • Does not follow JMS Standards and does not use JMS API • Kafka maintains feeds of messages in topics Producer Producer Producer 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 45 Kafka Cluster Consumer Consumer Consumer 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 Anatomy of a topic: Partition 0 Partition 1 Partition 2 Writes old new
  • 46. Apache Kafka - Motivation LinkedIn’s motivation for Kafka was: § “A unified platform for handling all the real-time data feeds a large company might have.” 2014 © Trivadis Must haves § High throughput to support high volume event feeds. § Support real-time processing of these feeds to create new, derived feeds. § Support large data backlogs to handle periodic ingestion from offline systems. § Support low-latency delivery to handle more traditional messaging use cases. § Guarantee fault-tolerance in the presence of machine failures. Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 46
  • 47. Apache Kafka - Performance Kafka at LinkedIn Up to 2 million writes/sec on 3 cheap machines § Using 3 producers on 3 different machines 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 47 10+ billion writes per day 172k messages per second (average) 55+ billion messages per day to real-time consumers http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
  • 48. Apache Kafka - Partition offsets Offset: messages in the partitions are each assigned a unique (per partition) and sequential id called the offset • Consumers track their pointers via (offset, partition, topic) tuples 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 48 Consumer group C1
  • 49. Apache Kafka – two Options for Log Cleanup Retaining a window of data • Ideal for event data • Window can be defined in time (days) or space (GBs) – defaults to 1 week Retain a complete log (log compaction) • Ideal for keyed data • Keep a space-efficient complete 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 49 log of changes • Log compaction runs in the background • Ensures that always at least the last known value for each message key within the log of data is retained
  • 50. Data Flow Graphs using Unified Log Stream processing allows for computing feeds off of other feeds Derived feeds are no different than original feeds they are computed off Single deployment of “Unified Log” but logically different feeds 2014 © Trivadis Customer Aggregate Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 50 Meter Readings Collector Enrich / Transform Aggregate by Minute Raw Meter Readings Meter with Customer Meter by Customer by Minute by Minute Meter by Minute Persist Meter by Minute Persist Raw Meter Readings
  • 51. 2014 © Trivadis Agenda 1. Introduction 2. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Processing Architectures 6. Summary Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 51
  • 52. Architectural Pattern: Standalone Event Stream Processing 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 Social Media 52 Event Processing (ESP / CEP) State Store / Event Store Enterprise Event Bus (Ingress) Event Cloud Streams Internet of Things Enterprise Event Bus Analytical Applications 52 DB Enterprise Service Bus Business Rule Management Rules System Event Processing Result Store
  • 53. Architectural Pattern: Event Stream Processing as part of Lambda Architecture 2014 © Trivadis Hadoop Big Data Infrastructure Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 Social Media 53 Event Processing (ESP / CEP) State Store / Event Store Enterprise Event Bus (Ingress) Event Cloud Streams Internet of Things Enterprise Event Bus Analytical Applications 53 DB Enterprise Service Bus Event Processing Map/ HDFS Reduce Result Store Result Store
  • 54. Architectural Pattern: Event Stream Processing as part of Kappa Architecture 2014 © Trivadis Hadoop Big Data Infrastructure Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 Social Media 54 Event Processing (ESP / CEP) State Store / Event Store Enterprise Event Bus (Ingress) Event Cloud Streams Internet of Things Analytical Applications 54 DB Enterprise Service Bus Event Processing HDFS Replay Result Store
  • 55. Questions and answers ... Guido Schmutz Technology Manager BASEL BERN BRUGES LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 55