SlideShare a Scribd company logo
1 of 115
Download to read offline
CLOUD-BASED
DATA STREAM PROCESSING
Thomas Heinze, Leonardo Aniello,
Leonardo Querzoni, Zbigniew Jerzak
DEBS 2014
Mumbai 26/5/2014
TUTORIAL GOALS
Data stream processing (DSP) was in the past
considered a solution for very specific problems.
 Financial trading
 Logistics tracking
 Factory monitoring
Today the potentialities of DSPs start to be used in
more general settings.
 DSP : online processing = MR : batch processing
DSPs will possibly be offered as-a-service from
cloud-based providers ?
225/05/14 CLoud-Based Data Stream Processing
TUTORIAL GOALS
Here we present an overview about current
research trends within data streaming systems
 How they consider the requirements imposed by
recent use-cases
 How are they moving toward cloud platforms
Two focus areas:
 Scalability
 Fault Tolerance
325/05/14 CLoud-Based Data Stream Processing
THE AUTHORS
Thomas Heinze
Phd Student @ SAP
Phd Topic: Elastic Data Stream Processing
4
Leonardo Aniello
Researcher fellow @ Sapienza Univ. of Rome
Leonardo Querzoni
Assistant professor @ Sapienza Univ. of Rome
Zbigniew Jerzak
Senior Researcher @ SAP
Co-Organizer of DEBS Grand Challenge
25/05/14 CLoud-Based Data Stream Processing
OUTLINE
 Historical overview of data stream processing
 Use cases for cloud-based data stream
processing
 How DSPs work: Storm as example
 Scalability methods
 Fault tolerance integrated in modern DSPs
 Future research directions and conclusions
525/05/14 CLoud-Based Data Stream Processing
CLOUD-BASED
DATA STREAM PROCESSING
Historical Overview 6
DATA STREAM PROCESSING ENGINE
 It is a piece of software that
 continuously calculates results for long-standing
queries
 over potentially infinite data streams
 using operators
 algebraic (filters, join, aggregation)
 user defined
 That can be stateless or stateful
25/05/14 CLoud-Based Data Stream Processing 7
Source Aggr Sink
REQUIREMENTS[1]
Rule 1 - Keep the data moving
Rule 2 - Query using SQL on stream (StreamSQL)
Rule 3 - Handle stream imperfections (delayed, missing and out-of-
order data)
Rule 4 - Generate predictable outcomes
Rule 5 - Integrate stored and streaming data
Rule 6 - Guarantee data safety and availability
Rule 7 - Partition and scale applications automatically
Rule 8 - Process and respond instantaneously
8
[1] M. Stonebraker, U. Cetintemel and S. Zdonik. "The 8 Requirements of real-time Stream Processing", ACM
SIGMOD Record, 2005.
25/05/14 CLoud-Based Data Stream Processing
THREE GENERATIONS
 First Generation
 Extensions to existing database engines or simplistic engines
 Dedicated to specific use cases
 Second Generation
 Enhanced methods regarding language expressiveness, load
balancing, fault tolerance
 Third Generation
 Dedicated towards trend of cloud computing; designed
towards massive parallelization
925/05/14 CLoud-Based Data Stream Processing
GENEALOGY
10
FIRST GENERATION THIRD GENERATIONSECOND GENERATION
Aurora (2002)
Medusa (2004)
Borealis (2005)
TelegraphCQ (2003)
NiagaraCQ
(2001)
STREAM (2003)
StreamMapReduce(2011)
StreamMine3G (2013)
Yahoo! S4(2010)
Apache S4(2012)
Twitter Storm (2011) Trident (2013)
StreamCloud(2010)
Elastic StreamCloud (2012)
SPADE(2008)
Elastic Operator(2009)
InfoSphere Streams(2010)
FLUX (2002)
GigaScope (2002)
Spark Streaming (2010) D-Streams(2013)
Cayuga (2007)
SASE (2008)
ZStream (2009)
CAPE (2004)
D-CAPE (2005)
SGuard (2008)
PIPES (2004) HybMig (2006)
StreamInsight (2010)
TimeStream (2013)
CEDR (2005)
SEEP (2013)
Esper (2006)
Truviso (2006)
StreamBase (2003)
Coral8 (2005)
Aleri (2006)
Sybase CEP (2010)
Oracle CQL (2006) Oracle CEP (2009)
SAP ESP (2012)
DejaVu (2009)
System S(2005)
LAAR(2014)
StreamIt (2002)
Flextream(2007)
Elastic System S(2013)
Cisco Prime (2012)
RTM Analyzer (2008)
webMethods Business Events(2011)
TIBCO StreamBase (2013)
TIBCO BusinessEvents(2008)
NILE (2004)
NILE-PDT (2005)
RLD (2013)
MillWheel(2013)
Johka (2007)
25/05/14 CLoud-Based Data Stream Processing
FIRST GENERATION - TELEGRAPH CQ[2]
 Data stream processing engine built on top of
Postgres DB
11
[2] S. Chandrasekaran et al. "TelegraphCQ: Continuous Dataflow Processing", SIGMOD, 2003.
25/05/14 CLoud-Based Data Stream Processing
Source: S. Chandrasekaran et al. "TelegraphCQ: Continuous Dataflow Processing", SIGMOD, 2003.
SECOND GENERATION - BOREALIS[3]
 Joint research project of Brandeis University,
Brown University and MIT
 Duration: ~3 years
 Allowed the experimentation of several
techniques
12
[3] D. J. Abadi et al. "The Design of the Borealis Stream Processing Engine", CIDR 2005.
25/05/14 CLoud-Based Data Stream Processing
BOREALIS MAIN NOVELTIES
 Load-aware Distribution
 Fine-grained High-availability
 Load Shedding mechanisms
1325/05/14 CLoud-Based Data Stream Processing
Source: D. J. Abadi et al. "The Design of the Borealis Stream Processing Engine", CIDR 2005.
THIRD GENERATION
 Novel use cases are driving toward
 Uprecedeted levels of parallelism
 Efficient fault tolerance
 Dynamic scalability
 Etc.
1425/05/14 CLoud-Based Data Stream Processing
CLOUD-BASED
DATA STREAM PROCESSING
Use Cases for Cloud-Based Data
Stream Processing
SCENARIOS FOR THIRD GENERATION
DSPS
 Tracking of query trend evolution in Google
 Analysis of popular queries submitted to Twitter
 User profiling at Yahoo! based on submitted queries
 Bus routing monitoring and management in Dublin
 Sentiment analysis on multiple tweet streams
 Fraud monitoring in cellular telephony
1625/05/14 Cloud-Based Data Stream Processing
QUERY MONITORING AT GOOGLE
 Analyze queries submitted to Google search engine to
create a query historical model
 Run on Zeitgeist on top of MillWheel[4]
 Incoming searches are organized in 1-second buckets
 Buckets are compared to historical data
 Useful to promptly detect anomalies (spikes/dips)
17
[4] Akidau, Balikov, Bekiroğlu, Chernyak, Haberman, Lax, McVeety, Mills, Nordstrom, Whittle. MillWheel: Fault-tolerant
Stream Processing at Internet Scale.VLDB, 2013.
25/05/14 Cloud-Based Data Stream Processing
POPULAR QUERY ANALYSIS AT TWITTER
 When a relevant event happens, an increase occurs in the
number of queries submitted to Twitter[5]
 These queries have to be correlated in real-time with
tweets (2.1 billion tweets per day)
 Runs on Storm
 These spikes are likely to fade away in a limited amount of
time
 Useful to improve the accuracy of popular queries
18
[5] Improving Twitter Search with Real-time Human Computation, 2013 http://blog.echen.me/2013/01/08/improving-
twitter-search-with-real-time-human-computation
25/05/14 Cloud-Based Data Stream Processing
POPULAR QUERY ANALYSIS AT TWITTER
 Problem
 Sudden peak of queries about a new (never seen before) event
 How to properly assign the right semantic (categorization) to the queries?
 Example
 Solution
 Employ Human Evaluation (Amazon’s Mechanical Turk Service) to produce
categorizations to queries unseen so far
 incorporate such categorizations into backend models
19
how can we understand that
#bindersfullofwomen refers
to politics?
25/05/14 Cloud-Based Data Stream Processing
POPULAR QUERY ANALYSIS AT TWITTER
20
Search
Engine
Kafka
Queue
query Log
Spout
Bolt
(query, ts)
popular
queries
Human
Evaluation
categorizations of queries
engine tuning
Storm
25/05/14 Cloud-Based Data Stream Processing
USER PROFILING AT YAHOO!
 Queries submitted by users are evaluated (thousands
queries per second by millions of users)
 Run on S4[6]
 Useful to generate highly personalized advertising
21
[6] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed Stream Computing Platform.ICDMW, 2010.
25/05/14 Cloud-Based Data Stream Processing
BUS ROUTING MANAGEMENT IN DUBLIN
 Tracking of bus locations (1000 buses) to improve
public transportation for 1.2 million citizens[7]
 Run on System S
 Position tracking by GPS signals to provide real-time
traffic information monitoring
 Useful to predict arrival times and suggest better routes
22
[7] Dublin City Council - Traffic Flow Improved by Big Data Analytics Used to Predict Bus Arrival and Transit Times.
IBM, 2013. http://www-03.ibm.com/software/businesscasestudies/en/us/corp?docid=RNAE-9C9PN5
25/05/14 Cloud-Based Data Stream Processing
SENTIMENT ANALYSIS ON TWEET
STREAMS
 Tweet streams flowing at high rates (10K tweet/s)
 Sentiment analysis in real-time
 Limited computation time (latency up to 2 seconds)
 Run on Timestream[8]
 Useful to continuously estimate the mood about specific
topics
23
[8] Z. Qian, Y. He, C. Su, Z. Wu, H. Zhu, T. Zhang, L. Zhou, Y. Yu, Z. Zhang. Timestream: Reliable Stream
Computation in the Cloud. Eurosys, 2013.
25/05/14 Cloud-Based Data Stream Processing
FRAUD MONITORING FOR MOBILE CALLS
 Fraud detection by real-time processing of Call Detail
Records (10k-50k CDR/s)
 Requires self-joins over large time windows (queries on
millions of CDRs)
 Run on StreamCloud[9]
 Useful to reactively spot dishonest behaviors
24
[9] V. Gulisano, R. Jimenez-Peris, M. Patino-Martinez, C. Soriente, and P. Valduriez. Streamcloud: An Elastic and Scalable
Data Streaming System. IEEE TPDS, 2012.
25/05/14 Cloud-Based Data Stream Processing
REQUIREMENTS
 Real-time and continuous complex analysis of heavy
data streams
 More than 10k event/s
 Limited computation latency
 Up to few seconds
 Correlation of new and historical data
 Computations have a state to be kept
 Input load varies considerably over time
 Computations have to adapt dynamically (Scalability)
 Hardware and Software failures can occur
 Computations have to transparently tolerate faults with limited
performance degradation (Fault Tolerance)
2525/05/14 Cloud-Based Data Stream Processing
CLOUD-BASED
DATA STREAM PROCESSING
How DSPs work: Storm as example
STORM
 Storm is an open source distributed realtime
computation system
 Provides abstractions for implementing event-based
computations over a cluster of physical nodes
 Manages high throughput data streams
 Performs parallel computations on them
 It can be effectively used to design complex
event-driven applications on intense streams of
data
25/05/14 Cloud-Based Data Stream Processing 27
STORM
 Originally developed by Nathan Marz and team
at BackType, then acquired by Twitter, now an
Apache Incubator project.
 Currently used by Twitter, Groupon, The
Weather Channel, Taobao, etc.
 Design goals:
 Guaranteed data processing
 Horizontal scalability
 Fault Tolerance
25/05/14 Cloud-Based Data Stream Processing 28
STORM
An application is represented by a topology:
Spout
Spout
Spout
Spout
Bolt Bolt
Bolt
BoltBolt
Bolt
Bolt
Bolt
25/05/14 Cloud-Based Data Stream Processing 29
OPERATOR EXPRESSIVENESS
 Storm is designed for custom operator definition
 Bolts can be designed as POJOs adhering to a specific
interface
 Implementations in other languages are feasible
 Trident[10] offers a more high level programming
interface
[10] N. Marz. Trident tutorial. https://github.com/nathanmarz/storm/wiki/Trident-tutorial, 2013.
25/05/14 Cloud-Based Data Stream Processing 30
OPERATOR EXPRESSIVENESS
 Operators are connected through grouping:
 Shuffle grouping
 Fields grouping
 All grouping
 Global grouping
 None grouping
 Direct grouping
 Local or shuffle grouping
25/05/14 Cloud-Based Data Stream Processing 31
PARALLELIZATION
 Storm asks the developer to provide “parallelism
hints” in the topology
Spout
Spout
Spout
Spout
Bolt Bolt
Bolt
BoltBolt
Bolt
Bolt
Bolt
x3
x8
x5
25/05/14 Cloud-Based Data Stream Processing 32
STORM INTERNALS
 A storm cluster is constituted by a Nimbus node
and n Worker nodes
25/05/14 Cloud-Based Data Stream Processing 33
TOPOLOGY EXECUTION
 A topology is run by submitting it to Nimbus
 Nimbus allocates the execution of components (spouts
and bolts) to the worker nodes using a scheduler
 Each component has multiple instances (parallelism)
 Each instance is mapped to an executor
 A worker is instantiated whenever the hosting node must
run executors for the submitted topology
 Each worker node locally manages incoming/outgoing
streams and local computation
 The local surpervisor takes care that everything runs as
prescribed
 Nimbus monitors worker nodes during the execution to
manage potential failures and the current resource usage
25/05/14 Cloud-Based Data Stream Processing 34
TOPOLOGY EXECUTION
 Storm’s default scheduler (EvenScheduler)
applies a simple round robin strategy
25/05/14 Cloud-Based Data Stream Processing 35
A PRACTICAL EXAMPLE
 Word count: the HelloWorld for DSPs
 Input: stream of text (e.g. from documents)
 Output: number of appearance for each word
Source: http://wpcertification.blogspot.it/2014/02/helloworld-apache-storm-word-counter.html
H elloStorm
LineReader
Spout
W ordSplitter
Bolt
W ordCounter
Bolt
text lines words
TXT
docs
25/05/14 Cloud-Based Data Stream Processing 36
A PRACTICAL EXAMPLE
 LineReaderSpout: reads docs and creates tuples
public class LineReaderSpout implements IRichSpout {
public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
this.context = context;
this.fileReader = new FileReader(conf.get("inputFile").toString());
this.collector = collector;
}
public void nextTuple() {
if (completed) {
Thread.sleep(1000);
}
String str; BufferedReader reader = new BufferedReader(fileReader);
while ((str = reader.readLine()) != null) {
this.collector.emit(new Values(str), str);
completed = true;
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("line"));
}
public void close() {
fileReader.close();
}
}
25/05/14 Cloud-Based Data Stream Processing 37
A PRACTICAL EXAMPLE
 WordSplitterBolt: cuts lines in words
public class WordSplitterBolt implements IRichBolt {
public void prepare(Map stormConf, TopologyContext context,OutputCollector c) {
this.collector = c;
}
public void execute(Tuple input) {
String sentence = input.getString(0);
String[] words = sentence.split(" ");
for(String word: words){
word = word.trim();
if(!word.isEmpty()){
word = word.toLowerCase();
collector.emit(new Values(word));
}
}
collector.ack(input);
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
}
25/05/14 Cloud-Based Data Stream Processing 38
A PRACTICAL EXAMPLE
 WordCounterBolt: counts word occurrences
public class WordCounterBolt implements IRichBolt {
public void prepare(Map stormConf, TopologyContext context, OutputCollector c) {
this.counters = new HashMap<String, Integer>();
this.collector = c;
}
public void execute(Tuple input) {
String str = input.getString(0);
if(!counters.containsKey(str)){
counters.put(str, 1);
} else {
Integer c = counters.get(str) +1;
counters.put(str, c);
}
collector.ack(input);
}
public void cleanup() {
for (Map.Entry<String, Integer> entry : counters.entrySet()){
System.out.println (entry.getKey()+" : " + entry.getValue());
}
}
}
25/05/14 Cloud-Based Data Stream Processing 39
A PRACTICAL EXAMPLE
 HelloStorm: contains the topology definition
public class HelloStorm {
public static void main (String[] args) throws Exception{
Config config = new Config();
config.put("inputFile", args[0]);
config.setDebug(true);
config.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 1);
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("line-reader-spout", new LineReaderSpout());
builder.setBolt("word-spitter", new WordSplitterBolt().shuffleGrouping(
"line-reader-spout");
builder.setBolt("word-counter", new WordCounterBolt()).shuffleGrouping(
"word-spitter");
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("HelloStorm", config, builder.createTopology());
Thread.sleep(10000);
cluster.shutdown();
}
}
25/05/14 Cloud-Based Data Stream Processing 40
CLOUD-BASED
DATA STREAM PROCESSING
Scalability
PARTITIONING SCHEMES
 Data Parallelism:
 How to parallelize the execution of an operator?
 How to detect the optimal level of parallelization?
 Operator Distribution:
 How to distribute the load across available hosts?
 How to achieve a load balance between these
machines?
28/05/2014 Cloud-based Data Stream Processing
42
RQUIREMENTS FOR DATA
PARALLELISM
 Transparent to the user
 correct results in correct order (identical to
sequential execution)
28/05/2014 Cloud-based Data Stream Processing
Filter
Filter
Round-
Robin Union
Aggr
Aggr
Hash
(groupID)
Merge&
Sort
43
DATA PARALLELISM
 First presented by FLUX[11] and Borealis[12]
 Explicitely done using partitioning operators
 User needs to decide:
 partitioning scheme
 merging scheme
 level of parallelism
28/05/2014 Cloud-based Data Stream Processing
44
[11] M. Shah, et al. "Flux: An adaptive partitioning operator for continuous query systems." In ICDE, 2003.
[12] D. Abadi, et al. "The Design of the Borealis Stream Processing Engine." In CIDR, 2005..
PARALLELISM FOR CLOUD-BASED DSP
 Massive parallelization (>100 partitions)
 Support custom operators
 Adapt parallelization level to the workload
without user interaction
28/05/2014 Cloud-based Data Stream Processing
45
DATA PARALLELISM IN STORM
 User defines number of parallel task
 Storm support different partitioning schemes
(aka grouping):
 Shuffle grouping, Fields grouping, All grouping, Global
grouping, None grouping, Direct grouping, Local or
shuffle grouping, Custom
28/05/2014 Cloud-based Data Stream Processing
46
MAPREDUCE FOR STREAMING [13,14]
 Extend MapReduce model for streaming data:
 Break strict phases
 Introduce Stateful Reducer
28/05/2014 Cloud-based Data Stream Processing
Map
Map
Map
Reduce
Reduce
OutputInput
[13] T. Condie, et al. "MapReduce Online." In NSDI,2010.
[14] A. Brito, et al. "Scalable and low-latency data processing with StreamMapReduce." CloudCom, 2011.
47
STATEFUL REDUCER[14]
reduceInit(k1) {
// custom user class
S = new State();S.sum = 0; S.count = 0;
// object S is now associated with key k1
return S;
}
reduce(Key k, <List new, List expired, List window, UserState S>) {
For v in expired do:
// Remove contribution of expired events
S.sum -= v; S.count--;
For v in new do:
// Add contribution of new events
S.sum += v; S.count++;
send(k1, S.sum/S.count);
}
[14] A. Brito, et al. "Scalable and low-latency data processing with StreamMapReduce." CloudCom, 2011.
28/05/2014 Cloud-based Data Stream Processing 48
AUTO PARALLELIZATION[15,16]
 Goal:
 detect parallelizable regions in operator graph with
custom operators for user-defined operators
 Runtime support for enforcing safety conditions
 Safety Conditions:
 For an operator: no state or partitioned state,
selectivity < 1, at most one pre/successor
 For an parallel region: compatible keys, forwarded
keys, region-local fusion dependencies
28/05/2014 Cloud-based Data Stream Processing
[15] S Schneider, et al. "Auto-parallelizing stateful distributed streaming applications." In PACT, 2012.
[16] Wu et al. “Parallelizing Stateful Operators in a Distributed Stream Processing System: How, Should You and How
Much?” In DEBS, 2012.
49
AUTO PARALLELIZATION
 Compiler-based Approach:
 Characterize each operator based on a set of
criterias (state type, selectivity, forwarding)
 Merge different operators together into parallel
regions (based on left-first approach)
 Best parallelization strategy is applied
automatically
 Level of parallelism is decided on job admission
28/05/2014 Cloud-based Data Stream Processing
50
ELASTICITY[17]
 Goal: React to unpredicted load peaks & reduce
the amount of ideling resources.
28/05/2014 Cloud-based Data Stream Processing
Workload
Load
Static Provisioning
Elastic Provisioning
Underprovisioning
Overprovisioning
[17] M. Armbrust, et al. "A view of cloud computing." Communications of the ACM, 2010.
51
DIFFERENT TYPES OF ELASTICITY
 Vertical Elasticity (Scale up)
 Adapt the number of threads per operator based
on the workload.
 Horizontal Elasticity (Scale out)
 Adapt the number of hosts based on the workload.
28/05/2014 Cloud-based Data Stream Processing
52
ELASTIC OPERATOR EXECUTION[18]
 Dynamically adapt number of threads for a
stateless operator based on the workload
 Limitations: works only on thread-level and for
stateless operators
28/05/2014 Cloud-based Data Stream Processing
[18] S. Schneider, et al. "Elastic scaling of data parallel operators in stream processing." In IPDPS, 2009.
Operator
Thread Pool
53
ELASTIC AUTO-PARALLELIZATION[19]
 Combines ideas of Elasticity and Auto-
Parallelization
 Adapt parallelization level using controller-based
approach
 Dynamic adaption of parallelization level based
on operator migration protocol
28/05/2014 Cloud-based Data Stream Processing
54
[19] B. Gedik, et al. “Elastic Scaling for Data Stream Processing." In TPDS, 2013.
Horizontal barrierVertical
barrier
ELASTIC AUTO-PARALLELIZATION
 Dynamic adaption of parallelization: lend &
borrow approach
28/05/2014 Cloud-based Data Stream Processing
55
Op1
Op2
Splitter Merger
Op2
Op2
1. Lend Phase2. Borrow Phase
OPERATOR DISTRIBUTION
 Well-studied problem since first DSP prototypes
 Typically referred to as „Operator Placement“
 Examples: Borealis, SODA, Pietzuch et al.
 Design decisions[20]
 Optimization goal: What is optimized ?
 Execution mode: Centralized or Decentralized?
 Algorithm runtime: Offline, online or both?
28/05/2014 Cloud-based Data Stream Processing
56
[20] G.T. Lakshmanan, et al. "Placement strategies for internet-scale data stream systems." In IEEE Internet
Computing, 2008.
OPTIMIZATION GOAL
 Different objectives:
 Balance CPU load (handle load variations)
 Minimize Network latency (minimize latency for
sensor networks)
 Latency optimization (predictable quality of service)
28/05/2014 Cloud-based Data Stream Processing
57
EXECUTION MODE
 Centralized Execution:
 All decision done by centralized management
component
 Major disadvantage: manager becomes scalability
bottleneck
 Decentralized Execution:
 Different hosts try to agree on an operator
placement
 Two examples: operator routing and commutative
execution
28/05/2014 Cloud-based Data Stream Processing
58
DECENTRALIZER EXECUTION
 Based on pairwise agreement[21]
28/05/2014 Cloud-based Data Stream Processing
Host 1 Host 2 Host 3 Host 4
Aggr4 Aggr2
Filter 2
Filter 3
Filter 5
Filter 6
Filter 4
Filter 1
Aggr 3Aggr 4
Aggr 3
Aggr 5Aggr 6
Source 1
Source 2
59
[21] M. Balazinska, et al. "Contract-Based Load Management in Federated Distributed Systems." NSDI. Vol. 4.
2004.
DECENTRALIZER EXECUTION
 Based on decentralized routing[22]
Host 2
Host 3
Host 4
Host 1
Src1
Src2
J
D
F F
28/05/2014 Cloud-based Data Stream Processing 60
[22] Y.Ahmad, et al. "Network-aware query processing for stream-based applications. In VLDB, 2004.
ALGORITHM RUNTIME
 Offline
 Based on estimation of input rates, selectivities, etc.
 Online
 Mostly simplistic initial estimation (e.g. round-robin or
random placement)
 Continously measurements and adaption during
runtime
28/05/2014 Cloud-based Data Stream Processing
61
MOVEMENT PROTOCOLS
 How to ensure loss-free movement of
operators?
 No input event shall be lost
 State need to be restored on new host
 Movement strategies:
 Pause & Resume vs. Parallel track
 large overlap with research on adaptive query
processing[23]
28/05/2014 Cloud-based Data Stream Processing
62
[23] Y. Zhu, et al. "Dynamic plan migration for continuous queries over data streams." In SIGMOD, 2004.
PAUSE & RESUME
 Approach:
1. Stop execution
2. Move state to new host
3. Restart execution on new hosts
 Properties:
 Simple & generic
 Latency Peak can be observed due to pausing
28/05/2014 Cloud-based Data Stream Processing
63
PARALLEL TRACK
 Approach:
1. Start new instance
2. Move or create up to date state
3. Stop old instance as soon as instances are in sync
 Properties:
 No latency peak
 Requires duplicate detection, detection of „sync“
status
 Events processed twice
28/05/2014 Cloud-based Data Stream Processing
64
OPERATOR PLACEMENT ITHIN CLOUD-
BASED DSP
 Different setup (highly location-wise distributed
vs. single cluster)
 Larger scale (100 …. 1000 hosts)
 New objectives: Elasticity, Monetary Cost, Energy
Efficiency
Mostly centralized, adaptive solutions optimizing
CPU utilization or monetary cost with Pause &
Resume Operator Movement.
28/05/2014 Cloud-based Data Stream Processing
65
RECENT OPERATOR PLACEMENT
APPROACHES
 SQPR[24]
 Query planner for data centers with heterogenes
resources
 MACE[25]
 Present san approach for precise latency estimation
28/05/2014 Cloud-based Data Stream Processing
[24] V. Kalyvianaki et al. "SQPR: Stream query planning with reuse“. In ICDE, 2011.
[25] B. Chandramouli et al. "Accurate latency estimation in a distributed event processing system“. In ICDE, 2011.
66
STORM LOAD MODEL
 Worker: physical JVM and executes subset of all
the tasks of the topology
 Task: Parallel instance of an operator
 Executor: Thread of an worker, executes one or
more tasks of the same operator
28/05/2014 Cloud-based Data Stream Processing
67
EXAMPLE FOR STORM LOAD MODEL
Config conf = new Config();
conf.setNumWorkers(3); // use two worker
processes topologyBuilder.setSpout(„src", new Source(), 3);
topologyBuilder.setBolt(„aggr", new Aggregation(),
6).shuffleGrouping(„src");
topologyBuilder.setBolt(„sink", new Sink(),
6).shuffleGrouping(„aggr");
StormSubmitter.submitTopology( "mytopology", conf,
topologyBuilder.createTopology() );
http://storm.incubator.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html.
28/05/2014 Cloud-based Data Stream Processing 68
STORM: DISTRIBUTION MODEL[26]
28/05/2014 Cloud-based Data Stream Processing 69
Source
(spout)
Aggr
(bolt)
Sink
(bolt)
Parallelism
hint =3
Parallelism
hint =6
Parallelism
hint =3
Worker 1
Source Task
Sink Task
Aggr Task
Aggr Task
Worker 2
Source Task
Sink Task
Aggr Task
Aggr Task
Worker 3
Source Task
Sink Task
Aggr Task
Aggr Task
[26] Storm. http://storm.incubator.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-
topology.html.
ADAPTIVE PLACEMENT IN STORM
 Manual reconfiguration (pause & resume
complete topology)
 Enhanced Load scheduler for Twitter Storm[27]
 Load balancing adapted to Storm architecture
28/05/2014 Cloud-based Data Stream Processing
$ storm rebalance mytopo -n 4 -e src=8
70
[27] L. Aniello, et al. "Adaptive online scheduling in storm.“ In DEBS, 2013.
DIFFERENT LEVELS OF ELASTICITY
28/05/2014 Cloud-based Data Stream Processing
Elastic Action Taken System
Virtual Machine Move virtual machines transparent to the user. [28]
Engine New engine instance is started and new queries are
employed on this engine
[29]
Query New query instance is started and the input data is
split
[30]
Operator New operator instance is created and the input data
is split
[31]
[28] T. Knauth, et al. "Scaling non-elastic applications using virtual machines." In Cloud, 2011.
[29] W. Kleiminger, et al."Balancing load in stream processing with the cloud." In ICDEW, 2011.
[30] M. Migliavacca, et al. "SEEP: scalable and elastic event processing.“ In Middleware, 2010.
[31] R. Fernandez, et al. „Integrating scale out and fault tolerance in stream processing using operator state
management”, SIGMOD 2013.
71
ELASTICS DSP SYSTEM ARCHITECTURE
28/05/2014 Cloud-based Data Stream Processing
Elasticity
Manager
Resource
Manager
Data Stream
Processing Engine
Data Stream
Processing Engine
Data Stream
Processing Engine
Virtual
Machine
Virtual
Machine
Virtual
Machine
...
...
72
STREAMCLOUD[9]
 Realized different level of elasticity on a highly
scalable DSP
 Studied different major design decisions:
 Optimal Level of Elasticity
 Migration strategy
 Load balancing based on user-defined thresholds
28/05/2014 Cloud-based Data Stream Processing 73
[9] V. Gulisano, R. Jimenez-Peris, M. Patino-Martinez, C. Soriente, and P. Valduriez. Streamcloud: An Elastic and
Scalable Data Streaming System. IEEE TPDS, 2012.
SEEP[31]
 Present the state of the art mechanims for state
management
 combine fault tolerance & scale out using same
mechanism
 Highly scalable and fast recovery
28/05/2014 Cloud-based Data Stream Processing 74
[31] R. Fernandez, et al. "Integrating scale out and fault tolerance in stream processing using operator state
management", SIGMOD 2013.
MILLWHEEL[4]
 Similar mechanisms like SEEP or StreamCloud
 Support massive scale out
 Centralized manager
 Optimization of CPU Load
28/05/2014 Cloud-based Data Stream Processing 75
[4] T. Akidau, et al. „MillWheel: Fault-Tolerant Stream Processing at Internet Scale”, VLDB 2013.
CLOUD-BASED
DATA STREAM PROCESSING
Fault tolerance
FAULT TOLERANCE IN DSPS
 Small scale stream processing
 Faults are an exception
 Optimize for the lucky case
 Catch faults at execution time and start recovery
procedures (possibly expensive)
 Large scale DSPs (e.g. cloud based)
 Faults are likely to affect every execution
 Consider them in the design process
28/05/2014 Cloud-Based Data Stream Processing 77
FAULT TOLERANCE IN DSPS
 Two main fault causes
 Message losses
 Computational element failures
 Event tracking
 Makes sure each injected event is correctly processed
with well-defined semantics
 State management
 Makes sure that the failure of a computational
element will not affect the system‘s correct execution
28/05/2014 Cloud-Based Data Stream Processing 78
EVENT TRACKING
 When a fault occurs, events may need to be
replayed
 Typical approach:
 Request acks from downstream operators
 Detect losses by setting timeouts on acks
 Replay lost events from upstream operators
 Tracking and managing information on processed
events may be needed to guarantee specific
processing semantics
28/05/2014 Cloud-Based Data Stream Processing 79
STORM – ET
 Event tracking in storm is guaranteed by two
different techniques:
 Acker processes  at-least-once message processing
 Transactional topologies  exactly-once message
processing
28/05/2014 Cloud-Based Data Stream Processing 80
STORM – ET
 A tuple injected in the system can cause the
production of multiple tuples in the topology
 This production partially maps the underlying
application topology
 Storm keeps track of the processing DAG
stemming from each input tuple
28/05/2014 Cloud-Based Data Stream Processing 81
STORM – ET
 Example: word-count topology
Spout
Splitter
bolt
Counter
bolt
“stay hungry, stay foolish”
text lines words
[“stay”,“hungry”,“stay”,“foolish”] [“stay”, 2]
[“hungry”, 1]
[“foolish”, 1]
“stay hungry, stay foolish”
“stay” [“stay”, 1]
“hungry”
“stay”
“foolish”
[“hungry”, 1]
[“foolish”, 1]
[“stay”, 2]
28/05/2014 Cloud-Based Data Stream Processing 82
STORM – ET
 Acker tasks are responsible for monitoring the
flow of tuples in the DAG
 Each bolt “acks” the correct processing of a tuple
 The processing of a tuple can be committed when it
has fully traversed the DAG
 In this case the acker notifies the original spout
 The spout must implement an ack(Object msgId)
method
 It can be used to garbage collect tuple state
28/05/2014 Cloud-Based Data Stream Processing 83
STORM – ET
 If a tuple does not reach the end of the DAG
 The acker timeouts and invoke fail(Object msgId) on
the spout
 The spout method implementation is in charge of
replaying failed tuples
 Notice: the original data source must be able to
reliably reply events (e.g. a reliable MQ)
28/05/2014 Cloud-Based Data Stream Processing 84
STORM – ET
 Developer perspective:
 Correctly connect bolts with tuple sources
(anchoring)
 Anchoring is how you specify the tuple tree
public void execute(Tuple tuple) {
String sentence = tuple.getString(0);
for(String word: sentence.split(" ")) {
_collector.emit(tuple, new Values(word));
}
_collector.ack(tuple);
}
Anchor
Source: https://github.com/nathanmarz/storm/wiki/Guaranteeing-message-processing
28/05/2014 Cloud-Based Data Stream Processing 85
STORM – ET
 Developer perspective:
 Explicitly ack processed tuples
 Or extend BaseBasicBolt
public void execute(Tuple tuple) {
String sentence = tuple.getString(0);
for(String word: sentence.split(" ")) {
_collector.emit(tuple, new Values(word));
}
_collector.ack(tuple);
}
Explicit ack
Source: https://github.com/nathanmarz/storm/wiki/Guaranteeing-message-processing
28/05/2014 Cloud-Based Data Stream Processing 86
STORM – ET
 Implement ack() and fail() on the spout(s)
public void nextTuple() {
if(!toSend.isEmpty()){
for(Map.Entry<Integer, String> transactionEntry: toSend.entrySet()){
Integer transactionId = transactionEntry.getKey();
String transactionMessage = transactionEntry.getValue();
collector.emit(new Values(transactionMessage),transactionId);
}
toSend.clear();
}
}
public void ack(Object msgId) {
messages.remove(msgId);
}
public void fail(Object msgId) {
Integer transactionId = (Integer) msgId;
Integer failures = transactionFailureCount.get(transactionId) + 1;
if(failures >= MAX_FAILS){
throw new RuntimeException(”Too many failures on Tx [”+transactionId+"]”);
}
transactionFailureCount.put(transactionId, failures);
toSend.put(transactionId,messages.get(transactionId));
}
Source: J. Leibiusky, G. Eisbruch and D. Simonassi. Getting Started with Storm. O’Reilly, 2012
re-schedule
Failed topology
Completed processing
28/05/2014 Cloud-Based Data Stream Processing 87
STORM – ET
 How does the acker task work?
 It is a standard bolt
 The acker tracks for each tuple emitted by a spout
the corresponding DAG
 It acks the spout whenever the DAG is complete
 You can instantiate parallel ackers to improve
performance
 Tuples are randomly assigned to ackers to improve
load balancing (uses mod hashing)
28/05/2014 Cloud-Based Data Stream Processing 88
STORM – ET
 How can ackers keep track of DAGs?
 Each tuple is identified by a unique 64bit ID
 The acker stores in a map for each tuple IP
 The ID of the emitter task
 An ack val
 An ack val is a bit-vector that encodes
 IDs of tuples stemmed from the initial one
 IDs of acked tuples
28/05/2014 Cloud-Based Data Stream Processing 89
STORM – ET
 Ack val management
 When a tuple is emitted by a spout
 Initializes the vector and encodes in it the tuple ID
 When a tuple is acked
 XOR its ID in the ack val
 When an anchored tuple is emitted
 XOR its ID in the ack val
 When the ack val is empty, all tuples in the DAG
have been acked (with high probability)
28/05/2014 Cloud-Based Data Stream Processing 90
STORM – ET
 If exactly-once semantics is required use a
transactional topology
 Transaction = processing + committing
 Processing is heavily parallelized
 Committing is strictly sequential
 Storm takes care of
 State management (through Zookeeper)
 Transaction coordination
 Fault detection
 Provides a batch processing API
 Note: requires a source able to reply data batches
28/05/2014 Cloud-Based Data Stream Processing 91
EVENT TRACKING
 In other systems the event tracking functionality
is strongly coupled with state management
 events cannot be garbage collected when they are
acknowledged
 Need to wait for the checkpointing of a state
updated with such events
 Timestream does not store all the events and re-
compute those to be replayed by tracking their
dependencies with input events, similarly to
Storm
28/05/2014 Cloud-Based Data Stream Processing 92
EVENT TRACKING
 SEEP stores non-checkpointed events on the
upstream operators
 Millwheel persists all intermediate results to an
underlying Distributed File System.
 It also provides exactly-once semantics
 D-Streams stores data to be processed in
immutable partitioned datasets
 These are implemented as resilient distributed
datasets (RDD)
28/05/2014 Cloud-Based Data Stream Processing 93
STATE MANAGEMENT
 Stateful operators require their state to be
persisted in case of failures.
 Two classic approaches
 Active replication
 Passive replication
28/05/2014 Cloud-Based Data Stream Processing 94
STATE MANAGEMENT
28/05/2014 Cloud-Based Data Stream Processing 95
Operator
Operator
Operator
Previousstage
Nextstage
State
State
State
Data
State implicitly synchronized by
ordered evaluation of same data
Active replication
Operator
Operator
(dormant)
Operator
(dormant)
Previousstage
Nextstage
State
State
State
Data
State periodically persisted on stable
storage and recovered on demand
Passive replication
Checkpoints
STORM - SM
 What happen when tasks fail ?
 If a worker dies its supervisor restarts it
 If it fails on startup Nimbus will reassign it on a different
machine
 If a machine fails its assigned tasks will timeout and
Nimbus will reassign them
 If Nimbus/Supervisors die they are simply restarted
 Behave like fail-fast processes
 They’re stateless in practice
 Their state is safely maintained in in a Zookeeper cluster
28/05/2014 Cloud-Based Data Stream Processing 96
STORM - SM
 There is no explicit state management for
operators in Storm
 Trident builds automatic SM on top of it
 Batch of tuples have unique Tx id
 If a batch is retried it will have the exact same Tx id
 State updates are ordered among batches
 A new Tx is not committed if an old one is still pending
 Transactional state guarantees transparently
exactly-once tuple processing
28/05/2014 Cloud-Based Data Stream Processing 97
STORM - SM
 Transactional state is possible only if supported
by the data source
 As an alternative
 Opaque transactional state
 Each tuple is guaranteed to be executed exactly in one
transaction
 But the set of transactions for a given Tx id can change in
case of failures
 Non transactional state
28/05/2014 Cloud-Based Data Stream Processing 98
STORM - SM
 Few combinations guarantee exactly-once sem.
28/05/2014 Cloud-Based Data Stream Processing 99
State
Non
transactional
Transactional
Opaque
transactional
Spout
Non
transactional
No No No
Transactional No Yes Yes
Opaque
transactional
No No Yes
APACHE S4 - SM
 Lightweight approach
 Assumes that lossy failovers are acceptable.
 PEs hosted on failed PNs are automatically moved to a
standby server
 Running PEs periodically perform uncoordinated and
asynchronous checkpointing of their internal state
 Distinct PE instances can checkpoint their state autonomously
without synchronization
 No global consistency
 Executed asynchronously by first serializing the operator state
and then saving it to stable storage through a pluggable adapter
 Can be overridden by a user implementation
28/05/2014 Cloud-Based Data Stream Processing 100
MILLWHEEL - SM
 Guarantees strongly consistent processing
 Checkpoints on persistent storage every single state
change incurred after a computation
 Can be executed
 before emitting results downstream
 operator implementations are automatically rendered idempotent
with respect to the execution
 after emitting results downstream
 it’s up to the developer to implement idempotent operators if
needed
 Produced results are checkpointed with state (strong
productions)
28/05/2014 Cloud-Based Data Stream Processing 101
TIMESTREAM- SM
 Takes a similar approach
 State information checkpointed to stable storage
 For each operator the state includes
 state dependency: the list of input events that made the
operator reach a specific state
 output dependency: the list of input events that made an
operator produce a specific output starting from a specific state.
 Allow to correctly recover from a fault without having to
store all the intermediate events produced by the
operators and their states.
 Is it possibile to periodically checkpoint a full operator
state in order to avoid re-emitting the whole history of
input events in order to recompute it.
28/05/2014 Cloud-Based Data Stream Processing 102
SEEP - SM
 Takes a different route allowing state to be
stored on upstream operators
 allows SEEP to treat operator recovery as a special
case of a standard operator scale-out procedure
 state in SEEP is characterized by three elements
 internal state
 output buffers
 routing state
 treated differently to reduce the state management
impact on system performance.
28/05/2014 Cloud-Based Data Stream Processing 103
D-STREAMS- SM
 SM depends strictly on the computation model
 A computation is structured as a sequence of
deterministic batch computations on small time intervals
 The input and output of each batch, as well as the state
of each computation, are stored as reliable distributed
datasets (RDDs)
 For each RDD, the graph of operations used to compute
(its lineage) it is tracked and reliably stored
 The recovery of an RDD can be performed in parallel on
separate nodes in order to speed up recovery operation
 Operator state can be optionally checkpointed on stable
storage to limit the number of operations required to
restore it.
28/05/2014 Cloud-Based Data Stream Processing 104
CLOUD-BASED
DATA STREAM PROCESSING
Open research directions
CONCLUSIONS
 A third generation of DPSs is coming out that
promise
 Unprecedented computational power through
horizontal scalability
 On-demand dynamic load adaptation
 Simplified programming models through powerful
event management semantics
 Graceful performance degradation in case of faults
 A few issues remain to be solved to make DSP
ready for the cloud-era
28/05/2014 Cloud-Based Data Stream Processing 106
INFRASTRUCTURE AWARENESS
 Most existing DSP systems are infrastructure
oblivious
 deployment strategies do not take into account the
peculiar characteristics of the available hardware
 The physical connection and relationship among
infrastructural element is known to be a key factor
to both improve system performance and fault
tolerance.
 E.g. Hadoop's “rack awareness”
 We think infrastructure awareness is an open
research field for data stream processing systems
that could possibly bring important improvements.
28/05/2014 Cloud-Based Data Stream Processing 107
COST-EFFICIENCY
 Users in these days are not only interested in the
performance of the system, but also the monetary cost
 Cloud-based DSP systems should consider running cost as
a variable for performance optimization
 efficient scaling behavior maximizing the system utilization
 efficient fault tolerance mechanisms
 Bellavista et al.[32] proposed a first prototype, which allows
the user to trade-off monetary cost and fault tolerance.
 Their prototype only selects a subset of the operators for
replication based on a user-defined value for the expected
information completeness.
28/05/2014 Cloud-Based Data Stream Processing 108
[32] P. Bellavista, A. Corradi, S. Kotoulas, and A. Reale. Adaptive fault-tolerance for dynamic resource provisioning in
distributed stream processing systems. In EDBT, 2014.
ENERGY-EFFICIENCY
 Most large-scale datacenters are striving to "go
green" by making energy consumption more
effective.
 DSP systems should consider energy
consumption a yet-another-variable in their
performance optimization process
 Note: this aspect is possibly strictly linked to the
infrastructure awareness theme
28/05/2014 Cloud-Based Data Stream Processing 109
ADVANCED ELASTICITY
 Most of the existing elastic scaling solutions for
DSPs apply simplistic schemes for load balancing.
 Operator placement algorithms only optimize system
utilization
 Other metrics (end to end latency, network
bandwidth, etc.) are only partially considered
 However these metric are often used to sign
contract-binding SLAs
 Would it be possible to design DSP systems able
to probabilistically guarantee performance ?
28/05/2014 Cloud-Based Data Stream Processing 110
MULTI-DSP INTEGRATION
 Most DSPs are particularly well suited for specific
use cases
 No one-size-fits-all solution
 Components automatically selecting the best engine
for each give use case would significantly improve
the applicability of these systems.
 No need to know in advance which is the best solution
for given use case
 No need to deploy and maintain different solutions
 First promising results[33,34]
28/05/2014 Cloud-Based Data Stream Processing 111
[33] M. Duller, J. S. Rellermeyer, G. Alonso, and N. Tatbul. Virtualizing stream processing. Middleware, 2011.
[34] H. Lim and S. Babu. Execution and optimization of continuous queries with cyclops. SIGMOD, 2013.
Storm
Performance
MapReduce
Performance
28/05/2014 Cloud-Based Data Stream Processing 112
MULTI-DSP INTEGRATION
ESPER
Performance
Cyclops
New Query
With Window Size w
Choose DSP
1
2
[34] H. Lim and S. Babu. Execution and optimization of continuous queries with cyclops. SIGMOD, 2013.
REFERENCES
[1] M. Stonebraker, U. Cetintemel and S. Zdonik "The 8 Requirements of real-time Stream Processing", ACM
SIGMOD Record, 2005.
[2] S. Chandrasekaran et al. "TelegraphCQ: Continuous Dataflow Processing", SIGMOD, 2003.
[3] D. J. Abadi et al. "The Design of the Borealis Stream Processing Engine", CIDR 2005.
[4] T. Akidau, A. Balikov, K. Bekirog ̆lu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom,
and S. Whittle. "MillWheel: Fault-tolerant Stream Processing at Internet Scale". VLDB, 2013.
[5] Improving Twitter Search with Real-time Human Computation, 2013 http://blog.echen.me/2013/01/08/improving-
twitter-search-with-real-time-human-computation
[6] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed Stream Computing Platform.ICDMW,
2010.
[7] "Dublin City Council - Traffic Flow Improved by Big Data Analytics Used to Predict Bus Arrival and
Transit Times". IBM, 2013. http://www-03.ibm.com/software/businesscasestudies/en/us/corp?docid=RNAE-9C9PN5
[8] Z. Qian, Y. He, C. Su, Z. Wu, H. Zhu, T. Zhang, L. Zhou, Y. Yu, Z. Zhang. "Timestream: Reliable Stream
Computation in the Cloud". Eurosys, 2013.
[9] V. Gulisano, R. Jimenez-Peris, M. Patino-Martinez, C. Soriente, and P. Valduriez. "Streamcloud: An Elastic
and Scalable Data Streaming System". IEEE TPDS, 2012.
[10] N. Marz. "Trident tutorial". https://github.com/nathanmarz/storm/wiki/Trident-tutorial.
[11] M. Shah, et al. "Flux: An adaptive partitioning operator for continuous query systems." In ICDE, 2003.
[12] D. Abadi, et al. "The Design of the Borealis Stream Processing Engine." In CIDR, 2005..
[13] T. Condie, et al. "MapReduce Online." In NSDI,2010.
[14] A. Brito, et al. "Scalable and low-latency data processing with stream mapreduce." CloudCom, 2011.
[15] S. Schneider, et al. "Auto-parallelizing stateful distributed streaming applications." In PACT, 2012.
REFERENCES
[16] Wu et al. “Parallelizing Stateful Operators in a Distributed Stream Processing System: How, Should You
and How Much?” In DEBS, 2012.
[17] M. Armbrust, et al. "A view of cloud computing." Communications of the ACM, 2010.
[18] S. Schneider, et al. "Elastic scaling of data parallel operators in stream processing." In IPDPS, 2009.
[19] B. Gedik, et al. “Elastic Scaling for Data Stream Processing." In TPDS, 2013.
[20] G.T. Lakshmanan, et al. "Placement strategies for internet-scale data stream systems." In IEEE Internet
Computing, 2008.
[21] M. Balazinska, et al. "Contract-Based Load Management in Federated Distributed Systems." NSDI. Vol. 4.
2004.
[22] Y.Ahmad, et al. "Network-aware query processing for stream-based applications. In VLDB, 2004.
[23] Y. Zhu, et al. "Dynamic plan migration for continuous queries over data streams." In SIGMOD, 2004.
[24] V. Kalyvianaki et al. "SQPR: Stream query planning with reuse“. In ICDE, 2011.
[25] B. Chandramouli et al. "Accurate latency estimation in a distributed event processing system“. In ICDE,
2011.
[26] Storm. http://storm.incubator.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-
topology.html.
[27] L. Aniello, et al. "Adaptive online scheduling in storm.“ In DEBS, 2013.
[28] T. Knauth, et al. "Scaling non-elastic applications using virtual machines." In Cloud, 2011.
[29] W. Kleiminger, et al."Balancing load in stream processing with the cloud." In ICDEW, 2011.
[30] M. Migliavacca, et al. "SEEP: scalable and elastic event processing.“ In Middleware, 2010.
REFERENCES
[31] R. Fernandez, et al. "Integrating scale out and fault tolerance in stream processing using operator state
management", SIGMOD 2013.
[32] P. Bellavista, A. Corradi, S. Kotoulas, and A. Reale. "Adaptive fault-tolerance for dynamic resource
provisioning in distributed stream processing systems". In EDBT, 2014.
[33] M. Duller, J. S. Rellermeyer, G. Alonso, and N. Tatbul. "Virtualizing stream processing". Middleware, 2011.
[34] H. Lim and S. Babu. "Execution and optimization of continuous queries with cyclops". SIGMOD, 2013.

More Related Content

What's hot

NPTEL BIG DATA FULL PPT BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...
NPTEL BIG DATA FULL PPT  BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...NPTEL BIG DATA FULL PPT  BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...
NPTEL BIG DATA FULL PPT BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...SayantanRoy14
 
Cloud computing
Cloud computingCloud computing
Cloud computingSyam Lal
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDr. C.V. Suresh Babu
 
Windows Azure Platform
Windows Azure PlatformWindows Azure Platform
Windows Azure PlatformDavid Chou
 
Cloud service management
Cloud service managementCloud service management
Cloud service managementgaurav jain
 
NIST Cloud Computing Reference Architecture
NIST Cloud Computing Reference ArchitectureNIST Cloud Computing Reference Architecture
NIST Cloud Computing Reference ArchitectureThanakrit Lersmethasakul
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Cloud computing and artificial intelligence
Cloud computing and artificial intelligenceCloud computing and artificial intelligence
Cloud computing and artificial intelligenceFurqan Haider
 
Issues in cloud computing
Issues in cloud computingIssues in cloud computing
Issues in cloud computingronak patel
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud ComputingImane SBAI
 
New Trends in Cloud Computing
New Trends in Cloud ComputingNew Trends in Cloud Computing
New Trends in Cloud ComputingAhmed Banafa
 
Cloud computing, Basic Concepts, SOA,
Cloud computing, Basic Concepts, SOA, Cloud computing, Basic Concepts, SOA,
Cloud computing, Basic Concepts, SOA, Dr Neelesh Jain
 

What's hot (20)

NPTEL BIG DATA FULL PPT BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...
NPTEL BIG DATA FULL PPT  BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...NPTEL BIG DATA FULL PPT  BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...
NPTEL BIG DATA FULL PPT BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...
 
Cloud sim
Cloud simCloud sim
Cloud sim
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System
 
Windows Azure Platform
Windows Azure PlatformWindows Azure Platform
Windows Azure Platform
 
Cloud service management
Cloud service managementCloud service management
Cloud service management
 
NIST Cloud Computing Reference Architecture
NIST Cloud Computing Reference ArchitectureNIST Cloud Computing Reference Architecture
NIST Cloud Computing Reference Architecture
 
IBM Cloud Computing
IBM Cloud ComputingIBM Cloud Computing
IBM Cloud Computing
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Cloud computing and artificial intelligence
Cloud computing and artificial intelligenceCloud computing and artificial intelligence
Cloud computing and artificial intelligence
 
Database Systems Concepts, 5th Ed
Database Systems Concepts, 5th EdDatabase Systems Concepts, 5th Ed
Database Systems Concepts, 5th Ed
 
On demand provisioning
On demand provisioningOn demand provisioning
On demand provisioning
 
Issues in cloud computing
Issues in cloud computingIssues in cloud computing
Issues in cloud computing
 
Cloud computing ppt
Cloud computing pptCloud computing ppt
Cloud computing ppt
 
Cloud Computing - Introduction
Cloud Computing - IntroductionCloud Computing - Introduction
Cloud Computing - Introduction
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
New Trends in Cloud Computing
New Trends in Cloud ComputingNew Trends in Cloud Computing
New Trends in Cloud Computing
 
Cloud computing, Basic Concepts, SOA,
Cloud computing, Basic Concepts, SOA, Cloud computing, Basic Concepts, SOA,
Cloud computing, Basic Concepts, SOA,
 

Viewers also liked

Adaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream ProcessingAdaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream ProcessingZbigniew Jerzak
 
Streaming from the cloud
Streaming from the cloudStreaming from the cloud
Streaming from the clouddmulford
 
Efficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersEfficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersAlbert Bifet
 
Hive Poster
Hive PosterHive Poster
Hive Posterragho
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Raja Chiky
 
Stream dataprocessing101
Stream dataprocessing101Stream dataprocessing101
Stream dataprocessing101Sotaro Kimura
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Till Rohrmann
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsAlbert Bifet
 
Batch processing and Stream processing by SQL
Batch processing and Stream processing by SQLBatch processing and Stream processing by SQL
Batch processing and Stream processing by SQLSATOSHI TAGOMORI
 
Data Streaming (in a Nutshell) ... and Spark's window operations
Data Streaming (in a Nutshell) ... and Spark's window operationsData Streaming (in a Nutshell) ... and Spark's window operations
Data Streaming (in a Nutshell) ... and Spark's window operationsVincenzo Gulisano
 
Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?MapR Technologies
 
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology confluent
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming AnalyticsGuido Schmutz
 
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...confluent
 
Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan confluent
 
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedApache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedGuido Schmutz
 
Reactive Stream Processing with Akka Streams
Reactive Stream Processing with Akka StreamsReactive Stream Processing with Akka Streams
Reactive Stream Processing with Akka StreamsKonrad Malawski
 
Introduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache KafkaIntroduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache Kafkaconfluent
 
Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Big Data Spain
 

Viewers also liked (20)

Adaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream ProcessingAdaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream Processing
 
Streaming from the cloud
Streaming from the cloudStreaming from the cloud
Streaming from the cloud
 
Efficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersEfficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream Classifiers
 
Hive Poster
Hive PosterHive Poster
Hive Poster
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
 
Stream dataprocessing101
Stream dataprocessing101Stream dataprocessing101
Stream dataprocessing101
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data Streams
 
Batch processing and Stream processing by SQL
Batch processing and Stream processing by SQLBatch processing and Stream processing by SQL
Batch processing and Stream processing by SQL
 
Data Streaming (in a Nutshell) ... and Spark's window operations
Data Streaming (in a Nutshell) ... and Spark's window operationsData Streaming (in a Nutshell) ... and Spark's window operations
Data Streaming (in a Nutshell) ... and Spark's window operations
 
Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?
 
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
 
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
 
Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan
 
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedApache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
 
Reactive Stream Processing with Akka Streams
Reactive Stream Processing with Akka StreamsReactive Stream Processing with Akka Streams
Reactive Stream Processing with Akka Streams
 
Introduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache KafkaIntroduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache Kafka
 
Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...
 

Similar to Cloud-based Data Stream Processing

TUW-ASE-Summer 2014: Emerging Dynamic Distributed Systems and Challenges for ...
TUW-ASE-Summer 2014: Emerging Dynamic Distributed Systems and Challenges for ...TUW-ASE-Summer 2014: Emerging Dynamic Distributed Systems and Challenges for ...
TUW-ASE-Summer 2014: Emerging Dynamic Distributed Systems and Challenges for ...Hong-Linh Truong
 
Streaming Hypothesis Reasoning - William Smith, Jan 2016
Streaming Hypothesis Reasoning - William Smith, Jan 2016Streaming Hypothesis Reasoning - William Smith, Jan 2016
Streaming Hypothesis Reasoning - William Smith, Jan 2016Seattle DAML meetup
 
Emerging Dynamic TUW-ASE Summer 2015 - Distributed Systems and Challenges for...
Emerging Dynamic TUW-ASE Summer 2015 - Distributed Systems and Challenges for...Emerging Dynamic TUW-ASE Summer 2015 - Distributed Systems and Challenges for...
Emerging Dynamic TUW-ASE Summer 2015 - Distributed Systems and Challenges for...Hong-Linh Truong
 
Cs6703 grid and cloud computing book
Cs6703 grid and cloud computing bookCs6703 grid and cloud computing book
Cs6703 grid and cloud computing bookkaleeswaranme
 
Cloud Computing: A Perspective on Next Basic Utility in IT World
Cloud Computing: A Perspective on Next Basic Utility in IT World Cloud Computing: A Perspective on Next Basic Utility in IT World
Cloud Computing: A Perspective on Next Basic Utility in IT World IRJET Journal
 
Seminaire bigdata23102014
Seminaire bigdata23102014Seminaire bigdata23102014
Seminaire bigdata23102014Raja Chiky
 
Using a Cloud to Replenish Parched Groundwater Modeling Efforts
Using a Cloud to Replenish Parched Groundwater Modeling EffortsUsing a Cloud to Replenish Parched Groundwater Modeling Efforts
Using a Cloud to Replenish Parched Groundwater Modeling EffortsJoseph Luchette
 
Streaming HYpothesis REasoning
Streaming HYpothesis REasoningStreaming HYpothesis REasoning
Streaming HYpothesis REasoningWilliam Smith
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
云计算及其应用
云计算及其应用云计算及其应用
云计算及其应用lantianlcdx
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Stavros Kontopoulos
 
Review and Classification of Cloud Computing Research
Review and Classification of Cloud Computing ResearchReview and Classification of Cloud Computing Research
Review and Classification of Cloud Computing Researchiosrjce
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyRohit Dubey
 
kambatla2014.pdf
kambatla2014.pdfkambatla2014.pdf
kambatla2014.pdfAkuhuruf
 
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big DataMrinal Kumar
 
050317 Ws Telecon Husar
050317 Ws Telecon Husar050317 Ws Telecon Husar
050317 Ws Telecon HusarRudolf Husar
 
Case Study: Synchroniztion Issues in Mobile Databases
Case Study: Synchroniztion Issues in Mobile DatabasesCase Study: Synchroniztion Issues in Mobile Databases
Case Study: Synchroniztion Issues in Mobile DatabasesG. Habib Uddin Khan
 

Similar to Cloud-based Data Stream Processing (20)

TUW-ASE-Summer 2014: Emerging Dynamic Distributed Systems and Challenges for ...
TUW-ASE-Summer 2014: Emerging Dynamic Distributed Systems and Challenges for ...TUW-ASE-Summer 2014: Emerging Dynamic Distributed Systems and Challenges for ...
TUW-ASE-Summer 2014: Emerging Dynamic Distributed Systems and Challenges for ...
 
Streaming Hypothesis Reasoning - William Smith, Jan 2016
Streaming Hypothesis Reasoning - William Smith, Jan 2016Streaming Hypothesis Reasoning - William Smith, Jan 2016
Streaming Hypothesis Reasoning - William Smith, Jan 2016
 
Emerging Dynamic TUW-ASE Summer 2015 - Distributed Systems and Challenges for...
Emerging Dynamic TUW-ASE Summer 2015 - Distributed Systems and Challenges for...Emerging Dynamic TUW-ASE Summer 2015 - Distributed Systems and Challenges for...
Emerging Dynamic TUW-ASE Summer 2015 - Distributed Systems and Challenges for...
 
DGterzo
DGterzoDGterzo
DGterzo
 
Cs6703 grid and cloud computing book
Cs6703 grid and cloud computing bookCs6703 grid and cloud computing book
Cs6703 grid and cloud computing book
 
Cloud Computing: A Perspective on Next Basic Utility in IT World
Cloud Computing: A Perspective on Next Basic Utility in IT World Cloud Computing: A Perspective on Next Basic Utility in IT World
Cloud Computing: A Perspective on Next Basic Utility in IT World
 
Seminaire bigdata23102014
Seminaire bigdata23102014Seminaire bigdata23102014
Seminaire bigdata23102014
 
Using a Cloud to Replenish Parched Groundwater Modeling Efforts
Using a Cloud to Replenish Parched Groundwater Modeling EffortsUsing a Cloud to Replenish Parched Groundwater Modeling Efforts
Using a Cloud to Replenish Parched Groundwater Modeling Efforts
 
Streaming HYpothesis REasoning
Streaming HYpothesis REasoningStreaming HYpothesis REasoning
Streaming HYpothesis REasoning
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
云计算及其应用
云计算及其应用云计算及其应用
云计算及其应用
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
 
Review and Classification of Cloud Computing Research
Review and Classification of Cloud Computing ResearchReview and Classification of Cloud Computing Research
Review and Classification of Cloud Computing Research
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
kambatla2014.pdf
kambatla2014.pdfkambatla2014.pdf
kambatla2014.pdf
 
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big Data
 
050317 Ws Telecon Husar
050317 Ws Telecon Husar050317 Ws Telecon Husar
050317 Ws Telecon Husar
 
Case Study: Synchroniztion Issues in Mobile Databases
Case Study: Synchroniztion Issues in Mobile DatabasesCase Study: Synchroniztion Issues in Mobile Databases
Case Study: Synchroniztion Issues in Mobile Databases
 

More from Zbigniew Jerzak

Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Zbigniew Jerzak
 
Visualization-Driven Data Aggregation
Visualization-Driven Data AggregationVisualization-Driven Data Aggregation
Visualization-Driven Data AggregationZbigniew Jerzak
 
Latency-aware Elastic Scaling for Distributed Data Stream Processing Systems
Latency-aware Elastic Scaling for Distributed Data Stream Processing SystemsLatency-aware Elastic Scaling for Distributed Data Stream Processing Systems
Latency-aware Elastic Scaling for Distributed Data Stream Processing SystemsZbigniew Jerzak
 
Auto-scaling Techniques for Elastic Data Stream Processing
Auto-scaling Techniques for Elastic Data Stream ProcessingAuto-scaling Techniques for Elastic Data Stream Processing
Auto-scaling Techniques for Elastic Data Stream ProcessingZbigniew Jerzak
 
Elastic Scaling of a High-Throughput Content-Based Publish/Subscribe Engine
Elastic Scaling of a High-Throughput Content-Based Publish/Subscribe EngineElastic Scaling of a High-Throughput Content-Based Publish/Subscribe Engine
Elastic Scaling of a High-Throughput Content-Based Publish/Subscribe EngineZbigniew Jerzak
 
ThesisXSiena: The Content-Based Publish/Subscribe System
ThesisXSiena: The Content-Based Publish/Subscribe SystemThesisXSiena: The Content-Based Publish/Subscribe System
ThesisXSiena: The Content-Based Publish/Subscribe SystemZbigniew Jerzak
 
Clock Synchronization in Distributed Systems
Clock Synchronization in Distributed SystemsClock Synchronization in Distributed Systems
Clock Synchronization in Distributed SystemsZbigniew Jerzak
 
XSiena: The Content-Based Publish/Subscribe System
XSiena: The Content-Based Publish/Subscribe SystemXSiena: The Content-Based Publish/Subscribe System
XSiena: The Content-Based Publish/Subscribe SystemZbigniew Jerzak
 
Soft State in Publish/Subscribe
Soft State in Publish/SubscribeSoft State in Publish/Subscribe
Soft State in Publish/SubscribeZbigniew Jerzak
 
Highly Available Publish/Subscribe
Highly Available Publish/SubscribeHighly Available Publish/Subscribe
Highly Available Publish/SubscribeZbigniew Jerzak
 
Prefix Forwarding for Publish/Subscribe
Prefix Forwarding for Publish/SubscribePrefix Forwarding for Publish/Subscribe
Prefix Forwarding for Publish/SubscribeZbigniew Jerzak
 
Fail-Aware Publish/Subscribe
Fail-Aware Publish/SubscribeFail-Aware Publish/Subscribe
Fail-Aware Publish/SubscribeZbigniew Jerzak
 
Bloom Filter Based Routing for Content-Based Publish/Subscribe
Bloom Filter Based Routing for Content-Based Publish/SubscribeBloom Filter Based Routing for Content-Based Publish/Subscribe
Bloom Filter Based Routing for Content-Based Publish/SubscribeZbigniew Jerzak
 
Adaptive Internal Clock Synchronization
Adaptive Internal Clock SynchronizationAdaptive Internal Clock Synchronization
Adaptive Internal Clock SynchronizationZbigniew Jerzak
 

More from Zbigniew Jerzak (14)

Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...
 
Visualization-Driven Data Aggregation
Visualization-Driven Data AggregationVisualization-Driven Data Aggregation
Visualization-Driven Data Aggregation
 
Latency-aware Elastic Scaling for Distributed Data Stream Processing Systems
Latency-aware Elastic Scaling for Distributed Data Stream Processing SystemsLatency-aware Elastic Scaling for Distributed Data Stream Processing Systems
Latency-aware Elastic Scaling for Distributed Data Stream Processing Systems
 
Auto-scaling Techniques for Elastic Data Stream Processing
Auto-scaling Techniques for Elastic Data Stream ProcessingAuto-scaling Techniques for Elastic Data Stream Processing
Auto-scaling Techniques for Elastic Data Stream Processing
 
Elastic Scaling of a High-Throughput Content-Based Publish/Subscribe Engine
Elastic Scaling of a High-Throughput Content-Based Publish/Subscribe EngineElastic Scaling of a High-Throughput Content-Based Publish/Subscribe Engine
Elastic Scaling of a High-Throughput Content-Based Publish/Subscribe Engine
 
ThesisXSiena: The Content-Based Publish/Subscribe System
ThesisXSiena: The Content-Based Publish/Subscribe SystemThesisXSiena: The Content-Based Publish/Subscribe System
ThesisXSiena: The Content-Based Publish/Subscribe System
 
Clock Synchronization in Distributed Systems
Clock Synchronization in Distributed SystemsClock Synchronization in Distributed Systems
Clock Synchronization in Distributed Systems
 
XSiena: The Content-Based Publish/Subscribe System
XSiena: The Content-Based Publish/Subscribe SystemXSiena: The Content-Based Publish/Subscribe System
XSiena: The Content-Based Publish/Subscribe System
 
Soft State in Publish/Subscribe
Soft State in Publish/SubscribeSoft State in Publish/Subscribe
Soft State in Publish/Subscribe
 
Highly Available Publish/Subscribe
Highly Available Publish/SubscribeHighly Available Publish/Subscribe
Highly Available Publish/Subscribe
 
Prefix Forwarding for Publish/Subscribe
Prefix Forwarding for Publish/SubscribePrefix Forwarding for Publish/Subscribe
Prefix Forwarding for Publish/Subscribe
 
Fail-Aware Publish/Subscribe
Fail-Aware Publish/SubscribeFail-Aware Publish/Subscribe
Fail-Aware Publish/Subscribe
 
Bloom Filter Based Routing for Content-Based Publish/Subscribe
Bloom Filter Based Routing for Content-Based Publish/SubscribeBloom Filter Based Routing for Content-Based Publish/Subscribe
Bloom Filter Based Routing for Content-Based Publish/Subscribe
 
Adaptive Internal Clock Synchronization
Adaptive Internal Clock SynchronizationAdaptive Internal Clock Synchronization
Adaptive Internal Clock Synchronization
 

Recently uploaded

Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 

Recently uploaded (20)

The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 

Cloud-based Data Stream Processing

  • 1. CLOUD-BASED DATA STREAM PROCESSING Thomas Heinze, Leonardo Aniello, Leonardo Querzoni, Zbigniew Jerzak DEBS 2014 Mumbai 26/5/2014
  • 2. TUTORIAL GOALS Data stream processing (DSP) was in the past considered a solution for very specific problems.  Financial trading  Logistics tracking  Factory monitoring Today the potentialities of DSPs start to be used in more general settings.  DSP : online processing = MR : batch processing DSPs will possibly be offered as-a-service from cloud-based providers ? 225/05/14 CLoud-Based Data Stream Processing
  • 3. TUTORIAL GOALS Here we present an overview about current research trends within data streaming systems  How they consider the requirements imposed by recent use-cases  How are they moving toward cloud platforms Two focus areas:  Scalability  Fault Tolerance 325/05/14 CLoud-Based Data Stream Processing
  • 4. THE AUTHORS Thomas Heinze Phd Student @ SAP Phd Topic: Elastic Data Stream Processing 4 Leonardo Aniello Researcher fellow @ Sapienza Univ. of Rome Leonardo Querzoni Assistant professor @ Sapienza Univ. of Rome Zbigniew Jerzak Senior Researcher @ SAP Co-Organizer of DEBS Grand Challenge 25/05/14 CLoud-Based Data Stream Processing
  • 5. OUTLINE  Historical overview of data stream processing  Use cases for cloud-based data stream processing  How DSPs work: Storm as example  Scalability methods  Fault tolerance integrated in modern DSPs  Future research directions and conclusions 525/05/14 CLoud-Based Data Stream Processing
  • 7. DATA STREAM PROCESSING ENGINE  It is a piece of software that  continuously calculates results for long-standing queries  over potentially infinite data streams  using operators  algebraic (filters, join, aggregation)  user defined  That can be stateless or stateful 25/05/14 CLoud-Based Data Stream Processing 7 Source Aggr Sink
  • 8. REQUIREMENTS[1] Rule 1 - Keep the data moving Rule 2 - Query using SQL on stream (StreamSQL) Rule 3 - Handle stream imperfections (delayed, missing and out-of- order data) Rule 4 - Generate predictable outcomes Rule 5 - Integrate stored and streaming data Rule 6 - Guarantee data safety and availability Rule 7 - Partition and scale applications automatically Rule 8 - Process and respond instantaneously 8 [1] M. Stonebraker, U. Cetintemel and S. Zdonik. "The 8 Requirements of real-time Stream Processing", ACM SIGMOD Record, 2005. 25/05/14 CLoud-Based Data Stream Processing
  • 9. THREE GENERATIONS  First Generation  Extensions to existing database engines or simplistic engines  Dedicated to specific use cases  Second Generation  Enhanced methods regarding language expressiveness, load balancing, fault tolerance  Third Generation  Dedicated towards trend of cloud computing; designed towards massive parallelization 925/05/14 CLoud-Based Data Stream Processing
  • 10. GENEALOGY 10 FIRST GENERATION THIRD GENERATIONSECOND GENERATION Aurora (2002) Medusa (2004) Borealis (2005) TelegraphCQ (2003) NiagaraCQ (2001) STREAM (2003) StreamMapReduce(2011) StreamMine3G (2013) Yahoo! S4(2010) Apache S4(2012) Twitter Storm (2011) Trident (2013) StreamCloud(2010) Elastic StreamCloud (2012) SPADE(2008) Elastic Operator(2009) InfoSphere Streams(2010) FLUX (2002) GigaScope (2002) Spark Streaming (2010) D-Streams(2013) Cayuga (2007) SASE (2008) ZStream (2009) CAPE (2004) D-CAPE (2005) SGuard (2008) PIPES (2004) HybMig (2006) StreamInsight (2010) TimeStream (2013) CEDR (2005) SEEP (2013) Esper (2006) Truviso (2006) StreamBase (2003) Coral8 (2005) Aleri (2006) Sybase CEP (2010) Oracle CQL (2006) Oracle CEP (2009) SAP ESP (2012) DejaVu (2009) System S(2005) LAAR(2014) StreamIt (2002) Flextream(2007) Elastic System S(2013) Cisco Prime (2012) RTM Analyzer (2008) webMethods Business Events(2011) TIBCO StreamBase (2013) TIBCO BusinessEvents(2008) NILE (2004) NILE-PDT (2005) RLD (2013) MillWheel(2013) Johka (2007) 25/05/14 CLoud-Based Data Stream Processing
  • 11. FIRST GENERATION - TELEGRAPH CQ[2]  Data stream processing engine built on top of Postgres DB 11 [2] S. Chandrasekaran et al. "TelegraphCQ: Continuous Dataflow Processing", SIGMOD, 2003. 25/05/14 CLoud-Based Data Stream Processing Source: S. Chandrasekaran et al. "TelegraphCQ: Continuous Dataflow Processing", SIGMOD, 2003.
  • 12. SECOND GENERATION - BOREALIS[3]  Joint research project of Brandeis University, Brown University and MIT  Duration: ~3 years  Allowed the experimentation of several techniques 12 [3] D. J. Abadi et al. "The Design of the Borealis Stream Processing Engine", CIDR 2005. 25/05/14 CLoud-Based Data Stream Processing
  • 13. BOREALIS MAIN NOVELTIES  Load-aware Distribution  Fine-grained High-availability  Load Shedding mechanisms 1325/05/14 CLoud-Based Data Stream Processing Source: D. J. Abadi et al. "The Design of the Borealis Stream Processing Engine", CIDR 2005.
  • 14. THIRD GENERATION  Novel use cases are driving toward  Uprecedeted levels of parallelism  Efficient fault tolerance  Dynamic scalability  Etc. 1425/05/14 CLoud-Based Data Stream Processing
  • 15. CLOUD-BASED DATA STREAM PROCESSING Use Cases for Cloud-Based Data Stream Processing
  • 16. SCENARIOS FOR THIRD GENERATION DSPS  Tracking of query trend evolution in Google  Analysis of popular queries submitted to Twitter  User profiling at Yahoo! based on submitted queries  Bus routing monitoring and management in Dublin  Sentiment analysis on multiple tweet streams  Fraud monitoring in cellular telephony 1625/05/14 Cloud-Based Data Stream Processing
  • 17. QUERY MONITORING AT GOOGLE  Analyze queries submitted to Google search engine to create a query historical model  Run on Zeitgeist on top of MillWheel[4]  Incoming searches are organized in 1-second buckets  Buckets are compared to historical data  Useful to promptly detect anomalies (spikes/dips) 17 [4] Akidau, Balikov, Bekiroğlu, Chernyak, Haberman, Lax, McVeety, Mills, Nordstrom, Whittle. MillWheel: Fault-tolerant Stream Processing at Internet Scale.VLDB, 2013. 25/05/14 Cloud-Based Data Stream Processing
  • 18. POPULAR QUERY ANALYSIS AT TWITTER  When a relevant event happens, an increase occurs in the number of queries submitted to Twitter[5]  These queries have to be correlated in real-time with tweets (2.1 billion tweets per day)  Runs on Storm  These spikes are likely to fade away in a limited amount of time  Useful to improve the accuracy of popular queries 18 [5] Improving Twitter Search with Real-time Human Computation, 2013 http://blog.echen.me/2013/01/08/improving- twitter-search-with-real-time-human-computation 25/05/14 Cloud-Based Data Stream Processing
  • 19. POPULAR QUERY ANALYSIS AT TWITTER  Problem  Sudden peak of queries about a new (never seen before) event  How to properly assign the right semantic (categorization) to the queries?  Example  Solution  Employ Human Evaluation (Amazon’s Mechanical Turk Service) to produce categorizations to queries unseen so far  incorporate such categorizations into backend models 19 how can we understand that #bindersfullofwomen refers to politics? 25/05/14 Cloud-Based Data Stream Processing
  • 20. POPULAR QUERY ANALYSIS AT TWITTER 20 Search Engine Kafka Queue query Log Spout Bolt (query, ts) popular queries Human Evaluation categorizations of queries engine tuning Storm 25/05/14 Cloud-Based Data Stream Processing
  • 21. USER PROFILING AT YAHOO!  Queries submitted by users are evaluated (thousands queries per second by millions of users)  Run on S4[6]  Useful to generate highly personalized advertising 21 [6] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed Stream Computing Platform.ICDMW, 2010. 25/05/14 Cloud-Based Data Stream Processing
  • 22. BUS ROUTING MANAGEMENT IN DUBLIN  Tracking of bus locations (1000 buses) to improve public transportation for 1.2 million citizens[7]  Run on System S  Position tracking by GPS signals to provide real-time traffic information monitoring  Useful to predict arrival times and suggest better routes 22 [7] Dublin City Council - Traffic Flow Improved by Big Data Analytics Used to Predict Bus Arrival and Transit Times. IBM, 2013. http://www-03.ibm.com/software/businesscasestudies/en/us/corp?docid=RNAE-9C9PN5 25/05/14 Cloud-Based Data Stream Processing
  • 23. SENTIMENT ANALYSIS ON TWEET STREAMS  Tweet streams flowing at high rates (10K tweet/s)  Sentiment analysis in real-time  Limited computation time (latency up to 2 seconds)  Run on Timestream[8]  Useful to continuously estimate the mood about specific topics 23 [8] Z. Qian, Y. He, C. Su, Z. Wu, H. Zhu, T. Zhang, L. Zhou, Y. Yu, Z. Zhang. Timestream: Reliable Stream Computation in the Cloud. Eurosys, 2013. 25/05/14 Cloud-Based Data Stream Processing
  • 24. FRAUD MONITORING FOR MOBILE CALLS  Fraud detection by real-time processing of Call Detail Records (10k-50k CDR/s)  Requires self-joins over large time windows (queries on millions of CDRs)  Run on StreamCloud[9]  Useful to reactively spot dishonest behaviors 24 [9] V. Gulisano, R. Jimenez-Peris, M. Patino-Martinez, C. Soriente, and P. Valduriez. Streamcloud: An Elastic and Scalable Data Streaming System. IEEE TPDS, 2012. 25/05/14 Cloud-Based Data Stream Processing
  • 25. REQUIREMENTS  Real-time and continuous complex analysis of heavy data streams  More than 10k event/s  Limited computation latency  Up to few seconds  Correlation of new and historical data  Computations have a state to be kept  Input load varies considerably over time  Computations have to adapt dynamically (Scalability)  Hardware and Software failures can occur  Computations have to transparently tolerate faults with limited performance degradation (Fault Tolerance) 2525/05/14 Cloud-Based Data Stream Processing
  • 26. CLOUD-BASED DATA STREAM PROCESSING How DSPs work: Storm as example
  • 27. STORM  Storm is an open source distributed realtime computation system  Provides abstractions for implementing event-based computations over a cluster of physical nodes  Manages high throughput data streams  Performs parallel computations on them  It can be effectively used to design complex event-driven applications on intense streams of data 25/05/14 Cloud-Based Data Stream Processing 27
  • 28. STORM  Originally developed by Nathan Marz and team at BackType, then acquired by Twitter, now an Apache Incubator project.  Currently used by Twitter, Groupon, The Weather Channel, Taobao, etc.  Design goals:  Guaranteed data processing  Horizontal scalability  Fault Tolerance 25/05/14 Cloud-Based Data Stream Processing 28
  • 29. STORM An application is represented by a topology: Spout Spout Spout Spout Bolt Bolt Bolt BoltBolt Bolt Bolt Bolt 25/05/14 Cloud-Based Data Stream Processing 29
  • 30. OPERATOR EXPRESSIVENESS  Storm is designed for custom operator definition  Bolts can be designed as POJOs adhering to a specific interface  Implementations in other languages are feasible  Trident[10] offers a more high level programming interface [10] N. Marz. Trident tutorial. https://github.com/nathanmarz/storm/wiki/Trident-tutorial, 2013. 25/05/14 Cloud-Based Data Stream Processing 30
  • 31. OPERATOR EXPRESSIVENESS  Operators are connected through grouping:  Shuffle grouping  Fields grouping  All grouping  Global grouping  None grouping  Direct grouping  Local or shuffle grouping 25/05/14 Cloud-Based Data Stream Processing 31
  • 32. PARALLELIZATION  Storm asks the developer to provide “parallelism hints” in the topology Spout Spout Spout Spout Bolt Bolt Bolt BoltBolt Bolt Bolt Bolt x3 x8 x5 25/05/14 Cloud-Based Data Stream Processing 32
  • 33. STORM INTERNALS  A storm cluster is constituted by a Nimbus node and n Worker nodes 25/05/14 Cloud-Based Data Stream Processing 33
  • 34. TOPOLOGY EXECUTION  A topology is run by submitting it to Nimbus  Nimbus allocates the execution of components (spouts and bolts) to the worker nodes using a scheduler  Each component has multiple instances (parallelism)  Each instance is mapped to an executor  A worker is instantiated whenever the hosting node must run executors for the submitted topology  Each worker node locally manages incoming/outgoing streams and local computation  The local surpervisor takes care that everything runs as prescribed  Nimbus monitors worker nodes during the execution to manage potential failures and the current resource usage 25/05/14 Cloud-Based Data Stream Processing 34
  • 35. TOPOLOGY EXECUTION  Storm’s default scheduler (EvenScheduler) applies a simple round robin strategy 25/05/14 Cloud-Based Data Stream Processing 35
  • 36. A PRACTICAL EXAMPLE  Word count: the HelloWorld for DSPs  Input: stream of text (e.g. from documents)  Output: number of appearance for each word Source: http://wpcertification.blogspot.it/2014/02/helloworld-apache-storm-word-counter.html H elloStorm LineReader Spout W ordSplitter Bolt W ordCounter Bolt text lines words TXT docs 25/05/14 Cloud-Based Data Stream Processing 36
  • 37. A PRACTICAL EXAMPLE  LineReaderSpout: reads docs and creates tuples public class LineReaderSpout implements IRichSpout { public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { this.context = context; this.fileReader = new FileReader(conf.get("inputFile").toString()); this.collector = collector; } public void nextTuple() { if (completed) { Thread.sleep(1000); } String str; BufferedReader reader = new BufferedReader(fileReader); while ((str = reader.readLine()) != null) { this.collector.emit(new Values(str), str); completed = true; } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("line")); } public void close() { fileReader.close(); } } 25/05/14 Cloud-Based Data Stream Processing 37
  • 38. A PRACTICAL EXAMPLE  WordSplitterBolt: cuts lines in words public class WordSplitterBolt implements IRichBolt { public void prepare(Map stormConf, TopologyContext context,OutputCollector c) { this.collector = c; } public void execute(Tuple input) { String sentence = input.getString(0); String[] words = sentence.split(" "); for(String word: words){ word = word.trim(); if(!word.isEmpty()){ word = word.toLowerCase(); collector.emit(new Values(word)); } } collector.ack(input); } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } } 25/05/14 Cloud-Based Data Stream Processing 38
  • 39. A PRACTICAL EXAMPLE  WordCounterBolt: counts word occurrences public class WordCounterBolt implements IRichBolt { public void prepare(Map stormConf, TopologyContext context, OutputCollector c) { this.counters = new HashMap<String, Integer>(); this.collector = c; } public void execute(Tuple input) { String str = input.getString(0); if(!counters.containsKey(str)){ counters.put(str, 1); } else { Integer c = counters.get(str) +1; counters.put(str, c); } collector.ack(input); } public void cleanup() { for (Map.Entry<String, Integer> entry : counters.entrySet()){ System.out.println (entry.getKey()+" : " + entry.getValue()); } } } 25/05/14 Cloud-Based Data Stream Processing 39
  • 40. A PRACTICAL EXAMPLE  HelloStorm: contains the topology definition public class HelloStorm { public static void main (String[] args) throws Exception{ Config config = new Config(); config.put("inputFile", args[0]); config.setDebug(true); config.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 1); TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("line-reader-spout", new LineReaderSpout()); builder.setBolt("word-spitter", new WordSplitterBolt().shuffleGrouping( "line-reader-spout"); builder.setBolt("word-counter", new WordCounterBolt()).shuffleGrouping( "word-spitter"); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("HelloStorm", config, builder.createTopology()); Thread.sleep(10000); cluster.shutdown(); } } 25/05/14 Cloud-Based Data Stream Processing 40
  • 42. PARTITIONING SCHEMES  Data Parallelism:  How to parallelize the execution of an operator?  How to detect the optimal level of parallelization?  Operator Distribution:  How to distribute the load across available hosts?  How to achieve a load balance between these machines? 28/05/2014 Cloud-based Data Stream Processing 42
  • 43. RQUIREMENTS FOR DATA PARALLELISM  Transparent to the user  correct results in correct order (identical to sequential execution) 28/05/2014 Cloud-based Data Stream Processing Filter Filter Round- Robin Union Aggr Aggr Hash (groupID) Merge& Sort 43
  • 44. DATA PARALLELISM  First presented by FLUX[11] and Borealis[12]  Explicitely done using partitioning operators  User needs to decide:  partitioning scheme  merging scheme  level of parallelism 28/05/2014 Cloud-based Data Stream Processing 44 [11] M. Shah, et al. "Flux: An adaptive partitioning operator for continuous query systems." In ICDE, 2003. [12] D. Abadi, et al. "The Design of the Borealis Stream Processing Engine." In CIDR, 2005..
  • 45. PARALLELISM FOR CLOUD-BASED DSP  Massive parallelization (>100 partitions)  Support custom operators  Adapt parallelization level to the workload without user interaction 28/05/2014 Cloud-based Data Stream Processing 45
  • 46. DATA PARALLELISM IN STORM  User defines number of parallel task  Storm support different partitioning schemes (aka grouping):  Shuffle grouping, Fields grouping, All grouping, Global grouping, None grouping, Direct grouping, Local or shuffle grouping, Custom 28/05/2014 Cloud-based Data Stream Processing 46
  • 47. MAPREDUCE FOR STREAMING [13,14]  Extend MapReduce model for streaming data:  Break strict phases  Introduce Stateful Reducer 28/05/2014 Cloud-based Data Stream Processing Map Map Map Reduce Reduce OutputInput [13] T. Condie, et al. "MapReduce Online." In NSDI,2010. [14] A. Brito, et al. "Scalable and low-latency data processing with StreamMapReduce." CloudCom, 2011. 47
  • 48. STATEFUL REDUCER[14] reduceInit(k1) { // custom user class S = new State();S.sum = 0; S.count = 0; // object S is now associated with key k1 return S; } reduce(Key k, <List new, List expired, List window, UserState S>) { For v in expired do: // Remove contribution of expired events S.sum -= v; S.count--; For v in new do: // Add contribution of new events S.sum += v; S.count++; send(k1, S.sum/S.count); } [14] A. Brito, et al. "Scalable and low-latency data processing with StreamMapReduce." CloudCom, 2011. 28/05/2014 Cloud-based Data Stream Processing 48
  • 49. AUTO PARALLELIZATION[15,16]  Goal:  detect parallelizable regions in operator graph with custom operators for user-defined operators  Runtime support for enforcing safety conditions  Safety Conditions:  For an operator: no state or partitioned state, selectivity < 1, at most one pre/successor  For an parallel region: compatible keys, forwarded keys, region-local fusion dependencies 28/05/2014 Cloud-based Data Stream Processing [15] S Schneider, et al. "Auto-parallelizing stateful distributed streaming applications." In PACT, 2012. [16] Wu et al. “Parallelizing Stateful Operators in a Distributed Stream Processing System: How, Should You and How Much?” In DEBS, 2012. 49
  • 50. AUTO PARALLELIZATION  Compiler-based Approach:  Characterize each operator based on a set of criterias (state type, selectivity, forwarding)  Merge different operators together into parallel regions (based on left-first approach)  Best parallelization strategy is applied automatically  Level of parallelism is decided on job admission 28/05/2014 Cloud-based Data Stream Processing 50
  • 51. ELASTICITY[17]  Goal: React to unpredicted load peaks & reduce the amount of ideling resources. 28/05/2014 Cloud-based Data Stream Processing Workload Load Static Provisioning Elastic Provisioning Underprovisioning Overprovisioning [17] M. Armbrust, et al. "A view of cloud computing." Communications of the ACM, 2010. 51
  • 52. DIFFERENT TYPES OF ELASTICITY  Vertical Elasticity (Scale up)  Adapt the number of threads per operator based on the workload.  Horizontal Elasticity (Scale out)  Adapt the number of hosts based on the workload. 28/05/2014 Cloud-based Data Stream Processing 52
  • 53. ELASTIC OPERATOR EXECUTION[18]  Dynamically adapt number of threads for a stateless operator based on the workload  Limitations: works only on thread-level and for stateless operators 28/05/2014 Cloud-based Data Stream Processing [18] S. Schneider, et al. "Elastic scaling of data parallel operators in stream processing." In IPDPS, 2009. Operator Thread Pool 53
  • 54. ELASTIC AUTO-PARALLELIZATION[19]  Combines ideas of Elasticity and Auto- Parallelization  Adapt parallelization level using controller-based approach  Dynamic adaption of parallelization level based on operator migration protocol 28/05/2014 Cloud-based Data Stream Processing 54 [19] B. Gedik, et al. “Elastic Scaling for Data Stream Processing." In TPDS, 2013.
  • 55. Horizontal barrierVertical barrier ELASTIC AUTO-PARALLELIZATION  Dynamic adaption of parallelization: lend & borrow approach 28/05/2014 Cloud-based Data Stream Processing 55 Op1 Op2 Splitter Merger Op2 Op2 1. Lend Phase2. Borrow Phase
  • 56. OPERATOR DISTRIBUTION  Well-studied problem since first DSP prototypes  Typically referred to as „Operator Placement“  Examples: Borealis, SODA, Pietzuch et al.  Design decisions[20]  Optimization goal: What is optimized ?  Execution mode: Centralized or Decentralized?  Algorithm runtime: Offline, online or both? 28/05/2014 Cloud-based Data Stream Processing 56 [20] G.T. Lakshmanan, et al. "Placement strategies for internet-scale data stream systems." In IEEE Internet Computing, 2008.
  • 57. OPTIMIZATION GOAL  Different objectives:  Balance CPU load (handle load variations)  Minimize Network latency (minimize latency for sensor networks)  Latency optimization (predictable quality of service) 28/05/2014 Cloud-based Data Stream Processing 57
  • 58. EXECUTION MODE  Centralized Execution:  All decision done by centralized management component  Major disadvantage: manager becomes scalability bottleneck  Decentralized Execution:  Different hosts try to agree on an operator placement  Two examples: operator routing and commutative execution 28/05/2014 Cloud-based Data Stream Processing 58
  • 59. DECENTRALIZER EXECUTION  Based on pairwise agreement[21] 28/05/2014 Cloud-based Data Stream Processing Host 1 Host 2 Host 3 Host 4 Aggr4 Aggr2 Filter 2 Filter 3 Filter 5 Filter 6 Filter 4 Filter 1 Aggr 3Aggr 4 Aggr 3 Aggr 5Aggr 6 Source 1 Source 2 59 [21] M. Balazinska, et al. "Contract-Based Load Management in Federated Distributed Systems." NSDI. Vol. 4. 2004.
  • 60. DECENTRALIZER EXECUTION  Based on decentralized routing[22] Host 2 Host 3 Host 4 Host 1 Src1 Src2 J D F F 28/05/2014 Cloud-based Data Stream Processing 60 [22] Y.Ahmad, et al. "Network-aware query processing for stream-based applications. In VLDB, 2004.
  • 61. ALGORITHM RUNTIME  Offline  Based on estimation of input rates, selectivities, etc.  Online  Mostly simplistic initial estimation (e.g. round-robin or random placement)  Continously measurements and adaption during runtime 28/05/2014 Cloud-based Data Stream Processing 61
  • 62. MOVEMENT PROTOCOLS  How to ensure loss-free movement of operators?  No input event shall be lost  State need to be restored on new host  Movement strategies:  Pause & Resume vs. Parallel track  large overlap with research on adaptive query processing[23] 28/05/2014 Cloud-based Data Stream Processing 62 [23] Y. Zhu, et al. "Dynamic plan migration for continuous queries over data streams." In SIGMOD, 2004.
  • 63. PAUSE & RESUME  Approach: 1. Stop execution 2. Move state to new host 3. Restart execution on new hosts  Properties:  Simple & generic  Latency Peak can be observed due to pausing 28/05/2014 Cloud-based Data Stream Processing 63
  • 64. PARALLEL TRACK  Approach: 1. Start new instance 2. Move or create up to date state 3. Stop old instance as soon as instances are in sync  Properties:  No latency peak  Requires duplicate detection, detection of „sync“ status  Events processed twice 28/05/2014 Cloud-based Data Stream Processing 64
  • 65. OPERATOR PLACEMENT ITHIN CLOUD- BASED DSP  Different setup (highly location-wise distributed vs. single cluster)  Larger scale (100 …. 1000 hosts)  New objectives: Elasticity, Monetary Cost, Energy Efficiency Mostly centralized, adaptive solutions optimizing CPU utilization or monetary cost with Pause & Resume Operator Movement. 28/05/2014 Cloud-based Data Stream Processing 65
  • 66. RECENT OPERATOR PLACEMENT APPROACHES  SQPR[24]  Query planner for data centers with heterogenes resources  MACE[25]  Present san approach for precise latency estimation 28/05/2014 Cloud-based Data Stream Processing [24] V. Kalyvianaki et al. "SQPR: Stream query planning with reuse“. In ICDE, 2011. [25] B. Chandramouli et al. "Accurate latency estimation in a distributed event processing system“. In ICDE, 2011. 66
  • 67. STORM LOAD MODEL  Worker: physical JVM and executes subset of all the tasks of the topology  Task: Parallel instance of an operator  Executor: Thread of an worker, executes one or more tasks of the same operator 28/05/2014 Cloud-based Data Stream Processing 67
  • 68. EXAMPLE FOR STORM LOAD MODEL Config conf = new Config(); conf.setNumWorkers(3); // use two worker processes topologyBuilder.setSpout(„src", new Source(), 3); topologyBuilder.setBolt(„aggr", new Aggregation(), 6).shuffleGrouping(„src"); topologyBuilder.setBolt(„sink", new Sink(), 6).shuffleGrouping(„aggr"); StormSubmitter.submitTopology( "mytopology", conf, topologyBuilder.createTopology() ); http://storm.incubator.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html. 28/05/2014 Cloud-based Data Stream Processing 68
  • 69. STORM: DISTRIBUTION MODEL[26] 28/05/2014 Cloud-based Data Stream Processing 69 Source (spout) Aggr (bolt) Sink (bolt) Parallelism hint =3 Parallelism hint =6 Parallelism hint =3 Worker 1 Source Task Sink Task Aggr Task Aggr Task Worker 2 Source Task Sink Task Aggr Task Aggr Task Worker 3 Source Task Sink Task Aggr Task Aggr Task [26] Storm. http://storm.incubator.apache.org/documentation/Understanding-the-parallelism-of-a-Storm- topology.html.
  • 70. ADAPTIVE PLACEMENT IN STORM  Manual reconfiguration (pause & resume complete topology)  Enhanced Load scheduler for Twitter Storm[27]  Load balancing adapted to Storm architecture 28/05/2014 Cloud-based Data Stream Processing $ storm rebalance mytopo -n 4 -e src=8 70 [27] L. Aniello, et al. "Adaptive online scheduling in storm.“ In DEBS, 2013.
  • 71. DIFFERENT LEVELS OF ELASTICITY 28/05/2014 Cloud-based Data Stream Processing Elastic Action Taken System Virtual Machine Move virtual machines transparent to the user. [28] Engine New engine instance is started and new queries are employed on this engine [29] Query New query instance is started and the input data is split [30] Operator New operator instance is created and the input data is split [31] [28] T. Knauth, et al. "Scaling non-elastic applications using virtual machines." In Cloud, 2011. [29] W. Kleiminger, et al."Balancing load in stream processing with the cloud." In ICDEW, 2011. [30] M. Migliavacca, et al. "SEEP: scalable and elastic event processing.“ In Middleware, 2010. [31] R. Fernandez, et al. „Integrating scale out and fault tolerance in stream processing using operator state management”, SIGMOD 2013. 71
  • 72. ELASTICS DSP SYSTEM ARCHITECTURE 28/05/2014 Cloud-based Data Stream Processing Elasticity Manager Resource Manager Data Stream Processing Engine Data Stream Processing Engine Data Stream Processing Engine Virtual Machine Virtual Machine Virtual Machine ... ... 72
  • 73. STREAMCLOUD[9]  Realized different level of elasticity on a highly scalable DSP  Studied different major design decisions:  Optimal Level of Elasticity  Migration strategy  Load balancing based on user-defined thresholds 28/05/2014 Cloud-based Data Stream Processing 73 [9] V. Gulisano, R. Jimenez-Peris, M. Patino-Martinez, C. Soriente, and P. Valduriez. Streamcloud: An Elastic and Scalable Data Streaming System. IEEE TPDS, 2012.
  • 74. SEEP[31]  Present the state of the art mechanims for state management  combine fault tolerance & scale out using same mechanism  Highly scalable and fast recovery 28/05/2014 Cloud-based Data Stream Processing 74 [31] R. Fernandez, et al. "Integrating scale out and fault tolerance in stream processing using operator state management", SIGMOD 2013.
  • 75. MILLWHEEL[4]  Similar mechanisms like SEEP or StreamCloud  Support massive scale out  Centralized manager  Optimization of CPU Load 28/05/2014 Cloud-based Data Stream Processing 75 [4] T. Akidau, et al. „MillWheel: Fault-Tolerant Stream Processing at Internet Scale”, VLDB 2013.
  • 77. FAULT TOLERANCE IN DSPS  Small scale stream processing  Faults are an exception  Optimize for the lucky case  Catch faults at execution time and start recovery procedures (possibly expensive)  Large scale DSPs (e.g. cloud based)  Faults are likely to affect every execution  Consider them in the design process 28/05/2014 Cloud-Based Data Stream Processing 77
  • 78. FAULT TOLERANCE IN DSPS  Two main fault causes  Message losses  Computational element failures  Event tracking  Makes sure each injected event is correctly processed with well-defined semantics  State management  Makes sure that the failure of a computational element will not affect the system‘s correct execution 28/05/2014 Cloud-Based Data Stream Processing 78
  • 79. EVENT TRACKING  When a fault occurs, events may need to be replayed  Typical approach:  Request acks from downstream operators  Detect losses by setting timeouts on acks  Replay lost events from upstream operators  Tracking and managing information on processed events may be needed to guarantee specific processing semantics 28/05/2014 Cloud-Based Data Stream Processing 79
  • 80. STORM – ET  Event tracking in storm is guaranteed by two different techniques:  Acker processes  at-least-once message processing  Transactional topologies  exactly-once message processing 28/05/2014 Cloud-Based Data Stream Processing 80
  • 81. STORM – ET  A tuple injected in the system can cause the production of multiple tuples in the topology  This production partially maps the underlying application topology  Storm keeps track of the processing DAG stemming from each input tuple 28/05/2014 Cloud-Based Data Stream Processing 81
  • 82. STORM – ET  Example: word-count topology Spout Splitter bolt Counter bolt “stay hungry, stay foolish” text lines words [“stay”,“hungry”,“stay”,“foolish”] [“stay”, 2] [“hungry”, 1] [“foolish”, 1] “stay hungry, stay foolish” “stay” [“stay”, 1] “hungry” “stay” “foolish” [“hungry”, 1] [“foolish”, 1] [“stay”, 2] 28/05/2014 Cloud-Based Data Stream Processing 82
  • 83. STORM – ET  Acker tasks are responsible for monitoring the flow of tuples in the DAG  Each bolt “acks” the correct processing of a tuple  The processing of a tuple can be committed when it has fully traversed the DAG  In this case the acker notifies the original spout  The spout must implement an ack(Object msgId) method  It can be used to garbage collect tuple state 28/05/2014 Cloud-Based Data Stream Processing 83
  • 84. STORM – ET  If a tuple does not reach the end of the DAG  The acker timeouts and invoke fail(Object msgId) on the spout  The spout method implementation is in charge of replaying failed tuples  Notice: the original data source must be able to reliably reply events (e.g. a reliable MQ) 28/05/2014 Cloud-Based Data Stream Processing 84
  • 85. STORM – ET  Developer perspective:  Correctly connect bolts with tuple sources (anchoring)  Anchoring is how you specify the tuple tree public void execute(Tuple tuple) { String sentence = tuple.getString(0); for(String word: sentence.split(" ")) { _collector.emit(tuple, new Values(word)); } _collector.ack(tuple); } Anchor Source: https://github.com/nathanmarz/storm/wiki/Guaranteeing-message-processing 28/05/2014 Cloud-Based Data Stream Processing 85
  • 86. STORM – ET  Developer perspective:  Explicitly ack processed tuples  Or extend BaseBasicBolt public void execute(Tuple tuple) { String sentence = tuple.getString(0); for(String word: sentence.split(" ")) { _collector.emit(tuple, new Values(word)); } _collector.ack(tuple); } Explicit ack Source: https://github.com/nathanmarz/storm/wiki/Guaranteeing-message-processing 28/05/2014 Cloud-Based Data Stream Processing 86
  • 87. STORM – ET  Implement ack() and fail() on the spout(s) public void nextTuple() { if(!toSend.isEmpty()){ for(Map.Entry<Integer, String> transactionEntry: toSend.entrySet()){ Integer transactionId = transactionEntry.getKey(); String transactionMessage = transactionEntry.getValue(); collector.emit(new Values(transactionMessage),transactionId); } toSend.clear(); } } public void ack(Object msgId) { messages.remove(msgId); } public void fail(Object msgId) { Integer transactionId = (Integer) msgId; Integer failures = transactionFailureCount.get(transactionId) + 1; if(failures >= MAX_FAILS){ throw new RuntimeException(”Too many failures on Tx [”+transactionId+"]”); } transactionFailureCount.put(transactionId, failures); toSend.put(transactionId,messages.get(transactionId)); } Source: J. Leibiusky, G. Eisbruch and D. Simonassi. Getting Started with Storm. O’Reilly, 2012 re-schedule Failed topology Completed processing 28/05/2014 Cloud-Based Data Stream Processing 87
  • 88. STORM – ET  How does the acker task work?  It is a standard bolt  The acker tracks for each tuple emitted by a spout the corresponding DAG  It acks the spout whenever the DAG is complete  You can instantiate parallel ackers to improve performance  Tuples are randomly assigned to ackers to improve load balancing (uses mod hashing) 28/05/2014 Cloud-Based Data Stream Processing 88
  • 89. STORM – ET  How can ackers keep track of DAGs?  Each tuple is identified by a unique 64bit ID  The acker stores in a map for each tuple IP  The ID of the emitter task  An ack val  An ack val is a bit-vector that encodes  IDs of tuples stemmed from the initial one  IDs of acked tuples 28/05/2014 Cloud-Based Data Stream Processing 89
  • 90. STORM – ET  Ack val management  When a tuple is emitted by a spout  Initializes the vector and encodes in it the tuple ID  When a tuple is acked  XOR its ID in the ack val  When an anchored tuple is emitted  XOR its ID in the ack val  When the ack val is empty, all tuples in the DAG have been acked (with high probability) 28/05/2014 Cloud-Based Data Stream Processing 90
  • 91. STORM – ET  If exactly-once semantics is required use a transactional topology  Transaction = processing + committing  Processing is heavily parallelized  Committing is strictly sequential  Storm takes care of  State management (through Zookeeper)  Transaction coordination  Fault detection  Provides a batch processing API  Note: requires a source able to reply data batches 28/05/2014 Cloud-Based Data Stream Processing 91
  • 92. EVENT TRACKING  In other systems the event tracking functionality is strongly coupled with state management  events cannot be garbage collected when they are acknowledged  Need to wait for the checkpointing of a state updated with such events  Timestream does not store all the events and re- compute those to be replayed by tracking their dependencies with input events, similarly to Storm 28/05/2014 Cloud-Based Data Stream Processing 92
  • 93. EVENT TRACKING  SEEP stores non-checkpointed events on the upstream operators  Millwheel persists all intermediate results to an underlying Distributed File System.  It also provides exactly-once semantics  D-Streams stores data to be processed in immutable partitioned datasets  These are implemented as resilient distributed datasets (RDD) 28/05/2014 Cloud-Based Data Stream Processing 93
  • 94. STATE MANAGEMENT  Stateful operators require their state to be persisted in case of failures.  Two classic approaches  Active replication  Passive replication 28/05/2014 Cloud-Based Data Stream Processing 94
  • 95. STATE MANAGEMENT 28/05/2014 Cloud-Based Data Stream Processing 95 Operator Operator Operator Previousstage Nextstage State State State Data State implicitly synchronized by ordered evaluation of same data Active replication Operator Operator (dormant) Operator (dormant) Previousstage Nextstage State State State Data State periodically persisted on stable storage and recovered on demand Passive replication Checkpoints
  • 96. STORM - SM  What happen when tasks fail ?  If a worker dies its supervisor restarts it  If it fails on startup Nimbus will reassign it on a different machine  If a machine fails its assigned tasks will timeout and Nimbus will reassign them  If Nimbus/Supervisors die they are simply restarted  Behave like fail-fast processes  They’re stateless in practice  Their state is safely maintained in in a Zookeeper cluster 28/05/2014 Cloud-Based Data Stream Processing 96
  • 97. STORM - SM  There is no explicit state management for operators in Storm  Trident builds automatic SM on top of it  Batch of tuples have unique Tx id  If a batch is retried it will have the exact same Tx id  State updates are ordered among batches  A new Tx is not committed if an old one is still pending  Transactional state guarantees transparently exactly-once tuple processing 28/05/2014 Cloud-Based Data Stream Processing 97
  • 98. STORM - SM  Transactional state is possible only if supported by the data source  As an alternative  Opaque transactional state  Each tuple is guaranteed to be executed exactly in one transaction  But the set of transactions for a given Tx id can change in case of failures  Non transactional state 28/05/2014 Cloud-Based Data Stream Processing 98
  • 99. STORM - SM  Few combinations guarantee exactly-once sem. 28/05/2014 Cloud-Based Data Stream Processing 99 State Non transactional Transactional Opaque transactional Spout Non transactional No No No Transactional No Yes Yes Opaque transactional No No Yes
  • 100. APACHE S4 - SM  Lightweight approach  Assumes that lossy failovers are acceptable.  PEs hosted on failed PNs are automatically moved to a standby server  Running PEs periodically perform uncoordinated and asynchronous checkpointing of their internal state  Distinct PE instances can checkpoint their state autonomously without synchronization  No global consistency  Executed asynchronously by first serializing the operator state and then saving it to stable storage through a pluggable adapter  Can be overridden by a user implementation 28/05/2014 Cloud-Based Data Stream Processing 100
  • 101. MILLWHEEL - SM  Guarantees strongly consistent processing  Checkpoints on persistent storage every single state change incurred after a computation  Can be executed  before emitting results downstream  operator implementations are automatically rendered idempotent with respect to the execution  after emitting results downstream  it’s up to the developer to implement idempotent operators if needed  Produced results are checkpointed with state (strong productions) 28/05/2014 Cloud-Based Data Stream Processing 101
  • 102. TIMESTREAM- SM  Takes a similar approach  State information checkpointed to stable storage  For each operator the state includes  state dependency: the list of input events that made the operator reach a specific state  output dependency: the list of input events that made an operator produce a specific output starting from a specific state.  Allow to correctly recover from a fault without having to store all the intermediate events produced by the operators and their states.  Is it possibile to periodically checkpoint a full operator state in order to avoid re-emitting the whole history of input events in order to recompute it. 28/05/2014 Cloud-Based Data Stream Processing 102
  • 103. SEEP - SM  Takes a different route allowing state to be stored on upstream operators  allows SEEP to treat operator recovery as a special case of a standard operator scale-out procedure  state in SEEP is characterized by three elements  internal state  output buffers  routing state  treated differently to reduce the state management impact on system performance. 28/05/2014 Cloud-Based Data Stream Processing 103
  • 104. D-STREAMS- SM  SM depends strictly on the computation model  A computation is structured as a sequence of deterministic batch computations on small time intervals  The input and output of each batch, as well as the state of each computation, are stored as reliable distributed datasets (RDDs)  For each RDD, the graph of operations used to compute (its lineage) it is tracked and reliably stored  The recovery of an RDD can be performed in parallel on separate nodes in order to speed up recovery operation  Operator state can be optionally checkpointed on stable storage to limit the number of operations required to restore it. 28/05/2014 Cloud-Based Data Stream Processing 104
  • 106. CONCLUSIONS  A third generation of DPSs is coming out that promise  Unprecedented computational power through horizontal scalability  On-demand dynamic load adaptation  Simplified programming models through powerful event management semantics  Graceful performance degradation in case of faults  A few issues remain to be solved to make DSP ready for the cloud-era 28/05/2014 Cloud-Based Data Stream Processing 106
  • 107. INFRASTRUCTURE AWARENESS  Most existing DSP systems are infrastructure oblivious  deployment strategies do not take into account the peculiar characteristics of the available hardware  The physical connection and relationship among infrastructural element is known to be a key factor to both improve system performance and fault tolerance.  E.g. Hadoop's “rack awareness”  We think infrastructure awareness is an open research field for data stream processing systems that could possibly bring important improvements. 28/05/2014 Cloud-Based Data Stream Processing 107
  • 108. COST-EFFICIENCY  Users in these days are not only interested in the performance of the system, but also the monetary cost  Cloud-based DSP systems should consider running cost as a variable for performance optimization  efficient scaling behavior maximizing the system utilization  efficient fault tolerance mechanisms  Bellavista et al.[32] proposed a first prototype, which allows the user to trade-off monetary cost and fault tolerance.  Their prototype only selects a subset of the operators for replication based on a user-defined value for the expected information completeness. 28/05/2014 Cloud-Based Data Stream Processing 108 [32] P. Bellavista, A. Corradi, S. Kotoulas, and A. Reale. Adaptive fault-tolerance for dynamic resource provisioning in distributed stream processing systems. In EDBT, 2014.
  • 109. ENERGY-EFFICIENCY  Most large-scale datacenters are striving to "go green" by making energy consumption more effective.  DSP systems should consider energy consumption a yet-another-variable in their performance optimization process  Note: this aspect is possibly strictly linked to the infrastructure awareness theme 28/05/2014 Cloud-Based Data Stream Processing 109
  • 110. ADVANCED ELASTICITY  Most of the existing elastic scaling solutions for DSPs apply simplistic schemes for load balancing.  Operator placement algorithms only optimize system utilization  Other metrics (end to end latency, network bandwidth, etc.) are only partially considered  However these metric are often used to sign contract-binding SLAs  Would it be possible to design DSP systems able to probabilistically guarantee performance ? 28/05/2014 Cloud-Based Data Stream Processing 110
  • 111. MULTI-DSP INTEGRATION  Most DSPs are particularly well suited for specific use cases  No one-size-fits-all solution  Components automatically selecting the best engine for each give use case would significantly improve the applicability of these systems.  No need to know in advance which is the best solution for given use case  No need to deploy and maintain different solutions  First promising results[33,34] 28/05/2014 Cloud-Based Data Stream Processing 111 [33] M. Duller, J. S. Rellermeyer, G. Alonso, and N. Tatbul. Virtualizing stream processing. Middleware, 2011. [34] H. Lim and S. Babu. Execution and optimization of continuous queries with cyclops. SIGMOD, 2013.
  • 112. Storm Performance MapReduce Performance 28/05/2014 Cloud-Based Data Stream Processing 112 MULTI-DSP INTEGRATION ESPER Performance Cyclops New Query With Window Size w Choose DSP 1 2 [34] H. Lim and S. Babu. Execution and optimization of continuous queries with cyclops. SIGMOD, 2013.
  • 113. REFERENCES [1] M. Stonebraker, U. Cetintemel and S. Zdonik "The 8 Requirements of real-time Stream Processing", ACM SIGMOD Record, 2005. [2] S. Chandrasekaran et al. "TelegraphCQ: Continuous Dataflow Processing", SIGMOD, 2003. [3] D. J. Abadi et al. "The Design of the Borealis Stream Processing Engine", CIDR 2005. [4] T. Akidau, A. Balikov, K. Bekirog ̆lu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. "MillWheel: Fault-tolerant Stream Processing at Internet Scale". VLDB, 2013. [5] Improving Twitter Search with Real-time Human Computation, 2013 http://blog.echen.me/2013/01/08/improving- twitter-search-with-real-time-human-computation [6] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed Stream Computing Platform.ICDMW, 2010. [7] "Dublin City Council - Traffic Flow Improved by Big Data Analytics Used to Predict Bus Arrival and Transit Times". IBM, 2013. http://www-03.ibm.com/software/businesscasestudies/en/us/corp?docid=RNAE-9C9PN5 [8] Z. Qian, Y. He, C. Su, Z. Wu, H. Zhu, T. Zhang, L. Zhou, Y. Yu, Z. Zhang. "Timestream: Reliable Stream Computation in the Cloud". Eurosys, 2013. [9] V. Gulisano, R. Jimenez-Peris, M. Patino-Martinez, C. Soriente, and P. Valduriez. "Streamcloud: An Elastic and Scalable Data Streaming System". IEEE TPDS, 2012. [10] N. Marz. "Trident tutorial". https://github.com/nathanmarz/storm/wiki/Trident-tutorial. [11] M. Shah, et al. "Flux: An adaptive partitioning operator for continuous query systems." In ICDE, 2003. [12] D. Abadi, et al. "The Design of the Borealis Stream Processing Engine." In CIDR, 2005.. [13] T. Condie, et al. "MapReduce Online." In NSDI,2010. [14] A. Brito, et al. "Scalable and low-latency data processing with stream mapreduce." CloudCom, 2011. [15] S. Schneider, et al. "Auto-parallelizing stateful distributed streaming applications." In PACT, 2012.
  • 114. REFERENCES [16] Wu et al. “Parallelizing Stateful Operators in a Distributed Stream Processing System: How, Should You and How Much?” In DEBS, 2012. [17] M. Armbrust, et al. "A view of cloud computing." Communications of the ACM, 2010. [18] S. Schneider, et al. "Elastic scaling of data parallel operators in stream processing." In IPDPS, 2009. [19] B. Gedik, et al. “Elastic Scaling for Data Stream Processing." In TPDS, 2013. [20] G.T. Lakshmanan, et al. "Placement strategies for internet-scale data stream systems." In IEEE Internet Computing, 2008. [21] M. Balazinska, et al. "Contract-Based Load Management in Federated Distributed Systems." NSDI. Vol. 4. 2004. [22] Y.Ahmad, et al. "Network-aware query processing for stream-based applications. In VLDB, 2004. [23] Y. Zhu, et al. "Dynamic plan migration for continuous queries over data streams." In SIGMOD, 2004. [24] V. Kalyvianaki et al. "SQPR: Stream query planning with reuse“. In ICDE, 2011. [25] B. Chandramouli et al. "Accurate latency estimation in a distributed event processing system“. In ICDE, 2011. [26] Storm. http://storm.incubator.apache.org/documentation/Understanding-the-parallelism-of-a-Storm- topology.html. [27] L. Aniello, et al. "Adaptive online scheduling in storm.“ In DEBS, 2013. [28] T. Knauth, et al. "Scaling non-elastic applications using virtual machines." In Cloud, 2011. [29] W. Kleiminger, et al."Balancing load in stream processing with the cloud." In ICDEW, 2011. [30] M. Migliavacca, et al. "SEEP: scalable and elastic event processing.“ In Middleware, 2010.
  • 115. REFERENCES [31] R. Fernandez, et al. "Integrating scale out and fault tolerance in stream processing using operator state management", SIGMOD 2013. [32] P. Bellavista, A. Corradi, S. Kotoulas, and A. Reale. "Adaptive fault-tolerance for dynamic resource provisioning in distributed stream processing systems". In EDBT, 2014. [33] M. Duller, J. S. Rellermeyer, G. Alonso, and N. Tatbul. "Virtualizing stream processing". Middleware, 2011. [34] H. Lim and S. Babu. "Execution and optimization of continuous queries with cyclops". SIGMOD, 2013.