8. • Architecture for Stream and CEP
processing
• Input from buses and SCATS sensors
• Use of crowdsourcing to resolve data
source unreliability
• Dataset of 13GB from Dublin city
16. • 1.4 Million consumers
• Demand Response Optimization
1. Peak demand forecasting
2. Effective response selection
• Data source: AMIs (Advanced Metering
Infrastructure)
• 3TB of data per day
19. • Detection of events: earthquakes, typhoons, etc.
• Twitter users as sensors
• Location estimation: Kalman and particle filtering
• Detects 96% of earthquakes repoted by
the Japan Meteorological Agency
63. Parsing/Filtering/ETL
Aggregation: collection and summarization of tuples
Merging: combining of streams with different schemas
Splitting: partitioning of stream into multiple ones for data/task parallelism or some logical
reason
Data mining/Machine Learning/NLP: spam filtering, fraud detection,
recommendation systems, data stream clustering, sentiment analysis
… Others: relational algebra, artificial intelligence and other custom operations
65. Traditional Data Stream
Distributed No Yes
Type of Result Accurate Approximate
Memory Usage Unlimited Restricted
Processing Time Unlimited Restricted
No. of Passes Multiple Single
121. Discretized Stream Processing
Run a streaming computation as a series
of very small, deterministic batch jobs
Chop up the live stream into batches
of X seconds
Spark treats each batch of data as
RDDs and processes them using RDD
operations
Finally, the processed results of the
RDD operations are returned in
batches
122. Discretized Stream Processing
Run a streaming computation as a series
of very small, deterministic batch jobs
122
Batch sizes as low as ½ second,
latency ~ 1 second
Potential for combining batch
processing and streaming processing
in the same system
123. Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
DStream: a sequence of RDD representing a
stream of data
batch @
t+1
batch @ t
batch @
t+2
tweets DStream
stored in memory as an
RDD (immutable,
distributed)
Twitter Streaming API
124. Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
flatMap flatMap flatMap
…
transformation: modify data in one Dstream to create
another DStream
new DStream
new RDDs created
for every batch
batch @
t+1
batch @ t
batch @
t+2
tweets DStream
hashTags
Dstream
[#cat, #dog, … ]
125. Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
output operation: to push data to external
storage
flatMa
p
flatMa
p
flatMa
p
save save save
batch @
t+1
batch @ t
batch @
t+2
tweets DStream
hashTags
DStream
every batch
saved to HDFS
126. Java Example
Scala
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
Java
JavaDStream<Status> tweets = ssc.twitterStream(<Twitter username>, <Twitter
password>)
JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { })
hashTags.saveAsHadoopFiles("hdfs://...")
Function object to define the
transformation
127. Fault-tolerance
RDDs remember the
sequence of operations
that created it from the
original fault-tolerant input
data
Batches of input data are
replicated in memory of
multiple worker nodes,
therefore fault-tolerant
Data lost due to worker
failure, can be recomputed
from input data
128. Key concepts
DStream – sequence of RDDs representing a stream of data
- Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets
Transformations – modify data from on DStream to another
- Standard RDD operations – map, countByValue, reduce, join, …
- Stateful operations – window, countByValueAndWindow, …
Output Operations – send data to external entity
- saveAsHadoopFiles – saves to HDFS
- foreach – do anything with each batch of results
129. Example 2 – Count the hashtags
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.countByValue()
130. Example 3 – Count the hashtags over
last 10 mins
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.window(Minutes(10),
Seconds(1)).countByValue()
sliding window
operation
window
length
sliding
interval
131. Example 3 – Counting the hashtags over
last 10 mins
val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()
138. Scheduler
Supervisor
Master node Worker node 1
Supervisor
Worker node n
Composed of one Nimbus and a set of
supervisors
Storm clusterExecutor Worker (process)
SlotsNimbus (process)
139. Scheduler
Supervisor
Master node Worker node 1
Supervisor
Worker node n
The Nimbus assigns work to supervisors,
manage failures and monitors resource usage.
Storm clusterExecutor Worker (process)
SlotsNimbus (process)
140. Scheduler
Supervisor
Master node Worker node 1
Supervisor
Worker node n
The number of slots of a supervisor is the
maximum number of workers it can execute
Storm clusterExecutor Worker (process)
SlotsNimbus (process)
180. Heinze, Thomas, et al. "Tutorial: Cloud-based Data Stream Processing." (2014).
Artikis, Alexander, Matthias Weidlich, Francois Schnitzler, Ioannis Boutsis, Thomas Liebig, Nico Piatkowski, Christian Bockermann et al.
"Heterogeneous Stream Processing and Crowdsourcing for Urban Traffic Management." In EDBT, pp. 712-723. 2014.
Bouillet, Eric, et al. "Processing 6 billion CDRs/day: from research to production (experience report)." Proceedings of the 6th ACM
International Conference on Distributed Event-Based Systems. ACM, 2012.
Lakshmanan, G. T., LI, Y., and Strom, R. Placement strategies for internet-scale data stream systems. Internet Computing, IEEE 12, 6 (2008),
50–60.
Simmhan, Yogesh, et al. "An informatics approach to demand response optimization in smart grids." NATURAL GAS 31 (2011): 60.
Sakaki, Takeshi, Makoto Okazaki, and Yutaka Matsuo. "Earthquake shakes Twitter users: real-time event detection by social sensors."
Proceedings of the 19th international conference on World wide web. ACM, 2010.