8. Storm
“Storm is a distributed realtime computation system.
Storm provides a set of general primitives for doing
realtime computation. Storm is simple, can be used with
any programming language, is used by many companies,
and is a lot of fun to use!”
http://storm.incubator.apache.org/
17. parallelism hint = 4
parallelism hint = 1
parallelism hint = 2
parallelism hint = 2
parallelism hint = 3
parallelism hint = 4
Supervisor
Slot
Slot
Slot
Slot
Supervisor
Slot
Slot
Slot
Slot
Worker processes = 8
18. parallelism hint = 4
parallelism hint = 1
parallelism hint = 2
parallelism hint = 2
parallelism hint = 3
parallelism hint = 4
Worker processes = 8
combined parallelism = 4 + 1 + 2 + 2 + 3 + 4 = 16
Tasks per worker = 16 / 8 = 2
Supervisor
Supervisor
19. Example: Word Count
line line line word word word
File
FileSpout SplitterBolt CounterBolt
parallelism hint = 2 parallelism hint = 3 parallelism hint = 2
20. SplitterBoltFileSpout
Example: Word Count
CounterBolt
Storm is a distributed
realtime computation
system. Storm provides a
set of general primitives
for doing realtime
computation. Storm is
simple, can be used with
any programming
language, is used by
many companies, and is
a lot of fun to use!
21. SplitterBoltFileSpout
Example: Word Count
CounterBolt
Storm is a distributed
Storm is a distributed
realtime computation
system. Storm provides a
set of general primitives
for doing realtime
computation. Storm is
simple, can be used with
any programming
language, is used by
many companies, and is
a lot of fun to use!
22. SplitterBoltFileSpout
Example: Word Count
CounterBolt
Storm is a distributed
Storm is a distributed
realtime computation
system. Storm provides a
set of general primitives
for doing realtime
computation. Storm is
simple, can be used with
any programming
language, is used by
many companies, and is
a lot of fun to use!
realtime computation
system. Storm provides a
23. SplitterBoltFileSpout
Example: Word Count
CounterBolt
Storm is a distributed
Storm is a distributed
realtime computation
system. Storm provides a
set of general primitives
for doing realtime
computation. Storm is
simple, can be used with
any programming
language, is used by
many companies, and is
a lot of fun to use!
realtime computation
system. Storm provides a
shuffle grouping
24. SplitterBoltFileSpout
Example: Word Count
CounterBolt
Storm is a distributed
Storm is a distributed
realtime computation
system. Storm provides a
set of general primitives
for doing realtime
computation. Storm is
simple, can be used with
any programming
language, is used by
many companies, and is
a lot of fun to use!
realtime computation
system. Storm provides a
Storm a
is
distributed
realtime
computation
system
provides
Storm a
shuffle grouping
25. SplitterBoltFileSpout
Example: Word Count
CounterBolt
Storm is a distributed
Storm is a distributed
realtime computation
system. Storm provides a
set of general primitives
for doing realtime
computation. Storm is
simple, can be used with
any programming
language, is used by
many companies, and is
a lot of fun to use!
realtime computation
system. Storm provides a
Storm a
is
distributed
realtime
computation
system
provides
Storm a
Storm
a
is
distributed
realtime
computation
system
provides
Storm
a
x1
x1
x1
x1
x1
x1
x1
x1
x1
x1
shuffle grouping
26. SplitterBoltFileSpout
Example: Word Count
CounterBolt
Storm is a distributed
Storm is a distributed
realtime computation
system. Storm provides a
set of general primitives
for doing realtime
computation. Storm is
simple, can be used with
any programming
language, is used by
many companies, and is
a lot of fun to use!
realtime computation
system. Storm provides a
shuffle grouping
a
is
Storm distributed
provides a
Storm
is
distributed
realtime
computation
system
a
x2
x1
x1
x1
x2
x1
x1
x1
realtime
computation
provides
fields grouping
system
Storm
27. Groupings
● Shuffle grouping
● Fields grouping
● All grouping
● Global grouping
● Direct grouping
● Local or shuffle grouping
29. ● Worker dies
○ Supervisor will restart it
● Worker dies too many times
○ Nimbus will reassign it to another node
● Node dies
○ Nimbus will reassign task to another node
● Nimbus is not a SPOF
● Nimbus & Supervisors are fail-fast
Fault-tolerance
31. Guaranteeing message processing
● When is a message “fully processed”?
● Solutions
○ Transactional Topologies
○ Trident framework
Storm is a distributed
Storm
is
distributed
a
Ok
Fail
Ok
Ok
32. Yet another example
tweet tweet tweet
word
word
word
TwitterSpout SplitterBolt
CounterBolt
CommitBolt
signal
signal
signal
DB
shuffle grouping
fields grouping
all grouping
https://github.com/ferrangali/betabeers-storm
33. Batch + Real time
● Lambda architecture
Serving
Batch layer
● High latency
● Reprocesses all data
New
data
34. Batch + Real time
● Lambda architecture
Speed layer
Serving
Batch layer
● Low latency
● Fast & incremental algorithms
● Eventually overridden by batch layer
● High latency
● Reprocesses all data
New
data
37. Trovit
● Batch layer:
○ MapReduce pipeline over HDFS
HDFS
Filter Enrich Dedup Index
kafka
xml
38. Trovit
● Speed layer
○ Storm topology
ad
ad
ad
ad
ad
ad
rich ad rich ad rich ad
Feeds Spout
Kafka Spout
Processor Bolt Indexer Bolt
Group by index
Commit in batch
every 5 minutes
kafka
xml