20. Stream grouping
Bolt subscribes to stream using:
Shuffle: pick random message
Fields grouping: mod hashing on a subset of fields
All: broadcast
Local: to tasks in the same process
and more...
30. Acker bolt
bu
b letters: 0
logger
Start 0000000
emit b 1001110
1001110
acker
1001110
31. Acker bolt
bu
b
b
letters: 0
logger
Start 0000000
emit b 1001110
1001110
emit b 0101101
1100011
acker
1100011
32. Acker bolt
bu
b
b
letters: 1
logger
Start 0000000
emit b 1001110
1001110
emit b 0101101
1100011
process b 1001110
1100011
acker
1100011
33. Acker bolt
bu
b, u
b, u
letters: 1
logger
Start 0000000
emit b 1001110
1001110
emit b 0101101
1100011
process b 1001110
1100011
... ...
1000101 acker
1000101
34. Acker bolt
bu
b, u
b, u
letters: 2
logger: b, u
Start 0000000
emit b 1001110
1001110
emit b 0101101
1100011
process b 1001110
1100011
... ...
1000101
process u 1000101
0000000
acker
0000000
38. DRPC
LinearDRPCTopologyBuilder
● initialized with a RPC name
● DRPC spout setup
● Returning the results to the DRPC server
Deployment:
● Launch DRPC server(s)
● Configure the locations of the DRPC servers in storm.yaml
● Submit DRPC topologies to Storm cluster
40. Trident
● Introduced with Storm 0.8.0
● High-level abstraction on top of Storm
● Stateful, incremental processing on top of
persistent store
● Exactly-once semantic
41. High-level abstraction
wordCounts =
topology.newStream("spout1", spout)
.each(new Fields("sentence"), new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.aggregate(new Fields("word"), new Count(), new Fields("count"))
53. Cluster
● Storm (0.9.0)
○ Single production cluster
○ 1 master, 2 slaves
● Zookeeper
● Apache Kafka
54. Apache Kafka
● Publish-subscribe messaging system
● Partitioned commit log
● Messages organised in topics
● Retains messages for some period of time
● Offset is controlled by the consumer
61. Resource separation
● Topologies can starve each other
● storm-yarn (Yahoo!)
● storm-mesos (Twitter’s production)
● Isolation scheduler (0.8.2)
62. Rebalance
● Scaling by adding nodes and increasing
parallelism hint
● Rebalance - no redeploy needed
● Still have to change configuration for the
next deployment
63. Spouts
● Names have to be unique across the cluster
(because of Zoo)
● Topology name prefix