Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- Why Twitter Storm?
- What is Twitter Storm?
- What to do with Twitter Storm?
3. Sankt Augustin
24-25.08.2013 About us
is a bunch of…
Big Data Nerds Agile Ninjas Continuous Delivery Gurus
Enterprise Java Specialists Performance Geeks
Join us!
8. Sankt Augustin
24-25.08.2013 Batch vs. Real-Time processing
• Batch processing
– Gathering of data and processing as a
group at one time.
• Real-time processing
– Processing of data that takes place as the
information is being entered.
10. Sankt Augustin
24-25.08.2013 Bridging the gap…
• A batch workflow is too slow
• Views are out of date
Absorbed into batch views
Time
Not Absorbed
Now
Just a few hours
of data
11. Sankt Augustin
24-25.08.2013 Storm vs. Hadoop
• Real-time
processing
• Topologies run
forever
• No SPOF
• Stateless nodes
• Batch processing
• Jobs run to
completion
• NameNode is SPOF
• Stateful nodes
• Scalable
• Gurantees no dataloss
• Open Source
12. Sankt Augustin
24-25.08.2013 Stream Processing
Stream processing is a technical paradigm to process
big volumes of unbound sequence of tuples in real-time
Source Stream Processing
• Algorithmic trading
• Sensor data monitoring
• Continuous analytics
15. Sankt Augustin
24-25.08.2013 Welcome, Twitter Storm!
• Created by Nathan Marz @ BackType
– Analyze tweets, links, users on Twitter
• Open sourced on 19th September, 2011
– Eclipse Public License 1.0
– Storm v0.5.2
• Latest Updates
– Current stable release v0.8.2 released on 11th January,
2013
– Major core improvements planned for v0.9.0
– Storm will be an Apache Project [soon..]
16. Sankt Augustin
24-25.08.2013 Storm under the hood
• Java & Clojure
• Apache Thrift
– Cross language bridge, RPC, Framework to build
services
• ZeroMQ
– Asynchronous message transport layer
• Kryo
– Serialization framework
• Jetty
– Embedded web server
17. Sankt Augustin
24-25.08.2013 Conceptual view
Spout
Spout
Spout:
Source of streams
Bolt
Bolt
Bolt
Bolt
Bolt
Bolt:
Consumer of streams,
Processing of tuples,
Possibly emits new tuples
Tuple
Tuple
Tuple
Tuple:
List of name-value pairs
Stream:
Unbound sequence of tuples
Topology: Network of Spouts & Bolts as the nodes and stream as the edge
18. Sankt Augustin
24-25.08.2013 Physical view
Java thread
spawned
by worker, runs one
or more tasks of the
same component
Nimbus
ZooKeeper
WorkerSupervisor
Executor Task
ZooKeeper
ZooKeeper
Supervisor
Supervisor
Supervisor
Supervisor
Worker
Worker
Worker Node
Worker Process
Java process
executing a subset
of topology
Component (Spout/
Bolt) instance,
performs the actual
data processing
Master daemon process
Responsible for
• distributing code
• assigning tasks
• monitoring failures
Storing operational
cluster state
Worker daemon process listening
for work assigned to its node
19. Sankt Augustin
24-25.08.2013 A simple example: WordCount
FileReader
Spout
WordSplit
Bolt
WordCount
Bolt
line
shakespeare.txt
word
of: 18126
to: 18763
i: 19540
and: 26099
the: 27730
Sorted list
21. Sankt Augustin
24-25.08.2013 FileReaderSpout II
/**
* Declare the output field "line"
*/
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("line"));
}
/**
* We will read the file and get the collector object
*/
public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
try {
this.fileReader = new FileReader(conf.get("wordsFile").toString());
} catch (FileNotFoundException e) {
throw new RuntimeException("Error reading file ["
+ conf.get("wordFile") + "]");
}
this.collector = collector;
}
public void close() {
}
22. Sankt Augustin
24-25.08.2013 FileReaderSpout III
/**
* The only thing that the methods will do is emit each file line
*/
public void nextTuple() {
/**
* The nextuple it is called forever, so if we have read the file we
* will wait and then return
*/
String str;
// Open the reader
BufferedReader reader = new BufferedReader(fileReader);
try {
// Read all lines
while ((str = reader.readLine()) != null) {
/**
* Emit each line as a value
*/
this.collector.emit(new Values(str), str);
}
} catch (Exception e) {
throw new RuntimeException("Error reading tuple", e);
} finally {
completed = true;
}
}
}
23. Sankt Augustin
24-25.08.2013 WordSplitBolt I
package de.codecentric.storm.wordcount.bolts;
import backtype.storm.topology.BasicOutputCollector;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseBasicBolt;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Tuple;
import backtype.storm.tuple.Values;
public class WordSplitBolt extends BaseBasicBolt {
public void cleanup() {}
/**
* The bolt will only emit the field "word"
*/
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
24. Sankt Augustin
24-25.08.2013 WordSplitBolt II
/**
* The bolt will receive the line from the
* words file and process it to split it into words
*/
public void execute(Tuple input, BasicOutputCollector collector) {
String sentence = input.getString(0);
String[] words = sentence.split(" ");
for(String word : words){
word = word.trim();
if(!word.isEmpty()){
word = word.toLowerCase();
collector.emit(new Values(word));
}
}
}
25. Sankt Augustin
24-25.08.2013 WordCountBolt I
package de.codecentric.storm.wordcount.bolts;
import java.util.Comparator;
import java.util.HashMap;
import java.util.Map;
import java.util.SortedSet;
import java.util.TreeSet;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.BasicOutputCollector;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseBasicBolt;
import backtype.storm.tuple.Tuple;
public class WordCountBolt extends BaseBasicBolt {
/**
*
*/
private static final long serialVersionUID = 1L;
Integer id;
String name;
Map<String, Integer> counters;
26. Sankt Augustin
24-25.08.2013 WordCountBolt II
/**
* On create
*/
@Override
public void prepare(Map stormConf, TopologyContext context) {
this.counters = new HashMap<String, Integer>();
this.name = context.getThisComponentId();
this.id = context.getThisTaskId();
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
}
@Override
public void execute(Tuple input, BasicOutputCollector collector) {
String str = input.getString(0);
/**
* If the word doesn't exist in the map we will create this, if not we will add 1
*/
if (!counters.containsKey(str)) {
counters.put(str, 1);
} else {
Integer c = counters.get(str) + 1;
counters.put(str, c);
}
}
27. Sankt Augustin
24-25.08.2013 WordCountBolt III
/**
* At the end of the spout (when the cluster is shutdown we will show the
* word counters
*/
@Override
public void cleanup() {
// Sort map
SortedSet<Map.Entry<String, Integer>> sortedCounts = entriesSortedByValues(counters);
System.out.println("-- Word Counter [" + name + "-" + id + "] --");
for (Map.Entry<String, Integer> entry : sortedCounts) {
System.out.println(entry.getKey() + ": " + entry.getValue());
}
}
…
}
28. Sankt Augustin
24-25.08.2013 WordCountTopology
public class WordCountTopology {
public static void main(String[] args) throws InterruptedException {
// Topology definition
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("word-reader",new FileReaderSpout());
builder.setBolt("word-normalizer", new WordSplitBolt())
.shuffleGrouping("word-reader");
builder.setBolt("word-counter", new WordCountBolt(),1)
.fieldsGrouping("word-normalizer", new Fields("word"));
// Configuration
Config conf = new Config();
conf.put("wordsFile", args[0]);
conf.setDebug(false);
// Run Topology
conf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 1);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("word-count-topology", conf, builder.createTopology());
// You don‘t do this on a regular topology
Utils.sleep(10000);
cluster.killTopology("word-count-topology");
cluster.shutdown();
}
}
29. Sankt Augustin
24-25.08.2013 Stream Grouping
• Each Spout or Bolt might be running n instances in parallel
• Groupings are used to decide to which task in the
subscribing bolt (group) a tuple is sent to.
• Possible Groupings:
Grouping Feature
Shuffle Random grouping
Fields Grouped by value such that equal value results in same task
All Replicates to all tasks
Global Makes all tuples go to one task
None Makes Bolt run in the same thread as the Bolt / Spout it subscribes to
Direct Producer (task that emits) controls which Consumer will receive
Local If the target bolt has one or more tasks in the same worker process,
tuples will be shuffled to just those in-process tasks
30. Sankt Augustin
24-25.08.2013 Key features of Twitter Storm
Storm is
• Fast & scalable
• Fault-tolerant
• Guaranteeing message processing
• Easy to setup & operate
• Free & Open Source
31. Sankt Augustin
24-25.08.2013 Key features of Twitter Storm
Storm is
• Fast & scalable
• Fault-tolerant
• Guaranteeing message processing
• Easy to setup & operate
• Free & Open Source
33. Sankt Augustin
24-25.08.2013 Parallelism
Number of worker nodes = 2
Number of worker slots per node = 4
Number of topology worker = 4
FileReaderSpout WordSplitBolt WordCountBolt
Number of tasks =
Not specified = Same
as parallism hint
Parellism_hint = 2
Number of tasks = 8
Parellism_hint = 4
Number of tasks =
Not specified = 6
Parellism_hint = 6
Number of component instances = 2 + 8 + 6 = 16
Number of executor threads = 2 + 4 + 6 = 12
34. Sankt Augustin
24-25.08.2013 Message passing
Receive
Thread
Executor
Transfer
Thread
Executor
Executor
Receiver queue
To other workers
From other workers
Internal transfer queue
Transfer queue
Interprocess communication is mediated by ZeroMQ
Outside transfer is done with Kryo serialization
Local communication is mediated by LMAX Disruptor
Inside transfer is done with no serialization
35. Sankt Augustin
24-25.08.2013 Key features of Twitter Storm
Storm is
• Fast & scalable
• Fault-tolerant
• Guaranteeing message processing
• Easy to setup & operate
• Free & Open Source
36. Sankt Augustin
24-25.08.2013 Fault tolerance
Nimbus ZooKeeper Supervisor Worker
Cluster works normally
Monitoring
cluster state
Synchronizing
assignment
Sending heartbeat
Reading worker heart
beat from local file
system
Sending executor heartbeat
37. Sankt Augustin
24-25.08.2013 Fault tolerance
Nimbus ZooKeeper Supervisor Worker
Nimbus goes down
Monitoring
cluster state
Synchronizing
assignment
Sending heartbeat
Reading worker heart
beat from local file
system
Sending executor heartbeat
Processing will still continue. But topology lifecycle
operations and reassignment facility are lost
38. Sankt Augustin
24-25.08.2013 Fault tolerance
Nimbus ZooKeeper Supervisor Worker
Worker node goes down
Monitoring
cluster state
Sending executor heartbeat
Nimbus will reassign the tasks to other machines
and the processing will continue
Supervisor Worker
Synchronizing
assignment
Sending heartbeat
Reading worker heart
beat from local file
system
39. Sankt Augustin
24-25.08.2013 Fault tolerance
Nimbus ZooKeeper Supervisor Worker
Supervisor goes down
Monitoring
cluster state
Synchronizing
assignment
Sending heartbeat
Reading worker heart
beat from local file
system
Sending executor heartbeat
Processing will still continue. But assignment is
never synchronized
40. Sankt Augustin
24-25.08.2013 Fault tolerance
Nimbus ZooKeeper Supervisor Worker
Worker process goes down
Monitoring
cluster state
Synchronizing
assignment
Sending heartbeat
Reading worker heart
beat from local file
system
Sending executor heartbeat
Supervisor will restart the worker process and the
processing will continue
41. Sankt Augustin
24-25.08.2013 Key features of Twitter Storm
Storm is
• Fast & scalable
• Fault-tolerant
• Guaranteeing message processing
• Easy to setup & operate
• Free & Open Source
42. Sankt Augustin
24-25.08.2013 Reliability API
public class FileReaderSpout extends BaseRichSpout {
public void nextTuple() {
…;
UUID messageID = getMsgID();
collector.emit(newValues(line), msgId)
}
public void ack(Object msgId) {
// Do something with acked message id
}
public void fail(Object msgId) {
// Do something with failes message id
}
}
public class WordSplitBolt extends BaseBasicBolt {
public void execute(Tuple input, BasicOutputCollector collector) {
for (String s : input.getString(0).split("s")) {
collector.emit(input, newValues(s));
}
collector.ack(input);
}
}
Tupel tree
Anchoring incoming tuple to
outgoing tuples
Sending ack
This
“This is a line”
This
This
This
Emiting tuple with Message ID
43. Sankt Augustin
24-25.08.2013 ACKing Framework
ACKer init
FileReaderSpout WordSplitBolt WordCountBolt
ACKer implicit
boltACKer ack
ACKer fail
ACKer ack
ACKer fail
Tuple A
Tuple B
Tuple C
• Emitted tuple A, XOR tuple A id with ack val
• Emitted tuple B, XOR tuple B id with ack val
• Emitted tuple C, XOR tuple C id with ack val
• Acked tuple A, XOR tuple A id with ack val
• Acked tuple B, XOR tuple B id with ack val
• Acked tuple C, XOR tuple C id with ack val
Spout Tuple ID Spout Task ID ACK val (64 Bit)
ACKer implizit bolt
ACK val has become 0, ACKer implicit bolt
knows the tuple tree has been completed
44. Sankt Augustin
24-25.08.2013 Key features of Twitter Storm
Storm is
• Fast & scalable
• Fault-tolerant
• Guaranteeing message processing
• Easy to setup & operate
• Free & Open Source
45. Sankt Augustin
24-25.08.2013 Cluster Setup
• Setup ZooKeeper cluster
• Install dependencies on Nimbus and worker machines
– ZeroMQ 2.1.7 and JZMQ
– Java 6 and Python 2.6.6
– unzip
• Download and extract a Storm release to Nimbus and
worker machines
• Fill in mandatory configuration into storm.yaml
• Launch daemons under supervision using storm scripts
• Start a topology:
– storm jar <path_topology_jar> <main_class> <arg1>…<argN>
49. Sankt Augustin
24-25.08.2013 Key features of Twitter Storm
Storm is
• Fast & scalable
• Fault-tolerant
• Guaranteeing message processing
• Easy to setup & operate
• Free & Open Source
50. Sankt Augustin
24-25.08.2013 Basic resources
• Storm is available at
– http://storm-project.net/
– https://github.com/nathanmarz/storm
under Eclipse Public License 1.0
• Get help on
– http://groups.google.com/group/storm-user
– #storm-user freenode room
• Follow
@stormprocessor and @nathanmarz
51. Sankt Augustin
24-25.08.2013 Many contributions
• Community repository for modules to use Storm at
– https://github.com/nathanmarz/storm-contrib
– including integration with Redis, Kafka, MongoDB, HBase, JMS,
Amazon SQS, …
• Good articles for understanding Storm internals
– http://www.michael-noll.com/blog/2012/10/16/understanding-the-
parallelism-of-a-stormtopology/
– http://www.michael-noll.com/blog/2013/06/21/understanding-storm-
internal-messagebuffers/
• Good slides for understanding real-life examples
– http://www.slideshare.net/DanLynn1/storm-as-deep-into-
realtime-data-processing-as-youcan-get-in-30-minutes
– http://www.slideshare.net/KrishnaGade2/storm-at-twitter
52. Sankt Augustin
24-25.08.2013 Coming next…
• Current release: 0.8.2
• Work in progress (newest): 0.9.0-wip21
– SLF4J and Logback
– Pluggable tuple serialization and blowfish
encryption
– Pluggable interprocess messaging and Netty
implementation
– Some bug fixes
– And more
• Storm on YARN
54. Sankt Augustin
24-25.08.2013 One example: Webshop
• Webtracking component
• No defined page impression
• Identifying page impressions using
Varnish logs of the click stream data
• Page consists of different fragments
– Body
– Article description
– Recommendation box, …
• Session data also of interest
55. Sankt Augustin
24-25.08.2013 One example: Webshop
• Custom solution using J2EE and
MongoDB
• Export into Comscore DAx and
Enterprise DWH
• Solution is currently working but not
scalable
• What about performance?