2. 2
Big Data
“Big Data is the capability to manage a
huge volume of disparate data, at the right
speed, and within the right time frame to
allow real-time analysis and reaction”
4. 4
Enablers of Big Data
Map/Reduce frameworks – Hadoop
Scalable storage – HDFS, NoSQL
databases
Cheap computing power – Cloud computing
5. 5
Why Real Time?
Better end-user experience
- Ex: View an ad, see the counter move.
Operational intelligence
- Low latency analysis
- Real time Dashboards
ŸEvent response
- Rule Engine, Personalization, Predictions
- Scalable analysis
Example: Trend analysis to recommend „hot‟
articles.
6. 6
Requirements
Fast
Scalable by process parallelization and
distribution
Fault-tolerant
Guarantees data processing
Easy to learn, code and operate
Robust
Doing scalable real time processing require
framework that:
7. 7
Storm
• Storm – open source distributed Real-
time computation system.
• Developed by Nathan Marz – acquired by
Twitter
8. 8
Storm
Fast
Scalable by process parallelization and
distribution
Fault-tolerant
Guarantees data processing
Runs on JVM
Easy to learn, code and operate
Supports development in multiple
languages
9. 9
Hadoop Storm
Storm for Real-Time processing
Storm is to real-time computation what Hadoop is to batch computation.
11. 11
Storm Use Cases
“Storm powers a wide variety of Twitter
systems, ranging in applications from
discovery, real-time analytics,
personalization, search, revenue
optimization, and many more.”
“Storm empowers stream/micro-batch
processing of user events, content feeds,
and application logs” - Yahoo
“ETL – move data from MongoDB to BI”
39. 39
Stream groupings
Shuffle grouping: Tuples are randomly distributed across the
bolt's tasks
Fields grouping: The stream is partitioned by the fields specified
in the grouping
Custom grouping
44. 44
Storm deployment
Out of box configuration are suitable for
production
One-click deploy with storm-deploy
project to EC2
Once deployed, easy to operate –
designed to be robust
Storm daemons, Nimbus and
Supervisors are stateless and fail-fast
Useful UI
79. 79
Conclusion
Storm allows us to solve a wide range of
business problems in real time
Thriving open-source community
80. 80
Resources
Storm Project wiki
Storm starter project
Storm contributions project
Running a Multi-Node Storm cluster tutorial
Implementing real-time trending topic
A Hadoop Alternative: Building a real-time
data pipeline with Storm
Storm Use cases
81. 81
Resources (cont’d)
Understanding the Parallelism of a Storm
Topology
Trident – high level Storm abstraction
A practical Storm‟s Trident API
Storm online forum
Project source code
New York City Storm Meetup
Image credits: US NASA
Average enterprises now can process and make sense of big data
Variety – the various types of dataVelocity – how fast this data is processedVolume – how much data
Running if component die and self healing
Running if component die and self healing
Stream – read tuples, do some processing and update database and drop tuples. Move data from operational db into BI or process log file, ETL processingYou ask storm for really expensive computation query online – for example, how many events I got since last week.Trending topics or most popular articles
Graph of spouts and bolts with streams connection
Number of worker processes per clusterFinally, you can change the number of workers and/or number of executors for components using the "storm rebalance" command. The following command changes the number of workers for the "demo" topology to 3, the number of executors for the "myspout" component to 5, and the number of executors for the "mybolt" component to 1: storm rebalance demo -n 3 -e myspout=5 -e mybolt=1 The number of executor threads can be changed after the topology has been started (see storm rebalance command).The number of tasks of a topology is static.So one reason for having 2+ tasks per executor thread is to give you the flexibility to expand/scale up the topology through the storm rebalance command in the future without taking the topology offline. For instance, imagine you start out with a Storm cluster of 15 machines but already know that next week another 10 boxes will be added. Here you could opt for running the topology at the anticipated parallelism level of 25 machines already on the 15 initial boxes (which is of course slower than 25 boxes). Once the additional 10 boxes are integrated you can then storm rebalance the topology to make full use of all 25 boxes without any downtime.Another reason to run 2+ tasks per executor is for (primarily functional) testing. For instance, if your dev machine or CI server is only powerful enough to run, say, 2 executors alongside all the other stuff running on the machine, you can still run 30 tasks (here: 15 per executor) to see whether code such as your custom Storm grouping is working as expected.
Question
Submitter - Uploads topology JAR to Nimbus inbox with dependencies Nimbus - Makes assignment, Starts topology
Storm considers a tuple coming of a spout fully processed when every message in the tree has been processed. A tuple is considered failed when its tree of messages fails to be fully processed within a configurable timeout. The default is 30 seconds.
For example, mongoDB _id
There's two things you have to do as a user to benefit from Storm's reliability capabilities. First, you need to tell Storm whenever you're creating a new link in the tree of tuples. Second, you need to tell Storm when you have finished processing an individual tuple. By doing both these things, Storm can detect when the tree of tuples is fully processed and can ack or fail the spout tuple appropriately. Storm's API provides a concise way of doing both of these tasks.Specifying a link in the tuple tree is called anchoring.
Second, you need to tell Storm when you have finished processing an individual tuple.