Más contenido relacionado
La actualidad más candente (20)
Similar a Mhug apache storm (20)
Mhug apache storm
- 1. Page 1 © Hortonworks Inc. 2014
Apache Storm
Joseph Niemiec
- 2. Page 2 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Quick Bio
• Hadoop Dev for 3+ years
• Co-Author for Apache Hadoop YARN
• Originally used Hadoop for location based services
– Destination Prediction
– Traffic Analysis
– Effects of weather at client locations on call center call types
• Pending Patent in Automotive/Telematics domain
• Defensive Paper on M2M Validation
• Started on analytics to be better at an MMORPG
• HWX SME for YARN, Tez & MapReduce
- 3. Page 3 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Stream Processing vs. Batch
Page 3
Factors Stream Processing Batch Processing
Data
Freshness Real-time ( usually < 15 min) Historical – usually more than 15
min old
Location Primarily memory ( moved to
disk after processing)
Primarily in disk moved to memory
for processing
Processing
Speed Sub second to few seconds Few seconds to hours
Frequency Always running Sporadic to periodic
Clients
Who? Automated systems only Human & automated systems
Type Primarily operational
systems
Primarily analytical applications
- 4. Page 4 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Apache Storm – Stream Processing in Hadoop
• What is it?
– “Doing for real-time processing what MapReduce did for batch”
– Distributed Real-time computation engine.
• Driven by new types of data
– Sensor/Machine data, Logs,
– Geo-location, clickstream
• Enables new business use cases
– Low-latency dashboards
– Quality, Security, Safety, Operations Alerts
– Improved operations
– Real-time data integration
Page 4
- 5. Page 5 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Key Capabilities of Storm
Page 5
• Extremely high ingest rates – millions of events/secondData Ingest
• Ability to easily plug different processing frameworks
• Guaranteed processing – atleast once processing semanticsProcessing
• Ability to persist data to multiple relational and non relational data storesPersistence
• Security, HA, fault tolerance & management supportOperations
- 6. Page 6 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Page 6
• Horizontally scalable like Hadoop
• Eg: 10 node cluster can process 1M tuples per second per nodeHighly scalable
• Automatically reassigns tasks on failed nodesFault-tolerant
• Supports at least once & exactly once processing semantics
Guarantees
processing
• Processing logic can be defined in any language
Language
agnostic
• Brand, governance & a large active communityApache project
Key Capabilities of Storm
- 7. Page 7 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Understanding Storm via a Real-World Use Case
• A large truck fleet company wants to in real-time capture events of
drivers in their trucks on the road across the US.
• Sensor devices on trucks captures all kinds of events varying from
vehicle diagnostics to driver infractions.
– E.g.: Excessive breaking/acceleration, speeding, start/stop, etc..
• Initial Business Requirement:
– Stream these events in, filter on violations and do real-time alerting on “lots” of erratic
behavior over a short period of time..
Page 7
- 8. Page 8 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
High Level Architecture
Truck Streaming Data
T(N)T(2)T(1)
Interactive Query
TEZ
Perform Ad Hoc
Queries on driver/truck
events and other
related data sources
Messaging Grid
(WMQ, ActiveMQ, Kafka)
truck
events
TOPIC
Stream Processing with Storm
Kafka Spout
HBase
Bolt
Monitoring
Bolt
HDFS
Bolt
High Speed Ingestion
Distributed Storage
HDFS
Write to
HDFS
Email
Alerts
ActiveMQ
Alert
Topic
Create Alerts
Real-time Serviing with
HBase
driver
dangerous
events
driver
dangerou
s events
count
Write to
HBase
Update Alert
Thresholds
Spring WebApp with SockJS
WebSockets
Real-Time
Streaming Driver
Monitoring App
Query driver
events in real-time
Consume alerts
in real-time
Batch Analytics
MR2
Do batch analysis/
models & update
HBase with right
thresholds for alerts
- 9. Page 9 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Key Constructs in Apache Storm
• Tuples
• Streams
• Spouts
• Bolts
• Topology
Page 9
- 10. Page 10 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Tuples and Streams
• What is a Tuple?
– Fundamental data structure in Storm. Is a named list of values that can be of any data
type.
Page 10
• What is a Stream?
– An unbounded sequences of tuples.
– Core abstraction in Storm and are what you “process” in Storm
- 11. Page 11 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Spouts
• What is a Spout?
– Generates or a source of Streams
– E.g.: JMS, Twitter, Log, Kafka Spout
– Can spin up multiple instances of a Spout and dynamically adjust as needed
Page 11
- 12. Page 12 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Bolts
• What is a Bolt?
– Processes any number of input streams and produces output streams
– Common processing in bolts are functions, aggregations, joins, read/write to data stores,
alerting logic
– Can spin up multiple instances of a Bolt and dynamically adjust as needed
• Bolts used in the Use Case:
1. HBaseBolt: persisting and counting in Hbase
2. HDFSBolt: persisting into HFDS as Avro Files using Flume
3. MonitoringBolt: Read from Hbase and create alerts via email and a message to
ActiveMQ if the number of illegal driver incidents exceed a given threshhold.
Page 12
- 13. Page 13 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Topology
• What is a Topology?
– A network of spouts and bolts wired together into a workflow
Page 13
Truck-Event-Processor Topology
Kafka Spout
HBase
Bolt
Monitoring
Bolt
HDFS
Bolt
WebSocket
Bolt
Stream Stream
Stream
Stream
- 14. Page 14 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Fields Grouping
• Question: If multiple instances of single bolt can be created, when a
tuple is emitted, how do you control what instance the tuple goes to?
• Answer: Fields Grouping
• What is a Field grouping?
– Provides various ways to control tuple routing to bolts.
– Many field grouping exist include shuffle, fields, global..
• Example of Field grouping in Use Case
– Events for a given driver/truck must be sent to the same monitoring bolt instance.
Page 14
- 15. Page 15 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Different Fields Grouping
Page 15
Grouping type What it does When to use
Shuffle Grouping Sends tuple to a bolt in random round robin
sequence
- Doing atomic operations eg.math operations
Fields Grouping Sends tuples to a bolt based on or more field's in
the tuple
- Segmentation of the incoming stream
- Counting tuples of a certain type
All grouping Sends a single copy of each bolt to all instances
of a receiving bolt
- Send some signal to all bolts like clear cache or
refresh state etc.
- Send ticker tuple to signal bolts to save state etc.
Custom grouping Implement your own field grouping so tuples are
routed based on custom logic
- Used to get max flexibility to change processing
sequence, logic etc. based on different factors like
data types, load, seasonality etc.
Direct grouping Source decides which bolt will receive tuple - Depends
Global grouping Global Grouping sends tuples generated by all
instances of the source to a single target
instance (specifically, the task with lowest ID)
- Global counts..
- 16. Page 16 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Putting it all Together
1. Create a TopologyBuilder
Page 16
2. Create a Kafka/JMS Spout and add it to Topology
3. Create the Bolts and add it to the Topology
4. Create and Submit the Topology to Storm
- 17. Page 17 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Storm Components: Topology Submission
Page 17
Submit
truck-event-processor
topology
Nimbus
(Yarn App Master Node)
Zookeeper ZookeeperZookeeper
Supervisor
(Slave Node)
Supervisor
(Slave Node)
Supervisor
(Slave Node)
Supervisor
(Slave Node)
Supervisor
(Slave Node)
Kafka
Spout
Kafka
Spout
Kafka
Spout
Kafka
Spout
Kafka
Spout
HDFS
Bolt
HDFS
Bolt
HDFS
Bolt
HBase
Bolt
HBase
Bolt
Monitor
Bolt
Monitor
Bolt
Nimbus (Management server)
• Similar to job tracker
• Distributes code around cluster
• Assigns tasks
• Handles failures
Supervisor (Worker nodes)
• Similar to task tracker
• Run bolts and spouts as ‘tasks’
Zookeeper:
• Cluster co-ordination
• Nimbus HA
• Stores cluster metrics
- 18. Page 18 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Parallelism in a Storm Topology
• Supervisor
– The supervisor listens for work assigned to its machine and starts and stops worker processes as
necessary based on what Nimbus has assigned to it.
• Worker (JVM Process)
– Is a JVM process that executes a subset of a topology
– Belongs to a specific topology and may run one or more executors (threads) for one or more
components (spouts or bolts) of this topology.
• Executors (Thread in JVM Process)
– Is a thread that is spawned by a worker process and runs within the worker’s JVM
– Runs one more tasks for the same component (spout or bolt)
• Tasks
– Performs the actual data processing and is run within its parent executor’s thread of execution
Page 18
- 19. Page 19 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Parallelism Example
• Example Configuration in Use Case:
– Supervisors = 4
– Workers = 8
– spout.count = 5
– bolt.count = 16
• How many worker threads per supervisor?
– 8/4 = 2
• How many total executors(threads)?
– 16 + 5 = 21
• How many executors(threads) per worker?
– 21/8 = 2 to 3
Page 19
- 20. Page 20 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Processing guarantees in Storm
Page 20
Processing
guarantee
How is it achieved? Applicable use cases
Atleast once Replay tuples on failure - Processing does not need to be ordered
- Need extremely low latency processing
Exactly once Transactional topologies ( now
implemented using Trident)
- Need ordered processing
- Global counts
- Context aware processing
- Causality based
- Latency not important
- 21. Page 21 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Storm Use Cases
Page 21
- 22. Page 22 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Who is using Storm today?
Page 22
Source: Storm-project.net
- 23. Page 23 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Storm Use Cases – Business View
23
Monitor real-time data
to..
Prevent Optimize
Finance
- Securities Fraud
- Compliance violations
- Order routing
- Pricing
Telco
- Security breaches
- Network Outages
- Bandwidth allocation
- Customer service
Retail
- Offers
- Pricing
Manufacturing
- Device failures - Supply chain
Transportation
- Driver & fleet issues - Routes
- Pricing
Web
- Application failures
- Operational issues
- Site content
Sentiment Clickstream Machine/Sensor Logs Geo-location
- Shrinkage
- Stock outs
- 24. Page 24 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Storm Use Cases – IT operations view
Page 24
• Continuously ingest high rate messages, process them and update data
stores
Continuous
processing
• Aggregate multiple data streams that emit data at extremely high rates
into one central data store
High speed data
aggregation
• Filter out unwanted data on the fly before it is persisted to a data storeData filtering
• Extremely resource( CPU, mem or I/O) intensive processing that would
take long time to process on a single machine can be parallelized with
Storm to reduce response times to seconds
Distributed RPC
response time
reduction
- 25. Page 25 © Hortonworks Inc. 2014
Q&A – Thanks You!
Page 25