SlideShare una empresa de Scribd logo
1 de 25
Descargar para leer sin conexión
Page 1 © Hortonworks Inc. 2014
Apache Storm
Joseph Niemiec
Page 2 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Quick Bio
• Hadoop Dev for 3+ years
• Co-Author for Apache Hadoop YARN
• Originally used Hadoop for location based services
– Destination Prediction
– Traffic Analysis
– Effects of weather at client locations on call center call types
• Pending Patent in Automotive/Telematics domain
• Defensive Paper on M2M Validation
• Started on analytics to be better at an MMORPG
• HWX SME for YARN, Tez & MapReduce
Page 3 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Stream Processing vs. Batch
Page 3
Factors Stream Processing Batch Processing
Data
Freshness Real-time ( usually < 15 min) Historical – usually more than 15
min old
Location Primarily memory ( moved to
disk after processing)
Primarily in disk moved to memory
for processing
Processing
Speed Sub second to few seconds Few seconds to hours
Frequency Always running Sporadic to periodic
Clients
Who? Automated systems only Human & automated systems
Type Primarily operational
systems
Primarily analytical applications
Page 4 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Apache Storm – Stream Processing in Hadoop
• What is it?
– “Doing for real-time processing what MapReduce did for batch”
– Distributed Real-time computation engine.
• Driven by new types of data
– Sensor/Machine data, Logs,
– Geo-location, clickstream
• Enables new business use cases
– Low-latency dashboards
– Quality, Security, Safety, Operations Alerts
– Improved operations
– Real-time data integration
Page 4
Page 5 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Key Capabilities of Storm
Page 5
•  Extremely high ingest rates – millions of events/secondData Ingest
•  Ability to easily plug different processing frameworks
•  Guaranteed processing – atleast once processing semanticsProcessing
•  Ability to persist data to multiple relational and non relational data storesPersistence
•  Security, HA, fault tolerance & management supportOperations
Page 6 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Page 6
•  Horizontally scalable like Hadoop
•  Eg: 10 node cluster can process 1M tuples per second per nodeHighly scalable
•  Automatically reassigns tasks on failed nodesFault-tolerant
•  Supports at least once & exactly once processing semantics
Guarantees
processing
•  Processing logic can be defined in any language
Language
agnostic
•  Brand, governance & a large active communityApache project
Key Capabilities of Storm
Page 7 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Understanding Storm via a Real-World Use Case
• A large truck fleet company wants to in real-time capture events of
drivers in their trucks on the road across the US.
• Sensor devices on trucks captures all kinds of events varying from
vehicle diagnostics to driver infractions.
– E.g.: Excessive breaking/acceleration, speeding, start/stop, etc..
• Initial Business Requirement:
– Stream these events in, filter on violations and do real-time alerting on “lots” of erratic
behavior over a short period of time..
Page 7
Page 8 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
High Level Architecture
Truck Streaming Data
T(N)T(2)T(1)
Interactive Query
TEZ
Perform Ad Hoc
Queries on driver/truck
events and other
related data sources
Messaging Grid
(WMQ, ActiveMQ, Kafka)
truck
events
TOPIC
Stream Processing with Storm
Kafka Spout
HBase
Bolt
Monitoring
Bolt
HDFS
Bolt
High Speed Ingestion
Distributed Storage
HDFS
Write to
HDFS
Email
Alerts
ActiveMQ
Alert
Topic
Create Alerts
Real-time Serviing with
HBase
driver
dangerous
events
driver
dangerou
s events
count
Write to
HBase
Update Alert
Thresholds
Spring WebApp with SockJS
WebSockets
Real-Time
Streaming Driver
Monitoring App
Query driver
events in real-time
Consume alerts
in real-time
Batch Analytics
MR2
Do batch analysis/
models & update
HBase with right
thresholds for alerts
Page 9 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Key Constructs in Apache Storm
• Tuples
• Streams
• Spouts
• Bolts
• Topology
Page 9
Page 10 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Tuples and Streams
• What is a Tuple?
– Fundamental data structure in Storm. Is a named list of values that can be of any data
type.
Page 10
• What is a Stream?
– An unbounded sequences of tuples.
– Core abstraction in Storm and are what you “process” in Storm
Page 11 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Spouts
• What is a Spout?
– Generates or a source of Streams
– E.g.: JMS, Twitter, Log, Kafka Spout
– Can spin up multiple instances of a Spout and dynamically adjust as needed
Page 11
Page 12 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Bolts
• What is a Bolt?
– Processes any number of input streams and produces output streams
– Common processing in bolts are functions, aggregations, joins, read/write to data stores,
alerting logic
– Can spin up multiple instances of a Bolt and dynamically adjust as needed
• Bolts used in the Use Case:
1.  HBaseBolt: persisting and counting in Hbase
2.  HDFSBolt: persisting into HFDS as Avro Files using Flume
3.  MonitoringBolt: Read from Hbase and create alerts via email and a message to
ActiveMQ if the number of illegal driver incidents exceed a given threshhold.
Page 12
Page 13 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Topology
• What is a Topology?
– A network of spouts and bolts wired together into a workflow
Page 13
Truck-Event-Processor Topology
Kafka Spout
HBase
Bolt
Monitoring
Bolt
HDFS
Bolt
WebSocket
Bolt
Stream Stream
Stream
Stream
Page 14 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Fields Grouping
• Question: If multiple instances of single bolt can be created, when a
tuple is emitted, how do you control what instance the tuple goes to?
• Answer: Fields Grouping
• What is a Field grouping?
– Provides various ways to control tuple routing to bolts.
– Many field grouping exist include shuffle, fields, global..
• Example of Field grouping in Use Case
– Events for a given driver/truck must be sent to the same monitoring bolt instance.
Page 14
Page 15 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Different Fields Grouping
Page 15
Grouping type What it does When to use
Shuffle Grouping Sends tuple to a bolt in random round robin
sequence
- Doing atomic operations eg.math operations
Fields Grouping Sends tuples to a bolt based on or more field's in
the tuple
-  Segmentation of the incoming stream
-  Counting tuples of a certain type
All grouping Sends a single copy of each bolt to all instances
of a receiving bolt
-  Send some signal to all bolts like clear cache or
refresh state etc.
-  Send ticker tuple to signal bolts to save state etc.
Custom grouping Implement your own field grouping so tuples are
routed based on custom logic
-  Used to get max flexibility to change processing
sequence, logic etc. based on different factors like
data types, load, seasonality etc.
Direct grouping Source decides which bolt will receive tuple -  Depends
Global grouping Global Grouping sends tuples generated by all
instances of the source to a single target
instance (specifically, the task with lowest ID)
-  Global counts..
Page 16 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Putting it all Together
1. Create a TopologyBuilder
Page 16
2. Create a Kafka/JMS Spout and add it to Topology
3. Create the Bolts and add it to the Topology
4. Create and Submit the Topology to Storm
Page 17 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Storm Components: Topology Submission
Page 17
Submit
truck-event-processor
topology
Nimbus
(Yarn App Master Node)
Zookeeper ZookeeperZookeeper
Supervisor
(Slave Node)
Supervisor
(Slave Node)
Supervisor
(Slave Node)
Supervisor
(Slave Node)
Supervisor
(Slave Node)
Kafka
Spout
Kafka
Spout
Kafka
Spout
Kafka
Spout
Kafka
Spout
HDFS
Bolt
HDFS
Bolt
HDFS
Bolt
HBase
Bolt
HBase
Bolt
Monitor
Bolt
Monitor
Bolt
Nimbus (Management server)
•  Similar to job tracker
•  Distributes code around cluster
•  Assigns tasks
•  Handles failures
Supervisor (Worker nodes)
•  Similar to task tracker
•  Run bolts and spouts as ‘tasks’
Zookeeper:
•  Cluster co-ordination
•  Nimbus HA
•  Stores cluster metrics
Page 18 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Parallelism in a Storm Topology
• Supervisor
– The supervisor listens for work assigned to its machine and starts and stops worker processes as
necessary based on what Nimbus has assigned to it.
• Worker (JVM Process)
– Is a JVM process that executes a subset of a topology
– Belongs to a specific topology and may run one or more executors (threads) for one or more
components (spouts or bolts) of this topology.
• Executors (Thread in JVM Process)
– Is a thread that is spawned by a worker process and runs within the worker’s JVM
– Runs one more tasks for the same component (spout or bolt)
• Tasks
– Performs the actual data processing and is run within its parent executor’s thread of execution
Page 18
Page 19 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Parallelism Example
• Example Configuration in Use Case:
– Supervisors = 4
– Workers = 8
– spout.count = 5
– bolt.count = 16
• How many worker threads per supervisor?
– 8/4 = 2
• How many total executors(threads)?
– 16 + 5 = 21
• How many executors(threads) per worker?
– 21/8 = 2 to 3
Page 19
Page 20 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Processing guarantees in Storm
Page 20
Processing
guarantee
How is it achieved? Applicable use cases
Atleast once Replay tuples on failure -  Processing does not need to be ordered
-  Need extremely low latency processing
Exactly once Transactional topologies ( now
implemented using Trident)
-  Need ordered processing
-  Global counts
-  Context aware processing
-  Causality based
-  Latency not important
Page 21 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Storm Use Cases
Page 21
Page 22 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Who is using Storm today?
Page 22
Source: Storm-project.net
Page 23 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Storm Use Cases – Business View
23
Monitor real-time data
to..
Prevent Optimize
Finance
- Securities Fraud
- Compliance violations
- Order routing
- Pricing
Telco
- Security breaches
- Network Outages
- Bandwidth allocation
- Customer service
Retail
- Offers
- Pricing
Manufacturing
- Device failures - Supply chain
Transportation
- Driver & fleet issues - Routes
- Pricing
Web
- Application failures
- Operational issues
- Site content
Sentiment Clickstream Machine/Sensor Logs Geo-location
- Shrinkage
- Stock outs
Page 24 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Storm Use Cases – IT operations view
Page 24
•  Continuously ingest high rate messages, process them and update data
stores
Continuous
processing
•  Aggregate multiple data streams that emit data at extremely high rates
into one central data store
High speed data
aggregation
•  Filter out unwanted data on the fly before it is persisted to a data storeData filtering
•  Extremely resource( CPU, mem or I/O) intensive processing that would
take long time to process on a single machine can be parallelized with
Storm to reduce response times to seconds
Distributed RPC
response time
reduction
Page 25 © Hortonworks Inc. 2014
Q&A – Thanks You!
Page 25

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...
Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...
Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...
 
(BDT210) Building Scalable Big Data Solutions: Intel & AOL
(BDT210) Building Scalable Big Data Solutions: Intel & AOL(BDT210) Building Scalable Big Data Solutions: Intel & AOL
(BDT210) Building Scalable Big Data Solutions: Intel & AOL
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
Building A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSBuilding A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWS
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
 
AWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon KinesisAWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon Kinesis
 
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...
 
Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203...
Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203...Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203...
Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203...
 
Intro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS CloudIntro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS Cloud
 
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
 
AWS Real-Time Event Processing
AWS Real-Time Event ProcessingAWS Real-Time Event Processing
AWS Real-Time Event Processing
 
(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce
 
AWS re:Invent 2016: How Citus Enables Scalable PostgreSQL on AWS (DAT207)
AWS re:Invent 2016: How Citus Enables Scalable PostgreSQL on AWS (DAT207)AWS re:Invent 2016: How Citus Enables Scalable PostgreSQL on AWS (DAT207)
AWS re:Invent 2016: How Citus Enables Scalable PostgreSQL on AWS (DAT207)
 
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
 
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
 
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012
 

Similar a Mhug apache storm

Similar a Mhug apache storm (20)

Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
 
Cleveland HUG - Storm
Cleveland HUG - StormCleveland HUG - Storm
Cleveland HUG - Storm
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Low Latency Streaming Data Processing in Hadoop
Low Latency Streaming Data Processing in HadoopLow Latency Streaming Data Processing in Hadoop
Low Latency Streaming Data Processing in Hadoop
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data Platforms
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
 
Hadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureHadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming Architecture
 
Network visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetryNetwork visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetry
 
Flow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECT
Flow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECTFlow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECT
Flow Tuning: Mule 3 vs. Mule 4 - MuleSoft Chicago CONNECT
 
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
UNIT II DIS.pptx
UNIT II DIS.pptxUNIT II DIS.pptx
UNIT II DIS.pptx
 
From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...
 

Último

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 

Último (20)

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 

Mhug apache storm

  • 1. Page 1 © Hortonworks Inc. 2014 Apache Storm Joseph Niemiec
  • 2. Page 2 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Quick Bio • Hadoop Dev for 3+ years • Co-Author for Apache Hadoop YARN • Originally used Hadoop for location based services – Destination Prediction – Traffic Analysis – Effects of weather at client locations on call center call types • Pending Patent in Automotive/Telematics domain • Defensive Paper on M2M Validation • Started on analytics to be better at an MMORPG • HWX SME for YARN, Tez & MapReduce
  • 3. Page 3 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Stream Processing vs. Batch Page 3 Factors Stream Processing Batch Processing Data Freshness Real-time ( usually < 15 min) Historical – usually more than 15 min old Location Primarily memory ( moved to disk after processing) Primarily in disk moved to memory for processing Processing Speed Sub second to few seconds Few seconds to hours Frequency Always running Sporadic to periodic Clients Who? Automated systems only Human & automated systems Type Primarily operational systems Primarily analytical applications
  • 4. Page 4 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Apache Storm – Stream Processing in Hadoop • What is it? – “Doing for real-time processing what MapReduce did for batch” – Distributed Real-time computation engine. • Driven by new types of data – Sensor/Machine data, Logs, – Geo-location, clickstream • Enables new business use cases – Low-latency dashboards – Quality, Security, Safety, Operations Alerts – Improved operations – Real-time data integration Page 4
  • 5. Page 5 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Key Capabilities of Storm Page 5 •  Extremely high ingest rates – millions of events/secondData Ingest •  Ability to easily plug different processing frameworks •  Guaranteed processing – atleast once processing semanticsProcessing •  Ability to persist data to multiple relational and non relational data storesPersistence •  Security, HA, fault tolerance & management supportOperations
  • 6. Page 6 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Page 6 •  Horizontally scalable like Hadoop •  Eg: 10 node cluster can process 1M tuples per second per nodeHighly scalable •  Automatically reassigns tasks on failed nodesFault-tolerant •  Supports at least once & exactly once processing semantics Guarantees processing •  Processing logic can be defined in any language Language agnostic •  Brand, governance & a large active communityApache project Key Capabilities of Storm
  • 7. Page 7 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Understanding Storm via a Real-World Use Case • A large truck fleet company wants to in real-time capture events of drivers in their trucks on the road across the US. • Sensor devices on trucks captures all kinds of events varying from vehicle diagnostics to driver infractions. – E.g.: Excessive breaking/acceleration, speeding, start/stop, etc.. • Initial Business Requirement: – Stream these events in, filter on violations and do real-time alerting on “lots” of erratic behavior over a short period of time.. Page 7
  • 8. Page 8 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 High Level Architecture Truck Streaming Data T(N)T(2)T(1) Interactive Query TEZ Perform Ad Hoc Queries on driver/truck events and other related data sources Messaging Grid (WMQ, ActiveMQ, Kafka) truck events TOPIC Stream Processing with Storm Kafka Spout HBase Bolt Monitoring Bolt HDFS Bolt High Speed Ingestion Distributed Storage HDFS Write to HDFS Email Alerts ActiveMQ Alert Topic Create Alerts Real-time Serviing with HBase driver dangerous events driver dangerou s events count Write to HBase Update Alert Thresholds Spring WebApp with SockJS WebSockets Real-Time Streaming Driver Monitoring App Query driver events in real-time Consume alerts in real-time Batch Analytics MR2 Do batch analysis/ models & update HBase with right thresholds for alerts
  • 9. Page 9 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Key Constructs in Apache Storm • Tuples • Streams • Spouts • Bolts • Topology Page 9
  • 10. Page 10 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Tuples and Streams • What is a Tuple? – Fundamental data structure in Storm. Is a named list of values that can be of any data type. Page 10 • What is a Stream? – An unbounded sequences of tuples. – Core abstraction in Storm and are what you “process” in Storm
  • 11. Page 11 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Spouts • What is a Spout? – Generates or a source of Streams – E.g.: JMS, Twitter, Log, Kafka Spout – Can spin up multiple instances of a Spout and dynamically adjust as needed Page 11
  • 12. Page 12 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Bolts • What is a Bolt? – Processes any number of input streams and produces output streams – Common processing in bolts are functions, aggregations, joins, read/write to data stores, alerting logic – Can spin up multiple instances of a Bolt and dynamically adjust as needed • Bolts used in the Use Case: 1.  HBaseBolt: persisting and counting in Hbase 2.  HDFSBolt: persisting into HFDS as Avro Files using Flume 3.  MonitoringBolt: Read from Hbase and create alerts via email and a message to ActiveMQ if the number of illegal driver incidents exceed a given threshhold. Page 12
  • 13. Page 13 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Topology • What is a Topology? – A network of spouts and bolts wired together into a workflow Page 13 Truck-Event-Processor Topology Kafka Spout HBase Bolt Monitoring Bolt HDFS Bolt WebSocket Bolt Stream Stream Stream Stream
  • 14. Page 14 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Fields Grouping • Question: If multiple instances of single bolt can be created, when a tuple is emitted, how do you control what instance the tuple goes to? • Answer: Fields Grouping • What is a Field grouping? – Provides various ways to control tuple routing to bolts. – Many field grouping exist include shuffle, fields, global.. • Example of Field grouping in Use Case – Events for a given driver/truck must be sent to the same monitoring bolt instance. Page 14
  • 15. Page 15 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Different Fields Grouping Page 15 Grouping type What it does When to use Shuffle Grouping Sends tuple to a bolt in random round robin sequence - Doing atomic operations eg.math operations Fields Grouping Sends tuples to a bolt based on or more field's in the tuple -  Segmentation of the incoming stream -  Counting tuples of a certain type All grouping Sends a single copy of each bolt to all instances of a receiving bolt -  Send some signal to all bolts like clear cache or refresh state etc. -  Send ticker tuple to signal bolts to save state etc. Custom grouping Implement your own field grouping so tuples are routed based on custom logic -  Used to get max flexibility to change processing sequence, logic etc. based on different factors like data types, load, seasonality etc. Direct grouping Source decides which bolt will receive tuple -  Depends Global grouping Global Grouping sends tuples generated by all instances of the source to a single target instance (specifically, the task with lowest ID) -  Global counts..
  • 16. Page 16 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Putting it all Together 1. Create a TopologyBuilder Page 16 2. Create a Kafka/JMS Spout and add it to Topology 3. Create the Bolts and add it to the Topology 4. Create and Submit the Topology to Storm
  • 17. Page 17 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Storm Components: Topology Submission Page 17 Submit truck-event-processor topology Nimbus (Yarn App Master Node) Zookeeper ZookeeperZookeeper Supervisor (Slave Node) Supervisor (Slave Node) Supervisor (Slave Node) Supervisor (Slave Node) Supervisor (Slave Node) Kafka Spout Kafka Spout Kafka Spout Kafka Spout Kafka Spout HDFS Bolt HDFS Bolt HDFS Bolt HBase Bolt HBase Bolt Monitor Bolt Monitor Bolt Nimbus (Management server) •  Similar to job tracker •  Distributes code around cluster •  Assigns tasks •  Handles failures Supervisor (Worker nodes) •  Similar to task tracker •  Run bolts and spouts as ‘tasks’ Zookeeper: •  Cluster co-ordination •  Nimbus HA •  Stores cluster metrics
  • 18. Page 18 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Parallelism in a Storm Topology • Supervisor – The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. • Worker (JVM Process) – Is a JVM process that executes a subset of a topology – Belongs to a specific topology and may run one or more executors (threads) for one or more components (spouts or bolts) of this topology. • Executors (Thread in JVM Process) – Is a thread that is spawned by a worker process and runs within the worker’s JVM – Runs one more tasks for the same component (spout or bolt) • Tasks – Performs the actual data processing and is run within its parent executor’s thread of execution Page 18
  • 19. Page 19 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Parallelism Example • Example Configuration in Use Case: – Supervisors = 4 – Workers = 8 – spout.count = 5 – bolt.count = 16 • How many worker threads per supervisor? – 8/4 = 2 • How many total executors(threads)? – 16 + 5 = 21 • How many executors(threads) per worker? – 21/8 = 2 to 3 Page 19
  • 20. Page 20 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Processing guarantees in Storm Page 20 Processing guarantee How is it achieved? Applicable use cases Atleast once Replay tuples on failure -  Processing does not need to be ordered -  Need extremely low latency processing Exactly once Transactional topologies ( now implemented using Trident) -  Need ordered processing -  Global counts -  Context aware processing -  Causality based -  Latency not important
  • 21. Page 21 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Storm Use Cases Page 21
  • 22. Page 22 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Who is using Storm today? Page 22 Source: Storm-project.net
  • 23. Page 23 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Storm Use Cases – Business View 23 Monitor real-time data to.. Prevent Optimize Finance - Securities Fraud - Compliance violations - Order routing - Pricing Telco - Security breaches - Network Outages - Bandwidth allocation - Customer service Retail - Offers - Pricing Manufacturing - Device failures - Supply chain Transportation - Driver & fleet issues - Routes - Pricing Web - Application failures - Operational issues - Site content Sentiment Clickstream Machine/Sensor Logs Geo-location - Shrinkage - Stock outs
  • 24. Page 24 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Storm Use Cases – IT operations view Page 24 •  Continuously ingest high rate messages, process them and update data stores Continuous processing •  Aggregate multiple data streams that emit data at extremely high rates into one central data store High speed data aggregation •  Filter out unwanted data on the fly before it is persisted to a data storeData filtering •  Extremely resource( CPU, mem or I/O) intensive processing that would take long time to process on a single machine can be parallelized with Storm to reduce response times to seconds Distributed RPC response time reduction
  • 25. Page 25 © Hortonworks Inc. 2014 Q&A – Thanks You! Page 25