SlideShare una empresa de Scribd logo
1 de 23
Descargar para leer sin conexión
Storm - pipes and
filters on steroids

      Andre Sprenger


    BigData Roundtable
   Hamburg 30. Nov 2011
My background
•   info@andresprenger.de

•   Studied Computer Science and Economics

•   Background: banking, ecommerce, online advertising

•   Freelancer

•   Java, Scala, Ruby, Rails

•   Hadoop, Pig, Hive, Cassandra
“Next click” problem
Raymie Strata (CTO,Yahoo):

“With the paths that go through Hadoop [at Yahoo!], the
latency is about fifteen minutes. … [I]t will never be true
real-time. It will never be what we call “next click,” where
I click and by the time the page loads, the semantic
implication of my decision is reflected in the page.”
“Next click” problem
                             (next)
 HTTP         HTTP           HTTP          HTTP
Request      Response       Request       Response


     max latency                  max latency
       80 ms                        80 ms

                                                     web server
              realtime   near realtime
              response     response

                                                     real time layer

      collect data                 process data


                         time
Example problems
•   Realtime statistics - counting, trends, moving average

•   Read Twitter stream and output images that are
    trending in the last 10 minutes

•   CTR calculation - read ad clicks/ad impressions and
    calculate new click through rate

•   ETL - transform format, filter duplicates / bot traffic,
    enrich from static data, persist

•   Search advertising
Pick your framework...
•   S4 - Yahoo, “real time map reduce”, actor model

•   Storm - Twitter

•   MapReduce Online - Yahoo

•   Cloud Map Reduce - Accenture

•   HStreaming - Startup, based on Hadoop

•   Brisk - DataStax, Cassandra
System requirements
•   Fault tolerance - system keeps running when a node
    fails

•   Horizontal scalability - should be easy, just add a
    node

•   Low latency

•   Reliable - does not loose data

•   High availability - well, if it’s down for an hour its not
    realtime
Storm in a nutshell
•   Written by Backtype (aquired by Twitter)

•   Open Source, Github

•   Runs on JVM

•   Clojure, Python, Zookeeper, ZeroMQ

•   Currently used by Twitter for real time statistics
Programming model
•   Tuple - name/value list

•   Stream - unbounded sequence of Tuples

•   Spout - source of Streams

•   Bolt - consumer / producer of Streams

•   Topology - network of Streams, Spouts and Bolts
Spout
        tuple tuple tuple tuple


Spout
        tuple tuple tuple tuple
Bolt
   Processes streams and generates new streams.



tuple tuple tuple tuple

                                  tuple tuple tuple tuple
                           Bolt
tuple tuple tuple tuple
Bolt
•   filtering

•   transformation

•   split / aggregate streams

•   counting, statistics

•   read from / write to database
Topology
Network of Streams, Spouts and Bolts

                    Bolt         Bolt
     Spout

                    Bolt

     Spout                       Bolt

                    Bolt
Task
Parallel processor inside Spouts and Bolts.
Each Spout / Bolt has a fixed number of Tasks.


      Spout                Bolt

      Task                 Task

      Task                 Task

      Task
Stream grouping
Which Task does a Tuple go to?

•   shuffle grouping - distribute randomly

•   field grouping - partition by field value

•   all grouping - send to all Tasks

•   custom grouping - implement your own logic
Word count example

                Sentence            Word    (“a”, 2)
                 Splitter           Count   (“b”, 2)
Spout
                  Bolt               Bolt   (“c”, 1)
                            (“a”)           (“d”, 1)
                            (“b”)
  (“a b c a b d”)           (“c”)
                            (“a”)
                            (“b”)
                            (“d”)
Guaranteed processing
                             (“a”)

                             (“b”)
                                             (“a”, 2)
                             (“c”)
                                             (“b”, 2)
Spout    (“a b c a b d”)
                                             (“c”, 1)
                             (“a”)
                                             (“d”, 1)
                             (“b”)

                             (“d”)

Topology has a timeout for processing of the tuple tree
Runtime view
Reliability
•   Nimbus / Supervisor are SPOF

•   both are stateless, easy to restart without data loss

•   Failure of master node (?)

•   Running Topologies should not be affected!

•   Failed Workers are restarted

•   Guaranteed message processing
Administration

•   Nimbus / Supervisor / Zookeeper need monitoring
    and supervisor (e.g. Monit)

•   Cluster nodes can be added at runtime

•   But: existing Topologies are not rebalanced (there is a
    ticket)

•   Administration web GUI
Community
•   Source is on Github - https://github.com/
    nathanmarz/storm.git

•   Wiki - https://github.com/nathanmarz/storm/wiki

•   Nice documentation

•   Google Group

•   People start to build add-ons: JRuby integration,
    adapters for JMS, AMQP
Storm summary
•   Nice programming model

•   Easy to deploy new topologies

•   Horizontal scalability

•   Low latency

•   Fault tolerance

•   Easy to setup on EC2
Questions?

Más contenido relacionado

La actualidad más candente (8)

Storm
StormStorm
Storm
 
ソーシャルゲームログ解析基盤のHadoop活用事例
ソーシャルゲームログ解析基盤のHadoop活用事例ソーシャルゲームログ解析基盤のHadoop活用事例
ソーシャルゲームログ解析基盤のHadoop活用事例
 
Cwmg
CwmgCwmg
Cwmg
 
[BGOUG] Java GC - Friend or Foe
[BGOUG] Java GC - Friend or Foe[BGOUG] Java GC - Friend or Foe
[BGOUG] Java GC - Friend or Foe
 
Introduction to Tokyo Products
Introduction to Tokyo ProductsIntroduction to Tokyo Products
Introduction to Tokyo Products
 
JavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesJavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for Dummies
 
Kyotoproducts
KyotoproductsKyotoproducts
Kyotoproducts
 
DEFCON 23 - Mike Sconzo - i am packer and so can you
DEFCON 23 - Mike Sconzo - i am packer and so can youDEFCON 23 - Mike Sconzo - i am packer and so can you
DEFCON 23 - Mike Sconzo - i am packer and so can you
 

Similar a Bigdata roundtable-storm

Storm distributed processing
Storm distributed processingStorm distributed processing
Storm distributed processing
ducquoc_vn
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
npinto
 

Similar a Bigdata roundtable-storm (20)

Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Storm distributed processing
Storm distributed processingStorm distributed processing
Storm distributed processing
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Reactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServicesReactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServices
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
 
Realtime Computation with Storm
Realtime Computation with StormRealtime Computation with Storm
Realtime Computation with Storm
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
Scaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceScaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter Experience
 
Springone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and ReactorSpringone2gx 2014 Reactive Streams and Reactor
Springone2gx 2014 Reactive Streams and Reactor
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
 
Processing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, TikalProcessing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, Tikal
 
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpaces
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 

Bigdata roundtable-storm

  • 1. Storm - pipes and filters on steroids Andre Sprenger BigData Roundtable Hamburg 30. Nov 2011
  • 2. My background • info@andresprenger.de • Studied Computer Science and Economics • Background: banking, ecommerce, online advertising • Freelancer • Java, Scala, Ruby, Rails • Hadoop, Pig, Hive, Cassandra
  • 3. “Next click” problem Raymie Strata (CTO,Yahoo): “With the paths that go through Hadoop [at Yahoo!], the latency is about fifteen minutes. … [I]t will never be true real-time. It will never be what we call “next click,” where I click and by the time the page loads, the semantic implication of my decision is reflected in the page.”
  • 4. “Next click” problem (next) HTTP HTTP HTTP HTTP Request Response Request Response max latency max latency 80 ms 80 ms web server realtime near realtime response response real time layer collect data process data time
  • 5. Example problems • Realtime statistics - counting, trends, moving average • Read Twitter stream and output images that are trending in the last 10 minutes • CTR calculation - read ad clicks/ad impressions and calculate new click through rate • ETL - transform format, filter duplicates / bot traffic, enrich from static data, persist • Search advertising
  • 6. Pick your framework... • S4 - Yahoo, “real time map reduce”, actor model • Storm - Twitter • MapReduce Online - Yahoo • Cloud Map Reduce - Accenture • HStreaming - Startup, based on Hadoop • Brisk - DataStax, Cassandra
  • 7. System requirements • Fault tolerance - system keeps running when a node fails • Horizontal scalability - should be easy, just add a node • Low latency • Reliable - does not loose data • High availability - well, if it’s down for an hour its not realtime
  • 8. Storm in a nutshell • Written by Backtype (aquired by Twitter) • Open Source, Github • Runs on JVM • Clojure, Python, Zookeeper, ZeroMQ • Currently used by Twitter for real time statistics
  • 9. Programming model • Tuple - name/value list • Stream - unbounded sequence of Tuples • Spout - source of Streams • Bolt - consumer / producer of Streams • Topology - network of Streams, Spouts and Bolts
  • 10. Spout tuple tuple tuple tuple Spout tuple tuple tuple tuple
  • 11. Bolt Processes streams and generates new streams. tuple tuple tuple tuple tuple tuple tuple tuple Bolt tuple tuple tuple tuple
  • 12. Bolt • filtering • transformation • split / aggregate streams • counting, statistics • read from / write to database
  • 13. Topology Network of Streams, Spouts and Bolts Bolt Bolt Spout Bolt Spout Bolt Bolt
  • 14. Task Parallel processor inside Spouts and Bolts. Each Spout / Bolt has a fixed number of Tasks. Spout Bolt Task Task Task Task Task
  • 15. Stream grouping Which Task does a Tuple go to? • shuffle grouping - distribute randomly • field grouping - partition by field value • all grouping - send to all Tasks • custom grouping - implement your own logic
  • 16. Word count example Sentence Word (“a”, 2) Splitter Count (“b”, 2) Spout Bolt Bolt (“c”, 1) (“a”) (“d”, 1) (“b”) (“a b c a b d”) (“c”) (“a”) (“b”) (“d”)
  • 17. Guaranteed processing (“a”) (“b”) (“a”, 2) (“c”) (“b”, 2) Spout (“a b c a b d”) (“c”, 1) (“a”) (“d”, 1) (“b”) (“d”) Topology has a timeout for processing of the tuple tree
  • 19. Reliability • Nimbus / Supervisor are SPOF • both are stateless, easy to restart without data loss • Failure of master node (?) • Running Topologies should not be affected! • Failed Workers are restarted • Guaranteed message processing
  • 20. Administration • Nimbus / Supervisor / Zookeeper need monitoring and supervisor (e.g. Monit) • Cluster nodes can be added at runtime • But: existing Topologies are not rebalanced (there is a ticket) • Administration web GUI
  • 21. Community • Source is on Github - https://github.com/ nathanmarz/storm.git • Wiki - https://github.com/nathanmarz/storm/wiki • Nice documentation • Google Group • People start to build add-ons: JRuby integration, adapters for JMS, AMQP
  • 22. Storm summary • Nice programming model • Easy to deploy new topologies • Horizontal scalability • Low latency • Fault tolerance • Easy to setup on EC2