SlideShare una empresa de Scribd logo
1 de 14
Slide updated for
                                   STORM 0.8.2




        STORM
    COMPARISON – INTRODUCTION - CONCEPTS




PRESENTATION BY KASPER MADSEN
NOVEMBER - 2012
HADOOP                              VS             STORM
     Batch processing                            Real-time processing
     Jobs runs to completion                   Topologies run forever
     JobTracker is SPOF*                      No single point of failure
     Stateful nodes                                    Stateless nodes


     Scalable                                                 Scalable
     Guarantees no data loss                  Guarantees no data loss
     Open source                                          Open source




* Hadoop 0.21 added some checkpointing
 SPOF: Single Point Of Failure
COMPONENTS
     Nimbus daemon is comparable to Hadoop JobTracker. It is the master
     Supervisor daemon spawns workers, it is comparable to Hadoop TaskTracker
     Worker is spawned by supervisor, one per port defined in storm.yaml configuration
     Executor is spawned by worker, run as a thread
     Task is spawned by executors, run as a thread
     Zookeeper* is a distributed system, used to store metadata. Nimbus and
     Supervisor daemons are fail-fast and stateless. All state is kept in Zookeeper.


         Notice all communication between Nimbus and
           Supervisors are done through Zookeeper

      On a cluster with 2k+1 zookeeper nodes, the system
          can recover when maximally k nodes fails.




* Zookeeper is an Apache top-level project
EXECUTORS
Executor is a new abstraction
    •   Disassociate tasks of a
        component to #threads
    •   Allows dynamically
        changing #executors,
        without changing #tasks
    •   Makes elasticity much
        simpler, as semantics are
        kept valid (e.g. for a
        grouping)
    •   Enables elasticity in a
        multi-core environment
STREAMS
Stream is an unbounded sequence of tuples.
Topology is a graph where each node is a spout or bolt, and the edges indicate
which bolts are subscribing to which streams.
•   A spout is a source of a stream
•   A bolt is consuming a stream (possibly emits a new one)
                                                              Subscribes: A
•   An edge represents a grouping                             Emits: C


                                                                                 Subscribes: C & D

                                                              Subscribes: A
                                 Source of stream A           Emits: D




                                 Source of stream B
                                                              Subscribes:A & B
GROUPINGS
Each spout or bolt are running X instances in parallel (called tasks).
Groupings are used to decide which task in the subscribing bolt, the tuple is sent to
Shuffle grouping     is a random grouping
Fields grouping      is grouped by value, such that equal value results in equal task
All grouping         replicates to all tasks
Global grouping      makes all tuples go to one task
None grouping        makes bolt run in same thread as bolt/spout it subscribes to
Direct grouping      producer (task that emits) controls which consumer will receive
                                          4 tasks   3 tasks


                                2 tasks


                                          2 tasks
TestWordSpout          ExclamationBolt     ExclamationBolt

    EXAMPLE
     TopologyBuilder builder = new TopologyBuilder();                   Create stream called ”words”

                                                                        Run 10 tasks
     builder.setSpout("words", new TestWordSpout(), 10);
                                                                        Create stream called ”exclaim1”
     builder.setBolt("exclaim1", new ExclamationBolt(), 3)              Run 3 tasks

                                                                        Subscribe to stream ”words”,
                 .shuffleGrouping("words");                             using shufflegrouping
                                                                        Create stream called ”exclaim2”
     builder.setBolt("exclaim2", new ExclamationBolt(), 2)
                                                                        Run 2 tasks
                 .shuffleGrouping("exclaim1");                          Subscribe to stream ”exclaim1”,
                                                                        using shufflegrouping



        A bolt can subscribe to an unlimited number of
                streams, by chaining groupings.



The sourcecode for this example is part of the storm-starter project on github
TestWordSpout        ExclamationBolt     ExclamationBolt

EXAMPLE – 1
TestWordSpout
public void nextTuple() {
     Utils.sleep(100);
     final String[] words = new String[] {"nathan", "mike", "jackson", "golda", "bertels"};
     final Random rand = new Random();
     final String word = words[rand.nextInt(words.length)];
     _collector.emit(new Values(word));
}



The TestWordSpout emits a random string from the
       array words, each 100 milliseconds
TestWordSpout          ExclamationBolt        ExclamationBolt

EXAMPLE – 2
ExclamationBolt                                    Prepare is called when bolt is created

OutputCollector _collector;
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
      _collector = collector;
}                                             Execute is called for each tuple
public void execute(Tuple tuple) {
     _collector.emit(tuple, new Values(tuple.getString(0) + "!!!"));
     _collector.ack(tuple);
 }                                            declareOutputFields is called when bolt is created
public void declareOutputFields(OutputFieldsDeclarer declarer) {
     declarer.declare(new Fields("word"));
}


declareOutputFields is used to declare streams and their schemas. It
 is possible to declare several streams and specify the stream to use
           when outputting tuples in the emit function call.
TRIDENT TOPOLOGY
  Trident topology is a new abstraction built on top of STORM primitives
  •   Supports
       • Joins
       • Aggregations
       • Grouping
       • Functions
       • Filters
  •   Easy to use, read the wiki
  •   Guarantees exactly-once processing - if using (opaque) transactional spout
        • Some basic ideas are equal to the deprecated transactional topology*
        • Tuples are processed as small batches
        • Each batch gets a transaction id, if batch is replayed same txid is given
        • State updates are strongly ordered among batches
        • State updates atomically stores meta-data with data
  •   Transactional topology is superseded by the Trident topology from 0.8.0


*see my first slide (march 2012) on STORM, for detailed information. www.slideshare.com/KasperMadsen
EXACTLY-ONCE-PROCESSING - 1
Transactional spouts guarantees same data is replayed for every batch
Guaranteeing exactly-once-processing for transactional spouts
    • txid is stored with data, such that last txid that updated the data is known
    • Information is used to know what to update in case of replay
Example
     1. Currently processing txid: 2, with data [”man”, ”dog”, ”dog”]
     2. Current state is:
            ”man” => [count=3, txid=1]
            ”dog” => [count=2, txid=2]
     3. Batch with txid 2, fails and gets replayed.
     4. Resulting state is
            ”man” => [count=4, txid=2]
            ”dog” => [count=2, txid=2]
     5. Because txid is stored with the data, it is known the count for “dog” should
        not be increased again.
EXACTLY-ONCE-PROCESSING - 2
Opaque transactional spout is not guaranteed to replay same data for a failed
batch, as originally existed in the batch.
    • Guarantees every tuple is successfully processed in exactly one batch
    • Useful for having exactly-once-processing and allowing some inputs to fail
Guaranteeing exactly-once-processing for opaque transactional spouts
      •
      Same trick doesn’t work, as replayed batch might be changed, meaning
      some state might now have stored incorrect data. Consider previous
      example!
    • Problem is solved by storing more meta-data with data (previous value)
Example
Step        Data                       Count     prevValue        Txid         Updates dog
                                                                              count then fails
1           2 dog        1 cat         2,1       0,0              1,1
2           1 dog        2 cat         3,1       2,1              2,1
2.1         2 dog        2 cat         4, 3      2,1              2,2
 Consider the potential problems if the                  Batch contains new data, but updates
new data for 2.1 doesn’t contain any cat.                   ok as previous values are used
ELASTICITY
• Rebalancing workers and executors (not tasks)
   • Pause spouts
   • Wait for message timeout
   • Set new assignment
   • All moved tasks will be killed and restarted in new location
• Swapping (STORM 0.8.2)
    •   Submit inactive new topology
    •   Pause spouts of old topology
    •   Wait for message timeout of old topology
    •   Activate new topology
    •   Deactivate old topology
    •   Kill old topology                          What about state on tasks
                                                   which are killed and restarted?

                                                   It is up to the user to solve!
LEARN MORE
Website (http://storm-project.net/)
Wiki (https://github.com/nathanmarz/storm/wiki)
Storm-starter (https://github.com/nathanmarz/storm-starter)
Mailing list (http://groups.google.com/group/storm-user)
#storm-user room on freenode
UTSL: https://github.com/nathanmarz/storm
More slides: www.slideshare.net/KasperMadsen




                                             from: http://www.cupofjoe.tv/2010/11/learn-lesson.html

Más contenido relacionado

La actualidad más candente

Intro to Reactive Thinking and RxJava 2
Intro to Reactive Thinking and RxJava 2Intro to Reactive Thinking and RxJava 2
Intro to Reactive Thinking and RxJava 2JollyRogers5
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
 
streamparse and pystorm: simple reliable parallel processing with storm
streamparse and pystorm: simple reliable parallel processing with stormstreamparse and pystorm: simple reliable parallel processing with storm
streamparse and pystorm: simple reliable parallel processing with stormDaniel Blanchard
 
Distributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache StormDistributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache Stormthe100rabh
 
Real-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaReal-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaAndrew Montalenti
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter StormUwe Printz
 
Automate Your Application on the Cloud
Automate Your Application on the CloudAutomate Your Application on the Cloud
Automate Your Application on the Cloudtamirko
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼NAVER D2
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceP. Taylor Goetz
 
Do more than one thing at the same time, the Python way
Do more than one thing at the same time, the Python wayDo more than one thing at the same time, the Python way
Do more than one thing at the same time, the Python wayJaime Buelta
 
Realtime processing with storm presentation
Realtime processing with storm presentationRealtime processing with storm presentation
Realtime processing with storm presentationGabriel Eisbruch
 
JUnit5 and TestContainers
JUnit5 and TestContainersJUnit5 and TestContainers
JUnit5 and TestContainersSunghyouk Bae
 
Counter Wars (JEEConf 2016)
Counter Wars (JEEConf 2016)Counter Wars (JEEConf 2016)
Counter Wars (JEEConf 2016)Alexey Fyodorov
 
DevoxxPL: JRebel Under The Covers
DevoxxPL: JRebel Under The CoversDevoxxPL: JRebel Under The Covers
DevoxxPL: JRebel Under The CoversSimon Maple
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormMd. Shamsur Rahim
 
Is your profiler speaking the same language as you? -- Docklands JUG
Is your profiler speaking the same language as you? -- Docklands JUGIs your profiler speaking the same language as you? -- Docklands JUG
Is your profiler speaking the same language as you? -- Docklands JUGSimon Maple
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Robert Evans
 

La actualidad más candente (20)

Intro to Reactive Thinking and RxJava 2
Intro to Reactive Thinking and RxJava 2Intro to Reactive Thinking and RxJava 2
Intro to Reactive Thinking and RxJava 2
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
Storm
StormStorm
Storm
 
streamparse and pystorm: simple reliable parallel processing with storm
streamparse and pystorm: simple reliable parallel processing with stormstreamparse and pystorm: simple reliable parallel processing with storm
streamparse and pystorm: simple reliable parallel processing with storm
 
Distributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache StormDistributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache Storm
 
Real-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaReal-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and Kafka
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter Storm
 
Automate Your Application on the Cloud
Automate Your Application on the CloudAutomate Your Application on the Cloud
Automate Your Application on the Cloud
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
 
Do more than one thing at the same time, the Python way
Do more than one thing at the same time, the Python wayDo more than one thing at the same time, the Python way
Do more than one thing at the same time, the Python way
 
Realtime processing with storm presentation
Realtime processing with storm presentationRealtime processing with storm presentation
Realtime processing with storm presentation
 
JUnit5 and TestContainers
JUnit5 and TestContainersJUnit5 and TestContainers
JUnit5 and TestContainers
 
Counter Wars (JEEConf 2016)
Counter Wars (JEEConf 2016)Counter Wars (JEEConf 2016)
Counter Wars (JEEConf 2016)
 
DevoxxPL: JRebel Under The Covers
DevoxxPL: JRebel Under The CoversDevoxxPL: JRebel Under The Covers
DevoxxPL: JRebel Under The Covers
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache Storm
 
Is your profiler speaking the same language as you? -- Docklands JUG
Is your profiler speaking the same language as you? -- Docklands JUGIs your profiler speaking the same language as you? -- Docklands JUG
Is your profiler speaking the same language as you? -- Docklands JUG
 
속도체크
속도체크속도체크
속도체크
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)
 

Similar a STORM COMPARISON - INTRODUCTION TO CONCEPTS AND COMPONENTS

Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time ComputationSonal Raj
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormDavorin Vukelic
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormLester Martin
 
BWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation systemBWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation systemAndrii Gakhov
 
storm-170531123446.dotx.pptx
storm-170531123446.dotx.pptxstorm-170531123446.dotx.pptx
storm-170531123446.dotx.pptxIbrahimBenhadhria
 
Cleveland HUG - Storm
Cleveland HUG - StormCleveland HUG - Storm
Cleveland HUG - Stormjustinjleet
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with StormMariusz Gil
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache StormP. Taylor Goetz
 
Java 7 Whats New(), Whats Next() from Oredev
Java 7 Whats New(), Whats Next() from OredevJava 7 Whats New(), Whats Next() from Oredev
Java 7 Whats New(), Whats Next() from OredevMattias Karlsson
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignMichael Noll
 
The Back Propagation Learning Algorithm
The Back Propagation Learning AlgorithmThe Back Propagation Learning Algorithm
The Back Propagation Learning AlgorithmESCOM
 

Similar a STORM COMPARISON - INTRODUCTION TO CONCEPTS AND COMPONENTS (20)

STORM
STORMSTORM
STORM
 
Storm
StormStorm
Storm
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time Computation
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Introduction to Apache Storm
Introduction to Apache StormIntroduction to Apache Storm
Introduction to Apache Storm
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
 
BWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation systemBWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation system
 
Storm
StormStorm
Storm
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
storm-170531123446.dotx.pptx
storm-170531123446.dotx.pptxstorm-170531123446.dotx.pptx
storm-170531123446.dotx.pptx
 
Cleveland HUG - Storm
Cleveland HUG - StormCleveland HUG - Storm
Cleveland HUG - Storm
 
Storm
StormStorm
Storm
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with Storm
 
Storm begins
Storm beginsStorm begins
Storm begins
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Java 7 Whats New(), Whats Next() from Oredev
Java 7 Whats New(), Whats Next() from OredevJava 7 Whats New(), Whats Next() from Oredev
Java 7 Whats New(), Whats Next() from Oredev
 
storm-170531123446.pptx
storm-170531123446.pptxstorm-170531123446.pptx
storm-170531123446.pptx
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
The Back Propagation Learning Algorithm
The Back Propagation Learning AlgorithmThe Back Propagation Learning Algorithm
The Back Propagation Learning Algorithm
 

Último

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 

Último (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 

STORM COMPARISON - INTRODUCTION TO CONCEPTS AND COMPONENTS

  • 1. Slide updated for STORM 0.8.2 STORM COMPARISON – INTRODUCTION - CONCEPTS PRESENTATION BY KASPER MADSEN NOVEMBER - 2012
  • 2. HADOOP VS STORM Batch processing Real-time processing Jobs runs to completion Topologies run forever JobTracker is SPOF* No single point of failure Stateful nodes Stateless nodes Scalable Scalable Guarantees no data loss Guarantees no data loss Open source Open source * Hadoop 0.21 added some checkpointing SPOF: Single Point Of Failure
  • 3. COMPONENTS Nimbus daemon is comparable to Hadoop JobTracker. It is the master Supervisor daemon spawns workers, it is comparable to Hadoop TaskTracker Worker is spawned by supervisor, one per port defined in storm.yaml configuration Executor is spawned by worker, run as a thread Task is spawned by executors, run as a thread Zookeeper* is a distributed system, used to store metadata. Nimbus and Supervisor daemons are fail-fast and stateless. All state is kept in Zookeeper. Notice all communication between Nimbus and Supervisors are done through Zookeeper On a cluster with 2k+1 zookeeper nodes, the system can recover when maximally k nodes fails. * Zookeeper is an Apache top-level project
  • 4. EXECUTORS Executor is a new abstraction • Disassociate tasks of a component to #threads • Allows dynamically changing #executors, without changing #tasks • Makes elasticity much simpler, as semantics are kept valid (e.g. for a grouping) • Enables elasticity in a multi-core environment
  • 5. STREAMS Stream is an unbounded sequence of tuples. Topology is a graph where each node is a spout or bolt, and the edges indicate which bolts are subscribing to which streams. • A spout is a source of a stream • A bolt is consuming a stream (possibly emits a new one) Subscribes: A • An edge represents a grouping Emits: C Subscribes: C & D Subscribes: A Source of stream A Emits: D Source of stream B Subscribes:A & B
  • 6. GROUPINGS Each spout or bolt are running X instances in parallel (called tasks). Groupings are used to decide which task in the subscribing bolt, the tuple is sent to Shuffle grouping is a random grouping Fields grouping is grouped by value, such that equal value results in equal task All grouping replicates to all tasks Global grouping makes all tuples go to one task None grouping makes bolt run in same thread as bolt/spout it subscribes to Direct grouping producer (task that emits) controls which consumer will receive 4 tasks 3 tasks 2 tasks 2 tasks
  • 7. TestWordSpout ExclamationBolt ExclamationBolt EXAMPLE TopologyBuilder builder = new TopologyBuilder(); Create stream called ”words” Run 10 tasks builder.setSpout("words", new TestWordSpout(), 10); Create stream called ”exclaim1” builder.setBolt("exclaim1", new ExclamationBolt(), 3) Run 3 tasks Subscribe to stream ”words”, .shuffleGrouping("words"); using shufflegrouping Create stream called ”exclaim2” builder.setBolt("exclaim2", new ExclamationBolt(), 2) Run 2 tasks .shuffleGrouping("exclaim1"); Subscribe to stream ”exclaim1”, using shufflegrouping A bolt can subscribe to an unlimited number of streams, by chaining groupings. The sourcecode for this example is part of the storm-starter project on github
  • 8. TestWordSpout ExclamationBolt ExclamationBolt EXAMPLE – 1 TestWordSpout public void nextTuple() { Utils.sleep(100); final String[] words = new String[] {"nathan", "mike", "jackson", "golda", "bertels"}; final Random rand = new Random(); final String word = words[rand.nextInt(words.length)]; _collector.emit(new Values(word)); } The TestWordSpout emits a random string from the array words, each 100 milliseconds
  • 9. TestWordSpout ExclamationBolt ExclamationBolt EXAMPLE – 2 ExclamationBolt Prepare is called when bolt is created OutputCollector _collector; public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _collector = collector; } Execute is called for each tuple public void execute(Tuple tuple) { _collector.emit(tuple, new Values(tuple.getString(0) + "!!!")); _collector.ack(tuple); } declareOutputFields is called when bolt is created public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } declareOutputFields is used to declare streams and their schemas. It is possible to declare several streams and specify the stream to use when outputting tuples in the emit function call.
  • 10. TRIDENT TOPOLOGY Trident topology is a new abstraction built on top of STORM primitives • Supports • Joins • Aggregations • Grouping • Functions • Filters • Easy to use, read the wiki • Guarantees exactly-once processing - if using (opaque) transactional spout • Some basic ideas are equal to the deprecated transactional topology* • Tuples are processed as small batches • Each batch gets a transaction id, if batch is replayed same txid is given • State updates are strongly ordered among batches • State updates atomically stores meta-data with data • Transactional topology is superseded by the Trident topology from 0.8.0 *see my first slide (march 2012) on STORM, for detailed information. www.slideshare.com/KasperMadsen
  • 11. EXACTLY-ONCE-PROCESSING - 1 Transactional spouts guarantees same data is replayed for every batch Guaranteeing exactly-once-processing for transactional spouts • txid is stored with data, such that last txid that updated the data is known • Information is used to know what to update in case of replay Example 1. Currently processing txid: 2, with data [”man”, ”dog”, ”dog”] 2. Current state is: ”man” => [count=3, txid=1] ”dog” => [count=2, txid=2] 3. Batch with txid 2, fails and gets replayed. 4. Resulting state is ”man” => [count=4, txid=2] ”dog” => [count=2, txid=2] 5. Because txid is stored with the data, it is known the count for “dog” should not be increased again.
  • 12. EXACTLY-ONCE-PROCESSING - 2 Opaque transactional spout is not guaranteed to replay same data for a failed batch, as originally existed in the batch. • Guarantees every tuple is successfully processed in exactly one batch • Useful for having exactly-once-processing and allowing some inputs to fail Guaranteeing exactly-once-processing for opaque transactional spouts • Same trick doesn’t work, as replayed batch might be changed, meaning some state might now have stored incorrect data. Consider previous example! • Problem is solved by storing more meta-data with data (previous value) Example Step Data Count prevValue Txid Updates dog count then fails 1 2 dog 1 cat 2,1 0,0 1,1 2 1 dog 2 cat 3,1 2,1 2,1 2.1 2 dog 2 cat 4, 3 2,1 2,2 Consider the potential problems if the Batch contains new data, but updates new data for 2.1 doesn’t contain any cat. ok as previous values are used
  • 13. ELASTICITY • Rebalancing workers and executors (not tasks) • Pause spouts • Wait for message timeout • Set new assignment • All moved tasks will be killed and restarted in new location • Swapping (STORM 0.8.2) • Submit inactive new topology • Pause spouts of old topology • Wait for message timeout of old topology • Activate new topology • Deactivate old topology • Kill old topology What about state on tasks which are killed and restarted? It is up to the user to solve!
  • 14. LEARN MORE Website (http://storm-project.net/) Wiki (https://github.com/nathanmarz/storm/wiki) Storm-starter (https://github.com/nathanmarz/storm-starter) Mailing list (http://groups.google.com/group/storm-user) #storm-user room on freenode UTSL: https://github.com/nathanmarz/storm More slides: www.slideshare.net/KasperMadsen from: http://www.cupofjoe.tv/2010/11/learn-lesson.html