SlideShare una empresa de Scribd logo
1 de 51
Large Scale Data Ingest Using NOT USE PUBLICLY
                                DO
    Apache Flume                 PRIOR TO 10/23/12
    Headline Goes Here
    Hari Shreedharan
    Speaker Name or Subhead Goes Here
    Software Engineer , Cloudera
    Apache Flume PMC member / committer
    February 2013




1
Why event streaming with Flume is awesome
    •   Couldn’t I just do this with a shell script?
          •   What year is this, 2001? There is a better way!
    • Scalable collection, aggregation of event data (i.e. logs)
    • Dynamic, contextual event routing
    • Low latency, high throughput
    • Declarative configuration
    • Productive out of the box, yet powerfully extensible
    • Open source software

2
Lessons learned from Flume OG
    • Hard to get predictable performance without decoupling tier
      impedance
    • Hard to scale-out without multiple threads at the sink level
    • A lot of functionality doesn’t work well as a decorator
    • People need a system that keeps the data flowing when there is
      a network partition (or downed host in the critical path)




3
Inside a Flume NG agent




4
Topology: Connecting agents together




             [Client]+  Agent [ Agent]*  Destination


5
Basic Concepts

    • Client                    • Valid Configuration
       • Log4j Appender            • Must have at least one
       • Client SDK                  Channel
       • Clientless Operation      • Must have at least one
                                     source or sink
    • Agent
                                   • Any number of sources
       • Source
                                   • Any number of channels
       • Channel
                                   • Any number of Sinks
       • Sink

6
Concepts in Action




    • Source: Puts events into the Channel
    • Sink: Drains events from the Channel
    • Channel: Store the events until drained

7
Flow Reliability


                                               success




       Reliability based on:
            •   Transactional Exchange between Agents
            •   Persistence Characteristics of Channels in the Flow

       Also Available:
            •   Built-in Load balancing Support
            •   Built-in Failover Support

8
Reliability
    • Transactional guarantees from channel
    • External client needs handle retry
          • Built in avro-client to read streams
          • Avro source for multi-hop flows
    •   Use Flume Client SDK for customization




9
Configuration Tree




10
Hierarchical Namespace
     agent1.properties:

            # Active components
            agent1.sources = src1
            agent1.channels = ch1
            agent1.sinks = sink1

            # Define and configure src1
            agent1.sources.src1.type = netcat
            agent1.sources.src1.channels = ch1
            agent1.sources.src1.bind = 127.0.0.1
            agent1.sources.src1.port = 10112

            # Define and configure sink1
            agent1.sinks.sink1.type = logger
            agent1.sinks.sink1.channel = ch1

            # Define and configure ch1
            agent1.channels.ch1.type = memory


11
Basic Configuration Rules
       # Active components
       agent1.sources = src1                  • Only the named agents’ configuration loaded
       agent1.channels = ch1
       agent1.sinks = sink1
                                              • Only active components’ configuration
       # Define and configure src1              loaded within the agents’ configuration
       agent1.sources.src1.type = netcat
       agent1.sources.src1.channels = ch1
       agent1.sources.src1.bind = 127.0.0.1
                                              • Every Agent must have at least one channel
       agent1.sources.src1.port = 10112
                                              • Every Source must have at least one channel
       # Define and configure sink1
       agent1.sinks.sink1.type = logger       • Every Sink must have exactly one channel
       agent1.sinks.sink1.channel = ch1
                                              • Every component must have a type
       # Define and configure ch1
       agent1.channels.ch1.type = memory

       # Some other Agents’ configuration
       agent2.sources = src1 src2




12
Deployment



                  Steady state inflow == outflow

                  4 Tier 1 agents at 100 events/sec (batch-size)
                   1 Tier 2 agent at 400 eps




13
Source
     • Event Driven
     • Supports Batch Processing
     • Source Types:
         •   AVRO – RPC source – other Flume agents can send data to this source port
         •   THRIFT – RPC source (available in next Flume release)
         •   SPOOLDIR – pick up rotated log files
         •   HTTP – post to a REST service (extensible)
         •   JMS – ingest from Java Message Service
         •   SYSLOGTCP, SYSLOGUDP
         •   NETCAT
         •   EXEC

14
How Does a Source Work?
     • Read data from external clients/other sinks
     • Stores events in configured channel(s)
     • Asynchronous to the other end of channel
     • Transactional semantics for storing data




15
Begin
Source                  Txn      Channel

Event                           Event

Event                           Event

Event                           Event
         Transaction
            batch               Event
Event
                                Event
Event
                       Commit
                         Txn
Source Features
     • Event driven or Pollable
     • Supports Batching
     • Fanout of flow
     • Interceptors




17
Fanout
                                       Transaction
           Interceptor                 handling               Flow 2


                         Channel                               Channel2
                         Processor
       Source
                                     Channel
                                     Selector                 Channel1

                    Fanout
                    processing                       Flow 1


18
Channel Selector
     •   Replicating selector
           •   Replicate events to all channels
     •   Multiplexing selector
           •   Contextual routing
     agent1.sources.sr1.selector.type = multiplexing
     agent1.sources.sr1.selector.mapping.foo = channel1
     agent1.sources.sr1.selector.mapping.bar = channel2
     agent1.sources.sr1.selector.default = channel1
     agent1.sources.sr1.selector.header = yourHeader

19
Built-in Sources in Flume
     •   Asynchronous sources
           • Client don't handle failures
           • Exec, Syslog
     •   Synchronous sources
           • Client handles failures
           • Avro, Scribe, HTTP, JMS
     •   Flume 0.9x Source
           •   AvroLegacy, ThriftLegacy

20
RPC Sources – Avro and Thrift
     • Reading events from external client
     • Only TCP
     • Connecting two agents in a distributed flow
     • Based on IPC thus failure notification is enabled
     • Configuration

     agent_foo.sources.rpcsource-1.type = avro/thrift
     agent_foo.sources.rpcsource-1.bind = <host>
     agent_foo.sources.rpcsource-1.port = <port>

21
Spooling Directory Source
     • Parses rotated log files out of a “spool” directory
     • Watches for new files, renames or deletes them when done
     • The files must be immutable before being placed into the
       watched directory

     agent.sources.spool.type = spooldir
     agent.sources.spool.spoolDir = /var/log/spooled-files
     agent.sources.spool.deletePolicy = never OR immediate



22
HTTP Source
     • Runs a web server that handles HTTP requests
     • The handler is pluggable (can roll your own)
     • Out of the box, an HTTP client posts a JSON array of events to
       the server. Server parses the events and puts them on the
       channel.

     agent.sources.http.type = http
     agent.sources.http.port = 8081



23
HTTP Source, cont’d.
     •   Default handler supports events that look like this:
     [{
       "headers" : {
              "timestamp" : "434324343",
              "host" : ”host1.example.com"
              },
       "body" : ”arbitrary data in body string"
       },
       {
       "headers" : {
              "namenode" : ”nn01.example.com",
              "datanode" : ”dn102.example.com"
              },
       "body" : ”some other arbitrary data in body string"
       }]



24
Exec Source
     •   Reading data from a output of a command
           •   Can be used for ‘tail –F ..’
     •   Doesn’t handle failures ..

     Configuration:
     agent_foo.sources.execSource.type = exec
     agent_foo.sources.execSource.command = 'tail -F /var/log/weblog.out’




25
JMS Source
     •   Reads messages from a JMS queue or topic, converts them to Flume events
         and puts those events onto the channel.
     •   Pluggable Converter that by default converrts Bytes, Text, and Object
         messages into Flume Events.
     •   So far, tested with ActiveMQ. We’d like to hear about experiences with any
         other JMS implementations.
     agent.sources.jms.type = jms
     agent.sources.jms.initialContextFactory =
        org.apache.activemq.jndi.ActiveMQInitialContextFactory
     agent.sources.jms.providerURL = tcp://mqserver:61616
     agent.sources.jms.destinationName = BUSINESS_DATA
     agent.sources.jms.destinationType = QUEUE



26
Interceptor
     • Applied to Source configuration element
     • One source can have many interceptors
     • Chain-of-responsibility
     • Can be used for tagging, filtering, routing*
     • Built-in interceptors:
         •   TIMESTAMP
         •   HOST
         •   STATIC
         •   REGEX EXTRACTOR

27
Writing a custom interceptor
     •   Configuration:

     # Declare interceptors
     agent1.sources.src1.interceptors = int1 int2 …
     # Define each interceptor

     agent1.sources.src1.interceptors.int1.type = <type>
     agent1.sources.src1.interceptors.int1.foo = bar

     •   Custom Interceptors:
     org.apache.flume.interceptor.Interceptor:
      void close()
      void initialize()
      Event intercept(Event)
      List<Event> intercept(List<Event> events)

     org.apache.flume.interceptor.Interceptor.Builder
      Interceptor build()
      void configure(Context)



28
Channel Selector
     •   Applied to Source, at most one.
     •   Not a Named Component
     •   Built-in Channel Selectors:
           • REPLICATING (Default)
           • MULTIPLEXING
     •   Multiplexing Channel Selector:
           •   Contextual Routing
           •   Must have a default set of channels
               agent1.sources.src1.selector.type = MULTIPLEXING
               agent1.sources.src1.selector.mapping.foo = ch1
               agent1.sources.src1.selector.mapping.bar = ch2
               agent1.sources.src1.selector.mapping.baz = ch1 ch2
               agent1.sources.src1.selector.default = ch5 ch6


29
Custom Channel Selector
     •   Configuration:
           agent1.sources.src1.selector.type = <type>
           agent1.sources.src1.selector.prop1 = value1
           agent1.sources.src1.selector.prop2 = value2


     •   Interface:
           org.apache.flume.ChannelSelector
            void setChannels(List<Channel>)
            List<Channel> getRequiredChannels(Event)
            List<Channel> getOptionalChannels(Event)
            List<Channel> getAllChannels()
            void configure(Context)


30
Channel
     • Passive Component
     • Determines the reliability of a flow
     • “Stock” channels that ship with Flume
         • FILE – provides durability; most people use this
         • MEMORY – lower latency for small writes, but not durable
         • JDBC – provides full ACID support, but has performance issues




31
File Channel
     • Write Ahead Log implementation
     • Configuration:
         agent1.channels.ch1.type = FILE
         agent1.channels.ch1.checkpointDir = <dir>
         agent1.channels.ch1.dataDirs = <dir1> <dir2>…
         agent1.channels.ch1.capacity = N (100k)
         agent1.channels.ch1.transactionCapacity = n
         agent1.channels.ch1.checkpointInterval = n (30000)
         agent1.channels.ch1.maxFileSize = N (1.52G)
         agent1.channels.ch1.write-timeout = n (10s)
         agent1.channels.ch1.checkpoint-timeout = n (600s)

32
File Channel
                    Flume Event Queue
                    • In memory representation of the
                       channel
                    • Maintains queue of pointers to
                       the data on disk in various log
                       files. Reference counts log files.
                    • Is memory mapped to a check
                       point file

                    Log Files
                    • On disk representation of actions
                       (Puts/Takes/Commits/Rollbacks)
                    • Maintains actual data
                    • Log files with 0 refs get deleted
33
Sink
     • Polling Semantics
     • Supports Batch Processing
     • Specialized Sinks
         •   HDFS                (Write to HDFS – highly configurable)
         •   HBASE, ASYNCHBASE   (Write to Hbase)
         •   AVRO                (IPC Sink – Avro Source as IPC source at next hop)
         •   THRIFT              (IPC Sink – Thrift Source as IPC source at next hop)
         •   FILE_ROLL           (Local disk, roll files based on size, # of events etc)
         •   NULL, LOGGER        (For Testing Purposes)
         •   ElasticSearch
         •   IRC

34
HDFS Sink
     •   Writes events to HDFS (what!)
     •   Configuring (taken from Flume User Guide):




35
HDFS Sink
     •   Supports dynamic directory naming using tags
          • Use event headers : %{header}
              • Eg: hdfs://namenode/flume/%{header}
          • Use timestamp from the event header
              • Use various options to use this.
                   • Eg: hdfs://namenode/flume/%{header}/%Y-%m-%D/
              • Use roundValue and roundUnit to round down the timestamp to use
                 separate directories.
          • Within a directory – files rolled based on:
              • rollInterval – time since last event was written
              • rollSize – max size of the file
              • rollCount – max # of events per file

36
AsyncHBase Sink

     • Insert events and increments into Hbase
     • Writes events asynchronously at very high rate.
     • Easy to configure:
         •   table
         •   columnFamily
         •   batchSize - # events per txn.
         •   timeout - how long to wait for success callback
         •   serializer/serializer.* - Custom serializer can decide how and where the events
             are written out.

37
IPC Sinks (Avro/Thrift)

     • Sends events to the next hop’s IPC Source 
     • Configuring:
         •   hostname
         •   port
         •   batch-size - # events per txn/batch sent to next hop
         •   request-timeout – how long to wait for success of batch



38
Serializers
     • Supported by HDFS, Hbase and File_Roll sink
     • Convert the event into a format of user’s choice.
     • In case of Hbase, convert an event into Puts and Increments.




39
Sink Group
     • Top-level element, needed to declare sink processors
     • A sink can be at most in one group at anytime
     • By default all sinks are in their individual default sink group
     • Default sink group is a pass-through
     • Deactivating sink-group does not deactivate the sink!!




40
Sink Processor
     • Acts as a Sink Proxy
     • Can work with multiple Sinks
     • Built-in Sink Processors:
           • DEFAULT
           • FAILOVER
           • LOAD_BALANCE

     •   Applied via Groups!
           •   A Top-Level Component


41
Application integration: Client SDK
     •   Factory:
            org.apache.flume.api.RpcClientFactory:
                 RpcClient getInstance(Properties)
            org.apache.flume.api.RpcClient:
                 void append(Event)
                 void appendBatch(List<Event>)
                 boolean isActive()
     •   Supports:
            •   Failover client
            •   Load balancing client with ROUND_ROBIN, RANDOM, and custom selectors.
     •   Avro
     •   Thrift

42
Clients: Embedded agent
     • More advanced RPC client. Integrates a channel.
     • Minimal example:
           properties.put("channel.type", "memory");
           properties.put("channel.capacity", "200");
           properties.put("sinks", "sink1");
           properties.put("sink1.type", "avro");
           properties.put("sink1.hostname", "collector1.example.com");
           properties.put("sink1.port", "5564");
           EmbeddedAgent agent = new EmbeddedAgent("myagent");
           agent.configure(properties);
           agent.start();
           List<Event> events = new ArrayList<Event>();
           events.add(event);
           agent.putAll(events);
           agent.stop();


     •   See Flume Developer Guide for more details and examples.
43
General Caveats
     • Reliability = function of channel type, capacity, and system
       redundancy
     • Carefully size the channels for needed capacity
     • Set batch sizes based on projected drain requirements
     • Number of cores should be ½ total # of sources & sinks
       combined in an agent




44
A common topology
  App Tier                  Flume Agent Tier 1                    Flume Agent Tier 2      Storage Tier
                            avro      agent11   avro
         Flume               src                sink
App-1
          SDK                                   avro
                                   file
                                                sink              avro   agent21   hdfs
                                   ch
                                                                   src             sink

                                                                         file
                            avro      agent12   avro                     ch
         Flume                                  sink
App-2                        src
          SDK
                                                avro                                          HDFS
                                   file                                  agent22
                                                sink              avro             hdfs
                                   ch
                                                                   src             sink

                                      agent13   avro                     file
         Flume              avro
App-3                                           sink                     ch
          SDK                src
                                                avro
                                   file
                                                sink
                                   ch




                                                                         .
                                                                         ..
        .
        ..




                    LB                                    LB
                                          .
                                          ..


                     +                                     +
                 failover                              failover
Summary
     •   Clients send Events to Agents
     •   Each agent hosts Flume components: Source, Interceptors, Channel
         Selectors, Channels, Sink Processors & Sinks
     •   Sources & Sinks are active components, Channels are passive
     •   Source accepts Events, passes them through Interceptor(s), and if not
         filtered, puts them on channel(s) selected by the configured Channel
         Selector
     •   Sink Processor identifies a sink to invoke, that can take Events from a
         Channel and send it to its next hop destination
     •   Channel operations are transactional to guarantee one-hop delivery
         semantics
     •   Channel persistence provides end-to-end reliability

46
Reference docs (1.3.1 release)
     User Guide:
     flume.apache.org/FlumeUserGuide.html

     Dev Guide:
     flume.apache.org/FlumeDeveloperGuide.html




47
Blog posts
     •   Flume performance tuning
           https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1
     •   Flume and Hbase
          https://blogs.apache.org/flume/entry/streaming_data_into_apache_hbase
     •   File Channel Innards
          https://blogs.apache.org/flume/entry/apache_flume_filechannel
     •   Architecture of Flume NG
          https://blogs.apache.org/flume/entry/flume_ng_architecture



48
Contributing: How to get involved!
     •   Join the mailing lists:
           •   user-subscribe@flume.apache.org
           •   dev-subscribe@flume.apache.org
     •   Look at the code
           •   github.com/apache/flume – Mirror of the Apache Flume git repo
     •   File or fix a JIRA
           •   issues.apache.org/jira/browse/FLUME
     •   More on how to contribute:
           •   cwiki.apache.org/confluence/display/FLUME/How+to+Contribute

49
Questions?




50
DO NOT USE PUBLICLY
     Thank you                            PRIOR TO 10/23/12
     Headline Goes Here
     Reach out on the mailing lists!
     Speaker Name or Subhead Goes Here
     Follow me on Twitter: @harisr1234




51

Más contenido relacionado

La actualidad más candente

Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudDataWorks Summit
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupGwen (Chen) Shapira
 
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...Lucas Jellema
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 
Reducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive StreamsReducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive Streamsjimriecken
 
Apache Pulsar at Yahoo! Japan
Apache Pulsar at Yahoo! JapanApache Pulsar at Yahoo! Japan
Apache Pulsar at Yahoo! JapanStreamNative
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Kafka and Spark Streaming
Kafka and Spark StreamingKafka and Spark Streaming
Kafka and Spark Streamingdatamantra
 
High Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureHigh Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureDataWorks Summit
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into OverdriveTodd Palino
 
Apache Flume
Apache FlumeApache Flume
Apache FlumeGetInData
 
Open keynote_carolyn&matteo&sijie
Open keynote_carolyn&matteo&sijieOpen keynote_carolyn&matteo&sijie
Open keynote_carolyn&matteo&sijieStreamNative
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Christopher Curtin
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka IntroductionAmita Mirajkar
 

La actualidad más candente (20)

Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloud
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
 
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Reducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive StreamsReducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive Streams
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Apache Pulsar at Yahoo! Japan
Apache Pulsar at Yahoo! JapanApache Pulsar at Yahoo! Japan
Apache Pulsar at Yahoo! Japan
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Kafka and Spark Streaming
Kafka and Spark StreamingKafka and Spark Streaming
Kafka and Spark Streaming
 
Kafka internals
Kafka internalsKafka internals
Kafka internals
 
High Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureHigh Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and Future
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Open keynote_carolyn&matteo&sijie
Open keynote_carolyn&matteo&sijieOpen keynote_carolyn&matteo&sijie
Open keynote_carolyn&matteo&sijie
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 

Destacado

Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexasArvind Prabhakar
 
ElasticSearch on AWS
ElasticSearch on AWSElasticSearch on AWS
ElasticSearch on AWSPhilipp Garbe
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkItai Yaffe
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchAbhishek Andhavarapu
 
Spark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataHortonworks
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Spark Summit
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Evan Chan
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 
Flume-Cassandra Log Processor
Flume-Cassandra Log ProcessorFlume-Cassandra Log Processor
Flume-Cassandra Log ProcessorCLOUDIAN KK
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014Hortonworks
 
Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012Hortonworks
 
Azure Service Bus
Azure Service BusAzure Service Bus
Azure Service BusJosh Lane
 
Introduction to Windows Azure Service Bus Relay Service
Introduction to Windows Azure Service Bus Relay ServiceIntroduction to Windows Azure Service Bus Relay Service
Introduction to Windows Azure Service Bus Relay ServiceTamir Dresher
 

Destacado (20)

Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexas
 
ElasticSearch on AWS
ElasticSearch on AWSElasticSearch on AWS
ElasticSearch on AWS
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and Elasticsearch
 
Spark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug Grall
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Introducing Akka
Introducing AkkaIntroducing Akka
Introducing Akka
 
Cloudera's Flume
Cloudera's FlumeCloudera's Flume
Cloudera's Flume
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Flume-Cassandra Log Processor
Flume-Cassandra Log ProcessorFlume-Cassandra Log Processor
Flume-Cassandra Log Processor
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014
 
Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012
 
Avvo fkafka
Avvo fkafkaAvvo fkafka
Avvo fkafka
 
Azure Service Bus
Azure Service BusAzure Service Bus
Azure Service Bus
 
Introduction to Windows Azure Service Bus Relay Service
Introduction to Windows Azure Service Bus Relay ServiceIntroduction to Windows Azure Service Bus Relay Service
Introduction to Windows Azure Service Bus Relay Service
 

Similar a Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume

Flume lspe-110325145754-phpapp01
Flume lspe-110325145754-phpapp01Flume lspe-110325145754-phpapp01
Flume lspe-110325145754-phpapp01joahp
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with FlumeRatnakar Pawar
 
Data persistency (draco, cygnus, sth comet, quantum leap)
Data persistency (draco, cygnus, sth comet, quantum leap)Data persistency (draco, cygnus, sth comet, quantum leap)
Data persistency (draco, cygnus, sth comet, quantum leap)Fernando Lopez Aguilar
 
Treasure Data Summer Internship 2016
Treasure Data Summer Internship 2016Treasure Data Summer Internship 2016
Treasure Data Summer Internship 2016Yuta Iwama
 
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - HackacIntro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - HackacApache Apex
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingApache Apex
 
Experiences with Microservices at Tuenti
Experiences with Microservices at TuentiExperiences with Microservices at Tuenti
Experiences with Microservices at TuentiAndrés Viedma Peláez
 
Introduction to Kubernetes
Introduction to KubernetesIntroduction to Kubernetes
Introduction to Kubernetesrajdeep
 
Actors or Not: Async Event Architectures
Actors or Not: Async Event ArchitecturesActors or Not: Async Event Architectures
Actors or Not: Async Event ArchitecturesYaroslav Tkachenko
 
Data Stream Processing with Apache Flink
Data Stream Processing with Apache FlinkData Stream Processing with Apache Flink
Data Stream Processing with Apache FlinkFabian Hueske
 
Superb Supervision of Short-lived Servers with Sensu
Superb Supervision of Short-lived Servers with SensuSuperb Supervision of Short-lived Servers with Sensu
Superb Supervision of Short-lived Servers with SensuPaul O'Connor
 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"Pinar Alper
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application Apache Apex
 
Meetup on Apache Zookeeper
Meetup on Apache ZookeeperMeetup on Apache Zookeeper
Meetup on Apache ZookeeperAnshul Patel
 
Tale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkTale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkKarthik Deivasigamani
 
Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)KafkaZone
 
AdroitLogic UltraESB
AdroitLogic UltraESBAdroitLogic UltraESB
AdroitLogic UltraESBAdroitLogic
 

Similar a Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume (20)

Flume lspe-110325145754-phpapp01
Flume lspe-110325145754-phpapp01Flume lspe-110325145754-phpapp01
Flume lspe-110325145754-phpapp01
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
 
Apache Flume (NG)
Apache Flume (NG)Apache Flume (NG)
Apache Flume (NG)
 
Data persistency (draco, cygnus, sth comet, quantum leap)
Data persistency (draco, cygnus, sth comet, quantum leap)Data persistency (draco, cygnus, sth comet, quantum leap)
Data persistency (draco, cygnus, sth comet, quantum leap)
 
Flume basic
Flume basicFlume basic
Flume basic
 
Treasure Data Summer Internship 2016
Treasure Data Summer Internship 2016Treasure Data Summer Internship 2016
Treasure Data Summer Internship 2016
 
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - HackacIntro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
 
Experiences with Microservices at Tuenti
Experiences with Microservices at TuentiExperiences with Microservices at Tuenti
Experiences with Microservices at Tuenti
 
Introduction to Kubernetes
Introduction to KubernetesIntroduction to Kubernetes
Introduction to Kubernetes
 
Actors or Not: Async Event Architectures
Actors or Not: Async Event ArchitecturesActors or Not: Async Event Architectures
Actors or Not: Async Event Architectures
 
Data Stream Processing with Apache Flink
Data Stream Processing with Apache FlinkData Stream Processing with Apache Flink
Data Stream Processing with Apache Flink
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Superb Supervision of Short-lived Servers with Sensu
Superb Supervision of Short-lived Servers with SensuSuperb Supervision of Short-lived Servers with Sensu
Superb Supervision of Short-lived Servers with Sensu
 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
 
Meetup on Apache Zookeeper
Meetup on Apache ZookeeperMeetup on Apache Zookeeper
Meetup on Apache Zookeeper
 
Tale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkTale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache Flink
 
Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)
 
AdroitLogic UltraESB
AdroitLogic UltraESBAdroitLogic UltraESB
AdroitLogic UltraESB
 

Más de Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

Más de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Último

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?SANGHEE SHIN
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.francesco barbera
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServiceRenan Moreira de Oliveira
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncObject Automation
 

Último (20)

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation Inc
 

Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume

  • 1. Large Scale Data Ingest Using NOT USE PUBLICLY DO Apache Flume PRIOR TO 10/23/12 Headline Goes Here Hari Shreedharan Speaker Name or Subhead Goes Here Software Engineer , Cloudera Apache Flume PMC member / committer February 2013 1
  • 2. Why event streaming with Flume is awesome • Couldn’t I just do this with a shell script? • What year is this, 2001? There is a better way! • Scalable collection, aggregation of event data (i.e. logs) • Dynamic, contextual event routing • Low latency, high throughput • Declarative configuration • Productive out of the box, yet powerfully extensible • Open source software 2
  • 3. Lessons learned from Flume OG • Hard to get predictable performance without decoupling tier impedance • Hard to scale-out without multiple threads at the sink level • A lot of functionality doesn’t work well as a decorator • People need a system that keeps the data flowing when there is a network partition (or downed host in the critical path) 3
  • 4. Inside a Flume NG agent 4
  • 5. Topology: Connecting agents together [Client]+  Agent [ Agent]*  Destination 5
  • 6. Basic Concepts • Client • Valid Configuration • Log4j Appender • Must have at least one • Client SDK Channel • Clientless Operation • Must have at least one source or sink • Agent • Any number of sources • Source • Any number of channels • Channel • Any number of Sinks • Sink 6
  • 7. Concepts in Action • Source: Puts events into the Channel • Sink: Drains events from the Channel • Channel: Store the events until drained 7
  • 8. Flow Reliability success Reliability based on: • Transactional Exchange between Agents • Persistence Characteristics of Channels in the Flow Also Available: • Built-in Load balancing Support • Built-in Failover Support 8
  • 9. Reliability • Transactional guarantees from channel • External client needs handle retry • Built in avro-client to read streams • Avro source for multi-hop flows • Use Flume Client SDK for customization 9
  • 11. Hierarchical Namespace agent1.properties: # Active components agent1.sources = src1 agent1.channels = ch1 agent1.sinks = sink1 # Define and configure src1 agent1.sources.src1.type = netcat agent1.sources.src1.channels = ch1 agent1.sources.src1.bind = 127.0.0.1 agent1.sources.src1.port = 10112 # Define and configure sink1 agent1.sinks.sink1.type = logger agent1.sinks.sink1.channel = ch1 # Define and configure ch1 agent1.channels.ch1.type = memory 11
  • 12. Basic Configuration Rules # Active components agent1.sources = src1 • Only the named agents’ configuration loaded agent1.channels = ch1 agent1.sinks = sink1 • Only active components’ configuration # Define and configure src1 loaded within the agents’ configuration agent1.sources.src1.type = netcat agent1.sources.src1.channels = ch1 agent1.sources.src1.bind = 127.0.0.1 • Every Agent must have at least one channel agent1.sources.src1.port = 10112 • Every Source must have at least one channel # Define and configure sink1 agent1.sinks.sink1.type = logger • Every Sink must have exactly one channel agent1.sinks.sink1.channel = ch1 • Every component must have a type # Define and configure ch1 agent1.channels.ch1.type = memory # Some other Agents’ configuration agent2.sources = src1 src2 12
  • 13. Deployment Steady state inflow == outflow 4 Tier 1 agents at 100 events/sec (batch-size)  1 Tier 2 agent at 400 eps 13
  • 14. Source • Event Driven • Supports Batch Processing • Source Types: • AVRO – RPC source – other Flume agents can send data to this source port • THRIFT – RPC source (available in next Flume release) • SPOOLDIR – pick up rotated log files • HTTP – post to a REST service (extensible) • JMS – ingest from Java Message Service • SYSLOGTCP, SYSLOGUDP • NETCAT • EXEC 14
  • 15. How Does a Source Work? • Read data from external clients/other sinks • Stores events in configured channel(s) • Asynchronous to the other end of channel • Transactional semantics for storing data 15
  • 16. Begin Source Txn Channel Event Event Event Event Event Event Transaction batch Event Event Event Event Commit Txn
  • 17. Source Features • Event driven or Pollable • Supports Batching • Fanout of flow • Interceptors 17
  • 18. Fanout Transaction Interceptor handling Flow 2 Channel Channel2 Processor Source Channel Selector Channel1 Fanout processing Flow 1 18
  • 19. Channel Selector • Replicating selector • Replicate events to all channels • Multiplexing selector • Contextual routing agent1.sources.sr1.selector.type = multiplexing agent1.sources.sr1.selector.mapping.foo = channel1 agent1.sources.sr1.selector.mapping.bar = channel2 agent1.sources.sr1.selector.default = channel1 agent1.sources.sr1.selector.header = yourHeader 19
  • 20. Built-in Sources in Flume • Asynchronous sources • Client don't handle failures • Exec, Syslog • Synchronous sources • Client handles failures • Avro, Scribe, HTTP, JMS • Flume 0.9x Source • AvroLegacy, ThriftLegacy 20
  • 21. RPC Sources – Avro and Thrift • Reading events from external client • Only TCP • Connecting two agents in a distributed flow • Based on IPC thus failure notification is enabled • Configuration agent_foo.sources.rpcsource-1.type = avro/thrift agent_foo.sources.rpcsource-1.bind = <host> agent_foo.sources.rpcsource-1.port = <port> 21
  • 22. Spooling Directory Source • Parses rotated log files out of a “spool” directory • Watches for new files, renames or deletes them when done • The files must be immutable before being placed into the watched directory agent.sources.spool.type = spooldir agent.sources.spool.spoolDir = /var/log/spooled-files agent.sources.spool.deletePolicy = never OR immediate 22
  • 23. HTTP Source • Runs a web server that handles HTTP requests • The handler is pluggable (can roll your own) • Out of the box, an HTTP client posts a JSON array of events to the server. Server parses the events and puts them on the channel. agent.sources.http.type = http agent.sources.http.port = 8081 23
  • 24. HTTP Source, cont’d. • Default handler supports events that look like this: [{ "headers" : { "timestamp" : "434324343", "host" : ”host1.example.com" }, "body" : ”arbitrary data in body string" }, { "headers" : { "namenode" : ”nn01.example.com", "datanode" : ”dn102.example.com" }, "body" : ”some other arbitrary data in body string" }] 24
  • 25. Exec Source • Reading data from a output of a command • Can be used for ‘tail –F ..’ • Doesn’t handle failures .. Configuration: agent_foo.sources.execSource.type = exec agent_foo.sources.execSource.command = 'tail -F /var/log/weblog.out’ 25
  • 26. JMS Source • Reads messages from a JMS queue or topic, converts them to Flume events and puts those events onto the channel. • Pluggable Converter that by default converrts Bytes, Text, and Object messages into Flume Events. • So far, tested with ActiveMQ. We’d like to hear about experiences with any other JMS implementations. agent.sources.jms.type = jms agent.sources.jms.initialContextFactory = org.apache.activemq.jndi.ActiveMQInitialContextFactory agent.sources.jms.providerURL = tcp://mqserver:61616 agent.sources.jms.destinationName = BUSINESS_DATA agent.sources.jms.destinationType = QUEUE 26
  • 27. Interceptor • Applied to Source configuration element • One source can have many interceptors • Chain-of-responsibility • Can be used for tagging, filtering, routing* • Built-in interceptors: • TIMESTAMP • HOST • STATIC • REGEX EXTRACTOR 27
  • 28. Writing a custom interceptor • Configuration: # Declare interceptors agent1.sources.src1.interceptors = int1 int2 … # Define each interceptor agent1.sources.src1.interceptors.int1.type = <type> agent1.sources.src1.interceptors.int1.foo = bar • Custom Interceptors: org.apache.flume.interceptor.Interceptor: void close() void initialize() Event intercept(Event) List<Event> intercept(List<Event> events) org.apache.flume.interceptor.Interceptor.Builder Interceptor build() void configure(Context) 28
  • 29. Channel Selector • Applied to Source, at most one. • Not a Named Component • Built-in Channel Selectors: • REPLICATING (Default) • MULTIPLEXING • Multiplexing Channel Selector: • Contextual Routing • Must have a default set of channels agent1.sources.src1.selector.type = MULTIPLEXING agent1.sources.src1.selector.mapping.foo = ch1 agent1.sources.src1.selector.mapping.bar = ch2 agent1.sources.src1.selector.mapping.baz = ch1 ch2 agent1.sources.src1.selector.default = ch5 ch6 29
  • 30. Custom Channel Selector • Configuration: agent1.sources.src1.selector.type = <type> agent1.sources.src1.selector.prop1 = value1 agent1.sources.src1.selector.prop2 = value2 • Interface: org.apache.flume.ChannelSelector void setChannels(List<Channel>) List<Channel> getRequiredChannels(Event) List<Channel> getOptionalChannels(Event) List<Channel> getAllChannels() void configure(Context) 30
  • 31. Channel • Passive Component • Determines the reliability of a flow • “Stock” channels that ship with Flume • FILE – provides durability; most people use this • MEMORY – lower latency for small writes, but not durable • JDBC – provides full ACID support, but has performance issues 31
  • 32. File Channel • Write Ahead Log implementation • Configuration: agent1.channels.ch1.type = FILE agent1.channels.ch1.checkpointDir = <dir> agent1.channels.ch1.dataDirs = <dir1> <dir2>… agent1.channels.ch1.capacity = N (100k) agent1.channels.ch1.transactionCapacity = n agent1.channels.ch1.checkpointInterval = n (30000) agent1.channels.ch1.maxFileSize = N (1.52G) agent1.channels.ch1.write-timeout = n (10s) agent1.channels.ch1.checkpoint-timeout = n (600s) 32
  • 33. File Channel Flume Event Queue • In memory representation of the channel • Maintains queue of pointers to the data on disk in various log files. Reference counts log files. • Is memory mapped to a check point file Log Files • On disk representation of actions (Puts/Takes/Commits/Rollbacks) • Maintains actual data • Log files with 0 refs get deleted 33
  • 34. Sink • Polling Semantics • Supports Batch Processing • Specialized Sinks • HDFS (Write to HDFS – highly configurable) • HBASE, ASYNCHBASE (Write to Hbase) • AVRO (IPC Sink – Avro Source as IPC source at next hop) • THRIFT (IPC Sink – Thrift Source as IPC source at next hop) • FILE_ROLL (Local disk, roll files based on size, # of events etc) • NULL, LOGGER (For Testing Purposes) • ElasticSearch • IRC 34
  • 35. HDFS Sink • Writes events to HDFS (what!) • Configuring (taken from Flume User Guide): 35
  • 36. HDFS Sink • Supports dynamic directory naming using tags • Use event headers : %{header} • Eg: hdfs://namenode/flume/%{header} • Use timestamp from the event header • Use various options to use this. • Eg: hdfs://namenode/flume/%{header}/%Y-%m-%D/ • Use roundValue and roundUnit to round down the timestamp to use separate directories. • Within a directory – files rolled based on: • rollInterval – time since last event was written • rollSize – max size of the file • rollCount – max # of events per file 36
  • 37. AsyncHBase Sink • Insert events and increments into Hbase • Writes events asynchronously at very high rate. • Easy to configure: • table • columnFamily • batchSize - # events per txn. • timeout - how long to wait for success callback • serializer/serializer.* - Custom serializer can decide how and where the events are written out. 37
  • 38. IPC Sinks (Avro/Thrift) • Sends events to the next hop’s IPC Source  • Configuring: • hostname • port • batch-size - # events per txn/batch sent to next hop • request-timeout – how long to wait for success of batch 38
  • 39. Serializers • Supported by HDFS, Hbase and File_Roll sink • Convert the event into a format of user’s choice. • In case of Hbase, convert an event into Puts and Increments. 39
  • 40. Sink Group • Top-level element, needed to declare sink processors • A sink can be at most in one group at anytime • By default all sinks are in their individual default sink group • Default sink group is a pass-through • Deactivating sink-group does not deactivate the sink!! 40
  • 41. Sink Processor • Acts as a Sink Proxy • Can work with multiple Sinks • Built-in Sink Processors: • DEFAULT • FAILOVER • LOAD_BALANCE • Applied via Groups! • A Top-Level Component 41
  • 42. Application integration: Client SDK • Factory: org.apache.flume.api.RpcClientFactory: RpcClient getInstance(Properties) org.apache.flume.api.RpcClient: void append(Event) void appendBatch(List<Event>) boolean isActive() • Supports: • Failover client • Load balancing client with ROUND_ROBIN, RANDOM, and custom selectors. • Avro • Thrift 42
  • 43. Clients: Embedded agent • More advanced RPC client. Integrates a channel. • Minimal example: properties.put("channel.type", "memory"); properties.put("channel.capacity", "200"); properties.put("sinks", "sink1"); properties.put("sink1.type", "avro"); properties.put("sink1.hostname", "collector1.example.com"); properties.put("sink1.port", "5564"); EmbeddedAgent agent = new EmbeddedAgent("myagent"); agent.configure(properties); agent.start(); List<Event> events = new ArrayList<Event>(); events.add(event); agent.putAll(events); agent.stop(); • See Flume Developer Guide for more details and examples. 43
  • 44. General Caveats • Reliability = function of channel type, capacity, and system redundancy • Carefully size the channels for needed capacity • Set batch sizes based on projected drain requirements • Number of cores should be ½ total # of sources & sinks combined in an agent 44
  • 45. A common topology App Tier Flume Agent Tier 1 Flume Agent Tier 2 Storage Tier avro agent11 avro Flume src sink App-1 SDK avro file sink avro agent21 hdfs ch src sink file avro agent12 avro ch Flume sink App-2 src SDK avro HDFS file agent22 sink avro hdfs ch src sink agent13 avro file Flume avro App-3 sink ch SDK src avro file sink ch . .. . .. LB LB . .. + + failover failover
  • 46. Summary • Clients send Events to Agents • Each agent hosts Flume components: Source, Interceptors, Channel Selectors, Channels, Sink Processors & Sinks • Sources & Sinks are active components, Channels are passive • Source accepts Events, passes them through Interceptor(s), and if not filtered, puts them on channel(s) selected by the configured Channel Selector • Sink Processor identifies a sink to invoke, that can take Events from a Channel and send it to its next hop destination • Channel operations are transactional to guarantee one-hop delivery semantics • Channel persistence provides end-to-end reliability 46
  • 47. Reference docs (1.3.1 release) User Guide: flume.apache.org/FlumeUserGuide.html Dev Guide: flume.apache.org/FlumeDeveloperGuide.html 47
  • 48. Blog posts • Flume performance tuning https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1 • Flume and Hbase https://blogs.apache.org/flume/entry/streaming_data_into_apache_hbase • File Channel Innards https://blogs.apache.org/flume/entry/apache_flume_filechannel • Architecture of Flume NG https://blogs.apache.org/flume/entry/flume_ng_architecture 48
  • 49. Contributing: How to get involved! • Join the mailing lists: • user-subscribe@flume.apache.org • dev-subscribe@flume.apache.org • Look at the code • github.com/apache/flume – Mirror of the Apache Flume git repo • File or fix a JIRA • issues.apache.org/jira/browse/FLUME • More on how to contribute: • cwiki.apache.org/confluence/display/FLUME/How+to+Contribute 49
  • 51. DO NOT USE PUBLICLY Thank you PRIOR TO 10/23/12 Headline Goes Here Reach out on the mailing lists! Speaker Name or Subhead Goes Here Follow me on Twitter: @harisr1234 51

Notas del editor

  1. If you have a server farm that emits log data in GB/min, then you could hack together a very simple aggregator, but chances are it won&apos;t provide reliability, manageability, or scalability.This is why many use Flume: an out-of-the-box aggregator that is an open-source, high-performing, reliable, and scalable aggregator for streaming data.You don’t want to risk outages or scripts failing causing an overload on spindles.Flume is declarative in that you don’t have to write codeFlume is extensible in that you can write your own components to go on top of Flume, which allow you to modify the behavior and feature-set of Flume out of the boxFlume has one hop delivery, if you want end-to-end reliability, use file channel, which we’ll talk about laterNo acknowledgements from terminal destination to client b/c then client forced to hold all events until ack receivedYou want these systems to be occupy less disk footprintSet up redundant flows if you’re concerned about hardware failures, flume doesn’t support splicing or raid out of the box
  2. With Flume NG, there is built-in buffering capacity at every hop. Thus, data and events will be preserved. In regards, to single-hop reliability, the degree of reliability is based on the channel: memory channel and recoverable memory channel are best-effort, whereas file channel and jdbc channel are reliable because you write to disk.OGgarden hose connected from faucet to sprinklercontiguous flow except when you pinch the hose in the middleNGhose connects multiple water tanks (i.e. channels/passive buffers) from faucet to sprinklerif you pinch the hose, the flow doesn&apos;t stop1. decouple impedance between producers and consumers2. dynamic routing capabilities (can shutdown one tank to re-route traffic)3. unrestricted capacity (consumer&apos;s input no longer restricted by producer&apos;s output as one tank can feed into multiple downstream tanks)
  3. Flume flowSimplest individual component is agent which can talk to each other and to hdfs,hbase, etcClients talk to agents
  4. Clientless operation – agent loads up info using specialized sourcesAgent is a collection of sources, channels, sinksSource captures events from external, only exec source can generate events on its ownChannel is buffer between source and sinkSink has responsibility of draining channel out to another agent or terminal point like hdfsYou can’t have a source with no place to write events
  5. In upper diagram, the 3 agents’ flow is healthyIn lower diagram, sink fails to communicate with downstream source thus reservoir fills up and the reservoir filling up cascades upstream, buffering from downstream hardware failuresBut no events are lost until all channels in that flow fill up, at which point the sources report failure to the clientSteadystate flow restored when link becomes active
  6. WHAT MAKES IT ACTIVE?Src2 is inactive b/c it’s not in the active setDefine multiple sources for same agent by space separated listsFan out: source write to two channelsMultiple sinks drain same channel for increased throughputSource can write to multiple channelsChannel is implemented as queue: source appends data to end of queue and sink drains from head of queueConfig file is checked at startup and changes are checked for every 30 sec – don’t have to restart agents if config file changedWhat use-case would need to have multiple sinks draining the same channel?Sources are multi-threaded and greedily implemented (for improved throughput)Sinks are single-threaded and have fixed capacity on what they can drainImpedance mismatch between sources and sinksSources will expand to accommodate load, bursty traffic, so downstream won’t be affectedSinks will drain steadilyAdd another sink to the same channel to meet steady-state requirement
  7. Four tier1 agents drain into one tier2 agent then distributes its load over two tier3 agentsYou can have a single config file for 3 agents and you pass that around your deployment and you’re doneAt any node, ingest rate must equal exit rate
  8. Avro is standardChannels support transactionsflume sources:avroexecsyslogspooling directoryhttpembedded agentJMS
  9. Transactional semantics for storing dataif sink takes data out, it will commit only if source on next hop has committed its data
  10. Use-cases:You want the same data to go into hdfs and into hbasePriority based routingAny contextual routing
  11. JMS – client talks to broker, which handles failures
  12. on avro, once the source commits the events on its channel via a put transaction, the source sends a success msg to the previous hop and the sink on the previous hop deletes these events once it commits the take transaction
  13. Takes a command as a config parameter and executes that command, whatever it writes to stdout, it will write each event out to the channelIf channel is full, data is dropped and lostDuring file rotation, if event fails, then data is lost
  14. Interceptor is transparent component that gets applied to the flow and can do filtering and minor modification of the event but can’t have interceptor do multiplication of event – e.g. can’t do decompression of event because batching, compression are framework level concerns that Flume should addressOverall number of events emitted by the interceptor can not be more than the number of events that came into the interceptor – you can drop but can’t add events (which would go over the transaction capacity)
  15. Interceptor never returns null b/c it’s passed to next interceptor or channel
  16. File channel is the recommended channel: reliable channel (no data loss in outage), scales linearly with additional spindles (more disks, better performance), better durability guarantees than memory channelMemory channel can’t scale to large capacity because bound by memoryJDBC not recommended due to slow performance (don’t mention deadlock)
  17. Recommended to use three disks: one disk for checkpointing and two disks for dataKeep-alive – wait 3 seconds for the blocks to free up – usually only used in high stress environments
  18. Three files: checkpoint file (memory mapped by flume event queue), log1 and log2Checkpoint file = FE QIf you lose FEQ, you don’t lose data since it’s in the log files but takes a long time to remap data into memoryChannel’s main operations are done on top of flume event queue, which is a queue of pointers which point to different locations and different log filesFEQ is queue of active data that exists within file channel and contains reference count of filesEach log file contains metadata of itself – write-ahead log, not direct serialization of dataFEQ doesn’t store data, size of your events don’t impact the FEQ
  19. Polling semantics – sink continually polls to see if events are availableAsynchbase sink recommended over hbase sink (synchronous hbaseapi) for better performanceNull sink will drop events to the floor
  20. Polling semantics – sink continually polls to see if events are availableAsynchbase sink recommended over hbase sink (synchronous hbaseapi) for better performanceNull sink will drop events to the floor
  21. Groups active sinks together and then adds a processorLoad_balance - shipped w round robin and random distribution and back off – but you can write your own selection algorithm and plug it into the sink processorFailover supports round robin, random, and back off (won’t try failed sink until back off time period is over)
  22. Interface that exposes itisActive can be used for testingThis is a way of getting data into flumeClient can talk to flume’s avro/thrift source