SlideShare una empresa de Scribd logo
1 de 31
Descargar para leer sin conexión
Apache Flume 
Data Aggregation At Scale 
Arvind Prabhakar 
© 2014 StreamSets Inc., All rights reserved 
© 2014 StreamSets, Inc.
Who am I? 
© 2014 StreamSets, Inc. 
❏ Founder/CTO 
Apache Software Foundation 
❏ Flume - PMC Chair 
❏ Sqoop - PMC Chair 
❏ Storm - PMC, Committer 
❏ MetaModel - Mentor 
❏ Sentry - Mentor 
❏ ASF Member 
Previously... 
❏ Cloudera 
❏ Informatica
What is Flume? 
© 2014 StreamSets, Inc. 
Logs 
Files 
Click 
Streams 
Sensors 
Devices 
Database 
Logs 
Social Data 
Streams 
Feeds 
Other 
Raw Storage 
(HDFS, S3) 
EDW, NoSQL 
(Hive, Impala, 
HBase, 
Cassandra) 
Search 
(Solr, 
ElasticSearch) 
Enterprise Data 
Infrastructure 
Apache Flume is a 
continuous data 
ingestion system that 
is... 
● open-source, 
● reliable, 
● scalable, 
● manageable, 
● customizable, 
...and designed for Big 
Data ecosystem.
...for Big Data ecosystem? 
“Big data is an all-encompassing term for any collection 
of data sets so large and complex that it becomes difficult 
to process using traditional data processing applications.” 
Big Data from a Data Ingestion Perspective 
● Logical Data Sources are physically distributed 
● Data production is continuous / never ending 
● Data structure and semantics change without notice 
© 2014 StreamSets, Inc.
Physically Distributed Data Sources 
© 2014 StreamSets, Inc. 
● Many physical sources 
that produce data 
● Number of physical 
sources changes 
constantly 
● Sources may exist in 
different governance 
zones, data centers, 
continents...
Continuous Data Production 
“Every two days now we create as much information 
as we did from the dawn of civilization up until 2003” 
© 2014 StreamSets, Inc. 
- Eric Schmidt, 2010 
● Weather 
● Traffic 
● Automobiles 
● Trains 
● Airplanes 
● Geological/Seismic 
● Oceanographic 
● Smart Phones 
● Health Accessories 
● Medical Devices 
● Home Automation 
● Digital Cameras 
● Social Media 
● Geolocation 
● Shop Floor Sensors 
● Network Activity 
● Industry Appliances 
● Security/Surveillance 
● Server Workloads 
● Digital Telephony 
● Bio-simulations...
Ever Changing Structure of Data 
● One of your data centers 
upgrade to IPv6 192.168.0.4 
© 2014 StreamSets, Inc. 
fe80::21b:21ff:fe83:90fa 
M0137: User {jonsmith} granted access to {accounts} 
M0137: [jonsmith] granted access to [sys.accounts] 
{ 
“first”:”jon”, 
“last”:”smith”, 
“add1”:”123 Main St.”, 
“add2”:”Ste - 4”, 
“city”:”Little Town”, 
“state”:”AZ”, 
“zip”: “12121” 
} 
{ 
“first”:”jon”, 
“last”:”smith”, 
“add1”:”123 Main St.”, 
“add2”:”Ste - 4”, 
“city”:”Little Town”, 
“state”:”AZ”, 
“zip”: “12121”, 
“phone”: “(408) 555-1212” 
} 
● Application developer 
changes logs (again) 
● JSON data may contain 
more attributes than 
expected
So, from Data Ingestion Perspective: 
Massive collection of ever changing physical sources... 
Never ending data production... 
Data structure and semantics evolve continuously... 
© 2014 StreamSets, Inc.
© 2014 StreamSets, Inc. 
Flume to the Rescue!
Apache Flume 
● Originally designed to be a log 
aggregation system by 
Cloudera Engineers 
● Evolved to handle any type of 
streaming event data 
● Low-cost of installation, 
operation and maintenance 
● Highly customizable and 
extendable 
© 2014 StreamSets, Inc.
A Closer Look at Flume 
Input Agent Agent Agent Agent Destination 
● Distributed Pipeline Architecture 
● Optimized for commonly used data sources and destinations 
● Built in support for contextual routing 
● Fully customizable and extendable 
© 2014 StreamSets, Inc.
Anatomy of a Flume Agent 
© 2014 StreamSets, Inc. 
Flume Agent 
Source 
Sink 
Channel 
Incoming 
Data 
Outgoing 
Data 
Source 
● Accepts incoming 
Data 
● Scales as required 
● Writes data to 
Channel 
Sink 
● Removes data from 
Channel 
● Sends data to 
downstream Agent or 
Destination 
Channel 
● Stores data in the 
order received
Transactional Data Exchange 
Upstream Sink TX 
© 2014 StreamSets, Inc. 
Flume Agent 
Source 
Sink 
Channel 
Incoming 
Data 
Outgoing 
Data 
Source TX 
Sink TX 
● Source uses transactions to write to the channel 
● Sink uses transactions to remove data from the channel 
● Sink transaction commits only after successful transfer of data 
● This ensures no data loss in Flume pipeline
Routing and Replicating 
© 2014 StreamSets, Inc. 
Flume Agent 
Source Sink 1 
Channel 1 Incoming 
Data 
Outgoing Data 
Channel 2 
Sink 2 Outgoing Data 
● Source can replicate or multiplex data across many channels 
● Metadata headers can be used to do contextual selection of 
channels 
● Channels can be drained by different sinks to different 
destinations or pipelines
Why Channels? 
● Buffers data and insulates downstream from load spikes 
● Provides persistent store for data in case the process restarts 
● Provides flow ordering* and transactional guarantees 
© 2014 StreamSets, Inc.
© 2014 StreamSets, Inc. 
Use-Case: Log Aggregation
Starting Out Simple 
● You would like to move your web-server 
© 2014 StreamSets, Inc. 
logs to HDFS 
● Let’s assume there are only 3 web 
servers at the time of launch 
● Ad-hoc solution will likely suffice! 
Challenges 
● How do you manage your output paths on HDFS? 
● How do you maintain your client code in face of changing 
environment as well as requirements?
Adding a Single Flume Agent 
Advantages 
● Insulation from HDFS downtime 
● Quick offload of logs from Web 
Server machines 
● Better Network utilization 
Challenges 
● What if the Flume node goes down? 
● Can one Flume node accommodate all load from Web Servers? 
© 2014 StreamSets, Inc.
Adding Two Flume Agents 
Advantages 
● Redundancy and Availability 
● Better handling of downstream 
failures 
● Automatic load balancing and 
failover 
Challenges 
● What happens when new Web Servers are added? 
● Can two Flume Agents keep up with all the load from more Web 
Servers? 
© 2014 StreamSets, Inc.
Handling a Server Farm 
© 2014 StreamSets, Inc. 
A Converging Flow 
● Traffic is aggregated by Tier-2 and 
Tier-3 before being put into 
destination system 
● Closer a tier is to the destination, 
larger the batch size it delivers 
downstream 
● Optimized handling of destination 
systems
Data Volume Per Agent 
© 2014 StreamSets, Inc. 
Batch Size Variation per Agent 
● Event volume is least in the 
outermost tier 
● Event volume increases as the 
flow converges 
● Event volume is highest in the 
innermost tier
Data Volume Per Tier 
© 2014 StreamSets, Inc. 
Batch Size Variation per Tier 
● In steady state, all tiers carry 
same event volume 
● Transient variations in flow are 
absorbed and ironed out by 
channels 
● Load spikes are handled smoothly 
without overwhelming the 
infrastructure
Planning and Sizing Flume Topology 
for Log-Aggregation Use-Case 
© 2014 StreamSets, Inc.
Planning and Sizing Your Topology 
What we need to know: 
● Number of Web Servers 
● Log volume per Web 
Server per unit time 
● Destination System and 
layout (Routing 
Requirements) 
● Worst case downtime for 
destination system 
© 2014 StreamSets, Inc. 
What we will calculate: 
● Number of tiers 
● Exit Batch Sizes 
● Channel capacity
Calculating Number of Tiers 
Rule of Thumb 
One Aggregating Agent (A) can be used with 
anywhere from 4 to 16 client Agents 
Considerations 
● Must handle projected ingest volume 
● Resulting number of tiers should provide for 
routing, load-balancing and failover 
requirements 
Gotchas 
Load test to ensure that steady state and peak 
load are addressed with adequate failover capacity 
© 2014 StreamSets, Inc.
Calculating Exit Batch Size 
Rule of Thumb 
Exit batch size is same as total exit data volume 
divided by number of Agents in a tier 
Considerations 
● Having some extra room is good 
● Keep contextual routing in mind 
● Consider duplication impact when batch 
sizes are large 
Gotchas 
Load test fail-over scenario to ensure near 
steady-state drain 
© 2014 StreamSets, Inc.
Calculating Channel Capacity 
Gotchas 
© 2014 StreamSets, Inc. 
Source 
Sink X 
Source 
Sink X 
X 
Rule of Thumb 
Equal to worst case data ingest rate sustained 
over the worst case downstream outage 
interval 
Considerations 
● Multiple disks will yield better performance 
● Channel size impacts the back-pressure 
buildup in the pipeline 
You may need more disk space than the 
physical footprint of the data size
To Recap 
Number of Tiers 
Calculated with upstream to downstream Agent ration ranging from 4:1 to 
16:1. Factor in routing, failover, load-balancing requirements... 
Exit Batch Size 
Calculated for steady state data volume exiting the tier, divided by 
number of Agents in that tier. Factor in contextual routing and duplication 
due to transient failure impact... 
Channel Capacity 
Calculated as worst case ingest rate sustained over the worst case 
downstream downtime. Factor in number of disks used etc... 
© 2014 StreamSets, Inc. 
...and that’s all there is to it!
Some Highlights of Flume 
● Flume is suitable for large volume data collection, especially when 
data is being produced in multiple locations 
● Once planned and sized appropriately, Flume will practically run 
itself without any operational intervention 
● Flume provides weak ordering guarantee, i.e., in the absence of 
failures the data will arrive in the order it was received in the Flume 
pipeline 
● Transactional exchange ensures that Flume never loses any data in 
transit between Agents. Sinks use transactions to ensure data is not 
lost at point of ingest or terminal destinations. 
● Flume has rich out-of-the box features such as contextual routing, 
and support for popular data sources and destination systems 
© 2014 StreamSets, Inc.
Things that could be better... 
● Handling of poison events 
● Ability to tail files 
● Ability to handle preset data formats such as JSON, CSV, XML 
● Centralized configuration 
● Once-only delivery semantics 
● ...and more 
Remember: patches are welcome! 
© 2014 StreamSets, Inc.
Thank You! 
Contact: 
● Email: arvind at streamsets dot com 
● Twitter: @aprabhakar 
More on Flume: 
● http://flume.apache.org/ 
● User Mailing List: user-subscribe@flume.apache.org 
● Developer Mailing List: dev-subscribe@flume.apache.org 
● JIRA: https://issues.apache.org/jira/browse/FLUME 
© 2014 StreamSets, Inc.

Más contenido relacionado

La actualidad más candente

Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
DataWorks Summit
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
 
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
DataWorks Summit
 

La actualidad más candente (20)

Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoop
 
Curb your insecurity with HDP
Curb your insecurity with HDPCurb your insecurity with HDP
Curb your insecurity with HDP
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
 
YARN Federation
YARN Federation YARN Federation
YARN Federation
 

Destacado

Destacado (10)

大型电商的数据服务的要点和难点
大型电商的数据服务的要点和难点 大型电商的数据服务的要点和难点
大型电商的数据服务的要点和难点
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015
 
Implementing and running a secure datalake from the trenches
Implementing and running a secure datalake from the trenches Implementing and running a secure datalake from the trenches
Implementing and running a secure datalake from the trenches
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascience
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Flume vs. kafka
Flume vs. kafkaFlume vs. kafka
Flume vs. kafka
 

Similar a Data Aggregation At Scale Using Apache Flume

Similar a Data Aggregation At Scale Using Apache Flume (20)

Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Empowering Real-Time Decision Making with Data Streaming
Empowering Real-Time Decision Making with Data StreamingEmpowering Real-Time Decision Making with Data Streaming
Empowering Real-Time Decision Making with Data Streaming
 
Full Stream Ahead: Authoring Workflows for Scalable Stream Processing
Full Stream Ahead: Authoring Workflows for Scalable Stream ProcessingFull Stream Ahead: Authoring Workflows for Scalable Stream Processing
Full Stream Ahead: Authoring Workflows for Scalable Stream Processing
 
Network characteristics of the cloud
Network characteristics of the cloudNetwork characteristics of the cloud
Network characteristics of the cloud
 
Network visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetryNetwork visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetry
 
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
 
Spark+flume seattle
Spark+flume seattleSpark+flume seattle
Spark+flume seattle
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
 
Apache Apex - Hadoop Users Group
Apache Apex - Hadoop Users GroupApache Apex - Hadoop Users Group
Apache Apex - Hadoop Users Group
 
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
 
Global Trading Infrastructure Services
Global Trading Infrastructure ServicesGlobal Trading Infrastructure Services
Global Trading Infrastructure Services
 
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
 
Anatomy behind Fast Data Applications.pptx
Anatomy behind Fast Data Applications.pptxAnatomy behind Fast Data Applications.pptx
Anatomy behind Fast Data Applications.pptx
 
HDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical Workshop
 
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Lag. Crackle. Pause. Keeping Your Unified Communications in Check.
Lag. Crackle. Pause. Keeping Your Unified Communications in Check.Lag. Crackle. Pause. Keeping Your Unified Communications in Check.
Lag. Crackle. Pause. Keeping Your Unified Communications in Check.
 
SplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding Overview
 

Último

%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Último (20)

%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 

Data Aggregation At Scale Using Apache Flume

  • 1. Apache Flume Data Aggregation At Scale Arvind Prabhakar © 2014 StreamSets Inc., All rights reserved © 2014 StreamSets, Inc.
  • 2. Who am I? © 2014 StreamSets, Inc. ❏ Founder/CTO Apache Software Foundation ❏ Flume - PMC Chair ❏ Sqoop - PMC Chair ❏ Storm - PMC, Committer ❏ MetaModel - Mentor ❏ Sentry - Mentor ❏ ASF Member Previously... ❏ Cloudera ❏ Informatica
  • 3. What is Flume? © 2014 StreamSets, Inc. Logs Files Click Streams Sensors Devices Database Logs Social Data Streams Feeds Other Raw Storage (HDFS, S3) EDW, NoSQL (Hive, Impala, HBase, Cassandra) Search (Solr, ElasticSearch) Enterprise Data Infrastructure Apache Flume is a continuous data ingestion system that is... ● open-source, ● reliable, ● scalable, ● manageable, ● customizable, ...and designed for Big Data ecosystem.
  • 4. ...for Big Data ecosystem? “Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.” Big Data from a Data Ingestion Perspective ● Logical Data Sources are physically distributed ● Data production is continuous / never ending ● Data structure and semantics change without notice © 2014 StreamSets, Inc.
  • 5. Physically Distributed Data Sources © 2014 StreamSets, Inc. ● Many physical sources that produce data ● Number of physical sources changes constantly ● Sources may exist in different governance zones, data centers, continents...
  • 6. Continuous Data Production “Every two days now we create as much information as we did from the dawn of civilization up until 2003” © 2014 StreamSets, Inc. - Eric Schmidt, 2010 ● Weather ● Traffic ● Automobiles ● Trains ● Airplanes ● Geological/Seismic ● Oceanographic ● Smart Phones ● Health Accessories ● Medical Devices ● Home Automation ● Digital Cameras ● Social Media ● Geolocation ● Shop Floor Sensors ● Network Activity ● Industry Appliances ● Security/Surveillance ● Server Workloads ● Digital Telephony ● Bio-simulations...
  • 7. Ever Changing Structure of Data ● One of your data centers upgrade to IPv6 192.168.0.4 © 2014 StreamSets, Inc. fe80::21b:21ff:fe83:90fa M0137: User {jonsmith} granted access to {accounts} M0137: [jonsmith] granted access to [sys.accounts] { “first”:”jon”, “last”:”smith”, “add1”:”123 Main St.”, “add2”:”Ste - 4”, “city”:”Little Town”, “state”:”AZ”, “zip”: “12121” } { “first”:”jon”, “last”:”smith”, “add1”:”123 Main St.”, “add2”:”Ste - 4”, “city”:”Little Town”, “state”:”AZ”, “zip”: “12121”, “phone”: “(408) 555-1212” } ● Application developer changes logs (again) ● JSON data may contain more attributes than expected
  • 8. So, from Data Ingestion Perspective: Massive collection of ever changing physical sources... Never ending data production... Data structure and semantics evolve continuously... © 2014 StreamSets, Inc.
  • 9. © 2014 StreamSets, Inc. Flume to the Rescue!
  • 10. Apache Flume ● Originally designed to be a log aggregation system by Cloudera Engineers ● Evolved to handle any type of streaming event data ● Low-cost of installation, operation and maintenance ● Highly customizable and extendable © 2014 StreamSets, Inc.
  • 11. A Closer Look at Flume Input Agent Agent Agent Agent Destination ● Distributed Pipeline Architecture ● Optimized for commonly used data sources and destinations ● Built in support for contextual routing ● Fully customizable and extendable © 2014 StreamSets, Inc.
  • 12. Anatomy of a Flume Agent © 2014 StreamSets, Inc. Flume Agent Source Sink Channel Incoming Data Outgoing Data Source ● Accepts incoming Data ● Scales as required ● Writes data to Channel Sink ● Removes data from Channel ● Sends data to downstream Agent or Destination Channel ● Stores data in the order received
  • 13. Transactional Data Exchange Upstream Sink TX © 2014 StreamSets, Inc. Flume Agent Source Sink Channel Incoming Data Outgoing Data Source TX Sink TX ● Source uses transactions to write to the channel ● Sink uses transactions to remove data from the channel ● Sink transaction commits only after successful transfer of data ● This ensures no data loss in Flume pipeline
  • 14. Routing and Replicating © 2014 StreamSets, Inc. Flume Agent Source Sink 1 Channel 1 Incoming Data Outgoing Data Channel 2 Sink 2 Outgoing Data ● Source can replicate or multiplex data across many channels ● Metadata headers can be used to do contextual selection of channels ● Channels can be drained by different sinks to different destinations or pipelines
  • 15. Why Channels? ● Buffers data and insulates downstream from load spikes ● Provides persistent store for data in case the process restarts ● Provides flow ordering* and transactional guarantees © 2014 StreamSets, Inc.
  • 16. © 2014 StreamSets, Inc. Use-Case: Log Aggregation
  • 17. Starting Out Simple ● You would like to move your web-server © 2014 StreamSets, Inc. logs to HDFS ● Let’s assume there are only 3 web servers at the time of launch ● Ad-hoc solution will likely suffice! Challenges ● How do you manage your output paths on HDFS? ● How do you maintain your client code in face of changing environment as well as requirements?
  • 18. Adding a Single Flume Agent Advantages ● Insulation from HDFS downtime ● Quick offload of logs from Web Server machines ● Better Network utilization Challenges ● What if the Flume node goes down? ● Can one Flume node accommodate all load from Web Servers? © 2014 StreamSets, Inc.
  • 19. Adding Two Flume Agents Advantages ● Redundancy and Availability ● Better handling of downstream failures ● Automatic load balancing and failover Challenges ● What happens when new Web Servers are added? ● Can two Flume Agents keep up with all the load from more Web Servers? © 2014 StreamSets, Inc.
  • 20. Handling a Server Farm © 2014 StreamSets, Inc. A Converging Flow ● Traffic is aggregated by Tier-2 and Tier-3 before being put into destination system ● Closer a tier is to the destination, larger the batch size it delivers downstream ● Optimized handling of destination systems
  • 21. Data Volume Per Agent © 2014 StreamSets, Inc. Batch Size Variation per Agent ● Event volume is least in the outermost tier ● Event volume increases as the flow converges ● Event volume is highest in the innermost tier
  • 22. Data Volume Per Tier © 2014 StreamSets, Inc. Batch Size Variation per Tier ● In steady state, all tiers carry same event volume ● Transient variations in flow are absorbed and ironed out by channels ● Load spikes are handled smoothly without overwhelming the infrastructure
  • 23. Planning and Sizing Flume Topology for Log-Aggregation Use-Case © 2014 StreamSets, Inc.
  • 24. Planning and Sizing Your Topology What we need to know: ● Number of Web Servers ● Log volume per Web Server per unit time ● Destination System and layout (Routing Requirements) ● Worst case downtime for destination system © 2014 StreamSets, Inc. What we will calculate: ● Number of tiers ● Exit Batch Sizes ● Channel capacity
  • 25. Calculating Number of Tiers Rule of Thumb One Aggregating Agent (A) can be used with anywhere from 4 to 16 client Agents Considerations ● Must handle projected ingest volume ● Resulting number of tiers should provide for routing, load-balancing and failover requirements Gotchas Load test to ensure that steady state and peak load are addressed with adequate failover capacity © 2014 StreamSets, Inc.
  • 26. Calculating Exit Batch Size Rule of Thumb Exit batch size is same as total exit data volume divided by number of Agents in a tier Considerations ● Having some extra room is good ● Keep contextual routing in mind ● Consider duplication impact when batch sizes are large Gotchas Load test fail-over scenario to ensure near steady-state drain © 2014 StreamSets, Inc.
  • 27. Calculating Channel Capacity Gotchas © 2014 StreamSets, Inc. Source Sink X Source Sink X X Rule of Thumb Equal to worst case data ingest rate sustained over the worst case downstream outage interval Considerations ● Multiple disks will yield better performance ● Channel size impacts the back-pressure buildup in the pipeline You may need more disk space than the physical footprint of the data size
  • 28. To Recap Number of Tiers Calculated with upstream to downstream Agent ration ranging from 4:1 to 16:1. Factor in routing, failover, load-balancing requirements... Exit Batch Size Calculated for steady state data volume exiting the tier, divided by number of Agents in that tier. Factor in contextual routing and duplication due to transient failure impact... Channel Capacity Calculated as worst case ingest rate sustained over the worst case downstream downtime. Factor in number of disks used etc... © 2014 StreamSets, Inc. ...and that’s all there is to it!
  • 29. Some Highlights of Flume ● Flume is suitable for large volume data collection, especially when data is being produced in multiple locations ● Once planned and sized appropriately, Flume will practically run itself without any operational intervention ● Flume provides weak ordering guarantee, i.e., in the absence of failures the data will arrive in the order it was received in the Flume pipeline ● Transactional exchange ensures that Flume never loses any data in transit between Agents. Sinks use transactions to ensure data is not lost at point of ingest or terminal destinations. ● Flume has rich out-of-the box features such as contextual routing, and support for popular data sources and destination systems © 2014 StreamSets, Inc.
  • 30. Things that could be better... ● Handling of poison events ● Ability to tail files ● Ability to handle preset data formats such as JSON, CSV, XML ● Centralized configuration ● Once-only delivery semantics ● ...and more Remember: patches are welcome! © 2014 StreamSets, Inc.
  • 31. Thank You! Contact: ● Email: arvind at streamsets dot com ● Twitter: @aprabhakar More on Flume: ● http://flume.apache.org/ ● User Mailing List: user-subscribe@flume.apache.org ● Developer Mailing List: dev-subscribe@flume.apache.org ● JIRA: https://issues.apache.org/jira/browse/FLUME © 2014 StreamSets, Inc.