SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
Venturing into Large Hadoop
Clusters
- VARUN SAXENA & NAGANARASIMHA G R
Who we are ?
Naganarasimha G R
 System Architect@ Huawei
 Apache Hadoop Committer.
 Currently working in Hadoop Platform Dev team
 Overall 12.5 years of experience.
Varun Saxena
 Senior Technical Lead @ Huawei
 Apache Hadoop Committer.
 Currently working in Hadoop Platform Dev team
 Overall 8.5 years of experience
Challenges in a large YARN Cluster
 As YARN RM is a single instance, scalability of YARN
is dependent upon number of nodes and applications
running in the cluster.
 Mean time to recovery(MTTR) is higher as it takes
more time for RM to load applications from state
store.
 As Hadoop clusters grow, so does the metadata
generated by them. A single instance of YARN
Application Timeline Server with a local Leveldb
storage hence becomes a bottleneck.
 Difficult to debug workflows run by multiple
tenants.
YARN Resource
Manager (RM)
Node
Zookeeper
State Store
YARN Node
Manager (NM)
Container
Node
YARN Node
Manager (NM)
Container
YARN Application
Timeline Server(ATS)
Application
Master
Leveldb
Store
Publishing metadata(events, metrics, etc.) to ATS
NM-RM communication (NM registration and node status via heartbeat)
AM-RM communication (ask for resources from RM)
AM-NM communication (launch containers based on resources allocated)
RM-Zookeeper communication (to store application state for recovery)
Challenges in a large HDFS Cluster
 While storage is scalable on account of ability to add
more datanodes, metadata is not and file system
operations are limited by single instance of NN. As
clusters grow bigger, number of files stored on HDFS
increase as well which can make single instance of NN
as a performance bottleneck
 In a large cluster, storage requirements will increase
proportionally as well. HDFS does replication to
achieve data reliability but this can be expensive in a
large cluster.
HDFS
Namenode(NN)
Stores namespace info
(file/directory names)
and does block
management
/rack0 /rack1
Node
HDFS Datanode
(DN)
Node
HDFS Datanode
(DN)Block
Replication
YARN Federation
(to scale to tens of thousands of nodes)
[YARN-2915]
Why YARN Federation?
 Scalability of YARN depends on single instance of RM.
 How much YARN can scale is proportional to number of nodes, number of running applications and
frequency of NM-RM and AM-RM heartbeats.
 Can scale by reducing frequency of heartbeats but that can impact utilization and in heartbeat based
scheduling, delay container allocation.
YARN Federation Architecture
Sub-Cluster #1 Sub-Cluster #3Sub-Cluster #2
Task Task
Task
AM RM Proxy
Service
(Per Node)
Policy State
Router
Service
YARN
Client
Federation
Services
Sub Clusters
AM
AM
Submit App Start
Containers
YARN Resource
Manager
YARN Resource
Manager
YARN Resource
Manager
 A large YARN cluster is broken up into multiple small sub-
clusters with a few thousand nodes each. Sub clusters can be
added or removed.
 Router Service
• Exposes ApplicationClientProtocol. Transparently hides existence of
multiple RMs’ in subclusters.
• Application is submitted to Router.
• Stateless, scalable service.
 AM-RM Proxy Service
• Implements ApplicationMasterProtocol. Acts as a proxy to YARN RM.
• Allows application to span across multiple sub-clusters.
• Runs in NodeManager.
 Policy and State store
• Zookeeper/DB.
• The Federation State defines the additional state that needs to be
maintained to loosely couple multiple individual sub-clusters into a
single large federated cluster.
• Policy store contains information about the capacity allocations made
by users, their mapping to sub-clusters and the policies that each of
the components (Router, AMRMPRoxy, RMs) should enforce.
Home sub-
cluster
Home sub-
cluster
Secondary
sub-clusters
AM RM Proxy Internals
• Hosted in NM
• Extensible Design
• DDoS Prevention
• Unmanaged AM used for container
negotiation. They are created on demand
based on policy
Node Manager
AM RM Proxy Service
Application Master
Per Application Pipeline (Interceptor Chain)
Federation Interceptor
Security/Throttling Interceptor
…
Home RM Proxy
Unmanaged AM
SC #2
Unmanaged AM
SC #3
SC #1 RM SC #2 RM SC#3 RM
Policy
YARN Application Timeline
Service Next Gen(ATSv2)
(to overcome metadata scalability issues)
[YARN-2928/YARN-5355]
Overview of ATSv1
 ATSv1 introduced the notion of Timeline Entity which is published by
clients to Timeline Server.
- It is an abstract concept of anything. Can be an application, an
application attempt, a container or any user-defined object.
- Can define the relationship between entities.
- Contains Primary filters which will be used to index the entities in
the Timeline Store.
- Uniquely identified by an EntityId and EntityType.
- Encapsulates events.
 Separate single Process
 Pluggable store – defaults to LevelDB (a lightweight key-value store)
 REST Interfaces
Why ATSv2 for a large cluster?
 Scalability was a concern for ATSv1
• Single global instance of writer/reader
• ATSv1 uses local disk based LevelDB storage
 Usability
• Handle flows(a group of YARN applications) as first-class concept and model metrics aggregation.
• Elevate configuration and metrics to first-class members and allowing filtering based on them.
 Reliability
• Data in ATSv1 is stored only in a local disk .
• Single TimelineServer daemon so single point of failure.
ATSv2 Key Design Points
 Distributed writers aka collectors(per app and per node) to
achieve scalability.
• Per App Collector/Writer launched as part of RM.
• Per Node Collector/Writer launched as an auxiliary service
in NM.
• Plan to support standalone writers.
 Scalable and reliable backend storage (HBase as default)
 A new object model API with flows built into it.
 Separate reader instance(s).
 Aggregation i.e. rolling up the metric values to the parent.
• Online aggregation for apps and flow runs.
• Offline aggregation for users, flows and queues.
Application
Master
Node
Manager
Timeline
Collector
App Events
/ Metrics
Container Events
/ Metrics
Storage
Resource Manager
Timeline
Collector
Timeline
Reader
Timeline
Reader Pool
YARN Application
Events
Write Flow
User Queries
Read Flow
NODE #X
Resource
Manager(RM)
Collector Discovery
Node Manager 1
List of app collectors
App Master
App Collector
App Collector
Aux Service
NODE #1
HBase
2. Allocate Response with collector
address
Node Manager 2
1. Report collector address for
each app collector in node
heartbeat request
5. Write entities
4. Publish entities to app collector
notified in heartbeat by RM.
{
app_1_collector_info ( includes NM collector address)
app_2_collector_info
….
}
4. Publish entities to app
collector notified in
heartbeat by RM.
Timeline Client
Timeline Client
Node Manager X
Timeline Client
App Collector is created
when RM asks a NM to
launch AM containers
NODE #2
3. NM-RM Heartbeat response with
collector address for apps.
Flow
 Flow is a group of YARN applications which are launched as
part of a logical app.
 Oozie, Pig, Scalding, HIVE queries, etc.
• Flow name : “sales_jan_deptA”
• Flow run id: 3
Aggregation
 Aggregation basically means rolling up metrics from child entities to parent entities. We can perform different operations such as
SUM, AVG ,etc. while rolling them up and store them in the parent.
 App level aggregation will be done by app collector as and when it receives different metrics.
 Online or real time aggregation for apps would be a SUM of metrics of child entities. Additional metrics will also be stored which
indicate AVG, MAX, AREA(time integral) etc.
 By promoting metrics from container level upto the flow, users can get an overall view of say something like CPU or memory
utilization at the workflow level.
Container A1
(CPUCoresMillis = 400)
Container A2
(CPUCoresMillis = 300)
Container B1
(CPUCoresMillis = 200)
App A
(CPUCoresMillis = 700)
App B
(CPUCoresMillis = 200)
Flow
(CPUCoresMillis = 900)
Possible use cases
 Cluster utilization and inputs for capacity planning. Cluster can learn from flow’s/application’s historical
data.
 Mappers / reducers optimizations.
 Application performance over time.
 Identifying job bottlenecks.
 Ad-hoc troubleshooting and identification of problems in cluster.
 Complex queries possible at flow, user and queue level. For instance, queries like % of applications which
ran more than 10000 containers.
 Full DAG from flow to flow run to application to container level can be seen.
YARN Zookeeper State Store
improvements
(for better MTTR and to reduce load on ZK based store)
[various JIRAs’]
Asynchronous Loading during RM recovery
 When a RM instance becomes active, it loads both running/incomplete and completed applications.
But it is not necessary, as completed apps would not be required to be processed further upon restart.
As completed apps are just required for querying, they can be loaded asynchronously on RM restart,
thereby allowing YARN service to be up and running earlier.
 RMIncompleteApps node was introduced in Zookeeper state store to have running applications as
child nodes under its hierarchy. These app nodes would neither have any data associated with them
nor have child application attempt nodes.
 When a RM becomes active, it will load all the nodes under RMIncompleteApps to get a list of apps
which are running. We will then read app and attempt data for these incomplete apps from
corresponding app nodes under RMAppRoot hierarchy.
 RM is then made active and thereby ready to serve.
 Rest of the apps i.e. completed apps are then loaded in separate thread(s) asynchronously.
 For 5000 running and 20000 completed apps, there was 2x-3x improvement in MTTR.
ZKRMStateRoot
RMAppRootRMIncompleteApps
Incomplete App nodes
(app40…app50)
ZKRMStateRoot
RMAppRoot
Application nodes
(app1…app50)
Attempt node Attempt node
Application nodes
(app1…app50)
Attempt node Attempt node
Changes in node structure(YARN-2962)
 Zookeeper has size restrictions on amount of data it can return in a single message(1 MB). In a large cluster(or otherwise), the number of app nodes
can be serveral thousand and getting child nodes under RMAppRoot hierarchy(i.e. list of app nodes’ names) can fail due to 1MB size restriction.
Application nodes are equivalent to application IDs’.
 Solution to overcome this problem was to store application nodes hierarchically by splitting the application ID into 2 parts based on a configurable split
index of 1 to 4, thereby reducing the number of app nodes retrieved in a single call.
 To reduce amount of data stored in zookeeper, improvements were also made to not store application data which is not required for completed apps.
ZKRMStateRoot
RMAppRoot
HIERARCHIES
1
ZKRMStateRoot
RMAppRoot
Attempt node Attempt node
Application nodes
(application_1234_0000…
application_1234_10299)
Attempt node Attempt node
2 3 4
application_1234_102application_1234_00
Nodes 00 to 99 Nodes 00 to 99
Application nodes
(application_1234_0000…
application_1234_10299)
Attempt node
Multithreaded loading from store
 YARN RM used to load all applications from state store in a single thread.
 However we found that we can leverage upon existence of multiple zookeeper servers and split the loading of applications across
multiple threads.
 In RM, we first get a list of applications to be read from state store and then divide the work of reading data associated with each
app along with its attempts to multiple threads.
ZK
Server1
ZK
Server2
ZK
Server3
YARN Resource Manager
Client Client Client
HDFS Federation
(to scale to tens of thousands of nodes)
Why HDFS Federation?
 Storage scales but namespace doesn’t which means limited number of files, directories and blocks.
 Namenode is a memory intensive process and there is a limit to heap memory which can be configured
for namenode process.
 Throughput of filesystem operations is limited due to a single namenode.
 Namespace has to be shared across multiple users and applications.
 Namespace and block management are tightly coupled.
HDFS Federation Architecture
 HDFS Federation uses multiple independent namespaces.
 Cluster can scale by adding more namespaces.
 Storage used is common across multiple namespace i.e. same set of datanodes
are used.
 Block pools are created for each namespace to avoid conflicting block IDs’.
 Datanodes’ register to all the namenodes.
HDFS Erasure Coding
(to reduce storage and network overhead)
[HDFS-7285]
Why HDFS Erasure Coding?
 3x replication leads to 200% storage space overhead.
 Replicating data 3 uses network bandwidth too while writing.
 EC uses almost half the storage space while providing similar level of fault tolerance compared to 3x
replication.
 Plan to move older data to EC.
Overall objective is to achieve data durability with storage efficiency.
3 way replication can have 2 failures per block and has a storage efficiency of 33%.
Erasure Coding saves storage
 XOR Coding : Storing 2 bits
Replication : 2 extra bits
XOR coding : = = 1 extra bit
Example above has same data durability with half the storage overhead. But not very useful for HDFS as it can generate at
most one parity cell. And hence can tolerate only one failure.
 Reed-Solomon(RS) Coding
• Uses sophisticated linear algebra operations to generate multiple parity cells, and thus can tolerate multiple failures per group.
• Configurable with two parameters, k and m. RS(k,m) works by multiplying a vector of k data cells with a Generator Matrix (GT) to generate an extended
codeword vector with k data cells and m parity cells. Storage failures can be recovered by multiplying the surviving cells in the codeword with the inverse
of GT—as long as k out of (k + m) cells are available. (Rows in GT corresponding to failed units should be deleted before taking its inverse).
• HDFS Erasure coding uses RS(6,3) by default which means it generates 3 parity bits for 6 bits of data and can tolerate upto 3 failures.
1 1 0 0
1 0 1
HDFS Erasure Coding
 Striping Layout
• Has a cell size of 64KB by default.
• Has no data locality as blocks are spread across datanodes’ but better for small files.
• Already available on trunk.
 Contiguous Layout
• Cell size of 128 MB i.e. equivalent to HDFS block size.
• Data locality is there but does not work well for small files. This is because For instance, with RS (10,4) a stripe with only a single 128MB data block would
still end up writing four 128MB parity blocks, for a storage overhead of 400% (worse than 3-way replication).
• Ongoing work in HDFS-8030.
Links and References
1. https://blog.cloudera.com/blog/2015/09/introduction-to-hdfs-erasure-coding-in-apache-hadoop/
2. https://www.slideshare.net/HadoopSummit/yarn-federation
3. https://www.slideshare.net/hortonworks/federationhadoop-worldfinal
4. https://issues.apache.org/jira/browse/YARN-2915
5. https://issues.apache.org/jira/browse/HDFS-7285
6. http://conferences.oreilly.com/strata/big-dataconference-ny-2015/public/schedule/detail/42957
Questions?

Más contenido relacionado

La actualidad más candente

Bring your Service to YARN
Bring your Service to YARNBring your Service to YARN
Bring your Service to YARN
DataWorks Summit
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
DataWorks Summit
 

La actualidad más candente (20)

YARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo HadoopYARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo Hadoop
 
Towards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersTowards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN Clusters
 
Apache Spark and Oracle Stream Analytics
Apache Spark and Oracle Stream AnalyticsApache Spark and Oracle Stream Analytics
Apache Spark and Oracle Stream Analytics
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
Hadoop and Big Data Overview
Hadoop and Big Data OverviewHadoop and Big Data Overview
Hadoop and Big Data Overview
 
YARN High Availability
YARN High AvailabilityYARN High Availability
YARN High Availability
 
A sdn based application aware and network provisioning
A sdn based application aware and network provisioningA sdn based application aware and network provisioning
A sdn based application aware and network provisioning
 
Scala in increasingly demanding environments - DATABIZ
Scala in increasingly demanding environments - DATABIZScala in increasingly demanding environments - DATABIZ
Scala in increasingly demanding environments - DATABIZ
 
Bring your Service to YARN
Bring your Service to YARNBring your Service to YARN
Bring your Service to YARN
 
Microservices Part 4: Functional Reactive Programming
Microservices Part 4: Functional Reactive ProgrammingMicroservices Part 4: Functional Reactive Programming
Microservices Part 4: Functional Reactive Programming
 
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
YOW! Data Keynote (2021)
YOW! Data Keynote (2021)YOW! Data Keynote (2021)
YOW! Data Keynote (2021)
 
Apache Slider
Apache SliderApache Slider
Apache Slider
 
Introduction to YARN Apps
Introduction to YARN AppsIntroduction to YARN Apps
Introduction to YARN Apps
 
Spark in yarn managed multi-tenant clusters
Spark in yarn managed multi-tenant clustersSpark in yarn managed multi-tenant clusters
Spark in yarn managed multi-tenant clusters
 
堵俊平:Hadoop virtualization extensions
堵俊平:Hadoop virtualization extensions堵俊平:Hadoop virtualization extensions
堵俊平:Hadoop virtualization extensions
 
Pulsar Watermarking - Pulsar Virtual Summit Europe 2021
Pulsar Watermarking - Pulsar Virtual Summit Europe 2021Pulsar Watermarking - Pulsar Virtual Summit Europe 2021
Pulsar Watermarking - Pulsar Virtual Summit Europe 2021
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Ambari - What's New in 2.0.0
Apache Ambari - What's New in 2.0.0Apache Ambari - What's New in 2.0.0
Apache Ambari - What's New in 2.0.0
 

Similar a Venturing into Large Hadoop Clusters

Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014
Hortonworks
 
Rails Request & Middlewares
Rails Request & MiddlewaresRails Request & Middlewares
Rails Request & Middlewares
Santosh Wadghule
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
hdhappy001
 

Similar a Venturing into Large Hadoop Clusters (20)

Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
 
YARN - way to share cluster BEYOND HADOOP
YARN - way to share cluster BEYOND HADOOPYARN - way to share cluster BEYOND HADOOP
YARN - way to share cluster BEYOND HADOOP
 
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
 
Yarn
YarnYarn
Yarn
 
Running Services on YARN
Running Services on YARNRunning Services on YARN
Running Services on YARN
 
Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014
 
Rails in the bowels
Rails in the bowelsRails in the bowels
Rails in the bowels
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Rails Request & Middlewares
Rails Request & MiddlewaresRails Request & Middlewares
Rails Request & Middlewares
 
Schema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdf
Schema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdfSchema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdf
Schema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdf
 
Field Notes: YARN Meetup at LinkedIn
Field Notes: YARN Meetup at LinkedInField Notes: YARN Meetup at LinkedIn
Field Notes: YARN Meetup at LinkedIn
 
AskTom: How to Make and Test Your Application "Oracle RAC Ready"?
AskTom: How to Make and Test Your Application "Oracle RAC Ready"?AskTom: How to Make and Test Your Application "Oracle RAC Ready"?
AskTom: How to Make and Test Your Application "Oracle RAC Ready"?
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute Platform
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider project
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedSpark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics Revised
 

Último

Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
Kayode Fayemi
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
raffaeleoman
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
Sheetaleventcompany
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
Kayode Fayemi
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
amilabibi1
 

Último (20)

Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
 
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, YardstickSaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
 
Air breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animalsAir breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animals
 
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)
 
Causes of poverty in France presentation.pptx
Causes of poverty in France presentation.pptxCauses of poverty in France presentation.pptx
Causes of poverty in France presentation.pptx
 
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
 
Presentation on Engagement in Book Clubs
Presentation on Engagement in Book ClubsPresentation on Engagement in Book Clubs
Presentation on Engagement in Book Clubs
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
 

Venturing into Large Hadoop Clusters

  • 1. Venturing into Large Hadoop Clusters - VARUN SAXENA & NAGANARASIMHA G R
  • 2. Who we are ? Naganarasimha G R  System Architect@ Huawei  Apache Hadoop Committer.  Currently working in Hadoop Platform Dev team  Overall 12.5 years of experience. Varun Saxena  Senior Technical Lead @ Huawei  Apache Hadoop Committer.  Currently working in Hadoop Platform Dev team  Overall 8.5 years of experience
  • 3. Challenges in a large YARN Cluster  As YARN RM is a single instance, scalability of YARN is dependent upon number of nodes and applications running in the cluster.  Mean time to recovery(MTTR) is higher as it takes more time for RM to load applications from state store.  As Hadoop clusters grow, so does the metadata generated by them. A single instance of YARN Application Timeline Server with a local Leveldb storage hence becomes a bottleneck.  Difficult to debug workflows run by multiple tenants. YARN Resource Manager (RM) Node Zookeeper State Store YARN Node Manager (NM) Container Node YARN Node Manager (NM) Container YARN Application Timeline Server(ATS) Application Master Leveldb Store Publishing metadata(events, metrics, etc.) to ATS NM-RM communication (NM registration and node status via heartbeat) AM-RM communication (ask for resources from RM) AM-NM communication (launch containers based on resources allocated) RM-Zookeeper communication (to store application state for recovery)
  • 4. Challenges in a large HDFS Cluster  While storage is scalable on account of ability to add more datanodes, metadata is not and file system operations are limited by single instance of NN. As clusters grow bigger, number of files stored on HDFS increase as well which can make single instance of NN as a performance bottleneck  In a large cluster, storage requirements will increase proportionally as well. HDFS does replication to achieve data reliability but this can be expensive in a large cluster. HDFS Namenode(NN) Stores namespace info (file/directory names) and does block management /rack0 /rack1 Node HDFS Datanode (DN) Node HDFS Datanode (DN)Block Replication
  • 5. YARN Federation (to scale to tens of thousands of nodes) [YARN-2915]
  • 6. Why YARN Federation?  Scalability of YARN depends on single instance of RM.  How much YARN can scale is proportional to number of nodes, number of running applications and frequency of NM-RM and AM-RM heartbeats.  Can scale by reducing frequency of heartbeats but that can impact utilization and in heartbeat based scheduling, delay container allocation.
  • 7. YARN Federation Architecture Sub-Cluster #1 Sub-Cluster #3Sub-Cluster #2 Task Task Task AM RM Proxy Service (Per Node) Policy State Router Service YARN Client Federation Services Sub Clusters AM AM Submit App Start Containers YARN Resource Manager YARN Resource Manager YARN Resource Manager  A large YARN cluster is broken up into multiple small sub- clusters with a few thousand nodes each. Sub clusters can be added or removed.  Router Service • Exposes ApplicationClientProtocol. Transparently hides existence of multiple RMs’ in subclusters. • Application is submitted to Router. • Stateless, scalable service.  AM-RM Proxy Service • Implements ApplicationMasterProtocol. Acts as a proxy to YARN RM. • Allows application to span across multiple sub-clusters. • Runs in NodeManager.  Policy and State store • Zookeeper/DB. • The Federation State defines the additional state that needs to be maintained to loosely couple multiple individual sub-clusters into a single large federated cluster. • Policy store contains information about the capacity allocations made by users, their mapping to sub-clusters and the policies that each of the components (Router, AMRMPRoxy, RMs) should enforce. Home sub- cluster Home sub- cluster Secondary sub-clusters
  • 8. AM RM Proxy Internals • Hosted in NM • Extensible Design • DDoS Prevention • Unmanaged AM used for container negotiation. They are created on demand based on policy Node Manager AM RM Proxy Service Application Master Per Application Pipeline (Interceptor Chain) Federation Interceptor Security/Throttling Interceptor … Home RM Proxy Unmanaged AM SC #2 Unmanaged AM SC #3 SC #1 RM SC #2 RM SC#3 RM Policy
  • 9. YARN Application Timeline Service Next Gen(ATSv2) (to overcome metadata scalability issues) [YARN-2928/YARN-5355]
  • 10. Overview of ATSv1  ATSv1 introduced the notion of Timeline Entity which is published by clients to Timeline Server. - It is an abstract concept of anything. Can be an application, an application attempt, a container or any user-defined object. - Can define the relationship between entities. - Contains Primary filters which will be used to index the entities in the Timeline Store. - Uniquely identified by an EntityId and EntityType. - Encapsulates events.  Separate single Process  Pluggable store – defaults to LevelDB (a lightweight key-value store)  REST Interfaces
  • 11. Why ATSv2 for a large cluster?  Scalability was a concern for ATSv1 • Single global instance of writer/reader • ATSv1 uses local disk based LevelDB storage  Usability • Handle flows(a group of YARN applications) as first-class concept and model metrics aggregation. • Elevate configuration and metrics to first-class members and allowing filtering based on them.  Reliability • Data in ATSv1 is stored only in a local disk . • Single TimelineServer daemon so single point of failure.
  • 12. ATSv2 Key Design Points  Distributed writers aka collectors(per app and per node) to achieve scalability. • Per App Collector/Writer launched as part of RM. • Per Node Collector/Writer launched as an auxiliary service in NM. • Plan to support standalone writers.  Scalable and reliable backend storage (HBase as default)  A new object model API with flows built into it.  Separate reader instance(s).  Aggregation i.e. rolling up the metric values to the parent. • Online aggregation for apps and flow runs. • Offline aggregation for users, flows and queues. Application Master Node Manager Timeline Collector App Events / Metrics Container Events / Metrics Storage Resource Manager Timeline Collector Timeline Reader Timeline Reader Pool YARN Application Events Write Flow User Queries Read Flow
  • 13. NODE #X Resource Manager(RM) Collector Discovery Node Manager 1 List of app collectors App Master App Collector App Collector Aux Service NODE #1 HBase 2. Allocate Response with collector address Node Manager 2 1. Report collector address for each app collector in node heartbeat request 5. Write entities 4. Publish entities to app collector notified in heartbeat by RM. { app_1_collector_info ( includes NM collector address) app_2_collector_info …. } 4. Publish entities to app collector notified in heartbeat by RM. Timeline Client Timeline Client Node Manager X Timeline Client App Collector is created when RM asks a NM to launch AM containers NODE #2 3. NM-RM Heartbeat response with collector address for apps.
  • 14. Flow  Flow is a group of YARN applications which are launched as part of a logical app.  Oozie, Pig, Scalding, HIVE queries, etc. • Flow name : “sales_jan_deptA” • Flow run id: 3
  • 15. Aggregation  Aggregation basically means rolling up metrics from child entities to parent entities. We can perform different operations such as SUM, AVG ,etc. while rolling them up and store them in the parent.  App level aggregation will be done by app collector as and when it receives different metrics.  Online or real time aggregation for apps would be a SUM of metrics of child entities. Additional metrics will also be stored which indicate AVG, MAX, AREA(time integral) etc.  By promoting metrics from container level upto the flow, users can get an overall view of say something like CPU or memory utilization at the workflow level. Container A1 (CPUCoresMillis = 400) Container A2 (CPUCoresMillis = 300) Container B1 (CPUCoresMillis = 200) App A (CPUCoresMillis = 700) App B (CPUCoresMillis = 200) Flow (CPUCoresMillis = 900)
  • 16. Possible use cases  Cluster utilization and inputs for capacity planning. Cluster can learn from flow’s/application’s historical data.  Mappers / reducers optimizations.  Application performance over time.  Identifying job bottlenecks.  Ad-hoc troubleshooting and identification of problems in cluster.  Complex queries possible at flow, user and queue level. For instance, queries like % of applications which ran more than 10000 containers.  Full DAG from flow to flow run to application to container level can be seen.
  • 17. YARN Zookeeper State Store improvements (for better MTTR and to reduce load on ZK based store) [various JIRAs’]
  • 18. Asynchronous Loading during RM recovery  When a RM instance becomes active, it loads both running/incomplete and completed applications. But it is not necessary, as completed apps would not be required to be processed further upon restart. As completed apps are just required for querying, they can be loaded asynchronously on RM restart, thereby allowing YARN service to be up and running earlier.  RMIncompleteApps node was introduced in Zookeeper state store to have running applications as child nodes under its hierarchy. These app nodes would neither have any data associated with them nor have child application attempt nodes.  When a RM becomes active, it will load all the nodes under RMIncompleteApps to get a list of apps which are running. We will then read app and attempt data for these incomplete apps from corresponding app nodes under RMAppRoot hierarchy.  RM is then made active and thereby ready to serve.  Rest of the apps i.e. completed apps are then loaded in separate thread(s) asynchronously.  For 5000 running and 20000 completed apps, there was 2x-3x improvement in MTTR. ZKRMStateRoot RMAppRootRMIncompleteApps Incomplete App nodes (app40…app50) ZKRMStateRoot RMAppRoot Application nodes (app1…app50) Attempt node Attempt node Application nodes (app1…app50) Attempt node Attempt node
  • 19. Changes in node structure(YARN-2962)  Zookeeper has size restrictions on amount of data it can return in a single message(1 MB). In a large cluster(or otherwise), the number of app nodes can be serveral thousand and getting child nodes under RMAppRoot hierarchy(i.e. list of app nodes’ names) can fail due to 1MB size restriction. Application nodes are equivalent to application IDs’.  Solution to overcome this problem was to store application nodes hierarchically by splitting the application ID into 2 parts based on a configurable split index of 1 to 4, thereby reducing the number of app nodes retrieved in a single call.  To reduce amount of data stored in zookeeper, improvements were also made to not store application data which is not required for completed apps. ZKRMStateRoot RMAppRoot HIERARCHIES 1 ZKRMStateRoot RMAppRoot Attempt node Attempt node Application nodes (application_1234_0000… application_1234_10299) Attempt node Attempt node 2 3 4 application_1234_102application_1234_00 Nodes 00 to 99 Nodes 00 to 99 Application nodes (application_1234_0000… application_1234_10299) Attempt node
  • 20. Multithreaded loading from store  YARN RM used to load all applications from state store in a single thread.  However we found that we can leverage upon existence of multiple zookeeper servers and split the loading of applications across multiple threads.  In RM, we first get a list of applications to be read from state store and then divide the work of reading data associated with each app along with its attempts to multiple threads. ZK Server1 ZK Server2 ZK Server3 YARN Resource Manager Client Client Client
  • 21. HDFS Federation (to scale to tens of thousands of nodes)
  • 22. Why HDFS Federation?  Storage scales but namespace doesn’t which means limited number of files, directories and blocks.  Namenode is a memory intensive process and there is a limit to heap memory which can be configured for namenode process.  Throughput of filesystem operations is limited due to a single namenode.  Namespace has to be shared across multiple users and applications.  Namespace and block management are tightly coupled.
  • 23. HDFS Federation Architecture  HDFS Federation uses multiple independent namespaces.  Cluster can scale by adding more namespaces.  Storage used is common across multiple namespace i.e. same set of datanodes are used.  Block pools are created for each namespace to avoid conflicting block IDs’.  Datanodes’ register to all the namenodes.
  • 24. HDFS Erasure Coding (to reduce storage and network overhead) [HDFS-7285]
  • 25. Why HDFS Erasure Coding?  3x replication leads to 200% storage space overhead.  Replicating data 3 uses network bandwidth too while writing.  EC uses almost half the storage space while providing similar level of fault tolerance compared to 3x replication.  Plan to move older data to EC. Overall objective is to achieve data durability with storage efficiency. 3 way replication can have 2 failures per block and has a storage efficiency of 33%.
  • 26. Erasure Coding saves storage  XOR Coding : Storing 2 bits Replication : 2 extra bits XOR coding : = = 1 extra bit Example above has same data durability with half the storage overhead. But not very useful for HDFS as it can generate at most one parity cell. And hence can tolerate only one failure.  Reed-Solomon(RS) Coding • Uses sophisticated linear algebra operations to generate multiple parity cells, and thus can tolerate multiple failures per group. • Configurable with two parameters, k and m. RS(k,m) works by multiplying a vector of k data cells with a Generator Matrix (GT) to generate an extended codeword vector with k data cells and m parity cells. Storage failures can be recovered by multiplying the surviving cells in the codeword with the inverse of GT—as long as k out of (k + m) cells are available. (Rows in GT corresponding to failed units should be deleted before taking its inverse). • HDFS Erasure coding uses RS(6,3) by default which means it generates 3 parity bits for 6 bits of data and can tolerate upto 3 failures. 1 1 0 0 1 0 1
  • 27. HDFS Erasure Coding  Striping Layout • Has a cell size of 64KB by default. • Has no data locality as blocks are spread across datanodes’ but better for small files. • Already available on trunk.  Contiguous Layout • Cell size of 128 MB i.e. equivalent to HDFS block size. • Data locality is there but does not work well for small files. This is because For instance, with RS (10,4) a stripe with only a single 128MB data block would still end up writing four 128MB parity blocks, for a storage overhead of 400% (worse than 3-way replication). • Ongoing work in HDFS-8030.
  • 28. Links and References 1. https://blog.cloudera.com/blog/2015/09/introduction-to-hdfs-erasure-coding-in-apache-hadoop/ 2. https://www.slideshare.net/HadoopSummit/yarn-federation 3. https://www.slideshare.net/hortonworks/federationhadoop-worldfinal 4. https://issues.apache.org/jira/browse/YARN-2915 5. https://issues.apache.org/jira/browse/HDFS-7285 6. http://conferences.oreilly.com/strata/big-dataconference-ny-2015/public/schedule/detail/42957