2. Who we are ?
Naganarasimha G R
System Architect@ Huawei
Apache Hadoop Committer.
Currently working in Hadoop Platform Dev team
Overall 12.5 years of experience.
Varun Saxena
Senior Technical Lead @ Huawei
Apache Hadoop Committer.
Currently working in Hadoop Platform Dev team
Overall 8.5 years of experience
3. Challenges in a large YARN Cluster
As YARN RM is a single instance, scalability of YARN
is dependent upon number of nodes and applications
running in the cluster.
Mean time to recovery(MTTR) is higher as it takes
more time for RM to load applications from state
store.
As Hadoop clusters grow, so does the metadata
generated by them. A single instance of YARN
Application Timeline Server with a local Leveldb
storage hence becomes a bottleneck.
Difficult to debug workflows run by multiple
tenants.
YARN Resource
Manager (RM)
Node
Zookeeper
State Store
YARN Node
Manager (NM)
Container
Node
YARN Node
Manager (NM)
Container
YARN Application
Timeline Server(ATS)
Application
Master
Leveldb
Store
Publishing metadata(events, metrics, etc.) to ATS
NM-RM communication (NM registration and node status via heartbeat)
AM-RM communication (ask for resources from RM)
AM-NM communication (launch containers based on resources allocated)
RM-Zookeeper communication (to store application state for recovery)
4. Challenges in a large HDFS Cluster
While storage is scalable on account of ability to add
more datanodes, metadata is not and file system
operations are limited by single instance of NN. As
clusters grow bigger, number of files stored on HDFS
increase as well which can make single instance of NN
as a performance bottleneck
In a large cluster, storage requirements will increase
proportionally as well. HDFS does replication to
achieve data reliability but this can be expensive in a
large cluster.
HDFS
Namenode(NN)
Stores namespace info
(file/directory names)
and does block
management
/rack0 /rack1
Node
HDFS Datanode
(DN)
Node
HDFS Datanode
(DN)Block
Replication
6. Why YARN Federation?
Scalability of YARN depends on single instance of RM.
How much YARN can scale is proportional to number of nodes, number of running applications and
frequency of NM-RM and AM-RM heartbeats.
Can scale by reducing frequency of heartbeats but that can impact utilization and in heartbeat based
scheduling, delay container allocation.
7. YARN Federation Architecture
Sub-Cluster #1 Sub-Cluster #3Sub-Cluster #2
Task Task
Task
AM RM Proxy
Service
(Per Node)
Policy State
Router
Service
YARN
Client
Federation
Services
Sub Clusters
AM
AM
Submit App Start
Containers
YARN Resource
Manager
YARN Resource
Manager
YARN Resource
Manager
A large YARN cluster is broken up into multiple small sub-
clusters with a few thousand nodes each. Sub clusters can be
added or removed.
Router Service
• Exposes ApplicationClientProtocol. Transparently hides existence of
multiple RMs’ in subclusters.
• Application is submitted to Router.
• Stateless, scalable service.
AM-RM Proxy Service
• Implements ApplicationMasterProtocol. Acts as a proxy to YARN RM.
• Allows application to span across multiple sub-clusters.
• Runs in NodeManager.
Policy and State store
• Zookeeper/DB.
• The Federation State defines the additional state that needs to be
maintained to loosely couple multiple individual sub-clusters into a
single large federated cluster.
• Policy store contains information about the capacity allocations made
by users, their mapping to sub-clusters and the policies that each of
the components (Router, AMRMPRoxy, RMs) should enforce.
Home sub-
cluster
Home sub-
cluster
Secondary
sub-clusters
8. AM RM Proxy Internals
• Hosted in NM
• Extensible Design
• DDoS Prevention
• Unmanaged AM used for container
negotiation. They are created on demand
based on policy
Node Manager
AM RM Proxy Service
Application Master
Per Application Pipeline (Interceptor Chain)
Federation Interceptor
Security/Throttling Interceptor
…
Home RM Proxy
Unmanaged AM
SC #2
Unmanaged AM
SC #3
SC #1 RM SC #2 RM SC#3 RM
Policy
10. Overview of ATSv1
ATSv1 introduced the notion of Timeline Entity which is published by
clients to Timeline Server.
- It is an abstract concept of anything. Can be an application, an
application attempt, a container or any user-defined object.
- Can define the relationship between entities.
- Contains Primary filters which will be used to index the entities in
the Timeline Store.
- Uniquely identified by an EntityId and EntityType.
- Encapsulates events.
Separate single Process
Pluggable store – defaults to LevelDB (a lightweight key-value store)
REST Interfaces
11. Why ATSv2 for a large cluster?
Scalability was a concern for ATSv1
• Single global instance of writer/reader
• ATSv1 uses local disk based LevelDB storage
Usability
• Handle flows(a group of YARN applications) as first-class concept and model metrics aggregation.
• Elevate configuration and metrics to first-class members and allowing filtering based on them.
Reliability
• Data in ATSv1 is stored only in a local disk .
• Single TimelineServer daemon so single point of failure.
12. ATSv2 Key Design Points
Distributed writers aka collectors(per app and per node) to
achieve scalability.
• Per App Collector/Writer launched as part of RM.
• Per Node Collector/Writer launched as an auxiliary service
in NM.
• Plan to support standalone writers.
Scalable and reliable backend storage (HBase as default)
A new object model API with flows built into it.
Separate reader instance(s).
Aggregation i.e. rolling up the metric values to the parent.
• Online aggregation for apps and flow runs.
• Offline aggregation for users, flows and queues.
Application
Master
Node
Manager
Timeline
Collector
App Events
/ Metrics
Container Events
/ Metrics
Storage
Resource Manager
Timeline
Collector
Timeline
Reader
Timeline
Reader Pool
YARN Application
Events
Write Flow
User Queries
Read Flow
13. NODE #X
Resource
Manager(RM)
Collector Discovery
Node Manager 1
List of app collectors
App Master
App Collector
App Collector
Aux Service
NODE #1
HBase
2. Allocate Response with collector
address
Node Manager 2
1. Report collector address for
each app collector in node
heartbeat request
5. Write entities
4. Publish entities to app collector
notified in heartbeat by RM.
{
app_1_collector_info ( includes NM collector address)
app_2_collector_info
….
}
4. Publish entities to app
collector notified in
heartbeat by RM.
Timeline Client
Timeline Client
Node Manager X
Timeline Client
App Collector is created
when RM asks a NM to
launch AM containers
NODE #2
3. NM-RM Heartbeat response with
collector address for apps.
14. Flow
Flow is a group of YARN applications which are launched as
part of a logical app.
Oozie, Pig, Scalding, HIVE queries, etc.
• Flow name : “sales_jan_deptA”
• Flow run id: 3
15. Aggregation
Aggregation basically means rolling up metrics from child entities to parent entities. We can perform different operations such as
SUM, AVG ,etc. while rolling them up and store them in the parent.
App level aggregation will be done by app collector as and when it receives different metrics.
Online or real time aggregation for apps would be a SUM of metrics of child entities. Additional metrics will also be stored which
indicate AVG, MAX, AREA(time integral) etc.
By promoting metrics from container level upto the flow, users can get an overall view of say something like CPU or memory
utilization at the workflow level.
Container A1
(CPUCoresMillis = 400)
Container A2
(CPUCoresMillis = 300)
Container B1
(CPUCoresMillis = 200)
App A
(CPUCoresMillis = 700)
App B
(CPUCoresMillis = 200)
Flow
(CPUCoresMillis = 900)
16. Possible use cases
Cluster utilization and inputs for capacity planning. Cluster can learn from flow’s/application’s historical
data.
Mappers / reducers optimizations.
Application performance over time.
Identifying job bottlenecks.
Ad-hoc troubleshooting and identification of problems in cluster.
Complex queries possible at flow, user and queue level. For instance, queries like % of applications which
ran more than 10000 containers.
Full DAG from flow to flow run to application to container level can be seen.
17. YARN Zookeeper State Store
improvements
(for better MTTR and to reduce load on ZK based store)
[various JIRAs’]
18. Asynchronous Loading during RM recovery
When a RM instance becomes active, it loads both running/incomplete and completed applications.
But it is not necessary, as completed apps would not be required to be processed further upon restart.
As completed apps are just required for querying, they can be loaded asynchronously on RM restart,
thereby allowing YARN service to be up and running earlier.
RMIncompleteApps node was introduced in Zookeeper state store to have running applications as
child nodes under its hierarchy. These app nodes would neither have any data associated with them
nor have child application attempt nodes.
When a RM becomes active, it will load all the nodes under RMIncompleteApps to get a list of apps
which are running. We will then read app and attempt data for these incomplete apps from
corresponding app nodes under RMAppRoot hierarchy.
RM is then made active and thereby ready to serve.
Rest of the apps i.e. completed apps are then loaded in separate thread(s) asynchronously.
For 5000 running and 20000 completed apps, there was 2x-3x improvement in MTTR.
ZKRMStateRoot
RMAppRootRMIncompleteApps
Incomplete App nodes
(app40…app50)
ZKRMStateRoot
RMAppRoot
Application nodes
(app1…app50)
Attempt node Attempt node
Application nodes
(app1…app50)
Attempt node Attempt node
19. Changes in node structure(YARN-2962)
Zookeeper has size restrictions on amount of data it can return in a single message(1 MB). In a large cluster(or otherwise), the number of app nodes
can be serveral thousand and getting child nodes under RMAppRoot hierarchy(i.e. list of app nodes’ names) can fail due to 1MB size restriction.
Application nodes are equivalent to application IDs’.
Solution to overcome this problem was to store application nodes hierarchically by splitting the application ID into 2 parts based on a configurable split
index of 1 to 4, thereby reducing the number of app nodes retrieved in a single call.
To reduce amount of data stored in zookeeper, improvements were also made to not store application data which is not required for completed apps.
ZKRMStateRoot
RMAppRoot
HIERARCHIES
1
ZKRMStateRoot
RMAppRoot
Attempt node Attempt node
Application nodes
(application_1234_0000…
application_1234_10299)
Attempt node Attempt node
2 3 4
application_1234_102application_1234_00
Nodes 00 to 99 Nodes 00 to 99
Application nodes
(application_1234_0000…
application_1234_10299)
Attempt node
20. Multithreaded loading from store
YARN RM used to load all applications from state store in a single thread.
However we found that we can leverage upon existence of multiple zookeeper servers and split the loading of applications across
multiple threads.
In RM, we first get a list of applications to be read from state store and then divide the work of reading data associated with each
app along with its attempts to multiple threads.
ZK
Server1
ZK
Server2
ZK
Server3
YARN Resource Manager
Client Client Client
22. Why HDFS Federation?
Storage scales but namespace doesn’t which means limited number of files, directories and blocks.
Namenode is a memory intensive process and there is a limit to heap memory which can be configured
for namenode process.
Throughput of filesystem operations is limited due to a single namenode.
Namespace has to be shared across multiple users and applications.
Namespace and block management are tightly coupled.
23. HDFS Federation Architecture
HDFS Federation uses multiple independent namespaces.
Cluster can scale by adding more namespaces.
Storage used is common across multiple namespace i.e. same set of datanodes
are used.
Block pools are created for each namespace to avoid conflicting block IDs’.
Datanodes’ register to all the namenodes.
25. Why HDFS Erasure Coding?
3x replication leads to 200% storage space overhead.
Replicating data 3 uses network bandwidth too while writing.
EC uses almost half the storage space while providing similar level of fault tolerance compared to 3x
replication.
Plan to move older data to EC.
Overall objective is to achieve data durability with storage efficiency.
3 way replication can have 2 failures per block and has a storage efficiency of 33%.
26. Erasure Coding saves storage
XOR Coding : Storing 2 bits
Replication : 2 extra bits
XOR coding : = = 1 extra bit
Example above has same data durability with half the storage overhead. But not very useful for HDFS as it can generate at
most one parity cell. And hence can tolerate only one failure.
Reed-Solomon(RS) Coding
• Uses sophisticated linear algebra operations to generate multiple parity cells, and thus can tolerate multiple failures per group.
• Configurable with two parameters, k and m. RS(k,m) works by multiplying a vector of k data cells with a Generator Matrix (GT) to generate an extended
codeword vector with k data cells and m parity cells. Storage failures can be recovered by multiplying the surviving cells in the codeword with the inverse
of GT—as long as k out of (k + m) cells are available. (Rows in GT corresponding to failed units should be deleted before taking its inverse).
• HDFS Erasure coding uses RS(6,3) by default which means it generates 3 parity bits for 6 bits of data and can tolerate upto 3 failures.
1 1 0 0
1 0 1
27. HDFS Erasure Coding
Striping Layout
• Has a cell size of 64KB by default.
• Has no data locality as blocks are spread across datanodes’ but better for small files.
• Already available on trunk.
Contiguous Layout
• Cell size of 128 MB i.e. equivalent to HDFS block size.
• Data locality is there but does not work well for small files. This is because For instance, with RS (10,4) a stripe with only a single 128MB data block would
still end up writing four 128MB parity blocks, for a storage overhead of 400% (worse than 3-way replication).
• Ongoing work in HDFS-8030.