HDFS Federation allows HDFS to scale beyond the limitations of a single namenode by federating the namespace and block management across multiple independent namenodes. This simplifies the design and implementation compared to a distributed namenode approach. Existing single namenode deployments are not impacted and can continue running as is. Federation preserves the robustness of individual namenodes while scaling to more namenodes. It generalizes the block storage layer to allow multiple namenodes to share the same blocks storage. This improves isolation and availability while providing a simpler way to scale HDFS in the near term.
2. Single Namenode Limitations Namespace NN process stores entire metadata in memory Number of objects (files + blocks) are limited by the heap size 50G heap for 200 million objects - supports 4000 DNs, 12 PB of storage at 40 MB average file size Storage Growth– DN storage 4TB to 36TB; cluster size to 8000 DNs => Storage from 12PB to > 100PB Performance File system operations limited to a single NN throughput Bottleneck for Next Generation Of MapReduce Isolation Experimental apps can affect production apps Cluster Availability Failure of single namenode brings down the entire cluster
3. Scaling the Name Service: Separate Block Management from NN Not to scale Block-reports for Billions of blocks requires rethinking block layer # clients Good isolation properties 100x 50x Distributed Namenode 20x Multiple Namespace volumes Partial NS in memory With Namespace volumes 4x All NS in memory Partial NS (Cache) in memory 1x Archives # names 100M 10B 200M 1B 2B 20B 3
4. Why Vertical Scaling is Not Sufficient? Why not use NNs with 512GB memory? Startup time is huge – currently 30mins to 2 hrs for 50GB NN Stop the world GC failures can bring down the cluster All DNs could be declared dead Debugging problems with large JVM heap is harder Optimizing NN memory usage is expensive Changes in trunk reduces used memory; expensive development time, code complexity Diminishing returns
5. Why Federation? Simplicity Simpler robust design Multiple independent namenodes Core development in 3.5 months Changes mostly in Datanode, Config and Tools Very little change in Namenode Simpler implementation than Distributed Namenode Lesser scalability – but will serve the immediate needs Federation is an optional feature Existing single NN configuration supported as is
6. HDFS Background Namenode Block Management Datanode Datanode … Physical Storage HDFS has 2 main layers Namespace management Manages namespace consisting of directories, files and blocks Supports file system operations such as create/modify/list files & dirs Block storage Block management Manages DN membership Supports add/delete/modify/get block location Manages replication and replica placement Physical storage Supports read/write access to blocks. Namespace NS Block Storage
12. Datanode Changes A thread per NN register with all the NNs periodic heartbeat to all the NNs with utilization summary block report to the NN for its block pool NNs can be added/removed/upgraded on the fly Block Pools Automatically created when DN talks to NN Block identified by ExtendedBlockID = BlockPoolID + BlockID Unique Block Pool ID across clusters - enables merging clusters DN data structures are “indexed” by BPID BlockMap, storage etc. indexed by BPID Upgrade/rollback happens per Block Pool/per NN
13. Other Changes Decommissioning Tools to initiate and monitor decom at all the NNs Balancer Allows balancing at datanode or block pool level Datanode daemons disk scanner and directory scanner adapted to federation NN Web UI Additionally shows NN’s block pool storage utilization
14. New Cluster Manager Web UI Cluster Summary Shows overall cluster storage utilization List of namenodes For each NN - BPID, storage utilization, number of missing blocks, number of live & dead DNs NN link to go to NN Web UI Decommissioning status of DNs
15. Managing Namespaces Client-side mount-table / Federation has multiple namespaces – don’t you need a single global namespace? Key is to share the data and the names used to access the shared data. A global namespace is one way to do that – but even there we talk of several large “global” namespaces Client-side mount table is another way to share Shared mount-table => “global” shared view Personalized mount-table => per-application view Share the data that matter by mounting it tmp home project data
16. Impact On Existing Deployments Very little impact on clusters with single NN Old configuration runs as is Two commands change NN format and first upgrade has a new ClusterID option During design/implementation lot of effort went into ensure single NN deployments work as is A lot of testing effort to validate this