6. HDFS Architecture :
Computation close to the data
Hadoop Cluster
Data
Data data data data data
Data data data data data Block 1 Block 1
Data data data data data
Block 1
Data data data data data Results
Data data data data data MAP Data data data data
Data data data data data Block 2 Data data data data
Data data data data
Data data data data data Block 2 MAP Data data data data
Data data data data data Reduce Data data data data
Data data data data data Block 2 Data data data data
Data data data data
Data data data data data Data data data data
Data data data data data Data data data data
Data data data data data
MAP
Block 3 Block 3
Block 3
6
10. Scaling Hadoop
Early Gains
• Simple design allowed rapid improvements
• Namespace is all in RAM, simpler locking
• Improved memory usage in 0.16, JVM Heap configuration (Suresh Srinivas)
Growth of number of files and storage is limited by adding RAM to namenode
• 50G heap = 200M “fs objects” = 100M names + 100MBlocks
• 14PB of storage (50MB blocksize)
• 4K nodes
- Job Tracker carries out both job lifecycle management and scheduling
Yahoo’s Response:
• HDFS Federation: horizontal scaling of namespace (0.22)
• Next Generation of Map-Reduce - Complete overhaul of job tracker/task tracker
Goal:
• Clusters of 6000 nodes, 100,000 cores & 10k concurrent jobs, 100 PB raw storage per cluster
10 6 May 2010
11. Not to scale
Scaling the Name Service:
Options
Block-reports for Billions of
blocks requires rethinking
# clients block layer
100x Good isolation
properties
50x
Distributed NNs
20x
Partial
Multiple
NS in memory
Namespace
With Namespace
volumes
volumes
4x
Separate Bmaps from NN Partial
All NS
1x in memory
Archives
NS (Cache)
in memory
# names
100M 200M 1B 2B 10B 20B
11
12. Opportunity:
Vertical & Horizontal scaling
Vertical scaling
More RAM, Efficiency in memory usage
First class archives (tar/zip like)
Partial namespace in main memory
Namenode Horizontal: Federation
Horizontal scaling/federation benefits:
– Scale
– Isolation, Stability, Availability
– Flexibility
– Other Namenode implementations or non-HDFS namespaces
12
13. Block (Object) Storage Subsystem
Block (Object) Storage Subsystem
• Shared storage provided as pools of blocks
• Namespaces (HDFS, others) use one or more block-pools
• Note: HDFS has 2 layers today – we are generalizing/extending it.
Namespace
NS1 ... NS k
... Foreign
NS n
Pools 1 Pools k Pools n
Block storage
B Block Pools
a
l
a
nDatanode 1 Datanode 2 Datanode m
c ... ... ...
e
r
13
14. 1st Phase:
B-Pool management inside Namenode
NN-1 NN-k NN-n
NS1 ... ... Foreign
NS k
NS n
Future:
Move Block
Pools 1 Pools k Pools n mgt into
separate
Block Pools nodes
B
a
l
a
n
cDatanode 1 Datanode 2 Datanode m
e ... ... ...
r
14
15. Future:
Move block management out
... ... Foreign
NS1 NS k NS n
Easier to scale
horizontally
1. Open than the name
server
Pools 1 Pools k Pools n
client
2. getBlockLocations
Block Manager
Block Pools
B
a
l
a
3. ReadBlock n
c
e
r
Datanode 1 Datanode 2 Datanode m
... ... ...
15
16. What is a HDFS Cluster
Current New
• HDFS Cluster • HDFS Cluster
– 1 Namespace – N Namespaces
– A set of blocks – Set of block-pools
• Each block-pool is set of blocks
• Phase 1: 1 BP per NS
– Implies N block-pools
• Implemented as
• Implemented as
– 1 Namenode
– N Namenode
– Set of DNs
– Set of DNs
• Each DN stores the blocks for
each block-pool
16
17. Managing Namespaces
/ Client-side
• Federation has multiple namespaces mount-table
– don’t you need a single global
namespace?
– Key is to share the data and the
names used to access the shared data project hom tmp
data. e
• A global namespace is one way to do
that – but even there we talk of
several large “global” namespaces
• Client-side mount table is another way
to share
– Shared mount-table => “global” shared
view
– Personalized mount-table => per-
application view
• Share the data that matter by mounting
it
18. HDFS Federation Across Clusters
/
Application / Application
mount- mount-
table in table in
Cluster 2 Cluster 1
home
tmp
home
tmp
data project
data project
Cluster 2
Cluster 1 18
19. Nameserver as container for namespaces
• Nameserver as a container for namespaces
• Each namespace with its own separate state
• Persistent state in shared storage (e.g. Book Keeper)
• Each nameserver serves a set of namespaces
• Selected based on isolation and capacity
• A namespace can be moved between nameserver
…
Nameserver Nameserver
…
Shared persistent storage for namespace metadata
(e.g. Book keeper)
19
20. Summary
Federated HDFS (Jira HDFS-1052)
• Scale by adding independent Namenodes
• Preserves the robustness of the Namenodes
• Not much code change to the Namenode
• Generalizes the Block storage layer
• Analogous to Sans & Luns
• Can add other implementations of the Namenodes
• Even other name services (HBase?)
• Could move the Block management out of the Namenode in the future
• But to truly scale to 10s or 100s Bilions of blocks we need to rethink the block map and block
reports
• Benefits
• Scale number of file names and blocks
• Improved isolation and hence availability
20 6 May 2010