Federated HDFS

HDFS Federation
Sanjay Radia, Hadoop Architect
Yahoo! Inc

Apache Hadoop
India Summit 2011
1

Outline

Hadoop Components
• HDFS - Quick overview HDFS Distributed file
• Scaling HDFS - Federation system
MapReduce Distributed
computation
HBase Column store

Pig Dataflow language

Hive Data warehouse

Zookeeper Distributed
coordination
Avro Data Serialization

Oozie Workflow

HDFS
Namespace Metadata &
Journal

Backup Namespace Block
Namenode State Map

Hierarchal Namespace
Namenode Block ID  Block Locations
File Name  BlockIDs

Heartbeats & Block Reports

Datanodes

b1 b3 b2 b4 b1 b3 b3 b2 b6
Block ID  Data
b2 b3 b5 b5 b5 b4

Horizontally Scale IO and Storage 4

HDFS
Client reads and writes

Namespace Block
State Map

1 open 1 create
Namenode

Client Client

2 read End-to-end checksum
2 write

b1 b3 b2 b4 b1 b3 b3 b2 b6

b2 b3 b5 b5 b5 b4
write write

Datanodes
5

HDFS Architecture :
Computation close to the data

Hadoop Cluster
Data
Data data data data data
Data data data data data Block 1 Block 1
Block 1
Data data data data data Results
Data data data data data MAP Data data data data
Data data data data data Block 2 Data data data data
Data data data data
Data data data data data Block 2 MAP Data data data data
Data data data data data Reduce Data data data data
Data data data data data Block 2 Data data data data
Data data data data
Data data data data data Data data data data
Data data data data data Data data data data
MAP
Block 3 Block 3
Block 3

6

Quiz: What Is the Common Attribute?

7

HDFS
Actively maintain data reliability

Namespace Block
State Map

Namenode

Bad/lost 1. 3. Periodically
block replica replicate blockReceived check block
checksums

b1 b3 b2 b4 b1 b3 b3 b2 b6
2. copy
b2 b3 b5 b5 b5 b4

Datanodes

Hadoop at Yahoo!

Availability SLA

250,000 Sandbox 99.69

Total Nodes = 43,936 Research 99.47
Total Storage = 206 PB
200,000
1M+ Monthly Hadoop Jobs Production 99.85

99.2 99.3 99.4 99.5 99.6 99.7 99.8 99.9
150,000

Nodes running Hadoop at Yahoo!

100,000

Sandbox 7,803

Over 43,000 nodes running Hadoop
50,000

Research 22,334

0
2006 - 2006 - 2006 - 2006 - 2007 - 2007 - 2007 - 2007 - 2008 - 2008 - 2008 - 2008 - 2009 - 2009 - 2009 - 2009 - 2010 - 2010 - 2010 -
Qtr1 Qtr2 Qtr3 Qtr4 Qtr1 Qtr2 Qtr3 Qtr4 Qtr1 Qtr2 Qtr3 Qtr4 Qtr1 Qtr2 Qtr3 Qtr4 Qtr1 Qtr2 Qtr3

Production 13,687

0 5000 10000 15000 20000 25000

9

Scaling Hadoop
 Early Gains
• Simple design allowed rapid improvements
• Namespace is all in RAM, simpler locking
• Improved memory usage in 0.16, JVM Heap configuration (Suresh Srinivas)

 Growth of number of files and storage is limited by adding RAM to namenode
• 50G heap = 200M “fs objects” = 100M names + 100MBlocks
• 14PB of storage (50MB blocksize)

• 4K nodes

- Job Tracker carries out both job lifecycle management and scheduling

 Yahoo’s Response:
• HDFS Federation: horizontal scaling of namespace (0.22)

• Next Generation of Map-Reduce - Complete overhaul of job tracker/task tracker

 Goal:
• Clusters of 6000 nodes, 100,000 cores & 10k concurrent jobs, 100 PB raw storage per cluster

10 6 May 2010

Not to scale
Scaling the Name Service:
Options
Block-reports for Billions of
blocks requires rethinking
# clients block layer
100x Good isolation
properties
50x

Distributed NNs
20x
Partial
Multiple
NS in memory
Namespace
With Namespace
volumes
volumes
4x

Separate Bmaps from NN Partial
All NS
1x in memory
Archives
NS (Cache)
in memory

# names
100M 200M 1B 2B 10B 20B
11

Opportunity:
Vertical & Horizontal scaling
Vertical scaling
More RAM, Efficiency in memory usage
First class archives (tar/zip like)
Partial namespace in main memory

Namenode Horizontal: Federation

Horizontal scaling/federation benefits:
– Scale
– Isolation, Stability, Availability
– Flexibility
– Other Namenode implementations or non-HDFS namespaces

12

Block (Object) Storage Subsystem

Block (Object) Storage Subsystem
• Shared storage provided as pools of blocks
• Namespaces (HDFS, others) use one or more block-pools
• Note: HDFS has 2 layers today – we are generalizing/extending it.
Namespace

NS1 ... NS k
... Foreign
NS n

Pools 1 Pools k Pools n
Block storage

B Block Pools
a
l
a
nDatanode 1 Datanode 2 Datanode m
c ... ... ...
e
r
13

1st Phase:
B-Pool management inside Namenode

NN-1 NN-k NN-n

NS1 ... ... Foreign
NS k
NS n

Future:
Move Block
Pools 1 Pools k Pools n mgt into
separate
Block Pools nodes
B
a
l
a
n
cDatanode 1 Datanode 2 Datanode m
e ... ... ...
r

14

Future:
Move block management out

... ... Foreign
NS1 NS k NS n

Easier to scale
horizontally
1. Open than the name
server

Pools 1 Pools k Pools n
client
2. getBlockLocations
Block Manager
Block Pools
B
a
l
a
3. ReadBlock n
c
e
r

Datanode 1 Datanode 2 Datanode m
... ... ...

15

What is a HDFS Cluster

Current New
• HDFS Cluster • HDFS Cluster
– 1 Namespace – N Namespaces
– A set of blocks – Set of block-pools
• Each block-pool is set of blocks
• Phase 1: 1 BP per NS
– Implies N block-pools

• Implemented as
• Implemented as
– 1 Namenode
– N Namenode
– Set of DNs
– Set of DNs
• Each DN stores the blocks for
each block-pool

16

Managing Namespaces

/ Client-side
• Federation has multiple namespaces mount-table
– don’t you need a single global
namespace?
– Key is to share the data and the
names used to access the shared data project hom tmp
data. e

• A global namespace is one way to do
that – but even there we talk of
several large “global” namespaces
• Client-side mount table is another way
to share
– Shared mount-table => “global” shared
view
– Personalized mount-table => per-
application view
• Share the data that matter by mounting
it

HDFS Federation Across Clusters
/
Application / Application
mount- mount-
table in table in
Cluster 2 Cluster 1
home
tmp
home
tmp
data project
data project

Cluster 2
Cluster 1 18

Nameserver as container for namespaces
• Nameserver as a container for namespaces
• Each namespace with its own separate state
• Persistent state in shared storage (e.g. Book Keeper)
• Each nameserver serves a set of namespaces
• Selected based on isolation and capacity
• A namespace can be moved between nameserver

…
Nameserver Nameserver

…
Shared persistent storage for namespace metadata
(e.g. Book keeper)
19

Summary
 Federated HDFS (Jira HDFS-1052)
• Scale by adding independent Namenodes
• Preserves the robustness of the Namenodes
• Not much code change to the Namenode

• Generalizes the Block storage layer
• Analogous to Sans & Luns
• Can add other implementations of the Namenodes
• Even other name services (HBase?)
• Could move the Block management out of the Namenode in the future
• But to truly scale to 10s or 100s Bilions of blocks we need to rethink the block map and block
reports

• Benefits
• Scale number of file names and blocks
• Improved isolation and hence availability

20 6 May 2010

Federated HDFS

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Viewers also liked

Viewers also liked (15)

Similar to Federated HDFS

Similar to Federated HDFS (20)

More from huguk

More from huguk (20)

Federated HDFS