Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2

DO NOT USE PUBLICLY
High Availability for the HDFS NameNode
PRIOR TO 10/23/12
Phase 2
Headline Goes Here
Aaron T. Myers and Todd Lipcon | Cloudera HDFS Team
Speaker Name or Subhead Goes Here
October 2012

1

Introductions / who we are
• Software engineers on Cloudera’s HDFS engineering team
• Committers/PMC Members for Apache Hadoop at ASF
• Main developers on HDFS HA
• Responsible for ~80% of the code for all phases of HA
development
• Have helped numerous customers setup and troubleshoot HA
HDFS clusters this year

©2012 Cloudera, Inc. All Rights
2
Reserved.

Outline
• HDFS HA Phase 1
• How did it work? What could it do?
• What problems remained?
• HDFS HA Phase 2: Automatic failover
• HDFS HA Phase 2: Quorum Journal

3
Reserved.

HDFS HA Phase 1 Review
HDFS-1623: completed March 2012

4

HDFS HA Development Phase 1
• Completed March 2012 (HDFS-1623)
• Introduced the StandbyNode, a hot backup for the HDFS
NameNode.
• Relied on shared storage to synchronize namespace state
• (e.g. a NAS filer appliance)
• Allowed operators to manually trigger failover to the Standby
• Sufficient for many HA use cases: avoided planned downtime
for hardware and software upgrades, planned machine/OS
maintenance, configuration changes, etc.
5
Reserved.

HDFS HA Architecture Phase 1
• Parallel block reports sent to Active and Standby NameNodes
• NameNode state shared by locating edit log on NAS over NFS
• Fencing of shared resources/data
• Critical that only a single NameNode is Active at any point in time
• Client failover done via client configuration
• Each client configured with the address of both NNs: try both to
find active

6
Reserved.

HDFS HA Architecture Phase 1

7
Reserved.

Fencing and NFS
• Must avoid split-brain syndrome
• Both nodes think they are active and try to write to the same file. Your
metadata becomes corrupt and requires manual intervention to restart
• Configure a fencing script
• Script must ensure that prior active has stopped writing
• STONITH: shoot-the-other-node-in-the-head
• Storage fencing: e.g using NetApp ONTAP API to restrict filer access
• Fencing script must succeed to have a successful failover

8
Reserved.

Shortcomings of Phase 1
• Insufficient to protect against unplanned downtime
• Manual failover only: requires an operator to step in quickly after
a crash
• Various studies indicated this was the minority of downtime, but
still important to address
• Requirement of a NAS device made deployment
complex, expensive, and error-prone

(we always knew this was just the first phase!)

9
Reserved.

HDFS HA Development Phase 2
• Multiple new features for high availability
• Automatic failover, based on Apache ZooKeeper
• Remove dependency on NAS (network-attached storage)

• Address new HA use cases
• Avoid unplanned downtime due to software or hardware faults
• Deploy in filer-less environments
• Completely stand-alone HA with no external hardware or software
dependencies
• no Linux-HA, filers, etc

10
Reserved.

Automatic Failover Overview
HDFS-3042: completed May 2012

11

Automatic Failover Goals
• Automatically detect failure of the Active NameNode
• Hardware, software, network, etc.
• Do not require operator intervention to initiate failover
• Once failure is detected, process completes automatically
• Support manually initiated failover as first-class
• Operators can still trigger failover without having to stop Active
• Do not introduce a new SPOF
• All parts of auto-failover deployment must themselves be HA

12
Reserved.

Automatic Failover Architecture
• Automatic failover requires ZooKeeper
• Not required for manual failover
• ZK makes it easy to:
• Detect failure of Active NameNode
• Determine which NameNode should become the Active NN

13
Reserved.

• Introduce new daemon in HDFS: ZooKeeper Failover Controller
• In an auto failover deployment, run two ZKFCs
• One per NameNode, on that NameNode machine
• ZooKeeper Failover Controller (ZKFC) is responsible for:
• Monitoring health of associated NameNode
• Participating in leader election of NameNodes
• Fencing the other NameNode if it wins election

14
Reserved.


15
Reserved.

ZooKeeper Failover Controller Details
• When a ZKFC is started, it:
• Begins checking the health of its associated NN via RPC
• As long as the associated NN is healthy, attempts to create
an ephemeral znode in ZK
• One of the two ZKFCs will succeed in creating the znode
and transition its associated NN to the Active state
• The other ZKFC transitions its associated NN to the Standby
state and begins monitoring the ephemeral znode

16
Reserved.

What happens when…
• … a NameNode process crashes?
• Associated ZKFC notices the health failure of the NN and
quits from active/standby election by removing znode
• … a whole NameNode machine crashes?
• ZKFC process crashes with it and the ephemeral znode is
deleted from ZK

17
Reserved.

What happens when…
• … the two NameNodes are partitioned from each other?
• Nothing happens: Only one will still have the znode
• … ZooKeeper crashes (or down for upgrade)?
• Nothing happens: active stays active

18
Reserved.

Fencing Still Required with ZKFC
• Tempting to think ZooKeeper means no need for fencing
• Consider the following scenario:
• Two NameNodes: A and B, each with associated ZKFC
• ZKFC A process crashes, ephemeral znode removed
• NameNode A process is still running
• ZKFC B notices znode removed
• ZKFC B wants to transition NN B to Active, but without
fencing NN A, both NNs would be active simultaneously
19
Reserved.

Auto-failover recap
• New daemon ZooKeeperFailoverController monitors the
NameNodes
• Automatically triggers fail-overs
• No need for operator intervention

Fencing and dependency on NFS storage still a pain

20
Reserved.

Removing the NAS dependency
HDFS-3077: completed October 2012

21

Shared Storage in HDFS HA
• The Standby NameNode synchronizes the namespace by
following the Active NameNode’s transaction log
• Each operation (eg mkdir(/foo)) is written to the log by the Active
• The StandbyNode periodically reads all new edits and applies
them to its own metadata structures
• Reliable shared storage is required for correct operation

22
Reserved.

Shared Storage in “Phase 1”
• Operator configures a traditional shared storage device (eg SAN
or NAS)
• Mount the shared storage via NFS on both Active and Standby
NNs
• Active NN writes to a directory on NFS, while Standby reads it

23
Reserved.

Shortcomings of NFS-based approach
• Custom hardware
• Lots of our customers don’t have SAN/NAS available in their datacenter
• Costs money, time and expertise
• Extra “stuff” to monitor outside HDFS
• We just moved the SPOF, didn’t eliminate it!
• Complicated
• Storage fencing, NFS mount options, multipath networking, etc
• Organizationally complicated: dependencies on storage ops team
• NFS issues
• Buggy client implementations, little control over timeout behavior, etc
24
Reserved.

Primary Requirements for Improved Storage
• No special hardware (PDUs, NAS)
• No custom fencing configuration
• Too complicated == too easy to misconfigure
• No SPOFs
• punting to filers isn’t a good option
• need something inherently distributed

25
Reserved.

Secondary Requirements
• Configurable failure toleration
• Configure N nodes to tolerate (N-1)/2
• Making N bigger (within reasonable bounds) shouldn’t hurt
performance. Implies:
• Writes done in parallel, not pipelined
• Writes should not wait on slowest replica
• Locate replicas on existing hardware investment (eg share with
JobTracker, NN, SBN)

26
Reserved.

Operational Requirements
• Should be operable by existing Hadoop admins. Implies:
• Same metrics system (“hadoop metrics”)
• Same configuration system (xml)
• Same logging infrastructure (log4j)
• Same security system (Kerberos-based)
• Allow existing ops to easily deploy and manage the new feature
• Allow existing Hadoop tools to monitor the feature
• (eg Cloudera Manager, Ganglia, etc)

27
Reserved.

Our solution: QuorumJournalManager
• QuorumJournalManager (client)
• Plugs into JournalManager abstraction in NN (instead of existing
FileJournalManager)
• Provides edit log storage abstraction
• JournalNode (server)
• Standalone daemon running on an odd number of nodes
• Provides actual storage of edit logs on local disks
• Could run inside other daemons in the future

28
Reserved.

Architecture

29
Reserved.

Commit protocol
• NameNode accumulates edits locally as they are logged
• On logSync(), sends accumulated batch to all JNs via Hadoop
RPC
• Waits for success ACK from a majority of nodes
• Majority commit means that a single lagging or crashed replica
does not impact NN latency
• Latency @ NN = median(Latency @ JNs)

30
Reserved.

JN Fencing
• How do we prevent split-brain?
• Each instance of QJM is assigned a unique epoch number
• provides a strong ordering between client NNs
• Each IPC contains the client’s epoch
• JN remembers on disk the highest epoch it has seen
• Any request from an earlier epoch is rejected. Any from a newer
one is recorded on disk
• Distributed Systems folks may recognize this technique from
Paxos and other literature

31
Reserved.

Fencing with epochs
• Fencing is now implicit
• The act of becoming active causes any earlier active NN to be
fenced out
• Since a quorum of nodes has accepted the new active, any other
IPC by an earlier epoch number can’t get quorum
• Eliminates confusing and error-prone custom fencing
configuration

32
Reserved.

Segment recovery
• In normal operation, a minority of JNs may be out of sync
• After a crash, all JNs may have different numbers of txns (last batch
may or may not have arrived at each)
• eg JN1 was down, JN2 crashed right before NN wrote txnid 150:
• JN1: has no edits
• JN2: has edits 101-149
• JN3: has edits 101-150
• Before becoming active, we need to come to consensus on this last
batch: was it committed or not?
• Use the well-known Paxos algorithm to solve consensus

33
Reserved.

Other implementation features
• Hadoop Metrics
• lag, percentile latencies, etc from perspective of JN, NN
• metrics for queued txns, % of time each JN fell behind, etc, to
help suss out a slow JN before it causes problems
• Security
• full Kerberos and SSL support: edits can be optionally encrypted
in-flight, and all access is mutually authenticated

34
Reserved.

Testing
• Randomized fault test
• Runs all communications in a single thread with deterministic
order and fault injections based on a seed
• Caught a number of really subtle bugs along the way
• Run as an MR job: 5000 fault tests in parallel
• Multiple CPU-years of stress testing: found 2 bugs in Jetty!
• Cluster testing: 100-node, MR, HBase, Hive, etc
• Commit latency in practice: within same range as local disks
(better than one of two local disks, worse than the other one)

36
Reserved.

Deployment and Configuration
• Most customers running 3 JNs (tolerate 1 failure)
• 1 on NN, 1 on SBN, 1 on JobTracker/ResourceManager
• Optionally run 2 more (eg on bastion/gateway nodes) to tolerate 2
failures
• Configuration:
• dfs.namenode.shared.edits.dir:
qjournal://nn1.company.com:8485,nn2.company.com:8485,jt.company.
com:8485/my-journal
• dfs.journalnode.edits.dir:
/data/1/hadoop/journalnode/
• dfs.ha.fencing.methods:
shell(/bin/true) (fencing not required!)
37
Reserved.

Status
• Merged into Hadoop development trunk in early October
• Available in CDH4.1
• Deployed at several customer/community sites with good
success so far
• Planned rollout to 20+ production HBase clusters within the
month

38
Reserved.

HA Phase 2 Improvements
• Run an active NameNode and a hot Standby NameNode
• Automatically triggers seamless failover using Apache
ZooKeeper
• Stores shared metadata on QuorumJournalManager: a fully
distributed, redundant, low latency journaling system.

• All improvements available now in HDFS trunk and CDH4.1

40
Reserved.

Why not BookKeeper?
• Pipelined commit instead of quorum commit
• Unpredictable latency
• Research project
• Not “Hadoopy”
• Their own IPC system, no security, different configuration, no
metrics
• External
• Feels like “two systems” to ops/deployment instead of just one
• Nevertheless: it’s pluggable and BK is an additional option.
43
Reserved.

Epoch number assignment
• On startup:
• NN -> JN: getEpochInfo()
• JN: respond with current promised epoch
• NN: set epoch = max(promisedEpoch) + 1
• NN -> JN: newEpoch(epoch)
• JN: if it is still higher than promisedEpoch, remember it and
ACK, otherwise NACK
• If NN receives ACK from a quorum of nodes, then it has uniquely
claimed that epoch
44
Reserved.

Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (9)

Similar a Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2

Similar a Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2 (20)

Más de Cloudera, Inc.

Más de Cloudera, Inc. (20)

Último

Último (20)

Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2