Más contenido relacionado La actualidad más candente (20) Similar a Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2 (20) Más de Cloudera, Inc. (20) Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 21. DO NOT USE PUBLICLY
High Availability for the HDFS NameNode
PRIOR TO 10/23/12
Phase 2
Headline Goes Here
Aaron T. Myers and Todd Lipcon | Cloudera HDFS Team
Speaker Name or Subhead Goes Here
October 2012
1
2. Introductions / who we are
• Software engineers on Cloudera’s HDFS engineering team
• Committers/PMC Members for Apache Hadoop at ASF
• Main developers on HDFS HA
• Responsible for ~80% of the code for all phases of HA
development
• Have helped numerous customers setup and troubleshoot HA
HDFS clusters this year
©2012 Cloudera, Inc. All Rights
2
Reserved.
3. Outline
• HDFS HA Phase 1
• How did it work? What could it do?
• What problems remained?
• HDFS HA Phase 2: Automatic failover
• HDFS HA Phase 2: Quorum Journal
©2012 Cloudera, Inc. All Rights
3
Reserved.
5. HDFS HA Development Phase 1
• Completed March 2012 (HDFS-1623)
• Introduced the StandbyNode, a hot backup for the HDFS
NameNode.
• Relied on shared storage to synchronize namespace state
• (e.g. a NAS filer appliance)
• Allowed operators to manually trigger failover to the Standby
• Sufficient for many HA use cases: avoided planned downtime
for hardware and software upgrades, planned machine/OS
maintenance, configuration changes, etc.
©2012 Cloudera, Inc. All Rights
5
Reserved.
6. HDFS HA Architecture Phase 1
• Parallel block reports sent to Active and Standby NameNodes
• NameNode state shared by locating edit log on NAS over NFS
• Fencing of shared resources/data
• Critical that only a single NameNode is Active at any point in time
• Client failover done via client configuration
• Each client configured with the address of both NNs: try both to
find active
©2012 Cloudera, Inc. All Rights
6
Reserved.
8. Fencing and NFS
• Must avoid split-brain syndrome
• Both nodes think they are active and try to write to the same file. Your
metadata becomes corrupt and requires manual intervention to restart
• Configure a fencing script
• Script must ensure that prior active has stopped writing
• STONITH: shoot-the-other-node-in-the-head
• Storage fencing: e.g using NetApp ONTAP API to restrict filer access
• Fencing script must succeed to have a successful failover
©2012 Cloudera, Inc. All Rights
8
Reserved.
9. Shortcomings of Phase 1
• Insufficient to protect against unplanned downtime
• Manual failover only: requires an operator to step in quickly after
a crash
• Various studies indicated this was the minority of downtime, but
still important to address
• Requirement of a NAS device made deployment
complex, expensive, and error-prone
(we always knew this was just the first phase!)
©2012 Cloudera, Inc. All Rights
9
Reserved.
10. HDFS HA Development Phase 2
• Multiple new features for high availability
• Automatic failover, based on Apache ZooKeeper
• Remove dependency on NAS (network-attached storage)
• Address new HA use cases
• Avoid unplanned downtime due to software or hardware faults
• Deploy in filer-less environments
• Completely stand-alone HA with no external hardware or software
dependencies
• no Linux-HA, filers, etc
©2012 Cloudera, Inc. All Rights
10
Reserved.
12. Automatic Failover Goals
• Automatically detect failure of the Active NameNode
• Hardware, software, network, etc.
• Do not require operator intervention to initiate failover
• Once failure is detected, process completes automatically
• Support manually initiated failover as first-class
• Operators can still trigger failover without having to stop Active
• Do not introduce a new SPOF
• All parts of auto-failover deployment must themselves be HA
©2012 Cloudera, Inc. All Rights
12
Reserved.
13. Automatic Failover Architecture
• Automatic failover requires ZooKeeper
• Not required for manual failover
• ZK makes it easy to:
• Detect failure of Active NameNode
• Determine which NameNode should become the Active NN
©2012 Cloudera, Inc. All Rights
13
Reserved.
14. Automatic Failover Architecture
• Introduce new daemon in HDFS: ZooKeeper Failover Controller
• In an auto failover deployment, run two ZKFCs
• One per NameNode, on that NameNode machine
• ZooKeeper Failover Controller (ZKFC) is responsible for:
• Monitoring health of associated NameNode
• Participating in leader election of NameNodes
• Fencing the other NameNode if it wins election
©2012 Cloudera, Inc. All Rights
14
Reserved.
16. ZooKeeper Failover Controller Details
• When a ZKFC is started, it:
• Begins checking the health of its associated NN via RPC
• As long as the associated NN is healthy, attempts to create
an ephemeral znode in ZK
• One of the two ZKFCs will succeed in creating the znode
and transition its associated NN to the Active state
• The other ZKFC transitions its associated NN to the Standby
state and begins monitoring the ephemeral znode
©2012 Cloudera, Inc. All Rights
16
Reserved.
17. What happens when…
• … a NameNode process crashes?
• Associated ZKFC notices the health failure of the NN and
quits from active/standby election by removing znode
• … a whole NameNode machine crashes?
• ZKFC process crashes with it and the ephemeral znode is
deleted from ZK
©2012 Cloudera, Inc. All Rights
17
Reserved.
18. What happens when…
• … the two NameNodes are partitioned from each other?
• Nothing happens: Only one will still have the znode
• … ZooKeeper crashes (or down for upgrade)?
• Nothing happens: active stays active
©2012 Cloudera, Inc. All Rights
18
Reserved.
19. Fencing Still Required with ZKFC
• Tempting to think ZooKeeper means no need for fencing
• Consider the following scenario:
• Two NameNodes: A and B, each with associated ZKFC
• ZKFC A process crashes, ephemeral znode removed
• NameNode A process is still running
• ZKFC B notices znode removed
• ZKFC B wants to transition NN B to Active, but without
fencing NN A, both NNs would be active simultaneously
©2012 Cloudera, Inc. All Rights
19
Reserved.
20. Auto-failover recap
• New daemon ZooKeeperFailoverController monitors the
NameNodes
• Automatically triggers fail-overs
• No need for operator intervention
Fencing and dependency on NFS storage still a pain
©2012 Cloudera, Inc. All Rights
20
Reserved.
22. Shared Storage in HDFS HA
• The Standby NameNode synchronizes the namespace by
following the Active NameNode’s transaction log
• Each operation (eg mkdir(/foo)) is written to the log by the Active
• The StandbyNode periodically reads all new edits and applies
them to its own metadata structures
• Reliable shared storage is required for correct operation
©2012 Cloudera, Inc. All Rights
22
Reserved.
23. Shared Storage in “Phase 1”
• Operator configures a traditional shared storage device (eg SAN
or NAS)
• Mount the shared storage via NFS on both Active and Standby
NNs
• Active NN writes to a directory on NFS, while Standby reads it
©2012 Cloudera, Inc. All Rights
23
Reserved.
24. Shortcomings of NFS-based approach
• Custom hardware
• Lots of our customers don’t have SAN/NAS available in their datacenter
• Costs money, time and expertise
• Extra “stuff” to monitor outside HDFS
• We just moved the SPOF, didn’t eliminate it!
• Complicated
• Storage fencing, NFS mount options, multipath networking, etc
• Organizationally complicated: dependencies on storage ops team
• NFS issues
• Buggy client implementations, little control over timeout behavior, etc
©2012 Cloudera, Inc. All Rights
24
Reserved.
25. Primary Requirements for Improved Storage
• No special hardware (PDUs, NAS)
• No custom fencing configuration
• Too complicated == too easy to misconfigure
• No SPOFs
• punting to filers isn’t a good option
• need something inherently distributed
©2012 Cloudera, Inc. All Rights
25
Reserved.
26. Secondary Requirements
• Configurable failure toleration
• Configure N nodes to tolerate (N-1)/2
• Making N bigger (within reasonable bounds) shouldn’t hurt
performance. Implies:
• Writes done in parallel, not pipelined
• Writes should not wait on slowest replica
• Locate replicas on existing hardware investment (eg share with
JobTracker, NN, SBN)
©2012 Cloudera, Inc. All Rights
26
Reserved.
27. Operational Requirements
• Should be operable by existing Hadoop admins. Implies:
• Same metrics system (“hadoop metrics”)
• Same configuration system (xml)
• Same logging infrastructure (log4j)
• Same security system (Kerberos-based)
• Allow existing ops to easily deploy and manage the new feature
• Allow existing Hadoop tools to monitor the feature
• (eg Cloudera Manager, Ganglia, etc)
©2012 Cloudera, Inc. All Rights
27
Reserved.
28. Our solution: QuorumJournalManager
• QuorumJournalManager (client)
• Plugs into JournalManager abstraction in NN (instead of existing
FileJournalManager)
• Provides edit log storage abstraction
• JournalNode (server)
• Standalone daemon running on an odd number of nodes
• Provides actual storage of edit logs on local disks
• Could run inside other daemons in the future
©2012 Cloudera, Inc. All Rights
28
Reserved.
29. Architecture
©2012 Cloudera, Inc. All Rights
29
Reserved.
30. Commit protocol
• NameNode accumulates edits locally as they are logged
• On logSync(), sends accumulated batch to all JNs via Hadoop
RPC
• Waits for success ACK from a majority of nodes
• Majority commit means that a single lagging or crashed replica
does not impact NN latency
• Latency @ NN = median(Latency @ JNs)
©2012 Cloudera, Inc. All Rights
30
Reserved.
31. JN Fencing
• How do we prevent split-brain?
• Each instance of QJM is assigned a unique epoch number
• provides a strong ordering between client NNs
• Each IPC contains the client’s epoch
• JN remembers on disk the highest epoch it has seen
• Any request from an earlier epoch is rejected. Any from a newer
one is recorded on disk
• Distributed Systems folks may recognize this technique from
Paxos and other literature
©2012 Cloudera, Inc. All Rights
31
Reserved.
32. Fencing with epochs
• Fencing is now implicit
• The act of becoming active causes any earlier active NN to be
fenced out
• Since a quorum of nodes has accepted the new active, any other
IPC by an earlier epoch number can’t get quorum
• Eliminates confusing and error-prone custom fencing
configuration
©2012 Cloudera, Inc. All Rights
32
Reserved.
33. Segment recovery
• In normal operation, a minority of JNs may be out of sync
• After a crash, all JNs may have different numbers of txns (last batch
may or may not have arrived at each)
• eg JN1 was down, JN2 crashed right before NN wrote txnid 150:
• JN1: has no edits
• JN2: has edits 101-149
• JN3: has edits 101-150
• Before becoming active, we need to come to consensus on this last
batch: was it committed or not?
• Use the well-known Paxos algorithm to solve consensus
©2012 Cloudera, Inc. All Rights
33
Reserved.
34. Other implementation features
• Hadoop Metrics
• lag, percentile latencies, etc from perspective of JN, NN
• metrics for queued txns, % of time each JN fell behind, etc, to
help suss out a slow JN before it causes problems
• Security
• full Kerberos and SSL support: edits can be optionally encrypted
in-flight, and all access is mutually authenticated
©2012 Cloudera, Inc. All Rights
34
Reserved.
36. Testing
• Randomized fault test
• Runs all communications in a single thread with deterministic
order and fault injections based on a seed
• Caught a number of really subtle bugs along the way
• Run as an MR job: 5000 fault tests in parallel
• Multiple CPU-years of stress testing: found 2 bugs in Jetty!
• Cluster testing: 100-node, MR, HBase, Hive, etc
• Commit latency in practice: within same range as local disks
(better than one of two local disks, worse than the other one)
©2012 Cloudera, Inc. All Rights
36
Reserved.
37. Deployment and Configuration
• Most customers running 3 JNs (tolerate 1 failure)
• 1 on NN, 1 on SBN, 1 on JobTracker/ResourceManager
• Optionally run 2 more (eg on bastion/gateway nodes) to tolerate 2
failures
• Configuration:
• dfs.namenode.shared.edits.dir:
qjournal://nn1.company.com:8485,nn2.company.com:8485,jt.company.
com:8485/my-journal
• dfs.journalnode.edits.dir:
/data/1/hadoop/journalnode/
• dfs.ha.fencing.methods:
shell(/bin/true) (fencing not required!)
©2012 Cloudera, Inc. All Rights
37
Reserved.
38. Status
• Merged into Hadoop development trunk in early October
• Available in CDH4.1
• Deployed at several customer/community sites with good
success so far
• Planned rollout to 20+ production HBase clusters within the
month
©2012 Cloudera, Inc. All Rights
38
Reserved.
40. HA Phase 2 Improvements
• Run an active NameNode and a hot Standby NameNode
• Automatically triggers seamless failover using Apache
ZooKeeper
• Stores shared metadata on QuorumJournalManager: a fully
distributed, redundant, low latency journaling system.
• All improvements available now in HDFS trunk and CDH4.1
©2012 Cloudera, Inc. All Rights
40
Reserved.
43. Why not BookKeeper?
• Pipelined commit instead of quorum commit
• Unpredictable latency
• Research project
• Not “Hadoopy”
• Their own IPC system, no security, different configuration, no
metrics
• External
• Feels like “two systems” to ops/deployment instead of just one
• Nevertheless: it’s pluggable and BK is an additional option.
©2012 Cloudera, Inc. All Rights
43
Reserved.
44. Epoch number assignment
• On startup:
• NN -> JN: getEpochInfo()
• JN: respond with current promised epoch
• NN: set epoch = max(promisedEpoch) + 1
• NN -> JN: newEpoch(epoch)
• JN: if it is still higher than promisedEpoch, remember it and
ACK, otherwise NACK
• If NN receives ACK from a quorum of nodes, then it has uniquely
claimed that epoch
©2012 Cloudera, Inc. All Rights
44
Reserved.