Hadoop Summit 2012 | Improving HBase Availability and Repair

Improving HBase Availability and Repair
Improving HBase Availability and Repair

Jeff Bean, Jonathan Hsieh
{jwfbean,jon}@cloudera.com
6/13/12

Who Are We?

• Jeff Bean
• Designated Support Engineer, Cloudera
• Education Program Lead, Cloudera

• Jonathan Hsieh
• Software Engineer, Cloudera
• Apache HBase Committer and PMC member

Hadoop Summit 2012. 6/13/12 Copyright 2012 2
Cloudera Inc, All Rights Reserved

What is Apache HBase?

Apache HBase is an
reliable, column-
oriented data store
that provides
consistent, low-
latency, random
read/write access.


Fault Tolerance vs Highly Available

• Fault tolerant:
• Ability to recover service if a
component fails, without losing
Fault Tolerant
data.

• Highly Available:
• Ability to quickly recover service if Highly
a component fails, without losing Available
data.

• Goal: Minimize downtime!

HBase Architecture
• HBase is designed to be fault tolerant
and highly available
• It depends on other systems to be as well.
App MR
• Replication for fault tolerance
• Serve regions from any Region server
• Failover HMasters
• ZK Quorums
• HDFS Block replication on Data Nodes
ZK HDFS
• But replication doesn’t guarantee high
availability
• There can still be software or human faults


Causes of HBase Downtime

HBase Downtime
Distribution
• Unplanned Maintenance
• Hardware failures
• Software errors
Planned
• Human error
• Planned Maintenance
• Upgrades Unplanned

• Migrations


Causes of Unexpected Maintenance Incidents

Unplanned Maintenance: Root
Cause from Cloudera Support
• Misconfiguration
• Metadata Corruptions
Repair
• Network / HW problems Needed
HBase, ZK,
28%
• SW problems MR, HDFS
Misconfig
44%
Fix
• Long recovery time HW/NW
16% Patch
• Automated and manual Required
12%

Source: Cloudera’s production HBase Support Tickets
CDH3’s HBase 0.90.x, Hadoop 0.20.x/1.0.x

Outline
• Where we were
• HBase 0.90.x + Hadoop 0.20.x/1.0.x
• Case Studies

• Where we are today
• HBase 0.92.x/0.94.x + Hadoop 2.0.x
• Feature Summary

• Where we are going
• HBase 0.96.x + Hadoop 2.x
• Feature Preview

[T]here are known knowns; there are things we know we know.
We also know there are known unknowns; that is to say we know
there are some things we do not know.
But there are also unknown unknowns – there are things we do not
know we don't know.
—United States Secretary of Defense Donald Rumsfeld

WHERE WE WERE:
CASE STUDIES


Best Practices to avoid hazards


Repair
Needed
HBase, ZK,
28%
MR, HDFS
Misconfig
44%
Fix
HW/NW
16% Patch
Required
12%

CAN PREVENT HBASE Source: Cloudera’s production HBase Support Tickets
MISCONFIGURATIONS CDH3’s HBase 0.90.x, Hadoop 0.20.x/1.0.x

Case #1: Memory Over-subscription Hazard

Misconfig Bad Outcome

Masters Take
Node A swaps
• Too many MR Slots • MapReduce tasks fail Action
• MR Slots too large • HDFS datanode
• “Arbitrary” processes operations time out • JobTracker blacklists TT
pause or unresponsive on node B
• HBase client operations
fail • Jobs fail or run slow
• NameNode re-replicates
blocks from node A
Node A Under Node B can’t
Load connect to node A


Case #2, #3: Hazards of Abusing HDFS and ZK

Millions of HDFS files Millions of ZK nodes
Bad Practice Misconfiguration
500,000 blocks per Millions of ZK znodes
datanode 400MB snapshot

Heartbeat thread SW Bug ZK fails to create new
blocks IO snapshots, fails

RS cannot access Bad outcome
HBase goes down
HDFS

HBase goes down Bad outcome HBase fails to restart
SW Bug, Worse
Hadoop Summit 2012. 6/13/12 Copyright 2012
outcome 12

Case #4: Splitting Corruption from HW failure

Manual, Slow, and
HW Failure requires expert

HBase has
Region regions Multiple 6 hour
Network failure Split Recovery
attempts to inconsistencies manual repair
(takes out NN) incomplete
split (overlaps / sessions.
holes)

SW Bug


Case #5: Slow recovery from HW failure

Correct but slow!
Human error

On
RS loses restart, Roo 9 hour hlog
Network Manual
HDFS, WAL t and splitting
HW failure Repairs
s .META. recovery
assign fails

SW error


Initial Lessons

• Use Best practices to avoid problems
• Conservative first
• Avoid unstable features

• What can we do?
• Fix the bugs
• Recover from problems faster
• Make people smarter to avoid hazards and misconfigurations
• Make software smarter to prevent hazards and
misconfigurations


In war, then, let your great object be
victory, not lengthy campaigns.
-- Sun Tzu

WHERE WE ARE TODAY
HBASE 0.92.X + HADOOP 2.0.X


Goal: Reduce unexpected downtime by
recovering faster

• Removing the SPOFs
• HA HDFS

• Faster Recovery
• Improved hbck
• Distributed Log splitting


Problem: HDFS NN goes down under HBase

• HBase depends on HDFS. MR
App
• If HDFS is down, HBase goes down.
• Ramifications.
• Forces Recovery mechanism
• Caused some data corruptions
ZK HDFS

• Ideally we avoid having to do recovery at all.


HBase-HDFS HA Nodes

NameNode (active) HMaster
(metadata server) (region metadata)
NameNode (standby) HMaster
(active-standby (hot standby)
hot failover)

ZooKeeper Quorum

HDFS DataNodes HBase RegionServers


HBase-HDFS HA Nodes: Transparent to HBase

HMaster
(region metadata)
HMaster
NameNode (active) (hot standby)

ZooKeeper Quorum



HBase-HDFS HA Nodes: No more SPOF

HMaster
NameNode (active) (active)

ZooKeeper Quorum



Recovery operations

• If a network switch fails or if there is a power outage,
• HBase, ZK, and HA HDFS will fail
• Will always still rely on recovery mechanisms.

• Need to be able to quickly recover
• Metadata Invariants to fix metadata corruptions
• Data Consistency to restore ACID guarantees


HBase Metadata Corruptions

• Internal HBase metadata Unplanned Maintenance: Root Cause
corruptions from Cloudera Support
• Prevent HBase from starting
• Cause some regions to be Repair
unavailable. Needed
28% HBase, ZK,
MR, HDFS
Misconfig
• Repairs are intricate and 44%
Fix
can cause extended periods HW/NW
of downtime. 16% Patch
Required
12%


HBase Metadata Invariants

Table Integrity Region Consistency
• Every key shall get assigned • Metadata about regions should
to a single region. agree in hdfs, meta and region
server assignment.
[‘ ‘,A)
[A,B) regioninfo
in META
[B, C)
[C, D)
[D, E) Good
[E, F) region
assigned .regioninfo
[F, G) to RS in HDFS
[G, ‘ ‘)


Detecting and Repairing corruption with hbck
• HBase 0.90 hbck
• Checks an HBase
instance’s internals
invariants.
• HBase hbck today
• Checks and can fix
problem in an HBase
instance’s internal
invariants
• 0.90.7, 0.92.2, 0.9
4.0
• CDH3u4, CDH4

Case #4 redux: Splitting Corruption

Manual, Slow, and
HW Failure requires expert

HBase has
Region Network failure regions Multiple 6 hour
Split Recovery
attempts to inconsistencies manual repair
split (overlaps / sessions.
holes)

SW Bug



HW Failure

HBase has
Region Network failure regions Automated
Split Recovery
attempts to inconsistencies repair tool
split (overlaps / (Minutes)
holes)

Fixes are
SW Bug
quicker, operator
can use



HW Failure

Minor HBase
Region Network failure inconsistencies Automated
Split Recovery
attempts to repair tool
(takes out NN) incomplete (bad
split (seconds)
assignments)

Fixed SW Bug


Data Consistency

• When a region server goes down, it tries to flush data in
memory to HDFS.
• If it cannot write to HDFS, it relies on the WAL/HLog.

• Recovery via the HLog is vital to prevent data loss
• Understand the write path.
• Recovery: HLog splitting.
• Faster Recovery: Distributed HLog splitting.


Write Path (Put / Delete / Increment)

HBase
Region Server
client

HLog Put
Server

HRegion HRegion
MemStore MemStore
Put

HStore

HStore

HStore

HStore

Write Path (Put / Delete / Increment)
Note, both regions
write to the same
HBase HLog
Region Server
client
Put

HLog Put Put
Server

HRegion HRegion
MemStore MemStore
Put Put

HStore

HStore

HStore

HStore

Log Splitting
HMaster

RegionServer RegionServer RegionServer
HLog1 HLog2 HLog3

…
HRegion

HRegion

HRegion

HRegion

HRegion

HRegion
mem mem mem mem mem mem

Log Splitting
HMaster

HLog1 HLog2 HLog3

…
HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

Log Splitting Splitting log 1
HMaster

HLog1 HLog2 HLog3

…
HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

HMaster

HLog
HLog1 HLog2 HLog3

…
HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

HMaster

HLog
HLog1 HLog
HLog2 HLog3

…
HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

HMaster

HLog HLog HLog

…
HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

Log Splitting Whew. I did a lot of
splitting work. That
HMaster took 9 hours!

HLog HLog HLog

…
HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

Log Splitting RegionServers, here
are your region
HMaster assignments.

RegionServer4 RegionServer5 RegionServer6

…

…
HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

Log Splitting Victory!
HMaster


…
HRegion

HRegion

HRegion

HRegion

HRegion

HRegion


Can we recover more quickly?

• In the case study, this is all done serially by the master
• The master took 9 hours to recovery.
• The 100 region server nodes were idle.

• Let’s use the idle machines to do splitting in parallel!

• Distributed log splitting (HBASE-1364)
• Introduced in 0.92.0 by Prakash Khemani (Facebook)
• Included in CDH4 (0.92.1)
• Backported to CDH3u3 (off by default)

Distributed Log Splitting I’m the boss.
HMaster

HLog1 HLog2 HLog3

…
HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

Distributed Log Splitting There is a lot of
splitting work here,
HMaster let’s split it up.

HLog1 HLog2 HLog3

…
HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

Distributed Log Splitting You guys do the work
for me.
HMaster


HLog1 HLog2 HLog3

…
HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

Distributed Log Splitting Great, that took 5.4
minutes.
HMaster


…
HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

Distributed Log Splitting Good Job, here are
your region
HMaster assignments.


…
HRegion

HRegion

HRegion

HRegion

HRegion

HRegion

Distributed Log Splitting Like a Boss.
HMaster


…
HRegion

HRegion

HRegion

HRegion

HRegion

HRegion


Case #5 redux: Network failure and slow recovery

Correct but slow!
Human error

On
RS loses restart, Roo 9 hour hlog
Network Manual
HDFS, WAL t and splitting
HW failure Repair
s .META. recovery
assign fails


Case #5 redux: Network failure and slow recovery

Correct and Faster!
Human error

On
5.4 Minute
RS loses restart, Roo
Network Automatic hlog
HDFS, WAL t and
HW failure repairs splitting
s .META.
recovery
assign fails

Fixed!


WHERE WE ARE GOING
HBASE 0.96 + HADOOP 2.X


Themes

• Minimizing Planned downtime HBase Downtime
• Changing configurations Distribution
• Online Schema Change
(experimental in 0.92, 0.94)
• Rolling Restarts Planned

• Wire compatibility

Unplanned


Table unavailable when changing schema

• Changing table schema requires disabling table
• disable table, alter table schema, enable table
• Schema includes compression, cf’s, caching, ttl, versions.

• Goal: Quickly change table and column configuration
settings without having to disable Hbase tables.
• Feature Online Schema Change (HBASE-1730)
• Included in but considered experimental in HBase 0.92/0.94.
• Contributed by Facebook


Changing Server Configs and Software updates

• Rolling restart is an operation for upgrading an HBase
cluster to a compatible version while keeping HBase
available and serving data.
• Handle server config changes.
• Handle code changes like hotfixes or compatible upgrades