10/25/12. My talk on the features and updates added in the past year to Apache HBase that are important for enterprises. This includes overviews of mechanisms for faster recovery, better recovery detection, replication and data backup strategies.
Apache hbase for the enterprise (Strata+Hadoop World 2012)
1. Apache HBase Features for the USE PUBLICLY
DO NOT
Enterprise PRIOR TO 10/23/12
Headline Goes Here
Jonathan Hsieh | @jmhsieh
Speaker Name or Subhead Goes Here
Software Engineer at Cloudera / HBase PMC Member
October 2012
2. Who Am I?
• Cloudera:
• Software Engineer
• Apache HBase committer / PMC
• Apache Flume founder / PMC
• Apache Sqoop committer / PMC
• U of Washington:
• Research in Distributed Systems
2 10/25/12 Strata Hadoop World 2012
3. What is Apache HBase?
Apache HBase is an open
App MR
source, distributed,
scalable, consistent, low
latency, random access
ZK HDFS
non-relational database
built on Apache Hadoop
3 10/25/12 Strata Hadoop World 2012
4. HBase provides Low-latency Random Access
• Writes:
• 1-3ms, 1k-10k writes/sec per node 0000000000
• Reads: 4 1111111111
• 0-3ms cached, 10-30ms disk 1 2222222222
• 10-40k reads / second / node from 3333333333
cache 5 4444444444
• Cell size: 5555555555
6666666666
• 0-3MB preferred 2 7777777777
• Read, write and insert data anywhere in 3
the table
• No sequential write limitations
4 9/23/12 Strangeloop 2012
5. HBase On a Cluster
HDFS NameNodes ZooKeeper Slave Boxes (DN + RS)
HBase Masters Quorum
Rack 1
Name
node
Rack 2
Name
node
5 10/25/12 Strata Hadoop World 2012
6. Production Apache HBase Applications
• Inbox
• Storage
• Web
• Search
• Analytics
• Monitoring
More Case Studies at http://www.hbasecon.com/agenda/
6 10/25/12 Strata Hadoop World 2012
7. Production Systems Need to Avoid Risk
• Unfortunately, all things can fail.
• Enterprises need to minimize risk.
• Understand potential data loss scenarios
• Understand potential unavailability scenarios
• Must have a disaster recovery story
• Downtime, data loss == risk
• Let’s talk about how HBase deals with:
• Risks from within the cluster
• Risks from outside the cluster
• Risks posed by Users
• Goal: Remove or reduce negative impact of potential risks
7 10/25/12 Strata Hadoop World 2012
10. Causes of HBase Downtime within the cluster
• Unplanned Maintenance • Planned Maintenance
• Hardware failures • Upgrades
• Software errors • Migrations
• Human error
Goal: Reduce downtime from hours
to minutes to seconds.
10 10/25/12 Strata Hadoop World 2012
11. Unplanned Downtime
Service realizes there is a
Failure Event problem starts fixing.
Detection Time Recovery Time
time
Service still Service is
thinks we are ok restored.
• Two sources of unavailability
• Detection time
• Recovery time
11 10/25/12 Strata Hadoop World 2012
12. Reduce downtime by speeding up recovery time
Service realizes there is a
Failure Event problem starts fixing.
Detection Time
time
Service still Service is
thinks we are ok restored.
• Distributed log splitting (0.92)
• Automated metadata repairs with hbck (0.92)
• Enable of writes while recovering from failure (0.96)
12 10/25/12 Strata Hadoop World 2012
13. Reduce downtime by speeding up detection
Service realizes there is a
Failure Event problem starts fixing.
time
Service still Service is
thinks we are ok restored.
• Proactively notify to recover from process failures quickly
0.92/0.94 All Master Failure Detection 180s Some Region Server Failure detection 180s
0.96 Master process failure detection 0- Region Server Failure detection
1s 0-1s
13 10/25/12 Strata Hadoop World 2012
14. Manual Problem detection: Metrics
• Goal: Pinpoint root causes of problems
faster
• Take a baseline of your system in steady-
state
• Anomalies like spikes or dips from baseline
can indicate problems
• Ex: Slow Query Logging
• Integrates with existing infrastructure via
JMX or use with Ganglia, Cloudera Manager
14 10/25/12 Strata Hadoop World 2012
15. Metrics from all levels of the system
• HBase Region Servers Host JVM
• Operations / sec Master
• Get / put latencies (0.92) CPUs Memory Disks Network
• Per CF metrics (0.94)
• Per Region metrics (0.94) Host
• HBase Master JVM JVM
Region server
• RIT metrics Region
HDFS
DataNode
• Replication Metrics CF CF
• System/JVM CPUs Memory Network
Disks
• GC, RPC metrics
15 10/25/12 Strata Hadoop World 2012
17. Eliminate HBase downtime
Maintenance Service remains online, Full service
Begins slightly degraded is restored.
Maintenance Time
time
• Highly Available stack: HDFS (2.0) / ZK / HBASE
• Client Cross-version wire compatibility (0.96)
• Rolling Restarts
• Online Schema Change (experimental)
17 10/25/12 Strata Hadoop World 2012
18. High Availability: HBase + HDFS + ZK
HDFS NameNodes ZooKeeper Slave Boxes (DN + RS)
HBase Masters Quorum
Rack 1
Name
node
Rack 2
Name
node
18 10/25/12 Strata Hadoop World 2012
19. Wire Compatibility
• Reduces downtime due to planned maintenance
• Wire compatibility + Extensible Data formats
• Allow for forwards and backwards compatibility
• Older clients can talk to newer servers and visa App MR
versa
• Rolling upgrade
• Upgrade a single node at a time while system runs
• Allows API and changes while guaranteeing wire ZK HDFS
compatibility between different minor versions
• HDFS client-server compatibility between Major
Versions
19 10/25/12 Strata Hadoop World 2012
28. Oops… User Error
User Error: Service is restored,
drop ‘table’ major data loss
Recovery Time
time
Service is down!
• How do we prevent user error?
• How do we recover from user error?
28 10/25/12 Strata Hadoop World 2012
29. Prevent user mistakes: User-level Security
User Error: Operation rejected,
drop ‘table’ insufficient permissions.
time
• Authentication:
• Ensure the identity of the services or users that are communicating
• Access Control:
• Ensure user has permission to execute table data operations
29 10/25/12 Strata Hadoop World 2012
30. HBase User-level Security
• Based on Kerberos for HBase, HDFS and Zookeeper
• Grant privileges to users
• Revoke privileges from users.
• Column Family and Table granularity
• Confidentiality:
• Ensure information is only seen by intended users.
• Audit Trails:
• Track which users performed particular operations
30 10/25/12 Strata Hadoop World 2012
31. Recovering from User Mistakes: Table Snapshots
User Error: Service is restored,
Service is down!
drop ‘table’ minor data loss
restore
time Periodic
snapshot
• Snapshot the state of a table at a certain moment in time
• Restore it or Clone it later, creating a new read write table
• Export it to another cluster with minimal impact on HBase
31 10/25/12 Strata Hadoop World 2012
32. Table Snapshots (0.96+)
• Under development, slated for HBase 0.96
• Multiple snapshot flavors planned
• Offline snapshots
• Online Snapshots
• Snapshot uses
• Recover from application or user error.
• Application experimentation (no need to
spin up another cluster for replication)
• Use MR directly on snapshot files
32 10/25/12 Strata Hadoop World 2012