Apache HBase is a distributed data store that is in production today at many enterprises and sites serving large volumes of near-real-time random-accesses. By supporting a wide range of production Apache HBase clusters with diverse use cases and sizes over the past year, we?ve noticed several new trends, learned lessons, and taken action to improve the HBase experience. We?ll present aggregated root-cause statistics on resolved support tickets from the past year. The comparison between this and the previous year?s shows an interesting shift away from problems internal to HBase (splitting, repairs, recovery time) that skews towards user-inflicted problems like poor application architecture level that can be mitigated by tuning (bulk load, r/w latencies and compaction policies). The talk will discuss several tuning tips used for a variety of production workloads running on top of the HBase 0.92.x/0.94.x clusters with 10s to 100s of nodes. This will include settings and their justification for sizing clusters, tuning bulk loads, region counts, and memory settings. We?ll also discuss recently added HBase features that alleviate these problems including an improved mean time to recovery, improved predictability, and improved performance.
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Trends in Supporting Production Apache HBase Clusters
1. Headline Goes Here
Speaker Name or Subhead Goes Here
DO NOT USE PUBLICLY
PRIOR TO 10/23/12
Trends in Supporting Production
Apache HBase Clusters
Jonathan Hsieh | @jmhsieh | Software Engineer at Cloudera /
HBase PMC Member
Kevin O’Dell| kevin.odell@cloudera| Systems Engineer at Cloudera
June 26, 2013
2. Who are we?
Jonathan Hsieh
• Cloudera:
• Software Engineer
• Apache HBase committer /
PMC
• Apache Flume founder
Kevin O’Dell
• Cloudera:
• Systems Engineer
• Apache HBase contributor
• Cloudera HBase Support Lead
2 6/26/13 Hadoop Summit / O'Dell, Hsieh
3. What is Apache HBase?
Apache HBase is a
reliable, column-
oriented data store that
provides consistent, low-
latency, random
read/write access.
ZK HDFS
App MR
3 6/26/13 Hadoop Summit / O'Dell, Hsieh
4. HBase Architecture
ZK HDFS
App MR
4 6/26/13 Hadoop Summit / O'Dell, Hsieh
• HBase is designed to be fault tolerant
and highly available
• It depends on other systems to be as
well.
• Replication for fault tolerance
• Serve regions from any Region server
• Failover HMasters
• ZK Quorums
• HDFS Block replication on Data Nodes
5. From the trenches at Cloudera Customer Operations
Trends Supporting HBase
6. Customers in 2011-12 vs in 2012-13
0.90.x / CDH3 era
• Red Hat 5.x
• Java jvm 1.6.13
• 4-8 disk machines
• 24-48 GB RAM
• Dual 4-core HT
• CDH3
• Apache HBase 0.90
• Apache Hadoop 0.20.x
0.92.x/0.94.x / CDH4 era
• Red Hat 6.x
• Java jvm 1.6.31
• 12-15 disk machines
• 48-96 GB RAM
• Dual 6-core HT
• CDH4
• Apache HBase 0.92/0.94
• Apache Hadoop 2.0
6 6/26/13 Hadoop Summit / O'Dell, Hsieh
7. Support Incidents 6/2011-6/2012
• Patched Bug
• Patched delivered, or
• Fixed in next version
• Operational Workaround
• Misconfiguration
• Schema design / tuning
• hbck used to fix
• Network/HW/OS
• Problems with underlying
systems.
7
Patched
12%
Workaround
(hbck)
28%
Workaround
(config)
44%
Net/HW/OS
16%
6/11-6/12 - CDH3 / 0.90.x HBase
Support Tickets
6/26/13 Hadoop Summit / O'Dell, Hsieh
8. Comparing 6/11-6/12 to 6/12-6/13
8
Patched
12%
Workaround
(hbck)
28%
Workaround
(config)
44%
Net/HW/OS
16%
6/11-6/12 - CDH3 / 0.90.x HBase
Support Tickets
Patched
14%
Workaround
(config/hbck)
36%
Net/HW/OS
42%
Documentation
8%
6/12-6/13 - CDH3+CDH4 HBase
Support Tickets
Much smaller!
Merged
config/hbck
New
category
This is
bigger!
6/26/13 Hadoop Summit / O'Dell, Hsieh
9. Comparing 2011 to 2012
• Majority customers
upgraded to CDH4.
• More customers, but similar
volume of support incidents
• Shrunk the CDH3’s largest
trouble spots significantly.
• Larger number of issues
due to underlying systems.
• This is actually a good thing!
9
Patched
14%
Workaround
(config/hbck)
36%
Net/HW/OS
42%
Documentation
8%
6/12-6/13 - CDH3+CDH4 HBase
Support Tickets
6/26/13 Hadoop Summit / O'Dell, Hsieh
15. Upgrade Assistance
• Parcels
• simplified distribution
• flexibility of install location
• side by side installs for rolling upgrades
• Rolling upgrades via CM
• hot fixes
• minor version upgrades
• Automated tests for upgrades and compatibility
15 6/26/13 Hadoop Summit / O'Dell, Hsieh
16. Configuration / Feature
• Continuous Bulk Load
• Avoid and Use Puts
• Region tuning
• Updated defaults + CM
• GC tuning
• Updated defaults + CM
• Balancer
• Manual / custom tools
• Bad Schema
• Trial and Error
16
Bug
14%
Workaround
(config/hbck)
36%
Net/HW/OS
42%
Documentation
8%
6/12-6/13 - CDH3+CDH4 HBase
Support Tickets
6/26/13 Hadoop Summit / O'Dell, Hsieh
17. CM helps
• Sanity checks on configurations
• Wizard based installation and setup
• Wizard based rolling upgrades (minor versions)
• Wizard based backup and disaster recovery strategies
17 6/26/13 Hadoop Summit / O'Dell, Hsieh
19. Support improvement wishlist
• Improved “Ergonomics”
• Better default configuration and guard rails
• “I’m sorry Dave, I can’t let you do that”
• Improved error messaging
• Suggest likely root causes in logs
• Improve log signal-to-noise ratio
• More improved ops tooling and frameworks for app development
6/26/13 Hadoop Summit / O'Dell, Hsieh19
20. Good news
• All bug fixes go into the Apache versions before CDH
• HBase is maturing
• Higher percentage of incidents by underlying OS/HW/NW
• More performance and tuning oriented questions
• Similar percentage of incidents caused by bugs
• We’re getting better
• Lower percentage of incidents managed with workarounds
• More tools in place to help operational support
• Hbck, CM, defaults
• We can still do better!
20 6/26/13 Hadoop Summit / O'Dell, Hsieh
25. Usability Concerns
• Administering HBase has been too hard.
• Difficult to see what is happening in HBase
• Easy to make bad design decisions early without realizing
• New Developments
• Metrics Revamp
• HTrace
• Frameworks for Schema design
6/26/13 Hadoop Summit / O'Dell, Hsieh25
27. HTrace
• Problem:
• Where is time being spent inside HBase?
• Solution: HTrace Framework
• Inspired by Google Dapper
• Threaded through HBase and HDFS
• Tracks time spent in calls in a distributed system by tracking spans*
on different machines.
*Some assembly still required.
6/26/13 Hadoop Summit / O'Dell, Hsieh27
28. HBase Schemas
• HBase Application developers must iterate to find a suitable HBase
schema
• Schema critical for Performance at Scale
• How can we make this easier?
• How can we reduce the expertise required to do this?
• Today:
• Lots of tuning knobs
• Developers need to understand Column Families, Rowkey design, Data
encoding, …
• Some are expensive to change after the fact
6/26/13 Hadoop Summit / O'Dell, Hsieh28
29. Row key design techniques
• Numeric Keys and lexicographic sort
• Store numbers big-endian.
• Pad ASCII numbers with 0’s.
• Use reversal to have most significant traits first.
• Reverse URL.
• Reverse timestamp to get most recent first.
• (MAX_LONG - ts) so “time” gets monotonically smaller.
• Use composite keys to make key distribute nicely and work
well with sub-scans
• Ex: User-ReverseTimeStamp
• Do not use current timestamp as first part of row key!
29
Row100
Row3
Row 31
Row003
Row031
Row100
vs.
blog.cloudera.com
hbase.apache.org
strataconf.com
vs.
com.cloudera.blog
com.strataconf
org.apache.hbase
6/26/13 Hadoop Summit / O'Dell, Hsieh
30. Row key design techniques
• Numeric Keys and lexicographic sort
• Store numbers big-endian.
• Pad ASCII numbers with 0’s.
• Use reversal to have most significant traits first.
• Reverse URL.
• Reverse timestamp to get most recent first.
• (MAX_LONG - ts) so “time” gets monotonically smaller.
• Use composite keys to make key distribute nicely and work
well with sub-scans
• Ex: User-ReverseTimeStamp
• Do not use current timestamp as first part of row key!
30
Row100
Row3
Row 31
Row003
Row031
Row100
vs.
blog.cloudera.com
hbase.apache.org
strataconf.com
vs.
com.cloudera.blog
com.strataconf
org.apache.hbase
6/26/13 Hadoop Summit / O'Dell, Hsieh
32. Reliable
Reliable / Highly Available
• Reliable:
• Ability to recover service if a
component fails, without losing data.
• Highly Available:
• Ability to quickly recover service if a
component fails, without losing data.
• Goal: Minimize downtime!
Highly Available
32 6/26/13 Hadoop Summit / O'Dell, Hsieh
33. Mean Time To Recovery (MTTR)
• Average time taken to automatically recover from a failure.
• Detection time
• Repair Time
• Notification Time
• Measure: HTrace (Dapper) Infrastructure (0.96+)
6/26/13 Hadoop Summit / O'Dell, Hsieh33
Detect Repair Notify
time
34. Reduce Detection Time
• Proactive notification of HMaster failure (0.95)
• Proactive notification of RS failure (0.95)
• Fast server failover (Hardware)
6/26/13 Hadoop Summit / O'Dell, Hsieh34
Detect Notify
time
Repair
35. Reduce Detection Time
• Proactive notification of HMaster failure (0.95)
• Proactive notification of RS failure (0.95)
• Fast server failover (Hardware)
6/26/13 Hadoop Summit / O'Dell, Hsieh35
Repair Notify
time
Detect
41. Reliable
Reliable / Highly Available
• Reliable:
• Ability to recover service if a component
fails, without losing data.
• Highly Available:
• Ability to quickly recover service if a
component fails, without losing data.
• Goal: Minimize downtime!
Highly Available
41 6/26/13 Hadoop Summit / O'Dell, Hsieh
42. Reliable
Reliable / Highly Available / Latency Tolerant
• Reliable:
• Ability to recover service if a component
fails, without losing data.
• Highly Available:
• Ability to quickly recover service if a
component fails, without losing data.
• Latency Tolerant
• Ability to perform and recover in a
predictable amount of time, without
losing data
• New Goal: Predictable performance
Highly Available
42
Latency
Tolerant
6/26/13 Hadoop Summit / O'Dell, Hsieh
43. Common causes of performance variability
• Compaction
• Garbage Collection
• Locality Loss
6/26/13 Hadoop Summit / O'Dell, Hsieh43
44. Compaction
• Compactions optimizing read layout by rewriting files
• Reduce the seeks required to read a row
• Improve random read performance
• Age off expired or deleted data
• Assumes uniformly distributed write workload
• But we have new workloads:
• Continuous Bulk load write pattern
• Time-series write pattern
6/26/13 Hadoop Summit / O'Dell, Hsieh44
45. Compactions: Put workload
• Minor compactions
• Optimizes a sub set of adjacent
files
• Major Compactions
• Optimizes all files
• Choosing:
• Assume: older files should be
larger than newer files.
• “New” files are “larger” than
“older” files? major compaction
• Else, look at newer files and
select files for a minor
compaction
6/26/13 Hadoop Summit / O'Dell, Hsieh45
Newly flushed HFiles
Minor
…
…
Minor
MajorMinor
46. Compactions: Bulkload workload
• Functionality for loading data en
masse
• Intended for Bootstrapping
HBase tables
• New write workload:
frequently ingest data only via
bulk load
• Problem:
• Breaks age/size assumption!
• Major Compaction Storms!
• Compactions unnecessarily
rewrite large files.
46 6/26/13 Hadoop Summit / O'Dell, Hsieh
Newly bulk loaded HFiles
Major
Newly flushed HFiles
MajorMajor
47. Bulkload: Exploring Compactor
• Explore all compaction
possibilities
• Choose minor compactions
that reduces # of files while
incurring least IO.
• “the best bang of the buck”
• Compaction workload is
more manageable
47 6/26/13 Hadoop Summit / O'Dell, Hsieh
Newly bulk loaded HFiles
Explore
Newly flushed HFiles
Minor
Minor
Hbase is a project that solves this problem. In a sentence, Hbase is an open source, distributed, sorted map modeled after Google’s BigTable.Open-source: Apache HBase is an open source project with an Apache 2.0 license.Distributed: HBase is designed to use multiple machines to store and serve data.Sorted Map: HBase stores data as a map, and guarantees that adjacent keys will be stored next to each other on disk.HBase is modeled after BigTable, a system that is used for hundreds of applications at Google.
Hbase is a project that solves this problem. In a sentence, Hbase is an open source, distributed, sorted map modeled after Google’s BigTable.Open-source: Apache HBase is an open source project with an Apache 2.0 license.Distributed: HBase is designed to use multiple machines to store and serve data.Sorted Map: HBase stores data as a map, and guarantees that adjacent keys will be stored next to each other on disk.HBase is modeled after BigTable, a system that is used for hundreds of applications at Google.
This pie chart is a product from analyzing critical production Hbase tickets over the past 6 months: misconfig 44%, patch 12%,hw/nw 16%, repair 28%. Meaning that correcting a misconfig was all that it took to bring Hbase back up again. As you can see, misconfigurations and bugs break the most HBase clusters. Fixing bugs is up to the community. Fixing misconfigurations is up to you and the focus of the next segment. Because it’s hard to diagnose, misconfigurations are not what you want to spend your time on.If your cluster is broken, it’s probably a misconfiguration. This is a hard problem becausethe error messages are not tightly tied to the root cause.
This pie chart is a product from analyzing critical production Hbase tickets over the past 6 months: misconfig 44%, patch 12%,hw/nw 16%, repair 28%. Meaning that correcting a misconfig was all that it took to bring Hbase back up again. As you can see, misconfigurations and bugs break the most HBase clusters. Fixing bugs is up to the community. Fixing misconfigurations is up to you and the focus of the next segment. Because it’s hard to diagnose, misconfigurations are not what you want to spend your time on.If your cluster is broken, it’s probably a misconfiguration. This is a hard problem becausethe error messages are not tightly tied to the root cause.
This pie chart is a product from analyzing critical production Hbase tickets over the past 6 months: misconfig 44%, patch 12%,hw/nw 16%, repair 28%. Meaning that correcting a misconfig was all that it took to bring Hbase back up again. As you can see, misconfigurations and bugs break the most HBase clusters. Fixing bugs is up to the community. Fixing misconfigurations is up to you and the focus of the next segment. Because it’s hard to diagnose, misconfigurations are not what you want to spend your time on.If your cluster is broken, it’s probably a misconfiguration. This is a hard problem becausethe error messages are not tightly tied to the root cause.
Hannibal helped a lot with identifying balance issues.
Hannibal helped a lot with identifying balance issues.
This pie chart is a product from analyzing critical production Hbase tickets over the past 6 months: misconfig 44%, patch 12%,hw/nw 16%, repair 28%. Meaning that correcting a misconfig was all that it took to bring Hbase back up again. As you can see, misconfigurations and bugs break the most HBase clusters. Fixing bugs is up to the community. Fixing misconfigurations is up to you and the focus of the next segment. Because it’s hard to diagnose, misconfigurations are not what you want to spend your time on.If your cluster is broken, it’s probably a misconfiguration. This is a hard problem becausethe error messages are not tightly tied to the root cause.