This technical session will provide a quick review of the Apache HBase project, looking at it from the past to the future. It will cover the imminent HBase 0.92 release as well as what is slated for 0.94 and beyond. A number of companies and use cases will be used as examples to describe the overall direction of the HBase community and project.
4. Apache HBase
A Friendly Open Source Project
Disclaimer: These are the personal opinions of Jonathan Gray and do not necessarily reflect the opinions of
Facebook Inc., Apache HBase, the Apache HBase community, or any other person or organization. I also apologize
in advance to any individuals or companies that were left out of slides or discussion. This was not done
purposefully and I love you all.
5. Apache HBase
▪ A dynamic and pragmatic community
▪ HBase committers scattered around many companies
▪ A culture of acceptance (contributions please!)
▪ Perhaps, occasionally, to a fault
▪ Many HBase committers have moved companies
▪ “Road Map” driven by sponsoring companies
▪ Bugs fixed and features developed decided by them
▪ HBase has no Enterprise Software Company behind it
6. The Ghost of HBase Past
Early days through . and .
7. HBase History
▪ Started in as Bigtable clone for Hadoop
▪ First code released in as part of Hadoop .
▪ Six major releases (three versioning schemes)
▪ . . in March
▪ . . in August
▪ . . in September
▪ . . in January
▪ . . in September
▪ . . in January
9. HBase History
▪ Early users focused on offline, crawl data storage
▪ Powerset was primary user
▪ Others like WorldLingo, OpenPlaces
▪ Augmenting Offline MapReduce
▪ Needed random writes for web crawl storage
▪ Also use random writes to store links and images
▪ The road map was easy... Bigtable
11. Online HBase
▪ Next generation of HBasers wanted OLTP
▪ Streamy.com (my previous startup)
▪ StumbleUpon and others
▪ HBase Goes Realtime
▪ Gave this talk at Hadoop Summit w/ JD Cryans
▪ “HBase . ... First ever Performance Release”
“As a random-access store, we are well suited for the storing and serving of
Web applications, but high latency and variability (100s of ms to seconds)
has reduced the usefulness of HBase and required the use of external
caching in the past”
12. HBase 0.20
▪ Performance Release (aka the Unjavafy release)
▪ Rewrite of entire read and write paths
▪ Introduction of KeyValue and zero-copy reads
▪ New block-based HFile format and LRU block cache
▪ New client APIs: Put, Get, Scan, Delete, Result
▪ ZooKeeper Integration
▪ Remove all dependencies on master for reads/writes
▪ Leader election, fault detection, remove SPOF
14. HBase 0.90
▪ Durability, Stability, Availability Release
▪ “Production Ready HBase”
▪ Zero data loss
▪ Rewrite of Master and ZooKeeper interactions
▪ Testing, debugging, monitoring improvements
▪ Random read and large row improvements
▪ Lots of awesome new features
15. HBase 0.90: Production Ready
▪ Zero data loss
▪ HDFS Appends, HLog fixes, gremlin testing
▪ Master rewrite
▪ Remove from read/write path + failover, no SPOF
▪ Operational improvements
▪ HBCK (fsck for HBase), HFile/HLog command-line tools
▪ Rolling restarts for minor upgrades
▪ New testing framework and k new lines of tests
16. HBase 0.90: New Features
▪ Cluster-to-cluster replication
▪ Read performance
▪ Bloom filters rewrite
▪ Efficient intra-row seeking for large row support
▪ Other stuff
▪ Mavenized
▪ Stargate REST server and AVRO server
▪ Shell improvements and EC scripts
19. HBase 0.92
▪ Stability and feature release
▪ Lots of usability and stability improvements
▪ Coprocessors and security
▪ Multi-Master cluster replication
▪ . . RC sometime in November
▪ blockers and criticals as of this morning
▪ FB already deploying a -based branch in dev
20. HBase 0.92: Big new features
▪ Coprocessors
▪ Triggers and Stored Procedures
▪ Pre/Post hooks to all client calls and server operations
▪ Dynamically add new RPC calls
▪ ACL security atop Coprocessors
▪ HFile v
▪ Support for very large regions / files
▪ Multi-level block index and inline blooms
21. HBase 0.92: Performance
▪ Performance improvements
▪ More seeking and early-out hints
▪ Distributed log splitting
▪ CacheOnWrite, EvictOnClose
▪ Compaction improvements
▪ Multi-threaded compactions
▪ Vastly improved file selection algorithm
▪ Lots of metrics and highly configurable
22. HBase 0.92: Improvements
▪ Operational improvements
▪ HBCK improvements, Web UI improvements
▪ Slow query log, running tasks and thread status
▪ Online schema modifications
▪ Usability and API improvements
▪ Increment client API
▪ String-based Filter language
▪ Multi-family bulk load
▪ The HBase Books!
23. HBase 0.92: Documentation!
▪ The (O’Reilly) HBase Book
▪ HBase: The Definitive Guide released in September
▪ Massive effort by committer Lars George
▪ Lots of input and feedback from the community
▪ The (Apache) HBase Book
▪ Apache HBase now has an docbook-format book
▪ Every HBase release will ship with a versioned book
▪ From installation to schema design and architecture
▪ Latest version @ http://hbase.apache.org/book.html
25. ?
You!
A usable, large scale production database system
26. HBase 0.94
▪ Stability and usability is the core focus
▪ Increase stability by decreasing complexity
▪ More work on UI, tools, monitoring, operability
▪ Table/family-level metrics
▪ But features will always continue...
▪ Fast backups w/ point-in-time recovery
▪ Multi-Slave Replication
▪ Constraints and other Coprocessor-based contribs
27. HBase 0.94: New Stuff
▪ Thrift .
▪ New Thrift API to more closely match Java API
▪ Embedded Thrift w/ RS short-circuit
▪ Other Goodies
▪ TTL + minVersions
▪ Point-in-time snapshot scanners
▪ Atomic Append operation
28. HBase 0.94: Performance
▪ Scaling for throughput vs. latency
▪ Early-lock-release to decrease row contention
▪ Early-thread-release to increase throughput
▪ Remove all global wait()/notify() on HLog
▪ Improved seeking and file selection
▪ “Lazy-seek” in-order file processing
▪ DeleteFamily bloom filter
29. HBase 0.94: Project Management
▪ Renewed focus on fast release cycle
▪ HBase . branch cut immediately after . release
▪ Already close to . feature freeze, . dev release?
▪ blockers and criticals left
▪ Apache HBase: A slightly less accepting project
▪ Stability is really code stability
▪ Push towards iterative feature dev and branch dev
▪ Coprocessors and Service Interfaces go a long way
31. Beyond HBase 0.94
▪ Stability and usability is still the core focus
▪ More tests, testing frameworks, integration tests
▪ But features will always continue...
▪ RPC redux
▪ Dynamic configuration
▪ Request, IO, and locality based load balancing
▪ Multi-Tenancy (QoS, ACL)
▪ Tighter coordination with rest of stack (HDFS, Linux)
32. Conclusion
▪ Apache HBase has come a long way
▪ Use case driven development
▪ HBase . coming very soon
▪ Most stable release to date
▪ Contributors and committers drive development
▪ Consumers can’t dictate the road map
▪ Individuals and organizations solve their problems
(They have their own users... and jobs to keep)
33. Check out the HBase at Facebook Page:
facebook.com/UsingHbase
Thanks! Questions?