The latest Apache HBase releases, 0.92 and 0.94, contain many improvements over prior releases in terms of correctness and performance improvements. We discuss a couple of these improvements from a development and operations perspective. For correctness, we discuss the ACID guarantees of HBase, give a case study of problems with earlier releases, and give an overview of the implementation internals that were improved to fix the issues. For performance, we discuss recent improvements in 0.94 and how to monitor the performance of a cluster with new metrics.
We are going to talk about recent improvements in HBase for ACID consistency and performance. We are going to discuss customer cases, and also look at the internals of HBase to give you a taste of these issues.
This is what the data format looks like, how do we write it?
This is what the data format looks like, how do we write it?
This is what the data format looks like, how do we write it?
This is what the data format looks like, how do we write it?
This is what the data format looks like, how do we write it?
This is what the data format looks like, how do we write it?
This is what the data format looks like, how do we write it?
This is what the data format looks like, how do we write it?
At Cloudera support we have seen few issues where hbase consistency can be a problem
In some workflows is desirable to upload data directly (ETL) into Hbase instead of invokingPut() to add new records. Depending on the case of use it might also have some performance advantages.
It was Fixed in 0.92HBASE-4552 and back ported into Hbase 0.90.5 (for convenience its also available since CDH3u3)
Each read returns partial content for the same row. It can be empty data or an old version of the data.
Also is possible to monitor the logs and metrics before exposing the new data to users.
WithHBASE-4552 read-write lock was implemented in order to make the data available to the readers until the bulkupload was complete. Also the old method was deprecated and a new one was implemented.
Once this lock is release the data is available to the readers (Scan)
In our example this is a system is an email storage based hbase that stores millions of emails and a MR task is concurrently running to classify emails as spam.
MR users will see the counters familiar, in this example we are running a filter that scans only for a dataset of 500K records from a table with 50M rows.
Remember, the filter should return always 500k records
Remember, the filter should return always 500k records
Not only empty rows can be this behavior, depending on the number of version you can get old data too!
This is production, so you can’t stop the service just to try a workaround.
This is production, so you can’t stop the service just to try a workaround.
This is what the data format looks like, how do we write it?