HBase Coprocessors allow user code to be deployed directly on HBase clusters. Coprocessors run within each region of a table and define an interface for client calls. Examples of coprocessors include distributed query processing and regular expression search. Coprocessors are loaded via configuration or table schema and provide hooks into various HBase operations like get, put, and scan calls as well as lifecycle events.
DevEX - reference for building teams, processes, and platforms
HBase Coprocessors: Deploy Shared Functionality Directly on the Cluster
1. HBase Coprocessors
Deploy shared functionality
directly on the cluster
O’Reilly Webcast
November 4th, 2011
2. About Me
• Solutions Architect @ Cloudera
• Apache HBase & Whirr Committer
• Author of
HBase – The Definitive Guide
• Working with HBase since end
of 2007
• Organizer of the Munich OpenHUG
• Speaker at Conferences (Fosdem, Hadoop World)
3. Overview
• Coprocessors were added to Bigtable
– Mentioned during LADIS 2009 talk
• Runs user code within each region of a table
– Code split and moves with region
• Defines high level call interface for clients
• Calls addressed to rows or ranges of rows
• Implicit automatic scaling, load balancing, and
request routing
4. Examples Use-Cases
• Bigtable uses Coprocessors
– Scalable metadata management
– Distributed language model for machine translation
– Distributed query processing for full-text index
– Regular expression search in code repository
• MapReduce jobs over HBase are often map-only
jobs
– Row keys are already sorted and distinct
➜ Could be replaced by Coprocessors
5. HBase Coprocessors
• Inspired by Google’s Coprocessors
– Not much information available, but general idea is
understood
• Define various types of server-side code extensions
– Associated with table using a table property
– Attribute is a path to JAR file
– JAR is loaded when region is opened
– Blends new functionality with existing
• Can be chained with Priorities and Load Order
➜ Allows for dynamic RPC extensions
6. Coprocessor Classes and Interfaces
• The Coprocessor Interface
– All user code must inherit from this class
• The CoprocessorEnvironment Interface
– Retains state across invocations
– Predefined classes
• The CoprocessorHost Interface
– Ties state and user code together
– Predefined classes
7. Coprocessor Priority
• System or User
/** Highest installation priority */
static final int PRIORITY_HIGHEST = 0;
/** High (system) installation priority */
static final int PRIORITY_SYSTEM = Integer.MAX_VALUE / 4;
/** Default installation prio for user coprocessors */
static final int PRIORITY_USER = Integer.MAX_VALUE / 2;
/** Lowest installation priority */
static final int PRIORITY_LOWEST = Integer.MAX_VALUE;
11. Coprocessor Interface
• Base for all other types of Coprocessors
• start() and stop() methods for lifecycle
management
• State as defined in the interface:
12. Observer Classes
• Comparable to database triggers
– Callback functions/hooks for every explicit API
method, but also all important internal calls
• Concrete Implementations
– MasterObserver
• Hooks into HMaster API
– RegionObserver
• Hooks into Region related operations
– WALObserver
• Hooks into write-ahead log operations
13. Region Observers
• Can mediate (veto) actions
– Used by the security policy extensions
– Priority allows mediators to run first
• Hooks into all CRUD+S API calls and more
– get(), put(), delete(), scan(), increment(),…
– checkAndPut(), checkAndDelete(),…
– flush(), compact(), split(),…
• Pre/Post Hooks for every call
• Can be used to build secondary indexes, filters
14. Endpoint Classes
• Define a dynamic RPC protocol, used between
client and region server
• Executes arbitrary code, loaded in region server
– Future development will add code weaving/inspection
to deny any malicious code
• Steps to add your own methods
– Define and implement your own protocol
– Implement endpoint coprocessor
– Call HTable’s coprocessorExec() or coprocessorProxy()
15. Coprocessor Loading
• There are two ways: dynamic or static
– Static: use configuration files and table schema
– Dynamic: not available (yet)
• For static loading from configuration:
– Order is important (defines the execution order)
– Special property key for each host type
– Region related classes are loaded for all regions and
tables
– Priority is always System
– JAR must be on class path
17. Coprocessor Loading (cont.)
• For static loading from table schema:
– Definition per table
– For all regions of the table
– Only region related classes, not WAL or Master
– Added to HTableDescriptor, when table is created
or altered
– Allows to set the priority and JAR path
COPROCESSOR$<num>
<path-to-jar>|<classname>|<priority>
21. Region Observers
• Handles all region related events
• Hooks for two classes of operations:
– Lifecycle changes
– Client API Calls
• All client API calls have a pre/post hook
– Can be used to grant access on preGet()
– Can be used to update secondary indexes on
postPut()
22. Handling Region Lifecycle Events
• Hook into pending open, open, and pending close
state changes
• Called implicitly by the framework
– preOpen(), postOpen(),…
• Used to piggyback or fail the process, e.g.
– Cache warm up after a region opens
– Suppress region splitting, compactions, flushes
24. Special Hook Parameter
public interface RegionObserver extends Coprocessor {
/**
* Called before the region is reported as open to the master.
* @param c the environment provided by the region server
*/
void preOpen(final
ObserverContext<RegionCoprocessorEnvironment> c);
/**
* Called after the region is reported as open to the master.
* @param c the environment provided by the region server
*/
void postOpen(final
ObserverContext<RegionCoprocessorEnvironment> c);
26. Chain of Command
• Especially the complete() and bypass()
methods allow to change the processing chain
– complete() ends the chain at the current
coprocessor
– bypass() completes the pre/post chain but uses
the last value returned by the coprocessors,
possibly not calling the actual API method (for
pre-hooks)
28. Master Observer
• Handles all HMaster related events
– DDL type calls, e.g. create table, add column
– Region management calls, e.g. move, assign
• Pre/post hooks with Context
• Specialized environment provided
30. Master Services (cont.)
• Very powerful features
– Access the AssignmentManager to modify plans
– Access the MasterFileSystem to create or access
resources on HDFS
– Access the ServerManager to get the list of known
servers
– Use the ExecutorService to run system-wide
background processes
• Be careful (for now)!
31. Example: Master Post Hook
public class MasterObserverExample
extends BaseMasterObserver {
@Override public void postCreateTable(
ObserverContext<MasterCoprocessorEnvironment> env,
HRegionInfo[] regions, boolean sync)
throws IOException {
String tableName =
regions[0].getTableDesc().getNameAsString();
MasterServices services =
env.getEnvironment().getMasterServices();
MasterFileSystem masterFileSystem =
services.getMasterFileSystem();
FileSystem fileSystem = masterFileSystem.getFileSystem();
Path blobPath = new Path(tableName + "-blobs");
fileSystem.mkdirs(blobPath);
}
}
32. Example Output
hbase(main):001:0> create 'testtable',
'colfam1‘
0 row(s) in 0.4300 seconds
$ bin/hadoop dfs -ls
Found 1 items
drwxr-xr-x - larsgeorge supergroup 0 ...
/user/larsgeorge/testtable-blobs
33. Endpoints
• Dynamic RPC extends server-side functionality
– Useful for MapReduce like implementations
– Handles the Map part server-side, Reduce needs
to be done client side
• Based on CoprocessorProtocol interface
• Routing to regions is based on either single
row keys, or row key ranges
– Call is sent, no matter if row exists or not since
region start and end keys are coarse grained
34. Custom Endpoint Implementation
• Involves two steps:
– Extend the CoprocessorProtocol interface
• Defines the actual protocol
– Extend the BaseEndpointCoprocessor
• Provides the server-side code and the dynamic RPC
method
35. Example: Row Count Protocol
public interface RowCountProtocol
extends CoprocessorProtocol {
long getRowCount()
throws IOException;
long getRowCount(Filter filter)
throws IOException;
long getKeyValueCount()
throws IOException;
}
36. Example: Endpoint for Row Count
public class RowCountEndpoint
extends BaseEndpointCoprocessor
implements RowCountProtocol {
private long getCount(Filter filter,
boolean countKeyValues) throws IOException {
Scan scan = new Scan();
scan.setMaxVersions(1);
if (filter != null) {
scan.setFilter(filter);
}
37. Example: Endpoint for Row Count
RegionCoprocessorEnvironment environment =
(RegionCoprocessorEnvironment)
getEnvironment();
// use an internal scanner to perform
// scanning.
InternalScanner scanner =
environment.getRegion().getScanner(scan);
int result = 0;
38. Example: Endpoint for Row Count
try {
List<KeyValue> curVals =
new ArrayList<KeyValue>();
boolean done = false;
do {
curVals.clear();
done = scanner.next(curVals);
result += countKeyValues ? curVals.size() : 1;
} while (done);
} finally {
scanner.close();
}
return result;
}
39. Example: Endpoint for Row Count
@Override
public long getRowCount() throws IOException {
return getRowCount(new FirstKeyOnlyFilter());
}
@Override
public long getRowCount(Filter filter) throws IOException {
return getCount(filter, false);
}
@Override
public long getKeyValueCount() throws IOException {
return getCount(null, true);
}
}
40. Endpoint Invocation
• There are two ways to invoke the call
– By Proxy, using HTable.coprocessorProxy()
• Uses a delayed model, i.e. the call is send when the proxied
method is invoked
– By Exec, using HTable.coprocessorExec()
• The call is send in parallel to all regions and the results are
collected immediately
• The Batch.Call class is used be coprocessorExec()
to wrap the calls per region
• The optional Batch.Callback can be used to react
upon completion of the remote call
42. Example: Invocation by Exec
public static void main(String[] args) throws IOException {
Configuration conf = HBaseConfiguration.create();
HTable table = new HTable(conf, "testtable");
try {
Map<byte[], Long> results =
table.coprocessorExec(RowCountProtocol.class, null, null,
new Batch.Call<RowCountProtocol, Long>() {
@Override
public Long call(RowCountProtocol counter)
throws IOException {
return counter.getRowCount();
}
});
43. Example: Invocation by Exec
long total = 0;
for (Map.Entry<byte[], Long> entry :
results.entrySet()) {
total += entry.getValue().longValue();
System.out.println("Region: " +
Bytes.toString(entry.getKey()) +
", Count: " + entry.getValue());
}
System.out.println("Total Count: " + total);
} catch (Throwable throwable) {
throwable.printStackTrace();
}
}
44. Example Output
Region:
testtable,,1303417572005.51f9e2251c...cb
cb0c66858f., Count: 2
Region: testtable,row3,
1303417572005.7f3df4dcba...dbc99fce5d
87., Count: 3
Total Count: 5
45. Batch Convenience
• The Batch.forMethod() helps to quickly map a
protocol function into a Batch.Call
• Useful for single method calls to the servers
• Uses the Java reflection API to retrieve the
named method
• Saves you from implementing the anonymous
inline class
47. Call Multiple Endpoints
• Sometimes you need to call more than one
endpoint in a single roundtrip call to the
servers
• This requires an anonymous inline class, since
Batch.forMethod cannot handle this
50. Questions?
• Contact:
Email: lars@cloudera.com
Twitter: @larsgeorge
• Talk at Hadoop World, November 8th & 9th
51. Special Offer for
Webcast Attendees
Visit http://oreilly.com to
purchase your copy of
Hbase: The Definitive
Guide and enter code
4CAST to save 40% off
print book & 50% off
ebook with special code
4CAST
Visit http://oreilly.com/webcasts to view upcoming webcasts and online events.