In order to effectively predict and prevent online fraud in real time, Sift Science stores hundreds of terabytes of data in HBase—and needs it to be always available. This talk will cover how we used circuit-breaking, cluster failover, monitoring, and automated recovery procedures to improve our HBase uptime from 99.7% to 99.99% on top of unreliable cloud hardware and networks.
2. What is Sift Science
Sift Science protects online businesses
from fraud using real-time machine
learning.
We work with hundreds of customers
across a range of verticals, countries,
and fraud types.
3. What is Sift Science?
sift
Customer
backend
bob added
credit card
bob
27
Carrier 12:00 PM
bob opened
app
Googlehttp://domain.com
Web Page Title
bob loaded
cart page fraud score
4. HBase at Sift
We use HBase to store all user-level
data—hundreds of terabytes.
We make hundreds of thousands of
requests per second to our online
HBase clusters.
Producing a risk score for a user may
require dozens of HBase queries.
600TB ● 48K regions ● 250 servers
5. Why HBase
• Scalable to millions of requests per second and
petabytes of data
• Strictly consistent writes and reads
• Supports write-heavy workloads
• Highly available …in theory
12. Symptom:
When a single region server became unavailable
or slow, our application would stop doing work.
13. Replicating the issue
with Chaos Engineering
• Killing processes
• Killing servers
• Partitioning the network
• Throttling network on HBase port
14. Replicating the issue
with Chaos Engineering
$ tc qdisc add dev eth0 handle ffff: ingress
$ tc filter add dev eth0 parent ffff:
protocol ip prio 50 u32 match ip protocol
6 0xff match ip dport 60020 0xffff police
rate 50kbit burst 10k drop flowid :1
Sets the bandwidth available for HBase to 50 kb/s
(don’t try this on your production cluster)
15. What’s going on?
Profiling showed that all threads are
stuck waiting on HBase.
Even though just one HBase server is
down, our request volume is so high
that all handler threads eventually hit
that server and get stuck.
runnable
blocked
waiting
16. Circuit Breaking
A pattern in distributed systems where
clients monitor the health of the servers
they communicate with.
If too many requests fail, the circuit
breaker trips and requests fail
immediately.
A small fraction of requests are let
through to gauge when the circuit
becomes healthy again.
Closed
Open
Half-Open
trips breakersuccess
fail fast
make request
trips breaker
request fails
request
succeeds
17. How well does this work?
very effective when one region server is unhealthy
circuit breaker
control
18. Circuit Breaking in hbase-client
Subclass RpcRetryingCaller / DelegatingRetryingCallable
private static class HystrixRegionServerCallable<R> extends
DelegatingRetryingCallable<R, RegionServerCallable<R>> {
@Override
public void prepare(boolean reload) throws IOException {
delegate.prepare(reload);
if (delegate instanceof MultiServerCallable) {
server = ((MultiServerCallable) delegate).getServerName();
} else {
HRegionLocation location = delegate.getLocation();
server = location.getServerName();
}
setter = HystrixCommand.Setter
.withGroupKey(HystrixCommandGroupKey.Factory.asKey(REGIONSERVER_KEY))
.andCommandKey(HystrixCommandKey.Factory.asKey(
server.getHostAndPort()));
}
}
19. Circuit Breaking in hbase-client
Subclass RpcRetryingCaller / DelegatingRetryingCallable
private static class HystrixRegionServerCallable<R> extends
DelegatingRetryingCallable<R, RegionServerCallable<R>> {
@Override
public R call(final int timeout) throws Exception {
if (setter != null) {
try {
return new HystrixCommand<R>(setter) {
@Override
public R run() throws Exception {
return delegate.call(timeout);
}
}.execute();
} catch (HystrixRuntimeException e) {
log.debug("Failed", e);
if (e.getFailureType() == HystrixRuntimeException.FailureType.SHORTCIRCUIT) {
throw new DoNotRetryRegionException(e.getMessage());
} else if (e.getCause() instanceof Exception) {
throw (Exception) e.getCause();
}
throw e;
}
} else {
return delegate.call(timeout);
20. Circuit Breaking in hbase-client
Subclass RpcRetryingCaller
public static class HystrixRpcCaller<T> extends RpcRetryingCaller<T> {
@Override
public synchronized T callWithRetries(RetryingCallable<T> callable, int callTimeout)
throws IOException, RuntimeException {
return super.callWithRetries(wrap(callable), callTimeout);
}
@Override
public T callWithoutRetries(RetryingCallable<T> callable, int callTimeout)
throws IOException {
return super.callWithoutRetries(wrap(callable), callTimeout);
}
private RetryingCallable<T> wrap(RetryingCallable<T> callable) {
if (callable instanceof RegionServerCallable) {
return new HystrixRegionServerCallable<>(
(RegionServerCallable<T>) callable, maxConcurrentReqs, timeout);
}
return callable;
}
}
21. Circuit Breaking in hbase-client
Subclass RpcRetryingCallerFactory
public class HystrixRpcCallerFactory extends RpcRetryingCallerFactory {
public HystrixRpcCallerFactory(Configuration conf) {
super(conf);
}
@Override
public <T> RpcRetryingCaller<T> newCaller() {
return new HystrixRpcCaller<>(conf);
}
}
// override the caller factory in HBase config
conf.set(RpcRetryingCallerFactory.CUSTOM_CALLER_CONF_KEY,
HystrixRpcCallerFactory.class.getCanonicalName());
23. Replication
Circuit breaking helps us avoid
downtime when a small number of
region servers are unhealthy.
Replication allows us to recover quickly
when the entire cluster is unhealthy.
This most often occurs due to HDFS
issues or HBase metadata issues.
cluster 1
cluster 2
replication
application
zookeeperzookeeperzookeeper
cluster 1 is primary
primary connection
fallback
24. Replication
We keep active connections to all
clusters to enable fast switching. A
zookeeper-backed connection provider
is responsible for handing out
connections to the current cluster.
If we see a high error rate from a
cluster, we can quickly switch to the
other while we investigate and fix.
This also allows us to do a full cluster
without downtime, speeding up our
ability to roll out new configurations
and HBase code.
cluster 1
cluster 2
replication
application
zookeeperzookeeperzookeeper
cluster 2 is primary
fallback
primary connection
25. Replication
Fail over between clusters takes less
than a second across our entire
application fleet.
Connection configuration is also stored
in zookeeper, so we can add and
remove clusters without code changes
or restarts.
requests per region server during switch
26. Replication
To verify inter-cluster consistency we
rely on map reduce jobs and online
client-side verification.
We automatically send a small
percentage of non-mutating requests to
the non-active clusters using a custom
subclass of HTable, comparing the
responses to those from the primary
cluster.
28. Monitoring
We collect detailed metrics from HBase
region servers and masters using
scollector. Metrics are sent to
OpenTSDB and a separate HBase
cluster.
We also use scollector to run hbck and
parse the output into metrics.
Metrics are queried by Bosun for
alerting and Grafana for visualization.
Region ServerRegion ServerRegion Server
scollector
TSDRelayTSDRelay
TSDRelay
Region ServerRegion ServerMasters
scollector
Metrics
HBase
TSDRelayTSDRelay
Write TSDs
Bosun
Grafana
TSDRelayTSDRelay
Read TSDs
30. Monitoring
99p latencies (from region server
metrics) can show region servers that
are unhealthy due to GC, imbalance, or
underlying hardware issues.
31. Monitoring
We closely track percent_files_local
(from region server metrics) because
performance and stability are affected
by poor locality.
32. Monitoring
Inconsistent tables (reported by hbck)
can show underlying hbase metadata
issues. Here a region server failed,
causing many tables to become
inconsistent. Most recovered, but one
did not until manual action was taken.
Some consistency issues can be fixed
by restarting masters, others require
running hcbk fix commands.
33. Next steps
• Cross-datacenter replication and failover
• Automating recovery procedures (killing failing
nodes, restarting masters, running hbck commands)
• Automating provisioning of capacity