Fault Tolerance in a High Volume, Distributed System
1. Fault Tolerance in a
High Volume, Distributed System
Ben Christensen
Software Engineer – API Platform at Netflix
@benjchristensen
http://www.linkedin.com/in/benjchristensen
1
2. Dozens of dependencies.
One going down takes everything down.
99.99%30 = 99.7% uptime
0.3% of 1 billion = 3,000,000 failures
2+ hours downtime/month
even if all dependencies have excellent uptime.
Reality is generally worse.
2
14. Separate Threads: Limited Concurrency
try {
if (!threadPool.isQueueSpaceAvailable()) {
// we are at the property defined max so want to throw the RejectedExecutionException to simulate
// reaching the real max and go through the same codepath and behavior
throw new RejectedExecutionException("Rejected command
because thread-pool queueSize is at rejection threshold.");
}
... define Callable that performs executeCommand() ...
// submit the work to the thread-pool
return threadPool.submit(command);
} catch (RejectedExecutionException e) {
circuitBreaker.markThreadPoolRejection();
// rejected so return fallback
return getFallback();
}
14
15. Separate Threads: Limited Concurrency
try {
if (!threadPool.isQueueSpaceAvailable()) {
// we are at the property defined max so want to throw the RejectedExecutionException to simulate
// reaching the real max and go through the same codepath and behavior
throw new RejectedExecutionException("Rejected command
RejectedExecutionException
because thread-pool queueSize is at rejection threshold.");
}
... define Callable that performs executeCommand() ...
// submit the work to the thread-pool
return threadPool.submit(command);
} catch (RejectedExecutionException e) {
circuitBreaker.markThreadPoolRejection();
// rejected so return fallback
return getFallback();
}
15
16. Separate Threads: Limited Concurrency
try {
if (!threadPool.isQueueSpaceAvailable()) {
// we are at the property defined max so want to throw the RejectedExecutionException to simulate
// reaching the real max and go through the same codepath and behavior
throw new RejectedExecutionException("Rejected command
RejectedExecutionException
because thread-pool queueSize is at rejection threshold.");
}
... define Callable that performs executeCommand() ...
// submit the work to the thread-pool
return threadPool.submit(command);
} catch (RejectedExecutionException e) {
circuitBreaker.markThreadPoolRejection();
// rejected so return fallback
return getFallback();
}
16
17. Separate Threads: Timeout
Override of Future.get()
public K get() throws CancellationException, InterruptedException, ExecutionException {
try {
long timeout =
getCircuitBreaker().getCommandTimeoutInMilliseconds();
return get(timeout, TimeUnit.MILLISECONDS);
} catch (TimeoutException e) {
// report timeout failure
circuitBreaker.markTimeout(
System.currentTimeMillis() - startTime);
// retrieve the fallback
return getFallback();
}
}
17
18. Separate Threads: Timeout
Override of Future.get()
public K get() throws CancellationException, InterruptedException, ExecutionException {
try {
long timeout =
getCircuitBreaker().getCommandTimeoutInMilliseconds();
return get(timeout, TimeUnit.MILLISECONDS);
} catch (TimeoutException e) {
// report timeout failure
circuitBreaker.markTimeout(
System.currentTimeMillis() - startTime);
// retrieve the fallback
return getFallback();
}
}
18
19. Separate Threads: Timeout
Override of Future.get()
public K get() throws CancellationException, InterruptedException, ExecutionException {
try {
long timeout =
getCircuitBreaker().getCommandTimeoutInMilliseconds();
return get(timeout, TimeUnit.MILLISECONDS);
} catch (TimeoutException e) {
// report timeout failure
circuitBreaker.markTimeout(
System.currentTimeMillis() - startTime);
// retrieve the fallback
return getFallback();
}
}
19
26. Tryable semaphores for “trusted” clients and fallbacks
Separate threads for “untrusted” clients
Aggressive timeouts on threads and network calls
to “give up and move on”
Circuit breakers as the “release valve”
26
30. Benefits of Separate Threads
Protection from client libraries
Lower risk to accept new/updated clients
Quick recovery from failure
Client misconfiguration
Client service performance characteristic changes
Built-in concurrency
30
31. Drawbacks of Separate Threads
Some computational overhead
Load on machine can be pushed too far
...
Benefits outweigh drawbacks
when clients are “untrusted”
31
33. Visualizing Circuits in Realtime
(generally sub-second latency)
Video available at
https://vimeo.com/33576628
33
34. Rolling 10 second counter – 1 second granularity
Median Mean 90th 99th 99.5th
Latent Error Timeout Rejected
Error Percentage
(error+timeout+rejected)/
(success+latent success+error+timeout+rejected).
34
49. Questions & More Information
Fault Tolerance in a High Volume, Distributed System
http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
Making the Netflix API More Resilient
http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html
Ben Christensen
@benjchristensen
http://www.linkedin.com/in/benjchristensen
49