At N26, we want to make sure we have resilience and fault tolerance built into our backend service-to-service calls. Our services used a combination of Hystrix, Retrofit, Retryer, and other tools to achieve this goal. However, Netflix recently announced that Hystrix is no longer under active development. Therefore, we needed to come up with a replacement solution that maintains the same level of functionality. Since Hystrix provided a big portion of our http client resilience (including circuit breaking, connection thread pool thresholds, easy to add fallbacks, response caching, etc.), we used this announcement as a good opportunity to revisit our entire http client resilience stack. We wanted to find a solution that consolidated our fragmented tooling into an easy-to-use and consistent approach.
This talk will share the approach we are currently implementing and the tools we analyzed while making the decision. Its aim is to provide backend devs (primarily working on JVM languages) and SREs with a comprehensive view on the state of the art for service-to-service call tooling (resilience 4j, envoy, gRPC, retrofit, etc), mechanisms to improve service-to-service call resiliency (timeouts, circuit breaking, adaptive concurrency limits, outlier detection, rate limiting, etc.) and a discussion on where these mechanisms should be implemented (client side, side-car proxy, server-side side-car proxy or server-side).
6. New Rollouts
Planned Changes
Traffic Drains
Turndowns
Triggering Conditions - Change
public String getCountry(String userId) {
try {
// Try to get latest country to avoid stale info
UserInfo userInfo = userInfoService.update(userId);
updateCache(userInfo);
...
return getCountryFromCache(userId);
} catch (Exception e) {
// Default to cache if service is down
return getCountryFromCache(userId);
}
}
The anatomy of a cascading failure
12. Resource Starvation - Dependencies Between Resources
Poorly tuned Garbage Collection
Slow requests
Increased CPU due to GC
More in-progress requests
More RAM due to queuing
Less RAM for caching
Lower cache hit rate
More requests to backend
🔥🔥🔥
The anatomy of a cascading failure
19. Architecture - Orchestration vs Choreography
Orchestration
Choreography
Strategies for Improving Resilience
Card service
Account
service
User service
Signup
service
Ship card
Create Account
Create user
Card service
Account
service
User service
User signup
event
Subscribes
Signup
service
Publishes
20. Capacity Planning - Do I need it in the age of the cloud?
Helpful, but not sufficient to protect against cascading failures
Accuracy is overrated and expensive (especially for new services)
It’s (usually) ok (and cheaper) to overprovision at first
Strategies for Improving Resilience
21. Capacity Planning - More important things
Automate provisioning and deployment (🐄🐄🐄 not 🐕🐈🐹)
Auto-scaling and auto-healing
Robust architecture in the face of growing traffic (pub/sub helps)
Agree on SLIs and SLOs and monitor them closely
Strategies for Improving Resilience
22. Capacity Planning - If I do need it, then what do I do?
Business requirements
Critical services, and YOLO the rest
⚠ Seasonality 🎄🥚🦃
Use hardware resources to measure capacity instead of Requests Per Second:
● cost of request = CPU time it has consumed
● (on GC platforms) higher memory => higher CPU
Strategies for Improving Resilience
23. “Chaos Engineering is the discipline of experimenting
on a system in order to build confidence in the system’s capability to withstand turbulent
conditions in production.”
Principles of Chaos Engineering
Chaos Testing
Strategies for Improving Resilience
24. Retrying - What should I retry?
What makes a request retriable?
● ⚠ idempotency
● 🚫 GET with side-effects
● ✅ stateless if you can
Should you retry timeouts?
● Stay tuned to the next slides
Strategies for Improving Resilience
26. Retrying - Retry Budgets
Per-request retry budget
● Each request retried at most 3x
Per-client retry budget
● Retry requests = at most 10% total requests to upstream
● If > 10% of requests are failing => upstream is likely unhealthy
Strategies for Improving Resilience
27. Throttling - Timeouts
Nesting is 🔥👿🔥
Retries make ☝ worse
Timing out => upstream service might still be processing request
Maintain discipline when setting timeouts/Propagate timeouts
Strategies for Improving Resilience
Service B
Service A
3s timeout
Service C
Service D
2s timeout
5s timeout
timeout
⚠ Avoid circular dependencies at all cost ⚠
28. Throttling - Rate Limiting
Avoid overload by clients and set per-client limits:
● requests from one calling service can use up to x CPU seconds/time
interval on the upstream
● anything above that will be throttled
● these metrics are aggregated across all instances of a calling service
and upstream
If this is too complicated
=> limit based on RPS/customer/endpoint
Strategies for Improving Resilience
29. Throttling - Circuit Breaking
Strategies for Improving Resilience
Closed Open
Half Open
fail (threshold reached)
reset timeout
fail
fail (under threshold)
success call/raise circuit open
success
Service A
Circuit
Breaker
Service B
⚠
⚠
⚠
⚠
🚫
timeout
timeout
timeout
timeout
trip circuit
circuit open
31. Fallbacks and Rejection
Cache
Dead letter queues for writes
Return hard-coded value
Empty Response (“Fail Silent”)
User experience
⚠ Make sure to discuss these with your product owners ⚠
��
��
Strategies for Improving Resilience
38. Hystrix - Testing
Scope:
● Unit tests easy for circuit opening/closing, fallbacks
● Integration tests reasonably easy for caching
Caveats:
● If you are using response caching, DO NOT FORGET to test
HystrixRequestContext
● Depending on the errors thrown by the call, you might need to
test circuit tripping (HystrixRuntimeException vs
HystrixBadRequestException)
● If you’re not careful, you might set the same
HystrixCommandGroupKey and HystrixCommandKey
Tools for Improving Resilience
39. Hystrix - Adoption Considerations
✅ 🚫 ⚠
Observability No longer
supported
Forces you towards
building thick clients
Mostly easy to test
Not language
agnostic
Tricky to enforce on
calling services
Cumbersome to
configure
HystrixRequestContext
Tools for Improving Resilience
41. Resilience4j - Topology
Tools for Improving ResilienceTools for Improving Resilience
Dependency A
Dependency D
Dependency CDependency B
Dependency E Dependency F
Client
44. Resilience4j - Observability
You can subscribe to various events for most of the decorators:
Built in support for:
● Dropwizard (resilience4j-metrics)
● Prometheus (resilience4j-prometheus)
● Micrometer (resilience4j-micrometer)
● Spring-boot actuator health information (resilience4j-spring-boot2)
circuitBreaker.getEventPublisher()
.onSuccess(event -> logger.info(...))
.onError(event -> logger.info(...))
.onIgnoredError(event -> logger.info(...))
.onReset(event -> logger.info(...))
.onStateTransition(event -> logger.info(...));
Tools for Improving Resilience
45. Resilience4j - Testing
Scope:
● Unit tests easy for composed layers and different scenarios
● Integration tests reasonably easy for caching
Caveats:
● Cache scope is tricky here as well
● Basically similar problems to Hystrix testing
Tools for Improving Resilience
46. Resilience4j - Adoption Considerations
✅ 🚫 ⚠
Observability Not language
agnostic
Forces you towards
building thick clients
Feature rich
Tricky to enforce on
calling services
Easier to configure
than Hystrix
Modularization
Less transitive
dependencies than
Hystrix
Tools for Improving Resilience
51. Envoy - Configuration Deployment
Static config:
● You will benefit from some scripting/tools to generate this config
● Deploy the generated yaml as a Docker container side-car using the
official Docker image
Dynamic config:
● gRPC APis for dynamically updating these settings
○ Endpoint Discovery Service (EDS)
○ Cluster Discovery Service (CDS)
○ Route Discovery Service (RDS)
○ Listener discovery service (LDS)
○ Secret discovery service (SDS)
● Control planes like Istio makes this manageable
Tools for Improving Resilience
52. Envoy - Observability
Data sinks:
● envoy.statsd - built-in envoy.statsd sink (does not support tagged metrics)
● envoy.dog_statsd - emits stats with DogStatsD compatible tags
● envoy.stat_sinks.hystrix - emits stats in text/event-stream formatted stream for use
by Hystrix dashboard
● build your own
(Small) subset of stats:
● downstream_rq_total, downstream_rq_5xx, downstream_rq_timeout,
downstream_rq_time, etc.
Detecting open circuits/throttling:
● x-envoy-overloaded header will be injected in the downstream response
● Detailed metrics: cx_open (connection circuit breaker), rq_open (request circuit
breaker), remaining_rq (remaining requests until circuit will open), etc
Tools for Improving Resilience
53. Envoy - Testing
Scope:
● E2E-ish 🙂
Caveats:
● Setup can be tricky (boot the side-car in a Docker container, put
a mock server behind it and start simulating requests and
different types of failures)
● Will probably need to test this / route or whatever your config
granularity is
Tools for Improving Resilience
54. Envoy - Adoption Considerations
✅ 🚫 ⚠
Application
language agnostic
Fallbacks Testability
Enforcement Cache
Ownership (SRE vs
Dev teams)
Change rollout
Configuration
Complexity
Caller/callee
resilience
Operational
Complexity
Observability
Tools for Improving Resilience
61. ✅ 🚫 ⚠
Caller/callee
resilience
Not language agnostic
Harder to predict
throttling
Does not require
manual, per-endpoint
config
Less mature than
others
Observability
Documentation is
quite scarce
Easier to enforce on
calling services
Netflix/concurrency-limits - Adoption Considerations
Tools for Improving Resilience
65. And the winner is… Envoy 🥇🥇🥇, but why?
Reasons:
● We already have it
● Observability is super strong
● Easy to enforce across all our infrastructure
● Allows us to have thin clients
● Language agnostic
And the runner up is… Resilience4j (kinda) 🥈🥈🥈
● Allowed for retries, caching and fallbacks, but it’s up to the teams
● We discourage using request caching for the most part