SlideShare una empresa de Scribd logo
1 de 67
Descargar para leer sin conexión
Resilient
service-to-service
calls in a
post-Hystrix world
Rareș Mușină, Tech Lead @N26
@r3sm4n
R.I.P. Hystrix (2012-2018)
Integration Patterns
Sync vs. Eventual
Consistency
The anatomy of a
cascading failure
Triggering Conditions - What happened?
The anatomy of a cascading failure
New Rollouts
Planned Changes
Traffic Drains
Turndowns
Triggering Conditions - Change
public String getCountry(String userId) {
try {
// Try to get latest country to avoid stale info
UserInfo userInfo = userInfoService.update(userId);
updateCache(userInfo);
...
return getCountryFromCache(userId);
} catch (Exception e) {
// Default to cache if service is down
return getCountryFromCache(userId);
}
}
The anatomy of a cascading failure
Triggering Conditions - What happened?
The anatomy of a cascading failure
Triggering Conditions - Throttling
The anatomy of a cascading failure
Triggering Conditions - What happened?
The anatomy of a cascading failure
Burstiness (e.g. scheduled tasks)
DDOSes
Instance Death (gee, thanks Spotinst)
Organic Growth
Request profile changes
Triggering Conditions - Entropy
The anatomy of a cascading failure
CPU
Memory
Network
Disk space
Threads
File descriptors
………………………...
Resource Starvation - Common Resources
The anatomy of a cascading failure
Resource Starvation - Dependencies Between Resources
Poorly tuned Garbage Collection
Slow requests
Increased CPU due to GC
More in-progress requests
More RAM due to queuing
Less RAM for caching
Lower cache hit rate
More requests to backend
🔥🔥🔥
The anatomy of a cascading failure
Server Overload/Meltdown/Crash/Unavailability
:(
CPU/Memory maxed out
Health checks returning 5xx
Endpoints returning 5xx
Timeouts
Increased load on other instances
The anatomy of a cascading failure
Cascading Failures - Load Redistribution
The anatomy of a cascading failure
ELB ELB
A B
500 350
100 250
ELB ELB
A
600 600
Cascading Failures - Retry Amplification
The anatomy of a cascading failure
Cascading Failures - Latency Propagation
The anatomy of a cascading failure
Cascading Failures - Resource Contention During Recovery
The anatomy of a cascading failure
Strategies for
Improving Resilience
Architecture - Orchestration vs Choreography
Orchestration
Choreography
Strategies for Improving Resilience
Card service
Account
service
User service
Signup
service
Ship card
Create Account
Create user
Card service
Account
service
User service
User signup
event
Subscribes
Signup
service
Publishes
Capacity Planning - Do I need it in the age of the cloud?
Helpful, but not sufficient to protect against cascading failures
Accuracy is overrated and expensive (especially for new services)
It’s (usually) ok (and cheaper) to overprovision at first
Strategies for Improving Resilience
Capacity Planning - More important things
Automate provisioning and deployment (🐄🐄🐄 not 🐕🐈🐹)
Auto-scaling and auto-healing
Robust architecture in the face of growing traffic (pub/sub helps)
Agree on SLIs and SLOs and monitor them closely
Strategies for Improving Resilience
Capacity Planning - If I do need it, then what do I do?
Business requirements
Critical services, and YOLO the rest
⚠ Seasonality 🎄🥚🦃
Use hardware resources to measure capacity instead of Requests Per Second:
● cost of request = CPU time it has consumed
● (on GC platforms) higher memory => higher CPU
Strategies for Improving Resilience
“Chaos Engineering is the discipline of experimenting
on a system in order to build confidence in the system’s capability to withstand turbulent
conditions in production.”
Principles of Chaos Engineering
Chaos Testing
Strategies for Improving Resilience
Retrying - What should I retry?
What makes a request retriable?
● ⚠ idempotency
● 🚫 GET with side-effects
● ✅ stateless if you can
Should you retry timeouts?
● Stay tuned to the next slides
Strategies for Improving Resilience
Retrying - Backing Off With Jitter
Strategies for Improving Resilience
Retrying - Retry Budgets
Per-request retry budget
● Each request retried at most 3x
Per-client retry budget
● Retry requests = at most 10% total requests to upstream
● If > 10% of requests are failing => upstream is likely unhealthy
Strategies for Improving Resilience
Throttling - Timeouts
Nesting is 🔥👿🔥
Retries make ☝ worse
Timing out => upstream service might still be processing request
Maintain discipline when setting timeouts/Propagate timeouts
Strategies for Improving Resilience
Service B
Service A
3s timeout
Service C
Service D
2s timeout
5s timeout
timeout
⚠ Avoid circular dependencies at all cost ⚠
Throttling - Rate Limiting
Avoid overload by clients and set per-client limits:
● requests from one calling service can use up to x CPU seconds/time
interval on the upstream
● anything above that will be throttled
● these metrics are aggregated across all instances of a calling service
and upstream
If this is too complicated
=> limit based on RPS/customer/endpoint
Strategies for Improving Resilience
Throttling - Circuit Breaking
Strategies for Improving Resilience
Closed Open
Half Open
fail (threshold reached)
reset timeout
fail
fail (under threshold)
success call/raise circuit open
success
Service A
Circuit
Breaker
Service B
⚠
⚠
⚠
⚠
🚫
timeout
timeout
timeout
timeout
trip circuit
circuit open
gradient = (RTTnoload/RTTactual)
newLimit = currentLimit × gradient + queueSize
Throttling - Adaptive Concurrency Limits
Queue
Concurrency
Strategies for Improving Resilience
Fallbacks and Rejection
Cache
Dead letter queues for writes
Return hard-coded value
Empty Response (“Fail Silent”)
User experience
⚠ Make sure to discuss these with your product owners ⚠
��
��
Strategies for Improving Resilience
Tools for Improving
Resilience
Tools for
Improving
Resilience
Hystrix
Resilience4j
Envoy
Netflix/concurrency-limits
Hystrix - Topology
Tools for Improving Resilience
Dependency A
Dependency D
Dependency CDependency B
Dependency E Dependency F
Client
Category Defense Mechanism Hystrix Resilience4j Envoy
Netflix/
Concurrency-
limits
Retrying Retrying 🚫
Throttling
Timeouts ✅
Rate Limiting 🚫
Circuit Breaking ✅
Adaptive Concurrency Limits 🚫
Rejection
Fallbacks ✅
Response Caching ✅
Hystrix - Resilience Features
Tools for Improving Resilience
Hystrix - Configuration management
public class GetUserInfoCommand extends HystrixCommand<UserInfo> {
private final String userId;
private final UserInfoApi userInfoApi;
public GetUserInfoCommand(String userId) {
super(HystrixCommand.Setter
.withGroupKey(HystrixCommandGroupKey.Factory.asKey("UserInfo"))
.andCommandKey(HystrixCommandKey.Factory.asKey("getUserInfo")));
this.userId = userId;
this.userInfoApi = userInfoApi;
}
@Override
protected UserInfo run() {
// Simplified to fit on a slide - you'd have some exception handling
return userInfoApi.getUserInfo(userId);
}
@Override
protected String getCacheKey() { // Dragons reside here
return userId;
}
@Override
protected UserInfo getFallback() {
return UserInfo.empty();
}
}
Tools for Improving Resilience
Hystrix - Observability
Tools for Improving Resilience
Hystrix - Testing
Scope:
● Unit tests easy for circuit opening/closing, fallbacks
● Integration tests reasonably easy for caching
Caveats:
● If you are using response caching, DO NOT FORGET to test
HystrixRequestContext
● Depending on the errors thrown by the call, you might need to
test circuit tripping (HystrixRuntimeException vs
HystrixBadRequestException)
● If you’re not careful, you might set the same
HystrixCommandGroupKey and HystrixCommandKey
Tools for Improving Resilience
Hystrix - Adoption Considerations
✅ 🚫 ⚠
Observability No longer
supported
Forces you towards
building thick clients
Mostly easy to test
Not language
agnostic
Tricky to enforce on
calling services
Cumbersome to
configure
HystrixRequestContext
Tools for Improving Resilience
Tools for
Improving
Resilience
Hystrix
Resilience4j
Envoy
Netflix/concurrency-limits
Resilience4j - Topology
Tools for Improving ResilienceTools for Improving Resilience
Dependency A
Dependency D
Dependency CDependency B
Dependency E Dependency F
Client
Category Defense Mechanism Hystrix Resilience4j Envoy
Netflix/
Concurrency-
limits
Retrying Retrying 🚫 ✅
Throttling
Timeouts ✅ ✅
Rate Limiting 🚫 ✅
Circuit Breaking ✅ ✅
Adaptive Concurrency Limits 🚫 👷
Rejection
Fallbacks ✅ ✅
Response Caching ✅ ✅
Resilience4j - Resilience Features
Tools for Improving Resilience
CircuitBreakerConfig circuitBreakerConfig = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.waitDurationInOpenState(Duration.ofMillis(1000))
.recordExceptions(IOException.class, TimeoutException.class)
.build();
CircuitBreakerRegistry circuitBreakerRegistry = CircuitBreakerRegistry.of(circuitBreakerConfig);
CircuitBreaker circuitBreaker =
circuitBreakerRegistry.circuitBreaker("getUserInfo",circuitBreakerConfig);
RetryConfig retryConfig = RetryConfig.custom().maxAttempts(3).build();
Retry retry = Retry.of("getUserInfo", retryConfig);
Supplier<UserInfo> decorateSupplier = CircuitBreaker.decorateSupplier(circuitBreaker,
Retry.decorateSupplier(retry,
() -> userInfoApi.getUserInfo(userId)));
UserInfo result = Try.ofSupplier(decorateSupplier).getOrElse(UserInfo.empty());
Resilience4j - Configuration management
Tools for Improving Resilience
Resilience4j - Observability
You can subscribe to various events for most of the decorators:
Built in support for:
● Dropwizard (resilience4j-metrics)
● Prometheus (resilience4j-prometheus)
● Micrometer (resilience4j-micrometer)
● Spring-boot actuator health information (resilience4j-spring-boot2)
circuitBreaker.getEventPublisher()
.onSuccess(event -> logger.info(...))
.onError(event -> logger.info(...))
.onIgnoredError(event -> logger.info(...))
.onReset(event -> logger.info(...))
.onStateTransition(event -> logger.info(...));
Tools for Improving Resilience
Resilience4j - Testing
Scope:
● Unit tests easy for composed layers and different scenarios
● Integration tests reasonably easy for caching
Caveats:
● Cache scope is tricky here as well
● Basically similar problems to Hystrix testing
Tools for Improving Resilience
Resilience4j - Adoption Considerations
✅ 🚫 ⚠
Observability Not language
agnostic
Forces you towards
building thick clients
Feature rich
Tricky to enforce on
calling services
Easier to configure
than Hystrix
Modularization
Less transitive
dependencies than
Hystrix
Tools for Improving Resilience
Tools for
Improving
Resilience
Hystrix
Resilience4j
Envoy
Netflix/concurrency-limits
Envoy - Topology
Service Cluster
Service
Service Cluster
Service
External Services
Discovery
Tools for Improving Resilience
Category Defense Mechanism Hystrix Resilience4j Envoy
Netflix/
Concurrency-
limits
Retrying Retrying 🚫 ✅ ✅
Throttling
Timeouts ✅ ✅ ✅
Rate Limiting 🚫 ✅ ✅
Circuit Breaking ✅ ✅ ✅
Adaptive Concurrency Limits 🚫 👷 👷
Rejection
Fallbacks ✅ ✅ 🚫
Response Caching ✅ ✅ 🚫
Envoy - Resilience Features
Tools for Improving Resilience
Envoy - Configuration management
clusters:
- name: get-cluster
connect_timeout: 10s
type: STRICT_DNS
outlier_detection:
consecutive_5xx: 5
interval: 10s
base_ejection_time: 30s
max_ejection_percent: 10
circuit_breakers:
thresholds:
- priority: DEFAULT
max_connections: 3
max_pending_requests: 3
max_requests: 3
max_retries: 3
- priority: HIGH
max_connections: 10
max_pending_requests: 10
max_requests: 10
max_retries: 10
hosts:
- socket_address:
address: httpbin-get
port_value: 8080
static_resources:
listeners:
- address:
socket_address:
...
filter_chains:
- filters:
- name: envoy.http_connection_manager
config:
...
route_config:
...
virtual_hosts:
- name: backend
...
routes:
- match:
prefix: "/"
headers:
- exact_match: "GET"
name: ":method"
route:
cluster: get-cluster
...
retry_policy:
retry_on: 5xx
num_retries: 2
priority: HIGH
Tools for Improving Resilience
Envoy - Configuration Deployment
Static config:
● You will benefit from some scripting/tools to generate this config
● Deploy the generated yaml as a Docker container side-car using the
official Docker image
Dynamic config:
● gRPC APis for dynamically updating these settings
○ Endpoint Discovery Service (EDS)
○ Cluster Discovery Service (CDS)
○ Route Discovery Service (RDS)
○ Listener discovery service (LDS)
○ Secret discovery service (SDS)
● Control planes like Istio makes this manageable
Tools for Improving Resilience
Envoy - Observability
Data sinks:
● envoy.statsd - built-in envoy.statsd sink (does not support tagged metrics)
● envoy.dog_statsd - emits stats with DogStatsD compatible tags
● envoy.stat_sinks.hystrix - emits stats in text/event-stream formatted stream for use
by Hystrix dashboard
● build your own
(Small) subset of stats:
● downstream_rq_total, downstream_rq_5xx, downstream_rq_timeout,
downstream_rq_time, etc.
Detecting open circuits/throttling:
● x-envoy-overloaded header will be injected in the downstream response
● Detailed metrics: cx_open (connection circuit breaker), rq_open (request circuit
breaker), remaining_rq (remaining requests until circuit will open), etc
Tools for Improving Resilience
Envoy - Testing
Scope:
● E2E-ish 🙂
Caveats:
● Setup can be tricky (boot the side-car in a Docker container, put
a mock server behind it and start simulating requests and
different types of failures)
● Will probably need to test this / route or whatever your config
granularity is
Tools for Improving Resilience
Envoy - Adoption Considerations
✅ 🚫 ⚠
Application
language agnostic
Fallbacks Testability
Enforcement Cache
Ownership (SRE vs
Dev teams)
Change rollout
Configuration
Complexity
Caller/callee
resilience
Operational
Complexity
Observability
Tools for Improving Resilience
Tools for
Improving
Resilience
Hystrix
Resilience4j
Envoy
Netflix/concurrency-limits
Netflix/concurrency-limits - Topology
Tools for Improving Resilience
Dependency A
Dependency D
Dependency CDependency B
Dependency E Dependency F
Client
Category Defense Mechanism Hystrix Resilience4j Envoy
Netflix/
Concurrency-
limits
Retrying Retrying 🚫 ✅ ✅ 🚫
Throttling
Timeouts ✅ ✅ ✅ 🚫
Rate Limiting 🚫 ✅ ✅ 🚫
Circuit Breaking ✅ ✅ ✅ 🚫
Adaptive Concurrency Limits 🚫 👷 👷 ✅
Rejection
Fallbacks ✅ ✅ 🚫 🚫
Response Caching ✅ ✅ 🚫 🚫
Netflix/concurrency-limits - Resilience Features
Tools for Improving Resilience
Netflix/concurrency-limits - Configuration management
ConcurrencyLimitServletFilter(
ServletLimiterBuilder()
.limit(VegasLimit.newBuilder().build())
.metricRegistry(concurrencyLimitMetricRegistry)
.build())
Tools for Improving Resilience
Netflix/concurrency-limits - Observability
class ConcurrencyLimitMetricRegistry(private val meterRegistry: MeterRegistry) : MetricRegistry {
override fun registerDistribution(id: String?, vararg tagNameValuePairs: String?):
MetricRegistry.SampleListener {
return MetricRegistry.SampleListener { }
}
override fun registerGauge(id: String?, supplier: Supplier<Number>?, vararg tagNameValuePairs:
String?) {
id?.let {
supplier?.let {
val tags = tagNameValuePairs.toList().zipWithNext().map { Tag.of(it.first,
it.second) }
meterRegistry.gauge(id, tags, supplier.get())
}
}
}
}
Tools for Improving Resilience
Netflix/concurrency-limits - Testing
Scope:
● E2E-ish 🙂
Caveats:
● Haha good luck with that
Tools for Improving Resilience
✅ 🚫 ⚠
Caller/callee
resilience
Not language agnostic
Harder to predict
throttling
Does not require
manual, per-endpoint
config
Less mature than
others
Observability
Documentation is
quite scarce
Easier to enforce on
calling services
Netflix/concurrency-limits - Adoption Considerations
Tools for Improving Resilience
What did we go for in
the end?
Resilience libraries showdown
Category
Defense
Mechanism
Hystrix
Resilience
4j
Envoy
Netflix/
concurrency
-limits
gRPC Sentinel
Retrying Retrying 🚫 ✅ ✅ 🚫 👷 🚫
Throttling
Timeouts ✅ ✅ ✅ 🚫 ✅ 🚫
Rate Limiting 🚫 ✅ ✅ 🚫 👷 🚫
Circuit
Breaking ✅ ✅ ✅ 🚫 👷 ✅
Adaptive
Concurrency
Limits
🚫 👷 👷 ✅ 👷 🚫
Rejection
Fallbacks ✅ ✅ 🚫 🚫 👷 ✅
Response
Caching ✅ ✅ 🚫 🚫 👷 🚫
And the winner is… Envoy 🥇🥇🥇
Category
Defense
Mechanism
Hystrix
Resilience
4j
Envoy
Netflix/
concurrency
-limits
gRPC Sentinel
Retrying Retrying 🚫 ✅ ✅ 🚫 👷 🚫
Throttling
Timeouts ✅ ✅ ✅ 🚫 ✅ 🚫
Rate Limiting 🚫 ✅ ✅ 🚫 👷 🚫
Circuit
Breaking ✅ ✅ ✅ 🚫 👷 ✅
Adaptive
Concurrency
Limits
🚫 👷 👷 ✅ 👷 🚫
Rejection
Fallbacks ✅ ✅ 🚫 🚫 👷 ✅
Response
Caching ✅ ✅ 🚫 🚫 👷 🚫
And the winner is… Envoy 🥇🥇🥇, but why?
Reasons:
● We already have it
● Observability is super strong
● Easy to enforce across all our infrastructure
● Allows us to have thin clients
● Language agnostic
And the runner up is… Resilience4j (kinda) 🥈🥈🥈
● Allowed for retries, caching and fallbacks, but it’s up to the teams
● We discourage using request caching for the most part
Ask Away
Thank you!
🙇 🙇 🙇

Más contenido relacionado

La actualidad más candente

MySQL GTID Concepts, Implementation and troubleshooting
MySQL GTID Concepts, Implementation and troubleshooting MySQL GTID Concepts, Implementation and troubleshooting
MySQL GTID Concepts, Implementation and troubleshooting Mydbops
 
Kubernetes #6 advanced scheduling
Kubernetes #6   advanced schedulingKubernetes #6   advanced scheduling
Kubernetes #6 advanced schedulingTerry Cho
 
Federated Engine 실무적용사례
Federated Engine 실무적용사례Federated Engine 실무적용사례
Federated Engine 실무적용사례I Goo Lee
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...Altinity Ltd
 
PostgreSQL on EXT4, XFS, BTRFS and ZFS
PostgreSQL on EXT4, XFS, BTRFS and ZFSPostgreSQL on EXT4, XFS, BTRFS and ZFS
PostgreSQL on EXT4, XFS, BTRFS and ZFSTomas Vondra
 
Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...
Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...
Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...PostgreSQL-Consulting
 
Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph QoS: How to support QoS in distributed storage system - Taewoong KimCeph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph QoS: How to support QoS in distributed storage system - Taewoong KimCeph Community
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesYoshinori Matsunobu
 
Keeping Latency Low for User-Defined Functions with WebAssembly
Keeping Latency Low for User-Defined Functions with WebAssemblyKeeping Latency Low for User-Defined Functions with WebAssembly
Keeping Latency Low for User-Defined Functions with WebAssemblyScyllaDB
 
Understanding How CQL3 Maps to Cassandra's Internal Data Structure
Understanding How CQL3 Maps to Cassandra's Internal Data StructureUnderstanding How CQL3 Maps to Cassandra's Internal Data Structure
Understanding How CQL3 Maps to Cassandra's Internal Data StructureDataStax
 
ProxySQL for MySQL
ProxySQL for MySQLProxySQL for MySQL
ProxySQL for MySQLMydbops
 
ProxySQL High Avalability and Configuration Management Overview
ProxySQL High Avalability and Configuration Management OverviewProxySQL High Avalability and Configuration Management Overview
ProxySQL High Avalability and Configuration Management OverviewRené Cannaò
 
[네전따] 네트워크 엔지니어에게 쿠버네티스는 어떤 의미일까요
[네전따] 네트워크 엔지니어에게 쿠버네티스는 어떤 의미일까요[네전따] 네트워크 엔지니어에게 쿠버네티스는 어떤 의미일까요
[네전따] 네트워크 엔지니어에게 쿠버네티스는 어떤 의미일까요Jo Hoon
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBDDan Frincu
 
So You Want to Write a Connector?
So You Want to Write a Connector? So You Want to Write a Connector?
So You Want to Write a Connector? confluent
 
Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®confluent
 
Building a scalable microservice architecture with envoy, kubernetes and istio
Building a scalable microservice architecture with envoy, kubernetes and istioBuilding a scalable microservice architecture with envoy, kubernetes and istio
Building a scalable microservice architecture with envoy, kubernetes and istioSAMIR BEHARA
 
Using eBPF for High-Performance Networking in Cilium
Using eBPF for High-Performance Networking in CiliumUsing eBPF for High-Performance Networking in Cilium
Using eBPF for High-Performance Networking in CiliumScyllaDB
 

La actualidad más candente (20)

MySQL GTID Concepts, Implementation and troubleshooting
MySQL GTID Concepts, Implementation and troubleshooting MySQL GTID Concepts, Implementation and troubleshooting
MySQL GTID Concepts, Implementation and troubleshooting
 
Kubernetes #6 advanced scheduling
Kubernetes #6   advanced schedulingKubernetes #6   advanced scheduling
Kubernetes #6 advanced scheduling
 
Federated Engine 실무적용사례
Federated Engine 실무적용사례Federated Engine 실무적용사례
Federated Engine 실무적용사례
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
 
Using galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wanUsing galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wan
 
Calico and BGP
Calico and BGPCalico and BGP
Calico and BGP
 
PostgreSQL on EXT4, XFS, BTRFS and ZFS
PostgreSQL on EXT4, XFS, BTRFS and ZFSPostgreSQL on EXT4, XFS, BTRFS and ZFS
PostgreSQL on EXT4, XFS, BTRFS and ZFS
 
Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...
Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...
Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...
 
Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph QoS: How to support QoS in distributed storage system - Taewoong KimCeph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
Keeping Latency Low for User-Defined Functions with WebAssembly
Keeping Latency Low for User-Defined Functions with WebAssemblyKeeping Latency Low for User-Defined Functions with WebAssembly
Keeping Latency Low for User-Defined Functions with WebAssembly
 
Understanding How CQL3 Maps to Cassandra's Internal Data Structure
Understanding How CQL3 Maps to Cassandra's Internal Data StructureUnderstanding How CQL3 Maps to Cassandra's Internal Data Structure
Understanding How CQL3 Maps to Cassandra's Internal Data Structure
 
ProxySQL for MySQL
ProxySQL for MySQLProxySQL for MySQL
ProxySQL for MySQL
 
ProxySQL High Avalability and Configuration Management Overview
ProxySQL High Avalability and Configuration Management OverviewProxySQL High Avalability and Configuration Management Overview
ProxySQL High Avalability and Configuration Management Overview
 
[네전따] 네트워크 엔지니어에게 쿠버네티스는 어떤 의미일까요
[네전따] 네트워크 엔지니어에게 쿠버네티스는 어떤 의미일까요[네전따] 네트워크 엔지니어에게 쿠버네티스는 어떤 의미일까요
[네전따] 네트워크 엔지니어에게 쿠버네티스는 어떤 의미일까요
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBD
 
So You Want to Write a Connector?
So You Want to Write a Connector? So You Want to Write a Connector?
So You Want to Write a Connector?
 
Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®
 
Building a scalable microservice architecture with envoy, kubernetes and istio
Building a scalable microservice architecture with envoy, kubernetes and istioBuilding a scalable microservice architecture with envoy, kubernetes and istio
Building a scalable microservice architecture with envoy, kubernetes and istio
 
Using eBPF for High-Performance Networking in Cilium
Using eBPF for High-Performance Networking in CiliumUsing eBPF for High-Performance Networking in Cilium
Using eBPF for High-Performance Networking in Cilium
 

Similar a Resilient service to-service calls in a post-Hystrix world

The anatomy of a cascading failure
The anatomy of a cascading failureThe anatomy of a cascading failure
The anatomy of a cascading failureRares Musina
 
Designing apps for resiliency
Designing apps for resiliencyDesigning apps for resiliency
Designing apps for resiliencyMasashi Narumoto
 
Three Perspectives on Measuring Latency
Three Perspectives on Measuring LatencyThree Perspectives on Measuring Latency
Three Perspectives on Measuring LatencyScyllaDB
 
The Value of Reactive
The Value of ReactiveThe Value of Reactive
The Value of ReactiveVMware Tanzu
 
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)Peter Tröger
 
Centricity EMRCPS_PNS_Troubleshooting
Centricity EMRCPS_PNS_TroubleshootingCentricity EMRCPS_PNS_Troubleshooting
Centricity EMRCPS_PNS_TroubleshootingSteve Oubre
 
Continuous Performance Testing
Continuous Performance TestingContinuous Performance Testing
Continuous Performance TestingC4Media
 
Resisting to The Shocks
Resisting to The ShocksResisting to The Shocks
Resisting to The ShocksStefano Fago
 
ScalabilityAvailability
ScalabilityAvailabilityScalabilityAvailability
ScalabilityAvailabilitywebuploader
 
Amazon builder Library notes
Amazon builder Library notesAmazon builder Library notes
Amazon builder Library notesDiego Pacheco
 
An adaptive and eventually self healing framework for geo-distributed real-ti...
An adaptive and eventually self healing framework for geo-distributed real-ti...An adaptive and eventually self healing framework for geo-distributed real-ti...
An adaptive and eventually self healing framework for geo-distributed real-ti...Angad Singh
 
solving restaurent model problem by using queueing theory
solving restaurent model problem by using queueing theorysolving restaurent model problem by using queueing theory
solving restaurent model problem by using queueing theorySubham kumar
 
Adaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream ProcessingAdaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream ProcessingZbigniew Jerzak
 
Resource Scheduling using Apache Mesos in Cloud Native Environments
Resource Scheduling using Apache Mesos in Cloud Native EnvironmentsResource Scheduling using Apache Mesos in Cloud Native Environments
Resource Scheduling using Apache Mesos in Cloud Native EnvironmentsSharma Podila
 
Tef con2016 (1)
Tef con2016 (1)Tef con2016 (1)
Tef con2016 (1)ggarber
 
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...DataStax
 
Comprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istioComprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istioFred Moyer
 
Designing Highly-Available Architectures for OTM
Designing Highly-Available Architectures for OTMDesigning Highly-Available Architectures for OTM
Designing Highly-Available Architectures for OTMMavenWire
 

Similar a Resilient service to-service calls in a post-Hystrix world (20)

The anatomy of a cascading failure
The anatomy of a cascading failureThe anatomy of a cascading failure
The anatomy of a cascading failure
 
Designing apps for resiliency
Designing apps for resiliencyDesigning apps for resiliency
Designing apps for resiliency
 
Three Perspectives on Measuring Latency
Three Perspectives on Measuring LatencyThree Perspectives on Measuring Latency
Three Perspectives on Measuring Latency
 
The value of reactive
The value of reactiveThe value of reactive
The value of reactive
 
The Value of Reactive
The Value of ReactiveThe Value of Reactive
The Value of Reactive
 
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
 
Centricity EMRCPS_PNS_Troubleshooting
Centricity EMRCPS_PNS_TroubleshootingCentricity EMRCPS_PNS_Troubleshooting
Centricity EMRCPS_PNS_Troubleshooting
 
Continuous Performance Testing
Continuous Performance TestingContinuous Performance Testing
Continuous Performance Testing
 
Resisting to The Shocks
Resisting to The ShocksResisting to The Shocks
Resisting to The Shocks
 
ScalabilityAvailability
ScalabilityAvailabilityScalabilityAvailability
ScalabilityAvailability
 
Amazon builder Library notes
Amazon builder Library notesAmazon builder Library notes
Amazon builder Library notes
 
An adaptive and eventually self healing framework for geo-distributed real-ti...
An adaptive and eventually self healing framework for geo-distributed real-ti...An adaptive and eventually self healing framework for geo-distributed real-ti...
An adaptive and eventually self healing framework for geo-distributed real-ti...
 
solving restaurent model problem by using queueing theory
solving restaurent model problem by using queueing theorysolving restaurent model problem by using queueing theory
solving restaurent model problem by using queueing theory
 
Adaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream ProcessingAdaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream Processing
 
Resource Scheduling using Apache Mesos in Cloud Native Environments
Resource Scheduling using Apache Mesos in Cloud Native EnvironmentsResource Scheduling using Apache Mesos in Cloud Native Environments
Resource Scheduling using Apache Mesos in Cloud Native Environments
 
Tef con2016 (1)
Tef con2016 (1)Tef con2016 (1)
Tef con2016 (1)
 
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
 
Comprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istioComprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istio
 
PEnDAR webinar 2 with notes
PEnDAR webinar 2 with notesPEnDAR webinar 2 with notes
PEnDAR webinar 2 with notes
 
Designing Highly-Available Architectures for OTM
Designing Highly-Available Architectures for OTMDesigning Highly-Available Architectures for OTM
Designing Highly-Available Architectures for OTM
 

Último

Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 

Último (20)

Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Resilient service to-service calls in a post-Hystrix world

  • 1. Resilient service-to-service calls in a post-Hystrix world Rareș Mușină, Tech Lead @N26 @r3sm4n
  • 3. Integration Patterns Sync vs. Eventual Consistency
  • 4. The anatomy of a cascading failure
  • 5. Triggering Conditions - What happened? The anatomy of a cascading failure
  • 6. New Rollouts Planned Changes Traffic Drains Turndowns Triggering Conditions - Change public String getCountry(String userId) { try { // Try to get latest country to avoid stale info UserInfo userInfo = userInfoService.update(userId); updateCache(userInfo); ... return getCountryFromCache(userId); } catch (Exception e) { // Default to cache if service is down return getCountryFromCache(userId); } } The anatomy of a cascading failure
  • 7. Triggering Conditions - What happened? The anatomy of a cascading failure
  • 8. Triggering Conditions - Throttling The anatomy of a cascading failure
  • 9. Triggering Conditions - What happened? The anatomy of a cascading failure
  • 10. Burstiness (e.g. scheduled tasks) DDOSes Instance Death (gee, thanks Spotinst) Organic Growth Request profile changes Triggering Conditions - Entropy The anatomy of a cascading failure
  • 11. CPU Memory Network Disk space Threads File descriptors ………………………... Resource Starvation - Common Resources The anatomy of a cascading failure
  • 12. Resource Starvation - Dependencies Between Resources Poorly tuned Garbage Collection Slow requests Increased CPU due to GC More in-progress requests More RAM due to queuing Less RAM for caching Lower cache hit rate More requests to backend 🔥🔥🔥 The anatomy of a cascading failure
  • 13. Server Overload/Meltdown/Crash/Unavailability :( CPU/Memory maxed out Health checks returning 5xx Endpoints returning 5xx Timeouts Increased load on other instances The anatomy of a cascading failure
  • 14. Cascading Failures - Load Redistribution The anatomy of a cascading failure ELB ELB A B 500 350 100 250 ELB ELB A 600 600
  • 15. Cascading Failures - Retry Amplification The anatomy of a cascading failure
  • 16. Cascading Failures - Latency Propagation The anatomy of a cascading failure
  • 17. Cascading Failures - Resource Contention During Recovery The anatomy of a cascading failure
  • 19. Architecture - Orchestration vs Choreography Orchestration Choreography Strategies for Improving Resilience Card service Account service User service Signup service Ship card Create Account Create user Card service Account service User service User signup event Subscribes Signup service Publishes
  • 20. Capacity Planning - Do I need it in the age of the cloud? Helpful, but not sufficient to protect against cascading failures Accuracy is overrated and expensive (especially for new services) It’s (usually) ok (and cheaper) to overprovision at first Strategies for Improving Resilience
  • 21. Capacity Planning - More important things Automate provisioning and deployment (🐄🐄🐄 not 🐕🐈🐹) Auto-scaling and auto-healing Robust architecture in the face of growing traffic (pub/sub helps) Agree on SLIs and SLOs and monitor them closely Strategies for Improving Resilience
  • 22. Capacity Planning - If I do need it, then what do I do? Business requirements Critical services, and YOLO the rest ⚠ Seasonality 🎄🥚🦃 Use hardware resources to measure capacity instead of Requests Per Second: ● cost of request = CPU time it has consumed ● (on GC platforms) higher memory => higher CPU Strategies for Improving Resilience
  • 23. “Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” Principles of Chaos Engineering Chaos Testing Strategies for Improving Resilience
  • 24. Retrying - What should I retry? What makes a request retriable? ● ⚠ idempotency ● 🚫 GET with side-effects ● ✅ stateless if you can Should you retry timeouts? ● Stay tuned to the next slides Strategies for Improving Resilience
  • 25. Retrying - Backing Off With Jitter Strategies for Improving Resilience
  • 26. Retrying - Retry Budgets Per-request retry budget ● Each request retried at most 3x Per-client retry budget ● Retry requests = at most 10% total requests to upstream ● If > 10% of requests are failing => upstream is likely unhealthy Strategies for Improving Resilience
  • 27. Throttling - Timeouts Nesting is 🔥👿🔥 Retries make ☝ worse Timing out => upstream service might still be processing request Maintain discipline when setting timeouts/Propagate timeouts Strategies for Improving Resilience Service B Service A 3s timeout Service C Service D 2s timeout 5s timeout timeout ⚠ Avoid circular dependencies at all cost ⚠
  • 28. Throttling - Rate Limiting Avoid overload by clients and set per-client limits: ● requests from one calling service can use up to x CPU seconds/time interval on the upstream ● anything above that will be throttled ● these metrics are aggregated across all instances of a calling service and upstream If this is too complicated => limit based on RPS/customer/endpoint Strategies for Improving Resilience
  • 29. Throttling - Circuit Breaking Strategies for Improving Resilience Closed Open Half Open fail (threshold reached) reset timeout fail fail (under threshold) success call/raise circuit open success Service A Circuit Breaker Service B ⚠ ⚠ ⚠ ⚠ 🚫 timeout timeout timeout timeout trip circuit circuit open
  • 30. gradient = (RTTnoload/RTTactual) newLimit = currentLimit × gradient + queueSize Throttling - Adaptive Concurrency Limits Queue Concurrency Strategies for Improving Resilience
  • 31. Fallbacks and Rejection Cache Dead letter queues for writes Return hard-coded value Empty Response (“Fail Silent”) User experience ⚠ Make sure to discuss these with your product owners ⚠ �� �� Strategies for Improving Resilience
  • 34. Hystrix - Topology Tools for Improving Resilience Dependency A Dependency D Dependency CDependency B Dependency E Dependency F Client
  • 35. Category Defense Mechanism Hystrix Resilience4j Envoy Netflix/ Concurrency- limits Retrying Retrying 🚫 Throttling Timeouts ✅ Rate Limiting 🚫 Circuit Breaking ✅ Adaptive Concurrency Limits 🚫 Rejection Fallbacks ✅ Response Caching ✅ Hystrix - Resilience Features Tools for Improving Resilience
  • 36. Hystrix - Configuration management public class GetUserInfoCommand extends HystrixCommand<UserInfo> { private final String userId; private final UserInfoApi userInfoApi; public GetUserInfoCommand(String userId) { super(HystrixCommand.Setter .withGroupKey(HystrixCommandGroupKey.Factory.asKey("UserInfo")) .andCommandKey(HystrixCommandKey.Factory.asKey("getUserInfo"))); this.userId = userId; this.userInfoApi = userInfoApi; } @Override protected UserInfo run() { // Simplified to fit on a slide - you'd have some exception handling return userInfoApi.getUserInfo(userId); } @Override protected String getCacheKey() { // Dragons reside here return userId; } @Override protected UserInfo getFallback() { return UserInfo.empty(); } } Tools for Improving Resilience
  • 37. Hystrix - Observability Tools for Improving Resilience
  • 38. Hystrix - Testing Scope: ● Unit tests easy for circuit opening/closing, fallbacks ● Integration tests reasonably easy for caching Caveats: ● If you are using response caching, DO NOT FORGET to test HystrixRequestContext ● Depending on the errors thrown by the call, you might need to test circuit tripping (HystrixRuntimeException vs HystrixBadRequestException) ● If you’re not careful, you might set the same HystrixCommandGroupKey and HystrixCommandKey Tools for Improving Resilience
  • 39. Hystrix - Adoption Considerations ✅ 🚫 ⚠ Observability No longer supported Forces you towards building thick clients Mostly easy to test Not language agnostic Tricky to enforce on calling services Cumbersome to configure HystrixRequestContext Tools for Improving Resilience
  • 41. Resilience4j - Topology Tools for Improving ResilienceTools for Improving Resilience Dependency A Dependency D Dependency CDependency B Dependency E Dependency F Client
  • 42. Category Defense Mechanism Hystrix Resilience4j Envoy Netflix/ Concurrency- limits Retrying Retrying 🚫 ✅ Throttling Timeouts ✅ ✅ Rate Limiting 🚫 ✅ Circuit Breaking ✅ ✅ Adaptive Concurrency Limits 🚫 👷 Rejection Fallbacks ✅ ✅ Response Caching ✅ ✅ Resilience4j - Resilience Features Tools for Improving Resilience
  • 43. CircuitBreakerConfig circuitBreakerConfig = CircuitBreakerConfig.custom() .failureRateThreshold(50) .waitDurationInOpenState(Duration.ofMillis(1000)) .recordExceptions(IOException.class, TimeoutException.class) .build(); CircuitBreakerRegistry circuitBreakerRegistry = CircuitBreakerRegistry.of(circuitBreakerConfig); CircuitBreaker circuitBreaker = circuitBreakerRegistry.circuitBreaker("getUserInfo",circuitBreakerConfig); RetryConfig retryConfig = RetryConfig.custom().maxAttempts(3).build(); Retry retry = Retry.of("getUserInfo", retryConfig); Supplier<UserInfo> decorateSupplier = CircuitBreaker.decorateSupplier(circuitBreaker, Retry.decorateSupplier(retry, () -> userInfoApi.getUserInfo(userId))); UserInfo result = Try.ofSupplier(decorateSupplier).getOrElse(UserInfo.empty()); Resilience4j - Configuration management Tools for Improving Resilience
  • 44. Resilience4j - Observability You can subscribe to various events for most of the decorators: Built in support for: ● Dropwizard (resilience4j-metrics) ● Prometheus (resilience4j-prometheus) ● Micrometer (resilience4j-micrometer) ● Spring-boot actuator health information (resilience4j-spring-boot2) circuitBreaker.getEventPublisher() .onSuccess(event -> logger.info(...)) .onError(event -> logger.info(...)) .onIgnoredError(event -> logger.info(...)) .onReset(event -> logger.info(...)) .onStateTransition(event -> logger.info(...)); Tools for Improving Resilience
  • 45. Resilience4j - Testing Scope: ● Unit tests easy for composed layers and different scenarios ● Integration tests reasonably easy for caching Caveats: ● Cache scope is tricky here as well ● Basically similar problems to Hystrix testing Tools for Improving Resilience
  • 46. Resilience4j - Adoption Considerations ✅ 🚫 ⚠ Observability Not language agnostic Forces you towards building thick clients Feature rich Tricky to enforce on calling services Easier to configure than Hystrix Modularization Less transitive dependencies than Hystrix Tools for Improving Resilience
  • 48. Envoy - Topology Service Cluster Service Service Cluster Service External Services Discovery Tools for Improving Resilience
  • 49. Category Defense Mechanism Hystrix Resilience4j Envoy Netflix/ Concurrency- limits Retrying Retrying 🚫 ✅ ✅ Throttling Timeouts ✅ ✅ ✅ Rate Limiting 🚫 ✅ ✅ Circuit Breaking ✅ ✅ ✅ Adaptive Concurrency Limits 🚫 👷 👷 Rejection Fallbacks ✅ ✅ 🚫 Response Caching ✅ ✅ 🚫 Envoy - Resilience Features Tools for Improving Resilience
  • 50. Envoy - Configuration management clusters: - name: get-cluster connect_timeout: 10s type: STRICT_DNS outlier_detection: consecutive_5xx: 5 interval: 10s base_ejection_time: 30s max_ejection_percent: 10 circuit_breakers: thresholds: - priority: DEFAULT max_connections: 3 max_pending_requests: 3 max_requests: 3 max_retries: 3 - priority: HIGH max_connections: 10 max_pending_requests: 10 max_requests: 10 max_retries: 10 hosts: - socket_address: address: httpbin-get port_value: 8080 static_resources: listeners: - address: socket_address: ... filter_chains: - filters: - name: envoy.http_connection_manager config: ... route_config: ... virtual_hosts: - name: backend ... routes: - match: prefix: "/" headers: - exact_match: "GET" name: ":method" route: cluster: get-cluster ... retry_policy: retry_on: 5xx num_retries: 2 priority: HIGH Tools for Improving Resilience
  • 51. Envoy - Configuration Deployment Static config: ● You will benefit from some scripting/tools to generate this config ● Deploy the generated yaml as a Docker container side-car using the official Docker image Dynamic config: ● gRPC APis for dynamically updating these settings ○ Endpoint Discovery Service (EDS) ○ Cluster Discovery Service (CDS) ○ Route Discovery Service (RDS) ○ Listener discovery service (LDS) ○ Secret discovery service (SDS) ● Control planes like Istio makes this manageable Tools for Improving Resilience
  • 52. Envoy - Observability Data sinks: ● envoy.statsd - built-in envoy.statsd sink (does not support tagged metrics) ● envoy.dog_statsd - emits stats with DogStatsD compatible tags ● envoy.stat_sinks.hystrix - emits stats in text/event-stream formatted stream for use by Hystrix dashboard ● build your own (Small) subset of stats: ● downstream_rq_total, downstream_rq_5xx, downstream_rq_timeout, downstream_rq_time, etc. Detecting open circuits/throttling: ● x-envoy-overloaded header will be injected in the downstream response ● Detailed metrics: cx_open (connection circuit breaker), rq_open (request circuit breaker), remaining_rq (remaining requests until circuit will open), etc Tools for Improving Resilience
  • 53. Envoy - Testing Scope: ● E2E-ish 🙂 Caveats: ● Setup can be tricky (boot the side-car in a Docker container, put a mock server behind it and start simulating requests and different types of failures) ● Will probably need to test this / route or whatever your config granularity is Tools for Improving Resilience
  • 54. Envoy - Adoption Considerations ✅ 🚫 ⚠ Application language agnostic Fallbacks Testability Enforcement Cache Ownership (SRE vs Dev teams) Change rollout Configuration Complexity Caller/callee resilience Operational Complexity Observability Tools for Improving Resilience
  • 56. Netflix/concurrency-limits - Topology Tools for Improving Resilience Dependency A Dependency D Dependency CDependency B Dependency E Dependency F Client
  • 57. Category Defense Mechanism Hystrix Resilience4j Envoy Netflix/ Concurrency- limits Retrying Retrying 🚫 ✅ ✅ 🚫 Throttling Timeouts ✅ ✅ ✅ 🚫 Rate Limiting 🚫 ✅ ✅ 🚫 Circuit Breaking ✅ ✅ ✅ 🚫 Adaptive Concurrency Limits 🚫 👷 👷 ✅ Rejection Fallbacks ✅ ✅ 🚫 🚫 Response Caching ✅ ✅ 🚫 🚫 Netflix/concurrency-limits - Resilience Features Tools for Improving Resilience
  • 58. Netflix/concurrency-limits - Configuration management ConcurrencyLimitServletFilter( ServletLimiterBuilder() .limit(VegasLimit.newBuilder().build()) .metricRegistry(concurrencyLimitMetricRegistry) .build()) Tools for Improving Resilience
  • 59. Netflix/concurrency-limits - Observability class ConcurrencyLimitMetricRegistry(private val meterRegistry: MeterRegistry) : MetricRegistry { override fun registerDistribution(id: String?, vararg tagNameValuePairs: String?): MetricRegistry.SampleListener { return MetricRegistry.SampleListener { } } override fun registerGauge(id: String?, supplier: Supplier<Number>?, vararg tagNameValuePairs: String?) { id?.let { supplier?.let { val tags = tagNameValuePairs.toList().zipWithNext().map { Tag.of(it.first, it.second) } meterRegistry.gauge(id, tags, supplier.get()) } } } } Tools for Improving Resilience
  • 60. Netflix/concurrency-limits - Testing Scope: ● E2E-ish 🙂 Caveats: ● Haha good luck with that Tools for Improving Resilience
  • 61. ✅ 🚫 ⚠ Caller/callee resilience Not language agnostic Harder to predict throttling Does not require manual, per-endpoint config Less mature than others Observability Documentation is quite scarce Easier to enforce on calling services Netflix/concurrency-limits - Adoption Considerations Tools for Improving Resilience
  • 62. What did we go for in the end?
  • 63. Resilience libraries showdown Category Defense Mechanism Hystrix Resilience 4j Envoy Netflix/ concurrency -limits gRPC Sentinel Retrying Retrying 🚫 ✅ ✅ 🚫 👷 🚫 Throttling Timeouts ✅ ✅ ✅ 🚫 ✅ 🚫 Rate Limiting 🚫 ✅ ✅ 🚫 👷 🚫 Circuit Breaking ✅ ✅ ✅ 🚫 👷 ✅ Adaptive Concurrency Limits 🚫 👷 👷 ✅ 👷 🚫 Rejection Fallbacks ✅ ✅ 🚫 🚫 👷 ✅ Response Caching ✅ ✅ 🚫 🚫 👷 🚫
  • 64. And the winner is… Envoy 🥇🥇🥇 Category Defense Mechanism Hystrix Resilience 4j Envoy Netflix/ concurrency -limits gRPC Sentinel Retrying Retrying 🚫 ✅ ✅ 🚫 👷 🚫 Throttling Timeouts ✅ ✅ ✅ 🚫 ✅ 🚫 Rate Limiting 🚫 ✅ ✅ 🚫 👷 🚫 Circuit Breaking ✅ ✅ ✅ 🚫 👷 ✅ Adaptive Concurrency Limits 🚫 👷 👷 ✅ 👷 🚫 Rejection Fallbacks ✅ ✅ 🚫 🚫 👷 ✅ Response Caching ✅ ✅ 🚫 🚫 👷 🚫
  • 65. And the winner is… Envoy 🥇🥇🥇, but why? Reasons: ● We already have it ● Observability is super strong ● Easy to enforce across all our infrastructure ● Allows us to have thin clients ● Language agnostic And the runner up is… Resilience4j (kinda) 🥈🥈🥈 ● Allowed for retries, caching and fallbacks, but it’s up to the teams ● We discourage using request caching for the most part