Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Resilient design 101
Avishai Ish-Shalom
github.com/avishai-ish-shalom@nukembergavishai.is@wix.com
Wix in numbers
~ 600 Engineers
~ 2000 employees
~ 100M users
~ 500 micro services
Lithuania
Ukraine
Vilnius
Kyiv
Dnipro
Wi...
Queues
01
Queues are everywhere!
▪ Futures/Executors
▪ Sockets
▪ Locks (DB Connection pools)
▪ Callbacks in node.js/Netty
Anything a...
Queues
▪ Incoming load (arrival rate)
▪ Service from the queue (service rate)
▪ Service discipline (FIFO/LIFO/Priority)
▪ ...
It varies
▪ Arrival rate fluctuates
▪ Service times fluctuates
▪ Delays accumulate
▪ Idle time wasted
Queues are almost al...
Capacity &
Latency
▪ Latency (and queue size) rises to infinity
as utilization approaches 1
▪ For QoS ρ << 0.75
▪ Decent l...
Implications
Infinite queues:
▪ Memory pressure / OOM
▪ High latency
▪ Stale work
Always limit queue size!
Work item TTL*
Latency &
Service time
λ = wait time
σ = service time
ρ = utilization
Utilization fluctuates!
▪ 10% fluctuation at = 0.5 will hardly affects latency (~ 1.1x)
▪ 10% fluctuation at = 0.9 will ki...
Practical advice
▪ Use chokepoints (throttling/load shedding)
▪ Plan for low utilization of slow resources
Example
Resourc...
Backpressure
▪ Internal queues fill up and cause latency
▪ Front layer will continue sending traffic
▪ We need to inform t...
Backpressure
▪ Blocking code has backpressure by default
▪ Executors, remote calls and async code need explicit
backpressu...
Load shedding
▪ A tradeoff between latency and error rate
▪ Cap the queue size / throttle arrival rate
▪ Reject excess wor...
Thread Pools
02
Jetty architecture
Thread pool (QTP)
Socket
Acceptor
thread
Too many threads
▪ O/S also has a queue
▪ Threads take memory, FDs, etc
▪ What about shared resources?
Bad QoS, GC storms,...
Capacity/Latency tradeoffs
When optimizing for Latency:
For low latency, resources must be available when needed
Keep the ...
Capacity/Latency tradeoffs
When optimizing for Capacity
For max capacity, resources must always have work waiting
Keep the...
How may threads?
▪ Assuming CPU is the limiting resource
▪ Compute by maximal load (opt. latency)
▪ With a Grid: How many ...
How may threads?
How to compute?
▪ Transaction time = W + C
▪ C ~ Total CPU time / throughput
▪ U ~ 0.5 – 0.7 (account for...
What about async servers?
Async servers architecture
Socket
Event loop
epoll
Callbacks
O/S
Syscalls
Socket
Socket
Async systems
▪ Event loop callback/handler queue
▪ The callback queue is unbounded (!!!)
▪ Event loop can block (ouch)
▪ ...
Async systems - overload
▪ No preemption -> no QoS
▪ No backpressure -> overload
▪ Hard to tune
▪ Hard to limit concurrenc...
So what’s the point?
▪ High concurrency
▪ More control (timeouts)
▪ I/O heavy servers
Still evolving…. let’s revisit in a ...
Little’s Law
03
Little’s law
▪ Holds for all distributions
▪ For “stable” systems
▪ Holds for systems and their subsystems
▪ “Throughput” ...
Using Little’s law
▪ How many requests queued inside the system?
▪ Verifying load tests / benchmarks
▪ Calculating latency...
Using Little’s law
W1
= 0.1
W2
= 0.001
LB
λ2
= 10,000
λ1
= 100
Least
connections
Timeouts
04
How not to timeout
People use arbitrary timeout values
▪ DB timeout > Overall transaction timeout
▪ Cache timeout > DB lat...
Deciding on timeouts
Use the distribution luke!
▪ Resources/Errors tradeoff
▪ Cumulative distribution chart
▪ Watch out fo...
Timeouts should be derived from
real world constraints!
UX numbers every developer needs to know
▪ Smooth motion perception threshold: ~ 20ms
▪ Immediate reaction threshold: ~ 10...
Hardware latency numbers every developer
needs to know
▪ SSD Disk seek: 0.15ms
▪ Magnetic disk seek: ~ 10ms
▪ Round trip w...
Timeout Budgets
▪ Decide on global timeouts
▪ Pass context object
▪ Each stage decrements budget
▪ Local timeouts accordin...
The debt buyer
▪ Transactions may return eventually after timeout
▪ Does the client really have to wait?
▪ Timeout and ret...
Questions?
github.com/avishai-ish-shalom@nukembergavishai.is@wix.com
Thank You
github.com/avishai-ish-shalom@nukembergavishai.is@wix.com
Resilient design 101 (BuildStuff LT 2017)
Próxima SlideShare
Cargando en…5
×

Resilient design 101 (BuildStuff LT 2017)

120 visualizaciones

Publicado el

Queueing Theory is perhaps one of the most important mathematical theories in systems design and analysis, yet only few engineers learn it. This talk teaches the basics of queueing theory and explores the ramifications of queue behavior on system performance and resiliency. This talk aims to give practical skills that can be applied better build and tune your systems. The talk covers:
- Queueing delays
- Queueing capacity
- Little's Law and how to apply it
- Proper sizing of thread and connection pools

Publicado en: Software
  • Sé el primero en comentar

Resilient design 101 (BuildStuff LT 2017)

  1. 1. Resilient design 101 Avishai Ish-Shalom github.com/avishai-ish-shalom@nukembergavishai.is@wix.com
  2. 2. Wix in numbers ~ 600 Engineers ~ 2000 employees ~ 100M users ~ 500 micro services Lithuania Ukraine Vilnius Kyiv Dnipro Wix Engineering Locations Israel Tel-Aviv Be’er Sheva
  3. 3. Queues 01
  4. 4. Queues are everywhere! ▪ Futures/Executors ▪ Sockets ▪ Locks (DB Connection pools) ▪ Callbacks in node.js/Netty Anything async?!
  5. 5. Queues ▪ Incoming load (arrival rate) ▪ Service from the queue (service rate) ▪ Service discipline (FIFO/LIFO/Priority) ▪ Latency = Wait time + Service time ▪ Service time independent of queue
  6. 6. It varies ▪ Arrival rate fluctuates ▪ Service times fluctuates ▪ Delays accumulate ▪ Idle time wasted Queues are almost always full or near-empty!
  7. 7. Capacity & Latency ▪ Latency (and queue size) rises to infinity as utilization approaches 1 ▪ For QoS ρ << 0.75 ▪ Decent latency -> over capacity ρ = arrival rate / service rate (utilization)
  8. 8. Implications Infinite queues: ▪ Memory pressure / OOM ▪ High latency ▪ Stale work Always limit queue size! Work item TTL*
  9. 9. Latency & Service time λ = wait time σ = service time ρ = utilization
  10. 10. Utilization fluctuates! ▪ 10% fluctuation at = 0.5 will hardly affects latency (~ 1.1x) ▪ 10% fluctuation at = 0.9 will kill you (~ 10x latency) ▪ Be careful when overloading resources ▪ During peak load we must be extra careful ▪ Highly varied load must be capped
  11. 11. Practical advice ▪ Use chokepoints (throttling/load shedding) ▪ Plan for low utilization of slow resources Example Resource Latency Planned Utilization RPC thread pool 1ms 0.75 DB connection pool 10ms 0.5
  12. 12. Backpressure ▪ Internal queues fill up and cause latency ▪ Front layer will continue sending traffic ▪ We need to inform the client that we’re out of capacity ▪ E.g.: Blocking client, HTTP 503, finite queues for threadpools
  13. 13. Backpressure ▪ Blocking code has backpressure by default ▪ Executors, remote calls and async code need explicit backpressure ▪ E.g. producer/consumer through Kafka
  14. 14. Load shedding ▪ A tradeoff between latency and error rate ▪ Cap the queue size / throttle arrival rate ▪ Reject excess work or send to fallback service Example: Facebook uses LIFO queue and rejects stale work http://queue.acm.org/detail.cfm?id=2839461
  15. 15. Thread Pools 02
  16. 16. Jetty architecture Thread pool (QTP) Socket Acceptor thread
  17. 17. Too many threads ▪ O/S also has a queue ▪ Threads take memory, FDs, etc ▪ What about shared resources? Bad QoS, GC storms, ungraceful degradation Not enough threads wrong ▪ Work will queue up ▪ Not enough RUNNING threads High latency, low resource utilization
  18. 18. Capacity/Latency tradeoffs When optimizing for Latency: For low latency, resources must be available when needed Keep the queue empty ▪ Block or apply backpressure ▪ Keep the queue small ▪ Overprovision
  19. 19. Capacity/Latency tradeoffs When optimizing for Capacity For max capacity, resources must always have work waiting Keep the queue full ▪ We use a large queue to buffer work ▪ Queueing increases latency ▪ Queue size >> concurrency
  20. 20. How may threads? ▪ Assuming CPU is the limiting resource ▪ Compute by maximal load (opt. latency) ▪ With a Grid: How many cores??? Java Concurrency in Practice (http://jcip.net/)
  21. 21. How may threads? How to compute? ▪ Transaction time = W + C ▪ C ~ Total CPU time / throughput ▪ U ~ 0.5 – 0.7 (account for O/S, JVM, GC - and 0.75 utilization target) ▪ Memory and other resource limits
  22. 22. What about async servers?
  23. 23. Async servers architecture Socket Event loop epoll Callbacks O/S Syscalls Socket Socket
  24. 24. Async systems ▪ Event loop callback/handler queue ▪ The callback queue is unbounded (!!!) ▪ Event loop can block (ouch) ▪ No inherent concurrency limit ▪ No backpressure (*)
  25. 25. Async systems - overload ▪ No preemption -> no QoS ▪ No backpressure -> overload ▪ Hard to tune ▪ Hard to limit concurrency/queue size ▪ Hard to debug
  26. 26. So what’s the point? ▪ High concurrency ▪ More control (timeouts) ▪ I/O heavy servers Still evolving…. let’s revisit in a few years?
  27. 27. Little’s Law 03
  28. 28. Little’s law ▪ Holds for all distributions ▪ For “stable” systems ▪ Holds for systems and their subsystems ▪ “Throughput” is either Arrival rate or Service rate depending on the context. Be careful! L = λ⋅W L = Avg clients in the system λ = Avg Throughput W = Avg Latency
  29. 29. Using Little’s law ▪ How many requests queued inside the system? ▪ Verifying load tests / benchmarks ▪ Calculating latency when no direct measurement is possible Go watch Gil Tene’s "How NOT to Measure Latency" Read Benchmarking Blunders and Things That Go Bump in the Night
  30. 30. Using Little’s law W1 = 0.1 W2 = 0.001 LB λ2 = 10,000 λ1 = 100 Least connections
  31. 31. Timeouts 04
  32. 32. How not to timeout People use arbitrary timeout values ▪ DB timeout > Overall transaction timeout ▪ Cache timeout > DB latency ▪ Huge unrealistic timeouts ▪ Refusing to return errors P.S: connection timeout, read timeout & transaction timeout are not the same thing
  33. 33. Deciding on timeouts Use the distribution luke! ▪ Resources/Errors tradeoff ▪ Cumulative distribution chart ▪ Watch out for multiple modes ▪ Context, context, context
  34. 34. Timeouts should be derived from real world constraints!
  35. 35. UX numbers every developer needs to know ▪ Smooth motion perception threshold: ~ 20ms ▪ Immediate reaction threshold: ~ 100ms ▪ Delay perception threshold: ~ 300ms ▪ Focus threshold: ~ 1sec ▪ Frustration threshold: ~ 10sec Google's RAIL model UX powers of 10
  36. 36. Hardware latency numbers every developer needs to know ▪ SSD Disk seek: 0.15ms ▪ Magnetic disk seek: ~ 10ms ▪ Round trip within same datacenter: ~ 0.5ms ▪ Packet roundtrip US->EU->US: ~ 150ms ▪ Send 1M over typical user WAN: ~ 1sec Latency numbers every developer needs to know (updated)
  37. 37. Timeout Budgets ▪ Decide on global timeouts ▪ Pass context object ▪ Each stage decrements budget ▪ Local timeouts according to budget ▪ If budget too low, terminate preemptively Think microservices Example Global: 500ms Stage Used Budget Timeout Authorization 6ms 494ms 100ms Data fetch (DB) 123ms 371ms 200ms Processing 47ms 324ms 371ms Rendering 89ms 235ms 324ms Audit 2ms - - Filter 10ms 223ms 233ms
  38. 38. The debt buyer ▪ Transactions may return eventually after timeout ▪ Does the client really have to wait? ▪ Timeout and return error/default response to client (50ms) ▪ Keep waiting asynchronously (1 sec) Can’t be used when client is expecting data back
  39. 39. Questions? github.com/avishai-ish-shalom@nukembergavishai.is@wix.com
  40. 40. Thank You github.com/avishai-ish-shalom@nukembergavishai.is@wix.com

×