Everyone dreams of being ‘Web Scale’, but we start out small. We — most of us — don’t launch a service and expect it to serve millions of requests from Day 1. This means that we don’t think about the ways in which our stack will blow up when the number of requests does start climbing. This talk lists simple patterns and checks that Development and Operations teams should implement from Day 1 in order to ensure a robust distributed system.
6. Monitoring 101: Logs!
• Request URL
• UUID
• Resource being queried
• Time taken (ms)
• Size of response (bytes)
• Human Identifiers for Data Store, Type of operation
IMP
7. Monitoring 101: Logs!
• Top 5 slowest DB calls
• $ sort -k6 -r -n <logs> | cut -f3- -d ‘ ‘
• Top 5 popular URLs
• $ sort -k4,4 -u <logs> | sort -k3 | cut -f 3-3
-d ' ' | uniq -c | sort -k1 -n -r
• Top 5 routes making the maximum number of DB
calls.
• $ sort -k4 <logs> | cut -f2-4 -d ' ' | uniq -
f1 -c | sort -k1 -n -r
20. Integration Points and
Domino Effects
App Server
DB
Network Calls
Web Server
Cascading Failures
QUEUE
Analytics
Consumer
21. Integration Points and
Domino Effects
App Server
DB
Network Calls
Web Server
Cascading Failures
QUEUE
Analytics
Consumer
X
X
X
X
22. Timeouts and Circuit
Breakers
Closed
(everything is
operational)
Half-Open
(Has it
recovered?)
Open
(Resource has
failed)
Failure
Wait for some time,
in the meanwhile:
- Fail Fast
- Gracefully Degrade
Attempt Reset
Failure
Success
Resource Timeout
25. Revisiting our Stack
App Server
DB
Network Calls
Web Server
Cascading Failures
QUEUE
Analytics
Consumer
CB, T, GD
CB, T
BP
CB, T
You Shall Not Pass
• CB: Circuit Breaker
• T: Timeouts
• BP: Back Pressure
• GD: Graceful
Degradation
Hey Guys, How’s everyone today? I’m looking forward to some kick-ass sessions today and tomorrow! My name’s Vedang,
Helpshift is a 3 year old startup in the Mobile CRM space. As a startup, your greatest weapon is your agility - shipping faster is how you compete with established companies. Any structural and architectural processes need to be balanced against the need to keep shipping.
The aim of this talk is to discuss scalability patterns that balance these constraints: they are relatively simple to implement, and they bake resilience into the system.
So this is the agenda for my talk today.
Monitoring your system is important, I hope we’re all agreed on that. Without monitoring, you are basically flying blind.
Having said that, monitoring systems can be complex to setup.
Logs are very effective for monitoring system behaviour!
For example, you can add logging around every network call you make, and record stats like this.
In a runtime like Clojure, you can even do this on-the-fly, so you can collect a reasonable sample on production and turn them off to avoid the performance penalty.
With logs like this and simple UNIX tools like sort, cut, uniq and grep you can gain deep insight into your what your system is doing. These are the low-hanging fruit that you can fix quickly. If you know awk and sed as well - well then you’re a wizard and you can do what you want.
This is a simple macro to do something like this in Clojure.
Unbounded calls are where you don’t have a bound on the size of the response or the number of requests you make. No matter how hard you try, your Dev and QA environments are never going to match up to production. Real-world usage is hard to predict, and we rarely think about the effects of data piling up over time.
Build default batch sizes into your DB request abstractions.
Real World: I have seen programmers explicitly over-ride this and ask the system to “return everything”. Catch this in Code Review!
Build abstractions and contracts for chunked requests: limit-skip, total count.
=scan-and-scroll=
I don’t know if it’s a functional paradigm thing, but a _lot_ of code gets written without any thought about the side-effects.
Catch this in Code Review and fix your functions.
Facebook recently open-sourced a library called Haxl which provides safe abstractions to dev and abstracts away access to remote data.
When you are building your data structures, think about how your data will flow through the system currently, as well as in the planned future. Every message queue and cache that it passes through imposes a serialization/deserialization penalty. The slide shows an example of a data-structure which stores dates as objects vs one which stores them as longs. The second one is more than twice as fast to serialize/deserialize.
When you’re young and write things like “became Core Java expert in 6 months” on your resume, you also believe that the Network “just works”.
Once you start working with Distributed Systems though…
We’re going to have full sessions dedicated to network flakiness, I trust that everyone here will definitely attend them. For the purposes of this talk, I’d just like to say…
Just avoid them. Network calls are slow, and they are the number 1 reason for cascading failures in your system. If your data size is small and it doesn’t change too often, cache it in memory. If your data size is large, cache it on local disk.
Integration points are the #1 cause for cascading failure in the system
Explain the diagram
2 powerful patterns to help combat cascading failures are
Every call to a resource should be configured to timeout
Make sure that the default timeouts are sane (Eg: Monger)
Circuit breakers track the health of your resource, and avoid badgering it when it is unresponsive.
You can now fallback to a secondary source, or just fail fast.
If you know you will fail eventually, you might as well fail immediately.
Check the CBs you need.
Hystrix gives you a nice implementation of Circuit Breakers and a whole host of other Scalability Patterns
A Health Check is a way to tell if your production service is responsive or not, and is essential to support features like auto-scaling. The idea is to wait until the machine passes the health check before sending production traffic to it.
Circuit breakers also act as a poor man’s health check.
With our patterns in place, we can contain failures and stop them from infecting the rest of the system.