Paul Cichonski's presentation from SF CloudOps Meetup on building and monitoring fault tolerant systems. (http://www.meetup.com/CloudOps/events/159397622/)
17. You Need Metrics
• Reduce “map/territory” confusion
• We use Yammer Metrics
– Timers
– Meters
– Histograms
• We use them a lot
– Every class has at least one metric, most
have multiple
17
“these are the technologies we use at lithium, you can see that we use a few different technologies for storing data...mostly for different use cases (i.e., batch vs realtime vs transactional). On top of all these data-storage technologies we are also building up services as we move towards a service oriented architecture and horizontally scalable, highly-available services. As we move towards SOA, and more importantly towards cloud the design space changes and we must deal with failure more realistically......transition, failure is constant”.
- As dependencies for fulfilling a request goes up, so does the probability of failure, which is fine, we just can’t have cascading failure.
- Everyone has now heard of netflix simian army for simulating failure, but we don’t always talk about the coding practices to be able to withstand its wrath.
This is not really a problem of “cloud” this is a problem associated with building distributed, horizontally scalable applications.The only time you don’t have to worry about these things is if your chosen method of scaling is “up” and not “out”, but event then, does it connect to users (i.e., the wider-system context is distributed).In the past generation of “scale-up” failure meant everything was dead, now it just means functionality gets degraded.
- Especially on network calls, this is about protecting the client
- Also about protecting the client- See hystrix from netflix, this is about protecting the client and allowing the downstream service to heal
- Beware that it is harder to reason about anything async.
- This is about signaling to upstream traffic that something is wrong downstream and they may want to take evasive action.
Or at least fail-over.Easy in cloud, harder in datacenter
- They should be explicit, how is this app going to deal with failure in these dependencies (always from the client-side perspective).
- Now that your apps have all of the previous concepts built in bad stuff will still happen. How do you manage the service in production to know when things are going wrong?
Find the most critical calls in your app (i.e., network calls, client calls)Figure out a way to visualize them to gain instant awareness as to what is going wrong in prodService should be small enough for a single engineer to gain full insight (assuming he has the baseline).
- Create alerts around specific log levels (ERROR) or system usage outside of a well-known baseline.- An alert should mean that something needs immediate attention (i.e., keep noise to a minimum)- Alerts should be a last resort (because sometimes you need to sleep).- Alerts should not be a substitute for continuous monitoring of the service through dashboards.
Or at least fail-over.Easy in cloud, harder in datacenter