Monitoring services is easy, right? Set up a notification that goes out when a certain number increases past a certain threshold to let you know that there’s a problem. But if that’s the case, why are so many teams drowning in alerts and dreading their time on call? The reason is that we tend to monitor the wrong things: reactive alerts, metrics that we don’t completely understand how they impact our service, and capacity alerts. We look at our own view of the service and fail to consider that our customers have a different view.
Come learn to let go of what does not help, and explore how to monitor for what truly matters: what the customer sees. This starts with defining our agreements with our customers, continues through building applications intelligently and instrumenting all the things, and finishes with picking the right signals out of that instrumentation to generate alerts that are actionable, not ones that introduce confusion and noise. We will also touch on capacity planning, and how it should never wake you up. You’ll find it’s possible to assure that you meet your service level objectives while still maximizing your sleep level objectives.
5. Network Operations Center
• Central monitoring and
alerting
• Gatekeeping monitored
alerts with no deep
knowledge
• Information overload for a
moderate sized system
• Glorified telephone
operators
6. Kafka Under-Replicated Partitions
• Unclear meaning
• Sometimes it’s not a
problem at all
• Does the customer care as
long as requests are getting
served?
• Frequently gets ignored in
the middle of the night
7. CPU Load
• Relative measure of how
busy the processors are
• Who cares? Processors are
supposed to be busy
• What’s causing it?
• Might be capacity. Maybe
10. Let’s Be Smart
About This
• Specific
• Measurable
• Agreed
• Realistic
• Time-limited, Testable
11. Common SLOs
Is the service able to
handle requests?
Availability
Are requests being
handled promptly?
Latency
Are the responses
being returned
correct?
Correctness
13. Observe and check the progress or quality of
(something) over a period of time; keep under
systematic review.
M o n i t o r
14. So WTF is Observability?
• Comes from control theory
• A measure of how well
internal states of a system
can be inferred from
knowledge of its external
outputs
• It’s a noun – you have this
(to some extent). You can’t
“do” it.
16. What Can We Work With?
Single numbers
• Counters
• Gauges
• Histograms (and Summaries)
Metrics Events
Structured data
• Log messages
• Tracing (collection of events)
17. Where Can We Get It?
• Rich data on internal state
• Necessary for high observability
• Tons of data possible, but the utility is often
questionable
• Beware! Here be dragons!
Subjective Objective
• Customer view of your system
• Think of “Down For Everyone Or Just Me?”
• Critical for SLO monitoring
• More difficult to do, but it’s the authority on
whether or not something is broken
19. Build For Failure
Rich instrumentation
on every aspect
Intelligence
Tolerate single
component failures
(not just N+1)
Availability
Limit resource
creation and
utilization
Capacity
20. It’s the only thing
that matters
Using the SLO
• Always measure the SLIs
• Objective monitoring is best
• Don’t beat the SLO
• Only alert on the SLO
21. ONLY???
• SLO alerts find unknown-unknowns
• Known-unknowns and unknown-knowns
must only exist transiently
• A known-known should not require a
human. Automate responses to known
issues
• For all else, if you have a 100% signal it
can be an alert. But if it doesn’t impact
the SLO, does it need to wake you up?
22. What About Capacity?
Assure no single user
can quickly overrun
capacity
Use Quotas
Frequently enough to
respond to trend
changes
Report &
Review
Never ignore or put
off expansion work
Act Promptly
24. What Should I Do Next?
• Talk to your customers and
agree on what they can expect
• Add objective monitoring for
these expectations
Define Your SLOs Clean Up Alerts
Add
Instrumentation
• Inventory alerts and eliminate
any that are not a clear signal
• Add alerts for the SLOs that
you have agreed on
• Implement quotas, if needed,
to assure capacity isn’t
suddenly overrun
• Switch to structured logging
• For distributed systems,
consider adding request
tracing
• But make sure you don’t hold
this extra information for longer
than it’s needed for debugging
25. More Resources
• Finding Me
• linkedin.com/in/toddpalino
• slideshare.net/ToddPalino
• @bonkoif
• Code Yellow – How we help overburdened teams
• devops.com/code-yellow-when-operations-isnt-perfect
• Usenix LISA (10/29 – Nashville) “Code Yellow: Helping Top-Heavy Teams the Smart
Way”
• SRE – What does the culture look like at LinkedIn
• “Building SRE” usenix.org/conference/srecon18asia/presentation/palino
• Every Day is Monday in Operations - everydayismondayinoperations.com
• Kafka – Deep dive on monitoring for Apache Kafka
• confluent.io/kafka-summit-london18/urp-excuse-you-the-three-metrics-you-have-to-
know
Before we get started, I want to give everyone a chance to snap this. After this talk is over, by the end of the day, I will post the slide deck up on my SlideShare along with the powerpoint original (with slide notes). So anything you want to be available to reference will be there for you.
Now that we’ve got that out of the way, what are we going to talk about? I’m going to start with examining a few anti-patterns in alerting that I’ve seen at LinkedIn, and how they’ve caused us pain. We’ll then talk a little about establishing what our application goals should look like. I’ll talk about the current state of the monitoring world we have to work with (along with a buzzword or two), and exactly what we are trying to get out of good monitoring. Once we have that established, we can talk about how we should design our stacks so that we get the information we need without killing ourselves. Lastly, we’ll quickly review what we should all be doing right now to make our lives easier.
When it comes to alerting, we make a lot of mistakes. Still. We set up far too many alerts, on information that we only partially understand, and we don’t even know how to act on it. This leads to us ignoring the things that we set up to ping us, and that muddies the signal even more. We often can’t see the size of the problem until we’re working through multiple sleepless nights, and at that point all we have the energy for is digging ourselves out of the current crisis. So what are some of the worst ones that I have seen at LinkedIn?
Probably the best example of signals going to the wrong place is the Network Operations Center. Our Site Operations Center is comprised of some really sharp engineers who are tasked with both coordinating large incidents, but also helping to monitor the growth and overall site health. Like every other engineer, they build tools and processes around this.
But back when we still called them the NOC, the role was vastly different. At the time, we didn’t have a good way to get alerts to individual engineers. So the NOC would have alerting dashboards they would monitor with hundreds upon hundreds of alerts. These would have to be onboarded to the NOC, making them the gatekeeper for what was and wasn’t important enough to have someone monitoring 24x7. To make it worse, most of the runbooks were “Call the SRE”, so they couldn’t even help to resolve most problems. This left us with a bunch of engineers who had to restrict what information was sent to them, without deep knowledge of the systems being monitored. And while they also had responsibilities around monitoring the growth metrics, they were often working as little more than a switchboard operator – waking engineers up at night, taking orders for who to escalate to.
Thankfully, this is much better now – while the SOC still helps with escalation, it’s a minor part of their role. Alerts are now managed by the individual SRE teams, and the notifications go to those teams, not a central team that has to dispatch them.
Another problem, one that I am still constantly fighting, is that of unclear monitoring signals. In Apache Kafka, the “under-replicated partitions” metric is often used as the gold standard of what to monitor and alert on. I know, because I helped to write the definitive guide for Kafka and I told everyone that it was the gold standard of what to monitor. Now I’m stuck with the unenviable task of having to tell everyone how wrong I was and why.
The problem is that this metric doesn’t provide a clear signal as to what is wrong. This graph usually means that one of the Kafka brokers is offline, but not always – it could also mean that it’s up but replication is broken in a variety of ways. The next graph is usually a problem with a broker talking to a single other broker for replication. But it could also be a single hot topic or partition that’s unbalanced. This last graph, nobody has any idea what it means. And sometimes it’s not a problem at all – when you need to move partitions around to balance a cluster, they are naturally under-replicated while data is being moved.
So what happens? The Kafka team has an alert that goes off if this metric is more than 10. Why 10? Because normally we don’t move more than 10 partition at a time, so that will mask a known false alert. But the alert often gets ignored when it goes off. There’s a separate signal for a broker that is down, and it’s often a sign of a capacity problem that can’t be immediately resolved (more on that later). In addition, the customers don’t really care that the cluster is under-replicated (although sometimes they do). So it’s not a crisis. Which begs the question, why is it an alert in the first place?
It’s much like this other lovely metric, CPU load. I would guess that we have all seen this, or some variant like CPU utilization, as an “important metric” in our alerts. Despite the fact that most engineers I have had the pleasure of working with over the years don’t understand what it is a measure of without asking Google. And even when you understand the measurement (hint – it’s a relative measure of how busy the processors on a system are, averaged over some period of time), you then need to understand what is a good number and what is a bad number. For example, is a CPU load of 20 bad? If you have 16 CPUs in your system the answer is very different than if you have 24.
Even then, it’s hard to say if it’s bad. There’s a question I heard someone ask at one point. If you run an email server, and the CPU utilization is at 99%, what do you do? The answer is that if the emails are getting delivered properly, you go get a drink. The CPU is supposed to be busy – it’s there to do work for you. If the work is happening properly, what do you care about how busy it is at any point in time. The one caveat is that you might have a capacity concern building, but this is a lousy way to measure it. And even if you do, what are you going to do about it in the middle of the night?
Complicating this, it’s a measure of the overall system. Any given system has hundreds of processes running. Which one is causing the CPU to be high? Maybe it’s an opportunistic process that will back off if the CPU is demanded by something else. Yet another reason why this shouldn’t wake you up.
If we shouldn’t be alerting on things that aren’t clear, or don’t matter to our users. Or if we want to make sure that the signals we provide don’t swamp us (or a NOC) with irrelevant information, what should we do? The answer is that we have to start with defining what our goals are for the application. The only goals that matter are what our customers expect – they don’t care if the CPU is a little hot. They also don’t care, unless we have made promises around internal replication, whether or not a given thing is under-replicated. They care about the agreements that we have made with them, so we have to define them before we know what we should measure, track, and alert on.
Most everyone here will be familiar with the term “service level agreement”. However, I would also guess that most of us (or our management) use it somewhat incorrectly, especially when discussing parts of an internal distributed architecture (and not external customers). There are actually several terms that you should know, and encourage the use of properly so that we have clear communication on what our goals are. All of these start with “service level”, which indicates that we’re talking about items that define how we deliver the services that we’re responsible for – what the level of performance is.
The first is “service level indicator”, abbreviated SLI. The SLI is the measurement that will form the basis for any goal. For example, if we want to track a service’s error rate as a goal, then the metric that tells us what the error rate is will be an SLI. If we want to track latency, we may have several SLIs – we might want to use both the 50th and 99th percentiles, for example (more on that later).
Next we have “service level objective”, or SLO. This is the term that more people know, however ITIL has decided to deprecate the use of this term in their documentation and use “service level target” (SLT) instead. They have the same meaning, however. The SLO is the specific measurements of your SLI that are considered good and bad. So if your goal is to have the error rate be less than 1%, and you’re going to track that over the course of a day, your SLO could be “service error rate < 1% averaged over 24 hours”, a combination of an SLI and a specific target for it, including the timeframe that it is measured over.
On top of all this is the “service level agreement”, or SLA. The SLA comprises the entire agreement with the cluster, which includes not only the SLO, but also consequences for what happens if the SLO is missed, how customer support is provided. It is a contract, and if you’re running a system with internal customers you probably don’t have one of these (even if you have SLOs already).
With terminology defined, how do we define what the SLOs should be for a given service?
We’re going to talk about another thing that we’ve probably all heard about, SMART goals. There’s a number of variants on what the letters S, M, A, R, and T are supposed to stand for here, so we’re going to use a set that will apply well to talking about SLOs. The end result should be an understanding of how to pick SLOs that won’t bite us later.
First, they need to be specific. It’s not sufficient to say we’re going to measure latency. We need to specify exactly what metric we are going to use and how it is measured. For something like latency, it’s important to know if we’re using a histogram so we can specify the right components (like 99th percentile). We also need exact target measurements and the time periods that they are measured over.
Next, and this may seem like it should go without saying, they need to be measurable. We can’t have a goal of making the customer happy – we don’t have a way to measure that. Even more reasonable things, like error rates and latencies, if we don’t have a way to measure them properly. It’s also worth noting here that where we measure these things from matters. What our server says the latency and error rate is will probably differ from what our clients think. So it’s important to say where the measurement is made from.
The SLOs also have to be agreed on between us and our customers. This is pretty simple – I can’t specify alone what the SLOs for my service are, because I need to know what’s required and acceptable to my customers. Neither can they say what the SLOs are without my involvement.
This is because the goals also need to be realistic. If you have a service that performs a complex calculation which takes a minimum of 10ms of CPU time (in terms of the wall clock), then it’s not reasonable to have a latency SLO at 5ms. If you can’t agree on the goals, specifically if they’re not realistic, it’s a sign you need to have a deep discussion with your customers about what they’re trying to accomplish.
Lastly, we’re going to have two “T”s – time-limited and testable. Testable is very similar to measurable, but we can also use it to say that we should be able to test for compliance with the SLOs before we release new code. That might be through pre-release performance testing, canaries, or some other mechanism for minimizing the impact of bad code. Time-limited means that we need to define the time period over which we are measuring the SLO. If we’re talking about availability of the service, then four nines measured over a day is very different than when it is measured over a week. The first one means that we can only have less than 9 seconds of downtime every day, whereas the latter means we can have a single event where we are down for a minute in a week.
So what are the SLOs you should be looking at? It’s hard for me to tell you what your customers want you to agree to for any given service, but they’re usually going to be one of these three things: availability, latency, or correctness.
Availability is simply whether or not your service is up and running and able to handle requests. It doesn’t (usually) mean that every component is up, just that a customer can send a request and get a response. Usually this is expressed in terms of how many nines you have.
Latency is how long it takes to service a request. This may be overall for your service, or if you have endpoints that behave differently, it might be per endpoint. You also may have SLOs that cover different percentiles for the same value. For example, you might have a 100ms response time at the 50th percentile, and a 1 second response time at the 99th percentile.
Correctness is the term I use instead of error rate. This is because an error is just one type of bad response. Giving a wrong answer is usually worse than returning an error code.
All together, these are the things customers usually care about: can I make requests, do I get responses in a reasonable time period, and are the responses correct?
Now we can talk about what we have to work with to make sure that we hit those goals. The entire discipline is called monitoring, but what is it really?
The Oxford definition of “to monitor” is: Observe and check the progress or quality of (something) over a period of time; keep under systematic review. This matches up with what we know – monitoring is about watching and measuring the status of our services.
OK, enough about observability. What are we looking for with our monitoring? Where are we trying to get to?
We’re going to call this the Rumsfeld Quadrant. On the side, we have detection of problems – there are known issues that we are able to detect, and unknown issues that we are not able to detect. Along the top, we have our response to those problems – there are issues where we know how to respond to them, how to fix them, and we have issues that we don’t know how to fix.
For issues that we can detect and that we know how to respond to, these are the known-knowns. This is the realm of good monitoring – crisp signals, and runbooks to address them. These are things that should be automated as well – we know how to fix them, so why spend engineer hours on doing this?
If we have problems that we can detect, but we don’t know how to respond to, these are known-unknowns: our active incidents that we’re working on. For problems we can’t detect, but we know how to fix, these are unknown-knowns. These are monitoring gaps that need to be resolved.
The last category is the unknown-unknowns – problems that we don’t know about, and we don’t know how to fix. These are tweets about your service being broken – you’re hearing about it from customers, and not from your own system. And you’re being spun up to quickly figure out what the heck broke. By necessity, these need to rapidly migrate to either known-unknowns or unknown-knowns, depending on whether you can fix it or detect it faster.
Eventually, all of these issues need to migrate to known-knowns. If they don’t, you’re stuck in reactive work all the time.
So we know what we’re looking for. How can we get there? What data do we have to work with? Monitoring generally breaks down to two types of things: metrics, and events.
Metrics are single numbers, whereas events are structured data. By the way, when you hear people talk about “observability”, they’re talking about events.
Metrics are what we make our graphs out of. Counters are pretty simple – constantly increasing integers. Total number of requests, total number of errors. Gauges are numbers (either integers or floats) that fluctuate up and down, like a speedometer. This could be requests per second, or network utilization, or CPU utilization. A third type of data is histograms. This is bucketed data, typically represented as percentiles. The 50th percentile is the median, the 99th percentile means that 99% of values are less than the metric value. We often use histograms for metrics like latencies, so that we can not only see the average but also what the worst offenders are.
For events, the one we probably all have already is log messages. Of course, they’re probably not as structured as we’d like them to be – most of us are using plain strings, even if they’re in a well-defined format like Apache HTTPD logs. Everyone, if they’re not already, should be moving towards true structured logging in a format like JSON (there are other options, but JSON is the most common).
The other type of event data, specifically relevant for distributed systems, is request tracing. This is actually a collection of events, where each event is a discrete call made as part of the initial request. For example, a user makes a request at service A. Service A then calls services B and C to get results, and assembles a response to the user. You would have 3 events in a trace, at minimum – one each for the initial user call to A, the call from A to B, and the call from B to C. These events will have rich data about the requests: the caller, the endpoint called, the status of the call, the time the call took, and possibly much more information.
This list isn’t exhaustive, by any means. But it is the most common types of monitoring data we have to work with. Metrics are usually where your alerts come from – they let you know that there is a problem. Events are usually where you get your detail for debugging from – they help you figure out why you have a problem. This is not an ”either/or” – use both.
We also have a choice on where we can get a lot of our data from. Specifically, I will talk about subjective measurements, and objective measurements. Subjective measurements are things that we measure about ourselves. “I am a very handsome presenter” is a subjective statement. Objective measurements are things that other people measure about us. “This guy has no idea what he’s talking about” is an objective statement.
Subjective monitoring has the ability to give us very rich data on the internal state of a system, because we are instrumenting the system directly. These types of data are absolutely necessary for high observability. However, we need to be careful here because while there is lots of data that we can make available, not all of it is going to be useful. Do you really care about the byte size of the representation of a specific array in your code? You might, but it’s probably not going to be a very useful piece of information. Additionally, it might not be the best way to measure something. If you measure the error rate for requests at the server, it won’t include all the requests that don’t make it to the server at all.
On the other side, we have objective monitoring. This provides a view of your system that is more like what your customers see. You can think of a service like “Down for everyone or just me?” – it makes a measurement against your system from the outside, and takes into account something other than what the system thinks about itself. This type of data is critically important for monitoring our SLOs, since when it comes to an agreement with our customers, they don’t really care if the service is down because the service itself is broken or if the network getting to the service is broken. It’s definitely harder to do, but it’s also much more of an authority on whether or not your service is working.
Again, most of the time you need both of these. You need objective monitoring to know whether or not there is a problem, and you need subjective monitoring to really dig into the detail of why.
With a shared understanding of what we can do, we can now move into how we handle our services to make sure that we have enough data to understand them, but avoid killing ourselves with noisy alerts and information overload. In large part, this means we need to design our services from the ground up to be successful.
This means building them knowing that they are going to fail. Anyone who says they have a service that will never fail is a liar – everything fails, even if it means the failure is outside of your direct control. If we know we’re going to fail, what do we need to do in advance to make it as easy as possible for us to manage that?
First, the code needs to provide us with appropriate intelligence about what is going on – we need high observability. We want detailed instrumentation on the operations of our entire system, including both metrics and events, so that we can debug problems easily. We also need to make sure that any SLIs that we have identified are included in there as well.
Next, we need to build availability into the architecture. This means that we will strive to tolerate the failure of any single component without affecting the overall availability of the system. And this doesn’t mean N+1, which is what most people think of when it comes to availability. With the Kafka infrastructure at LinkedIn, we have previously used a replication factor of 2 – this means that we can lose a single broker, and the cluster will continue to operate. However, we can only lose one broker, and this means that whenever we had a hardware failure, we needed to spin up the on call engineer immediately to get that system back up and running, because a second failure would mean that we were down. This isn’t really being able to tolerate a single failure - it just meant that we had a very small window in which it wasn’t impacting customers.
The third part is that we need to manage the capacity of our services so that we don’t get surprised by a sudden surge. For storage systems like Kafka, this means limiting the creation of resources, and how much storage and processing those resources can use by default. For non-storage things, this might mean quotas on request rates from upstream callers, or dynamic allocation of servers to accommodate surges in traffic. Either way, capacity problems cannot surprise us because we don’t have the ability to magically make new resources appear. And if you do, that should be automatic.
When we’re setting up the alerting on all that rich instrumentation, the SLO is the thing to look at. We’ve hit some of these points already, but it’s worth a recap:
We always have to measure the SLIs that have been defined. If we don’t, you may as well not bother having SLOs at all, because you don’t know if you’re hitting them or not. What’s worse is if you rely on your customers to monitor the SLOs. As a service owner, I never want to hear about any problem from my customers – I should always be the first to know about an issue, and I should be informing them. Not the other way around.
When it comes to the SLIs, the best monitoring is objective monitoring. Since the SLO is an agreement with the customer, it is almost always measured from the customer’s point of view, not ours. Our monitoring should match that. We can rely on subjective monitoring to provide detailed data when we’re debugging a problem, but generally not for the SLOs themselves.
Once we’ve got our SLOs defined, we should not try to beat them. At least not by very much, since we probably always want a little bit of buffer. But if you agree in your SLOs on 90% uptime, but you’re actually delivering 99% uptime all the time, guess what? Your customers probably have become accustomed to that, and they’re going to start making noise if the uptime is 98% even though that’s well within the SLO on paper. This is a good time to have maintenance windows where the service is offline even if it doesn’t need to be. Or artificial latency that can be dynamically adjusted if you have a downstream problem. It may seem counterintuitive to not be the best we can be here, but you’re going to sleep a lot better if you deliver what you promised.
Lastly, only set up alerts on the SLO.
Wait, you say. I must have misheard you. I thought you said that I shouldn’t alert on anything except the SLOs, and not these dozens of other metrics that indicate problems!
Yes, that’s exactly what I said. And here is why. Think back to the Rumsfeld Quadrant. Alerts on the SLOs will find the unknown-unknowns – the problems that we can’t currently detect and our customers are going to tell us about.
The known-unknowns, which indicate an active incident that we’re debugging how to fix, and the unknown-knowns, where we have a monitoring gap, by definition must only exist in our systems transiently. They have to transition to known-knowns if we’re going to avoid spending our entire day in reactive work.
The known-knowns have a known detection and a known response – that should not require a human being to handle them. Those responses should be automated.
But those are the only things we have. Monitoring signals used for alerting either tell us about a problem clearly, or they don’t – there is no useful grey area. And if they don’t tell us about a problem clearly, we’re setting ourselves up for failure if we use them. So if you have another signal that is 100% clear, that can potentially be used as an alert. But if you have a problem that doesn’t impact the SLO, why does that need to wake you up? That problem is either a known-known, in which case it can be automated away, or it’s a known-unknown, which someone had better be working on turning into a known-known.
Now, I will admit that what I have described is the ideal world. But it is of the utmost importance that we have the ideal state in mind, and make sure that everything we do moves us towards that ideal state, not away from it.
The one gotcha in all this is the capacity of our service. That’s probably not an SLO, but it does need to be monitored and managed correctly. Now, I’m specifically talking about systems that don’t have dynamic capacity changes available to them. If you’re running in a public cloud, and your service is well designed for scalability, you can probably react to capacity changes by spinning up new instances. Automate that. For the rest of us, and those of us who are running storage systems that require special handling, there are a few things we can do.
First, as I mentioned earlier, use quotas to make sure that by default, your users are not able to overrun your available capacity. In Kafka, for example, this means using message retention by bytes on disk, as well as restricting the inbound and outbound bytes per second rates. Your customers should have resource limits in place so that if they want to significantly change their call patterns, they need to interact with you somehow to make sure you know about it in advance, and can plan for it.
Once you have limits in place, you need to report on the capacity of your system and make sure you’re reviewing those reports frequently. Maybe you can automate this entire process, and have the reporting system put in hardware orders automatically. Maybe not. But either way, this is a process that is out-of-band from your normal alerting because hardware doesn’t magically appear when you snap your fingers.
And don’t ever ignore those reports, or put off the expansion work that’s required. That’s really only asking to have problems that you could have taken care of proactively.
If we can manage to do these things - design our systems well, monitor for the SLOs, and manage capacity proactively – we set ourselves up for success. The SLOs will let us know about the unknown-unknowns, and the detailed metrics and events will provide what we need to fix those new problems. As someone who wants to move towards the ideal state of automating their troubles away and cleaning up the noise of bad alerts, where do you start?
The first thing you want to do is define your SLOs. Have a conversation with your customers, whether internal or external, and come to an agreement on what your service can and will provide for them. Then add some objective monitoring so that you can stick to those agreements.
Next, you need to work on cleaning up your alerts. Take a look at all of them, and eliminate anything that doesn’t have a clear signal to start with. At least put them in quarantine and see if you can stop waking yourself up with them. Add new alerts for the SLOs that you now have. And make sure you have quotas in place to avoid surprises.
Lastly, beef up the instrumentation for your services, and manage the monitoring data appropriately. If you’re not there already, for the love of all that is good please switch to structured logging. The larger your systems get, the more you realize that all of your data (like logs) should be targeted at automated processing, not humans being able to read it. Also think about adding request tracing if it seems appropriate. This is really important for distributed systems, but it’s also useful for tracing a request within a single service. It can be as simple as logging detailed request data, or more complex like emitting messages to a system like Kafka for stream processing.
Ultimately, the more instrumentation you add, the more you need to make sure you’re only holding onto what you need. Make sure that monitoring data, including tracing, is only retained for as long as it’s important for debugging. This will help keep you in the good graces of the teams that are responsible for your monitoring storage, as well.
I’ve also got a few resources for you. Due to monitoring overload and capacity problems, among other things, our Kafka team spent the first half of the year in a state we call Code Yellow. This is a way for us to put the brakes on and fix critical problems with a service or a team. If you’d like to learn more about that, I’ve written a blog post on what Code Yellow is. Michael Kehoe and I will be at LISA in a couple weeks to talk about that and other Code Yellows at LinkedIn.
If you’re looking for more information about SRE at LinkedIn, I’ve spoken on that several times including at SREcon Asia this past. There’s also an excellent series written by one of my colleagues, Ben Purgason along with our former head of engineering David Henke called “Every Day is Monday in Operations”.
And if you’re interested in learning more about Apache Kafka, there are many resources available and previous talks I have done. When it comes to monitoring, there’s a talk that I’ve given that echoes a lot of what you’ve heard here, applied to Kafka. You can view the talk from Kafka Summit London earlier this year. I’ll also be giving this talk next week at Kafka Summit SF, if you happen to be going.
And as always, if you have any questions you can feel free to connect with me and ask. I’m on LinkedIn, of course, and you can find me on Twitter as well.