From Operations to Site Reliability in Five Easy Steps

•Descargar como PPTX, PDF•

2 recomendaciones•577 vistas

The document discusses the results of a study on the effects of a new drug on memory and cognitive function in older adults. The double-blind study involved giving either the new drug or a placebo to 100 volunteers aged 65-80 over a 6 month period. Testing showed those receiving the drug experienced statistically significant improvements in short-term memory retention and processing speed compared to the placebo group.

Ingeniería

From Operations to Site Reliability in Five Easy Steps

Más contenido relacionado

Más de Todd Palino

Kafka at Peak PerformanceTodd Palino

More Datacenters, More ProblemsTodd Palino

Putting Kafka Into OverdriveTodd Palino

Tuning Kafka for Fun and ProfitTodd Palino

Kafka at Scale: Multi-Tier ArchitecturesTodd Palino

Enterprise Kafka: Kafka as a ServiceTodd Palino

Más de Todd Palino (6)

Kafka at Peak Performance

More Datacenters, More Problems

Putting Kafka Into Overdrive

Tuning Kafka for Fun and Profit

Kafka at Scale: Multi-Tier Architectures

Enterprise Kafka: Kafka as a Service

Último

Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis

Application of Residue Theorem to evaluate real integrations.pptx959SahilShah

CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani

Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran

Introduction-To-Agricultural-Surveillance-Rover.pptxk795866

Indian Dairy Industry Present Status and.pptMadan Karki

Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff

POWER SYSTEMS-1 Complete notes examplesDr. Gudipudi Nageswara Rao

Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066

Design and analysis of solar grass cutter.pdfTagore Institute of Engineering And Technology

INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12

IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst

Oxy acetylene welding presentation note.eptoze12

lifi-technology with integration of IOT.pptxsomshekarkn64

CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani

TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani

Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr

Earthing details of Electrical Substationstephanwindworld

Electronically Controlled suspensions system .pdfme23b1001

Notas del editor

Welcome. I’m here today to talk about how teams move from a traditional mindset to site reliability engineering, or SRE. At least, how it ended up happening at LinkedIn, Before we get into that, however, let me introduce myself.
So who am I? I’m a member of LinkedIn’s Site Reliability team, which right now is somewhere in excess of 400 people. I’ve been with the team for five an a half years now, and I currently hold the title of “senior staff”. Every company has their own title progressions, and it’s hard to understand what that means because of that. What I can say is that LinkedIn SRE currently has five engineers who are senior staff or higher. We tend to focus on high level process that cut a swath across all of SRE with high impact on all of LinkedIn. My team is Capacity Engineering, which is tasked with developing tools and processes for evaluating and managing capacity for all of the hundreds of applications that run LinkedIn. Up until last year, though, I was on the team that was responsible for Apache Kafka. Many of you may know that Kafka was developed at LinkedIn, so I worked with both the original team and our ever-expanding Kafka development and SRE team. I also co-authored Kafka: The Definitive Guide for O’Reilly with Gwen Shapira at Confluent. I’m also a registered locksmith.
Now that you know who I am, what is this “SRE” thing that we’re talking about? Probably most of you have heard of it, and many of you may even have SREs at your company. But part of the problem with SRE is that outside of California, even within the United States, it’s not yet common. And if you ask five people what SRE is, you’ll probably get at least six different answers. The reason for this is that everyone does SRE a little bit differently, depending on where they learned it and what they liked about it. The term originated with Google, and has been explained by Ben Treynor there as what you get when you apply software engineering principles to the problem of operations. What does that mean? Well, historically, traditional operations has systems administrators doing a lot of work, often herculean efforts, to keep a site up and running properly. SREs strive to be lazy, whether they admit it or not. We’d much rather have the software do the toil, the repetitive work, and apply ourselves to design and automation.
While every company implements SRE a little differently, we all share a common language. At the end of the day, our biggest concern is “Site Up”, as we call it at LinkedIn. Is everything running correctly and giving our members the experience that they expect? There’s a lot that goes into that, and in the next, oh, 25 minute or so, I want to tell you how you can can speak the same language. How you can build an SRE team that not only improves the experience of your customers, whether they are internal or external, but also scales sub-linearly. That is, if you double your business, you shouldn’t have to double your operations team.
I promised you five steps, and I will do my best to describe the work in five steps. Even though we all know that was a click-bait talk title. What I can’t promise is that they will actually be “easy”. Creating an SRE team will often require significant shifts in culture within your company. These are almost never easy or quick, and they require support from the highest levels of management. But they always start in the same place…
The first thing you have to do is admit that you have a problem. Many of you will recognize the Costa Concordia here, and if you didn’t, you probably immediately knew once I said the ship’s name. For the remainder, the Concordia suffered a catastrophic failure in 2012 off the western coast of Italy when they struck a rock and tore a gash in the side of the ship. What is interesting to us about this, is that it took over an hour for the captain to admit the problem, issue a mayday, and abandon ship. By the time it happened, the ship was listing so badly that they could not effectively launch the lifeboats. Thirty-two people died. If you’re going to make a big change in your organization, it’s going to be because either there’s a problem right now, or you see one on the horizon. The sooner you admit this, the easier it will be to make the change. But it’s never too late – if it took another hour for the Concordia’s captain to admit the problem, most of the passengers would still have been fine. So what kind of problems are we talking about that make us want to move to SRE? Maybe your development and operations team are not working effectively – ops has become “binary management”, and dev is just tossing the applications over the wall to them and moving on to the next thing. Maybe it’s that you have really bad monitoring and you can’t quantify what the problem is or how bad it is. Or maybe you can see that with the growth of your company, the Operations team will not scale to support it. Regardless, the first step is to define what the problem is that you’re trying to solve.
Once you’ve defined the problem, you can then start to attack it. But here comes one of the key principles of SRE – blameless culture. You want to attack the problem, not the person. This is something that our former VP who was responsible for bootstrapping SRE at LinkedIn, David Henke, would say a lot. There is nothing constructive about blaming people for the problems that you have. All it does is alienate people, and make them less likely to take intelligent risks and surface problems when they find them. What’s more, your customers don’t care who is to blame. Have you ever been to a restaurant where the service was really bad, and the wait staff started telling you about how the kitchen is short-handed or they didn’t get a delivery they expected? Sure, you might be sympathetic once or twice, but if they’re telling you that every time you’re not going to be as charitable. All you know is that you’re not getting food, and maybe you should have gone elsewhere. The last thing to remember here is that what gets measured gets fixed. This is another favorite phrase of David’s. What it means is that the only way to fix a problem, whether it is technical or cultural, is if you can quantify it, how bad it is, and how it improves. So whatever your problems is, you need to find a way to measure it. This might be availability for an application. Or it may be something like a net promoter score for cultural issues.
I said that David Henke was responsible for our transition to SRE. How did it happen? David was hired back in late 2009, and this is what he was faced with. We had a monolithic codebase – one big piece of code that ran all of LinkedIn and needed to be deployed all at the same time. Everyone was doing branch development, which meant that merging was a huge effort. Monitoring was pretty poor. Of course there was some monitoring, but the coverage was bad and there were a lot of different systems and levels of quality in place. This was less than helpful during… The marathon deployment sessions that were necessary to get changes out. We’re talking everyone getting into a room at five in the evening, stocking up on pizza and caffeine, and not finishing until the next morning. And it was frequently unsuccessful because it was impossible to know in advance if everything was going to work together correctly. The worst part is that the entire team considered this to be the normal state of affairs. At the same time, everyone agreed that it was unacceptable. But still they were doing the same thing every time. OK, we know there’s a problem. We can quantify it in terms of site availability, number of successful deployments, and how long those deployments take. Among other metrics. What are we going to do about it?
Our next step is to take control of the situation. We have to stop the bleeding and fix the problem that we have identified, but this is not nearly as easy as we would like it to be. Our company has momentum. Product teams want new features. Sales is selling our services to everyone they can find. Operations is swamped, and we can’t snap our fingers and double the number of sysadmins we have. It doesn’t matter. You have to stop and fix the problem if things are going to get better. And this is the first major cultural shift that is necessary within the company. The entire company needs to give the SRE team enough authority to stop everything.
Within LinkedIn, we have a framework that we’ve developed called “Code Yellow”. I’ve written a blog post on this for DevOps.com that’s linked on the resource slide, and I’ve given a talk with my colleague Michael Kehoe on how it works. When a team, any team - not just SRE, is suffering, often due to excessive toil or repeated failures, they can declare a Code Yellow to put the brakes on and fix the problem they have identified. I won’t get into the details, as that’s a talk all to itself, but it allows the rest of the organization to understand what is going on and support them. When you stop to fix your problem that is leading to implementing SRE, you’re going to do the same thing. You’re going to give SRE enough authority to state what the problems are and stop any work that is outside the scope of fixing those problems. This is not easy, and as a result you need support from the highest management down. In this phase, SRE is going to act as gatekeepers. They’re going to say no. This is definitely an anti-pattern in SRE, but this is the backswing on the pendulum – you’ve moved too far in the direction of your problems, and reaching equilibrium will require an over-compensation first. Your goal is to reduce the heroics that are necessary from your operations teams. You don’t want marathon deployments. You don’t want on call engineers being up all night. In order to do this, you’re going to reduce the scope of their work and allow them to put systems in place that allow the software to solve the problems for them. Through all of this, you have to stay in communication with the rest of the company. Product and Sales is only going to support you if they can see that progress is being made. Remember, you’re making operations better for your engineers. At the same time, you have to impress on everyone else why that matters to them.
When David Henke started at LinkedIn, the apocryphal story (as it was before my time) is that he went to Jeff Weiner and declared that operations was a complete disaster. He didn’t use the word “disaster”, and I shouldn’t say exactly what he said on a stage in front of this many people. But he impressed on the CEO the severity of the problem. This started the work building the SRE team, and implementing a monitoring stack that could measure the problems better. But the bigger effort was called Project InVersion, and it involved both SRE and Engineering. We love any name for a project that has “in” in it somewhere. The goal of InVersion was to solve the problems we had around deploying LinkedIn, and was championed in large part by Kevin Scott, our then VP of Engineering. The first thing we did is stopped everything. For two months, there were no new features allowed – everything was focused on InVersion. As you can guess, it was not popular with product teams. But it was supported by Engineering and Jeff. The next thing was that the monolithic code base was broken up. We move to a trunk development model, to avoid the problems of branching. And each application built and deployed separately. We built a CI/CD pipeline to support this. Most importantly, halfway through the project we burned all the bridges. The old system was completely dismantled, so that the only way to move forwards was to complete InVersion. What happened is that the project completed, and the ability of our teams to develop and roll out features increased exponentially. There were no more marathon deployment nights. LinkedIn was able to support a hyper-growth phase that added hundreds of millions of members on that platform in the coming years. Alright, so we solved our immediate problem. What comes after that for SRE?
We need to continue to automate everything. I said at the start that SREs are inherently lazy - we want to automate everything about the work we need to do. And we recognize that this won’t put us out of work – there are always new problems, and new applications. SRE’s role is to scale what they can handle, and not by hiring people alone. That means we have to take the knowledge that we have about operations and how our applications work and lay it down in software.
But SRE can make a lot of mistakes in doing this. None of us are perfect, and especially because SRE is an evolution that starts with excessive toil, we have to be careful to spend our time effectively. One of the biggest problems I have seen is that engineers will tend to build new solutions from scratch. This isn’t always a bad thing – I don’t think anyone here would argue that LinkedIn building Apache Kafka was misspent time. It’s one of the largest data platforms out there, and continues to grow. Does that mean that you should spend your engineering time building a publish-subscribe messaging system? Of course not – you can just adopt Kafka. The same applies for SRE tools, such as monitoring, alerting, incident management, and CI/CD. SRE should build tools that don’t exist, or are a core differentiator for the business, and adopt or buy everything else. This is difficult – we’re engineers at heart, and we love to build cool things. But contributing to an existing open source project is a noble effort. And if you buy a monitoring platform, you can spend that time developing much cooler stuff that nobody else has thought of. This leads into the next mistake that I see, which is rewarding heroic efforts. Traditional operations is frequently about “Alice is an amazing engineer. I don’t know how we would run the site without her.” SRE should be more about “The site is running fine. Why do we have this SRE team anyways?” Again, this is not easy. We like “rock stars” and “gurus”, and it makes it simple to recognize those efforts and promote based on it. But those types of engineers don’t scale – you can’t hire another Alice. She only scales by putting her expertise in a durable form – software and documentation. Building instead of buying is a form of heroics too. The third mistake is overloading ourselves with alerts and monitoring. Once you start to love data, you can move too far along to worshipping it. You discover a file descriptor leak in your application? Of course you need to add an alert for the number of file descriptors in use. And that alert persists even once the problem is solved, until eventually it goes off and nobody understand why the alert threshold is what it is. Monitoring and alerting is a hot topic lately because so many of us have built systems where there are far too many alerts. In fact, I have a whole talk around this, but I’m going to pull out just a small piece of it here.
In my talk, I have something I call the Rumsfeld Quadrant. Donald Rumsfeld was the secretary of defense in the US in 2002 when he said in a news briefing how there were known knowns, and unknown unknowns when asked about evidence of weapons of mass destruction. Of course, it sounds odd and it was ridiculed in some circles at the time. But it’s quite respectable – in everything, there are things we know and things we don’t know. My quadrant has detection of problems along one side, and the response to those problems along the other. I use it to describe the types of alerts you should have. If you know how to detect a problem, and you know what the response to that problem is, you have something that you should automate away. If you have a problem that you can detect, but you don’t know how to fix, that’s an active incident, so you’re working on it right now. If, on the other hand, you know how to resolve the problem but you can’t currently detect it, then you have a monitoring gap. Again, you should be actively working to resolve this. If you can’t detect the problem and you don’t know how to fix it, well, this is when your customers are tweeting about how your application is broken. Alerting should only ever tell you what you don’t know about. If you know about it, it’s either automated away or you’re actively working on the problem. For automation, you don’t need to know about it – it’s automatic. For active problems, you already know about it – you don’t need to be told again. So the only thing you really need to alert on is the unknown-unknowns. How do we do that? You focus on service level objectives and your customer experience. Does your customer care that a Kafka partition is not fully replicated? Nope. Do they care if the site is returning a 500 error, or it’s really slow? Yes. That’s what you should focus alerting on.
What’s more, you should be really careful now what you beat up your engineers about. In fact, you should now be learning how to love them. The ultimate goal of SRE is to enable the company to move forward as quickly as possible while keeping the site up. We can’t accomplish this if we’re acting as gatekeepers, so the automation we build should enable engineers to move quickly without requiring SRE to approve everything.
We’ve now reached an advanced point in the evolution of site reliability, because we understand that reliability and availability are not the same thing. There are so many more things that going into it. Availability merely asks whether or not the service is up and running. But we can have an available service that does nothing other than return a 200 HTTP response code and a blank page. We also need to ask if the service is correct. Does it return an accurate response for whatever the application is? For the sanity of our SREs, is it maintainable? Can we easily deploy, debug, and scale the application? Our friends on the security team will of course want to assure that only the proper people can use this application, and only the right engineers can run it. Now more than ever our customers, and our governments, are asking if we have addressed privacy. Are we handling the customer’s information properly? And let’s not leave out performance. Does it respond quickly? These are just 6 areas to look at – as SREs we need to identify what aspects of reliability are important for any application. And we have to be careful not to over-index on them. Does your webcomic really need five nines of availability? Do your cat pictures need to return in ten milliseconds? Don’t drive yourself crazy with unrealistic expectations.
What are the things we can do to help our engineers do the right thing, own their applications, and move quickly? A great way to get started is to provide self-service monitoring. This was one of the first innovations at LinkedIn that really changed the game. So many monitoring systems require metrics to be onboarded and vetted before they can be gathered. We created a system that allowed any engineer to just annotate a metric in their application and have it collected automatically. They didn’t have to justify its existence (though of course excessive use will be addressed). Additionally, because the metrics are annotated in code it’s the responsibility of the developer to do that, not the SRE to collect them. This moves ownership towards the developer. Another important step is the CI/CD pipeline. Developers should be able to create new applications and get them running without friction, but how can SRE help to assure that they do so safely? LinkedIn does this with canaries – an application is deployed on a single node in production and monitored automatically for a period of time to assure it performs in a similar manner to the previous version. If there are significant changes, the canary is rejected. If not, deployment proceeds automatically. Once you have good CI/CD and canaries, then you can implement error budgets. Just like a financial budget, and error budget is an agreed-on amount of errors that you’re willing to tolerate in production for an application. The “agreed-on” part is important – this is not SRE enforcing a limit on development. As long as the application is within it’s budget, developers can deploy a new version. If the budget is exceeded, deployments are locked out until the number of errors are reduced. This allows the developers to look at the budget and decide whether they should prioritize a new feature, or reducing the number of errors.
Throughout all of this work, moving from operations to SRE, we have been building a new culture. We’re empowering the people who are responsible for operations – giving them footing with product and engineering to say what is important and taking the time to fix the problems that we have identified. This is not something that will be finished, where you can go back to the way things were. The only way SRE thrives is if the company continues to support it, so that’s how culture becomes my final step in growing SRE.
This is my team – these are the people who make SRE at LinkedIn what it is. The culture we built was not intentional – at the time we were solving the problems that existed, we weren’t thinking about how to make a healthy culture. What we had was the mission and vision of LinkedIn, statements like “Members First, ”Relationships Matter”, and “Be open, honest, and constructive.” But these weren’t just words – everyone, from Jeff Weiner on down, lived them. So the culture that grew reflected this.
Now that we’re more mature, we think about the culture directly and how to make sure that we continue to have a healthy environment. How do we do that? We stress blameless culture in everything we do. We have an internal SRE conference every year called SREinCon. Everyone gets together offsite for two days, talks are proposed, reviewed, and delivered just like any external conference. One of my coworkers, Katie, gave a talk two years ago that exemplifies blameless culture. She stood up on stage in front of her peers and described how during an incident, after being at LinkedIn for a year, she accidentally made it worse and took down the site. There was no shame – this was a learning experience for everyone. She was not called out by name in either the site status meeting or the post-mortem. And she’s since been promoted because she’s an amazing engineer. If someone had blamed her for the outage, I have no doubt she would have left and we’d all be worse off for it. In 2013, we turned our love of data and process towards hiring. We developed an SRE hiring process that was consistent, documented, and repeatable. In addition, we don’t hire SREs for a specific team. Instead, we hire people who will be good SREs. Then we find the right fit for them, between the teams that have the highest need for another person and the skills and wants of the engineer. We also enable free movement within the company – whether someone wants a new challenge in SRE or elsewhere in the company. Or even outside the company – we endeavor to always support our engineers, even if that means helping them move on to a new challenge elsewhere. Another thing we have going for us is very strong technical leadership. This means that we enable individual contributors to progress and grow without requiring them to move into people management. I used to manage a team at my previous job for several years. It turned out that while I appeared to be pretty good at it, I was happier when I was working on code and operations. At LinkedIn, I participate in our site reliability technology leadership group that reviews large projects and offers feedback. I co-chair a promotion committee, which is comprised of not-managers, as well as our internal SRE conference. I lead a cross-organizational team to formalize our incident response processes. And I do all this without people reporting to me. We also work very hard at supporting diversity and inclusion in the company. Erica is one of the people who is responsible for our efforts around supporting women in tech, both internally and externally. But LinkedIn has tens of internal inclusion groups where our colleagues get together to support each other. This comes back to LinkedIn’s values, and that they’re not just words plastered on a wall somewhere.
Coming back around to the beginning of this talk, now that we've talked about what the progression looks like in moving into the SRE world, how can you walk away and get started?
The very first step is for you to clearly state what the problem is. This means measuring it, which might take a little work to get whatever you need in place to do that. But remember, what gets measured gets fixed. If you can't clearly measure the problem and whether it is getting better or worse, you're going to have a hard time with the next step... Which is putting together your proposal for a solution and talking about it. It's difficult to have a conversation like this, especially when you're probably going to be telling some senior people things they don't want to hear. Like that they're going to need to hold off releasing new features for a while in order to fix the underlying problems. But you need buy-in from everyone for SRE to be a success. You're going to get support. You're going to get negative feedback, which you have to address and incorporate. Remember that the only feedback that is bad is none at all - that indicates that people aren't engaged. Once you work on these two things, you can start yourself down the path to SRE.
I’ve gathered some resources that I’ve either mentioned in my talk or found useful. In particular, the blog posts and talks detail how SRE at LinkedIn has grown up. I’ve also included Jamie Wilkenson’s talk on the Theory and Practice of SLOs, which I have found to be one of the best talks on how to utilize SLOs in monitoring and alerting. And of course, the SRE books that are currently available from O’Reilly, though you should be cautious because they do skew very heavily towards Google’s view of SRE. Just like my talk skews towards LinkedIn’s view.
I would like to thank you very much for your time today as I’ve talked about how an SRE team gets built. If you have questions, whether about SRE, LinkedIn, or really any other topic, I will be over on the Arena Stage at 11:30.