This document summarizes key lessons from a presentation by Gene Kim on building a world-class engineering culture. Some of the main surprises discussed include: (1) the business value of DevOps is even higher than previously thought, (2) DevOps benefits operations and security as much as development, (3) measuring code deployment lead time is more important than deployments per day, and (4) Conway's Law has implications for organizational structure and architecture. The presentation also discusses how DevOps enables organizations to become dynamic learning organizations.
2. @RealGeneKim
My Definition of DevOps
The architecture, technical practices, and cultural norms
that enable us to…
increase our ability to deliver applications and services...
quickly and safely, which enables rapid experimentation
and innovation, and the fastest delivery of value to our
customers…
while ensuring world-class security, reliability, and stability...
…so that we can win in the marketplace.
(vs. traditional software development and infrastructure practices)
16. @RealGeneKim
Elite Low Difference
Deployment Frequency
On-demand
(multiple times per day)
Weekly to
monthly
46x
Deployment Lead Time < 1 hour 1 day to 1 week 2,555x
Deploy Success Rate 0-15% 46-60% 7x
Mean Time to Restore < 1 hour 1 week to 1 month 2,604x
Elite vs. Low Performers
Source: Google/DORA: 2018 State Of DevOps Report: https://cloudplatformonline.com/2018-state-of-devops.html
17. @RealGeneKim
Elite Low Difference
Deployment Frequency
On-demand
(multiple times per day)
Weekly to
monthly
46x
Deployment Lead Time < 1 hour 1 day to 1 week 2,555x
Deploy Success Rate 0-15% 46-60% 7x
Mean Time to Restore < 1 hour 1 week to 1 month 2,604x
Elite vs. Low Performers
Source: Google/DORA: 2018 State Of DevOps Report: https://cloudplatformonline.com/2018-state-of-devops.html
18. @RealGeneKim
Elite Low Difference
Deployment Frequency
On-demand
(multiple times per day)
Weekly to
monthly
46x
Deployment Lead Time < 1 hour 1 day to 1 week 2,555x
Deploy Success Rate 0-15% 46-60% 7x
Mean Time to Restore < 1 hour 1 week to 1 month 2,604x
Elite vs. Low Performers
Source: Google/DORA: 2018 State Of DevOps Report: https://cloudplatformonline.com/2018-state-of-devops.html
19. @RealGeneKim
Elite Low Difference
Deployment Frequency
On-demand
(multiple times per day)
Weekly to
monthly
46x
Deployment Lead Time < 1 hour 1 day to 1 week 2,555x
Deploy Success Rate 0-15% 46-60% 7x
Mean Time to Restore < 1 hour 1 week to 1 month 2,604x
Elite vs. Low Performers
Source: Google/DORA: 2018 State Of DevOps Report: https://cloudplatformonline.com/2018-state-of-devops.html
20. @RealGeneKim
High Performers Are More Secure And
Controlled
2x 29%
less time spent
remediating
security issues
more time spent
on new work
Source: Google/DORA: 2018 State Of DevOps Report: https://cloudplatformonline.com/2018-state-of-devops.html
21. @RealGeneKim
High Performers Win In The Marketplace
2x 2xmore likely to
exceed profitability,
market share &
productivity goals
more likely to achieve
organizational and
mission goals, customer
satisfaction, quantity &
quality goals
Source: Google/DORA: 2018 State Of DevOps Report: https://cloudplatformonline.com/2018-state-of-devops.html
22. @RealGeneKim
High Performers Win In The Marketplace
2.2xhigher employee
Net Promoter Score
50%higher market
capitalization growth
over 3 years*
Source: Google/DORA: 2018 State Of DevOps Report: https://cloudplatformonline.com/2018-state-of-devops.html
28. @RealGeneKim
“As a lifelong Ops practitioner, I know we
need DevOps to make our work humane.
In the past, I’ve worked every holiday, on
my birthday, my spouse’s birthday, and
even on the day my son was born.”
Nathan Shimek
Engineering Manager, New Context
@nathan_shimek
29. @RealGeneKim
CSG: COBOL App + 20 tech stacks
Source: Scott Prugh, Chief Architect, CSG, Inc.
And the customer got
the feature in half the
time!
Apps supporting bill printing and customer care for 50MM customer, 6B transactions per month
20 technology platforms, including mainframe VSAM and DB2, Java, desktop client
Moved from 2 to 4 releases per year
Shared Operations Team performed daily deployments to UAT
30. @RealGeneKim
Developers Carry Pagers
“We found that when we woke up developers at
2am, defects got fixed faster than ever”
– Patrick Lightbody
“You build it, you run it.”
– Werner Vogels
31. @RealGeneKim
“As a developer, the most satisfying
points in my career?
“It’s when I wrote the code, pushed the
button to deploy it, watched the metrics
to see if it actually worked in production,
and fixed it if it broke.”
Tim Tischler
Director of Operations Engineering
Nike, Inc.
36. @RealGeneKim
“What is your lead time
for changes?”
“How long does it take to go from
code committed to code successfully
running in production?”
37. @RealGeneKimSource: The DevOps Handbook
Product Design and Development Product Delivery
(Build, Test, Deploy)
Create new products and services that solve
customer problems using hypothesis-driven
delivery, modern UX, design thinking
Enable fast flow from development to
production and reliable releases by
standardizing work, reducing variability and
batch sizes
Feature design and implementation may
require work that has never been done before
Integration, test and deployment must be
performed continuously, as quickly as possible
Estimates are highly uncertain
Cycle times should be well-known and
predictable
Outcomes are highly variable Outcomes should have low variability
Change Committed Into Version Control
38. @RealGeneKimSource: The DevOps Handbook
Product Design and Development Product Delivery
(Build, Test, Deploy)
Create new products and services that solve
customer problems using hypothesis-driven
delivery, modern UX, design thinking
Enable fast flow from development to
production and reliable releases by
standardizing work, reducing variability and
batch sizes
Feature design and implementation may
require work that has never been done before
Integration, test and deployment must be
performed continuously, as quickly as possible
Estimates are highly uncertain
Cycle times should be well-known and
predictable
Outcomes are highly variable Outcomes should have low variability
Change Committed Into Version Control
39. @RealGeneKimSource: The DevOps Handbook
Change Committed Into Version Control
Product Design and Development Product Delivery
(Build, Test, Deploy)
Create new products and services that solve
customer problems using hypothesis-driven
delivery, modern UX, design thinking
Enable fast flow from development to
production and reliable releases by
standardizing work, reducing variability and
batch sizes
Feature design and implementation may
require work that has never been done before
Integration, test and deployment must be
performed continuously, as quickly as possible
Estimates are highly uncertain
Cycle times should be well-known and
predictable
Outcomes are highly variable Outcomes should have low variability
43. @RealGeneKim
Conway’s Law
Eric S. Raymond: “If you have four groups
working on a compiler, you’ll get a four pass
compiler”
(summarizing results of Dr. Melvin Conway’s
experiment in 1968)
44. @RealGeneKim
The Birth And Death Of Etsy Sprouter
A story about teams of engineers implementing
changes
2008: Devs and DBAs
2009: Devs and DBAs and Sprouter team
2010: Devs
47. @RealGeneKim
Architecture Enables Teams To…
…make large scale changes to the design of its system without the
permission of someone outside the team, or depending on other
teams
...complete its work without fine-grained communication and
coordination with people outside the team
...deploy and release its product or service on demand, independently
of other services the product or service depends upon
...do most of its testing on demand, without requiring an integrated
test environment
...perform deployments during normal business hours with negligible
downtime
Source: Puppet/DORA: 2017 State Of DevOps Report: https://puppet.com/resources/whitepaper/state-of-devops-report
48. @RealGeneKim
The Value Of Platforms
Enable developer productivity
Self-service
On-demand
Immediacy and fast feedback
Focus and flow
Joy
Monitoring, deployment, environment creation,
security scans, orchestration…
53. @RealGeneKim
Dr. Steven Spear
“While designing
perfectly safe systems is
likely beyond our
abilities, safe systems
are close to achievable”
when the four following
conditions are met…”
Source: Dr. Steven Spear
54. @RealGeneKim
Dr. Steven Spear’s Four Capabilities
1. See problems as they occur
2. Swarm and solve problems to create new
knowledge
3. Spread new knowledge throughout the
organization
4. Leaders create new leaders
Source: Dr. Steven Spear
55. @RealGeneKim
Capability 1
See problems as they occur:
Complex work is managed so that problems in
design are revealed
They see problems as they occur, through
relentless testing of assumptions
Automated testing in the deployment pipeline,
proactive monitoring of the production environment, …
Source: Dr. Steven Spear
56. @RealGeneKim
Pervasive Production Telemetry
Etsy engineering culture: anything in production
requires telemetry:
Ian Malpass: “If it moves, we graph it. Even if it
doesn’t move, we graph it, just in case it makes a
run for it.”
2011: 200,000 production metrics
2015: 800,000 production metrics
57. @RealGeneKim
Capability 2
Swarming and solving problems as they are seen
to build new knowledge
Problems that are seen are solved so that new
knowledge is built quickly
Improvement of daily work is prioritized above
daily work
Source: Dr. Steven Spear
58. @RealGeneKim
Absence Of Capability 2
“In manufacturing, the absence of effective feedback often
contribute to major quality and safety problems. In one well-
documented case at the General Motors Fremont manufacturing
plant, there were no effective procedures in place to detect
problems during the assembly process, nor were there explicit
procedures on what to do when problems were found.
“As a result, there were instances of engines being put in
backward, cars missing steering wheels or tires, and cars even
having to be towed off the assembly line because they wouldn’t
start.”
Source: DevOps Handbook
59. @RealGeneKim
Create as much feedback in our system, from as
many areas in our system, sooner, faster, and
cheaper, with as much clarity between cause and
effect.
Why? Because the more assumptions we can
invalidate, the more we learn, improving our ability
to fix problems and innovate.
Source: DevOps Handbook
61. @RealGeneKim
How many times per day is the andon cord
pulled in a typical day at a Toyota
manufacturing plant?
3,500 times per day
Source: http://www.gembapantarei.com/2008/04/how_many_times_do_you_pull_the_andon_cord_each_day.html
63. @RealGeneKim
"Automated tests transform fear into boredom."
-- Eran Messeri, Google
Google Dev And Ops (2013)
15,000 engineers, working on 4,000+ projects
All code is checked into one source tree
(billions of files!)
5,500 code commits/day
75 million test cases are run daily
65. @RealGeneKim
Capability 3
Spreading new knowledge throughout the
organization
The new discovery of local knowledge and
improvements are turned into global
improvements, shared throughout the
organization
Learning is fed back into the system to prevent
future failures
Source: Dr. Steven Spear
70. @RealGeneKim
“Then I remembered all the Chaos Monkey
exercises we’ve gone through. My reaction
was, ‘Bring it on!’”
The 2014 AWS Reboot
“When we got the news about the emergency EC2
reboots, our jaws dropped. When we got the list of
how many Cassandra nodes would be affected, I
felt ill.
– Christos Kalantzis
Netflix Cloud DB Engineering
Source: http://techblog.netflix.com/2014/10/a-state-of-xen-chaos-monkey-cassandra.html
71. @RealGeneKim
The 2014 AWS Reboot
“Out of our 2700+ production Cassandra nodes,
218 were rebooted. 22 Cassandra nodes did not
reboot successfully.
“Netflix customers experienced no downtime that
weekend.”
– Bruce Wong
Netflix Chaos Engineering
72. @RealGeneKim
DevOps Practices In Capability 3
Learning days and internal technology
conferences
DevOps Dojos and other training
Embracing open source
Internal architecture to propagate best known
patterns (“buoys, not boundaries”)
Functional organizations
75. @RealGeneKim
DevOps Enterprise: Lessons Learned
In 2018, we’ll hold the fifth year of the DevOps Enterprise Summit, a
conference for horses, by horses
Over the years, we’ve had over 200 leaders from:
Capital One, KeyBank, Barclays, GE Capital, ING Bank, Fidelity, PNC, ADP, BofA,
Western Union, BBVA
Nationwide Insurance, Zurich Insurance, Hiscox, Aviva, LV=
Walmart, Nordstrom, Target, Macy’s, Marks and Spencer
Nike, Adidas, Sherwin Williams
Verizon, Telstra, T-Mobile, Orange, CSG
Raytheon, Lockheed Martin, Northrop Grumman, CSRA, Jaguar Land Rover
Disney, Ticketmaster, NBC/Universal
Kaiser Permanente
US Citizenship & Immigration Services, UK HM Revenue Collection, DISA Forge.mil, NZ
Ministry of Social Development, UK Welfare and Pensions, US Joint Warfare Analysis
Center
Amazon PrimeNow, CA, Compuware, Google Search, IBM, MicroFocus, Microsoft, SAP
76. @RealGeneKim
Observations
They were using the same technical practices and getting the same sort of
metrics as the unicorns
Target: 100+ deploys per week, < 10 incidents per month, enabled 53 business
initiatives
Capital One: 100s of deploys per day, lead time of minutes
Macy’s: 1,500 manual tests every 10 days, now 100Ks automated tests run daily
Disney: Has embedded nearly 100 Ops engineers into LOB teams across the
enterprise
Nationwide Insurance: Retirement Plans app (COBOL on mainframe)
Raytheon: testing and certification from months to a day
Key Bank: rebuilt consumer online banking in containers and Kubernetes in 1 year
Nordstrom: 20% lead time reduction into executive bonuses
80. @RealGeneKim
Leadership Matters
Teams with the least reported transformational
leadership behaviors (the bottom-third) were one-
half as likely to be high IT performers
Leaders cannot do it alone! Teams with the top
10% of reported transformational leadership
behaviors performed no better than the median
Source: Puppet/DORA: 2017 State Of DevOps Report: https://puppet.com/resources/whitepaper/state-of-devops-report
81. @RealGeneKim
Leaders Affect Outcomes Through…
Source: Puppet/DORA: 2017 State Of DevOps Report: https://puppet.com/resources/whitepaper/state-of-devops-report
87. @RealGeneKim
Fast Push To Market — Continued
Features
Defects
Defect fixing dominates work
Site reliability tanks
Slower and slower velocity
Customers leave
Morale plunges
Devs leave because everything is hard
Quality
Debts & Risks
89. @RealGeneKim
Near Death Experiences
● Ebay (1999)
● Microsoft (2002): Bill Gates memo
● Amazon (2004): Jeff Bezos memo
● Twitter (2008)
● LinkedIn (2009)
● Etsy (2009)
● Knight Capital (2012)
● Healthcare.gov (2013)
● British Airways (2015) *
● Equifax (2018) *
* Not actually a “near death experience” — but I think they’re egregiously bad…
92. @RealGeneKim
Quote from Marty Cagan from his book
Inspired
The deal [between product owners and] engineering goes like this: Product
management takes 20% of the team’s capacity right off the top and gives this to
engineering to spend as they see fit. They might use it to rewrite, re-architect, or
re-factor problematic parts of the code base…whatever they believe is necessary
to avoid ever having to come to the team and say, ‘we need to stop and rewrite [all
our code].’ If you’re in really bad shape today, you might need to make this 30% or
even more of the resources. However, I get nervous when I find teams that think
they can get away with much less than 20%.
Cagan notes that when organizations do not pay their “20% tax,” technical debt
will increase to the point where an organization inevitably spends all of its cycles
paying down technical debt. At some point, the services become so fragile that
feature delivery grinds to a halt because all the engineers are working on reliability
issues or working around problems.
95. @RealGeneKim
Advice To Business Leaders
Managing technical debt doesn’t have to always
happen during extinction event crises
20% of R&D time as a part of daily work is one of my
favorite patterns
Hack Days, Improvement Blitzes, Improvement
Days saved Google, Facebook, Etsy, Amazon
Surface and solve problems as a part of daily work
Work with your teams to determine whether the Debts
dial needs to turned up higher! (0% is never
sustainable)
Encourage “blameless post-mortems”: make it safe to
talk about problems
96. @RealGeneKim
From Dr. Mik Kersten, CEO, Tasktop
Top annual priority: new features
Result: debt to pay down
97. @RealGeneKim
From Dr. Mik Kersten, CEO, Tasktop
Last year: Unhappiest team
Now: Happiest team
Highest feature flow
99. @RealGeneKim
As Your Ambassador From Dev
For decades, I self-identified as an Ops person…
2 years ago, I’ve started to self-identify as Dev
Clojure / ClojureScript
LISP, functional programming, immutability
3000 lines of Objective C -> 1500 lines of
TypeScript/React -> 500 lines of ClojureScript
Development is so fun, and these days, you can do
miraculous things with so little effort
100. @RealGeneKim
Why Functional Programming
The famous French philosopher Claude Lévi-Strauss
would say of certain tools, ‘is it good to think with?’
Core FP concepts
Immutability
Pure functions
Composability
Pioneered by Haskell and Ocaml. Popularized by
Clojure, Erlang, Elm, Elixir
102. @RealGeneKim
Never Have I Valued Infrastructure More
Things I detest now
Everything outside of my application
Connecting to anything to anything
Secrets management
Bash
YAML
Patching
Building kubernetes deployment files (mostly by
Googling)
Why my cloud costs are so high
104. @RealGeneKim
The Rebellion Needs You
The DevOps Enterprise journey is a rebellion
against an ancient and powerful order
Digital transformation is changing everything —
and that’s where the excitement is
The Rebellion needs you — the next generation
of leaders
106. @RealGeneKim
Help I’m Looking For
For 1.5 years, I’ve been working on “The Unicorn
Project”
“The Phoenix Project” retold, but from perspective of a
developer/architect
It’s a book about people rebelling against an ancient and
powerful order
Combination of Redshirts from Star Trek, Hogan’s Heroes, A
Team, and the movie Brazil
I’m looking for women in technology to review an
early draft of a book
107. @RealGeneKim
Want More Learn More?
To receive the following:
A copy of this presentation
Eight excerpts from Beyond The Phoenix Project audio series w/John Willis
The 140 page excerpt of The DevOps Handbook
The 140 page excerpt of The Phoenix Project
Videos and slides from DevOps Enterprise 2014-2017
Whitepaper from DevOps Research and Assessment
The DevOps Enterprise Forum Guidance Papers
Link to the DevOps Audit Defense Toolkit
One hour excerpt of The Phoenix Project audiobook
Just pick up your phone, and send an email:
To: realgenekim@SendYourSlides.com
Subject: devops
realgenekim@SendYourSlides.com
devops
Notas del editor
[ picture of messy data center ] Ten minutes into Bill’s first day on the job, he has to deal with a payroll run failure. Tomorrow is payday, and finance just found out that while all the salaried employees are going to get paid, none of the hourly factory employees will. All their records from the factory timekeeping systems were zeroed out.Was it a SAN failure? A database failure? An application failure? Interface failure? Cabling error?
Who are they auditing? IT operations.
I love IT operatoins. Why? Because when the developers screw up, the only people who can save the day are the IT operations people.
Memory leak? No problem, we’ll do hourly reboots until you figure that out.
Who here is from IT operations?
Bad day:
Not as prepared for the audit as they thought
Spending 30% of their time scrambling, generating presentation for auditors
Or an outage, and the developer is adamant that they didn’t make the change – they’re saying, “it must be the security guys – they’re always causing outages”
Or, there’s 50 systems behind the load balancer, and six systems are acting funny – what different, and who made them different
Or every server is like a snowflake, each having their own personality
We as Tripwire practitioners can help them make sure changes are made visible, authorized, deployed completely and accurately, find differences
Create and enforce a culture of change management and causality
Source: Flickr: birdsandanchors
Source: RyanJLane
High performing IT organizations are still deploying more frequently with fewer failures.
- High performers deploy 46x (compared to 200x in 2016) more frequently than low performers. This is the difference between being able to deploy on demand multiple times a day and only being able to deploy a one or two times a month.
- High performers also have 440x (compared to 2,555x faster in 2016) lead times. They can push a change to production in less than an hour versus once every couple of months.
- They also recover from failures 24x faster. This is the difference between an hour and a day, which is significant when you factor in the average cost of downtime per hour.
- High performing organizations also have 1/5 the change failure rate (or 5x lower change failure rate).
Takeway: High performers are maximizing throughput while maintaining the highest levels of stability. This means they are able to get new features and bug fixes to market faster, get customer feedback, and iterate more rapidly.
Low performers have increased their throughput, compared to 2016 results, and are deploy faster and more frequently, but they’re still doing poorly in terms of stability. We speculate that this is due to low-performing teams are optimizing for speed, but not investing enough in building quality into the process, which takes time.
The result is larger failures that take more time to restore service. High performers understand that they don’t have to trade speed for stability or vice versa, because by building quality in, they get both.
We used the most powerful analytical tool to generate this graph: not SPSS, R, Tableau, PLA Sim. We used pivot tables in Excel.
April 22, 2011
[ picture of messy data center ] Ten minutes into Bill’s first day on the job, he has to deal with a payroll run failure. Tomorrow is payday, and finance just found out that while all the salaried employees are going to get paid, none of the hourly factory employees will. All their records from the factory timekeeping systems were zeroed out.Was it a SAN failure? A database failure? An application failure? Interface failure? Cabling error?