This presentation describes my interpretation of the Why and How of DevOps, and the key findings from my 15 year study of high-performing IT organizations, and how they simultaneously deliver stellar service levels and rapid implementation of new features into the production environment.
Organizations employing DevOps practices such as Google, Amazon, Facebook, Etsy and Twitter are routinely deploying code into production hundreds, or even thousands, of times per day, while providing world-class availability, reliability and security. In contrast, most organizations struggle to do releases more every nine months.
He will present how these high-performing organizations achieve this fast flow of work through Product Management and Development, through QA and Infosec, and into IT Operations. By doing so, other organizations can now replicate the extraordinary culture and outcomes enabling their organization to win in the marketplace.
14. @RealGeneKim
10 deploys per day
Dev & ops cooperation at Flickr
John Allspaw & Paul Hammond
Velocity 2009
Source: John Allspaw (@allspaw) and Paul Hammond (@ph)
16. Little bit weird
Sits closer to the boss
Thinks too hard
Pulls levers & turns knobs
Easily excited
Yells a lot in emergencies
Source: John Allspaw (@allspaw) and Paul Hammond (@ph)
17.
18. Ops who think like devs
Devs who think like ops
@RealGeneKim
Source: John Allspaw (@allspaw) and Paul Hammond (@ph)
24. @RealGeneKim
Making Changes When It Matters Most
“By installing a rampant innovation culture,
we performed 165 experiments in the peak three
months of tax season.”
“Our business result? Conversion rate of the
website is up 50 percent. Employee result?
Everyone loves it, because now their ideas can
make it to market.”
–Scott Cook, Intuit Founder
25. @RealGeneKim
Who Is Doing DevOps?
Google, Amazon, Netflix, Etsy, Spotify, Twitter, Facebook …
Dynatrace, CSC, IBM, CA, SAP, HP, Microsoft, Red Hat, …
GE Capital, Nationwide, BNP Paribas, BNY Mellon,
World Bank, Paychex, Intuit …
The Gap, Nordstrom, Macy’s, Williams-Sonoma, Target …
General Motors, Raytheon, LEGO, Bosche …
UK Government, US Department of Homeland Security …
Kansas State University…
Who else?
26. High Performers Are More Agile
30x 8,000x
more frequent
deployments
@RealGeneKim
faster lead times
than their peers
Source: Puppet Labs 2013 State Of DevOps: http://puppetlabs.com/2013-state-of-devops-infographic
27. @RealGeneKim
High Performers Are More Reliable
2x 12x
the change
success rate
faster mean time
to recover (MTTR)
Source: Puppet Labs 2013 State Of DevOps: http://puppetlabs.com/2013-state-of-devops-infographic
28. High Performers Win In The Marketplace
2x 50%
more likely to
exceed profitability,
market share &
productivity goals
@RealGeneKim
higher market
capitalization growth
over 3 years*
Source: Puppet Labs 2014 State Of DevOps
30. “This book will have a profound effect on IT,
just as The Goal did for manufacturing.”
–Jez Humble,
co-author Continuous Delivery
“This is the IT swamp draining manual for
anyone who is neck deep in alligators.”
–Adrian Cockroft,
Cloud Architect at Netflix
“This is The Goal for our decade,
and is for any IT professional who wants
their life back.”
–Charles Betz, IT architect, author
“Architecture and Patterns for IT”
@RealGeneKim
36. @RealGeneKim
Create One Step Environment
Creation Process
Make environments available early in the
Development process
Make sure Dev builds the code and environment
at the same time
Create a common Dev, QA and Production
environment creation process
37. @RealGeneKim
If I had a magic wand,
I’d change the Agile sprints and
definition of “done”:
“At the end of each sprint, we must
have working and shippable code…
demonstrated in an environment
that resembles production.”
38. Deploy Smaller Changes, More Frequently *
@RealGeneKim
Source: http://www.facebook.com/note.php?note_id=14218138919
39. Deploy Smaller Changes, More Frequently *
@RealGeneKim
Decouple feature releases from code
deployments
Deploy features in a disabled state, using feature
flags
Require all developers check code into trunk
daily (at least)
Practice deploying smaller changes, which
dramatically reduces risk and improves MTTR
40. Experiment: Reducing Batch Size By 50%
And the customer got the feature in
@RealGeneKim
half the time!
Source: Scott Prugh, Chief Architect, CSG, Inc.
41. @RealGeneKim
“As a lifelong Ops practitioner, I know
we need DevOps to make our work
humane.
In the past, I’ve worked every holiday, on
my birthday, my spouse’s birthday, and
even on the day my son was born.”
Nathan Shimek
Engineering Manager, New Context
@nathan_shimek
42. @RealGeneKim
Breaking The Bottlenecks In The Flow
Environment creation
Code deployment
Test setup and run (mention @rohansingh)
Overly tight architecture
Development
Product management
43. “In November 2011, running even the most minimal
test for CloudFoundry required deploying to 45 virtual
machines, which took a half hour. This was way too
long, and also prevented developers from testing on
@RealGeneKim
their own workstations.
By using containers, within months, we got it down to
18 virtual machines so that any developer can deploy
the entire system to single VM in six minutes.”
— Elisabeth Hendrickson, Director of Quality
Engineering, Pivotal Labs
@testobsessed
44. @RealGeneKim
Blackboard Learn: 2005-Present
54
LoC
Commits
Source: David Ashman, Chief Architect, Blackboard, Inc. (@davidbashman)
The Problem
45. @RealGeneKim
Blackboard Learn Building Blocks
55
Source: David Ashman, Chief Architect, Blackboard, Inc. (@davidbashman)
46. Top Predictors Of IT Performance (2014)
Version control of all production artifacts
Continuous integration and deployment
Automated acceptance testing
Peer-review of production changes (vs. external
change approval)
High trust culture
Proactive monitoring of the production environment
Win-win relationship between Dev and Ops
@RealGeneKim
Source: Puppet Labs 2014 State Of DevOps
47. @RealGeneKim
The First Way: Outcomes
Creating single repository for code and environments
Determinism in the release process
Consistent Dev, Test and Production environments, all properly
built before deployment begins
Features being deployed daily without catastrophic failures
Decreased lead time
Faster cycle time and release cadence
50. How many times per day is the andon cord
@RealGeneKim
pulled in a typical day at a Toyota
manufacturing plant?
3,500 times per day
Source: http://www.gembapantarei.com/2008/04/how_many_times_do_you_pull_the_andon_cord_each_day.html
51. Why would Toyota do something so disruptive as
stopping production thousands of times per day?
@RealGeneKim
“It’s the only way we can build 2,000 vehicles
per day – that’s one completed vehicle every
55 seconds.”
52. @RealGeneKim
Google Dev And Ops (2013)
15,000 engineers, working on 4,000+ projects
All code is checked into one source tree
(billions of files!)
5,500 code commits/day
75 million test cases are run daily
"Automated tests transform fear into boredom."
-- Eran Messeri, Google
53. @RealGeneKim
Developers Carry Pagers
“We found that when we woke up developers at
2am, defects got fixed faster than ever”
– Patrick Lightbody,
CEO, BrowserMob
“You build it, you run it.”
– Werner Vogels
CTO, Amazon
54. @RealGeneKim
Developers Carry Pagers
“As a developer, there has never been a more
satisfying point in my career than when I wrote
the code, I pushed the button to deploy it,
I watched the metrics to see if it actually worked
in production, and fixed it if it broke.”
– Tim Tischler
Director of Operations Engr,
Nike, Inc.
57. @RealGeneKim
Pervasive Production Telemetry
“Having a
developer add a
monitoring metric
shouldn’t feel like
a schema
change.”
– John Allspaw,
SVP Tech Ops,
Etsy
63. Top Predictors Of IT Performance (2014)
Version control of all production artifacts
Continuous integration and deployment
Automated acceptance testing
Peer-review of production changes (vs. external
change approval)
High trust culture
Proactive monitoring of the production environment
Win-win relationship between Dev and Ops
@RealGeneKim
Source: Puppet Labs 2014 State Of DevOps
64. @RealGeneKim
The Second Way: Outcomes
Defects and security issues getting fixed faster than ever
Disciplined automated testing enabling many
simultaneous small, agile teams to work productively
All groups communicating and coordinating better
Everybody is getting more work done
65. The Third Way:
Continual Experimentation And Learning
@RealGeneKim
66. @RealGeneKim
Break Things Early And Often
“Do painful things more frequently, so you can
make it less painful… We don’t get pushback
from Dev, because they know it makes rollouts
smoother.”
– Adrian Cockcroft,
Former Architect, Netflix
(Now Technology Fellow,
Battery Ventures)
70. @RealGeneKim
The 2014 AWS Reboot
“When we got the news about the emergency EC2
reboots, our jaws dropped. When we got the list of
how many Cassandra nodes would be affected, I
felt ill.
“Then I remembered all the Chaos Monkey
exercises we’ve gone through. My reaction
was, ‘Bring it on!’”
– Christos Kalantzis
Netflix Cloud DB Engineering
Source: http://techblog.netflix.com/2014/10/a-state-of-xen-chaos-monkey-cassandra.html
71. @RealGeneKim
The 2014 AWS Reboot
“Out of our 2700+ production Cassandra nodes,
218 were rebooted. 22 Cassandra nodes did not
reboot successfully.
“Netflix customers experienced no downtime that
weekend.”
– Bruce Wong
Netflix Chaos Engineering
73. “By November 2011, Kevin Scott,
LinkedIn’s top engineer, had had
enough. The system was taxed as
LinkedIn attracted more users, and
engineers were burnt out.
“To fix the problems, Scott, who’d
arrived from Google that February,
launched Operation InVersion.
“He froze development on new
features so engineers could overhaul
the computing architecture.
“`We had to tell management we’re
not going to deliver anything new
while all of engineering works on this
project for the next two months,’
Scott says. “It was a scary thing.’”
@RealGeneKim
81. @RealGeneKim
Our Mission
Positively influence the
lives of one million IT
professionals by 2017.
82. @RealGeneKim
DevOps Enterprise: Lessons Learned
On Oct 21-23, we held the DevOps Enterprise Summit, a
conference for horses, by horses
Macy’s, Disney, GE Capital, Blackboard, Telstra, US Department of
Homeland Security, CSG, Raytheon, Ticketmaster, Union Bank of
California
Leaders driving DevOps transformations talked about
The business problem they set out to solve
The obstacles they had to overcome
The business value they created
83. @RealGeneKim
Want More Learn More?
To receive the following:
A copy of this presentation
A free 140 page excerpt of The Phoenix Project
Information on the DevOps Enterprise: Lessons
Learned
My recommended reading list for enterprise DevOps
adoption
See early drafts of our upcoming DevOps Cookbook
Just pick up your phone, and send an email:
To: realgenekim@SendYourSlides.com
Subject: lisa
realgenekim@SendYourSlides.com
lisa
84. Can Large Orgs Be High Performers?
Yes.
But orgs with 10,000+
employees 40% less likely
to be high performing vs.
500 employee orgs…
Source: Puppet Labs 2014 State Of DevOps @RealGeneKim
My name is Gene Kim. My area of passion started when I was the CTO and founder of Tripwire in 1999. I started keeping a list that we called “Gene’s list of people with great kung fu.” These were the organizations that simutaneously…
In the next 25 minutes, I’m really excited to share with you some of my key learnings, which I’m hoping that will not only be applicable to you, but that you’ll be able to put into practice right away, and get some amazing results.
But let me tell you how my journey began…
[ picture of messy data center ] Ten minutes into Bill’s first day on the job, he has to deal with a payroll run failure. Tomorrow is payday, and finance just found out that while all the salaried employees are going to get paid, none of the hourly factory employees will. All their records from the factory timekeeping systems were zeroed out.Was it a SAN failure? A database failure? An application failure? Interface failure? Cabling error?
Who are they auditing? IT operations.
I love IT operatoins. Why? Because when the developers screw up, the only people who can save the day are the IT operations people.
Memory leak? No problem, we’ll do hourly reboots until you figure that out.
Who here is from IT operations?
Bad day:
Not as prepared for the audit as they thought
Spending 30% of their time scrambling, generating presentation for auditors
Or an outage, and the developer is adamant that they didn’t make the change – they’re saying, “it must be the security guys – they’re always causing outages”
Or, there’s 50 systems behind the load balancer, and six systems are acting funny – what different, and who made them different
Or every server is like a snowflake, each having their own personality
We as Tripwire practitioners can help them make sure changes are made visible, authorized, deployed completely and accurately, find differences
Create and enforce a culture of change management and causality
EG Parts Unlimited, Inc. DBA Parts Unlimited in is serious trouble. Stock has tumbled 19% in the last 30 days, and is down 52% from its peak three years ago. The company continues to be outmaneuvered by their arch-rival, famous for their ability to anticipate and instantly react to customer needs. Parts Unlimited now trails the competition in sales growth, inventory turns and profitability.
Parts Unlimited has been promising the release of a software, call “Phoenix” which – if they can ever get it release – should close the gap. It tightly integrates its retailing and e-commerce channels. Already years late, many expect the company to announce another program delay in their analyst earnings call next month. 20 million in, years late and the Board and the Investors are – let’s just say the natives are restless and are looking for heads. Which mean not only have some of the players been let go, and moved positions, but the board is looking at outsourcing and / or splitting up the company..
The board has given the team six months to make dramatic improvements.
Source: Flickr: birdsandanchors
Who’s introducing variance? Well, it’s often these guys. Show me a developer who isn’t causing an outage, I’ll show you one who is on vacation.
Primary measurement is deploy features quickly – get to market.
I’ve worked with two of the five largest Internet companies (Google, Microsoft, Yahoo, AOL, Amazon), and I now believe that the biggest differentiator to great time to market is great operations:
Bad day:
We do 6 weeks of testing, but deployment still fails. Why? QA environment doesn’t match production
Or there’s a failure in testing, and no one can agree whether it’s a code failure or an environment failure
Or changes are made in QA, but no one wrote them down, so they didn’t get replicated downstream in production
Believe it or not, we as Tripwire practitioners can even help them – make sure environments are available when we need them, that they’re properly configured correctly the first time, document all the changes, replicate them downstream
[ picture of messy data center ] Ten minutes into Bill’s first day on the job, he has to deal with a payroll run failure. Tomorrow is payday, and finance just found out that while all the salaried employees are going to get paid, none of the hourly factory employees will. All their records from the factory timekeeping systems were zeroed out.Was it a SAN failure? A database failure? An application failure? Interface failure? Cabling error?
So who are all these constituencies that we can help, and increase our relevance as Tripwire practitioners and champions?
How many people here are in infosec?
Goal: protect critical systems and data
Safeguard organizational commitments
Prevent security breaches, help quickly detect and recover from them
Bad day: no security standards
No one is complying
Yes, we’re 3 years behind. “Whaddya gonna do about it?”
Vs. we (Tripwire owner) can become more relevant and add value by help infosec by leveraging all the configuration guidance out there
Measure variance between produciton and those known good states
Trust and verify that when management says, we’ve trued up the configurations, they’ve actually done it
Why? Now, more than ever, there are an ever increasing amount of regulatory and contractual requirements to protect systems and data
There are many ways to react to this: like, fear, horror, trying to become invisible… All understandable, given the circumstances…
Because infosec can no longer take 4 weeks to turn around a security review for application code, or take 6 weeks to turnaround a firewall change.
But, on the other hand, I think it’s will be the best thing to ever happen to infosec in the past 20 years. We’re calling this Rugged DevOps, because it’s a way for infosec to integrate into the DevOps process, and be welcomed. And not be viewed as the shrill hysterical folks who slow the business down.
Tell story of Amazon, Netflix: they care about, availability, security
It’s not a push, it’s a pull – they’re looking for our help (#1 concern: fear of disintermediation and being marginalized)
Eran Feigenbaum
Director of Security, Google Enterprise
[ picture of messy data center ] Ten minutes into Bill’s first day on the job, he has to deal with a payroll run failure. Tomorrow is payday, and finance just found out that while all the salaried employees are going to get paid, none of the hourly factory employees will. All their records from the factory timekeeping systems were zeroed out.Was it a SAN failure? A database failure? An application failure? Interface failure? Cabling error?