Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2oAuPOe.
Sahar Samiei and Willie Wheeler share Expedia’s resiliency journey, starting with resiliency as an afterthought and progressing toward resiliency as a first-class concern. They talk about the importance of partnering with the teams experiencing operational struggles, and equipping them with the data to make the right investments at the right time. Filmed at qconsf.com.
Sahar Samiei is a Senior Product Manager leading the Site Reliability Program at Expedia. She has been with Expedia for eight years, working across different teams, focusing on operations. Willie Wheeler is a Principal Application Engineer at Expedia, with 20 years of professional software development experience. At Expedia he serves as a resiliency champion within the engineering organization.
1. Willie Wheeler (@williewheeler)
Principal Applications Engineer
Sahar Samiei (@saharsamiei)
Senior Technical Product Manager
Expedia’s Journey Toward Site Resilience
Copyright (c) 2017 Expedia, Inc. All rights reserved.
2. InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
expedia-website-resiliency
3. Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
4. In 2016, Expedia was the
11th-largest Internet company
by revenue at $8.77B.
Source: https://en.wikipedia.org/wiki/List_of_largest_Internet_companies
`
5. Uptime Downtime Potential Revenue Loss
99% 3.65 days $87.7M
99.9% 8.76 hours $8.77M
99.99% 52.56 minutes $877K
The Price of Downtime:
A Back-of-the-Envelope Calculation
6. So keeping our site
up protects tens of
millions of dollars of
revenue per year.
19. Team autonomy is one of Expedia’s core
beliefs. It’s great to have latitude to
establish your own practices, tools and
processes but it could add complexity.
23. We put travelers at the heart of our thinking.
To organize ourselves to ensure maximum
availability of our websites – even during odd
conditions – we started an initiative called
“Perceived Zero Downtime”.
25. We empower
teams by providing
tools & best
practices
We humbly listen
to different
approaches
We are not biased
We are open to
other teams’
suggestions
26. We embrace team autonomy.
We set up a forum to bring teams from different brands together to share ideas, tools and
best practices.
36. Resilience Engineering Lifecycle
Service Service Service Service Service
Service Service Service Service Service
Database Database
3rd-party
service
3rd-party
service
37. Resilience Engineering Lifecycle
Service Service Service Service Service
Service Service Service Service Service
Database Database
3rd-party
service
3rd-party
service
1. Prioritize Services
38. Resilience Engineering Lifecycle
Service Service Service Service
Service Service Service Service
Database Database
3rd-party
service
3rd-party
service
1. Prioritize Services
2. Investigate Vulnerabilities
Service
Service
46. 1. Prioritize Services
Resilience Scorecard
Resilience Alerts
Your app failed a Chaos Monkey
experiment. Chaos Monkey
terminated an EC2 instance,
causing an outage.
To resolve this, please use EC2
Auto Scaling Groups.
53. An open source circuit breaker
implementation by Netflix:
• Start closed
• Trip (open) when downstream is
unhealthy
• Close when downstream is
healthy again
Defend your app with
Hystrix
UI
API
Database External API
54. • Isolate resources to avoid
unintended interactions
between apps
• Avoid "tragedy of the commons"
Defend your app with
Bulkheads
55. On our side, it is not a production
incident because no client started
to use this new platform.
59. 4. Experiments in Test
$ python test3.py
Baseline city count is 8.
Response during attack returned a city
count of 1.
Response after attack returned a city
count of 8.
********* Test success. *********
67. Chaos Monkey running daily
in production since May 15
ResultstoDate
Added resilience tests to
four Tier 1 service pipelines
Increased organizational
awareness
Established resilience
community of practice
68.
69. Dev team engagement has
been a struggle:
• Limited team capacity
• Product > resilience
ResultstoDate
70. Pivot for 2018: Automation
Reduce the cost of resilience engineering through automation
Test by discovery VisibilityProxy-based resilience
71. Resilience has a cost!
Teams have multiple
competing prioritizes
There are major
misconceptions
Reduce friction
Prepare for early
effort
Whatdidwelearn?
72. Add complexity
incrementally
Try multiple technical
approaches
Make adoption easy
Drive for experiments
in production
Whatdidworkwell?
Represent metrics
Start with low-hanging
fruit