This document discusses how small teams can get ready for Site Reliability Engineering (SRE). It describes the challenges faced by a small engineering team at a company with around 100 employees and 10 engineers. To address issues with productivity, reliability, and deployment speed, the team implemented several initiatives including adopting SCRUM, adding automated testing, simplifying deployments, and creating easy-to-use development environments. While these changes helped, the team knows there is still work needed in areas like data center operations and establishing formal SLAs and incident management processes as the company and services grow. The presentation concludes by discussing why SRE is preferable to just DevOps and provides resources for further learning.
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
How Small Team Get Ready for SRE (public version)
1. Presented byPresented by
How SmallTeam Get Ready for
Site Reliability Engineering (SRE)
Setyo Legowo
Facebook Developer Circles – Bandung
October 1st, 2017
2. Presented by
Sources
SREcon17 Asia/Australia: How
Could Small Teams Get Ready for
SRE
Zehua Liu, Zendesk Singapore
Facebook DevCircles - BandungOctober 1st, 2017
Source: https://www.usenix.org/sites/default/files/srecon_europe_wide.png
3. Presented by
What is SRE?
October 1st, 2017 Facebook DevCircles - Bandung
Source: https://landing.google.com/sre/interview/ben-treynor.html
4. Presented by
Key Points of SRE
• Hire only coders
• Have an SLA for your service
• Measure and report performance against SLA
• Use error budgets and gate launches on them
• Common staffing pool for SRE and DEV
• Excess Ops work overflows to Dev team
• Cap SRE operational load at 50%
• Share 5% of Ops work with Dev team
• Maximum of 2 events per oncall shift per person is all that's possible
• minimum group size of 8 people (8 people x1 location or 6x2)
• Post mortem for every event
• Post mortems are blameless and focus on process and technology, not people
October 1st, 2017 Facebook DevCircles - Bandung
Ben Treynor
VP Engineering at Google
Image Source:
https://www.usenix.org/sites/default/files/conference-files/ben_treynor_300.png
Source: https://www.usenix.org/conference/srecon14/technical-sessions/presentation/keys-sre
6. Presented by
The Problem
• Small Teams?
• Small company
• Small engineering team
• The case:
• A small team in a big company
• ~100 employees
• ~10 Engineer
• 6 Software Engineer
• 1 DevOps – Infrastructure Engineer
• 3 Mobile Application Engineer
October 1st, 2017 Facebook DevCircles - Bandung
7. Presented by
Growth Problem
• Total visitor grows gradually each month
And also new features
• Issues in productivity and site reliability
• Onboarding new hires
• Slower deployment time
• More incidents
• No clear SLA
October 1st, 2017 Facebook DevCircles - Bandung
Source: https://c1.staticflickr.com/6/5260/5519749611_a95070b507.jpg
8. Presented by
Do we have any
solution?
• Started a series of engineering
initiatives
• Implement SCRUM instead of FDD
• Automated test
• Simple deployment
• Easy-to-use development environment
• …
October 1st, 2017 Facebook DevCircles - Bandung
Source: http://www.doncio.navy.mil/uploads/0803IXR47425.jpg
9. Presented by
Dedicated Engineering Resources
• SCRUM Development – Past
• CTO led feature development
• Toil task fixed when encountered
• SCRUM Development – Now
• Hired more engineers
• Tried to eliminate technical debts
• No feature development for operational team
• Develop tools that support developers
October 1st, 2017 Facebook DevCircles - Bandung
10. Presented by
Simple Deployment
• Production Deployment – Past
• Manual: ssh and copy and paste scripts
• Prone human error
• Only few engineers could do it
• Could not accommodate new engineers and more frequent deployment
• Production Deployment – New
• Jenkins Travis Jenkins
• DevOps team install deployment script on new apps
• Ownership for engineers
October 1st, 2017 Facebook DevCircles - Bandung
11. Presented by
Easy-to-use Development Environment
• Setup development environment – Past
• Had ~30 steps setup steps
• Non uniform application version whether they installed the same apps
• Hard for new engineers
• Setup development environment – Now
• Spent one quarter dockerizing dev and test environment
• Current development/deployment pipeline:
• Develop locally Test in Docker Deploy to Staging Test on staging
Deploy to Production
October 1st, 2017 Facebook DevCircles - Bandung
12. Presented by
AutomatedTest
• Automated Test – Past
• No automated test
• Manual test directly by product owner
• Automated Test – Now
• Automated unit and acceptance test in Docker
• Manual test by QA
• Test coverage report saved in reliable storage
• Insert automated test in each deployment step
October 1st, 2017 Facebook DevCircles - Bandung
13. Presented by
Miscellaneous Initiatives
• Change velocity, several deployment for each day
• Deploy to staging/production in minutes
• Build useful monitoring dashboard
• And alert notification
• Rotate monitoring shift
• Establish post mortem culture
• Report every incident as post mortem
October 1st, 2017 Facebook DevCircles - Bandung
14. Presented by
Do those initiatives meet all requirements of SRE?
• Yes, but …
• Do not have to do SRE like Go*gle
• Adjust with your needs/issues as you grow and SRE will come to you
• You don’t even need an SRE team!
• Focus on how to deliver reliable services
October 1st, 2017 Facebook DevCircles - Bandung
15. Presented by
Unfulfilled Goals
• When we become a big guy
• Data center operations
• On-premise devices
• Reliability checklist
• SLA SLI, SLO
• Incident management
• Good for reporting
October 1st, 2017 Facebook DevCircles - Bandung
Source: https://commons.wikimedia.org/wiki/File:Pilgrims_on_the_Way_of_St.James_near_Saint-Martin-des-Champs.JPG
17. Presented by
What is the difference with DevOps?
October 1st, 2017 Facebook DevCircles - Bandung
Image source: https://commons.wikimedia.org/wiki/File:Devops-toolchain.svg
19. Presented by
Watch & Reading List
• How Could Small Team Get Ready for SRE, by Zehua Liu
https://www.usenix.org/conference/srecon17asia/program/presentation/liu
• Key Points of SRE, by Ben Treynor
https://www.usenix.org/conference/srecon14/technical-sessions/presentation/keys-sre
• https://landing.google.com/sre/interview/ben-treynor.html
• Usenix Youtube Channel, https://www.youtube.com/channel/UC4-
GrpQBx6WCGwmwozP744Q
• Site Reliability Engineering: How Google Runs Production Systems, Edited by Betsy Beyer,
Chris Jones, Jennifer Petoff, and Niall Richard Murphy
• The DevOps Handbook: How to Create World-Class Agility, Reliability, & Security in
Technology Organizations, by Gene Kim, Jez Humble, Patrick Debois, and John Willis
• Linux Foundation Events Youtube Channel,
https://www.youtube.com/channel/UCthvmTSlmIcMH93LIJNe-2w
October 1st, 2017 Facebook DevCircles - Bandung
20. Presented by
ThankYou
October 1st, 2017 Facebook DevCircles - Bandung
Setyo Legowo
• Software Engineer at UrbanIndo
• Office e-mail address: setyo@urbanindo.com
• Personal e-mail address: setyolegowo94@gmail.com
• LinkedIn: https://www.linkedin.com/in/setyolegowo/