The document describes several experiments conducted by Javier Turegano's team at REA Group to improve their DevOps processes. They tried placing developers and operations staff together in teams, creating a centralized tooling team, doing staff rotations between roles, integrating automation into delivery teams, organizing teams around business areas rather than technology, and forming dedicated delivery engineering teams. The last approach of dedicated delivery engineering teams with their own QA and operations support seemed to work best by reducing the number of streams of work and allowing teams to focus on delivery.
5. At the beginning...
Delivery Team1
Site
Operations
Ops Ops Ops Ops
Dev Dev Dev
Dev Dev Dev
Dev Dev Dev
Delivery Team2
Dev Dev Dev
Dev Dev Dev
Dev Dev Dev
Delivery Team N
Dev Dev Dev
Dev Dev Dev
Dev Dev Dev
Ops
12. Placements
Delivery Team1
Site
Operations
Ops Ops Ops Ops
Dev Dev
Dev
Dev Dev Dev
Dev Dev Dev
Delivery Team2
Dev Dev Dev
Dev Dev Dev
Dev Dev Dev
Ops
Delivery Team N
Dev Dev Dev
Dev Dev Dev
Dev Dev Dev
32. Dev
QA Ops
BAIM
TechL Dev Dev
Team 1 – Midsize initiative X
Dev DevBA
Team 2 – Small Initiative Y
IM
Dev
QA Ops
BA
TechL Dev Dev
IM
Dev Dev Dev
QA
UX
UX
Team 3 – Big Initiative G
LoB “A”
Team 4 – Midsize initiative Z
IM
Dev Dev Dev
QA
Ops
Lead
Tech Lead
IM
BA
UX
TechL
Dev
QA
Ops
Iteration Manager
Business Analyst
User Experience
Tech lead
Developer
Tech lead
Quality Assurance
Operations
Ops
39. Dev
QA Ops
BAIM
TechL Dev Dev
Team 1 – Midsize initiative X
Dev DevBA
Team 2 – Small Initiative Y
IM
Dev
QA
BA
TechL Dev Dev
IM
Dev Dev Dev
QA
UX
UX
Team 3 – Big Initiative G
LoB “TOO MANY STREAMS”
Team 4 – Midsize initiative Z
IM
Dev Dev Dev
QA
IM
BA
UX
TechL
Dev
QA
Ops
Iteration Manager
Business Analyst
User Experience
Tech lead
Developer
Tech lead
Quality Assurance
Operations
Dev DevBA
Team 6 – Small Initiative Y
Dev Dev
Team N – Small Initiative Y
Dev
BAIM
TechL Dev Dev
IM
Ops
Ops
40.
41. Dev
QA
BAIM
TechL Dev Dev
Team 1 – Midsize initiative X
Dev DevBA
Team 2 – Small Initiative Y
IM
Dev
QA Ops
BA
TechL Dev Dev
IM
Dev Dev Dev
QA
UX
UX
Team 3 – Big Initiative G
LoB “A”
Team 4 – Midsize initiative Z
IM
Dev Dev Dev
QA
Team 5 – Delivery Engineering
Ops Ops
Dev QA
Ops
Lead
Tech Lead
IM
BA
UX
TechL
Dev
QA
Ops
Iteration Manager
Business Analyst
User Experience
Tech lead
Developer
Tech lead
Quality Assurance
Operations
Ops
44. TL;DR: Which one worked?
There are only a few problems that can't be
solved by cake
QUESTIONS?
FEEDBACK?
THANKS!
@setoide
Notas del editor
These is me and my passions.
In the last 4 and half year I've been working for REA.
We operate heavy traffic sites around the world.
Some of the things that make REA special are:
- Innovation
- Though leadership in areas like agile, lean and devops
The only constant is change, always looking to improve.
This talk is about the different experiments we've run to try to create a devops culture in REA.
As probably Nigel could explain better:
“Complex systems are complex” and organizations like REA are complex in many dimensions: business, engineering, IT systems, etc...
The approach
Change something and observe. Be brave. Repeat.
Delviery vs Site Operations
Ops:
- To modify the code
- To help understand how the application works
Devs:
- To help us deploy to prod
- To help us with some non functional requirements
The night is dark and full of incidents.
Days since a full night sleep counter
3-4 alerts per night
Happy engineer getting off pager.
Ops had to understand and troubleshoot a massively large complex set of systems
Storage/Networks/Systems/Apps/Monitoring/Data/Security etc...
That made hiring difficult because:
Heroes don't scale
Short temporal placements of engineers in a different functional area. Normally went for a few weeks.
Allocated capacity
Working closer to where the action is
Knowledge of full stack
You would never stop learning
Handovers and rump up for a new area difficult
Still there were conflicting priorities
Alerts and incidents still been managed by the central team
Meet ADO, one of our first Devs to be fully knighted by the SiteOps team
Ops in Delivery
Devs in Site Ops
I am going to fit there?
As many companies have done
Create a centralize team to drive automation, continuous delivery, cloud adoption, etc...
PROBLEMS:
Painful manual deployments
QA blessing to go to prod
Coordination wall
1 staging fits all
The approach
Centralized team
Build tools ( #cloud + #chef + #git )
Solution that fits all needs
Influence teams to adoption
This is a simplified version of an E2E environment. One of the achievements of the Gandalf team that allowed us for a long time to have better opportunities for developing and testing changes that affected multiple components.
We all hiate pie charts.
Specially knowing that we have Lindsay in the audience.
Just an example of some of the tech challenges the team was going through as they tried to provide stable infrastructure for EVERYONE.
Thousands of environments created every month for years, We can see the effect of stoppinator cleaning the environments at the end of the day when the engineers are not at work.
Future e2e replaced by contract driven development. For example check:
https://github.com/realestate-com-au/pact
And Decoupled systems using techniques like Hipster Batch copy
Hackdays were a place were having this kind of capabilities was awesome.
You could create a full environment mirroring the website and modify at will in minutes.
Send your champions to contaminate other areas with their passion
Longer term allocations to a team
Ops still reported/belonged to the SiteOps team
Different approach
- Champions in each team to build the needed capabilities: automation, monitoring, performance
Some pluses
Priorities dictated by your function area
Engagement with the team
Better understanding of pain points
Early input in the project
Example of optimization from within a team instead of tackling the full-company problem.
The Autobots team was part of one of the Delivery areas and was focused on automating some parts of their delivery process.
They mianaged to automate some really compex processes:
- Schemabot: Database schema changes in an automated maner.
- Deploybot: Managed the deployment. One of its components, the netscaler gem, was afterward used by multiple teams.
The idea of copying from the open source model and having teams looking at what other teams have come up with has repeated over time becoming one of the most successful patterns at REA.
Different business areas highly independent
Develop + Operation
A very lean layer of Global Infrastructure to support
Thing layers of shared services and vendor mgmt
The principle was to impulse TMI: Team Managed Infrastructure.
Cloud – Many accounts
Cons: Does everybody needs to know about infrastructure/netoworks/etc...?
Negative
Priorities dictated by your business area
New Silos
Lost sense of community
Postivie
Focus - Get Shit Done
Engagement with the team +++
Input into the roadmap
We give autonomy to the business areas to chose the best tools/practices for their areas.
They will have to support and maintain what they create which drives the Accountability.
Can you spot the Ops engineer?
Devs step up (Pager, deployments, metrics, performance, etc...)
Day pager going to devs
Escalate if needed after troubleshooting
Proxy knowledge
Pick up BAU
Deploy something that hasn't been deployed
Tom our ops engineer can focus in general improvements of operations like:
Exploring a new CDN
Regresion testing in Operations
Automating Security patches
Etc...
If the problem was beyond the knowledge of the engineers they can escalate the problem to the Ops representative and the good thing is that they will cache the knowledge.
The role of the ops in LoBs has evolved:
Their role (boost operations capacity in their area)
Enable previously disabled people
Early input into the projects
War room becomes the exception. For example this all hands on deck collaboration to tackle Hearbleed as soon as possible.
The previous model was quite successful but as we can see as we became faster the business areas tried to run more streams in parallel but the Ops capability sometimes wasn't correctly readjusted...
How many ops are too many ops?
With areas running so many concurrent projects
Push to regroup again
But how is this different?
Previous investments paying off. Devs++
Focus in areas that can boost the full group
Sometimes called Devops (arrrgggggg) or BAU teams.
Focus: go fast from idea to prod
Examples: MaD walking scheleton, Group Delivery Engineering
Danger: BAU and operations brought back to this group undoing the previous beneficts
Night pager improved over time.
And finally we had our first grad on Pager.
Kudos to Angus.
This experiments presented are just examples of what we have tried at some point of time. They had different level of success and the results are based on the state of our own business and our own journey.
Run your own experiments. Try new things. Monitor the results.