These are the slides from my opening keynote for GDBC 2018 (video at https://www.youtube.com/watch?v=aIiLhK0NIlY).
I give an overview of how we evolved from TFS into VSTS as a service that ships every sprint. For more depth on the topics I cover, check out the videos on our DevOps at Microsoft page: https://docs.microsoft.com/en-us/azure/devops/devops-at-microsoft/.
4. 3 weeks
Team Foundation Server (TFS)
Visual Studio Team Services (VSTS)
Single master branch, multiple release branches
5.
6.
7. Shared Platform Services (SPS)
North Central
TFS SU1
North Central
TFS SU0
West Central
TFS SU7
Australia
8. Hosted Build
Pool
Hosted Build
Pool
Today: Micro Services
TFS
Work Item Tracking
Version Control
Build
Test Case Management
Service
Hooks
Release
Management
Search Code Lens
Extension
Management
Hosted Build
Pool
Cloud Load
Test
VSTS
Blobstore
Feeds
Packaging
SPS
Identity
Account
Commerce
Licensing
Moving to Containerized Services
20. Sprint 1
August 2010
Sprint 135
May 2018
Team Rooms
August 2013
1ES
Spring 2014
On-call Duty
October 2013
Combined
Engineering
November 2014
Test Conversion
Completed
April 2017
Service Online
April 23, 2011
Service Preview
June 2012
24. On call rotation
Gather data for root cause & mitigate for
customers
Every action recorded
Create & track Repair Items to prevent
reoccurrence and improve detection time
25.
26.
27. Test at the lowest level possible
Fast and reliable
Product is designed for testability
Test code is product code
End to end tests can run in production
28. Over 22 hours for nightly run and 2 days for the full run
Only ~60% of P0 runs passed 100%; Each run had many failures
Took days to sift through failures before deployment could start
29. L0 – requires only built binaries, no dependencies
L1 – adds ability to use SQL and file system
Run L0 & L1 in the pull request builds
L2 – test a service via REST APIs
L3 – full environment to test end to end
TRA tests – Legacy functional tests
30.
31. A strategy adopted by our teams to provide
focus, and assist with an interrupt culture.
• The team self-organizes each sprint into two
distinct sub-teams: Features and Shield
• Rotates each sprint
Team of 10 Engineers
Shield Team
Deals with all live-site
issues and interruptions
Feature Team
Works on committed
features (new work)
32. • Conference bridge created
• DRI’s brought in to call
• Communication externally and
internally
• Pursue multiple theories
• Gather data for root cause & mitigate
• Record changes
• Rotate people during long running
LSIs
33. Repair work-items are logged in VSTS but linked into
the post mortem for traceability
Time-to’s are a key KPI that are reviewed for improvements
Each Feature Team has goals for closing repair items
34. If we can’t prevent failure – can we limit the impact?
https://github.com/Netflix/Hystrix/wiki
35.
36. •
•
•
•
Day 1
Ring 0
Binaries
Delay
1 hour
Ring 0
Servicing
Delay
2 hours
Ring 1
Binaries
Delay
1 hour
Ring 1
Servicing
Delay
2 hours
Ring 2
Binaries
Delay
1 hour
Ring 2
Servicing
Day 2
Ring 3
Binaries
Delay
1 hour
Ring 3
Servicing
Delay
3 hours
Ring 4
Binaries
Delay
1 hour
Ring 4
Servicing
37. PR to Merge is 30 mins
600 PR builds per day
~60,000 tests in each build
175 pushes to master
Merge to CI Build is 22 mins
120 builds per day
2,864 projects (C# and C++)
10 GB Build Drop
Merge to SelfTest is 58 mins
6 SelfTest suits triggered in parallel
518 tests executed in <8 mins
Merge to SelfHost is 120 mins
4 SelfHost suits triggered in parallel
3260 tests executed in < 75 mins
38. Why move to containers?
Agility for teams while keeping COGs under control
Faster deployments
Get test results faster
Improve quality of service by simpler auto-scaling
Same for production and engineering environments
Notas del editor
Welcome to the Global DevOps Bootcamp
Hello, I’m Buck Hodges, director of software engineering for Visual Studio Team Services, and today I’m going to talk to you about our journey from a box product to a cloud cadence.
TFS and VSTS provide Git, agile planning, build automation, and more. TFS ships on-premises, and VSTS is the equivalent running as a service on Azure.
As of October 2017:
Single repo
430 people pushed to the repo in the last 30 days
~40 feature teams
Code base is 90+% the same
Teams work in master
No nightly build
Over 3,000 projects (doubled in the last 3 years)
Pull requests build & run unit test validation
Single tenant = every account sign up created a new database in Azure North Central (Chicago)…got to 11,000 DBs
Blast radius
Scale limit (soft)
Geos/sovereignty
VMs – PaaS web and worker roles and moving to Containers
App tiers – serve web UI, web service endpoints
Job agents – background processing like scheduled builds, clean up, commit processing, etc.
DB – only metadata in SQL Azure, multi-tenant
Blob – file data in Azure Storage
Collections are accounts
SU1 was the first only originally…no incremental roll out when there’s only one!
Then SPS (March 2013)
Then SU0
Then more scale units in the US and around the world
Now have SPS SU0 (February 2017)
Organized in deployment rings
Health check runs after each ring is deployed
Today we have four rings with outer rings having multiple scale units in them
Each service has scale units organized in rings
Micro services
Search, RM, Package, etc.
We require all services to operate the same way
Consistent deployments
Consistent framework
Dark launch – decouples marketing and engineering
Turn new features on completely at least 24 hours ahead of an event
Turn on incrementally
Monitor
Use feature flags for back end changes
We only had TFS SU1 and SPS at the time
https://blogs.msdn.microsoft.com/bharry/2013/11/25/a-rough-patch/
Story about initial release – Twitter was a better monitor than what we had in 2011 when we first announced the service at //Build in 2011
~60GB per day in 2015
Test at the lowest level possible
Fast and reliable
Product is designed for testability
Test code is product code
We started with an on-prem product, TFS 2010
We refactored it in production
Shifted to
Thanks for being here. If you would like to learn more about our DevOps journey, go to aka.ms/devops.
Have a lot of fun today learning DevOps.
Now over to Marcel…
We adopted Scrum during the summer of 2010 after we had completed TFS 2010.
TFS 2010 was the first release that supported load balanced application tiers and collections. TFS had a consistent SQL component access layer for quite a while.
VMs – Azure PaaS web and worker roles
App tiers – serve web UI, web service endpoints
Job agents – background processing like scheduled builds, clean up, commit processing, etc.
DB – only metadata in SQL Azure
Blob – file data in Azure Storage
Collections are accounts
SU1 was the first only originally…no incremental roll out when there’s only one!
Deployment mode is a mode where every time a sproc is called the sproc grabs a reader lock on the schema so that the schema can’t be changed during the call
Single tenant = every account sign up created a new database in Azure North Central (Chicago)…got to 11,000 DBs
Added multi-tenancy in February 2012
Stats as of September 2017
Other Azure Services – DocumentDB, DataLake, Service Fabric, Elastic Search Clusters, ServiceFabric, VMSS (Virtual Machine Scale Sets), AFD, CDN, Azure Traffic Manager, AzureActiveDirectory Services, IaaS
Services = 31:
AEX
ALMSearch
artifact
AX
blobstore
clt
CodeLens
Coss
csstool
dataimport
devtestlabs
DrService
entreq
extmgmt
feeds
gov
kalypso
market
MMS
mps
msdnadmin
MySubscriptions
OrgSearch
pe
pkgs
Portal
ReleaseManagement
sh
sps
spsext
tfs
Key Takeaway:
--5 whys
-- define improvement for both code and process
-- visibility to ensure learnings are applied
Outcomes:
-- Improve how you respond (TTx)
-- Stop from happening again
Limit the impact of a problem
Degrade gracefully
Once problem is over, the service should self-recover quickly
“Release It! – Design and Deploy Production-Ready Software”
Michael T. Nygard
Netflix Hystrix – “Making the Netflix API More Resilient”
Ben Schmaus
https://github.com/Netflix/Hystrix/wiki
Binary deployments are faster: 3 minutes vs. 30 minutes