Scaling API-first – The story of a global engineering organization
Our long road to….continuous improvement (DevOps Days Boston 2014)
1. Our long road to….
continuous improvement
Kevin Amorin
2. BitSight Team
15 yrs in Enterprise & Startups
Enterprise Internal IT
IT Virtualization Software
SaaS low latency, high volume
SaaS Big Data, Analytics
team:
Issa Ashwash
Isaac Boehman
Sathya Ragavan
Pavel Sadikov
K e v i
n
3. starting the journey
it’s a journey not a destination
no silver bullet
“always make new mistakes”
do not re-invent wheel
continuous improvement
find and reduce the bottleneck (Lean)
10. infrastructure v1
what does this server do?
symptom: lost time on debugging/managing
products in data center
cause: organic growth, little planning
improvement: redesign infrastructure &
process for naming (host+subnet+app), access
(vpn), users, deployment
11. build v1
build is broken!
symptom: lost time on broken build debugging
cause: lack of uniform build environment,
committed code with minimal testing and review
improvement: centralized build & CI, branching,
pull requests, additional unit tests
12. deployment v1
that system doesn’t have my fix.
symptom: lost time debugging on wrong code
cause: lack of revisioned artifacts
improvement: central artifact repo each
package with banch, time, commit
13. communication v1
what new feature?
symptom: lost time with misinformation
cause: siloing of information
improvement: chat, single issue tracking &
process, representatives in standup &
retrospective
14. infrastructure v2
that system is missing pip module.
symptom: lost time debugging misconfigured
application & systems
cause: lack of consistency of systems &
applications
improvement: config management,
monitoring & alerting
15. build v2
ran in my local system…
symptom: code freeze branch would not run
end to end. lost time debugging which change
caused issue
cause: lack of regression/functional test
improvement: functional tests & require it to run
before merge
16. deployment v2
db is not correct?
symptom: database schema/data did not match
code
cause: inconsistent process & manual steps on
schema/data updates
improvement: db schema management tool +
process
17. deployment v3
new server didn’t come up right?
symptom: lost time with misconfigured or failed
provisioned nodes in AWS
cause: inconsistent semi-automated provisioning
steps did not have the flexibility needed for a
growing product line
improvement: knife-bs provisioning & deployment
http://github.com/CBitLabs/knife-bs
18. provisioning research & design
Netflix Asgard: Web interface for application
deployments and cloud management in Amazon
Web Services (AWS)
Infochimps Ironfan: Chef orchestration layer --
your system diagram come to life
Chef Knife-ec2: plugin gives knife the ability to
create, bootstrap, and manage EC2 instances.
19. knife-bs
cloud provisioning tool build on top of
opscode/knife-ec2. Using a description of
your infrastructure and stacks (in either
YAML or JSON), knife-bs will build correct
the stack in the correct environment and
bootstrap chef.
22. infrastructure v3
can I grab a cluster?
symptom: who is using what, what state is it in?
cause: lack of visibility of ownership & state of
application
improvement: infrastructure web UI which
overlays org meta-data
http://github.com/CBitLabs/atlas