One of the most fundamental challenges of CI/CD is the ability to balance between Quality, Time, and Cost. Amazon EC2 Container Service (ECS), along with Docker and Amazon EC2 Container Registry (ECR), has changed the game for many by making resource management very simple. For Okta, it has enabled the Continuous Integration team to maximize throughput while minimizing cost. In this session we will show you how Okta has created a flexible CI system with ECS, Docker, ECR, AWS Lambda, AWS CloudFormation, Amazon RDS, and Amazon SQS. Okta runs 30,000 tests with each developer commit, and releases 10,000 new lines of code each week to production. The CI system, built 100% on AWS, must be able to handle load while keeping cost under control. This talk is oriented toward developers looking to achieve efficient resource and cost management without compromising speed or quality.
2. Topics
• Who is Okta
• Okta Engineering—How Do We work, how do we ship
our code?
• The Challenge of the Developer Productivity Team
• A CI System with Amazon EC2 Container Service and
Docker
3. Okta: Connect Everything
• Connects all users, devices,
applications, and organizations
• SSO, Adaptive MFA,
Provisioning, Universal Directory,
Mobility
• The broadest and deepest
application network
Leader: Okta
Magic Quadrant
Leader: Okta
Forrester Wave
What We Do
We believe that connecting
everything will make organizations
more productive and more secure.
What We Believe
We Make Customers
Successful
11. Okta Engineering—How Do We work, how do
we ship our code?
• 200 engineers, split into teams with embedded
specialists
• 1 week sprints, and deploy to production weekly
• Capability to do more than one hotfix per day at
customers’ request or for bugs found in CI or pre-prod
• Every merge to master is a potential release candidate
12. Okta Engineering—How Do We Test Our Code?
• Every topic branch goes through the same amount of
vigor in testing as release candidate.
• Passing automated tests is enforced at commit time.
• Largest repo: 30K tests, takes 60 minutes (22 parallel
runs)
• Smallest repo: 100 tests, 5 minutes
• The Developer Productivity team is responsible for
supporting engineering.
13. Challenge of Developer Productivity Team
• Developer experience
• Quality
• Cost
• Cloud First
14. Challenge of Developer Productivity Team
• Developer experience
• Quality
• Cost
• Cloud First
Developers expect fast turn-
around time and reliable results.
15. Challenge of Developer Productivity Team
• Developer experience
• Quality
• Cost
• Cloud First
We need to run all the tests
required to guarantee quality.
16. Challenge of Developer Productivity Team
• Developer experience
• Quality
• Cost
• Cloud First
We need to run an
infrastructure which is as cost-
effective as possible
17. Challenge of Developer Productivity Team
• Developer experience
• Quality
• Cost
• Cloud First
We aim to use cloud services
first, wherever possible
21. Vision
• Clean testing environments
• Dynamic worker scaling
• Spot instances for cost
• Versioned Testing
• Improved queuing system
• Less Infrastructure
Flakiness
• The correct privileges, to
maintain security
22. Vision
• Clean testing
environment
• Dynamic worker scaling
• Spot instances for cost
• Versioned Testing
• Improved queuing system
• Less Infrastructure
Flakiness
• The correct privileges, to
maintain security
Isolate test environments from
others, parallel and serial runs
23. Vision
• Clean testing environments
• Dynamic worker scaling
• Spot instances for cost
• Versioned Testing
• Improved queuing system
• Less Infrastructure
Flakiness
• The correct privileges, to
maintain security
Workers should survive the
loss of their build server
Worker pool should scale
quickly
Number of workers should not
affect memory footprint of build
server
24. Vision
• Clean testing environment
• Dynamic worker scaling
• Spot instances for cost
• Versioned Testing
• Improved queuing system
• Less Infrastructure
Flakiness
• The correct privileges, to
maintain security
Run our services for cheaper
rates, as we have many short
lived tasks, and could certainly
handle a few failures
25. Vision
• Clean testing environment
• Dynamic worker scaling
• Spot instances for cost
• Versioned Testing
• Improved queuing system
• Less Infrastructure
Flakiness
• The correct privileges, to
maintain security
Enable testing of infrastructure
changes in topic branches
26. Vision
• Clean testing environment
• Dynamic worker scaling
• Spot instances for cost
• Versioned Testing
• Improved queuing system
• Less Infrastructure
Flakiness
• The correct privileges, to
maintain security
Should survive build server
reboots
Shouldn’t be tied to specific
workers or build servers
Centralized
Should have good visibility
Re-queuing of lost tasks
27. Vision
• Clean testing environment
• Dynamic worker scaling
• Spot instances for cost
• Versioned Testing
• Improved queuing system
• Less Infrastructure
Flakiness
• The correct privileges, to
maintain security
Push testing and creation of
test machines to developers
28. Vision
• Clean testing environment
• Dynamic worker scaling
• Spot instances for cost
• Versioned Testing
• Improved queuing system
• Less Infrastructure
Flakiness
• The correct privileges, to
maintain security
Launch tasks in secure
environments
30. EC2 Container Service and Docker
• Amazon Web Services + Java app tailored to Okta
process
• Immutable and Disposable build workers—created for
one-time use, destroyed when job is done
• Near ZERO cost on weekends, scales with load
• EC2 Container Service allows us to maximize usage of
EC2 instances
• Same containers for multiple types and numbers of
builds
• Same Machine Image can run multiple docker images
33. Docker Update
• Update Dockerfile and our CI system builds the new image,
uploading it to our repository
• Update task definition for cluster updates
34. Dockerfile
FROM docker.aue1d.saasure.com/okta-base:2.0
MAINTAINER Okta
RUN useradd -d /home/container_user -m -s /bin/bash container_user
# Install wget, tar, hostname
RUN yum install -y wget tar hostname
# Install Java 8
RUN yum install -y java-1.8.0-oracle-1.8.0_31
RUN mkdir -p /opt/sage
RUN mkdir -p /var/log/sage
RUN chown container_user /var/log/sage
ADD conf/* /opt/sage/conf/
ADD core/target/core-*.jar /opt/sage/sage.jar
EXPOSE 8882 8883
USER container_user
CMD java $OKTA_SAGE_JAVA_ARGS -jar /opt/sage/sage.jar server /opt/sage/conf/sage.yml
35. Docker Security Conventions
Container repository
• Only allow containers from internal repository
Security scanning of containers - JFrog Xray
Process monitoring on docker host – cAdvisor from google
Secrets or any form of config NEVER baked in containers
Start from minimal, audited base OS
Run container as non-privileged user w/ user namespaces
Docker 1.10+
Monitor alas.aws.amazon.com for critical updates
36. Docker Source Conventions
3 categories of container definitions
1. “Library” definitions used as the basis for building other images
2. Third-party service definitions e.g. Zookeeper or Elasticsearch
3. Internal service definitions
Repo per internal service
• Dockerfile in same repo => image versioned with code
• Docker compose for running dependent services
• Pegged versions (no builds)
Single repo for library and third-party service definitions
37. Docker Build Conventions
Integration tests run against code running in container
Build owns creating immutable version and publishing to
artifact server
Strict rules around “FROM” clause
• Must point at internal artifact server
• Must be tagged following SEMVER-SHORT_SHA convention
• Never allow missing or use of “latest” tag for repeatable builds
41. Amazon EC2 Container Service Host Management
Userdata installs:
• Slave terminator – T-800
• Base docker images an option
• Credentials – from s3
• Splunk Forwarder – logging
• Cluster target
• Cache – code and libs
42. Amazon EC2 Container Service
Identity and Access Management separation per service
• Either service per cluster or use new Identity and Access
Management for Elastic Container Service functionality
Sharing the docker daemon to allow running docker within
docker
Pre-fetching large data blobs and making them available
on the hosts is an option
Multiple containers: mysql, redis, kinesilite
45. Clean Testing Environments
• Docker images
• Nearly instant machine refresh
• Easy for users to create and upload images that have
been tested to work locally
• Efficient Machine use
• Amazon EC2 Container Service with EC2 Container
Repository and private repository backend
48. Dynamic Worker Scaling
Lambda allocates jobs using bin packing
This is one of the changes we had to make in order to use
EC2 Container Service for long running tasks, rather than
services spread across many stateless instances
Disconnects unneeded nodes from cluster allowing
themselves to self terminate when they are idle
VS
49. Dynamic Worker Scaling
Lambda allocates jobs using bin packing
This is one of the changes we had to make in order to use
EC2 Container Service for long running tasks, rather than
services spread across many stateless instances
Disconnects unneeded nodes from cluster allowing
themselves to self terminate when they are idle
VS
50. Dynamic Worker Scaling
Lambda allocates jobs using bin packing
This is one of the changes we had to make in order to use
EC2 Container Service for long running tasks, rather than
services spread across many stateless instances
Disconnects unneeded nodes from cluster allowing
themselves to self terminate when they are idle
VS
51. Dynamic Worker Scaling
Lambda allocates jobs using bin packing
This is one of the changes we had to make in order to use
EC2 Container Service for long running tasks, rather than
services spread across many stateless instances
Disconnects unneeded nodes from cluster allowing
themselves to self terminate when they are idle
VS
52. Dynamic Worker Scaling`
Lambda allocates jobs using bin packing
This is one of the changes we had to make in order to use
EC2 Container Service for long running tasks, rather than
services spread across many stateless instances
Disconnects unneeded nodes from cluster allowing
themselves to self terminate when they are idle
VS
58. Versioned Jobs With EC2 Container Service
• Versioned build and test scripts can now be run in
versioned docker containers, using versioned task
definitions
• Creates extreme flexibility
• Cloud formation allows us to stand up whole new
clusters with all different versions in a matter of minutes
for long term testing
59. EC2 Container Service + Docker Problems
• Docker containers not launching
• EC2 Container Service agent failing
• Docker containers stopping
• Incompatibility with certain services
• Docker OS availability
• Cleanup
• Image size
65. Expand Use
• Use EC2 Container Service for more services
• Allow Developers to control their test suites and Docker
images more directly
• Developer Environments
• Use docker for local long running services
• Use a VM running the same version OS
• Remote updates to keep it in line with CI
• Aim to enable running CI containers right out of the box
66. Result: Happy Engineering Team
• Developers can write more tests quicker.
• Happy devs, timely build/test status feedback.
• Happy quality team, all tests are run at each commit.
• Happy ops team, release candidate produced quickly.
• Happy management, infra budget is under control.
67. Thank You
Join us @Okta - www.okta.com/company/careers/
stackshare.io/okta/okta