The "Holy Grail" of Dev/Ops

The “Holy Grail” of Dev/Ops
A practical guide to what we’ve done at Cloud Posse
Prepared by Erik Osterman
Cloud Posse, LLC June 2017

Democratization of Information

About Me
● Former Director of Cloud Architecture, CBS Interactive in San Francisco
● Ran Operations for TV.com, Metacritic.com, and Clicker.com
● Worked with AWS since 2006 / Private Invite-only Beta
● Advise numerous successful venture backed startups
● Backend Software Developer, Open Source Advocate / Contributor
● Took ~2 years off to travel; visted ~30 countries

This Talk
● ~90 Minutes
● Q&A at the end
● Write question in the chat
● Actionable, practical advice
● Collection of our “Best Practices”

Best Practices
(my) definition: An opinionated & proven strategy with specific tactics to help
achieve the objectives for some overarching goal.

Emulate Giants
Netflix
Google
Spotify
Twitter

Our Best Practices
Organizational
Software Development, CI/CD, Testing, Q&A
Infrastructure, Automation, Orchestration
Logging, Monitoring, Alerting, Escalation, Remediation
Security

Organization
The
...it all starts here

Realize we’re different.
Managers vs Makers - We’re work differently
(Paul Graham - YCombinator Founder)
Makers plan in half-day blocks of time
Managers plan to minimize empty 15 minute slots in their calendar
Interrupts are costly for developers and therefore the business

HumanOps (i.e. not cyborgs)
Humans get tired and stressed, they feel happy and sad.
Human issues are system issues.
Human health impacts business health.
Humans need to switch off and on again (aka sleep).
Humans build and fix systems.
Humans > systems
http://www.humanops.com/

Right Tools for the Job
Email == external communication
(not tasks, threaded conversations, cat pics)
Slack == all internal communications; channels for topics #dogs
Quip == all documentation for transparency
(engineering & business)
Zoom == reliable cross-platform conferencing
Asana == issue tracking

Technical Debt is Real
Tradeoffs are inevitable. Pay the tax now or later.
Later usually means bankruptcy & software rewrites
Includes upgrades, refactoring, optimizations, etc
It’s anything that doesn’t move the product forward
But it will hold the product back
This is not just a software problem.
It’s a business problem too.
...and unavoidable

Software Development
Cloud Native Design - the “12 Factor” Pattern
Stable Code Requires Feature Branching / Pull Requests / Code Reviews
Versioning / Version Pinning
Logging
Local Development Environments

Some Bad Practices
Cowboy Coding, committing to master
Hardcoding secrets, hostnames, paths, etc
“Clever” code is often “complicated” code
Writing un-greppable code, terse variable names,
Inconsistent naming conventions, long functions, and………… you get the point.
Using tabs :P

Some Good Ones….
Strict Linting (e.g. eslint, go lint)
Semantic Versioning (semver)
.editorconfig (tabs or spaces? http://editorconfig.org/)
Seed project repositories
CHANGELOG.md

Best Practice: Open Source Pattern*
Leads to much cleaner code with fewer proprietary dependencies
Fewer proprietary dependencies makes it more reusable across projects
If decide to release, it demonstrates the kind of engineering you do
It works because developer’s ego is on the line to write stuff that doesn’t suck
Pro tip: follow the conventions of your favorite framework or package system
* Does not require that organization releases code as open source

Best Practice:
README.md & CHANGELOG.md
Use well-formed Markdown syntax (.md)
Write “README” files on all your projects. Explain the purpose of the project
Show how to get started and where to look for more information
Document breaking changes & upgrade path in CHANGELOG.md
Pro tip: Use a markdown editor if you’re not familiar with the syntax

Best Practice: Use Makefiles
Provide targets for common usage
E.g. deps, build, run, clean
Include them with all repos
Document targets purpose (##)

Makefile Example
-include .secrets
DB_HOST ?= localhost
## build a docker image
build:
docker build -t cloudposse/test .
## run container
run:
docker run -v $$(pwd):/app
-e DB_HOST=$(DB_HOST)
-e DB_PASS=$(DB_PASS)
-p 8080:80
cloudposse/test
## test
test:
curl http://localhost:8080/

Best Practice: Local Dev Environments
Onboarding new hires should take minutes not hours
Use fully automated local dev environments
Use same Docker images that will run in staging/production
Bind-mount local volumes to speed up iterations for “live editing”
Pro Tip: Use docker-compose rather than vagrant which is too heavy

Best Practice: Developers write Dockerfiles
Always use alpine:3.5 Base images (be wary of unofficial images)
Declare all ENV in Dockerfile (like function arguments to an OS)
Write as few layers as possible (chain with && )
Version Pin Everything
Use 2-stage build process for thin images (C/C++, Golang)

Best Practice: Branch Protection
Essential for security and stability of your codebase
Require PR approval to merge to master
Force branches to be up-to-date
Disallow commits to master
Restrict to squash+merge

Best Practice: Branch Protection

Best Practice: Pull Requests
Smaller the better; implement exactly 1 feature
Milestones
Use Labels:
Define PULL_REQUEST_TEMPLATE (## what, ## why, ## dependencies)
Use checkboxes for TODOs
….for clean commit histories in master

What a PR should look like....

Best Practice: Follow PRs with Trailer
http://ptsochantaris.github.io/trailer/

Best Practice: Application Logging
Use JSON structured log events
Libraries will efficiently generate/parse
Human readable, highly consistent
Pro tip: use Sentry to aggregate errors+warnings and log them in issue tracker

Best Practice: Pair Programming
Lose: speed (arguably)
Gain: fewer bugs, business continuity, education, team building/camaraderie
When: implementing complicated features, onboarding, and triaging
Pro tip: Use tmate for instant terminal sharing (https://tmate.io/)

QA
Developers with a focus on test automation
Quality Control
Masters of CI/CD

Best Practice: Bug Blowouts
Set aside 1 day per week to dog food your own app
Prepare test scripts (aka flows) for everyone to follow
Get everyone on board, not just QA.
That means developers, graphic artists, customer support, etc
Monitor logs, submit bugs immediately to issue tracker

Best Practice: Synthetic Testing
Continuous Testing of Critical User Paths
Uses Browser to Automate Tests of Production
Ensure User Registrations, Password Resets, Shopping Carts, and Checkout
work 100% of the time
Pro Tip: Checkout Selenium or PhantomJS

Cloud Native Design
Service-Oriented Architectures (SOA)
Single-purpose Services (aka micro services)
Connected through APIs
Highly Decoupled
12 Factor Pattern

“12 Factor” in a Nutshell
Use Environment Variables for all configuration
(credentials, ports, tuning parameters, etc)
Use Backing Services for everything durable
Write all services as stateless & disposable
Automate all admin tasks
(the rest is meh)

Best Practice: X509 Client Certificates
Use CA to Sign SSL Certificates that perform certain functions
Automatic transport & endpoint security for APIs
Highly scalable - no API requests to validate tokens
Don’t Rely on API tokens which are costly to authenticate and don’t secure the
transport layer
Examples: Kubernetes APIs, etcd

CI/CD
Frequency reduces Difficulty. The more you deploy, the easier it gets.
Latency between check-in and production is risky. It’s like HFT.
Faster delivery improves software development practices
Consistency improves confidence

Ensure applications support same backend schema for adjacent releases
Use feature flags to enable new features of backend schemas
Best Practice: Safe Schema Migrations

Write terse .travis.yaml, circle.yaml, Jenkinsfile
Use the same targets in all projects
Use Makefile to automate build, test
Clone harness repo after git checkout
Example: https://github.com/cloudposse/build-harness
Best Practice: Use a Build Harness

Best Practice: Liberal Tagging
Tag all docker images with multiple tags, in addition to release tags
Let $ref = {branch|tag}
Then, tag
$ref
$ref-$build
$git_hash

It is not…
a) A dedicated team within the organization
b) A job title
c) A sysadmin
d) A skill
e) all the above

What it actually is...
A cross-disciplinary engineering culture
Infrastructure is Code
Automation over toil
A path towards “Serverless” (but we’re still far away!)
Site Reliability Engineering (“SRE”)

Infrastructure as Code
Infrastructure is now 100% API driven
“Best Practices” of Development → Infrastructure
Versioned Infrastructure
Automated Remediations

Use Terraform to fully orchestrate environments
(e.g. DNS, instances, volumes, AutoScaling Groups, Load Balancers, Databases)
S3 remote backends to store state for collaboration and backups
Use modules to encapsulate business logic for consistency / manageability
Version pin modules and dependencies to ensure stability
Best Practice: Automated Orchestration

Best Practice: Tools as Containers
Only local dependency should be docker and maybe make =)
Distribute all other local development tools or dependencies as containers
(e.g. terraform, aws, kops, helm, etc...)
Easier to standardize on one OS
Example: https://github.com/cloudposse/geodesic/

Best Practice: 100% Isolation
Use (1) AWS Account per Stage (E.g. production, staging, dev)
Use (1) VPC per Cluster
Use (1) Dedicated TLD per AWS Account
(e.g. foobar.com, foobar.qa, foobar.org)
Use (1) Single Process Containers for all Apps

Best Practice: Identical Environments
Environments should only differ in size, not shape
“Production”, “Staging”, “Dev” are only labels
Run as many parallel environments as we need
Only manual action is initiating build
E.g. other labels: pentest, loadtest, erik
Pro tip: each environment gets it’s own DNS zone (e.g. erik.cloudposse.org)

What We Want
Reliable - we want things to be online 100% of the time and when things go
wrong, we want them to auto-heal.
Fast - we want to run a site that can scale horizontally as traffic increases
Easy - we shouldn't need rocket scientists to operate it on a day-to-day basis
Affordable - we want it to be easy and cost effective to maintain in the long run
Maintainable - we want to have a development or staging environment that is
identical to production, so we can efficiently work on new versions of the site
without it affecting production
Secure - we don't want to get hacked

Technically, we need this… “Everything”
Horizontal Auto Scaling, Auto Healing, Auto DNS, Auto SSL
Automated deployments and rollbacks, Versioned History
Service Discovery & Load Balancing
Batch Job, Scheduled Job Execution
Storage/Volume Orchestration
...out of the box

Best Practice: Use Kubernetes (sometimes)
Ideally suited for microservices architectures, larger engineering teams
“Infrastructure as Code” - write documents that describe you microservices
(Pods ~ VMs, ReplicaSets ~ clusters, Services ~ Load Balancers)
Comes with Everything out-of-the-box
Cons: more complex to get started, difficult to triage issues, requires SME
Pro tip: Use kops to spin up clusters automatically in AWS and GCE

Best Practice: Use Elastic Beanstalk
Ideally suited for monolithic architectures
Comes with almost Everything out-of-the-box
Supports instances inside private VPC with root SSH access
Formal process for promoting code to production / automatic rollbacks
Pro tip: Use terraform to spin up beanstalk clusters automatically in AWS

Configuration Management
Immutablevs Mutable
Declarative vs Imperative
“WYSIWYG”

Best Practice: Immutable Containers/AMIs
Like “Burning” a copy of your code in an image
Easy to know exactly what is running
Fast to deploy and rollback
Use Docker containers for applications
Use something like CoreOS for underlying host (~dom0)

Best Practice: Imperative Infrastructure
“Give me a load balancer, 2 filesystems, 2 GB ram, 4 CPUs, 4 instances”
There’s no guess work about what is output
Compatible with legacy architectures
There’s less magic

Monitoring
Application - Synthetic Testing
Infrastructure
Real-User Monitoring (RUM)
SLI
Systems don't have feelings. They only have SLAs.

Best Practice: Team Dashboards
Display Service Level Indicators (~ KPIs) relevant for specific teams
Create dashboards for specific services like Kafka and Zookeeper
First place to look when triaging issues
Pro tip: Use Datadog dashboards with namespace filtering on clusters

Alerting
Alert Fatigue == Human Fatigue
Dashboards > Alerts > Email
Human health impacts business health.
Budgets
Metrics driven; not log events
Alerts need to be actionable - with links to documentation

Best Practice: Actionable Alerts

Escalation & Remediation
Automate as much as possible, escalate to a human as a last resort.
KPI~SLI / SLO / SLA
On-call Engineers
PagerDuty - Manage Calendars and Phone/SMS Escalations

Best Practice: #OCE Slack Channel
One channel to reach engineers
Searchable history of events and conversations
Use topic to announce who is on-call
Linked Google Calendar with Relevant Events (E.g. Customer Demo Calendar)

Best Practice: Post-Mortems
Kill the shame game. Human issues are system issues.
5 Whys - Root Cause Analysis (“RCA”)
Use Consistent Template (KISS)
Weekly Retrospectives with past OCEs and Stakeholders
Documented in Quip → Instantly Searchable
Pro Tip: Check out how Google does it:
https://landing.google.com/sre/book/chapters/postmortem-culture.html

Security
100% Security Cannot Be Achieved
Assume systems are insecure
Devalue credentials with MFA

What not to do...
1. Store secrets in git repository
2. Hardcode secrets in configurations
3. Write them in plain-text
4. Manually distributed them
5. Reuse/share keys across users and apps
6. Build homegrown systems to protect secrets
(* unless you’re Netflix, Hashicorp or Google)
...but you already knew that!

Best Practice: Beyond Corp Model
Enterprise zero-trust security model used by Google
Shift access controls from the network perimeter to individual devices/users
Allow employees to work more securely from any location
Do not rely on traditional VPNs

Best Practice: Identity-Aware Proxy (IAP).
Protect internal services using an IAP
Integrates cleanly with your SSO provide
MFA
Pro tip: Use the Bitly OAuth2 Proxy to add auth layer to any service

Best Practice: Bastion Host
Centralized point for accessing systems
Session logs, Slack Login Notifications
Require MFA to authenticate
Disable proxy mode and TCP socket forwarding
Use bastion only for triage, not administration (because that’s scripted!)
Pro Tip: Use Duo Push Notifications + Geofencing

Best Practice: Login Justifications

Best Practice: SSH Key Management
2 options - Github Public Key API or Signed Certificates
● You can’t protect the private key
● You can add multiple factors (a.k.a. MFA)
● Our Solution
○ Use Github Public Key API to distribute public keys
https://github.com/cloudposse/github-authorized-keys
○ Use Duo for MFA Push Notifications + Geofencing
https://github.com/cloudposse/bastion
Pro tip: Checkout Bless by Netflix

Duo Slack Integration and Dashboard

Best Practice: SSM Scripted Remediations
Use SSM to execute commands in parallel across machines
(don’t use parallel ssh since that is harder to audit)
Full audit logs of command and output
Use IAM roles to restrict execution
Pro tip: use the aws cli to trigger remediations on the command line

Best Practice: Federated Accounts
Reduce the blast radius when things explode
Use one account per environment: dev, staging, production
Use a one account for billing aggregation, IAM federation
Assumed Roles (e.g. read-only, admin, dba)
MFA required to assume roles - to devalue credentials
Pro Tip: Use STS API with MFA to generate short lived AWS credentials
Example: https://github.com/cloudposse/aws-assumed-role
AWS

Best Practice: AWS Secrets (Client-side)
Client Side (e.g. Terraform, AWS Cli)
● IAM User Account Access Keys (never shared!)
● Access Keys only permit Assume Role+MFA
● Assumed Roles (limit scope)
● Temporary Sessions Tokens with STS (expire after 1 hour)
● MFA (devalue credentials)
Solution: https://github.com/cloudposse/aws-assumed-role

Best Practice: AWS Secrets (Server-side)
Dynamic, Auto Rotating Credentials for Server Applications
Never ever hardcode AWS credentials on EC2 instances
Server Side (e.g. EC2 Instance, Docker Container)
● IAM Instance Profiles with Assumed Roles
● Use Kube2IAM with Kubernetes (kops)
https://github.com/cloudposse/charts/tree/master/incubator/kube2iam-kops
○ Temporary AWS credentials
○ Drop-in Compatiblity with all official AWS client library

Best Practice: Bootstrap Secrets
Secrets you need to provision new clusters on AWS...
● Run terraform inside of Container
● Private S3 Configuration Bucket
● Encrypted Bucket Objects
● Mount S3 Bucket inside container (S3FS)
● Use /dev/shm for caching
Geodesic: https://github.com/cloudposse/geodesic

Best Practice: Password Managers
Store Organizational Secrets in Password Manager
(webhook urls, master account credentials, shared MFA)
Use Vaults specific to some shared objective (e.g. team)
Require MFA for decryption
Avoid Shared Credentials as much as possible (this is a last resort)
SSO > Shared Passwords
Pro tip: Use 1Password for Teams. Abandon all other password managers.

Best Practice: Avoid Password Rules
They don't work
They frustrate average users
Penalize people that use real random password generators
They are often computationally weaker → vulnerable to brute force attacks
https://blog.codinghorror.com/password-rules-are-bullshit/

Best Practice: Avoid Password Rules

The Bible
https://landing.google.com/sre/book.html

__EOF__
Erik Osterman, Founder
Cloud Posse, LLC
hello@cloudposse.com
https://cloudposse.com/
https://github.com/cloudposse/

The "Holy Grail" of Dev/Ops

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a The "Holy Grail" of Dev/Ops

Similar a The "Holy Grail" of Dev/Ops (20)

Más de Erik Osterman

Más de Erik Osterman (7)

Último

Último (20)

The "Holy Grail" of Dev/Ops