4. About Me
● Former Director of Cloud Architecture, CBS Interactive in San Francisco
● Ran Operations for TV.com, Metacritic.com, and Clicker.com
● Worked with AWS since 2006 / Private Invite-only Beta
● Advise numerous successful venture backed startups
● Backend Software Developer, Open Source Advocate / Contributor
● Took ~2 years off to travel; visted ~30 countries
5. This Talk
● ~90 Minutes
● Q&A at the end
● Write question in the chat
● Actionable, practical advice
● Collection of our “Best Practices”
6. Best Practices
(my) definition: An opinionated & proven strategy with specific tactics to help
achieve the objectives for some overarching goal.
10. Realize we’re different.
Managers vs Makers - We’re work differently
(Paul Graham - YCombinator Founder)
Makers plan in half-day blocks of time
Managers plan to minimize empty 15 minute slots in their calendar
Interrupts are costly for developers and therefore the business
11. HumanOps (i.e. not cyborgs)
Humans get tired and stressed, they feel happy and sad.
Human issues are system issues.
Human health impacts business health.
Humans need to switch off and on again (aka sleep).
Humans build and fix systems.
Humans > systems
http://www.humanops.com/
12. Right Tools for the Job
Email == external communication
(not tasks, threaded conversations, cat pics)
Slack == all internal communications; channels for topics #dogs
Quip == all documentation for transparency
(engineering & business)
Zoom == reliable cross-platform conferencing
Asana == issue tracking
13. Technical Debt is Real
Tradeoffs are inevitable. Pay the tax now or later.
Later usually means bankruptcy & software rewrites
Includes upgrades, refactoring, optimizations, etc
It’s anything that doesn’t move the product forward
But it will hold the product back
This is not just a software problem.
It’s a business problem too.
...and unavoidable
15. Software Development
Cloud Native Design - the “12 Factor” Pattern
Stable Code Requires Feature Branching / Pull Requests / Code Reviews
Versioning / Version Pinning
Logging
Local Development Environments
16. Some Bad Practices
Cowboy Coding, committing to master
Hardcoding secrets, hostnames, paths, etc
“Clever” code is often “complicated” code
Writing un-greppable code, terse variable names,
Inconsistent naming conventions, long functions, and………… you get the point.
Using tabs :P
17. Some Good Ones….
Strict Linting (e.g. eslint, go lint)
Semantic Versioning (semver)
.editorconfig (tabs or spaces? http://editorconfig.org/)
Seed project repositories
CHANGELOG.md
18. Best Practice: Open Source Pattern*
Leads to much cleaner code with fewer proprietary dependencies
Fewer proprietary dependencies makes it more reusable across projects
If decide to release, it demonstrates the kind of engineering you do
It works because developer’s ego is on the line to write stuff that doesn’t suck
Pro tip: follow the conventions of your favorite framework or package system
* Does not require that organization releases code as open source
19. Best Practice:
README.md & CHANGELOG.md
Use well-formed Markdown syntax (.md)
Write “README” files on all your projects. Explain the purpose of the project
Show how to get started and where to look for more information
Document breaking changes & upgrade path in CHANGELOG.md
Pro tip: Use a markdown editor if you’re not familiar with the syntax
20. Best Practice: Use Makefiles
Provide targets for common usage
E.g. deps, build, run, clean
Include them with all repos
Document targets purpose (##)
21. Makefile Example
-include .secrets
DB_HOST ?= localhost
## build a docker image
build:
docker build -t cloudposse/test .
## run container
run:
docker run -v $$(pwd):/app
-e DB_HOST=$(DB_HOST)
-e DB_PASS=$(DB_PASS)
-p 8080:80
cloudposse/test
## test
test:
curl http://localhost:8080/
22. Best Practice: Local Dev Environments
Onboarding new hires should take minutes not hours
Use fully automated local dev environments
Use same Docker images that will run in staging/production
Bind-mount local volumes to speed up iterations for “live editing”
Pro Tip: Use docker-compose rather than vagrant which is too heavy
23. Best Practice: Developers write Dockerfiles
Always use alpine:3.5 Base images (be wary of unofficial images)
Declare all ENV in Dockerfile (like function arguments to an OS)
Write as few layers as possible (chain with && )
Version Pin Everything
Use 2-stage build process for thin images (C/C++, Golang)
24. Best Practice: Branch Protection
Essential for security and stability of your codebase
Require PR approval to merge to master
Force branches to be up-to-date
Disallow commits to master
Restrict to squash+merge
26. Best Practice: Pull Requests
Smaller the better; implement exactly 1 feature
Milestones
Use Labels:
Define PULL_REQUEST_TEMPLATE (## what, ## why, ## dependencies)
Use checkboxes for TODOs
….for clean commit histories in master
29. Best Practice: Application Logging
Use JSON structured log events
Libraries will efficiently generate/parse
Human readable, highly consistent
Pro tip: use Sentry to aggregate errors+warnings and log them in issue tracker
31. Best Practice: Pair Programming
Lose: speed (arguably)
Gain: fewer bugs, business continuity, education, team building/camaraderie
When: implementing complicated features, onboarding, and triaging
Pro tip: Use tmate for instant terminal sharing (https://tmate.io/)
33. Best Practice: Bug Blowouts
Set aside 1 day per week to dog food your own app
Prepare test scripts (aka flows) for everyone to follow
Get everyone on board, not just QA.
That means developers, graphic artists, customer support, etc
Monitor logs, submit bugs immediately to issue tracker
34. Best Practice: Synthetic Testing
Continuous Testing of Critical User Paths
Uses Browser to Automate Tests of Production
Ensure User Registrations, Password Resets, Shopping Carts, and Checkout
work 100% of the time
Pro Tip: Checkout Selenium or PhantomJS
36. “12 Factor” in a Nutshell
Use Environment Variables for all configuration
(credentials, ports, tuning parameters, etc)
Use Backing Services for everything durable
Write all services as stateless & disposable
Automate all admin tasks
(the rest is meh)
37. Best Practice: X509 Client Certificates
Use CA to Sign SSL Certificates that perform certain functions
Automatic transport & endpoint security for APIs
Highly scalable - no API requests to validate tokens
Don’t Rely on API tokens which are costly to authenticate and don’t secure the
transport layer
Examples: Kubernetes APIs, etcd
38. CI/CD
Frequency reduces Difficulty. The more you deploy, the easier it gets.
Latency between check-in and production is risky. It’s like HFT.
Faster delivery improves software development practices
Consistency improves confidence
39. Ensure applications support same backend schema for adjacent releases
Use feature flags to enable new features of backend schemas
Best Practice: Safe Schema Migrations
40. Write terse .travis.yaml, circle.yaml, Jenkinsfile
Use the same targets in all projects
Use Makefile to automate build, test
Clone harness repo after git checkout
Example: https://github.com/cloudposse/build-harness
Best Practice: Use a Build Harness
41. Best Practice: Liberal Tagging
Tag all docker images with multiple tags, in addition to release tags
Let $ref = {branch|tag}
Then, tag
$ref
$ref-$build
$git_hash
46. What it actually is...
A cross-disciplinary engineering culture
Infrastructure is Code
Automation over toil
A path towards “Serverless” (but we’re still far away!)
Site Reliability Engineering (“SRE”)
47. Infrastructure as Code
Infrastructure is now 100% API driven
“Best Practices” of Development → Infrastructure
Versioned Infrastructure
Automated Remediations
48. Use Terraform to fully orchestrate environments
(e.g. DNS, instances, volumes, AutoScaling Groups, Load Balancers, Databases)
S3 remote backends to store state for collaboration and backups
Use modules to encapsulate business logic for consistency / manageability
Version pin modules and dependencies to ensure stability
Best Practice: Automated Orchestration
49. Best Practice: Tools as Containers
Only local dependency should be docker and maybe make =)
Distribute all other local development tools or dependencies as containers
(e.g. terraform, aws, kops, helm, etc...)
Easier to standardize on one OS
Example: https://github.com/cloudposse/geodesic/
50. Best Practice: 100% Isolation
Use (1) AWS Account per Stage (E.g. production, staging, dev)
Use (1) VPC per Cluster
Use (1) Dedicated TLD per AWS Account
(e.g. foobar.com, foobar.qa, foobar.org)
Use (1) Single Process Containers for all Apps
51. Best Practice: Identical Environments
Environments should only differ in size, not shape
“Production”, “Staging”, “Dev” are only labels
Run as many parallel environments as we need
Only manual action is initiating build
E.g. other labels: pentest, loadtest, erik
Pro tip: each environment gets it’s own DNS zone (e.g. erik.cloudposse.org)
52. What We Want
Reliable - we want things to be online 100% of the time and when things go
wrong, we want them to auto-heal.
Fast - we want to run a site that can scale horizontally as traffic increases
Easy - we shouldn't need rocket scientists to operate it on a day-to-day basis
Affordable - we want it to be easy and cost effective to maintain in the long run
Maintainable - we want to have a development or staging environment that is
identical to production, so we can efficiently work on new versions of the site
without it affecting production
Secure - we don't want to get hacked
53. Technically, we need this… “Everything”
Horizontal Auto Scaling, Auto Healing, Auto DNS, Auto SSL
Automated deployments and rollbacks, Versioned History
Service Discovery & Load Balancing
Batch Job, Scheduled Job Execution
Storage/Volume Orchestration
...out of the box
54. Best Practice: Use Kubernetes (sometimes)
Ideally suited for microservices architectures, larger engineering teams
“Infrastructure as Code” - write documents that describe you microservices
(Pods ~ VMs, ReplicaSets ~ clusters, Services ~ Load Balancers)
Comes with Everything out-of-the-box
Cons: more complex to get started, difficult to triage issues, requires SME
Pro tip: Use kops to spin up clusters automatically in AWS and GCE
56. Best Practice: Use Elastic Beanstalk
Ideally suited for monolithic architectures
Comes with almost Everything out-of-the-box
Supports instances inside private VPC with root SSH access
Formal process for promoting code to production / automatic rollbacks
Pro tip: Use terraform to spin up beanstalk clusters automatically in AWS
59. Best Practice: Immutable Containers/AMIs
Like “Burning” a copy of your code in an image
Easy to know exactly what is running
Fast to deploy and rollback
Use Docker containers for applications
Use something like CoreOS for underlying host (~dom0)
60. Best Practice: Imperative Infrastructure
“Give me a load balancer, 2 filesystems, 2 GB ram, 4 CPUs, 4 instances”
There’s no guess work about what is output
Compatible with legacy architectures
There’s less magic
61. Monitoring
Application - Synthetic Testing
Infrastructure
Real-User Monitoring (RUM)
SLI
Systems don't have feelings. They only have SLAs.
62. Best Practice: Team Dashboards
Display Service Level Indicators (~ KPIs) relevant for specific teams
Create dashboards for specific services like Kafka and Zookeeper
First place to look when triaging issues
Pro tip: Use Datadog dashboards with namespace filtering on clusters
64. Alerting
Alert Fatigue == Human Fatigue
Dashboards > Alerts > Email
Human health impacts business health.
Budgets
Metrics driven; not log events
Alerts need to be actionable - with links to documentation
67. Escalation & Remediation
Automate as much as possible, escalate to a human as a last resort.
KPI~SLI / SLO / SLA
On-call Engineers
PagerDuty - Manage Calendars and Phone/SMS Escalations
68. Best Practice: #OCE Slack Channel
One channel to reach engineers
Searchable history of events and conversations
Use topic to announce who is on-call
Linked Google Calendar with Relevant Events (E.g. Customer Demo Calendar)
69. Best Practice: Post-Mortems
Kill the shame game. Human issues are system issues.
5 Whys - Root Cause Analysis (“RCA”)
Use Consistent Template (KISS)
Weekly Retrospectives with past OCEs and Stakeholders
Documented in Quip → Instantly Searchable
Pro Tip: Check out how Google does it:
https://landing.google.com/sre/book/chapters/postmortem-culture.html
71. What not to do...
1. Store secrets in git repository
2. Hardcode secrets in configurations
3. Write them in plain-text
4. Manually distributed them
5. Reuse/share keys across users and apps
6. Build homegrown systems to protect secrets
(* unless you’re Netflix, Hashicorp or Google)
...but you already knew that!
72. Best Practice: Beyond Corp Model
Enterprise zero-trust security model used by Google
Shift access controls from the network perimeter to individual devices/users
Allow employees to work more securely from any location
Do not rely on traditional VPNs
73. Best Practice: Identity-Aware Proxy (IAP).
Protect internal services using an IAP
Integrates cleanly with your SSO provide
MFA
Pro tip: Use the Bitly OAuth2 Proxy to add auth layer to any service
74. Best Practice: Bastion Host
Centralized point for accessing systems
Session logs, Slack Login Notifications
Require MFA to authenticate
Disable proxy mode and TCP socket forwarding
Use bastion only for triage, not administration (because that’s scripted!)
Pro Tip: Use Duo Push Notifications + Geofencing
76. Best Practice: SSH Key Management
2 options - Github Public Key API or Signed Certificates
● You can’t protect the private key
● You can add multiple factors (a.k.a. MFA)
● Our Solution
○ Use Github Public Key API to distribute public keys
https://github.com/cloudposse/github-authorized-keys
○ Use Duo for MFA Push Notifications + Geofencing
https://github.com/cloudposse/bastion
Pro tip: Checkout Bless by Netflix
78. Best Practice: SSM Scripted Remediations
Use SSM to execute commands in parallel across machines
(don’t use parallel ssh since that is harder to audit)
Full audit logs of command and output
Use IAM roles to restrict execution
Pro tip: use the aws cli to trigger remediations on the command line
79.
80. Best Practice: Federated Accounts
Reduce the blast radius when things explode
Use one account per environment: dev, staging, production
Use a one account for billing aggregation, IAM federation
Assumed Roles (e.g. read-only, admin, dba)
MFA required to assume roles - to devalue credentials
Pro Tip: Use STS API with MFA to generate short lived AWS credentials
Example: https://github.com/cloudposse/aws-assumed-role
AWS
81. Best Practice: AWS Secrets (Client-side)
Client Side (e.g. Terraform, AWS Cli)
● IAM User Account Access Keys (never shared!)
● Access Keys only permit Assume Role+MFA
● Assumed Roles (limit scope)
● Temporary Sessions Tokens with STS (expire after 1 hour)
● MFA (devalue credentials)
Solution: https://github.com/cloudposse/aws-assumed-role
82. Best Practice: AWS Secrets (Server-side)
Dynamic, Auto Rotating Credentials for Server Applications
Never ever hardcode AWS credentials on EC2 instances
Server Side (e.g. EC2 Instance, Docker Container)
● IAM Instance Profiles with Assumed Roles
● Use Kube2IAM with Kubernetes (kops)
https://github.com/cloudposse/charts/tree/master/incubator/kube2iam-kops
○ Temporary AWS credentials
○ Drop-in Compatiblity with all official AWS client library
83. Best Practice: Bootstrap Secrets
Secrets you need to provision new clusters on AWS...
● Run terraform inside of Container
● Private S3 Configuration Bucket
● Encrypted Bucket Objects
● Mount S3 Bucket inside container (S3FS)
● Use /dev/shm for caching
Geodesic: https://github.com/cloudposse/geodesic
84. Best Practice: Password Managers
Store Organizational Secrets in Password Manager
(webhook urls, master account credentials, shared MFA)
Use Vaults specific to some shared objective (e.g. team)
Require MFA for decryption
Avoid Shared Credentials as much as possible (this is a last resort)
SSO > Shared Passwords
Pro tip: Use 1Password for Teams. Abandon all other password managers.
85. Best Practice: Avoid Password Rules
They don't work
They frustrate average users
Penalize people that use real random password generators
They are often computationally weaker → vulnerable to brute force attacks
https://blog.codinghorror.com/password-rules-are-bullshit/