The document provides an overview of key areas to review for production readiness including architecture design, monitoring, logging, documentation, alerting, service level agreements, expected throughput, testing, and deployment strategy. It summarizes best practices and considerations for each area such as using circuit breakers in monitoring, consistent logging formats, storing documentation near code, automating level 1 operations, and strategies for testing, deployments, and managing error budgets.
2. About me:
Chris Munns - munns@amazon.com, @chrismunns
• Senior Developer Advocate - Serverless
• New Yorker
• Previously:
• AWS Business Development Manager – DevOps, July ’15 - Feb ‘17
• AWS Solutions Architect Nov, 2011- Dec 2014
• Formerly on operations teams @Etsy and @Meetup
• Little time at a hedge fund, Xerox and a few other startups
• Rochester Institute of Technology: Applied Networking and
Systems Administration ’05
• Internet infrastructure geek
4. Production Readiness Review
You don’t need all of these from day one, grow them as your teams grow.
Architecture Design Review
Monitoring
Logging
Documentation
Alerting
Service Level Agreement
Expected Throughput
Testing
Deploy Strategy
6. Architecture Design Review
Netflix Chaos Engineering
1. Define the system’s normal behavior — its “steady state” — based on
measurable output like overall throughput, error rates, latency, etc.
2. Hypothesize about the steady state behavior of an experimental group, as
compared to a stable control group.
3. Expose the experimental group to simulated real-world events such as server
crashes, malformed responses, or traffic spikes.
4. Test the hypothesis by comparing the steady state of the control group and
the experimental group. The smaller the differences, the more confidence we
have that the system is resilient.
TLDR; Intentionally break things, compare measured with expected impact, and correct any problems uncovered this way.
7. Architecture Design Review
Highly Available & Redundant
Problem Solution
Failure of a service in a specific
location
Run across multiple availability zones
or regions
Able to handle spikes of traffic Have auto-scaling in place with EC2,
Containers, or through leveraging
serverless architectures.
Avoid Single Points of Failure (SPOF) Be sure services are running in
clusters scaled across AZs.
Replication > Backups.
8. Architecture Design Review
Using Standard Libraries & Design Patterns
Standardizing on libraries, languages, styleguides makes onboarding new
developers and troubleshooting issues easier. Enforce these programmatically
where you can. (eslint, gofmt, etc)
Spot situations where code may be duplicated and able to be refactored.
Look for opportunities to implement good design patterns.
Know your licenses - OpenSource Permissive (MIT/Apache) vs Copy Left
(GNU/MPL)
9. Architecture Design Review
Review for Security Best Practices
Security should always be a top priority
Ensure no credentials are being stored in the application
Code defensively for SQL injections, XSS attacks, and more
Leverage Static Analysis tools
https://en.wikipedia.org/wiki/List_of_tools_for_static_code_analysis
Consider using Pre-Commit by Yelp
http://pre-commit.com
10. Architecture Design Review
Leverage other startups or rotate teams to keep fresh eyes on your code
Partner with another startup to help each other with architecture, code review,
interviewing, and more.
Consider rotating developers off of projects every few months to gain fresh
eyes on projects.
13. Monitoring
Performance Metrics
Start by building a dashboard of “important” metrics. Continue iterating on this
as you learn more about your system under inspection. Each system has a
“heartbeat” that will appear off when things are unhealthy.
You always think you have enough metrics being gathered until you need the
one you’re missing. When applications fail, the more data you can observe the
easier it is to get to the root cause.
Averages hide issues. Be sure to leverage percentiles to expose where users
are experiencing issues.
Complicated systems build complicated dependency chains. Small fluctuations
in one part of your stack can manifest itself in other parts.
14. Monitoring
Application Level Visibility
Provides Insight To Application Performance
You need visibility into how your application itself is performing.
How long are certain calls to resources taking?
Is that trending up or down?
What part of the application is generating the most number of errors?
17. Monitoring
Real User Monitoring (RUM) & Synthetic Monitoring
Synthetic Monitoring
Automatic testing of your site and service to measure performance.
Real User Monitoring
Shows your exactly how users are interacting with your site or application.
Measures page load times, DNS resolution issues, traffic bottlenecks, and
more.
25. Logging
Consistent Log Format
Consider using JSON for logging
User Log Levels correctly [INFO/WARN/CRIT]
Add context for your logging statements
Log behaviors and errors
Consider how analytics will be used on this data
26. Logging
UTC Timestamps
Centrally aggregated logs make analysis easier
Helps prevent mismatch errors due to DST
Prepares you for multi-region
Log tool interfaces let you adjust time zones per user
[2017-07-13 14:49:24.436245]
27. Logging
Individual Transaction IDs
The session ID that generated the error
The user who encountered the error
The user’s location in the application
The ID of the transaction or product that caused the error
Be careful about what you log from a security perspective
Web App Database
ID 10948281 ID 10948281
28. Documentation
Store Your Documentation Close To Your Code: Read.me
What the code does
How to install and run it
How to interact with it (stop, start, restart)
How to configure it
How to troubleshoot it
What metrics and dashboards are available
30. Alerting
"Level 1" Operations Teams Should Be Automated
check process nginx with pidfile /var/run/nginx.pid
start program = "/etc/init.d/nginx start”
stop program = "/etc/init.d/nginx stop”
group www (for centos)
33. Alerting
Build Proper Escalation Paths For Alerts
Primary
Secondary
Team
Management
10 Minutes
10 Minutes
10 Minutes
Being paged when something fails is great, but you
always need a backup
These need to auto escalate when not acknowledged
As it escalates up it’s good to notify a wider range of
people to get more eyes on the issue
Review alerts that have been ack’d or silenced beyond
a tolerable threshold.
34. Alerting
Developers Code Should Only Burden Themselves
Operations Add Capacity
Developer Deploy Hotfix
Bad application code
causes 40% increase in
CPU usage across a
cluster.
Temporary Fix
Permanent Fix
36. Service Level Agreements/Objectives
Services Should Have An SLA/SLO
/Search
/Cart
/Avatars
99.99%
99.999%
99.9%
These are internal SLAs for the
company
Helps identify how much effort should
be put into the reliability of each
service
Important when using microservices
for teams to reliably build
dependencies on your service.
https://landing.google.com/sre/book/chapters/service-level-objectives.html
37. Service Level Agreements
Understand The Cost Of Adding Each 9
Level of
Availability
Percent of
Uptime
Downtime per
Year
Downtime per
Day
1 Nine 90% 36.5 days 2.4 hours
2 Nines 99% 3.65 days 14 minutes
3 Nines 99.9% 8.76 hours 86 seconds
4 Nines 99.99% 52.6 minutes 8.6 seconds
5 Nines 99.999% 5.25 minutes .86 seconds
6 Nines 99.9999% 31.5 seconds 8.6 milliseconds
38. Expected Throughput
Run Load Tests & Understand Your Limits
Before a service goes live, know where your breaking points are.
Know the bare minimum number of instances needed to run your average
throughput
Know the maximum throughput you can handle with your current architecture
Calculate the throughput per instance ratio so you can accurately setup
proper auto-scaling in a cost optimized way.
40. Expected Throughput
Provides Performance Baseline For Future Release
0
500
1000
1500
2000
2500
3000
3500
Max RPS
V1
V14
As code evolves, so does your
performance.
Understand the impact of additional
libraries, added lines of code, and new
external calls.
Here we see a 63.58% increase in
performance from V1 to V14. This
directly correlates to your infrastructure
cost.
42. Testing
Adopt Automated Testing Early
Builds confidence in the code being
released
Allows you to test more of your
application in less time
Manual testing can become error
prone
45. Deployment Strategy
Database Migrations
Understand what changes to the database need to happen to support new
code releases.
Avoid removing columns, only make additions to reduce risk.
Be sure to test migrations against test copies of the database
Keep a revision history of database migrations for reference
Snapshot databases before doing migrations
47. Deployment Strategy
Dark Deploys & Feature Flags
Opt In
Test new features with selected
users
Kill Switch
Disable poorly performing features
Scalable Roll Outs
Do % roll outs of new features
Block Users
Prevent selected users from features
Run A/B Tests
Test and compare new features
Sunset Old Features
Safely decommission old features
48. Error Budget
Spend it! It’s there for you to use.
Error budget is there for you to take calculated risks in your environment.
Allows you to save up a high budget to spend it on major architectural
changes.
Some companies force the spending of this budget when it’s not utilized to
encourage services built on it to gracefully fail. If the SLA is 99.99% and it’s
running at 100%, they will manually force downtime to stay at 99.99%.
49. Production Readiness Review
Summary of key areas for a PRR
Architecture Design Review
Monitoring
Logging
Documentation
Alerting
Service Level Agreement
Expected Throughput
Testing
Deploy Strategy
50. Resources
Useful resources related to the topics covered
Production Readiness Review:
https://arxiv.org/pdf/1305.2402.pdf
Netflix Hystrix Circuit Breaker:
https://github.com/Netflix/Hystrix/wiki/How-it-Works
Feature Flags:
https://en.wikipedia.org/wiki/Feature_toggle
Error Budgets:
https://landing.google.com/sre/interview/ben-treynor.html
Monitoring Philosophies:
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit