SlideShare una empresa de Scribd logo
1 de 52
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Chris Munns
Fall 2017
AWS Startup Day
Production Readiness Review
About me:
Chris Munns - munns@amazon.com, @chrismunns
• Senior Developer Advocate - Serverless
• New Yorker
• Previously:
• AWS Business Development Manager – DevOps, July ’15 - Feb ‘17
• AWS Solutions Architect Nov, 2011- Dec 2014
• Formerly on operations teams @Etsy and @Meetup
• Little time at a hedge fund, Xerox and a few other startups
• Rochester Institute of Technology: Applied Networking and
Systems Administration ’05
• Internet infrastructure geek
“Everything fails all the time.”
Werner Vogels, CTO, Amazon.com
Production Readiness Review
You don’t need all of these from day one, grow them as your teams grow.
Architecture Design Review
Monitoring
Logging
Documentation
Alerting
Service Level Agreement
Expected Throughput
Testing
Deploy Strategy
Architecture Design Review
Architecture Design Review
Netflix Chaos Engineering
1. Define the system’s normal behavior — its “steady state” — based on
measurable output like overall throughput, error rates, latency, etc.
2. Hypothesize about the steady state behavior of an experimental group, as
compared to a stable control group.
3. Expose the experimental group to simulated real-world events such as server
crashes, malformed responses, or traffic spikes.
4. Test the hypothesis by comparing the steady state of the control group and
the experimental group. The smaller the differences, the more confidence we
have that the system is resilient.
TLDR; Intentionally break things, compare measured with expected impact, and correct any problems uncovered this way.
Architecture Design Review
Highly Available & Redundant
Problem Solution
Failure of a service in a specific
location
Run across multiple availability zones
or regions
Able to handle spikes of traffic Have auto-scaling in place with EC2,
Containers, or through leveraging
serverless architectures.
Avoid Single Points of Failure (SPOF) Be sure services are running in
clusters scaled across AZs.
Replication > Backups.
Architecture Design Review
Using Standard Libraries & Design Patterns
Standardizing on libraries, languages, styleguides makes onboarding new
developers and troubleshooting issues easier. Enforce these programmatically
where you can. (eslint, gofmt, etc)
Spot situations where code may be duplicated and able to be refactored.
Look for opportunities to implement good design patterns.
Know your licenses - OpenSource Permissive (MIT/Apache) vs Copy Left
(GNU/MPL)
Architecture Design Review
Review for Security Best Practices
Security should always be a top priority
Ensure no credentials are being stored in the application
Code defensively for SQL injections, XSS attacks, and more
Leverage Static Analysis tools
https://en.wikipedia.org/wiki/List_of_tools_for_static_code_analysis
Consider using Pre-Commit by Yelp
http://pre-commit.com
Architecture Design Review
Leverage other startups or rotate teams to keep fresh eyes on your code
Partner with another startup to help each other with architecture, code review,
interviewing, and more.
Consider rotating developers off of projects every few months to gain fresh
eyes on projects.
Monitoring
Monitoring
Application vs Service Level Alerting
AppWeb DB
Application Level
Service Level
AppWeb DB
Monitoring
Performance Metrics
Start by building a dashboard of “important” metrics. Continue iterating on this
as you learn more about your system under inspection. Each system has a
“heartbeat” that will appear off when things are unhealthy.
You always think you have enough metrics being gathered until you need the
one you’re missing. When applications fail, the more data you can observe the
easier it is to get to the root cause.
Averages hide issues. Be sure to leverage percentiles to expose where users
are experiencing issues.
Complicated systems build complicated dependency chains. Small fluctuations
in one part of your stack can manifest itself in other parts.
Monitoring
Application Level Visibility
Provides Insight To Application Performance
You need visibility into how your application itself is performing.
How long are certain calls to resources taking?
Is that trending up or down?
What part of the application is generating the most number of errors?
Monitoring
Averages vs Percentiles
Monitoring
Averages vs Percentiles
Monitoring
Real User Monitoring (RUM) & Synthetic Monitoring
Synthetic Monitoring
Automatic testing of your site and service to measure performance.
Real User Monitoring
Shows your exactly how users are interacting with your site or application.
Measures page load times, DNS resolution issues, traffic bottlenecks, and
more.
Monitoring
Circuit Breakers
Orders
Invoices
APIClient
Customers
Invoices
Orders
Customers
81ms
63ms
37ms
181ms
Monitoring
Circuit Breakers
Orders
Invoices
APIClient
Customers
Invoices
Orders
Customers
81ms
63ms
4082ms
4226ms
Slow at handling requests, requests queuing up
Monitoring
Circuit Breakers
Orders
Invoices
APIClient
Customers
Invoices
Orders
Customers
High Error Rate
81ms
63ms
1ms
145ms
Monitoring
Circuit Breakers
Orders
Invoices
APIClient
Customers
Invoices
Orders
Customers
Reduced Error Rate
81ms
63ms
91ms
235ms
Monitoring
Circuit Breakers
Orders
Invoices
APIClient
Customers
Invoices
Orders
Customers
81ms
63ms
37ms
181ms
Monitoring
Circuit Breakers
Closed
Open
Half Open
Success
Fast Failing
Open
Try One
Request
Fail
Open Circuit
Success
Open Circuit
Logging
Logging
Consistent Log Format
Consider using JSON for logging
User Log Levels correctly [INFO/WARN/CRIT]
Add context for your logging statements
Log behaviors and errors
Consider how analytics will be used on this data
Logging
UTC Timestamps
Centrally aggregated logs make analysis easier
Helps prevent mismatch errors due to DST
Prepares you for multi-region
Log tool interfaces let you adjust time zones per user
[2017-07-13 14:49:24.436245]
Logging
Individual Transaction IDs
The session ID that generated the error
The user who encountered the error
The user’s location in the application
The ID of the transaction or product that caused the error
Be careful about what you log from a security perspective
Web App Database
ID 10948281 ID 10948281
Documentation
Store Your Documentation Close To Your Code: Read.me
What the code does
How to install and run it
How to interact with it (stop, start, restart)
How to configure it
How to troubleshoot it
What metrics and dashboards are available
Alerting
Alerting
"Level 1" Operations Teams Should Be Automated
check process nginx with pidfile /var/run/nginx.pid
start program = "/etc/init.d/nginx start”
stop program = "/etc/init.d/nginx stop”
group www (for centos)
Alerting
"Level 1" Operations Teams Should Be Automated
EC2 Auto Recovery
Alerting
"Level 1" Operations Teams Should Be Automated
EC2 Auto Scaling
Alerting
Build Proper Escalation Paths For Alerts
Primary
Secondary
Team
Management
10 Minutes
10 Minutes
10 Minutes
Being paged when something fails is great, but you
always need a backup
These need to auto escalate when not acknowledged
As it escalates up it’s good to notify a wider range of
people to get more eyes on the issue
Review alerts that have been ack’d or silenced beyond
a tolerable threshold.
Alerting
Developers Code Should Only Burden Themselves
Operations Add Capacity
Developer Deploy Hotfix
Bad application code
causes 40% increase in
CPU usage across a
cluster.
Temporary Fix
Permanent Fix
Service Level Agreements
Service Level Agreements/Objectives
Services Should Have An SLA/SLO
/Search
/Cart
/Avatars
99.99%
99.999%
99.9%
These are internal SLAs for the
company
Helps identify how much effort should
be put into the reliability of each
service
Important when using microservices
for teams to reliably build
dependencies on your service.
https://landing.google.com/sre/book/chapters/service-level-objectives.html
Service Level Agreements
Understand The Cost Of Adding Each 9
Level of
Availability
Percent of
Uptime
Downtime per
Year
Downtime per
Day
1 Nine 90% 36.5 days 2.4 hours
2 Nines 99% 3.65 days 14 minutes
3 Nines 99.9% 8.76 hours 86 seconds
4 Nines 99.99% 52.6 minutes 8.6 seconds
5 Nines 99.999% 5.25 minutes .86 seconds
6 Nines 99.9999% 31.5 seconds 8.6 milliseconds
Expected Throughput
Run Load Tests & Understand Your Limits
Before a service goes live, know where your breaking points are.
Know the bare minimum number of instances needed to run your average
throughput
Know the maximum throughput you can handle with your current architecture
Calculate the throughput per instance ratio so you can accurately setup
proper auto-scaling in a cost optimized way.
Expected Throughput
Helps with Cost Optimization & Auto Scaling
Expected Throughput
Provides Performance Baseline For Future Release
0
500
1000
1500
2000
2500
3000
3500
Max RPS
V1
V14
As code evolves, so does your
performance.
Understand the impact of additional
libraries, added lines of code, and new
external calls.
Here we see a 63.58% increase in
performance from V1 to V14. This
directly correlates to your infrastructure
cost.
Testing
Testing
Adopt Automated Testing Early
Builds confidence in the code being
released
Allows you to test more of your
application in less time
Manual testing can become error
prone
Testing
Test Driven Development
Red
GreenRefactor
Build a test first, fails.
Develop code so it passes.
Refactor and optimize the code.
Repeat.
Deployment Strategy
Deployment Strategy
Database Migrations
Understand what changes to the database need to happen to support new
code releases.
Avoid removing columns, only make additions to reduce risk.
Be sure to test migrations against test copies of the database
Keep a revision history of database migrations for reference
Snapshot databases before doing migrations
Deployment Strategy
Canary Pools
Version 1
Version 2Load Balancer
10%
90%
Version 1
Version 2Load Balancer
100%
0%
0% Errors 0% Errors
Deployment Strategy
Dark Deploys & Feature Flags
Opt In
Test new features with selected
users
Kill Switch
Disable poorly performing features
Scalable Roll Outs
Do % roll outs of new features
Block Users
Prevent selected users from features
Run A/B Tests
Test and compare new features
Sunset Old Features
Safely decommission old features
Error Budget
Spend it! It’s there for you to use.
Error budget is there for you to take calculated risks in your environment.
Allows you to save up a high budget to spend it on major architectural
changes.
Some companies force the spending of this budget when it’s not utilized to
encourage services built on it to gracefully fail. If the SLA is 99.99% and it’s
running at 100%, they will manually force downtime to stay at 99.99%.
Production Readiness Review
Summary of key areas for a PRR
Architecture Design Review
Monitoring
Logging
Documentation
Alerting
Service Level Agreement
Expected Throughput
Testing
Deploy Strategy
Resources
Useful resources related to the topics covered
Production Readiness Review:
https://arxiv.org/pdf/1305.2402.pdf
Netflix Hystrix Circuit Breaker:
https://github.com/Netflix/Hystrix/wiki/How-it-Works
Feature Flags:
https://en.wikipedia.org/wiki/Feature_toggle
Error Budgets:
https://landing.google.com/sre/interview/ben-treynor.html
Monitoring Philosophies:
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit
Chris Munns
munns@amazon.com
@chrismunnshttps://www.flickr.com/photos/theredproject/3302110152/
Operations: Production Readiness Review – How to stop bad things from Happening

Más contenido relacionado

La actualidad más candente

Discover BigQuery ML, build your own CREATE MODEL statement
Discover BigQuery ML, build your own CREATE MODEL statementDiscover BigQuery ML, build your own CREATE MODEL statement
Discover BigQuery ML, build your own CREATE MODEL statementMárton Kodok
 
Real-time Data Streaming from Oracle to Apache Kafka
Real-time Data Streaming from Oracle to Apache Kafka Real-time Data Streaming from Oracle to Apache Kafka
Real-time Data Streaming from Oracle to Apache Kafka confluent
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
Cloud computing and migration strategies to cloud
Cloud computing and migration strategies to cloudCloud computing and migration strategies to cloud
Cloud computing and migration strategies to cloudSourabh Saxena
 
Unleash the Power of Redis with Amazon ElastiCache
Unleash the Power of Redis with Amazon ElastiCacheUnleash the Power of Redis with Amazon ElastiCache
Unleash the Power of Redis with Amazon ElastiCacheAmazon Web Services
 
Exalogic Technical Overview
Exalogic Technical OverviewExalogic Technical Overview
Exalogic Technical OverviewAndrey Akulov
 
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...Amazon Web Services Korea
 
Cloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs GoogleCloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs GooglePatrick Pierson
 
Kafka Tutorial: Kafka Security
Kafka Tutorial: Kafka SecurityKafka Tutorial: Kafka Security
Kafka Tutorial: Kafka SecurityJean-Paul Azar
 
Building Modern Streaming Analytics with Confluent on AWS
Building Modern Streaming Analytics with Confluent on AWSBuilding Modern Streaming Analytics with Confluent on AWS
Building Modern Streaming Analytics with Confluent on AWSconfluent
 
Data Streaming with Apache Kafka in the Defence and Cybersecurity Industry
Data Streaming with Apache Kafka in the Defence and Cybersecurity IndustryData Streaming with Apache Kafka in the Defence and Cybersecurity Industry
Data Streaming with Apache Kafka in the Defence and Cybersecurity IndustryKai Wähner
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptxWasm1953
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
 
AWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - SlidesAWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - SlidesTobyWilman
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State StoresPerformance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Storesconfluent
 
ksqlDB로 실시간 데이터 변환 및 스트림 처리
ksqlDB로 실시간 데이터 변환 및 스트림 처리ksqlDB로 실시간 데이터 변환 및 스트림 처리
ksqlDB로 실시간 데이터 변환 및 스트림 처리confluent
 
Building Your Own ML Application with AWS Lambda and Amazon SageMaker (SRV404...
Building Your Own ML Application with AWS Lambda and Amazon SageMaker (SRV404...Building Your Own ML Application with AWS Lambda and Amazon SageMaker (SRV404...
Building Your Own ML Application with AWS Lambda and Amazon SageMaker (SRV404...Amazon Web Services
 

La actualidad más candente (20)

Discover BigQuery ML, build your own CREATE MODEL statement
Discover BigQuery ML, build your own CREATE MODEL statementDiscover BigQuery ML, build your own CREATE MODEL statement
Discover BigQuery ML, build your own CREATE MODEL statement
 
Real-time Data Streaming from Oracle to Apache Kafka
Real-time Data Streaming from Oracle to Apache Kafka Real-time Data Streaming from Oracle to Apache Kafka
Real-time Data Streaming from Oracle to Apache Kafka
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Cloud computing and migration strategies to cloud
Cloud computing and migration strategies to cloudCloud computing and migration strategies to cloud
Cloud computing and migration strategies to cloud
 
Unleash the Power of Redis with Amazon ElastiCache
Unleash the Power of Redis with Amazon ElastiCacheUnleash the Power of Redis with Amazon ElastiCache
Unleash the Power of Redis with Amazon ElastiCache
 
Exalogic Technical Overview
Exalogic Technical OverviewExalogic Technical Overview
Exalogic Technical Overview
 
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
데이터 분석가를 위한 신규 분석 서비스 - 김기영, AWS 분석 솔루션즈 아키텍트 / 변규현, 당근마켓 소프트웨어 엔지니어 :: AWS r...
 
ElastiCache & Redis
ElastiCache & RedisElastiCache & Redis
ElastiCache & Redis
 
Cloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs GoogleCloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs Google
 
Kafka Tutorial: Kafka Security
Kafka Tutorial: Kafka SecurityKafka Tutorial: Kafka Security
Kafka Tutorial: Kafka Security
 
Building Modern Streaming Analytics with Confluent on AWS
Building Modern Streaming Analytics with Confluent on AWSBuilding Modern Streaming Analytics with Confluent on AWS
Building Modern Streaming Analytics with Confluent on AWS
 
Data Streaming with Apache Kafka in the Defence and Cybersecurity Industry
Data Streaming with Apache Kafka in the Defence and Cybersecurity IndustryData Streaming with Apache Kafka in the Defence and Cybersecurity Industry
Data Streaming with Apache Kafka in the Defence and Cybersecurity Industry
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptx
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
AWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - SlidesAWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - Slides
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State StoresPerformance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Stores
 
ksqlDB로 실시간 데이터 변환 및 스트림 처리
ksqlDB로 실시간 데이터 변환 및 스트림 처리ksqlDB로 실시간 데이터 변환 및 스트림 처리
ksqlDB로 실시간 데이터 변환 및 스트림 처리
 
Building Your Own ML Application with AWS Lambda and Amazon SageMaker (SRV404...
Building Your Own ML Application with AWS Lambda and Amazon SageMaker (SRV404...Building Your Own ML Application with AWS Lambda and Amazon SageMaker (SRV404...
Building Your Own ML Application with AWS Lambda and Amazon SageMaker (SRV404...
 
Oracle Cloud Infrastructure
Oracle Cloud InfrastructureOracle Cloud Infrastructure
Oracle Cloud Infrastructure
 

Destacado

Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraSpark Summit
 
golang.tokyo #6 (in Japanese)
golang.tokyo #6 (in Japanese)golang.tokyo #6 (in Japanese)
golang.tokyo #6 (in Japanese)Yuichi Murata
 
Streaming Data Analytics with Amazon Redshift and Kinesis Firehose
Streaming Data Analytics with Amazon Redshift and Kinesis FirehoseStreaming Data Analytics with Amazon Redshift and Kinesis Firehose
Streaming Data Analytics with Amazon Redshift and Kinesis FirehoseAmazon Web Services
 
MongoDBの可能性の話
MongoDBの可能性の話MongoDBの可能性の話
MongoDBの可能性の話Akihiro Kuwano
 
An introduction and future of Ruby coverage library
An introduction and future of Ruby coverage libraryAn introduction and future of Ruby coverage library
An introduction and future of Ruby coverage librarymametter
 
Spiderストレージエンジンの使い方と利用事例 他ストレージエンジンの紹介
Spiderストレージエンジンの使い方と利用事例 他ストレージエンジンの紹介Spiderストレージエンジンの使い方と利用事例 他ストレージエンジンの紹介
Spiderストレージエンジンの使い方と利用事例 他ストレージエンジンの紹介Kentoku
 
AWS X-Rayによるアプリケーションの分析とデバッグ
AWS X-Rayによるアプリケーションの分析とデバッグAWS X-Rayによるアプリケーションの分析とデバッグ
AWS X-Rayによるアプリケーションの分析とデバッグAmazon Web Services Japan
 
ScalaからGoへ
ScalaからGoへScalaからGoへ
ScalaからGoへJames Neve
 
AndApp開発における全て #denatechcon
AndApp開発における全て #denatechconAndApp開発における全て #denatechcon
AndApp開発における全て #denatechconDeNA
 
神に近づくx/net/context (Finding God with x/net/context)
神に近づくx/net/context (Finding God with x/net/context)神に近づくx/net/context (Finding God with x/net/context)
神に近づくx/net/context (Finding God with x/net/context)guregu
 
Swaggerでのapi開発よもやま話
Swaggerでのapi開発よもやま話Swaggerでのapi開発よもやま話
Swaggerでのapi開発よもやま話KEISUKE KONISHI
 
Fast and Reliable Swift APIs with gRPC
Fast and Reliable Swift APIs with gRPCFast and Reliable Swift APIs with gRPC
Fast and Reliable Swift APIs with gRPCTim Burks
 
メルカリアッテの実務で使えた、GAE/Goの開発を効率的にする方法
メルカリアッテの実務で使えた、GAE/Goの開発を効率的にする方法メルカリアッテの実務で使えた、GAE/Goの開発を効率的にする方法
メルカリアッテの実務で使えた、GAE/Goの開発を効率的にする方法Takuya Ueda
 
Solving anything in VCL
Solving anything in VCLSolving anything in VCL
Solving anything in VCLFastly
 
So You Wanna Go Fast?
So You Wanna Go Fast?So You Wanna Go Fast?
So You Wanna Go Fast?Tyler Treat
 

Destacado (20)

Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
 
golang.tokyo #6 (in Japanese)
golang.tokyo #6 (in Japanese)golang.tokyo #6 (in Japanese)
golang.tokyo #6 (in Japanese)
 
Streaming Data Analytics with Amazon Redshift and Kinesis Firehose
Streaming Data Analytics with Amazon Redshift and Kinesis FirehoseStreaming Data Analytics with Amazon Redshift and Kinesis Firehose
Streaming Data Analytics with Amazon Redshift and Kinesis Firehose
 
What’s New in Amazon Aurora
What’s New in Amazon AuroraWhat’s New in Amazon Aurora
What’s New in Amazon Aurora
 
MongoDBの可能性の話
MongoDBの可能性の話MongoDBの可能性の話
MongoDBの可能性の話
 
Blockchain on Go
Blockchain on GoBlockchain on Go
Blockchain on Go
 
An introduction and future of Ruby coverage library
An introduction and future of Ruby coverage libraryAn introduction and future of Ruby coverage library
An introduction and future of Ruby coverage library
 
Spiderストレージエンジンの使い方と利用事例 他ストレージエンジンの紹介
Spiderストレージエンジンの使い方と利用事例 他ストレージエンジンの紹介Spiderストレージエンジンの使い方と利用事例 他ストレージエンジンの紹介
Spiderストレージエンジンの使い方と利用事例 他ストレージエンジンの紹介
 
AWS X-Rayによるアプリケーションの分析とデバッグ
AWS X-Rayによるアプリケーションの分析とデバッグAWS X-Rayによるアプリケーションの分析とデバッグ
AWS X-Rayによるアプリケーションの分析とデバッグ
 
SLOのすすめ
SLOのすすめSLOのすすめ
SLOのすすめ
 
ScalaからGoへ
ScalaからGoへScalaからGoへ
ScalaからGoへ
 
AndApp開発における全て #denatechcon
AndApp開発における全て #denatechconAndApp開発における全て #denatechcon
AndApp開発における全て #denatechcon
 
神に近づくx/net/context (Finding God with x/net/context)
神に近づくx/net/context (Finding God with x/net/context)神に近づくx/net/context (Finding God with x/net/context)
神に近づくx/net/context (Finding God with x/net/context)
 
Microservices at Mercari
Microservices at MercariMicroservices at Mercari
Microservices at Mercari
 
Swaggerでのapi開発よもやま話
Swaggerでのapi開発よもやま話Swaggerでのapi開発よもやま話
Swaggerでのapi開発よもやま話
 
Fast and Reliable Swift APIs with gRPC
Fast and Reliable Swift APIs with gRPCFast and Reliable Swift APIs with gRPC
Fast and Reliable Swift APIs with gRPC
 
メルカリアッテの実務で使えた、GAE/Goの開発を効率的にする方法
メルカリアッテの実務で使えた、GAE/Goの開発を効率的にする方法メルカリアッテの実務で使えた、GAE/Goの開発を効率的にする方法
メルカリアッテの実務で使えた、GAE/Goの開発を効率的にする方法
 
Solving anything in VCL
Solving anything in VCLSolving anything in VCL
Solving anything in VCL
 
So You Wanna Go Fast?
So You Wanna Go Fast?So You Wanna Go Fast?
So You Wanna Go Fast?
 
Google Home and Google Assistant Workshop: Build your own serverless Action o...
Google Home and Google Assistant Workshop: Build your own serverless Action o...Google Home and Google Assistant Workshop: Build your own serverless Action o...
Google Home and Google Assistant Workshop: Build your own serverless Action o...
 

Similar a Operations: Production Readiness Review – How to stop bad things from Happening

Operations: Production Readiness
Operations: Production ReadinessOperations: Production Readiness
Operations: Production ReadinessAmazon Web Services
 
Start Up Austin 2017: Production Preview - How to Stop Bad Things From Happening
Start Up Austin 2017: Production Preview - How to Stop Bad Things From HappeningStart Up Austin 2017: Production Preview - How to Stop Bad Things From Happening
Start Up Austin 2017: Production Preview - How to Stop Bad Things From HappeningAmazon Web Services
 
Fast, Secure Deployments with Docker on AWS
Fast, Secure Deployments with Docker on AWSFast, Secure Deployments with Docker on AWS
Fast, Secure Deployments with Docker on AWSAmazon Web Services
 
AWS Summit Auckland - Application Delivery Patterns for Developers
AWS Summit Auckland - Application Delivery Patterns for DevelopersAWS Summit Auckland - Application Delivery Patterns for Developers
AWS Summit Auckland - Application Delivery Patterns for DevelopersAmazon Web Services
 
Compliance as Code Everywhere
Compliance as Code EverywhereCompliance as Code Everywhere
Compliance as Code EverywhereMatt Ray
 
Simplify and Scale Enterprise Spring Apps in the Cloud | March 23, 2023
Simplify and Scale Enterprise Spring Apps in the Cloud | March 23, 2023Simplify and Scale Enterprise Spring Apps in the Cloud | March 23, 2023
Simplify and Scale Enterprise Spring Apps in the Cloud | March 23, 2023VMware Tanzu
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum JapanBrian Brazil
 
Best practice adoption (and lack there of)
Best practice adoption (and lack there of)Best practice adoption (and lack there of)
Best practice adoption (and lack there of)John Pape
 
Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And ScalabilityJason Ragsdale
 
Getting Started with Amazon Inspector - AWS June 2016 Webinar Series
Getting Started with Amazon Inspector - AWS June 2016 Webinar SeriesGetting Started with Amazon Inspector - AWS June 2016 Webinar Series
Getting Started with Amazon Inspector - AWS June 2016 Webinar SeriesAmazon Web Services
 
OpsWorks for Chef Automate - Auckland AWS
OpsWorks for Chef Automate - Auckland AWS OpsWorks for Chef Automate - Auckland AWS
OpsWorks for Chef Automate - Auckland AWS Matt Ray
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Sourceaspyker
 
From Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.auFrom Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.auevanbottcher
 
(SEC312) Taking a DevOps Approach to Security | AWS re:Invent 2014
(SEC312) Taking a DevOps Approach to Security | AWS re:Invent 2014(SEC312) Taking a DevOps Approach to Security | AWS re:Invent 2014
(SEC312) Taking a DevOps Approach to Security | AWS re:Invent 2014Amazon Web Services
 
Modernizing Testing as Apps Re-Architect
Modernizing Testing as Apps Re-ArchitectModernizing Testing as Apps Re-Architect
Modernizing Testing as Apps Re-ArchitectDevOps.com
 
DevOps on Windows: How to Deploy Complex Windows Workloads | AWS Public Secto...
DevOps on Windows: How to Deploy Complex Windows Workloads | AWS Public Secto...DevOps on Windows: How to Deploy Complex Windows Workloads | AWS Public Secto...
DevOps on Windows: How to Deploy Complex Windows Workloads | AWS Public Secto...Amazon Web Services
 
Starting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for OpsStarting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for OpsDynatrace
 

Similar a Operations: Production Readiness Review – How to stop bad things from Happening (20)

Operations: Production Readiness
Operations: Production ReadinessOperations: Production Readiness
Operations: Production Readiness
 
Start Up Austin 2017: Production Preview - How to Stop Bad Things From Happening
Start Up Austin 2017: Production Preview - How to Stop Bad Things From HappeningStart Up Austin 2017: Production Preview - How to Stop Bad Things From Happening
Start Up Austin 2017: Production Preview - How to Stop Bad Things From Happening
 
Fast, Secure Deployments with Docker on AWS
Fast, Secure Deployments with Docker on AWSFast, Secure Deployments with Docker on AWS
Fast, Secure Deployments with Docker on AWS
 
AWS Summit Auckland - Application Delivery Patterns for Developers
AWS Summit Auckland - Application Delivery Patterns for DevelopersAWS Summit Auckland - Application Delivery Patterns for Developers
AWS Summit Auckland - Application Delivery Patterns for Developers
 
Compliance as Code Everywhere
Compliance as Code EverywhereCompliance as Code Everywhere
Compliance as Code Everywhere
 
Simplify and Scale Enterprise Spring Apps in the Cloud | March 23, 2023
Simplify and Scale Enterprise Spring Apps in the Cloud | March 23, 2023Simplify and Scale Enterprise Spring Apps in the Cloud | March 23, 2023
Simplify and Scale Enterprise Spring Apps in the Cloud | March 23, 2023
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Best practice adoption (and lack there of)
Best practice adoption (and lack there of)Best practice adoption (and lack there of)
Best practice adoption (and lack there of)
 
Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And Scalability
 
Internship msc cs
Internship msc csInternship msc cs
Internship msc cs
 
Getting Started with Amazon Inspector - AWS June 2016 Webinar Series
Getting Started with Amazon Inspector - AWS June 2016 Webinar SeriesGetting Started with Amazon Inspector - AWS June 2016 Webinar Series
Getting Started with Amazon Inspector - AWS June 2016 Webinar Series
 
OpsWorks for Chef Automate - Auckland AWS
OpsWorks for Chef Automate - Auckland AWS OpsWorks for Chef Automate - Auckland AWS
OpsWorks for Chef Automate - Auckland AWS
 
Path to continuous delivery
Path to continuous deliveryPath to continuous delivery
Path to continuous delivery
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
 
From Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.auFrom Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.au
 
(SEC312) Taking a DevOps Approach to Security | AWS re:Invent 2014
(SEC312) Taking a DevOps Approach to Security | AWS re:Invent 2014(SEC312) Taking a DevOps Approach to Security | AWS re:Invent 2014
(SEC312) Taking a DevOps Approach to Security | AWS re:Invent 2014
 
Modernizing Testing as Apps Re-Architect
Modernizing Testing as Apps Re-ArchitectModernizing Testing as Apps Re-Architect
Modernizing Testing as Apps Re-Architect
 
DevOps on Windows: How to Deploy Complex Windows Workloads | AWS Public Secto...
DevOps on Windows: How to Deploy Complex Windows Workloads | AWS Public Secto...DevOps on Windows: How to Deploy Complex Windows Workloads | AWS Public Secto...
DevOps on Windows: How to Deploy Complex Windows Workloads | AWS Public Secto...
 
Starting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for OpsStarting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for Ops
 
Devops architecture
Devops architectureDevops architecture
Devops architecture
 

Más de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Más de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Operations: Production Readiness Review – How to stop bad things from Happening

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Chris Munns Fall 2017 AWS Startup Day Production Readiness Review
  • 2. About me: Chris Munns - munns@amazon.com, @chrismunns • Senior Developer Advocate - Serverless • New Yorker • Previously: • AWS Business Development Manager – DevOps, July ’15 - Feb ‘17 • AWS Solutions Architect Nov, 2011- Dec 2014 • Formerly on operations teams @Etsy and @Meetup • Little time at a hedge fund, Xerox and a few other startups • Rochester Institute of Technology: Applied Networking and Systems Administration ’05 • Internet infrastructure geek
  • 3. “Everything fails all the time.” Werner Vogels, CTO, Amazon.com
  • 4. Production Readiness Review You don’t need all of these from day one, grow them as your teams grow. Architecture Design Review Monitoring Logging Documentation Alerting Service Level Agreement Expected Throughput Testing Deploy Strategy
  • 6. Architecture Design Review Netflix Chaos Engineering 1. Define the system’s normal behavior — its “steady state” — based on measurable output like overall throughput, error rates, latency, etc. 2. Hypothesize about the steady state behavior of an experimental group, as compared to a stable control group. 3. Expose the experimental group to simulated real-world events such as server crashes, malformed responses, or traffic spikes. 4. Test the hypothesis by comparing the steady state of the control group and the experimental group. The smaller the differences, the more confidence we have that the system is resilient. TLDR; Intentionally break things, compare measured with expected impact, and correct any problems uncovered this way.
  • 7. Architecture Design Review Highly Available & Redundant Problem Solution Failure of a service in a specific location Run across multiple availability zones or regions Able to handle spikes of traffic Have auto-scaling in place with EC2, Containers, or through leveraging serverless architectures. Avoid Single Points of Failure (SPOF) Be sure services are running in clusters scaled across AZs. Replication > Backups.
  • 8. Architecture Design Review Using Standard Libraries & Design Patterns Standardizing on libraries, languages, styleguides makes onboarding new developers and troubleshooting issues easier. Enforce these programmatically where you can. (eslint, gofmt, etc) Spot situations where code may be duplicated and able to be refactored. Look for opportunities to implement good design patterns. Know your licenses - OpenSource Permissive (MIT/Apache) vs Copy Left (GNU/MPL)
  • 9. Architecture Design Review Review for Security Best Practices Security should always be a top priority Ensure no credentials are being stored in the application Code defensively for SQL injections, XSS attacks, and more Leverage Static Analysis tools https://en.wikipedia.org/wiki/List_of_tools_for_static_code_analysis Consider using Pre-Commit by Yelp http://pre-commit.com
  • 10. Architecture Design Review Leverage other startups or rotate teams to keep fresh eyes on your code Partner with another startup to help each other with architecture, code review, interviewing, and more. Consider rotating developers off of projects every few months to gain fresh eyes on projects.
  • 12. Monitoring Application vs Service Level Alerting AppWeb DB Application Level Service Level AppWeb DB
  • 13. Monitoring Performance Metrics Start by building a dashboard of “important” metrics. Continue iterating on this as you learn more about your system under inspection. Each system has a “heartbeat” that will appear off when things are unhealthy. You always think you have enough metrics being gathered until you need the one you’re missing. When applications fail, the more data you can observe the easier it is to get to the root cause. Averages hide issues. Be sure to leverage percentiles to expose where users are experiencing issues. Complicated systems build complicated dependency chains. Small fluctuations in one part of your stack can manifest itself in other parts.
  • 14. Monitoring Application Level Visibility Provides Insight To Application Performance You need visibility into how your application itself is performing. How long are certain calls to resources taking? Is that trending up or down? What part of the application is generating the most number of errors?
  • 17. Monitoring Real User Monitoring (RUM) & Synthetic Monitoring Synthetic Monitoring Automatic testing of your site and service to measure performance. Real User Monitoring Shows your exactly how users are interacting with your site or application. Measures page load times, DNS resolution issues, traffic bottlenecks, and more.
  • 23. Monitoring Circuit Breakers Closed Open Half Open Success Fast Failing Open Try One Request Fail Open Circuit Success Open Circuit
  • 25. Logging Consistent Log Format Consider using JSON for logging User Log Levels correctly [INFO/WARN/CRIT] Add context for your logging statements Log behaviors and errors Consider how analytics will be used on this data
  • 26. Logging UTC Timestamps Centrally aggregated logs make analysis easier Helps prevent mismatch errors due to DST Prepares you for multi-region Log tool interfaces let you adjust time zones per user [2017-07-13 14:49:24.436245]
  • 27. Logging Individual Transaction IDs The session ID that generated the error The user who encountered the error The user’s location in the application The ID of the transaction or product that caused the error Be careful about what you log from a security perspective Web App Database ID 10948281 ID 10948281
  • 28. Documentation Store Your Documentation Close To Your Code: Read.me What the code does How to install and run it How to interact with it (stop, start, restart) How to configure it How to troubleshoot it What metrics and dashboards are available
  • 30. Alerting "Level 1" Operations Teams Should Be Automated check process nginx with pidfile /var/run/nginx.pid start program = "/etc/init.d/nginx start” stop program = "/etc/init.d/nginx stop” group www (for centos)
  • 31. Alerting "Level 1" Operations Teams Should Be Automated EC2 Auto Recovery
  • 32. Alerting "Level 1" Operations Teams Should Be Automated EC2 Auto Scaling
  • 33. Alerting Build Proper Escalation Paths For Alerts Primary Secondary Team Management 10 Minutes 10 Minutes 10 Minutes Being paged when something fails is great, but you always need a backup These need to auto escalate when not acknowledged As it escalates up it’s good to notify a wider range of people to get more eyes on the issue Review alerts that have been ack’d or silenced beyond a tolerable threshold.
  • 34. Alerting Developers Code Should Only Burden Themselves Operations Add Capacity Developer Deploy Hotfix Bad application code causes 40% increase in CPU usage across a cluster. Temporary Fix Permanent Fix
  • 36. Service Level Agreements/Objectives Services Should Have An SLA/SLO /Search /Cart /Avatars 99.99% 99.999% 99.9% These are internal SLAs for the company Helps identify how much effort should be put into the reliability of each service Important when using microservices for teams to reliably build dependencies on your service. https://landing.google.com/sre/book/chapters/service-level-objectives.html
  • 37. Service Level Agreements Understand The Cost Of Adding Each 9 Level of Availability Percent of Uptime Downtime per Year Downtime per Day 1 Nine 90% 36.5 days 2.4 hours 2 Nines 99% 3.65 days 14 minutes 3 Nines 99.9% 8.76 hours 86 seconds 4 Nines 99.99% 52.6 minutes 8.6 seconds 5 Nines 99.999% 5.25 minutes .86 seconds 6 Nines 99.9999% 31.5 seconds 8.6 milliseconds
  • 38. Expected Throughput Run Load Tests & Understand Your Limits Before a service goes live, know where your breaking points are. Know the bare minimum number of instances needed to run your average throughput Know the maximum throughput you can handle with your current architecture Calculate the throughput per instance ratio so you can accurately setup proper auto-scaling in a cost optimized way.
  • 39. Expected Throughput Helps with Cost Optimization & Auto Scaling
  • 40. Expected Throughput Provides Performance Baseline For Future Release 0 500 1000 1500 2000 2500 3000 3500 Max RPS V1 V14 As code evolves, so does your performance. Understand the impact of additional libraries, added lines of code, and new external calls. Here we see a 63.58% increase in performance from V1 to V14. This directly correlates to your infrastructure cost.
  • 42. Testing Adopt Automated Testing Early Builds confidence in the code being released Allows you to test more of your application in less time Manual testing can become error prone
  • 43. Testing Test Driven Development Red GreenRefactor Build a test first, fails. Develop code so it passes. Refactor and optimize the code. Repeat.
  • 45. Deployment Strategy Database Migrations Understand what changes to the database need to happen to support new code releases. Avoid removing columns, only make additions to reduce risk. Be sure to test migrations against test copies of the database Keep a revision history of database migrations for reference Snapshot databases before doing migrations
  • 46. Deployment Strategy Canary Pools Version 1 Version 2Load Balancer 10% 90% Version 1 Version 2Load Balancer 100% 0% 0% Errors 0% Errors
  • 47. Deployment Strategy Dark Deploys & Feature Flags Opt In Test new features with selected users Kill Switch Disable poorly performing features Scalable Roll Outs Do % roll outs of new features Block Users Prevent selected users from features Run A/B Tests Test and compare new features Sunset Old Features Safely decommission old features
  • 48. Error Budget Spend it! It’s there for you to use. Error budget is there for you to take calculated risks in your environment. Allows you to save up a high budget to spend it on major architectural changes. Some companies force the spending of this budget when it’s not utilized to encourage services built on it to gracefully fail. If the SLA is 99.99% and it’s running at 100%, they will manually force downtime to stay at 99.99%.
  • 49. Production Readiness Review Summary of key areas for a PRR Architecture Design Review Monitoring Logging Documentation Alerting Service Level Agreement Expected Throughput Testing Deploy Strategy
  • 50. Resources Useful resources related to the topics covered Production Readiness Review: https://arxiv.org/pdf/1305.2402.pdf Netflix Hystrix Circuit Breaker: https://github.com/Netflix/Hystrix/wiki/How-it-Works Feature Flags: https://en.wikipedia.org/wiki/Feature_toggle Error Budgets: https://landing.google.com/sre/interview/ben-treynor.html Monitoring Philosophies: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit