SlideShare a Scribd company logo
1 of 128
Cloud Native, Capacity, Performance
and Cost Optimization Tools and
Techniques
CMG Workshop
November 2013
Adrian Cockcroft
@adrianco @NetflixOSS
http://www.linkedin.com/in/adriancockcroft
Presentation vs. Workshop
• Presentation
– Short duration, focused subject
– One presenter to many anonymous audience
– A few questions at the end

• Workshop
– Time to explore in and around the subject
– Tutor gets to know the audience
– Discussion, rat-holes, “bring out your dead”
Attendee Introductions
• Who are you, where do you work
• Why are you here today, what do you need
• “Bring out your dead”
– Do you have a specific problem or question?
– One sentence elevator pitch

• What instrument do you play?
Content
Cloud Native
Migration Path
Service and API Architectures
Storage Architecture
Operations and Tools
Cost Optimization
More?
Cloud Native
What is it?
Why?
Strive for perfection
Perfect code
Perfect hardware
Perfectly operated
But perfection takes too long…
Compromises…
Time to market vs. Quality
Utopia remains out of reach
Where time to market wins big
Making a land-grab
Disrupting competitors (OODA)
Anything delivered as web services
Land grab
opportunity

Engage
customers

Deliver

Measure
customers

Act

Competitive
move

Observe

Colonel Boyd,
USAF
“Get inside your
adversaries'
OODA loop to
disorient them”

Customer
Pain Point

Analysis

Orient
Model
alternatives

Implement

Decide
Commit
resources

Plan
response
Get buy-in
How Soon?
Product features in days instead of months
Deployment in minutes instead of weeks
Incident response in seconds instead of hours
Cloud Native
A new engineering challenge
Construct a highly agile and highly
available service from ephemeral and
assumed broken components
Inspiration
How to get to Cloud Native
Freedom and Responsibility for Developers
Decentralize and Automate Ops Activities
Integrate DevOps into the Business Organization
Four Transitions
• Management: Integrated Roles in a Single Organization
– Business, Development, Operations -> BusDevOps

• Developers: Denormalized Data – NoSQL
– Decentralized, scalable, available, polyglot

• Responsibility from Ops to Dev: Continuous Delivery
– Decentralized small daily production updates

• Responsibility from Ops to Dev: Agile Infrastructure - Cloud
– Hardware in minutes, provisioned directly by developers
Netflix BusDevOps Organization
Chief Product
Officer

VP Product
Management

VP UI
Engineering

VP Discovery
Engineering

VP Platform

Directors
Product

Directors
Development

Directors
Development

Directors
Platform

Code, independently updated
continuous delivery

Developers +
DevOps

Developers +
DevOps

Developers +
DevOps

Denormalized, independently
updated and scaled data

UI Data
Sources

Discovery
Data Sources

Platform
Data Sources

Cloud, self service updated &
scaled infrastructure

AWS

AWS

AWS
Decentralized Deployment
Asgard Developer Portal
http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html
Ephemeral Instances
• Largest services are autoscaled
• Average lifetime of an instance is 36 hours

Autoscale Up

Autoscale Down

P
u
s
h
Netflix Member Web Site Home Page
Personalization Driven – How Does It Work?
How Netflix Used to Work
Consumer
Electronics

Oracle

Monolithic Web
App

AWS Cloud
Services

MySQL

CDN Edge
Locations

Oracle
Datacenter

Customer Device
(PC, PS3, TV…)

Monolithic
Streaming App
MySQL

Content
Management
Limelight/Level 3
Akamai CDNs

Content Encoding
How Netflix Streaming Works Today
Consumer
Electronics

User Data

Web Site or
Discovery API

AWS Cloud
Services

Personalization

CDN Edge
Locations

DRM
Datacenter

Customer Device
(PC, PS3, TV…)

Streaming API
QoS Logging

OpenConnect
CDN Boxes

CDN
Management
and Steering

Content Encoding
The DIY Question
Why doesn’t Netflix build and run its
own cloud?
Fitting Into Public Scale

1,000 Instances

Public

Startups

100,000 Instances

Grey
Area
Netflix

Private

Facebook
How big is Public?
AWS Maximum Possible Instance Count 4.2 Million – May 2013
Growth >10x in Three Years, >2x Per Annum - http://bit.ly/awsiprange

AWS upper bound estimate based on the number of public IP Addresses
Every provisioned instance gets a public IP by default (some VPC don’t)
The Alternative Supplier
Question
What if there is no clear leader for a
feature, or AWS doesn’t have what
we need?
Things We Don’t Use AWS For
SaaS Applications – Pagerduty, Appdynamics
Content Delivery Service
DNS Service
Nov
2012
Streaming
Bandwidth

March
2013
Mean
Bandwidth
+39% 6mo
CDN Scale

Gigabits

Terabits
Akamai

Startups

Limelight
Level 3

AWS CloudFront

Netflix
Openconnect
YouTube

Facebook

Netflix
Content Delivery Service
Open Source Hardware Design + FreeBSD, bird, nginx
see openconnect.netflix.com
DNS Service
AWS Route53 is missing too many features (for now)
Multiple vendor strategy Dyn, Ultra, Route53
Abstracted (broken) DNS APIs with Denominator
Cost
reduction

Lower
margins

Less revenue

Process
reduction

Slow down
developers

Higher
margins

Less
competitive

More
revenue

What Changed?
Get out of the way of innovation
Best of breed, by the hour
Choices based on scale

Speed up
developers

More
competitive
Availability Questions
Is it running yet?
How many places is it running in?
How far apart are those places?
Netflix Outages
• Running very fast with scissors
– Mostly self inflicted – bugs, mistakes from pace of change
– Some caused by AWS bugs and mistakes

• Incident Life-cycle Management by Platform Team
– No runbooks, no operational changes by the SREs
– Tools to identify what broke and call the right developer

• Next step is multi-region active/active
– Investigating and building in stages during 2013
– Could have prevented some of our 2012 outages
Incidents – Impact and Mitigation
Public Relations
Media Impact

PR

Y incidents mitigated by Active
Active, game day practicing

X Incidents
High Customer
Service Calls

CS

YY incidents
mitigated by
better tools and
practices

XX Incidents
Affects AB
Test Results

Metrics impact – Feature disable
XXX Incidents
No Impact – fast retry or automated failover
XXXX Incidents

YYY incidents
mitigated by better
data tagging
Real Web Server Dependencies Flow
(Netflix Home page business transaction as seen by AppDynamics)
Each icon is
three to a few
hundred
instances
across three
AWS zones

Cassandra
memcached

Start Here

Personalization movie group choosers
(for US, Canada and Latam)

Web service
S3 bucket
Three Balanced Availability Zones
Test with Chaos Gorilla
Load Balancers

Zone A

Zone B

Zone C

Cassandra and Evcache
Replicas

Cassandra and Evcache
Replicas

Cassandra and Evcache
Replicas
Isolated Regions
EU-West Load Balancers

US-East Load Balancers

Zone A

Zone B

Zone C

Zone A

Zone B

Zone C

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

More?
Highly Available NoSQL Storage
A highly scalable, available and
durable deployment pattern based
on Apache Cassandra
Single Function Micro-Service Pattern
One keyspace, replaces a single table or materialized view
Single function Cassandra
Cluster Managed by Priam
Between 6 and 144 nodes

Many Different Single-Function REST Clients

Stateless Data Access REST Service
Astyanax Cassandra Client
Over 50 Cassandra clusters
Over 1000 nodes
Over 30TB backup
Over 1M writes/s/cluster

Each icon represents a horizontally scaled service of three to
hundreds of instances deployed over three availability zones
Appdynamics Service Flow Visualization

Optional
Datacenter
Update Flow
Stateless Micro-Service Architecture
Linux Base AMI (CentOS or Ubuntu)
Optional
Apache
frontend,
memcached,
non-java apps

Monitoring
Log rotation
to S3
AppDynamics
machineagent
Epic/Atlas

Java (JDK 6 or 7)
AppDynamics
appagent
monitoring
GC and thread
dump logging

Tomcat
Application war file, base
servlet, platform, client
interface jars, Astyanax

Healthcheck, status
servlets, JMX interface,
Servo autoscale
Cassandra Instance Architecture
Linux Base AMI (CentOS or Ubuntu)
Tomcat and
Priam on JDK

Java (JDK 7)

Healthcheck,
Status
AppDynamics
appagent
monitoring

Cassandra Server

Monitoring
AppDynamics
machineagent
Epic/Atlas

GC and thread
dump logging

Local Ephemeral Disk Space – 2TB of SSD or 1.6TB disk
holding Commit log and SSTables
Apache Cassandra
• Scalable and Stable in large deployments
– No additional license cost for large scale!
– Optimized for “OLTP” vs. Hbase optimized for “DSS”

• Available during Partition (AP from CAP)
– Hinted handoff repairs most transient issues
– Read-repair and periodic repair keep it clean

• Quorum and Client Generated Timestamp
– Read after write consistency with 2 of 3 copies
– Latest version includes Paxos for stronger transactions
Astyanax Cassandra Client for Java
Available at http://github.com/netflix

• Features
– Abstraction of connection pool from RPC protocol
– Fluent Style API
– Operation retry with backoff
– Token aware
– Batch manager
– Many useful recipes
– New: Entity Mapper based on JPA annotations
C* Astyanax Recipes
•
•
•
•
•
•
•
•
•

Distributed row lock (without needing zookeeper)
Multi-region row lock
Uniqueness constraint
Multi-row uniqueness constraint
Chunked and multi-threaded large file storage
Reverse index search
All rows query
Durable message queue
Contributed: High cardinality reverse index
Astyanax - Cassandra Write Data Flows
Single Region, Multiple Availability Zone, Token Aware
Cassandra
•Disks
•Zone A

1. Client Writes to local
coordinator
2. Coodinator writes to
other zones
3. Nodes return ack
4. Data written to
internal commit log
disks (no more than
10 seconds later)

2Cassandra
3•Disks 4

Cassandra 3

4

•Disks
•Zone C

1

•Zone B

Token
Aware
Clients

2

Cassandra

Cassandra

•Disks
•Zone B

•Disks
•Zone C

3

Cassandra
•Disks
•Zone A

4

If a node goes offline,
hinted handoff
completes the write
when the node comes
back up.
Requests can choose to
wait for one node, a
quorum, or all nodes to
ack the write
SSTable disk writes and
compactions occur
asynchronously
Data Flows for Multi-Region Writes
Token Aware, Consistency Level = Local Quorum
1. Client writes to local replicas
2. Local write acks returned to
Client which continues when
2 of 3 local nodes are
committed
3. Local coordinator writes to
remote coordinator.
4. When data arrives, remote
coordinator node acks and
copies to other remote zones
5. Remote nodes ack to local
coordinator
6. Data flushed to internal
commit log disks (no more
than 10 seconds later)

If a node or region goes offline, hinted handoff
completes the write when the node comes back up.
Nightly global compare and repair jobs ensure
everything stays consistent.

100+ms latency

Cassandra
• Disks
• Zone A

Cassandra

6

• Disks
• Zone C

• Disks
• Zone A

2

2

Cassandra

6 3

1

• Disks
• Zone B

Cassandra

5• Disks6
• Zone C

US
Clients

EU
Clients
2

Cassandra

Cassandra

• Disks
• Zone B

• Disks
• Zone C

6

Cassandra
• Disks
• Zone A

Cassandra

4Cassandra
•
4 Disks6
• Zone B
4

Cassandra

Cassandra

• Disks
• Zone B

• Disks
• Zone C

5
6Cassandra
• Disks
• Zone A
Cassandra at Scale
Benchmarking to Retire Risk

More?
Scalability from 48 to 288 nodes on AWS
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

Client Writes/s by node count – Replication Factor = 3
1200000
1099837

1000000
800000
600000

Used 288 of m1.xlarge
4 CPU, 15 GB RAM, 8 ECU
Cassandra 0.86
Benchmark config only
existed for about 1hr

537172

400000

366828

200000

174373

0
0

50

100

150

200

250

300

350
Cassandra Disk vs. SSD Benchmark
Same Throughput, Lower Latency, Half Cost
http://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.html
2013 - Cross Region Use Cases
• Geographic Isolation
– US to Europe replication of subscriber data
– Read intensive, low update rate
– Production use since late 2011

• Redundancy for regional failover
– US East to US West replication of everything
– Includes write intensive data, high update rate
– Testing now
Benchmarking Global Cassandra
Write intensive test of cross region replication capacity
16 x hi1.4xlarge SSD nodes per zone = 96 total
192 TB of SSD in six locations up and running Cassandra in 20 minutes
Test
Load

1 Million reads
After 500ms
CL.ONE with no
Data loss

Validation
Load

1 Million writes
CL.ONE (wait for
one replica to ack)

Test
Load

US-East-1 Region - Virginia

US-West-2 Region - Oregon

Zone A

Zone B

Zone C

Zone A

Zone B

Zone C

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Inter-Zone Traffic

Inter-Region Traffic
Up to 9Gbits/s, 83ms

18TB
backups
from S3
Copying 18TB from East to West
Cassandra bootstrap 9.3 Gbit/s single threaded 48 nodes to 48 nodes
Thanks to boundary.com for these network analysis plots
Inter Region Traffic Test
Verified at desired capacity, no problems, 339 MB/s, 83ms latency
Ramp Up Load Until It Breaks!
Unmodified tuning, dropping client data at 1.93GB/s inter region traffic
Spare CPU, IOPS, Network, just need some Cassandra tuning for more
Managing Multi-Region Availability
AWS
Route53

DynECT
DNS

UltraDNS

Denominator

Regional Load Balancers

Regional Load Balancers

Zone A

Zone B

Zone C

Zone A

Zone B

Zone C

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Denominator – manage traffic via multiple DNS providers with Java code
2013 Timeline - Concept Jan, Code Feb, OSS March, Production use May
Failure Modes and Effects
Failure Mode

Probability

Current Mitigation Plan

Application Failure

High

Automatic degraded response

AWS Region Failure

Low

Active-Active multi-region deployment

AWS Zone Failure

Medium

Continue to run on 2 out of 3 zones

Datacenter Failure

Medium

Migrate more functions to cloud

Data store failure

Low

Restore from S3 backups

S3 failure

Low

Restore from remote archive

Until we got really good at mitigating high and medium
probability failures, the ROI for mitigating regional
failures didn’t make sense. Getting there…
Application Resilience
Run what you wrote
Rapid detection
Rapid Response
Chaos Monkey
http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html

• Computers (Datacenter or AWS) randomly die
– Fact of life, but too infrequent to test resiliency

• Test to make sure systems are resilient
– Kill individual instances without customer impact

• Latency Monkey (coming soon)
– Inject extra latency and error return codes
Edda – Configuration History
http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html

Eureka Services
metadata

AWS
Instances, ASGs, etc.

AppDynamics
Request flow

Edda

Monkeys
Edda Query Examples
Find any instances that have ever had a specific public IP address
$ curl "http://edda/api/v2/view/instances;publicIpAddress=1.2.3.4;_since=0"
["i-0123456789","i-012345678a","i-012345678b”]

Show the most recent change to a security group
$ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2"
--- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810
+++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504
@@ -1,33 +1,33 @@
{
…
"ipRanges" : [
"10.10.1.1/32",
"10.10.1.2/32",
+
"10.10.1.3/32",
"10.10.1.4/32"
…
}
Cloud Native Big Data
Size the cluster to the data
Size the cluster to the questions
Never wait for space or answers
Netflix Dataoven
From cloud
Services
~100 Billion
Events/day
From C*
Terabytes of
Dimension
data

RDS
Ursula

Metadata
Aegisthus

Data Pipelines
Data Warehouse
Over 2 Petabytes

Gateways

Hadoop Clusters – AWS EMR

Tools

More?
1300 nodes

800 nodes

Multiple 150 nodes Nightly
Cloud Native Development
Patterns
Master copies of data are cloud resident
Dynamically provisioned micro-services
Services are distributed and ephemeral
Datacenter to Cloud Transition Goals
• Faster
– Lower latency than the equivalent datacenter web pages and API calls
– Measured as mean and 99th percentile
– For both first hit (e.g. home page) and in-session hits for the same user

• Scalable
– Avoid needing any more datacenter capacity as subscriber count increases
– No central vertically scaled databases
– Leverage AWS elastic capacity effectively

• Available
– Substantially higher robustness and availability than datacenter services
– Leverage multiple AWS availability zones
– No scheduled down time, no central database schema to change

• Productive
– Optimize agility of a large development team with automation and tools
– Leave behind complex tangled datacenter code base (~8 year old architecture)
– Enforce clean layered interfaces and re-usable components
Datacenter Anti-Patterns
What do we currently do in the
datacenter that prevents us from
meeting our goals?
Rewrite from Scratch
Not everything is cloud specific
Pay down technical debt
Robust patterns
Netflix Datacenter vs. Cloud Arch
Central SQL Database

Distributed Key/Value NoSQL

Sticky In-Memory Session

Shared Memcached Session

Chatty Protocols

Latency Tolerant Protocols

Tangled Service Interfaces

Layered Service Interfaces

Instrumented Code

Instrumented Service Patterns

Fat Complex Objects

Lightweight Serializable Objects

Components as Jar Files

Components as Services
More?
Cloud Security
Fine grain security rather than perimeter
Leveraging AWS Scale to resist DDOS attacks
Automated attack surface monitoring and testing
http://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned
Security Architecture
• Instance Level Security baked into base AMI
– Login: ssh only allowed via portal (not between instances)
– Each app type runs as its own userid app{test|prod}

• AWS Security, Identity and Access Management
– Each app has its own security group (firewall ports)
– Fine grain user roles and resource ACLs

• Key Management
– AWS Keys dynamically provisioned, easy updates
– High grade app specific key management using HSM
More?
AWS Accounts
Accounts Isolate Concerns
• paastest – for development and testing
– Fully functional deployment of all services
– Developer tagged “stacks” for separation

• paasprod – for production
– Autoscale groups only, isolated instances are terminated
– Alert routing, backups enabled by default

• paasaudit – for sensitive services
– To support SOX, PCI, etc.
– Extra access controls, auditing

• paasarchive – for disaster recovery
– Long term archive of backups
– Different region, perhaps different vendor
Cloud Access Control
developers

Cloud Access
audit log
ssh/sudo
Gateway

wwwprod

• Userid wwwprod
Security groups don’t allow
ssh between instances

Dalprod
Cassprod

• Userid dalprod

• Userid cassprod
Our perspiration…
A Cloud Native Open Source Platform
See netflix.github.com
Example Application – RSS Reader
Zuul Traffic
Processing
and Routing

Z
U
U
L
Zuul Architecture
http://techblog.netflix.com/2013/06/announcing-zuul-edge-service-in-cloud.html
Ice – AWS Usage Tracking
http://techblog.netflix.com/2013/06/announcing-ice-cloud-spend-and-usage.html
NetflixOSS Continuous Build and Deployment
Github
NetflixOSS
Source

Maven
Central

AWS
Base AMI

Cloudbees
Jenkins
Aminator
Bakery

Dynaslave
AWS Build
Slaves

AWS
Baked AMIs

Glisten
Workflow DSL

Asgard
(+ Frigga)
Console

AWS
Account
More?
NetflixOSS Services Scope

AWS Account
Asgard Console

Archaius
Config Service

Multiple AWS Regions

Cross region Priam C*
Eureka Registry
Pytheas
Dashboards

Atlas
Monitoring

Exhibitor
Zookeeper

3 AWS Zones

Edda History
Application Clusters

Genie, Lipstick

Hadoop Services

Evcache

Cassandra

Memcached

Instances

Simian Army

Priam

Autoscale Groups

Persistent Storage

Ephemeral Storage

Zuul Traffic Mgr
Ice – AWS Usage
Cost Monitoring

More?
NetflixOSS Instance Libraries

Initialization
Service
Requests
Data Access
Logging

• Baked AMI – Tomcat, Apache, your code
• Governator – Guice based dependency injection
• Archaius – dynamic configuration properties client
• Eureka - service registration client

• Karyon - Base Server for inbound requests
• RxJava – Reactive pattern
• Hystrix/Turbine – dependencies and real-time status
• Ribbon and Feign - REST Clients for outbound calls

• Astyanax – Cassandra client and pattern library
• Evcache – Zone aware Memcached client
• Curator – Zookeeper patterns
• Denominator – DNS routing abstraction

• Blitz4j – non-blocking logging
• Servo – metrics export for autoscaling
• Atlas – high volume instrumentation

More?
NetflixOSS Testing and Automation

Test Tools

• CassJmeter – Load testing for Cassandra
• Circus Monkey – Test account reservation rebalancing

Maintenance

• Janitor Monkey – Cleans up unused resources
• Efficiency Monkey
• Doctor Monkey
• Howler Monkey – Complains about AWS limits

Availability

• Chaos Monkey – Kills Instances
• Chaos Gorilla – Kills Availability Zones
• Chaos Kong – Kills Regions
• Latency Monkey – Latency and error injection

Security

• Conformity Monkey – architectural pattern warnings
• Security Monkey – security group and S3 bucket permissions

More?
Vendor Driven Portability
Interest in using NetflixOSS for Enterprise Private Clouds
“It’s done when it runs Asgard”
Functionally complete
Demonstrated March
Released June in V3.3

IBM Example application “Acme Air”
Based on NetflixOSS running on AWS
Ported to IBM Softlayer with Rightscale

Vendor and end user interest
Openstack “Heat” getting there
Paypal C3 Console based on Asgard
Cost-Aware
Cloud Architectures
Jinesh Varia
@jinman
Technology Evangelist

Adrian Cockcroft
@adrianco
Director, Architecture
Cloud Economics – Agile ROI
Get a faster Return by Speeding up
Investment
Observe

Act

Rapid innovation by
speeding up the
OODA loop
Try, fail, try again, succeed

Orient

Decide
Experiment Often & Adapt Quickly

























• Cost of failure falls dramatically
• Return on (small incremental)
Investments is high
• More risk taking, more innovation
• More iteration, faster innovation
« Want to increase innovation?
Lower the cost of failure »
Joi Ito
Accelerate building a new line of business

Market Replay
Go Global in Minutes
Netflix Examples
• European Launch using AWS Ireland
– No employees in Ireland, no provisioning delay, everything
worked
– No need to do detailed capacity planning
– Over-provisioned on day 1, shrunk to fit after a few days
– Capacity grows as needed for additional country launches

• Brazilian Proxy Experiment
–
–
–
–

No employees in Brazil, no “meetings with IT”
Deployed instances into two zones in AWS Brazil
Experimented with network proxy optimization
Decided that gain wasn’t enough, shut everything down
Product Launch Agility - Rightsized

$
Demand
Cloud
Datacenter
Product Launch - Under-estimated
Product Launch Agility – Over-estimated

$
Return on Agility (Agile ROI) = More Revenue
Key Takeaways on Cost-Aware Architectures….
#1 Business Agility by Rapid Experimentation = Increased Revenue
When you turn off your cloud
resources, you actually stop paying for
50% Savings
Web Servers

Weekly CPU Load

1

5

9

13

17

21

25

29

Week

Optimize during a year

33

37

41

45

49
Instances

Business Throughput
50%+ Cost Saving
Scale up/down
by 70%+

Move to Load-Based Scaling
Pay as you go
AWS Support – Trusted Advisor –
Your personal cloud assistant
Other simple optimization tips

• Don’t forget to…
– Disassociate unused EIPs
– Delete unassociated Amazon
EBS volumes
– Delete older Amazon EBS
snapshots
– Leverage Amazon S3 Object
Expiration
Janitor Monkey cleans up unused resources
Building Cost-Aware Cloud Architectures
#1 Business Agility by Rapid Experimentation = Increased Revenue
#2 Business-driven Auto Scaling Architectures = Savings
When Comparing TCO…
When Comparing TCO…

Make sure that
you are including
all the cost factors
into consideration

Place
Power
Pipes
People
Patterns
Save more when you reserve

On-demand
Instances
• Pay as you go

• Starts from
$0.02/Hour

Reserved
Instances
• One time low
upfront fee +
Pay as you go
• $23 for 1 year
term and
$0.01/Hour

Light
Utilization RI
1-year and
3-year terms

Medium
Utilization RI
Heavy
Utilization RI
Break-even point

Utilization
(Uptime)

ed
es

ow
e + Pay

year

ur

Light
Utilization RI
1-year and 3year terms

Ideal For

10% - 40%

Disaster Recovery
(Lowest Upfront)

(>3.5 < 5.5
months/year)

40% - 75%
Standard Reserved
Medium
(>5.5 < 7 months/year) Capacity
Utilization RI
Heavy
Utilization RI

>75%
(>7 months/year)

Baseline Servers
(Lowest Total Cost)

Savings over
On-Demand

56%
66%
71%
Mix and Match Reserved Types and On-Demand
12

10

On-Demand

Instances

8

6

Light RI

Light RI

Light RI

Light RI

4

2

Heavy Utilization Reserved Instances
0
1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

Days of Month
Netflix Concept for Regional Failover
Capacity
West Coast
Failover
Use

Normal
Use

East Coast

Light
Reservations

Light
Reservations

Heavy
Reservations

Heavy
Reservations
Building Cost-Aware Cloud Architectures
#1 Business Agility by Rapid Experimentation = Increased Revenue
#2 Business-driven Auto Scaling Architectures = Savings

#3 Mix and Match Reserved Instances with On-Demand = Savings
Variety of Applications and Environments
Every Company has….

Business App Fleet

Marketing Site
Intranet Site
BI App
Multiple Products
Analytics

Every Application has….

Production Fleet

Dev Fleet
Test Fleet
Staging/QA
Perf Fleet
DR Site
Consolidated Billing: Single payer for a group of
accounts
• One Bill for multiple accounts
• Easy Tracking of account
charges (e.g., download CSV of
cost data)

• Volume Discounts can be
reached faster with combined
usage
• Reserved Instances are shared
across accounts (including RDS
Reserved DBs)
Over-Reserve the Production Environment
Total Capacity
Production Env.
Account

100 Reserved

QA/Staging Env.
Account

0 Reserved

Perf Testing Env.
Account

0 Reserved

Development Env.
Account

0 Reserved

Storage Account

0 Reserved
Consolidated Billing Borrows Unused Reservations
Total Capacity
Production Env.
Account

68 Used

QA/Staging Env.
Account

10 Borrowed

Perf Testing Env.
Account

6 Borrowed

Development Env.
Account

12 Borrowed

Storage Account

4 Borrowed
Consolidated Billing Advantages
• Production account is guaranteed to get burst capacity
– Reservation is higher than normal usage level
– Requests for more capacity always work up to reserved
limit
– Higher availability for handling unexpected peak demands

• No additional cost
– Other lower priority accounts soak up unused reservations
– Totals roll up in the monthly billing cycle
Building Cost-Aware Cloud Architectures
#1 Business Agility by Rapid Experimentation = Increased Revenue
#2 Business-driven Auto Scaling Architectures = Savings

#3 Mix and Match Reserved Instances with On-Demand = Savings
#4 Consolidated Billing and Shared Reservations = Savings
Continuous optimization in your
architecture results in
recurring savings
as early as your next month’s bill
Right-size your cloud: Use only what you need
• An instance type
for every purpose
• Assess your
memory & CPU
requirements
– Fit your
application to
the resource
– Fit the resource
to your
application

• Only use a larger
instance when
needed
Reserved Instance Marketplace

Buy a smaller term instance
Buy instance with different OS or type
Buy a Reserved instance in different region

Sell your unused Reserved Instance
Sell unwanted or over-bought capacity
Further reduce costs by optimizing
Instance Type Optimization

Older m1 and m2 families
• Slower CPUs
• Higher response times
• Smaller caches (6MB)
• Oldest m1.xl 15GB/8ECU/48c
• Old m2.xl 17GB/6.5ECU/41c
• ~16 ECU/$/hr

Latest m3 family
• Faster CPUs
• Lower response times
• Bigger caches (20MB)
• Even faster for Java vs. ECU
• New m3.xl 15GB/13 ECU/50c
• 26 ECU/$/hr – 62% better!
• Java measured even higher
• Deploy fewer instances
Building Cost-Aware Cloud Architectures
#1 Business Agility by Rapid Experimentation = Increased Revenue
#2 Business-driven Auto Scaling Architectures = Savings

#3 Mix and Match Reserved Instances with On-Demand = Savings
#4 Consolidated Billing and Shared Reservations = Savings
#5 Always-on Instance Type Optimization = Recurring Savings
Follow the Customer (Run web servers) during the day
16

No. of Reserved
Instances

No of Instances Running

14
12
10

8
Auto Scaling Servers
Hadoop Servers

6
4
2
0
Mon

Tue

Wed

Thur

Fri

Sat

Sun

Week

Follow the Money (Run Hadoop clusters) at night
Total
Instances
Reserved
Table
14 Types

Web
Application
Fleet
Total Instances
Running now = 100

4 AZ-mappings

Unused
Reservations
Calculator

Launch 40

Hadoop Fleet

Total unused Reservations
available = 40 in 2 AZs
(5 min interval)
Soaking up unused reservations
Unused reserved instances is published as a metric
Netflix Data Science ETL Workload
• Daily business metrics roll-up
• Starts after midnight
• EMR clusters started using hundreds of instances
Netflix Movie Encoding Workload
• Long queue of high and low priority encoding jobs
• Can soak up 1000’s of additional unused instances
Building Cost-Aware Cloud Architectures
#1 Business Agility by Rapid Experimentation = Increased Revenue
#2 Business-driven Auto Scaling Architectures = Savings

#3 Mix and Match Reserved Instances with On-Demand = Savings
#4 Consolidated Billing and Shared Reservations = Savings
#5 Always-on Instance Type Optimization = Recurring Savings

#6 Follow the Customer (Run web servers) during the day
Follow the Money (Run Hadoop clusters) at night
Thank you!

Jinesh Varia and Adrian Cockcroft
jvaria@amazon.com @jinman
acockcroft@netflix.com @adrianco
Slideshare.net/Netflix Details
•

Meetup S1E3 July – Featuring Contributors Eucalyptus, IBM, Paypal, Riot Games
– http://techblog.netflix.com/2013/07/netflixoss-meetup-series-1-episode-3.html

•

Lightning Talks March S1E2
– http://www.slideshare.net/RuslanMeshenberg/netflixoss-meetup-lightning-talks-androadmap

•

Lightning Talks Feb S1E1
– http://www.slideshare.net/RuslanMeshenberg/netflixoss-open-house-lightning-talks

•

Asgard In Depth Feb S1E1
– http://www.slideshare.net/joesondow/asgard-overview-from-netflix-oss-open-house

•

Security Architecture
– http://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned/
Takeaways
Cloud Native Manages Scale and Complexity at Speed
NetflixOSS makes it easier for everyone to become Cloud Native
Rethink deployments and turn things off to save money!
http://netflix.github.com
http://techblog.netflix.com
http://slideshare.net/Netflix
http://www.linkedin.com/in/adriancockcroft
@adrianco #netflixcloud @NetflixOSS

More Related Content

What's hot

Performance architecture for cloud connect
Performance architecture for cloud connectPerformance architecture for cloud connect
Performance architecture for cloud connectAdrian Cockcroft
 
(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The CloudAmazon Web Services
 
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesYow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesAdrian Cockcroft
 
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)Adrian Cockcroft
 
Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)Adrian Cockcroft
 
Crunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR Hadoop
Crunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR HadoopCrunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR Hadoop
Crunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR HadoopAdrian Cockcroft
 
Ciso executive summit 2012
Ciso executive summit 2012Ciso executive summit 2012
Ciso executive summit 2012Bill Burns
 
DevOps on AWS: Deep Dive on Continuous Delivery and the AWS Developer Tools
DevOps on AWS: Deep Dive on Continuous Delivery and the AWS Developer ToolsDevOps on AWS: Deep Dive on Continuous Delivery and the AWS Developer Tools
DevOps on AWS: Deep Dive on Continuous Delivery and the AWS Developer ToolsAmazon Web Services
 
DevOps on AWS: Accelerating Software Delivery with the AWS Developer Tools
DevOps on AWS: Accelerating Software Delivery with the AWS Developer ToolsDevOps on AWS: Accelerating Software Delivery with the AWS Developer Tools
DevOps on AWS: Accelerating Software Delivery with the AWS Developer ToolsAmazon Web Services
 
AWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at NetflixAWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at NetflixAdrian Cockcroft
 
SV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformSV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformAdrian Cockcroft
 
Netflix in the Cloud at SV Forum
Netflix in the Cloud at SV ForumNetflix in the Cloud at SV Forum
Netflix in the Cloud at SV ForumAdrian Cockcroft
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud ArchitectureAdrian Cockcroft
 
Architectures for High Availability - QConSF
Architectures for High Availability - QConSFArchitectures for High Availability - QConSF
Architectures for High Availability - QConSFAdrian Cockcroft
 
Cloud Native Cost Optimization
Cloud Native Cost OptimizationCloud Native Cost Optimization
Cloud Native Cost OptimizationAdrian Cockcroft
 
Architecting for the Cloud using NetflixOSS - Codemash Workshop
Architecting for the Cloud using NetflixOSS - Codemash WorkshopArchitecting for the Cloud using NetflixOSS - Codemash Workshop
Architecting for the Cloud using NetflixOSS - Codemash WorkshopSudhir Tonse
 
Asgard, the Grails App that Deploys Netflix to the Cloud
Asgard, the Grails App that Deploys Netflix to the CloudAsgard, the Grails App that Deploys Netflix to the Cloud
Asgard, the Grails App that Deploys Netflix to the CloudJoe Sondow
 
Microservices: What's Missing - O'Reilly Software Architecture New York
Microservices: What's Missing - O'Reilly Software Architecture New YorkMicroservices: What's Missing - O'Reilly Software Architecture New York
Microservices: What's Missing - O'Reilly Software Architecture New YorkAdrian Cockcroft
 
Leveraging elastic web scale computing with AWS
 Leveraging elastic web scale computing with AWS Leveraging elastic web scale computing with AWS
Leveraging elastic web scale computing with AWSShiva Narayanaswamy
 

What's hot (20)

Performance architecture for cloud connect
Performance architecture for cloud connectPerformance architecture for cloud connect
Performance architecture for cloud connect
 
(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud
 
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesYow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
 
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
 
Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)
 
Crunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR Hadoop
Crunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR HadoopCrunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR Hadoop
Crunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR Hadoop
 
Ciso executive summit 2012
Ciso executive summit 2012Ciso executive summit 2012
Ciso executive summit 2012
 
DevOps on AWS: Deep Dive on Continuous Delivery and the AWS Developer Tools
DevOps on AWS: Deep Dive on Continuous Delivery and the AWS Developer ToolsDevOps on AWS: Deep Dive on Continuous Delivery and the AWS Developer Tools
DevOps on AWS: Deep Dive on Continuous Delivery and the AWS Developer Tools
 
DevOps on AWS: Accelerating Software Delivery with the AWS Developer Tools
DevOps on AWS: Accelerating Software Delivery with the AWS Developer ToolsDevOps on AWS: Accelerating Software Delivery with the AWS Developer Tools
DevOps on AWS: Accelerating Software Delivery with the AWS Developer Tools
 
AWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at NetflixAWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at Netflix
 
SV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformSV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source Platform
 
Netflix in the Cloud at SV Forum
Netflix in the Cloud at SV ForumNetflix in the Cloud at SV Forum
Netflix in the Cloud at SV Forum
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud Architecture
 
Architectures for High Availability - QConSF
Architectures for High Availability - QConSFArchitectures for High Availability - QConSF
Architectures for High Availability - QConSF
 
Cloud Native Cost Optimization
Cloud Native Cost OptimizationCloud Native Cost Optimization
Cloud Native Cost Optimization
 
Architecting for the Cloud using NetflixOSS - Codemash Workshop
Architecting for the Cloud using NetflixOSS - Codemash WorkshopArchitecting for the Cloud using NetflixOSS - Codemash Workshop
Architecting for the Cloud using NetflixOSS - Codemash Workshop
 
Asgard, the Grails App that Deploys Netflix to the Cloud
Asgard, the Grails App that Deploys Netflix to the CloudAsgard, the Grails App that Deploys Netflix to the Cloud
Asgard, the Grails App that Deploys Netflix to the Cloud
 
Microservices: What's Missing - O'Reilly Software Architecture New York
Microservices: What's Missing - O'Reilly Software Architecture New YorkMicroservices: What's Missing - O'Reilly Software Architecture New York
Microservices: What's Missing - O'Reilly Software Architecture New York
 
Netflix in the cloud 2011
Netflix in the cloud 2011Netflix in the cloud 2011
Netflix in the cloud 2011
 
Leveraging elastic web scale computing with AWS
 Leveraging elastic web scale computing with AWS Leveraging elastic web scale computing with AWS
Leveraging elastic web scale computing with AWS
 

Viewers also liked

Spinnaker - Bay Area AWS Meetup - 20160726
Spinnaker - Bay Area AWS Meetup - 20160726Spinnaker - Bay Area AWS Meetup - 20160726
Spinnaker - Bay Area AWS Meetup - 20160726Adam Jordens
 
Netflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and OpsNetflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and OpsAdrian Cockcroft
 
Continuous Deployment to the Cloud using Spinnaker
Continuous Deployment to the Cloud using SpinnakerContinuous Deployment to the Cloud using Spinnaker
Continuous Deployment to the Cloud using SpinnakerTim Ysewyn
 
Spinnaker VLDB 2011
Spinnaker VLDB 2011Spinnaker VLDB 2011
Spinnaker VLDB 2011sandeep_tata
 
AWS Cost Optimisation Best Practices Webinar
AWS Cost Optimisation Best Practices WebinarAWS Cost Optimisation Best Practices Webinar
AWS Cost Optimisation Best Practices WebinarAmazon Web Services
 
Running Lean Architectures: How to Optimize for Cost Efficiency
Running Lean Architectures: How to Optimize for Cost Efficiency Running Lean Architectures: How to Optimize for Cost Efficiency
Running Lean Architectures: How to Optimize for Cost Efficiency Amazon Web Services
 
It's Time the Data Center Gets the "Moneyball" Treatment
It's Time the Data Center Gets the "Moneyball" TreatmentIt's Time the Data Center Gets the "Moneyball" Treatment
It's Time the Data Center Gets the "Moneyball" TreatmentTeamQuest Corporation
 
Top 5 Ways to Optimize for Cost Efficiency with the Cloud
Top 5 Ways to Optimize for Cost Efficiency with the CloudTop 5 Ways to Optimize for Cost Efficiency with the Cloud
Top 5 Ways to Optimize for Cost Efficiency with the CloudAmazon Web Services
 
Making the most of your constrained resources optimizing resource allocation ...
Making the most of your constrained resources optimizing resource allocation ...Making the most of your constrained resources optimizing resource allocation ...
Making the most of your constrained resources optimizing resource allocation ...Productivity Intelligence Institute
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Sourceaspyker
 

Viewers also liked (14)

Spinnaker - Bay Area AWS Meetup - 20160726
Spinnaker - Bay Area AWS Meetup - 20160726Spinnaker - Bay Area AWS Meetup - 20160726
Spinnaker - Bay Area AWS Meetup - 20160726
 
Spinnaker Chadev
Spinnaker ChadevSpinnaker Chadev
Spinnaker Chadev
 
Netflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and OpsNetflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and Ops
 
Continuous Deployment to the Cloud using Spinnaker
Continuous Deployment to the Cloud using SpinnakerContinuous Deployment to the Cloud using Spinnaker
Continuous Deployment to the Cloud using Spinnaker
 
Spinnaker VLDB 2011
Spinnaker VLDB 2011Spinnaker VLDB 2011
Spinnaker VLDB 2011
 
AWS Cost Optimisation Best Practices Webinar
AWS Cost Optimisation Best Practices WebinarAWS Cost Optimisation Best Practices Webinar
AWS Cost Optimisation Best Practices Webinar
 
Running Lean Architectures: How to Optimize for Cost Efficiency
Running Lean Architectures: How to Optimize for Cost Efficiency Running Lean Architectures: How to Optimize for Cost Efficiency
Running Lean Architectures: How to Optimize for Cost Efficiency
 
Resource Optimization Strategy and Innovation
Resource Optimization Strategy and Innovation Resource Optimization Strategy and Innovation
Resource Optimization Strategy and Innovation
 
It's Time the Data Center Gets the "Moneyball" Treatment
It's Time the Data Center Gets the "Moneyball" TreatmentIt's Time the Data Center Gets the "Moneyball" Treatment
It's Time the Data Center Gets the "Moneyball" Treatment
 
Top 5 Ways to Optimize for Cost Efficiency with the Cloud
Top 5 Ways to Optimize for Cost Efficiency with the CloudTop 5 Ways to Optimize for Cost Efficiency with the Cloud
Top 5 Ways to Optimize for Cost Efficiency with the Cloud
 
Making the most of your constrained resources optimizing resource allocation ...
Making the most of your constrained resources optimizing resource allocation ...Making the most of your constrained resources optimizing resource allocation ...
Making the most of your constrained resources optimizing resource allocation ...
 
Kenzan Spinnaker Meetup
Kenzan Spinnaker MeetupKenzan Spinnaker Meetup
Kenzan Spinnaker Meetup
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
 
NIST 800-171 Simplifying CUI and DFARS Compliance
NIST 800-171 Simplifying CUI and DFARS ComplianceNIST 800-171 Simplifying CUI and DFARS Compliance
NIST 800-171 Simplifying CUI and DFARS Compliance
 

Similar to CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Migrating Enterprise Applications to AWS
Migrating Enterprise Applications to AWSMigrating Enterprise Applications to AWS
Migrating Enterprise Applications to AWSTom Laszewski
 
The State of Serverless Computing | AWS Public Sector Summit 2017
The State of Serverless Computing | AWS Public Sector Summit 2017The State of Serverless Computing | AWS Public Sector Summit 2017
The State of Serverless Computing | AWS Public Sector Summit 2017Amazon Web Services
 
Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...
Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...
Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...Amazon Web Services
 
Building scalable OTT workflows on AWS - Serverless Video Workflows
Building scalable OTT workflows on AWS - Serverless Video WorkflowsBuilding scalable OTT workflows on AWS - Serverless Video Workflows
Building scalable OTT workflows on AWS - Serverless Video WorkflowsAmazon Web Services
 
Build & Deploy Scalable Cloud Applications in Record Time
Build & Deploy Scalable Cloud Applications in Record TimeBuild & Deploy Scalable Cloud Applications in Record Time
Build & Deploy Scalable Cloud Applications in Record TimeRightScale
 
SMC301 The State of Serverless Computing
SMC301 The State of Serverless ComputingSMC301 The State of Serverless Computing
SMC301 The State of Serverless ComputingAmazon Web Services
 
5 Years Of Building SaaS On AWS
5 Years Of Building SaaS On AWS5 Years Of Building SaaS On AWS
5 Years Of Building SaaS On AWSChristian Beedgen
 
Stay productive while slicing up the monolith
Stay productive while slicing up the monolithStay productive while slicing up the monolith
Stay productive while slicing up the monolithMarkus Eisele
 
AWS Summit 2013 | India - Web, Mobile and Social Apps on AWS, Kingsley Wood
AWS Summit 2013 | India - Web, Mobile and Social Apps on AWS, Kingsley WoodAWS Summit 2013 | India - Web, Mobile and Social Apps on AWS, Kingsley Wood
AWS Summit 2013 | India - Web, Mobile and Social Apps on AWS, Kingsley WoodAmazon Web Services
 
Vancouver keynote - AWS Innovate - Sam Elmalak
Vancouver keynote - AWS Innovate - Sam ElmalakVancouver keynote - AWS Innovate - Sam Elmalak
Vancouver keynote - AWS Innovate - Sam ElmalakAmazon Web Services
 
Stay productive while slicing up the monolith
Stay productive while slicing up the monolithStay productive while slicing up the monolith
Stay productive while slicing up the monolithMarkus Eisele
 
Apresentação Microsoft Azure no SASPI 5
Apresentação Microsoft Azure no SASPI 5Apresentação Microsoft Azure no SASPI 5
Apresentação Microsoft Azure no SASPI 5Lucas Chies
 
AWS re:Invent 2016: Serverless IoT Back Ends (IOT401)
AWS re:Invent 2016: Serverless IoT Back Ends (IOT401)AWS re:Invent 2016: Serverless IoT Back Ends (IOT401)
AWS re:Invent 2016: Serverless IoT Back Ends (IOT401)Amazon Web Services
 
AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWS
AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWSAWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWS
AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWSAmazon Web Services
 
AWS re:Invent 2016: Discovery Channel's Broadcast Workflows and Channel Origi...
AWS re:Invent 2016: Discovery Channel's Broadcast Workflows and Channel Origi...AWS re:Invent 2016: Discovery Channel's Broadcast Workflows and Channel Origi...
AWS re:Invent 2016: Discovery Channel's Broadcast Workflows and Channel Origi...Amazon Web Services
 
Introducing to serverless computing and AWS lambda - Israel Clouds Meetup
Introducing to serverless computing and AWS lambda - Israel Clouds MeetupIntroducing to serverless computing and AWS lambda - Israel Clouds Meetup
Introducing to serverless computing and AWS lambda - Israel Clouds MeetupBoaz Ziniman
 
[Cloud Computing Day with V-Forum] Going Global on AWS
[Cloud Computing Day with V-Forum] Going Global on AWS[Cloud Computing Day with V-Forum] Going Global on AWS
[Cloud Computing Day with V-Forum] Going Global on AWSAmazon Web Services Korea
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Amazon Web Services
 
AWS re:Invent 2016: The State of Serverless Computing (SVR311)
AWS re:Invent 2016: The State of Serverless Computing (SVR311)AWS re:Invent 2016: The State of Serverless Computing (SVR311)
AWS re:Invent 2016: The State of Serverless Computing (SVR311)Amazon Web Services
 

Similar to CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques (20)

Migrating Enterprise Applications to AWS
Migrating Enterprise Applications to AWSMigrating Enterprise Applications to AWS
Migrating Enterprise Applications to AWS
 
The State of Serverless Computing | AWS Public Sector Summit 2017
The State of Serverless Computing | AWS Public Sector Summit 2017The State of Serverless Computing | AWS Public Sector Summit 2017
The State of Serverless Computing | AWS Public Sector Summit 2017
 
Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...
Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...
Migrating Enterprise Applications to AWS: Best Practices & Techniques (ENT303...
 
Building scalable OTT workflows on AWS - Serverless Video Workflows
Building scalable OTT workflows on AWS - Serverless Video WorkflowsBuilding scalable OTT workflows on AWS - Serverless Video Workflows
Building scalable OTT workflows on AWS - Serverless Video Workflows
 
Build & Deploy Scalable Cloud Applications in Record Time
Build & Deploy Scalable Cloud Applications in Record TimeBuild & Deploy Scalable Cloud Applications in Record Time
Build & Deploy Scalable Cloud Applications in Record Time
 
Digital Workloads on AWS
Digital Workloads on AWSDigital Workloads on AWS
Digital Workloads on AWS
 
SMC301 The State of Serverless Computing
SMC301 The State of Serverless ComputingSMC301 The State of Serverless Computing
SMC301 The State of Serverless Computing
 
5 Years Of Building SaaS On AWS
5 Years Of Building SaaS On AWS5 Years Of Building SaaS On AWS
5 Years Of Building SaaS On AWS
 
Stay productive while slicing up the monolith
Stay productive while slicing up the monolithStay productive while slicing up the monolith
Stay productive while slicing up the monolith
 
AWS Summit 2013 | India - Web, Mobile and Social Apps on AWS, Kingsley Wood
AWS Summit 2013 | India - Web, Mobile and Social Apps on AWS, Kingsley WoodAWS Summit 2013 | India - Web, Mobile and Social Apps on AWS, Kingsley Wood
AWS Summit 2013 | India - Web, Mobile and Social Apps on AWS, Kingsley Wood
 
Vancouver keynote - AWS Innovate - Sam Elmalak
Vancouver keynote - AWS Innovate - Sam ElmalakVancouver keynote - AWS Innovate - Sam Elmalak
Vancouver keynote - AWS Innovate - Sam Elmalak
 
Stay productive while slicing up the monolith
Stay productive while slicing up the monolithStay productive while slicing up the monolith
Stay productive while slicing up the monolith
 
Apresentação Microsoft Azure no SASPI 5
Apresentação Microsoft Azure no SASPI 5Apresentação Microsoft Azure no SASPI 5
Apresentação Microsoft Azure no SASPI 5
 
AWS re:Invent 2016: Serverless IoT Back Ends (IOT401)
AWS re:Invent 2016: Serverless IoT Back Ends (IOT401)AWS re:Invent 2016: Serverless IoT Back Ends (IOT401)
AWS re:Invent 2016: Serverless IoT Back Ends (IOT401)
 
AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWS
AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWSAWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWS
AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWS
 
AWS re:Invent 2016: Discovery Channel's Broadcast Workflows and Channel Origi...
AWS re:Invent 2016: Discovery Channel's Broadcast Workflows and Channel Origi...AWS re:Invent 2016: Discovery Channel's Broadcast Workflows and Channel Origi...
AWS re:Invent 2016: Discovery Channel's Broadcast Workflows and Channel Origi...
 
Introducing to serverless computing and AWS lambda - Israel Clouds Meetup
Introducing to serverless computing and AWS lambda - Israel Clouds MeetupIntroducing to serverless computing and AWS lambda - Israel Clouds Meetup
Introducing to serverless computing and AWS lambda - Israel Clouds Meetup
 
[Cloud Computing Day with V-Forum] Going Global on AWS
[Cloud Computing Day with V-Forum] Going Global on AWS[Cloud Computing Day with V-Forum] Going Global on AWS
[Cloud Computing Day with V-Forum] Going Global on AWS
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS
 
AWS re:Invent 2016: The State of Serverless Computing (SVR311)
AWS re:Invent 2016: The State of Serverless Computing (SVR311)AWS re:Invent 2016: The State of Serverless Computing (SVR311)
AWS re:Invent 2016: The State of Serverless Computing (SVR311)
 

More from Adrian Cockcroft

Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013Adrian Cockcroft
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSAdrian Cockcroft
 
Netflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconNetflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconAdrian Cockcroft
 
Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3) Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3) Adrian Cockcroft
 
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)Adrian Cockcroft
 
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...Adrian Cockcroft
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraAdrian Cockcroft
 
Netflix Velocity Conference 2011
Netflix Velocity Conference 2011Netflix Velocity Conference 2011
Netflix Velocity Conference 2011Adrian Cockcroft
 
Cmg06 utilization is useless
Cmg06 utilization is uselessCmg06 utilization is useless
Cmg06 utilization is uselessAdrian Cockcroft
 

More from Adrian Cockcroft (13)

Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013
 
NetflixOSS Meetup
NetflixOSS MeetupNetflixOSS Meetup
NetflixOSS Meetup
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWS
 
Netflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconNetflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at Gluecon
 
Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3) Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3)
 
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
 
Global Netflix Platform
Global Netflix PlatformGlobal Netflix Platform
Global Netflix Platform
 
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global Cassandra
 
Netflix Velocity Conference 2011
Netflix Velocity Conference 2011Netflix Velocity Conference 2011
Netflix Velocity Conference 2011
 
Migrating to Public Cloud
Migrating to Public CloudMigrating to Public Cloud
Migrating to Public Cloud
 
Cmg06 utilization is useless
Cmg06 utilization is uselessCmg06 utilization is useless
Cmg06 utilization is useless
 
NoSQL for Netflix
NoSQL for NetflixNoSQL for Netflix
NoSQL for Netflix
 

Recently uploaded

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Recently uploaded (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

  • 1. Cloud Native, Capacity, Performance and Cost Optimization Tools and Techniques CMG Workshop November 2013 Adrian Cockcroft @adrianco @NetflixOSS http://www.linkedin.com/in/adriancockcroft
  • 2. Presentation vs. Workshop • Presentation – Short duration, focused subject – One presenter to many anonymous audience – A few questions at the end • Workshop – Time to explore in and around the subject – Tutor gets to know the audience – Discussion, rat-holes, “bring out your dead”
  • 3. Attendee Introductions • Who are you, where do you work • Why are you here today, what do you need • “Bring out your dead” – Do you have a specific problem or question? – One sentence elevator pitch • What instrument do you play?
  • 4. Content Cloud Native Migration Path Service and API Architectures Storage Architecture Operations and Tools Cost Optimization More?
  • 6. Strive for perfection Perfect code Perfect hardware Perfectly operated
  • 7. But perfection takes too long… Compromises… Time to market vs. Quality Utopia remains out of reach
  • 8. Where time to market wins big Making a land-grab Disrupting competitors (OODA) Anything delivered as web services
  • 9. Land grab opportunity Engage customers Deliver Measure customers Act Competitive move Observe Colonel Boyd, USAF “Get inside your adversaries' OODA loop to disorient them” Customer Pain Point Analysis Orient Model alternatives Implement Decide Commit resources Plan response Get buy-in
  • 10. How Soon? Product features in days instead of months Deployment in minutes instead of weeks Incident response in seconds instead of hours
  • 11. Cloud Native A new engineering challenge Construct a highly agile and highly available service from ephemeral and assumed broken components
  • 13. How to get to Cloud Native Freedom and Responsibility for Developers Decentralize and Automate Ops Activities Integrate DevOps into the Business Organization
  • 14. Four Transitions • Management: Integrated Roles in a Single Organization – Business, Development, Operations -> BusDevOps • Developers: Denormalized Data – NoSQL – Decentralized, scalable, available, polyglot • Responsibility from Ops to Dev: Continuous Delivery – Decentralized small daily production updates • Responsibility from Ops to Dev: Agile Infrastructure - Cloud – Hardware in minutes, provisioned directly by developers
  • 15. Netflix BusDevOps Organization Chief Product Officer VP Product Management VP UI Engineering VP Discovery Engineering VP Platform Directors Product Directors Development Directors Development Directors Platform Code, independently updated continuous delivery Developers + DevOps Developers + DevOps Developers + DevOps Denormalized, independently updated and scaled data UI Data Sources Discovery Data Sources Platform Data Sources Cloud, self service updated & scaled infrastructure AWS AWS AWS
  • 18. Ephemeral Instances • Largest services are autoscaled • Average lifetime of an instance is 36 hours Autoscale Up Autoscale Down P u s h
  • 19. Netflix Member Web Site Home Page Personalization Driven – How Does It Work?
  • 20. How Netflix Used to Work Consumer Electronics Oracle Monolithic Web App AWS Cloud Services MySQL CDN Edge Locations Oracle Datacenter Customer Device (PC, PS3, TV…) Monolithic Streaming App MySQL Content Management Limelight/Level 3 Akamai CDNs Content Encoding
  • 21. How Netflix Streaming Works Today Consumer Electronics User Data Web Site or Discovery API AWS Cloud Services Personalization CDN Edge Locations DRM Datacenter Customer Device (PC, PS3, TV…) Streaming API QoS Logging OpenConnect CDN Boxes CDN Management and Steering Content Encoding
  • 22. The DIY Question Why doesn’t Netflix build and run its own cloud?
  • 23. Fitting Into Public Scale 1,000 Instances Public Startups 100,000 Instances Grey Area Netflix Private Facebook
  • 24. How big is Public? AWS Maximum Possible Instance Count 4.2 Million – May 2013 Growth >10x in Three Years, >2x Per Annum - http://bit.ly/awsiprange AWS upper bound estimate based on the number of public IP Addresses Every provisioned instance gets a public IP by default (some VPC don’t)
  • 25. The Alternative Supplier Question What if there is no clear leader for a feature, or AWS doesn’t have what we need?
  • 26. Things We Don’t Use AWS For SaaS Applications – Pagerduty, Appdynamics Content Delivery Service DNS Service
  • 28. CDN Scale Gigabits Terabits Akamai Startups Limelight Level 3 AWS CloudFront Netflix Openconnect YouTube Facebook Netflix
  • 29. Content Delivery Service Open Source Hardware Design + FreeBSD, bird, nginx see openconnect.netflix.com
  • 30. DNS Service AWS Route53 is missing too many features (for now) Multiple vendor strategy Dyn, Ultra, Route53 Abstracted (broken) DNS APIs with Denominator
  • 31. Cost reduction Lower margins Less revenue Process reduction Slow down developers Higher margins Less competitive More revenue What Changed? Get out of the way of innovation Best of breed, by the hour Choices based on scale Speed up developers More competitive
  • 32. Availability Questions Is it running yet? How many places is it running in? How far apart are those places?
  • 33.
  • 34. Netflix Outages • Running very fast with scissors – Mostly self inflicted – bugs, mistakes from pace of change – Some caused by AWS bugs and mistakes • Incident Life-cycle Management by Platform Team – No runbooks, no operational changes by the SREs – Tools to identify what broke and call the right developer • Next step is multi-region active/active – Investigating and building in stages during 2013 – Could have prevented some of our 2012 outages
  • 35. Incidents – Impact and Mitigation Public Relations Media Impact PR Y incidents mitigated by Active Active, game day practicing X Incidents High Customer Service Calls CS YY incidents mitigated by better tools and practices XX Incidents Affects AB Test Results Metrics impact – Feature disable XXX Incidents No Impact – fast retry or automated failover XXXX Incidents YYY incidents mitigated by better data tagging
  • 36. Real Web Server Dependencies Flow (Netflix Home page business transaction as seen by AppDynamics) Each icon is three to a few hundred instances across three AWS zones Cassandra memcached Start Here Personalization movie group choosers (for US, Canada and Latam) Web service S3 bucket
  • 37. Three Balanced Availability Zones Test with Chaos Gorilla Load Balancers Zone A Zone B Zone C Cassandra and Evcache Replicas Cassandra and Evcache Replicas Cassandra and Evcache Replicas
  • 38. Isolated Regions EU-West Load Balancers US-East Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas More?
  • 39. Highly Available NoSQL Storage A highly scalable, available and durable deployment pattern based on Apache Cassandra
  • 40. Single Function Micro-Service Pattern One keyspace, replaces a single table or materialized view Single function Cassandra Cluster Managed by Priam Between 6 and 144 nodes Many Different Single-Function REST Clients Stateless Data Access REST Service Astyanax Cassandra Client Over 50 Cassandra clusters Over 1000 nodes Over 30TB backup Over 1M writes/s/cluster Each icon represents a horizontally scaled service of three to hundreds of instances deployed over three availability zones Appdynamics Service Flow Visualization Optional Datacenter Update Flow
  • 41. Stateless Micro-Service Architecture Linux Base AMI (CentOS or Ubuntu) Optional Apache frontend, memcached, non-java apps Monitoring Log rotation to S3 AppDynamics machineagent Epic/Atlas Java (JDK 6 or 7) AppDynamics appagent monitoring GC and thread dump logging Tomcat Application war file, base servlet, platform, client interface jars, Astyanax Healthcheck, status servlets, JMX interface, Servo autoscale
  • 42. Cassandra Instance Architecture Linux Base AMI (CentOS or Ubuntu) Tomcat and Priam on JDK Java (JDK 7) Healthcheck, Status AppDynamics appagent monitoring Cassandra Server Monitoring AppDynamics machineagent Epic/Atlas GC and thread dump logging Local Ephemeral Disk Space – 2TB of SSD or 1.6TB disk holding Commit log and SSTables
  • 43. Apache Cassandra • Scalable and Stable in large deployments – No additional license cost for large scale! – Optimized for “OLTP” vs. Hbase optimized for “DSS” • Available during Partition (AP from CAP) – Hinted handoff repairs most transient issues – Read-repair and periodic repair keep it clean • Quorum and Client Generated Timestamp – Read after write consistency with 2 of 3 copies – Latest version includes Paxos for stronger transactions
  • 44. Astyanax Cassandra Client for Java Available at http://github.com/netflix • Features – Abstraction of connection pool from RPC protocol – Fluent Style API – Operation retry with backoff – Token aware – Batch manager – Many useful recipes – New: Entity Mapper based on JPA annotations
  • 45. C* Astyanax Recipes • • • • • • • • • Distributed row lock (without needing zookeeper) Multi-region row lock Uniqueness constraint Multi-row uniqueness constraint Chunked and multi-threaded large file storage Reverse index search All rows query Durable message queue Contributed: High cardinality reverse index
  • 46. Astyanax - Cassandra Write Data Flows Single Region, Multiple Availability Zone, Token Aware Cassandra •Disks •Zone A 1. Client Writes to local coordinator 2. Coodinator writes to other zones 3. Nodes return ack 4. Data written to internal commit log disks (no more than 10 seconds later) 2Cassandra 3•Disks 4 Cassandra 3 4 •Disks •Zone C 1 •Zone B Token Aware Clients 2 Cassandra Cassandra •Disks •Zone B •Disks •Zone C 3 Cassandra •Disks •Zone A 4 If a node goes offline, hinted handoff completes the write when the node comes back up. Requests can choose to wait for one node, a quorum, or all nodes to ack the write SSTable disk writes and compactions occur asynchronously
  • 47. Data Flows for Multi-Region Writes Token Aware, Consistency Level = Local Quorum 1. Client writes to local replicas 2. Local write acks returned to Client which continues when 2 of 3 local nodes are committed 3. Local coordinator writes to remote coordinator. 4. When data arrives, remote coordinator node acks and copies to other remote zones 5. Remote nodes ack to local coordinator 6. Data flushed to internal commit log disks (no more than 10 seconds later) If a node or region goes offline, hinted handoff completes the write when the node comes back up. Nightly global compare and repair jobs ensure everything stays consistent. 100+ms latency Cassandra • Disks • Zone A Cassandra 6 • Disks • Zone C • Disks • Zone A 2 2 Cassandra 6 3 1 • Disks • Zone B Cassandra 5• Disks6 • Zone C US Clients EU Clients 2 Cassandra Cassandra • Disks • Zone B • Disks • Zone C 6 Cassandra • Disks • Zone A Cassandra 4Cassandra • 4 Disks6 • Zone B 4 Cassandra Cassandra • Disks • Zone B • Disks • Zone C 5 6Cassandra • Disks • Zone A
  • 48. Cassandra at Scale Benchmarking to Retire Risk More?
  • 49. Scalability from 48 to 288 nodes on AWS http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html Client Writes/s by node count – Replication Factor = 3 1200000 1099837 1000000 800000 600000 Used 288 of m1.xlarge 4 CPU, 15 GB RAM, 8 ECU Cassandra 0.86 Benchmark config only existed for about 1hr 537172 400000 366828 200000 174373 0 0 50 100 150 200 250 300 350
  • 50. Cassandra Disk vs. SSD Benchmark Same Throughput, Lower Latency, Half Cost http://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.html
  • 51. 2013 - Cross Region Use Cases • Geographic Isolation – US to Europe replication of subscriber data – Read intensive, low update rate – Production use since late 2011 • Redundancy for regional failover – US East to US West replication of everything – Includes write intensive data, high update rate – Testing now
  • 52. Benchmarking Global Cassandra Write intensive test of cross region replication capacity 16 x hi1.4xlarge SSD nodes per zone = 96 total 192 TB of SSD in six locations up and running Cassandra in 20 minutes Test Load 1 Million reads After 500ms CL.ONE with no Data loss Validation Load 1 Million writes CL.ONE (wait for one replica to ack) Test Load US-East-1 Region - Virginia US-West-2 Region - Oregon Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Inter-Zone Traffic Inter-Region Traffic Up to 9Gbits/s, 83ms 18TB backups from S3
  • 53. Copying 18TB from East to West Cassandra bootstrap 9.3 Gbit/s single threaded 48 nodes to 48 nodes Thanks to boundary.com for these network analysis plots
  • 54. Inter Region Traffic Test Verified at desired capacity, no problems, 339 MB/s, 83ms latency
  • 55. Ramp Up Load Until It Breaks! Unmodified tuning, dropping client data at 1.93GB/s inter region traffic Spare CPU, IOPS, Network, just need some Cassandra tuning for more
  • 56. Managing Multi-Region Availability AWS Route53 DynECT DNS UltraDNS Denominator Regional Load Balancers Regional Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Denominator – manage traffic via multiple DNS providers with Java code 2013 Timeline - Concept Jan, Code Feb, OSS March, Production use May
  • 57. Failure Modes and Effects Failure Mode Probability Current Mitigation Plan Application Failure High Automatic degraded response AWS Region Failure Low Active-Active multi-region deployment AWS Zone Failure Medium Continue to run on 2 out of 3 zones Datacenter Failure Medium Migrate more functions to cloud Data store failure Low Restore from S3 backups S3 failure Low Restore from remote archive Until we got really good at mitigating high and medium probability failures, the ROI for mitigating regional failures didn’t make sense. Getting there…
  • 58. Application Resilience Run what you wrote Rapid detection Rapid Response
  • 59. Chaos Monkey http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html • Computers (Datacenter or AWS) randomly die – Fact of life, but too infrequent to test resiliency • Test to make sure systems are resilient – Kill individual instances without customer impact • Latency Monkey (coming soon) – Inject extra latency and error return codes
  • 60. Edda – Configuration History http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html Eureka Services metadata AWS Instances, ASGs, etc. AppDynamics Request flow Edda Monkeys
  • 61. Edda Query Examples Find any instances that have ever had a specific public IP address $ curl "http://edda/api/v2/view/instances;publicIpAddress=1.2.3.4;_since=0" ["i-0123456789","i-012345678a","i-012345678b”] Show the most recent change to a security group $ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2" --- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810 +++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504 @@ -1,33 +1,33 @@ { … "ipRanges" : [ "10.10.1.1/32", "10.10.1.2/32", + "10.10.1.3/32", "10.10.1.4/32" … }
  • 62. Cloud Native Big Data Size the cluster to the data Size the cluster to the questions Never wait for space or answers
  • 63. Netflix Dataoven From cloud Services ~100 Billion Events/day From C* Terabytes of Dimension data RDS Ursula Metadata Aegisthus Data Pipelines Data Warehouse Over 2 Petabytes Gateways Hadoop Clusters – AWS EMR Tools More? 1300 nodes 800 nodes Multiple 150 nodes Nightly
  • 64. Cloud Native Development Patterns Master copies of data are cloud resident Dynamically provisioned micro-services Services are distributed and ephemeral
  • 65. Datacenter to Cloud Transition Goals • Faster – Lower latency than the equivalent datacenter web pages and API calls – Measured as mean and 99th percentile – For both first hit (e.g. home page) and in-session hits for the same user • Scalable – Avoid needing any more datacenter capacity as subscriber count increases – No central vertically scaled databases – Leverage AWS elastic capacity effectively • Available – Substantially higher robustness and availability than datacenter services – Leverage multiple AWS availability zones – No scheduled down time, no central database schema to change • Productive – Optimize agility of a large development team with automation and tools – Leave behind complex tangled datacenter code base (~8 year old architecture) – Enforce clean layered interfaces and re-usable components
  • 66. Datacenter Anti-Patterns What do we currently do in the datacenter that prevents us from meeting our goals?
  • 67. Rewrite from Scratch Not everything is cloud specific Pay down technical debt Robust patterns
  • 68. Netflix Datacenter vs. Cloud Arch Central SQL Database Distributed Key/Value NoSQL Sticky In-Memory Session Shared Memcached Session Chatty Protocols Latency Tolerant Protocols Tangled Service Interfaces Layered Service Interfaces Instrumented Code Instrumented Service Patterns Fat Complex Objects Lightweight Serializable Objects Components as Jar Files Components as Services More?
  • 69. Cloud Security Fine grain security rather than perimeter Leveraging AWS Scale to resist DDOS attacks Automated attack surface monitoring and testing http://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned
  • 70. Security Architecture • Instance Level Security baked into base AMI – Login: ssh only allowed via portal (not between instances) – Each app type runs as its own userid app{test|prod} • AWS Security, Identity and Access Management – Each app has its own security group (firewall ports) – Fine grain user roles and resource ACLs • Key Management – AWS Keys dynamically provisioned, easy updates – High grade app specific key management using HSM More?
  • 72. Accounts Isolate Concerns • paastest – for development and testing – Fully functional deployment of all services – Developer tagged “stacks” for separation • paasprod – for production – Autoscale groups only, isolated instances are terminated – Alert routing, backups enabled by default • paasaudit – for sensitive services – To support SOX, PCI, etc. – Extra access controls, auditing • paasarchive – for disaster recovery – Long term archive of backups – Different region, perhaps different vendor
  • 73. Cloud Access Control developers Cloud Access audit log ssh/sudo Gateway wwwprod • Userid wwwprod Security groups don’t allow ssh between instances Dalprod Cassprod • Userid dalprod • Userid cassprod
  • 74. Our perspiration… A Cloud Native Open Source Platform See netflix.github.com
  • 75. Example Application – RSS Reader Zuul Traffic Processing and Routing Z U U L
  • 77. Ice – AWS Usage Tracking http://techblog.netflix.com/2013/06/announcing-ice-cloud-spend-and-usage.html
  • 78. NetflixOSS Continuous Build and Deployment Github NetflixOSS Source Maven Central AWS Base AMI Cloudbees Jenkins Aminator Bakery Dynaslave AWS Build Slaves AWS Baked AMIs Glisten Workflow DSL Asgard (+ Frigga) Console AWS Account More?
  • 79. NetflixOSS Services Scope AWS Account Asgard Console Archaius Config Service Multiple AWS Regions Cross region Priam C* Eureka Registry Pytheas Dashboards Atlas Monitoring Exhibitor Zookeeper 3 AWS Zones Edda History Application Clusters Genie, Lipstick Hadoop Services Evcache Cassandra Memcached Instances Simian Army Priam Autoscale Groups Persistent Storage Ephemeral Storage Zuul Traffic Mgr Ice – AWS Usage Cost Monitoring More?
  • 80. NetflixOSS Instance Libraries Initialization Service Requests Data Access Logging • Baked AMI – Tomcat, Apache, your code • Governator – Guice based dependency injection • Archaius – dynamic configuration properties client • Eureka - service registration client • Karyon - Base Server for inbound requests • RxJava – Reactive pattern • Hystrix/Turbine – dependencies and real-time status • Ribbon and Feign - REST Clients for outbound calls • Astyanax – Cassandra client and pattern library • Evcache – Zone aware Memcached client • Curator – Zookeeper patterns • Denominator – DNS routing abstraction • Blitz4j – non-blocking logging • Servo – metrics export for autoscaling • Atlas – high volume instrumentation More?
  • 81. NetflixOSS Testing and Automation Test Tools • CassJmeter – Load testing for Cassandra • Circus Monkey – Test account reservation rebalancing Maintenance • Janitor Monkey – Cleans up unused resources • Efficiency Monkey • Doctor Monkey • Howler Monkey – Complains about AWS limits Availability • Chaos Monkey – Kills Instances • Chaos Gorilla – Kills Availability Zones • Chaos Kong – Kills Regions • Latency Monkey – Latency and error injection Security • Conformity Monkey – architectural pattern warnings • Security Monkey – security group and S3 bucket permissions More?
  • 82. Vendor Driven Portability Interest in using NetflixOSS for Enterprise Private Clouds “It’s done when it runs Asgard” Functionally complete Demonstrated March Released June in V3.3 IBM Example application “Acme Air” Based on NetflixOSS running on AWS Ported to IBM Softlayer with Rightscale Vendor and end user interest Openstack “Heat” getting there Paypal C3 Console based on Asgard
  • 83. Cost-Aware Cloud Architectures Jinesh Varia @jinman Technology Evangelist Adrian Cockcroft @adrianco Director, Architecture
  • 84. Cloud Economics – Agile ROI Get a faster Return by Speeding up Investment Observe Act Rapid innovation by speeding up the OODA loop Try, fail, try again, succeed Orient Decide
  • 85. Experiment Often & Adapt Quickly             • Cost of failure falls dramatically • Return on (small incremental) Investments is high • More risk taking, more innovation • More iteration, faster innovation
  • 86. « Want to increase innovation? Lower the cost of failure » Joi Ito
  • 87. Accelerate building a new line of business Market Replay
  • 88. Go Global in Minutes
  • 89. Netflix Examples • European Launch using AWS Ireland – No employees in Ireland, no provisioning delay, everything worked – No need to do detailed capacity planning – Over-provisioned on day 1, shrunk to fit after a few days – Capacity grows as needed for additional country launches • Brazilian Proxy Experiment – – – – No employees in Brazil, no “meetings with IT” Deployed instances into two zones in AWS Brazil Experimented with network proxy optimization Decided that gain wasn’t enough, shut everything down
  • 90. Product Launch Agility - Rightsized $ Demand Cloud Datacenter
  • 91. Product Launch - Under-estimated
  • 92. Product Launch Agility – Over-estimated $
  • 93. Return on Agility (Agile ROI) = More Revenue
  • 94. Key Takeaways on Cost-Aware Architectures…. #1 Business Agility by Rapid Experimentation = Increased Revenue
  • 95. When you turn off your cloud resources, you actually stop paying for
  • 96. 50% Savings Web Servers Weekly CPU Load 1 5 9 13 17 21 25 29 Week Optimize during a year 33 37 41 45 49
  • 98. 50%+ Cost Saving Scale up/down by 70%+ Move to Load-Based Scaling
  • 100. AWS Support – Trusted Advisor – Your personal cloud assistant
  • 101. Other simple optimization tips • Don’t forget to… – Disassociate unused EIPs – Delete unassociated Amazon EBS volumes – Delete older Amazon EBS snapshots – Leverage Amazon S3 Object Expiration Janitor Monkey cleans up unused resources
  • 102. Building Cost-Aware Cloud Architectures #1 Business Agility by Rapid Experimentation = Increased Revenue #2 Business-driven Auto Scaling Architectures = Savings
  • 104. When Comparing TCO… Make sure that you are including all the cost factors into consideration Place Power Pipes People Patterns
  • 105. Save more when you reserve On-demand Instances • Pay as you go • Starts from $0.02/Hour Reserved Instances • One time low upfront fee + Pay as you go • $23 for 1 year term and $0.01/Hour Light Utilization RI 1-year and 3-year terms Medium Utilization RI Heavy Utilization RI
  • 106. Break-even point Utilization (Uptime) ed es ow e + Pay year ur Light Utilization RI 1-year and 3year terms Ideal For 10% - 40% Disaster Recovery (Lowest Upfront) (>3.5 < 5.5 months/year) 40% - 75% Standard Reserved Medium (>5.5 < 7 months/year) Capacity Utilization RI Heavy Utilization RI >75% (>7 months/year) Baseline Servers (Lowest Total Cost) Savings over On-Demand 56% 66% 71%
  • 107. Mix and Match Reserved Types and On-Demand 12 10 On-Demand Instances 8 6 Light RI Light RI Light RI Light RI 4 2 Heavy Utilization Reserved Instances 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Days of Month
  • 108. Netflix Concept for Regional Failover Capacity West Coast Failover Use Normal Use East Coast Light Reservations Light Reservations Heavy Reservations Heavy Reservations
  • 109. Building Cost-Aware Cloud Architectures #1 Business Agility by Rapid Experimentation = Increased Revenue #2 Business-driven Auto Scaling Architectures = Savings #3 Mix and Match Reserved Instances with On-Demand = Savings
  • 110. Variety of Applications and Environments Every Company has…. Business App Fleet Marketing Site Intranet Site BI App Multiple Products Analytics Every Application has…. Production Fleet Dev Fleet Test Fleet Staging/QA Perf Fleet DR Site
  • 111. Consolidated Billing: Single payer for a group of accounts • One Bill for multiple accounts • Easy Tracking of account charges (e.g., download CSV of cost data) • Volume Discounts can be reached faster with combined usage • Reserved Instances are shared across accounts (including RDS Reserved DBs)
  • 112. Over-Reserve the Production Environment Total Capacity Production Env. Account 100 Reserved QA/Staging Env. Account 0 Reserved Perf Testing Env. Account 0 Reserved Development Env. Account 0 Reserved Storage Account 0 Reserved
  • 113. Consolidated Billing Borrows Unused Reservations Total Capacity Production Env. Account 68 Used QA/Staging Env. Account 10 Borrowed Perf Testing Env. Account 6 Borrowed Development Env. Account 12 Borrowed Storage Account 4 Borrowed
  • 114. Consolidated Billing Advantages • Production account is guaranteed to get burst capacity – Reservation is higher than normal usage level – Requests for more capacity always work up to reserved limit – Higher availability for handling unexpected peak demands • No additional cost – Other lower priority accounts soak up unused reservations – Totals roll up in the monthly billing cycle
  • 115. Building Cost-Aware Cloud Architectures #1 Business Agility by Rapid Experimentation = Increased Revenue #2 Business-driven Auto Scaling Architectures = Savings #3 Mix and Match Reserved Instances with On-Demand = Savings #4 Consolidated Billing and Shared Reservations = Savings
  • 116. Continuous optimization in your architecture results in recurring savings as early as your next month’s bill
  • 117. Right-size your cloud: Use only what you need • An instance type for every purpose • Assess your memory & CPU requirements – Fit your application to the resource – Fit the resource to your application • Only use a larger instance when needed
  • 118. Reserved Instance Marketplace Buy a smaller term instance Buy instance with different OS or type Buy a Reserved instance in different region Sell your unused Reserved Instance Sell unwanted or over-bought capacity Further reduce costs by optimizing
  • 119. Instance Type Optimization Older m1 and m2 families • Slower CPUs • Higher response times • Smaller caches (6MB) • Oldest m1.xl 15GB/8ECU/48c • Old m2.xl 17GB/6.5ECU/41c • ~16 ECU/$/hr Latest m3 family • Faster CPUs • Lower response times • Bigger caches (20MB) • Even faster for Java vs. ECU • New m3.xl 15GB/13 ECU/50c • 26 ECU/$/hr – 62% better! • Java measured even higher • Deploy fewer instances
  • 120. Building Cost-Aware Cloud Architectures #1 Business Agility by Rapid Experimentation = Increased Revenue #2 Business-driven Auto Scaling Architectures = Savings #3 Mix and Match Reserved Instances with On-Demand = Savings #4 Consolidated Billing and Shared Reservations = Savings #5 Always-on Instance Type Optimization = Recurring Savings
  • 121. Follow the Customer (Run web servers) during the day 16 No. of Reserved Instances No of Instances Running 14 12 10 8 Auto Scaling Servers Hadoop Servers 6 4 2 0 Mon Tue Wed Thur Fri Sat Sun Week Follow the Money (Run Hadoop clusters) at night
  • 122.
  • 123. Total Instances Reserved Table 14 Types Web Application Fleet Total Instances Running now = 100 4 AZ-mappings Unused Reservations Calculator Launch 40 Hadoop Fleet Total unused Reservations available = 40 in 2 AZs (5 min interval)
  • 124. Soaking up unused reservations Unused reserved instances is published as a metric Netflix Data Science ETL Workload • Daily business metrics roll-up • Starts after midnight • EMR clusters started using hundreds of instances Netflix Movie Encoding Workload • Long queue of high and low priority encoding jobs • Can soak up 1000’s of additional unused instances
  • 125. Building Cost-Aware Cloud Architectures #1 Business Agility by Rapid Experimentation = Increased Revenue #2 Business-driven Auto Scaling Architectures = Savings #3 Mix and Match Reserved Instances with On-Demand = Savings #4 Consolidated Billing and Shared Reservations = Savings #5 Always-on Instance Type Optimization = Recurring Savings #6 Follow the Customer (Run web servers) during the day Follow the Money (Run Hadoop clusters) at night
  • 126. Thank you! Jinesh Varia and Adrian Cockcroft jvaria@amazon.com @jinman acockcroft@netflix.com @adrianco
  • 127. Slideshare.net/Netflix Details • Meetup S1E3 July – Featuring Contributors Eucalyptus, IBM, Paypal, Riot Games – http://techblog.netflix.com/2013/07/netflixoss-meetup-series-1-episode-3.html • Lightning Talks March S1E2 – http://www.slideshare.net/RuslanMeshenberg/netflixoss-meetup-lightning-talks-androadmap • Lightning Talks Feb S1E1 – http://www.slideshare.net/RuslanMeshenberg/netflixoss-open-house-lightning-talks • Asgard In Depth Feb S1E1 – http://www.slideshare.net/joesondow/asgard-overview-from-netflix-oss-open-house • Security Architecture – http://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned/
  • 128. Takeaways Cloud Native Manages Scale and Complexity at Speed NetflixOSS makes it easier for everyone to become Cloud Native Rethink deployments and turn things off to save money! http://netflix.github.com http://techblog.netflix.com http://slideshare.net/Netflix http://www.linkedin.com/in/adriancockcroft @adrianco #netflixcloud @NetflixOSS

Editor's Notes

  1. Hive – thin metadata layer on top of S3Used for ad-hoc analytics (Ursula for merge ETL)HiveQL gets compiled into set of MR jobs (1 -&gt; many)Is a CLI – runs on the gateways, not like a relational DB server, or a service that the query gets shipped toPig – used for ETL (can create DAGs, workflows for Hadoop processes)Pig scripts also get compiled into MR jobsJava – straight up Hadoop, not for the faint of heart. Some recommendation algorithms are in Hadoop.Python/Java – UDFsApplications such as Sting use the tools on some gateway to access all the various componentsNext – focus on two key components: Data &amp; Clusters
  2. We have to be wrong a lot in order to right a lotCloud really helps you to reduce the cost of failure.
  3. Since we’ve invested in facilities around the world, we can offer you global reach at a moment’s notice. It’s cost prohibitive to put your own data center where all your customers are, but with AWS, you get the benefit without having to make the huge investment.
  4. Only happens in the cloud
  5. Our strategy of pricing each service independently gives you tremendous flexibility to choose the services you need for each project and to pay only for what you use
  6. Personal Optimization Assistant
  7. Netflix now serves 2x the customer traffic with the same amount of AWS resources as deployed 10 months ago
  8. Reduced TCO remains one of the core reasons why customers choose the AWS cloud. However, there are a number of other benefits when you choose AWS, such as reduced time to market and increased business agility, which cannot be overlooked.
  9. No Enterprise has only Steady State Workloads.In fact, no system is entirely steady state.
  10. You should use Consolidated Billing for any of the following scenarios:You have multiple accounts today and want to get a single bill and track each account&apos;s charges (e.g., you might have multiple projects, each with its own AWS account).You have multiple cost centers to track.You&apos;ve acquired a project or company that has its own existing AWS account and you want to consolidate it on the same bill with your other AWS accounts.
  11. You should use Consolidated Billing for any of the following scenarios:You have multiple accounts today and want to get a single bill and track each account&apos;s charges (e.g., you might have multiple projects, each with its own AWS account).You have multiple cost centers to track.You&apos;ve acquired a project or company that has its own existing AWS account and you want to consolidate it on the same bill with your other AWS accounts.
  12. You should use Consolidated Billing for any of the following scenarios:You have multiple accounts today and want to get a single bill and track each account&apos;s charges (e.g., you might have multiple projects, each with its own AWS account).You have multiple cost centers to track.You&apos;ve acquired a project or company that has its own existing AWS account and you want to consolidate it on the same bill with your other AWS accounts.
  13. Cloud is highly cost-effective because you can turn off and stop paying for it when you don’t need it or your users are not accessing. Build websites that sleep at night
  14. In addition, Only Use What You Need to Use.