SlideShare una empresa de Scribd logo
1 de 16
Descargar para leer sin conexión
Dynamically Scaling
Netflix in the Cloud



                                Coburn Watson
        Manager - Cloud Performance Engineering
Netflix, Inc.
- World's leading internet television network
- 33 Million subscribers in 40 countries
- Over a billion hours streamed per month
- Approximately 33% of all US Internet traffic at night
- Increasing quantity of original content
- Recent Technical Notables
   - Open Source Software
   - OpenConnect (homegrown CDN)
About Me

- Manage Cloud Performance Engineering team
- Focus on performance since 2000-ish
    - Large-scale billing applications, eCommerce, datacenter mgmt, etc.
    - Genentech, McKesson, Amdocs, Mercury Int., HP, etc.
- Passion for tackling performance at cloud-scale
- Looking for great performance engineers
- cwatson@netflix.com
First things first
- ASG = Autoscaling group
- AWS description:
   "An Auto Scaling group is a representation of multiple Amazon EC2 instances that share similar characteristics, and that are
   treated as a logical grouping for the purposes of instance scaling and management. "
   "An Auto Scaling group starts by launching the minimum number (or the desired number, if specified) of EC2 instances and then
   increases or decreases the number of running EC2 instances automatically according to the conditions that you define."




- Within Netflix (almost) all services are created as ASGs
   - Asgard (OSS) simplifies this process:
Dynamic Scaling @ Netflix
- EC2 footprint autoscales 2500-3500 instances per day
     - order of tens of thousands of EC2 instances
- Largest ASG* spans 200-600 m2.4xlarge (64GB RAM)

Why:
- Improved scalability during unexpected workloads
- Avoid sizing capacity aggressively high
    - each service team determines their capacity
- Creates "reserved instance troughs" for batch activity
    - on the order of hundreds of thousands of instance hours weekly
* largest "autoscaling" ASG
How?
- Discovery
   - AWS elastic load balancers "speak" autoscaling
   - mid-tier services utilize Eureka (OSS)
- Leverage native AWS autoscaling capabilities
- Publish our own metrics up to CloudWatch (Servo OSS)
- Stateless
How?
Two types of scaling behavior exposed in Asgard
 1. rate-based autoscaling




2. scheduled action autoscaling
AWS Autoscaling
-Define policies on ASG
 - alarm, scaling unit (percent/amount), cooldown,
   evaluation interval and period
- Cooldowns:
    - ASG-level versus policy-level (both exist)
   - cooldown start tied to last instance ready
   - should be tied closely to application/service startup time
- Execute load or squeeze tests; measures capacity
   - Frequent pushes with SOA corresponds to possible frequent
     changes in per-instance capacity
   - (insert here) 10 second primer on squeeze tests
In Action
- Example covers 3 services
   - 2 edge (A,B), 1 mid-tier (C)
   - C has more upstream services
     than simply A and B




- Multiple autoscaling policy types
   - (A) System Load Average
   - (B) Request-rate based (tomcat requestCount)
   - (C) Request-rate based (internal library numCompleted)
Day in the life, instance counts




  - At peak 1,948 instances
  - without autoscaling: ~ 46.8 k instance hours
  - with autoscaling:    ~ 31.2 k instance hours (~ 33% reduction in usage)
Day in the life, request rates




   - Total requests: 4.5x peak versus min
   - Per instance stays between 45-90 RPS
Day in the life, latency




  - Response variability greatest during initial scale-up events
  - Average response time primarily between 75-150 msec
Day in the life, CPU Utilization




  - Instance counts 3x, Request rate 4.5x (not shown)
  - Avg CPU utilization per instance: ~ 25-55% *

   * service A currently resolving concurrency issue; limits ideal CPU utilization
Unused capacity
- Reserved Instance "troughs" = spare capacity
   -Align services along fewer instance types for fewer, larger pools
- Current usage
   - Stand up "bonus" EMR cluster in off-peak hours
- Planned usage
   - Framework being developed to share unused capacity "fairly"
    across multiple batch applications
Caveats
- AWS Autoscaling
   - Simplified scaling policy capabilities
    - Cooldown is static, not dynamically configurable
- Application resource profiles can change quickly (SOA)
- When something goes wrong...
    1. traffic rates can drop quickly
    2. scale-down can kick in
    3. thundering herd can knock you back down
    - lockout scale-down quickly
    - proactively protect yourself with Hystrix (OSS) against downstream
     service degradation or failure
Wrap-up
- Autoscaling is a big win for Netflix
- Dynamically scaling affords improved scalability
- Our Open Source Software simplifies mgmt at scale




 next Netflix OSS meetup: Wednesday March 13th @ Netflix
- Great projects, stunning colleagues: jobs.netflix.com

Más contenido relacionado

La actualidad más candente

(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineMonal Daxini
 
Windows Azure Versioning Strategies
Windows Azure Versioning StrategiesWindows Azure Versioning Strategies
Windows Azure Versioning StrategiesPavel Revenkov
 
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache KafkaKafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafkaconfluent
 
Business Continuity with Microservices-Based Apps and DevOps: Learnings from ...
Business Continuity with Microservices-Based Apps and DevOps: Learnings from ...Business Continuity with Microservices-Based Apps and DevOps: Learnings from ...
Business Continuity with Microservices-Based Apps and DevOps: Learnings from ...DevOps.com
 
Scaling Traffic from 0 to 139 Million Unique Visitors
Scaling Traffic from 0 to 139 Million Unique VisitorsScaling Traffic from 0 to 139 Million Unique Visitors
Scaling Traffic from 0 to 139 Million Unique VisitorsYelp Engineering
 
Building a data warehouse with AWS Redshift, Matillion and Yellowfin
Building a data warehouse with AWS Redshift, Matillion and YellowfinBuilding a data warehouse with AWS Redshift, Matillion and Yellowfin
Building a data warehouse with AWS Redshift, Matillion and YellowfinLynn Langit
 
Operational challenges behind Serverless architectures
Operational challenges behind Serverless architecturesOperational challenges behind Serverless architectures
Operational challenges behind Serverless architecturesLaurent Bernaille
 
INTRODUCING: CREATE PIPELINE
INTRODUCING: CREATE PIPELINEINTRODUCING: CREATE PIPELINE
INTRODUCING: CREATE PIPELINESingleStore
 
Apache Superset at Airbnb
Apache Superset at AirbnbApache Superset at Airbnb
Apache Superset at AirbnbBill Liu
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017Monal Daxini
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uberconfluent
 
Data Science at Netflix with Amazon EMR (BDT306) | AWS re:Invent 2013
Data Science at Netflix with Amazon EMR (BDT306) | AWS re:Invent 2013Data Science at Netflix with Amazon EMR (BDT306) | AWS re:Invent 2013
Data Science at Netflix with Amazon EMR (BDT306) | AWS re:Invent 2013Amazon Web Services
 
AWS Summit Tel Aviv - Enterprise Track - Backup and Disaster Recovery
AWS Summit Tel Aviv - Enterprise Track - Backup and Disaster RecoveryAWS Summit Tel Aviv - Enterprise Track - Backup and Disaster Recovery
AWS Summit Tel Aviv - Enterprise Track - Backup and Disaster RecoveryAmazon Web Services
 
Apache Kafka® at Dropbox
Apache Kafka® at DropboxApache Kafka® at Dropbox
Apache Kafka® at Dropboxconfluent
 
How we Auto Scale applications based on CPU with Kubernetes at M6Web?
 How we Auto Scale applications based on CPU with Kubernetes at M6Web? How we Auto Scale applications based on CPU with Kubernetes at M6Web?
How we Auto Scale applications based on CPU with Kubernetes at M6Web?Vincent Gallissot
 
Benchmarking Aerospike on the Google Cloud - NoSQL Speed with Ease
Benchmarking Aerospike on the Google Cloud - NoSQL Speed with EaseBenchmarking Aerospike on the Google Cloud - NoSQL Speed with Ease
Benchmarking Aerospike on the Google Cloud - NoSQL Speed with EaseLynn Langit
 
Apache Cassandra in the Cloud
Apache Cassandra in the CloudApache Cassandra in the Cloud
Apache Cassandra in the CloudInstaclustr
 

La actualidad más candente (20)

(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
 
Windows Azure Versioning Strategies
Windows Azure Versioning StrategiesWindows Azure Versioning Strategies
Windows Azure Versioning Strategies
 
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache KafkaKafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
 
Business Continuity with Microservices-Based Apps and DevOps: Learnings from ...
Business Continuity with Microservices-Based Apps and DevOps: Learnings from ...Business Continuity with Microservices-Based Apps and DevOps: Learnings from ...
Business Continuity with Microservices-Based Apps and DevOps: Learnings from ...
 
Scaling Traffic from 0 to 139 Million Unique Visitors
Scaling Traffic from 0 to 139 Million Unique VisitorsScaling Traffic from 0 to 139 Million Unique Visitors
Scaling Traffic from 0 to 139 Million Unique Visitors
 
Building a data warehouse with AWS Redshift, Matillion and Yellowfin
Building a data warehouse with AWS Redshift, Matillion and YellowfinBuilding a data warehouse with AWS Redshift, Matillion and Yellowfin
Building a data warehouse with AWS Redshift, Matillion and Yellowfin
 
Operational challenges behind Serverless architectures
Operational challenges behind Serverless architecturesOperational challenges behind Serverless architectures
Operational challenges behind Serverless architectures
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
INTRODUCING: CREATE PIPELINE
INTRODUCING: CREATE PIPELINEINTRODUCING: CREATE PIPELINE
INTRODUCING: CREATE PIPELINE
 
Apache Superset at Airbnb
Apache Superset at AirbnbApache Superset at Airbnb
Apache Superset at Airbnb
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
Data Science at Netflix with Amazon EMR (BDT306) | AWS re:Invent 2013
Data Science at Netflix with Amazon EMR (BDT306) | AWS re:Invent 2013Data Science at Netflix with Amazon EMR (BDT306) | AWS re:Invent 2013
Data Science at Netflix with Amazon EMR (BDT306) | AWS re:Invent 2013
 
Autoscaling on Kubernetes
Autoscaling on KubernetesAutoscaling on Kubernetes
Autoscaling on Kubernetes
 
AWS Summit Tel Aviv - Enterprise Track - Backup and Disaster Recovery
AWS Summit Tel Aviv - Enterprise Track - Backup and Disaster RecoveryAWS Summit Tel Aviv - Enterprise Track - Backup and Disaster Recovery
AWS Summit Tel Aviv - Enterprise Track - Backup and Disaster Recovery
 
Apache Kafka® at Dropbox
Apache Kafka® at DropboxApache Kafka® at Dropbox
Apache Kafka® at Dropbox
 
How we Auto Scale applications based on CPU with Kubernetes at M6Web?
 How we Auto Scale applications based on CPU with Kubernetes at M6Web? How we Auto Scale applications based on CPU with Kubernetes at M6Web?
How we Auto Scale applications based on CPU with Kubernetes at M6Web?
 
Benchmarking Aerospike on the Google Cloud - NoSQL Speed with Ease
Benchmarking Aerospike on the Google Cloud - NoSQL Speed with EaseBenchmarking Aerospike on the Google Cloud - NoSQL Speed with Ease
Benchmarking Aerospike on the Google Cloud - NoSQL Speed with Ease
 
Apache Cassandra in the Cloud
Apache Cassandra in the CloudApache Cassandra in the Cloud
Apache Cassandra in the Cloud
 

Similar a #lspe Q1 2013 dynamically scaling netflix in the cloud

Risk Management and Particle Accelerators: Innovating with New Compute Platfo...
Risk Management and Particle Accelerators: Innovating with New Compute Platfo...Risk Management and Particle Accelerators: Innovating with New Compute Platfo...
Risk Management and Particle Accelerators: Innovating with New Compute Platfo...Amazon Web Services
 
透過 Amazon Redshift 打造數據分析服務及 Amazon Redshift 新功能案例介紹
透過 Amazon Redshift 打造數據分析服務及 Amazon Redshift 新功能案例介紹透過 Amazon Redshift 打造數據分析服務及 Amazon Redshift 新功能案例介紹
透過 Amazon Redshift 打造數據分析服務及 Amazon Redshift 新功能案例介紹Amazon Web Services
 
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Amazon Web Services
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...Amazon Web Services
 
Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...Amazon Web Services
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Amazon Web Services
 
AWS Public Sector Symposium 2014 Canberra | Managing Seasonal Workloads on AWS
AWS Public Sector Symposium 2014 Canberra | Managing Seasonal Workloads on AWS AWS Public Sector Symposium 2014 Canberra | Managing Seasonal Workloads on AWS
AWS Public Sector Symposium 2014 Canberra | Managing Seasonal Workloads on AWS Amazon Web Services
 
Scalable Web Apps Webinar September 2017 - IL Webina
Scalable Web Apps Webinar September 2017 - IL WebinaScalable Web Apps Webinar September 2017 - IL Webina
Scalable Web Apps Webinar September 2017 - IL WebinaAmazon Web Services
 
AWS Webcast - An Introduction to High Performance Computing on AWS
AWS Webcast - An Introduction to High Performance Computing on AWSAWS Webcast - An Introduction to High Performance Computing on AWS
AWS Webcast - An Introduction to High Performance Computing on AWSAmazon Web Services
 
Aws webcast - Scaling on AWS 13 08-20
Aws webcast - Scaling on AWS 13 08-20Aws webcast - Scaling on AWS 13 08-20
Aws webcast - Scaling on AWS 13 08-20Amazon Web Services
 
AWS re:Invent 2016: 20k in 20 Days - Agile Genomic Analysis (ENT320)
AWS re:Invent 2016: 20k in 20 Days - Agile Genomic Analysis (ENT320)AWS re:Invent 2016: 20k in 20 Days - Agile Genomic Analysis (ENT320)
AWS re:Invent 2016: 20k in 20 Days - Agile Genomic Analysis (ENT320)Amazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cl...
AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cl...AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cl...
AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cl...Amazon Web Services
 
Scaling web application in the Cloud
Scaling web application in the CloudScaling web application in the Cloud
Scaling web application in the CloudFederico Feroldi
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...Amazon Web Services
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersAmazon Web Services
 
Embrace the Cloud - Inspiring Conference 2015
Embrace the Cloud - Inspiring Conference 2015Embrace the Cloud - Inspiring Conference 2015
Embrace the Cloud - Inspiring Conference 2015Henrik Møller Rasmussen
 
Re invent 2018 meetup presentation
Re invent 2018 meetup presentationRe invent 2018 meetup presentation
Re invent 2018 meetup presentationEliran Yamin
 

Similar a #lspe Q1 2013 dynamically scaling netflix in the cloud (20)

Kinney j aws
Kinney j awsKinney j aws
Kinney j aws
 
Risk Management and Particle Accelerators: Innovating with New Compute Platfo...
Risk Management and Particle Accelerators: Innovating with New Compute Platfo...Risk Management and Particle Accelerators: Innovating with New Compute Platfo...
Risk Management and Particle Accelerators: Innovating with New Compute Platfo...
 
透過 Amazon Redshift 打造數據分析服務及 Amazon Redshift 新功能案例介紹
透過 Amazon Redshift 打造數據分析服務及 Amazon Redshift 新功能案例介紹透過 Amazon Redshift 打造數據分析服務及 Amazon Redshift 新功能案例介紹
透過 Amazon Redshift 打造數據分析服務及 Amazon Redshift 新功能案例介紹
 
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
 
Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 
AWS Public Sector Symposium 2014 Canberra | Managing Seasonal Workloads on AWS
AWS Public Sector Symposium 2014 Canberra | Managing Seasonal Workloads on AWS AWS Public Sector Symposium 2014 Canberra | Managing Seasonal Workloads on AWS
AWS Public Sector Symposium 2014 Canberra | Managing Seasonal Workloads on AWS
 
Scalable Web Apps Webinar September 2017 - IL Webina
Scalable Web Apps Webinar September 2017 - IL WebinaScalable Web Apps Webinar September 2017 - IL Webina
Scalable Web Apps Webinar September 2017 - IL Webina
 
AWS Webcast - An Introduction to High Performance Computing on AWS
AWS Webcast - An Introduction to High Performance Computing on AWSAWS Webcast - An Introduction to High Performance Computing on AWS
AWS Webcast - An Introduction to High Performance Computing on AWS
 
Aws webcast - Scaling on AWS 13 08-20
Aws webcast - Scaling on AWS 13 08-20Aws webcast - Scaling on AWS 13 08-20
Aws webcast - Scaling on AWS 13 08-20
 
AWS re:Invent 2016: 20k in 20 Days - Agile Genomic Analysis (ENT320)
AWS re:Invent 2016: 20k in 20 Days - Agile Genomic Analysis (ENT320)AWS re:Invent 2016: 20k in 20 Days - Agile Genomic Analysis (ENT320)
AWS re:Invent 2016: 20k in 20 Days - Agile Genomic Analysis (ENT320)
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cl...
AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cl...AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cl...
AWS re:Invent 2016: Building HPC Clusters as Code in the (Almost) Infinite Cl...
 
Scaling web application in the Cloud
Scaling web application in the CloudScaling web application in the Cloud
Scaling web application in the Cloud
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users
 
Embrace the Cloud - Inspiring Conference 2015
Embrace the Cloud - Inspiring Conference 2015Embrace the Cloud - Inspiring Conference 2015
Embrace the Cloud - Inspiring Conference 2015
 
Re invent 2018 meetup presentation
Re invent 2018 meetup presentationRe invent 2018 meetup presentation
Re invent 2018 meetup presentation
 

Último

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 

Último (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 

#lspe Q1 2013 dynamically scaling netflix in the cloud

  • 1. Dynamically Scaling Netflix in the Cloud Coburn Watson Manager - Cloud Performance Engineering
  • 2. Netflix, Inc. - World's leading internet television network - 33 Million subscribers in 40 countries - Over a billion hours streamed per month - Approximately 33% of all US Internet traffic at night - Increasing quantity of original content - Recent Technical Notables - Open Source Software - OpenConnect (homegrown CDN)
  • 3. About Me - Manage Cloud Performance Engineering team - Focus on performance since 2000-ish - Large-scale billing applications, eCommerce, datacenter mgmt, etc. - Genentech, McKesson, Amdocs, Mercury Int., HP, etc. - Passion for tackling performance at cloud-scale - Looking for great performance engineers - cwatson@netflix.com
  • 4. First things first - ASG = Autoscaling group - AWS description: "An Auto Scaling group is a representation of multiple Amazon EC2 instances that share similar characteristics, and that are treated as a logical grouping for the purposes of instance scaling and management. " "An Auto Scaling group starts by launching the minimum number (or the desired number, if specified) of EC2 instances and then increases or decreases the number of running EC2 instances automatically according to the conditions that you define." - Within Netflix (almost) all services are created as ASGs - Asgard (OSS) simplifies this process:
  • 5. Dynamic Scaling @ Netflix - EC2 footprint autoscales 2500-3500 instances per day - order of tens of thousands of EC2 instances - Largest ASG* spans 200-600 m2.4xlarge (64GB RAM) Why: - Improved scalability during unexpected workloads - Avoid sizing capacity aggressively high - each service team determines their capacity - Creates "reserved instance troughs" for batch activity - on the order of hundreds of thousands of instance hours weekly * largest "autoscaling" ASG
  • 6. How? - Discovery - AWS elastic load balancers "speak" autoscaling - mid-tier services utilize Eureka (OSS) - Leverage native AWS autoscaling capabilities - Publish our own metrics up to CloudWatch (Servo OSS) - Stateless
  • 7. How? Two types of scaling behavior exposed in Asgard 1. rate-based autoscaling 2. scheduled action autoscaling
  • 8. AWS Autoscaling -Define policies on ASG - alarm, scaling unit (percent/amount), cooldown, evaluation interval and period - Cooldowns: - ASG-level versus policy-level (both exist) - cooldown start tied to last instance ready - should be tied closely to application/service startup time - Execute load or squeeze tests; measures capacity - Frequent pushes with SOA corresponds to possible frequent changes in per-instance capacity - (insert here) 10 second primer on squeeze tests
  • 9. In Action - Example covers 3 services - 2 edge (A,B), 1 mid-tier (C) - C has more upstream services than simply A and B - Multiple autoscaling policy types - (A) System Load Average - (B) Request-rate based (tomcat requestCount) - (C) Request-rate based (internal library numCompleted)
  • 10. Day in the life, instance counts - At peak 1,948 instances - without autoscaling: ~ 46.8 k instance hours - with autoscaling: ~ 31.2 k instance hours (~ 33% reduction in usage)
  • 11. Day in the life, request rates - Total requests: 4.5x peak versus min - Per instance stays between 45-90 RPS
  • 12. Day in the life, latency - Response variability greatest during initial scale-up events - Average response time primarily between 75-150 msec
  • 13. Day in the life, CPU Utilization - Instance counts 3x, Request rate 4.5x (not shown) - Avg CPU utilization per instance: ~ 25-55% * * service A currently resolving concurrency issue; limits ideal CPU utilization
  • 14. Unused capacity - Reserved Instance "troughs" = spare capacity -Align services along fewer instance types for fewer, larger pools - Current usage - Stand up "bonus" EMR cluster in off-peak hours - Planned usage - Framework being developed to share unused capacity "fairly" across multiple batch applications
  • 15. Caveats - AWS Autoscaling - Simplified scaling policy capabilities - Cooldown is static, not dynamically configurable - Application resource profiles can change quickly (SOA) - When something goes wrong... 1. traffic rates can drop quickly 2. scale-down can kick in 3. thundering herd can knock you back down - lockout scale-down quickly - proactively protect yourself with Hystrix (OSS) against downstream service degradation or failure
  • 16. Wrap-up - Autoscaling is a big win for Netflix - Dynamically scaling affords improved scalability - Our Open Source Software simplifies mgmt at scale next Netflix OSS meetup: Wednesday March 13th @ Netflix - Great projects, stunning colleagues: jobs.netflix.com