SlideShare una empresa de Scribd logo
1 de 66
Descargar para leer sin conexión
Netflix Titus, its Feisty
Team, and Daemons
Titus - Netflix’s Container Management Platform
Scheduling
● Service & batch job lifecycle
● Resource management
Container Execution
● AWS Integration
● Netflix Ecosystem Support
Service
Job and Fleet Management
Resource Management &
Optimization
Container Execution
Batch
Stats
Containers Launcher
Per Week
Batch vs. Service
3 Million /
Week
Batch runtimes
(< 1s, < 1m, < 1h, < 12h, < 1d, > 1d)
Service runtimes
(< 1 day, < 1 week, < 1 month, > 1 month)
Autoscaling
High Churn
The Titus team
● Design
● Develop
● Operate
● Support
* And Netflix Platform Engineering and Amazon Web Services
Titus Product Strategy
Ordered priority focus on
● Developer Velocity
● Reliability
● Cost Efficiency
Easy migration from VMs to containers
Easy container integration with VMs and Amazon Services
Focus on just what Netflix needs
Deeply integrated AWS container platform
IP per container
● VPC, ENIs, and security groups
IAM Roles and Metadata Endpoint per container
● Container view of 169.254.169.254
Cryptographic identity per container
● Using Amazon instance identity document, Amazon KMS
Service job container autoscaling
● Using Native AWS Cloudwatch, SQS, Autoscaling policies and engine
Application Load Balancing (ALB)
Applications using containers at Netflix
● Netflix API, Node.js Backend UI Scripts
● Machine Learning (GPUs) for personalization
● Encoding and Content use cases
● Netflix Studio use cases
● CDN tracking, planning, monitoring
● Massively parallel CI system
● Data Pipeline and Stream Processing
● Big Data use cases (Notebooks, Presto)
Batch
Q4 15
Basic
Services
1Q 16
Production
Services
4Q 16
Customer
Facing
Services
2Q 17
Q4 2018 Container Usage
Common
Jobs Launched 255K jobs / day
Different applications 1K+ different images
Isolated Titus deployments 7
Services
Single App Cluster Size 5K (real), 12K containers (benchmark)
Hosts managed 7K VMs (435,000 CPUs)
Batch
Containers launched 450K / day (750K / day peak)
Hosts managed (autoscaled) 55K / month
High Level Titus Architecture
Cassandra
Titus Control Plane
● API
● Scheduling
● Job Lifecycle Control
EC2 Autoscaling
Fenzo
container
container
container
docker
Titus Hosts
Mesos agent
Docker
Docker Registry
containercontainerUser Containers
AWS Virtual Machines
Mesos
Titus System ServicesBatch/Workflow
Systems
Service
CI/CD
Open Source
Open sourced April 2018
Help other communities by sharing our approach
Lessons Learned
End to End
User Experience
Our initial view of containers
Image
Registry
Publish
Run
Container
Workload
Monitor
ContainersDeploy new
Container
Workload
“The Runtime”
What about?
Image
Registry
Publish
Run
Container
Workload
Monitor
Containers
Deploy new
Container
Workload
Security
Scanning
Change
Campaigns
Ad hoc
Performance
analysis
What about?
Local Development CI/CD Runtime
End to end tooling
Container orchestration only part of the problem
For Netflix …
● Local Development - Newt
● Continuous Integration - Jenkins + Newt
● Continuous Delivery - Spinnaker
● Change Campaigns - Astrid
● Performance Analysis - Vector and Flamegraphs
Tooling guidance
● Ensure coverage for entire application SDLC
○ Developing an application before deployment
○ Change management, security and compliance tooling for runtime
● What we added to Docker tooling
○ Curated known base images
○ Consistent image tagging
○ Assistance for multi-region/account registries
○ Consistency with existing tools
Operations and
High Availability
● Single container crashes
● Single host crashes
● Control plane fails
● Control plane gets into bad state
Learning how things fail
Increasing
Severity
● Single container crashes
● Single host crashes
○ Taking down multiple containers
● Control plane fails
● Control plane gets into bad state
Learning how things fail
Increasing
Severity
● Single container crashes
● Single host crashes
● Control plane fails
○ Existing containers continue to run
○ New jobs cannot be submitted
○ Replacements and scale ups do not occur
● Control plane gets into bad state
Learning how things fail
Increasing
Severity
● Single container crashes
● Single host crashes
● Control plane fails
● Control plane gets into bad state
○ Can be catastrophic
Learning how things fail
Increasing
Severity
● Most orchestrators will recover
● Most often during startup
or shutdown
● Monitor for crash loops
Case 1 - Single container crashes
Case 2 - Single host crashes
● Need a placement engine that spreads critical workloads
● Need a way to detect and remediate bad hosts
Titus node health monitoring, scheduling
● Extensive health checks
○ Control plane components - Docker, Mesos, Titus executor
○ AWS - ENI, VPC, GPU
○ Netflix Dependencies - systemd state, security systems
Scheduler
Health Check and
Service Discovery
✖
+
+
✖
Titus hosts
Titus node health remediation
● Rate limiting through centralized service is critical
Scheduler ✖
+
+
Infrastructure
Automation
(Winston)
Events
Automation
perform analysis on host
perform remediation on host
if (unrecoverable) {
tell scheduler to reschedule work
terminate instance
}
Titus hosts
Spotting fleet wide issues using logging
● For the hosts, not the containers
○ Need fleet wide view of container runtime, OS problems
○ New workloads will trigger new host problems
● Titus hosts generate 2B log lines per day
○ Stream processing to look for patterns and remediations
● Aggregated logging - see patterns in the large
Case 3 - Control plane hard failures
White box - monitor time
bucketed queue length
Black box - submit
synthetic workloads
Case 4 - Control plane soft failures
I don’t feel so good!
But first, let’s talk about Zombies
● Early on, we had cases where
○ Some but not all of the control plane was working
○ User terminated their containers
○ Containers still running, but shouldn’t have been
● The “fix” - Mesos implicit reconciliation
○ Titus to Mesos - What containers are running?
○ Titus to Mesos - Kill these containers we know shouldn’t be running
○ System converges on consistent state 👍
Disconnected containers
But what if?
Cluster
State
Scheduler
Controller
Host
✖
✖
✖
✖
Backing store
gets corrupted
Or
Control plane reads
store incorrectly
Bad things occur
12,000 containers “reconciled” in < 1m
An hour to restore service
Running
Containers
Not running
Containers
Guidance
● Know how to operate your cluster storage
○ Perform backups and test restores
○ Test corruption
○ Know failure modes, and know how to recover
At Netflix, we ...
● Moved to less aggressive reconciliation
● Page on inconsistent data
○ Let existing containers run
○ Human fixes state and decides how to proceed
● Automated snapshot testing for staging
Security
● Enforcement
○ Seccomp and AppArmor policies
● Cryptographic identity for each container
○ Leveraging host level Amazon and control plane provided identities
○ Validated by central Netflix service before secrets are provided
Reducing container escape vectors
User namespaces
● Root (or user) in container != Root (or user) on host
● Challenge: Getting it to work with persistent storage
Reducing impact of container escape vectors
user_namespaces (7)
Lock down, isolate control plane
● Hackers are scanning for Docker and Kubernetes
● Reported lack of networking isolation in Google Borg
● We also thought our networking was isolated (wasn’t)
Avoiding user host level access
Vector
ssh
perf tools
Scale - Scheduling Speed
How does Netflix failover?
✖
Kong
Netflix regional failover
Kong evacuation of us-east-1
Traffic diverted to other regions
Fail back to us-east-1
Traffic moved back to us-east-1
API Calls Per Region
EU-WEST-1
US-EAST-1
● Increase capacity during scale up of savior region
● Launch 1000s of containers in 7 minutes
Infrastructure challenge
Easy Right?
“we reduced time to schedule 30,000
pods onto 1,000 nodes from
8,780 seconds to 587 seconds”
Easy Right?
“we reduced time to schedule 30,000
pods onto 1,000 nodes from
8,780 seconds to 587 seconds”
Synthetic benchmarks missing
1. Heterogeneous workloads
2. Full end to end launches
3. Docker image pull times
4. Integration with public cloud networking
Titus can do this by ...
● Dynamically changeable scheduling behavior
● Fleet wide networking optimizations
Normal scheduling
VM1
...
VM2 VMn
App 1 App 1
App 2
ENI1 ENI 2
App 2
App 1
ENI1 ENI1 ENI 2
Trade-off for reliability
IP1 IP1 IP1 IP1 IP1
Spread Pack
Scheduling Algorithm
Failover scheduling
VM1
...
VM2 VMn
App 1 App 1
App 2
ENI1 ENI 2
App 2
App 1
ENI1 ENI1 ENI 2
App 1
App 1
App 1
App 1
App 1
App 2
App 2
IP1 IP1 IP1 IP1 IP1
Spread Pack
IP2, IP3 IP2, IP3, IP4 IP2, IP3
Trade-off for speed
Scheduling Algorithm
● Due to normal scheduling, host likely already has ...
○ Docker image downloaded
○ Networking interfaces and security groups configured
● Need to burst allocate IP addresses
○ Opportunistically batch allocate at container launch time
○ Likely if one container was launched more are coming
○ Garbage collect unused later
On each host
Results
us-east-1 / prod
containers started per minute
7500 Launched
In 5 Minutes
Scale - Limits
How far can a single Titus stack go?
● Speed and stability of scheduling
● Blast radius of mistakes
Scaling options
Idealistic
Continue to improve
performance
Avoid making mistakes
Realistic
Test a stack up to a
known scale level
Contain mistakes
Titus “Federation”
● Allows a Titus stack to be scaled out
○ For performance and reliability reasons
● Not to help with
○ Cloud bursting across different resource pools
○ Automatic balancing across resource pools
○ Joining various business unit resource pools
Federation Implementation
● Users need only to know of the external single API
○ VIP - titus-api.us-east-1.prod.netflix.net
● Simple federation proxy spans stacks (cells)
○ Route these apps to cell 1, these others to cell 2
○ Fan out & union queries across cells
○ Operators can route directly to specific cells
Titus Federation
…
Titus API
(Federation Proxy)
Titus cell01
us-west-2
Titus cell02
…
Titus cell01
us-east-1
Titus cell02
…
Titus cell01
eu-west-1
Titus cell02
Titus API
(Federation Proxy)
Titus API
(Federation Proxy)
How many cells?
A few large cells
● Only as many as needed for scale / blast radius
Why? Larger resource pools help with
● Cross workload efficiency
● Operations
● Bad workload impacts
Performance and Efficiency
● A fictional “16 vCPU” host
● Left and right are CPU packages
● Top to bottom are cores with hyperthreads
Simplified view of a server
Consider workload placement
● Consider three workloads
○ Static A, B, and C all which are latency sensitive
○ Burst D which would like more compute when available
● Let’s start placing workloads
Problems
● Potential performance problems
○ Static C is crossing packages
○ Static B can steal from Static C
● Underutilized resources
○ Burst workload isn’t using all available resources
Node level CPU rescheduling
● After containers land on hosts
○ Eventually, dynamic and cross host
● Leverages cpusets
○ Static - placed on single CPU package and exclusive full cores
○ Burst - can consume extra capacity, but variable performance
● Kubernetes - CPU Manager (beta)
Opportunistic workloads
● Enable workloads to burst into underutilized resources
● Differences between utilized and total
Utilization
time
Resources (Total)
Resources (Allocated)
Resources (Utilized)
Questions?

Más contenido relacionado

La actualidad más candente

NetflixOSS and ZeroToDocker Talk
NetflixOSS and ZeroToDocker TalkNetflixOSS and ZeroToDocker Talk
NetflixOSS and ZeroToDocker Talkaspyker
 
Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016aspyker
 
Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1aspyker
 
Netflix Open Source Meetup Season 3 Episode 2
Netflix Open Source Meetup Season 3 Episode 2Netflix Open Source Meetup Season 3 Episode 2
Netflix Open Source Meetup Season 3 Episode 2aspyker
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015aspyker
 
Velocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ NetflixVelocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ Netflixaspyker
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
Netflix Cloud Platform and Open Source
Netflix Cloud Platform and Open SourceNetflix Cloud Platform and Open Source
Netflix Cloud Platform and Open Sourceaspyker
 
CS80A Foothill College Open Source Talk
CS80A Foothill College Open Source TalkCS80A Foothill College Open Source Talk
CS80A Foothill College Open Source Talkaspyker
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17aspyker
 
Netflix oss season 2 episode 1 - meetup Lightning talks
Netflix oss   season 2 episode 1 - meetup Lightning talksNetflix oss   season 2 episode 1 - meetup Lightning talks
Netflix oss season 2 episode 1 - meetup Lightning talksRuslan Meshenberg
 
Dev309 from asgard to zuul - netflix oss-final
Dev309  from asgard to zuul - netflix oss-finalDev309  from asgard to zuul - netflix oss-final
Dev309 from asgard to zuul - netflix oss-finalRuslan Meshenberg
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1aspyker
 
Netflix viewing data architecture evolution - EBJUG Nov 2014
Netflix viewing data architecture evolution - EBJUG Nov 2014Netflix viewing data architecture evolution - EBJUG Nov 2014
Netflix viewing data architecture evolution - EBJUG Nov 2014Philip Fisher-Ogden
 
Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Monitoring, the Prometheus Way - Julius Voltz, Prometheus Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Monitoring, the Prometheus Way - Julius Voltz, Prometheus Docker, Inc.
 
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)Tim Bozarth
 
Netflix Story of Embracing the Cloud
Netflix Story of Embracing the CloudNetflix Story of Embracing the Cloud
Netflix Story of Embracing the CloudKate Karniouchina
 
20140708 - Jeremy Edberg: How Netflix Delivers Software
20140708 - Jeremy Edberg: How Netflix Delivers Software20140708 - Jeremy Edberg: How Netflix Delivers Software
20140708 - Jeremy Edberg: How Netflix Delivers SoftwareDevOps Chicago
 

La actualidad más candente (20)

NetflixOSS and ZeroToDocker Talk
NetflixOSS and ZeroToDocker TalkNetflixOSS and ZeroToDocker Talk
NetflixOSS and ZeroToDocker Talk
 
Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016
 
Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1
 
Netflix Open Source Meetup Season 3 Episode 2
Netflix Open Source Meetup Season 3 Episode 2Netflix Open Source Meetup Season 3 Episode 2
Netflix Open Source Meetup Season 3 Episode 2
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
 
Velocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ NetflixVelocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ Netflix
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
The new Netflix API
The new Netflix APIThe new Netflix API
The new Netflix API
 
Netflix Cloud Platform and Open Source
Netflix Cloud Platform and Open SourceNetflix Cloud Platform and Open Source
Netflix Cloud Platform and Open Source
 
CS80A Foothill College Open Source Talk
CS80A Foothill College Open Source TalkCS80A Foothill College Open Source Talk
CS80A Foothill College Open Source Talk
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17
 
Netflix oss season 2 episode 1 - meetup Lightning talks
Netflix oss   season 2 episode 1 - meetup Lightning talksNetflix oss   season 2 episode 1 - meetup Lightning talks
Netflix oss season 2 episode 1 - meetup Lightning talks
 
Dev309 from asgard to zuul - netflix oss-final
Dev309  from asgard to zuul - netflix oss-finalDev309  from asgard to zuul - netflix oss-final
Dev309 from asgard to zuul - netflix oss-final
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1
 
Netflix viewing data architecture evolution - EBJUG Nov 2014
Netflix viewing data architecture evolution - EBJUG Nov 2014Netflix viewing data architecture evolution - EBJUG Nov 2014
Netflix viewing data architecture evolution - EBJUG Nov 2014
 
Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Monitoring, the Prometheus Way - Julius Voltz, Prometheus Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Monitoring, the Prometheus Way - Julius Voltz, Prometheus
 
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
 
Netflix Story of Embracing the Cloud
Netflix Story of Embracing the CloudNetflix Story of Embracing the Cloud
Netflix Story of Embracing the Cloud
 
20140708 - Jeremy Edberg: How Netflix Delivers Software
20140708 - Jeremy Edberg: How Netflix Delivers Software20140708 - Jeremy Edberg: How Netflix Delivers Software
20140708 - Jeremy Edberg: How Netflix Delivers Software
 

Similar a QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons

Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and DaemonsDisenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and DaemonsC4Media
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016aspyker
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Sharma Podila
 
Netflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger ThingsNetflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger ThingsAll Things Open
 
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...Amazon Web Services
 
DCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at NetflixDCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at NetflixDocker, Inc.
 
DockerCon SF 2015 : Reliably shipping containers in a resource rich world usi...
DockerCon SF 2015 : Reliably shipping containers in a resource rich world usi...DockerCon SF 2015 : Reliably shipping containers in a resource rich world usi...
DockerCon SF 2015 : Reliably shipping containers in a resource rich world usi...Docker, Inc.
 
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...Xiaoman DONG
 
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS IntegrationRe:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS Integrationaspyker
 
Netflix Architecture and Open Source
Netflix Architecture and Open SourceNetflix Architecture and Open Source
Netflix Architecture and Open SourceAll Things Open
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaSteven Wu
 
DCSF19 How Docker Simplifies Kubernetes for the Masses
DCSF19 How Docker Simplifies Kubernetes for the Masses  DCSF19 How Docker Simplifies Kubernetes for the Masses
DCSF19 How Docker Simplifies Kubernetes for the Masses Docker, Inc.
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudDatadog
 
Netflix Titus WASP October 2017
Netflix Titus WASP October 2017Netflix Titus WASP October 2017
Netflix Titus WASP October 2017Andrew Leung
 
Presto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix ContainersPresto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix Containerskbajda
 
Container orchestration and microservices world
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices worldKarol Chrapek
 
WSO2 Kubernetes Reference Architecture - Nov 2017
WSO2 Kubernetes Reference Architecture - Nov 2017WSO2 Kubernetes Reference Architecture - Nov 2017
WSO2 Kubernetes Reference Architecture - Nov 2017Imesh Gunaratne
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kevin Lynch
 

Similar a QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons (20)

Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and DaemonsDisenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
 
Netflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger ThingsNetflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger Things
 
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
 
DCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at NetflixDCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at Netflix
 
DockerCon SF 2015 : Reliably shipping containers in a resource rich world usi...
DockerCon SF 2015 : Reliably shipping containers in a resource rich world usi...DockerCon SF 2015 : Reliably shipping containers in a resource rich world usi...
DockerCon SF 2015 : Reliably shipping containers in a resource rich world usi...
 
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
 
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS IntegrationRe:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS Integration
 
Netflix Architecture and Open Source
Netflix Architecture and Open SourceNetflix Architecture and Open Source
Netflix Architecture and Open Source
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
reBuy on Kubernetes
reBuy on KubernetesreBuy on Kubernetes
reBuy on Kubernetes
 
DCSF19 How Docker Simplifies Kubernetes for the Masses
DCSF19 How Docker Simplifies Kubernetes for the Masses  DCSF19 How Docker Simplifies Kubernetes for the Masses
DCSF19 How Docker Simplifies Kubernetes for the Masses
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloud
 
Netflix Titus WASP October 2017
Netflix Titus WASP October 2017Netflix Titus WASP October 2017
Netflix Titus WASP October 2017
 
Presto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix ContainersPresto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix Containers
 
Container orchestration and microservices world
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices world
 
WSO2 Kubernetes Reference Architecture - Nov 2017
WSO2 Kubernetes Reference Architecture - Nov 2017WSO2 Kubernetes Reference Architecture - Nov 2017
WSO2 Kubernetes Reference Architecture - Nov 2017
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 

Más de aspyker

SRECon Lightning Talk
SRECon Lightning TalkSRECon Lightning Talk
SRECon Lightning Talkaspyker
 
Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4aspyker
 
Netflix Open Source: Building a Distributed and Automated Open Source Program
Netflix Open Source:  Building a Distributed and Automated Open Source ProgramNetflix Open Source:  Building a Distributed and Automated Open Source Program
Netflix Open Source: Building a Distributed and Automated Open Source Programaspyker
 
Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3aspyker
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Sourceaspyker
 
Ibm cloud nativenetflixossfinal
Ibm cloud nativenetflixossfinalIbm cloud nativenetflixossfinal
Ibm cloud nativenetflixossfinalaspyker
 
Docker Demo IBM Impact 2014
Docker Demo IBM Impact 2014Docker Demo IBM Impact 2014
Docker Demo IBM Impact 2014aspyker
 
Netflix s2e1lightningtalk
Netflix s2e1lightningtalkNetflix s2e1lightningtalk
Netflix s2e1lightningtalkaspyker
 
Going Cloud Native with IBM Cloud and NetflixOSS for Dev@Pulse
Going Cloud Native with IBM Cloud and NetflixOSS for Dev@PulseGoing Cloud Native with IBM Cloud and NetflixOSS for Dev@Pulse
Going Cloud Native with IBM Cloud and NetflixOSS for Dev@Pulseaspyker
 

Más de aspyker (9)

SRECon Lightning Talk
SRECon Lightning TalkSRECon Lightning Talk
SRECon Lightning Talk
 
Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4
 
Netflix Open Source: Building a Distributed and Automated Open Source Program
Netflix Open Source:  Building a Distributed and Automated Open Source ProgramNetflix Open Source:  Building a Distributed and Automated Open Source Program
Netflix Open Source: Building a Distributed and Automated Open Source Program
 
Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
 
Ibm cloud nativenetflixossfinal
Ibm cloud nativenetflixossfinalIbm cloud nativenetflixossfinal
Ibm cloud nativenetflixossfinal
 
Docker Demo IBM Impact 2014
Docker Demo IBM Impact 2014Docker Demo IBM Impact 2014
Docker Demo IBM Impact 2014
 
Netflix s2e1lightningtalk
Netflix s2e1lightningtalkNetflix s2e1lightningtalk
Netflix s2e1lightningtalk
 
Going Cloud Native with IBM Cloud and NetflixOSS for Dev@Pulse
Going Cloud Native with IBM Cloud and NetflixOSS for Dev@PulseGoing Cloud Native with IBM Cloud and NetflixOSS for Dev@Pulse
Going Cloud Native with IBM Cloud and NetflixOSS for Dev@Pulse
 

Último

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons

  • 1. Netflix Titus, its Feisty Team, and Daemons
  • 2.
  • 3. Titus - Netflix’s Container Management Platform Scheduling ● Service & batch job lifecycle ● Resource management Container Execution ● AWS Integration ● Netflix Ecosystem Support Service Job and Fleet Management Resource Management & Optimization Container Execution Batch
  • 4. Stats Containers Launcher Per Week Batch vs. Service 3 Million / Week Batch runtimes (< 1s, < 1m, < 1h, < 12h, < 1d, > 1d) Service runtimes (< 1 day, < 1 week, < 1 month, > 1 month) Autoscaling High Churn
  • 5. The Titus team ● Design ● Develop ● Operate ● Support * And Netflix Platform Engineering and Amazon Web Services
  • 6. Titus Product Strategy Ordered priority focus on ● Developer Velocity ● Reliability ● Cost Efficiency Easy migration from VMs to containers Easy container integration with VMs and Amazon Services Focus on just what Netflix needs
  • 7. Deeply integrated AWS container platform IP per container ● VPC, ENIs, and security groups IAM Roles and Metadata Endpoint per container ● Container view of 169.254.169.254 Cryptographic identity per container ● Using Amazon instance identity document, Amazon KMS Service job container autoscaling ● Using Native AWS Cloudwatch, SQS, Autoscaling policies and engine Application Load Balancing (ALB)
  • 8. Applications using containers at Netflix ● Netflix API, Node.js Backend UI Scripts ● Machine Learning (GPUs) for personalization ● Encoding and Content use cases ● Netflix Studio use cases ● CDN tracking, planning, monitoring ● Massively parallel CI system ● Data Pipeline and Stream Processing ● Big Data use cases (Notebooks, Presto) Batch Q4 15 Basic Services 1Q 16 Production Services 4Q 16 Customer Facing Services 2Q 17
  • 9. Q4 2018 Container Usage Common Jobs Launched 255K jobs / day Different applications 1K+ different images Isolated Titus deployments 7 Services Single App Cluster Size 5K (real), 12K containers (benchmark) Hosts managed 7K VMs (435,000 CPUs) Batch Containers launched 450K / day (750K / day peak) Hosts managed (autoscaled) 55K / month
  • 10. High Level Titus Architecture Cassandra Titus Control Plane ● API ● Scheduling ● Job Lifecycle Control EC2 Autoscaling Fenzo container container container docker Titus Hosts Mesos agent Docker Docker Registry containercontainerUser Containers AWS Virtual Machines Mesos Titus System ServicesBatch/Workflow Systems Service CI/CD
  • 11. Open Source Open sourced April 2018 Help other communities by sharing our approach
  • 13. End to End User Experience
  • 14. Our initial view of containers Image Registry Publish Run Container Workload Monitor ContainersDeploy new Container Workload “The Runtime”
  • 17. End to end tooling Container orchestration only part of the problem For Netflix … ● Local Development - Newt ● Continuous Integration - Jenkins + Newt ● Continuous Delivery - Spinnaker ● Change Campaigns - Astrid ● Performance Analysis - Vector and Flamegraphs
  • 18. Tooling guidance ● Ensure coverage for entire application SDLC ○ Developing an application before deployment ○ Change management, security and compliance tooling for runtime ● What we added to Docker tooling ○ Curated known base images ○ Consistent image tagging ○ Assistance for multi-region/account registries ○ Consistency with existing tools
  • 20. ● Single container crashes ● Single host crashes ● Control plane fails ● Control plane gets into bad state Learning how things fail Increasing Severity
  • 21. ● Single container crashes ● Single host crashes ○ Taking down multiple containers ● Control plane fails ● Control plane gets into bad state Learning how things fail Increasing Severity
  • 22. ● Single container crashes ● Single host crashes ● Control plane fails ○ Existing containers continue to run ○ New jobs cannot be submitted ○ Replacements and scale ups do not occur ● Control plane gets into bad state Learning how things fail Increasing Severity
  • 23. ● Single container crashes ● Single host crashes ● Control plane fails ● Control plane gets into bad state ○ Can be catastrophic Learning how things fail Increasing Severity
  • 24. ● Most orchestrators will recover ● Most often during startup or shutdown ● Monitor for crash loops Case 1 - Single container crashes
  • 25. Case 2 - Single host crashes ● Need a placement engine that spreads critical workloads ● Need a way to detect and remediate bad hosts
  • 26. Titus node health monitoring, scheduling ● Extensive health checks ○ Control plane components - Docker, Mesos, Titus executor ○ AWS - ENI, VPC, GPU ○ Netflix Dependencies - systemd state, security systems Scheduler Health Check and Service Discovery ✖ + + ✖ Titus hosts
  • 27. Titus node health remediation ● Rate limiting through centralized service is critical Scheduler ✖ + + Infrastructure Automation (Winston) Events Automation perform analysis on host perform remediation on host if (unrecoverable) { tell scheduler to reschedule work terminate instance } Titus hosts
  • 28. Spotting fleet wide issues using logging ● For the hosts, not the containers ○ Need fleet wide view of container runtime, OS problems ○ New workloads will trigger new host problems ● Titus hosts generate 2B log lines per day ○ Stream processing to look for patterns and remediations ● Aggregated logging - see patterns in the large
  • 29. Case 3 - Control plane hard failures White box - monitor time bucketed queue length Black box - submit synthetic workloads
  • 30. Case 4 - Control plane soft failures I don’t feel so good!
  • 31. But first, let’s talk about Zombies
  • 32. ● Early on, we had cases where ○ Some but not all of the control plane was working ○ User terminated their containers ○ Containers still running, but shouldn’t have been ● The “fix” - Mesos implicit reconciliation ○ Titus to Mesos - What containers are running? ○ Titus to Mesos - Kill these containers we know shouldn’t be running ○ System converges on consistent state 👍 Disconnected containers
  • 33. But what if? Cluster State Scheduler Controller Host ✖ ✖ ✖ ✖ Backing store gets corrupted Or Control plane reads store incorrectly
  • 34. Bad things occur 12,000 containers “reconciled” in < 1m An hour to restore service Running Containers Not running Containers
  • 35. Guidance ● Know how to operate your cluster storage ○ Perform backups and test restores ○ Test corruption ○ Know failure modes, and know how to recover
  • 36. At Netflix, we ... ● Moved to less aggressive reconciliation ● Page on inconsistent data ○ Let existing containers run ○ Human fixes state and decides how to proceed ● Automated snapshot testing for staging
  • 38. ● Enforcement ○ Seccomp and AppArmor policies ● Cryptographic identity for each container ○ Leveraging host level Amazon and control plane provided identities ○ Validated by central Netflix service before secrets are provided Reducing container escape vectors
  • 39. User namespaces ● Root (or user) in container != Root (or user) on host ● Challenge: Getting it to work with persistent storage Reducing impact of container escape vectors user_namespaces (7)
  • 40. Lock down, isolate control plane ● Hackers are scanning for Docker and Kubernetes ● Reported lack of networking isolation in Google Borg ● We also thought our networking was isolated (wasn’t)
  • 41. Avoiding user host level access Vector ssh perf tools
  • 43. How does Netflix failover? ✖ Kong
  • 44. Netflix regional failover Kong evacuation of us-east-1 Traffic diverted to other regions Fail back to us-east-1 Traffic moved back to us-east-1 API Calls Per Region EU-WEST-1 US-EAST-1
  • 45. ● Increase capacity during scale up of savior region ● Launch 1000s of containers in 7 minutes Infrastructure challenge
  • 46. Easy Right? “we reduced time to schedule 30,000 pods onto 1,000 nodes from 8,780 seconds to 587 seconds”
  • 47. Easy Right? “we reduced time to schedule 30,000 pods onto 1,000 nodes from 8,780 seconds to 587 seconds” Synthetic benchmarks missing 1. Heterogeneous workloads 2. Full end to end launches 3. Docker image pull times 4. Integration with public cloud networking
  • 48. Titus can do this by ... ● Dynamically changeable scheduling behavior ● Fleet wide networking optimizations
  • 49. Normal scheduling VM1 ... VM2 VMn App 1 App 1 App 2 ENI1 ENI 2 App 2 App 1 ENI1 ENI1 ENI 2 Trade-off for reliability IP1 IP1 IP1 IP1 IP1 Spread Pack Scheduling Algorithm
  • 50. Failover scheduling VM1 ... VM2 VMn App 1 App 1 App 2 ENI1 ENI 2 App 2 App 1 ENI1 ENI1 ENI 2 App 1 App 1 App 1 App 1 App 1 App 2 App 2 IP1 IP1 IP1 IP1 IP1 Spread Pack IP2, IP3 IP2, IP3, IP4 IP2, IP3 Trade-off for speed Scheduling Algorithm
  • 51. ● Due to normal scheduling, host likely already has ... ○ Docker image downloaded ○ Networking interfaces and security groups configured ● Need to burst allocate IP addresses ○ Opportunistically batch allocate at container launch time ○ Likely if one container was launched more are coming ○ Garbage collect unused later On each host
  • 52. Results us-east-1 / prod containers started per minute 7500 Launched In 5 Minutes
  • 54. How far can a single Titus stack go? ● Speed and stability of scheduling ● Blast radius of mistakes
  • 55. Scaling options Idealistic Continue to improve performance Avoid making mistakes Realistic Test a stack up to a known scale level Contain mistakes
  • 56. Titus “Federation” ● Allows a Titus stack to be scaled out ○ For performance and reliability reasons ● Not to help with ○ Cloud bursting across different resource pools ○ Automatic balancing across resource pools ○ Joining various business unit resource pools
  • 57. Federation Implementation ● Users need only to know of the external single API ○ VIP - titus-api.us-east-1.prod.netflix.net ● Simple federation proxy spans stacks (cells) ○ Route these apps to cell 1, these others to cell 2 ○ Fan out & union queries across cells ○ Operators can route directly to specific cells
  • 58. Titus Federation … Titus API (Federation Proxy) Titus cell01 us-west-2 Titus cell02 … Titus cell01 us-east-1 Titus cell02 … Titus cell01 eu-west-1 Titus cell02 Titus API (Federation Proxy) Titus API (Federation Proxy)
  • 59. How many cells? A few large cells ● Only as many as needed for scale / blast radius Why? Larger resource pools help with ● Cross workload efficiency ● Operations ● Bad workload impacts
  • 61. ● A fictional “16 vCPU” host ● Left and right are CPU packages ● Top to bottom are cores with hyperthreads Simplified view of a server
  • 62. Consider workload placement ● Consider three workloads ○ Static A, B, and C all which are latency sensitive ○ Burst D which would like more compute when available ● Let’s start placing workloads
  • 63. Problems ● Potential performance problems ○ Static C is crossing packages ○ Static B can steal from Static C ● Underutilized resources ○ Burst workload isn’t using all available resources
  • 64. Node level CPU rescheduling ● After containers land on hosts ○ Eventually, dynamic and cross host ● Leverages cpusets ○ Static - placed on single CPU package and exclusive full cores ○ Burst - can consume extra capacity, but variable performance ● Kubernetes - CPU Manager (beta)
  • 65. Opportunistic workloads ● Enable workloads to burst into underutilized resources ● Differences between utilized and total Utilization time Resources (Total) Resources (Allocated) Resources (Utilized)