SlideShare una empresa de Scribd logo
1 de 68
Descargar para leer sin conexión
How Netflix manages
petabyte scale
Apache Cassandra
in the cloud
Joey Lynch, Vinay Chella
Netflix’s Distributed Database Engineers
Who are we?
Vinay Chella
Distributed Systems Engineer
Focusing on Apache Cassandra and Data
Abstractions
Cloud Data Engineering
Netflix
Joey Lynch
Distributed Systems Engineer
Distributed system addict and data wrangler
Cloud Data Engineering
Netflix
Agenda Why use Cassandra?
Scale of Cassandra
Life of Cassandra Cluster
- Where does it start?
- Provisioning
- Keep it running
- Migration / Retiring
Murphy’s law applied
Why
Apache
Cassandra
— Millions of operations per sec
— Global data replication
— Failure isolation at rack level
— Chaos ready database
— Tunable consistency
— Log structured storage engine
— 10’s of thousands instances
— 100’s of global C* clusters
— >6 PB of data
— Millions of requests / second
— Replicating several GiB/sec data
across the globe
Scale
Story of Apache Cassandra and Netflix
Inception Provision Keep it running Migrations
Inception
Where does it all start?
Inception
Inception
Inception
Service philosophy
— Context not control
◆ Education
◆ Tooling
— SLOs are key
◆ Size, rate, latency,
availability
— Every party must be
responsible
Inception
Inception
Inception
Invest in DevEd
Inception
Inception
Better Tooling
Inception
Cost insights
Inception
Maintenance
Inception
Maintenance - Repair Insights
Maintenance - Backup Insights
Inception
Maintenance - Node Insights
Inception
SLOs are Key
Inception
Whom to page?
Inception
Good contracts
make good
partners!
Story of C*
Inception
Provision
Keep it running Migrations
Provision
Get up and running fast!
Provision
Prove it before serving it
Robust AMI build process Pre-flight check
Provision
AMI Build
process
Provision
Pre-flight check EC2 Instance lifecycle
Story of C*
Inception Provision
Keep it running
Migrations
Keep it
Running
Keep it Running
Imperative Control Plane Declarative Control Plane
Keep it Running
Declarative Control PlaneImperative Control Plane
Pros:
— Easy to implement
— Easy to understand
Cons:
— Fragile to failure
— Scaling issues
— Have to implement
parallelism
Pros:
— Lower total cost of
ownership
— Parallel out of the box
Cons
— Requires different
engineering mindset
— Hard to implement
— Have to stop
parallelism
Keep it Running
Declarative ToolsImperative Tools
jenkins.io/
— “Orchestrators”
◆ Jenkins
◆ Fabric
◆ Ansible
— Imperative
operation
ansible.com/fabfile.org/
— Distributed Databases
— Distributed operation
— Similar to Actor model
puppet.com
consul.io/
Declarative Control Plane
1. Obtain desires from
datastore
2. Obtain state of world
3. If state != desire, act!
Keep it Running
Keep it Running
Ring Health Imperative Solution
— O(#nodes) commands
— Single Network Path
— Hard to synthesize global
view
Keep it Running
Ring Health Declarative Solution
Stream Processing!
— Desire: Healthy <30m
— Observe: Full cluster
state
— Action: Remediate or
page
Keep it Running
Ring Health Declarative Solution
— O(#nodes) messages
— Degraded hardware is
hard to SSH into
◆ Failed ephemerals
◆ Failed network
Keep it Running
Hardware Health Imperative Solution
Keep it Running
Hardware Health Declarative Solution
— Desire: Degraded
hardware should die <10m
— Observe: Local disks,
creds, filesystems
— Action: Request
termination
Keep it Running
JVM Health Declarative Solution
— Desire: JVM
Throughput > 90%
— Observe: GC vs
Runtime
— Action: core dump +
terminate
https://github.com/Netflix-Skunkworks/jvmquake
Keep it Running
Backup Declarative Solution
— Desire: Full snapshot
every interval
— Observe:
◆ State of data
◆ State of S3
— Action:
◆ Upload diff
Keep it Running
Case Study: Cassandra Repair
Keep it Running
Case Study: Cassandra Repair
Keep it Running
At Scale Subrange is Required
Keep it Running
Repair Imperative Solution
— O(lots) commands
— Remote JMX = :(
— Single point of failure = :(
— 1 month job duration = D:
Keep it Running
Repair Declarative Solution
— Distributed Control Plane
— Makes progress when
database is available
— Every node follows repair
state machine
https://issues.apache.org/jira/browse/CASSANDRA-14346
Keep it Running
Repair Declarative Solution (#14346)
Keep it Running
Cassandra Ecosystem
Keep it Running
C* Instance ecosystem
PAGE
CDEPre-flight check
AWS IO detection
W
I
N
S
T
O
N
CDE Service
Alert Atlas Mantis
C*
C*
C*
Priam
Bolt
Cluster
Metadata
Remediation
C*
Story of C*
Inception Provision Keep it running
Migration
Migration
Migration
Migration Philosophy
— Immutable infrastructure
— Seamless and transparent migrations
Migration
Software Migration
Migration
Hardware Migration
Desires
S3
Instance
(i2.xl)
agent
C* Data
Incremental
backup of C*
Data
desired
instance?
i2.2xl
Instance
(i2.2xl)
agent
C* Data
Download
latest
SSTables
from S3
Copy differential
data from source
Desires
desired
instance? i2.2xl
Hardware Migration: The Numbers
Naive solution,
sequential transfer
Parallel S3 download
sequential fixup
Parallel download and
zone wide fixup
Migration
Data Migration
Cast Studies
Murphy’s
law
Cautionary Tale: Rarely run automation
— Auto remediation to clean up stale backups,
repair snapshots
◆ Triggers only above FATAL disk threshold
◆ Assumption mismatch
Cautionary Tale: Rarely run automation
— Auto remediation to clean up stale backups,
repair snapshots
◆ Triggers only above FATAL disk threshold
◆ Assumption mismatch
— Result:
◆ Automation ended up deleting real backup
data
Do you restore your backups constantly?
Take Away
Take Away
Invest in context sharing, tooling and
declarative(desire) infrastructure to run
Apache Cassandra at petabyte scale
Thank You.

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Looking towards an official cassandra sidecar netflix
Looking towards an official cassandra sidecar   netflixLooking towards an official cassandra sidecar   netflix
Looking towards an official cassandra sidecar netflix
 
Terraform introduction
Terraform introductionTerraform introduction
Terraform introduction
 
Terraform Basics
Terraform BasicsTerraform Basics
Terraform Basics
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Building and running cloud native cassandra
Building and running cloud native cassandraBuilding and running cloud native cassandra
Building and running cloud native cassandra
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
 
A Deep Dive Into Understanding Apache Cassandra
A Deep Dive Into Understanding Apache CassandraA Deep Dive Into Understanding Apache Cassandra
A Deep Dive Into Understanding Apache Cassandra
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 

Similar a How netflix manages petabyte scale apache cassandra in the cloud

20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
Wei Ting Chen
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
DataStax
 

Similar a How netflix manages petabyte scale apache cassandra in the cloud (20)

BigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current TrendsBigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current Trends
 
Velocity 2018 preetha appan final
Velocity 2018   preetha appan finalVelocity 2018   preetha appan final
Velocity 2018 preetha appan final
 
Planning to Fail #phpne13
Planning to Fail #phpne13Planning to Fail #phpne13
Planning to Fail #phpne13
 
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
 
Pull, don’t push: Architectures for monitoring and configuration in a microse...
Pull, don’t push: Architectures for monitoring and configuration in a microse...Pull, don’t push: Architectures for monitoring and configuration in a microse...
Pull, don’t push: Architectures for monitoring and configuration in a microse...
 
Pull, Don't Push! Sensu Summit 2018 Talk
Pull, Don't Push! Sensu Summit 2018 TalkPull, Don't Push! Sensu Summit 2018 Talk
Pull, Don't Push! Sensu Summit 2018 Talk
 
Planning to Fail #phpuk13
Planning to Fail #phpuk13Planning to Fail #phpuk13
Planning to Fail #phpuk13
 
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
 
Infrastructure as Code to Maintain your Sanity
Infrastructure as Code to Maintain your SanityInfrastructure as Code to Maintain your Sanity
Infrastructure as Code to Maintain your Sanity
 
Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search Roadshow
 
Talk at the Boston Cloud Foundry Meetup June 2015
Talk at the Boston Cloud Foundry Meetup June 2015Talk at the Boston Cloud Foundry Meetup June 2015
Talk at the Boston Cloud Foundry Meetup June 2015
 
The Architecture of Continuous Innovation - OSCON 2015
The Architecture of Continuous Innovation - OSCON 2015The Architecture of Continuous Innovation - OSCON 2015
The Architecture of Continuous Innovation - OSCON 2015
 
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
IEEE HPSR 2017 Keynote: Softwarized Dataplanes and the P^3 trade-offs: Progra...
IEEE HPSR 2017 Keynote: Softwarized Dataplanes and the P^3 trade-offs: Progra...IEEE HPSR 2017 Keynote: Softwarized Dataplanes and the P^3 trade-offs: Progra...
IEEE HPSR 2017 Keynote: Softwarized Dataplanes and the P^3 trade-offs: Progra...
 
Why Distributed Databases?
Why Distributed Databases?Why Distributed Databases?
Why Distributed Databases?
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
 
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
 
CampusSDN2017 - Jawdat: SDN Technology Evolvement
CampusSDN2017 - Jawdat: SDN Technology EvolvementCampusSDN2017 - Jawdat: SDN Technology Evolvement
CampusSDN2017 - Jawdat: SDN Technology Evolvement
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
 

Más de Vinay Kumar Chella

Más de Vinay Kumar Chella (7)

Safer restarts, faster streaming, and better repair, just a glimpse of cassan...
Safer restarts, faster streaming, and better repair, just a glimpse of cassan...Safer restarts, faster streaming, and better repair, just a glimpse of cassan...
Safer restarts, faster streaming, and better repair, just a glimpse of cassan...
 
Live traffic capture and replay in cassandra 4.0
Live traffic capture and replay in cassandra 4.0Live traffic capture and replay in cassandra 4.0
Live traffic capture and replay in cassandra 4.0
 
Query and audit logging in cassandra
Query and audit logging in cassandraQuery and audit logging in cassandra
Query and audit logging in cassandra
 
A glimpse of cassandra 4.0 features netflix
A glimpse of cassandra 4.0 features   netflixA glimpse of cassandra 4.0 features   netflix
A glimpse of cassandra 4.0 features netflix
 
Honest performance testing with NDBench
Honest performance testing with NDBenchHonest performance testing with NDBench
Honest performance testing with NDBench
 
Real world repairs
Real world repairsReal world repairs
Real world repairs
 
CassandraSummit2015_Cassandra upgrades at scale @ NETFLIX
CassandraSummit2015_Cassandra upgrades at scale @ NETFLIXCassandraSummit2015_Cassandra upgrades at scale @ NETFLIX
CassandraSummit2015_Cassandra upgrades at scale @ NETFLIX
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

How netflix manages petabyte scale apache cassandra in the cloud