SlideShare una empresa de Scribd logo
1 de 24
How We Make Adding
and Removing Nodes
Faster and Safer
Asias He, Software Developer
Presenter
Asias He, Software Developer
Asias He is a software developer with over 10 years of experience
in programming. In the past, he worked on Debian Project,
Solaris Kernel, KVM Virtualization for Linux, OSv unikernel. He
now works on Seastar and ScyllaDB.
Overview of the
node operations
1. Replace operation
■ To replace a dead node
● Token ring does not change
● Use same tokens and host_id of the replaced node
■ Suffer from the resumable issue
● Replace failed after 99% of data is streamed
● Run replace again, we need to stream all the data again
■ Suffer from the "not streaming latest copy” issue
● Stream data from only one of the replicas which might not have the latest copy
● Stream from all the replicas solve the problem
● Too heavy and wasteful to stream the same data more than once
Replace operation: Not streaming latest copy
■ What do we expect from a QUORUM read that follows a QUORUM
write?
● Strong Consistency: Write CL + Read CL > RF
● X2 is newer than X1
Node1
=X2
Node2
=X1 Correct quorum read!
Node3
=X2
■ Node 3 dies and Node 4 replaces it
■ What do we expect from a QUORUM read that follows a QUORUM
write?
Replace operation: Not streaming latest copy
Node1
=X2
Node2
=X1
Node4
(replacing)
=?
■ This is what you would expect:
Replace operation: Not streaming latest copy
Node4
(replacing)
=X2
Stream: X2Node1
=X2
Node2
=X1
■ This is what (may) happen:
Replace operation: Not streaming latest copy
Stream: X1
Node1
=X2
Node2
=X1
Node4
(replacing)
=X1
■ This is what (may) happen:
Replace operation: Not streaming latest copy
Wrong quorum read!
Node1
=X2
Node2
=X1
Node4
(replacing)
=X1
■ What if we have to replace again before repair is run?
■ Node 1 dies and Node 5 replaces it
Replace operation: Not streaming latest copy
New data was lost!
Node5
(replacing)
=X1
Node2
=X1
Node4
(replacing)
=X1
Stream: X1
2. Rebuild operation
■ To get all the data this node owns from other nodes
● E.g., rebuild a new DC
● Token ring does not change
■ Suffer from the resumable issue
■ Suffer from the "not streaming latest copy” issue
● Stream data from only one of the replicas which might not have the latest copy
3. Removenode operation
■ To remove a dead node out of the cluster
● Token ring changes
■ Suffer from the resumable issue
■ Suffer from the "not streaming latest copy” issue
● Remaining nodes pull data from other nodes for the new ranges they own
● Stream from only one of the replicas which might not have the latest copy
4. Decommission operation
■ To remove a live node from the cluster
● Token ring changes
■ Suffer from the resumable issue
■ Do not suffer from the “not streaming latest copy” issue
● The leaving node pushes data to other nodes which are the new owner
Stream: Y2
Node1
=X2
Node3
(leaving)
=X2, Y2
Node2
=Y2
Stream: X2
Node 1: new owner of range for X2
Node 2: new owner of range for Y2
Node 3: loses the range for X2 and Y2
5. Bootstrap operation
■ To add a new node into the cluster
● Token ring changes
■ Suffer from the resumable issue
■ Do not suffer from the "not streaming latest copy” issue
● New node pulls data from existing nodes that are losing the token ranges
Stream: Y2
Node1
=X2
Node3
(joining)
=X2, Y2
Node2
=Y2
Stream: X2
Node 1: loses range for X2
Node 2: loses range for Y2
Node 3: new owner of the range for X2 and Y2
Node operation summary
Node
Operations
Token ring
change
Resumable
issue
Latest copy
issue
Replace No Yes Yes
Rebuild No Yes Yes
Removenode Yes Yes Yes
Decommission Yes Yes No
Bootstrap Yes Yes No
Solutions to
the problems
Repair based node operations
The idea is: use repair to sync data
between replicas instead of streaming
Benefits of repair based node operation
■ Latest copy is guaranteed
● The operation node will always have the latest copy
■ Resumable in nature
● Repair skips the already synced data very quickly
● E.g., Restart the replace operation from where it failed
■ No extra data is streamed
● E.g., rebuild twice, will not stream the same data twice
■ Free repair during node operations
● No need to run repair before/after the node operations
● Simplify the procedure and reduce the chance to make mistakes
■ Unify code path for node operations and repair
● Retire regular streaming code
■ The way you operate the cluster stays the same
● You can still use the nodetool rebuild, decommission commands
Isn’t repair a heavy operation
■ Node operations assume the data is already consistent
● The job to make data consistent is repair
● We recommend repair before node operations
● Repair + streaming won’t be faster than doing only repair
■ Old repair is not fast enough: (partition level repair)
● Over-streaming problem
● Granularity is ~100 partitions
■ New repair is fast (row level repair introduced in Scylla 3.1 )
● No overstreaming
● Only mismatched rows are synced
● Foundation of the repair based node operations
Optimize repair for node operations
■ Increased the internal row buffer size
● Old 256 KiB (3.1) to current 32 MiB(3.2+)
● Good for cross DC cluster with high latency links
■ Improved data transfer efficiency between nodes
● From rpc verb (3.1) to rpc stream (3.2+)
● More efficient to transfer large amount of data
● Same as regular stream based node operations
Test results
Repair v.s. Stream based rebuild operation 1/2
Rebuild from 1 DC Method Space after rebuild Time to rebuild Notes
us-east Stream 26GB 573s Stream 10% of vnode
ranges at a time
us-east Repair 26GB 368s Repair works on more
vnode ranges in parallel,
1.5X less time
● 3 nodes in the cluster, 1 node per DC, 3 DCs
● AWS, i3.2xlarge
● 150M partitions on each node
● RF = { eu-west=1, us-east=1, us-west-2=1 }
● 80 ms latency
● Run rebuild on DC us-west-2
Repair v.s. Stream based rebuild operation 2/2
Rebuild from 2 DCs Method Space after rebuild Time to rebuild Notes
us-east and eu-west Stream 39GB 1500s Two rebuild operations,
streams 2X data, total time
573 + 927 = 1500s
us-east and eu-west Repair 26GB 611s Single rebuild to sync from
two DCs, streams no extra
data, 2.5x less time
● 3 nodes in the cluster, 3 DCs, 1 node per DC
● AWS, i3.2xlarge
● 150M partitions on each node
● RF = { eu-west=1,us-east=1,us-west-2=1 }
● 80ms latency
● Run rebuild on DC use-west-2
Thank you Stay in touch
Any questions?
Asias He
asias@scylladb.com
@asias_he

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
 
Understanding oracle rac internals part 1 - slides
Understanding oracle rac internals   part 1 - slidesUnderstanding oracle rac internals   part 1 - slides
Understanding oracle rac internals part 1 - slides
 
Bulk Loading Data into Cassandra
Bulk Loading Data into CassandraBulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
Getting Started with Confluent Schema Registry
Getting Started with Confluent Schema RegistryGetting Started with Confluent Schema Registry
Getting Started with Confluent Schema Registry
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
 
Introduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache KafkaIntroduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache Kafka
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
 
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
 
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafka
 

Similar a How Scylla Make Adding and Removing Nodes Faster and Safer

Cassandra overview
Cassandra overviewCassandra overview
Cassandra overview
Sean Murphy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
osdi23_slides_lo_v2.pdf
osdi23_slides_lo_v2.pdfosdi23_slides_lo_v2.pdf
osdi23_slides_lo_v2.pdf
gmdvmk
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
Demi Ben-Ari
 

Similar a How Scylla Make Adding and Removing Nodes Faster and Safer (20)

Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Fast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating SystemsFast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating Systems
 
Apache cassandra an introduction
Apache cassandra  an introductionApache cassandra  an introduction
Apache cassandra an introduction
 
Cassandra overview
Cassandra overviewCassandra overview
Cassandra overview
 
Large scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutionsLarge scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutions
 
Making the Most Out of ScyllaDB's Awesome Concurrency at Optimizely
Making the Most Out of ScyllaDB's Awesome Concurrency at OptimizelyMaking the Most Out of ScyllaDB's Awesome Concurrency at Optimizely
Making the Most Out of ScyllaDB's Awesome Concurrency at Optimizely
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Scylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair Updates
Scylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair UpdatesScylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair Updates
Scylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair Updates
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDB
 
DB reading group may 16, 2018
DB reading group may 16, 2018DB reading group may 16, 2018
DB reading group may 16, 2018
 
osdi23_slides_lo_v2.pdf
osdi23_slides_lo_v2.pdfosdi23_slides_lo_v2.pdf
osdi23_slides_lo_v2.pdf
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
 
M|18 Architectural Overview: MariaDB MaxScale
M|18 Architectural Overview: MariaDB MaxScaleM|18 Architectural Overview: MariaDB MaxScale
M|18 Architectural Overview: MariaDB MaxScale
 
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
 
C100 k and go
C100 k and goC100 k and go
C100 k and go
 
10 replication
10 replication10 replication
10 replication
 
S3 cassandra or outer space? dumping time series data using spark
S3 cassandra or outer space? dumping time series data using sparkS3 cassandra or outer space? dumping time series data using spark
S3 cassandra or outer space? dumping time series data using spark
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 

Más de ScyllaDB

Más de ScyllaDB (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & Tradeoffs
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

How Scylla Make Adding and Removing Nodes Faster and Safer

  • 1. How We Make Adding and Removing Nodes Faster and Safer Asias He, Software Developer
  • 2. Presenter Asias He, Software Developer Asias He is a software developer with over 10 years of experience in programming. In the past, he worked on Debian Project, Solaris Kernel, KVM Virtualization for Linux, OSv unikernel. He now works on Seastar and ScyllaDB.
  • 3. Overview of the node operations
  • 4. 1. Replace operation ■ To replace a dead node ● Token ring does not change ● Use same tokens and host_id of the replaced node ■ Suffer from the resumable issue ● Replace failed after 99% of data is streamed ● Run replace again, we need to stream all the data again ■ Suffer from the "not streaming latest copy” issue ● Stream data from only one of the replicas which might not have the latest copy ● Stream from all the replicas solve the problem ● Too heavy and wasteful to stream the same data more than once
  • 5. Replace operation: Not streaming latest copy ■ What do we expect from a QUORUM read that follows a QUORUM write? ● Strong Consistency: Write CL + Read CL > RF ● X2 is newer than X1 Node1 =X2 Node2 =X1 Correct quorum read! Node3 =X2
  • 6. ■ Node 3 dies and Node 4 replaces it ■ What do we expect from a QUORUM read that follows a QUORUM write? Replace operation: Not streaming latest copy Node1 =X2 Node2 =X1 Node4 (replacing) =?
  • 7. ■ This is what you would expect: Replace operation: Not streaming latest copy Node4 (replacing) =X2 Stream: X2Node1 =X2 Node2 =X1
  • 8. ■ This is what (may) happen: Replace operation: Not streaming latest copy Stream: X1 Node1 =X2 Node2 =X1 Node4 (replacing) =X1
  • 9. ■ This is what (may) happen: Replace operation: Not streaming latest copy Wrong quorum read! Node1 =X2 Node2 =X1 Node4 (replacing) =X1
  • 10. ■ What if we have to replace again before repair is run? ■ Node 1 dies and Node 5 replaces it Replace operation: Not streaming latest copy New data was lost! Node5 (replacing) =X1 Node2 =X1 Node4 (replacing) =X1 Stream: X1
  • 11. 2. Rebuild operation ■ To get all the data this node owns from other nodes ● E.g., rebuild a new DC ● Token ring does not change ■ Suffer from the resumable issue ■ Suffer from the "not streaming latest copy” issue ● Stream data from only one of the replicas which might not have the latest copy
  • 12. 3. Removenode operation ■ To remove a dead node out of the cluster ● Token ring changes ■ Suffer from the resumable issue ■ Suffer from the "not streaming latest copy” issue ● Remaining nodes pull data from other nodes for the new ranges they own ● Stream from only one of the replicas which might not have the latest copy
  • 13. 4. Decommission operation ■ To remove a live node from the cluster ● Token ring changes ■ Suffer from the resumable issue ■ Do not suffer from the “not streaming latest copy” issue ● The leaving node pushes data to other nodes which are the new owner Stream: Y2 Node1 =X2 Node3 (leaving) =X2, Y2 Node2 =Y2 Stream: X2 Node 1: new owner of range for X2 Node 2: new owner of range for Y2 Node 3: loses the range for X2 and Y2
  • 14. 5. Bootstrap operation ■ To add a new node into the cluster ● Token ring changes ■ Suffer from the resumable issue ■ Do not suffer from the "not streaming latest copy” issue ● New node pulls data from existing nodes that are losing the token ranges Stream: Y2 Node1 =X2 Node3 (joining) =X2, Y2 Node2 =Y2 Stream: X2 Node 1: loses range for X2 Node 2: loses range for Y2 Node 3: new owner of the range for X2 and Y2
  • 15. Node operation summary Node Operations Token ring change Resumable issue Latest copy issue Replace No Yes Yes Rebuild No Yes Yes Removenode Yes Yes Yes Decommission Yes Yes No Bootstrap Yes Yes No
  • 17. Repair based node operations The idea is: use repair to sync data between replicas instead of streaming
  • 18. Benefits of repair based node operation ■ Latest copy is guaranteed ● The operation node will always have the latest copy ■ Resumable in nature ● Repair skips the already synced data very quickly ● E.g., Restart the replace operation from where it failed ■ No extra data is streamed ● E.g., rebuild twice, will not stream the same data twice ■ Free repair during node operations ● No need to run repair before/after the node operations ● Simplify the procedure and reduce the chance to make mistakes ■ Unify code path for node operations and repair ● Retire regular streaming code ■ The way you operate the cluster stays the same ● You can still use the nodetool rebuild, decommission commands
  • 19. Isn’t repair a heavy operation ■ Node operations assume the data is already consistent ● The job to make data consistent is repair ● We recommend repair before node operations ● Repair + streaming won’t be faster than doing only repair ■ Old repair is not fast enough: (partition level repair) ● Over-streaming problem ● Granularity is ~100 partitions ■ New repair is fast (row level repair introduced in Scylla 3.1 ) ● No overstreaming ● Only mismatched rows are synced ● Foundation of the repair based node operations
  • 20. Optimize repair for node operations ■ Increased the internal row buffer size ● Old 256 KiB (3.1) to current 32 MiB(3.2+) ● Good for cross DC cluster with high latency links ■ Improved data transfer efficiency between nodes ● From rpc verb (3.1) to rpc stream (3.2+) ● More efficient to transfer large amount of data ● Same as regular stream based node operations
  • 22. Repair v.s. Stream based rebuild operation 1/2 Rebuild from 1 DC Method Space after rebuild Time to rebuild Notes us-east Stream 26GB 573s Stream 10% of vnode ranges at a time us-east Repair 26GB 368s Repair works on more vnode ranges in parallel, 1.5X less time ● 3 nodes in the cluster, 1 node per DC, 3 DCs ● AWS, i3.2xlarge ● 150M partitions on each node ● RF = { eu-west=1, us-east=1, us-west-2=1 } ● 80 ms latency ● Run rebuild on DC us-west-2
  • 23. Repair v.s. Stream based rebuild operation 2/2 Rebuild from 2 DCs Method Space after rebuild Time to rebuild Notes us-east and eu-west Stream 39GB 1500s Two rebuild operations, streams 2X data, total time 573 + 927 = 1500s us-east and eu-west Repair 26GB 611s Single rebuild to sync from two DCs, streams no extra data, 2.5x less time ● 3 nodes in the cluster, 3 DCs, 1 node per DC ● AWS, i3.2xlarge ● 150M partitions on each node ● RF = { eu-west=1,us-east=1,us-west-2=1 } ● 80ms latency ● Run rebuild on DC use-west-2
  • 24. Thank you Stay in touch Any questions? Asias He asias@scylladb.com @asias_he

Notas del editor

  1. Here is an example of not streaming latest copy issue. We have 3 nodes. We do a quorum write. The second node missed the write When we do a quorum read. We will still have the correct result.
  2. This is what you would expect. We would stream the latest copy to the replacing node.
  3. But this what may happen. We stream the old copy to the replacing node.
  4. As a result, the quorum read will be wrong.
  5. In this case, node 1 dies and node 5 replaces it. Unfortunately, node2 streams the old copy to the replacing node. As a result, the new data was lost!
  6. .
  7. The first test is to rebuild from 1DC. 3 nodes in the cluster, ... As we can see, in this test, the repair based operation is actually faster. This is mainly because, internally the repair works on more vnode ranges in parallel than streaming. In streaming, we stream 10% of the vnode ranges at a time. But there is only one pending stream plan in parallel. However, even if repair based rebuild is slower in some cases, for instance, not considering the parallel contribution, it would be acceptable, because repair has more work to do and it is much much safer.
  8. The second test is to rebuild from 2 DCs. For the stream based operation, We have to perform two rebuilds operations. It streams 2 times more data to the rebuilding node. For the repair based operation, we only need to perform one rebuild operation to sync from two DCS. It streams no extra data and uses less time. The time difference is around 2.5 times.