SlideShare una empresa de Scribd logo
1 de 23
Raft in Scylla
Konstantin Osipov
ScyllaDB
Konstantin Osipov
Team Lead, ScyllaDB
Kostja’s one of the developers behind Scylla lightweight transactions. His current focus is
Raft log replication and its applications to schema, topology changes and tablets.
2
Recap Scylla Summit 2019
▪ LWT: the first strongly
consistent feature
▪ Available in 4.0
▪ Pay per use
3
2020-12-25: Tested with Jepsen!
UPDATE employees
SET join_date = '2018-05-19' WHERE
firstname = 'John' AND
lastname = 'Doe'
IF join_date != null;
[applied]
False
▪ 3 network round trips per write
▪ Must read the old value before write
LWT use of Paxos
4
R1
R
2
R
3
Decision made
Can I
propose
a value?
Check
condition
Accept
new
value
Learn
decision
What is Raft anyway?
▪ Raft provides strong consistency efficiently
▪ Only the leader can accept writes
5
Leader
Append
entries
Apply
Follower
Follower
Decision made
… 1 network round trip per write on average
Raft log replication
▪ Each node has a copy of Raft log
6
Scylla plans to use Raft for:
▪ Topology changes
7
Scylla plans to use Raft for:
▪ Topology changes
▪ Schema changes
8
Scylla plans to use Raft for:
▪ Schema changes
▪ Topology changes
▪ Tablets
9
0-99
100-199
200-299
Table
Node A
Tablet 1
0-99
Node B
Tablet 2
100-199
Node C
Tablet 3
200-299
Topology changes on Raft
10
Topology changes in Scylla
▪ Safe when one change is done at a time
▪ Rely on 30+ second timeouts for consistency
▪ Allowed on a significantly degraded cluster (split brain)
11
30s
👋
💡
💡
Topology changes using Raft
▪ Durable and linearizable
▪ Permit adding multiple nodes
▪ Permit background data rebalancing
▪ Require a majority of replicas alive to succeed
12
Schema changes using Raft
13
Schema changes in Scylla
▪ Each node owns a copy of the schema
▪ Schema change is first made locally
▪ Then eventually pushed through the cluster
▪ Last-timestamp-wins rule is used for reconciliation
14
Node A:
> CREATE TABLE e (a int);
OK (hash: a81e, ts: 1609420790)
Node B:
> CREATE TABLE e (a int, b int);
OK (hash: 2fa3, ts: 1609420792)
INSERT/UPDATE when schemas differ
▪ Each data request carries a schema version
▪ Missing versions can be pulled from peers
15
Node A (a81e):
> INSERT INTO e (a) VALUES (1);
hash: 2fa3, row: (1, null)
Node B (2fa3):
> INSERT INTO e (a) VALUES (1);
hash: 2fa3, row: (1, null)
Schema changes using Raft
▪ Each node continues to store a copy of the schema
▪ A change is first persisted in a global Raft log
▪ On success, it’s applied on replicas
▪ Schema changes are now linearizable and consistent
▪ Nodes catch up with schema history during boot
The
Speaker’s
camera
displays
here
16
Tablets
17
Token based partitioning
▪ Partition key is hashed to an integer (token)
▪ Nodes own ranges of tokens
▪ Provides even distribution of data and traffic
▪ Hotspots if partitions have many clustering rows
18
ck:
pk - partition key, ck - clustering key
pk: a b c .. gf .. t.. u
1 2 3 .. 21 3 11 1
token
footprint:
Tablet partitioning
▪ Tablet is a new kind of partition
▪ It stores a primary key range, not a single partition key
▪ Tablet ranges are subject to dynamic load balancing
▪ Size of each tablet is configurable (e.g. 64MB)
19
Raft for Tablets
▪ Manageable number of Raft groups (~100,000)
▪ No client-side timestamps
▪ Provides isolation for ALL queries
▪ Writes do not require a read
▪ No need to repair
▪ Strong consistency of materialized views
Strong consistency by default
20
Raft in Scylla: summary
▪ Raft extended to efficiently support many groups
▪ Raft and Tablet partitioning = fast strong consistency
▪ Linearizable, more powerful schema and topology changes
▪ High Availability and partition tolerance of Cassandra are
mostly unaffected
21
Thank You
@kostja_osipov
kostja@scylladb.com
Konstantin Osipov
22
Download Scylla Open Source:
scylladb.com/download
Talk to an expert:
scylladb.com/consultation
Take a test drive:
scylladb.com/test-drive
The
Speaker’s
camera
displays
here
Experience Scylla for Yourself
23

Más contenido relacionado

La actualidad más candente

The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
 
Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLModeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLScyllaDB
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...InfluxData
 
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB,  or how we implemented a 10-times faster CassandraSeastar / ScyllaDB,  or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB, or how we implemented a 10-times faster CassandraTzach Livyatan
 
How to be Successful with Scylla
How to be Successful with ScyllaHow to be Successful with Scylla
How to be Successful with ScyllaScyllaDB
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Flink Forward
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlJiangjie Qin
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1ScyllaDB
 
Scylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScyllaDB
 
Catalogs - Turning a Set of Parquet Files into a Data Set
Catalogs - Turning a Set of Parquet Files into a Data SetCatalogs - Turning a Set of Parquet Files into a Data Set
Catalogs - Turning a Set of Parquet Files into a Data SetInfluxData
 
Redpanda and ClickHouse
Redpanda and ClickHouseRedpanda and ClickHouse
Redpanda and ClickHouseAltinity Ltd
 
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...ScyllaDB
 
RedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedis Labs
 

La actualidad más candente (20)

The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLModeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQL
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
 
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB,  or how we implemented a 10-times faster CassandraSeastar / ScyllaDB,  or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
 
How to be Successful with Scylla
How to be Successful with ScyllaHow to be Successful with Scylla
How to be Successful with Scylla
 
Druid
DruidDruid
Druid
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
CockroachDB
CockroachDBCockroachDB
CockroachDB
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
 
Scylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with Raft
 
Catalogs - Turning a Set of Parquet Files into a Data Set
Catalogs - Turning a Set of Parquet Files into a Data SetCatalogs - Turning a Set of Parquet Files into a Data Set
Catalogs - Turning a Set of Parquet Files into a Data Set
 
Redpanda and ClickHouse
Redpanda and ClickHouseRedpanda and ClickHouse
Redpanda and ClickHouse
 
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
 
RedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ Twitter
 

Similar a Eventually, Scylla Chooses Consistency

Graph processing
Graph processingGraph processing
Graph processingyeahjs
 
CCNA (R & S) Module 04 - Scaling Networks - Chapter 4
CCNA (R & S) Module 04 - Scaling Networks - Chapter 4CCNA (R & S) Module 04 - Scaling Networks - Chapter 4
CCNA (R & S) Module 04 - Scaling Networks - Chapter 4Waqas Ahmed Nawaz
 
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...Paris Carbone
 
Cisco data center support
Cisco data center supportCisco data center support
Cisco data center supportKrunal Shah
 
Tech Talk: ONOS- A Distributed SDN Network Operating System
Tech Talk: ONOS- A Distributed SDN Network Operating SystemTech Talk: ONOS- A Distributed SDN Network Operating System
Tech Talk: ONOS- A Distributed SDN Network Operating Systemnvirters
 
Cisco systems hacking layer 2 ethernet switches
Cisco systems   hacking layer 2 ethernet switchesCisco systems   hacking layer 2 ethernet switches
Cisco systems hacking layer 2 ethernet switchesKJ Savaliya
 
Hacking Layer 2 - Enthernet Switcher Hacking Countermeasures.
Hacking Layer 2 - Enthernet Switcher Hacking Countermeasures.Hacking Layer 2 - Enthernet Switcher Hacking Countermeasures.
Hacking Layer 2 - Enthernet Switcher Hacking Countermeasures.Sumutiu Marius
 
Cisco labs practical4
Cisco labs practical4Cisco labs practical4
Cisco labs practical4Tai Lam
 
Evolving Data Center switching with TRILL
Evolving Data Center switching with TRILLEvolving Data Center switching with TRILL
Evolving Data Center switching with TRILLbradhedlund
 
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and BeyondScylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and BeyondScyllaDB
 
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala TaiwanScala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala TaiwanJimin Hsieh
 
HPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark verticaHPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark verticaJack Gudenkauf
 
©LWTAOB© 2013 Cisco andLab – O.docx
©LWTAOB© 2013 Cisco andLab – O.docx©LWTAOB© 2013 Cisco andLab – O.docx
©LWTAOB© 2013 Cisco andLab – O.docxLynellBull52
 
Raft After ScyllaDB 5.2: Safe Topology Changes
Raft After ScyllaDB 5.2: Safe Topology ChangesRaft After ScyllaDB 5.2: Safe Topology Changes
Raft After ScyllaDB 5.2: Safe Topology ChangesScyllaDB
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Andriy Berestovskyy
 
Building Scalable, Real Time Applications for Financial Services with DataStax
Building Scalable, Real Time Applications for Financial Services with DataStaxBuilding Scalable, Real Time Applications for Financial Services with DataStax
Building Scalable, Real Time Applications for Financial Services with DataStaxDataStax
 
DataStax Enterprise – Foundations for Finance – 20160419
DataStax Enterprise – Foundations for Finance – 20160419DataStax Enterprise – Foundations for Finance – 20160419
DataStax Enterprise – Foundations for Finance – 20160419Daniel Cohen
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGATO project
 

Similar a Eventually, Scylla Chooses Consistency (20)

Graph processing
Graph processingGraph processing
Graph processing
 
CCNA (R & S) Module 04 - Scaling Networks - Chapter 4
CCNA (R & S) Module 04 - Scaling Networks - Chapter 4CCNA (R & S) Module 04 - Scaling Networks - Chapter 4
CCNA (R & S) Module 04 - Scaling Networks - Chapter 4
 
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
 
Cisco data center support
Cisco data center supportCisco data center support
Cisco data center support
 
Tech Talk: ONOS- A Distributed SDN Network Operating System
Tech Talk: ONOS- A Distributed SDN Network Operating SystemTech Talk: ONOS- A Distributed SDN Network Operating System
Tech Talk: ONOS- A Distributed SDN Network Operating System
 
Cisco systems hacking layer 2 ethernet switches
Cisco systems   hacking layer 2 ethernet switchesCisco systems   hacking layer 2 ethernet switches
Cisco systems hacking layer 2 ethernet switches
 
Hacking Layer 2 - Enthernet Switcher Hacking Countermeasures.
Hacking Layer 2 - Enthernet Switcher Hacking Countermeasures.Hacking Layer 2 - Enthernet Switcher Hacking Countermeasures.
Hacking Layer 2 - Enthernet Switcher Hacking Countermeasures.
 
Cisco labs practical4
Cisco labs practical4Cisco labs practical4
Cisco labs practical4
 
Evolving Data Center switching with TRILL
Evolving Data Center switching with TRILLEvolving Data Center switching with TRILL
Evolving Data Center switching with TRILL
 
Ether channel fundamentals
Ether channel fundamentalsEther channel fundamentals
Ether channel fundamentals
 
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and BeyondScylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
 
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala TaiwanScala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
 
HPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark verticaHPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark vertica
 
©LWTAOB© 2013 Cisco andLab – O.docx
©LWTAOB© 2013 Cisco andLab – O.docx©LWTAOB© 2013 Cisco andLab – O.docx
©LWTAOB© 2013 Cisco andLab – O.docx
 
Raft After ScyllaDB 5.2: Safe Topology Changes
Raft After ScyllaDB 5.2: Safe Topology ChangesRaft After ScyllaDB 5.2: Safe Topology Changes
Raft After ScyllaDB 5.2: Safe Topology Changes
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
Building Scalable, Real Time Applications for Financial Services with DataStax
Building Scalable, Real Time Applications for Financial Services with DataStaxBuilding Scalable, Real Time Applications for Financial Services with DataStax
Building Scalable, Real Time Applications for Financial Services with DataStax
 
DataStax Enterprise – Foundations for Finance – 20160419
DataStax Enterprise – Foundations for Finance – 20160419DataStax Enterprise – Foundations for Finance – 20160419
DataStax Enterprise – Foundations for Finance – 20160419
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
 
Code GPU with CUDA - SIMT
Code GPU with CUDA - SIMTCode GPU with CUDA - SIMT
Code GPU with CUDA - SIMT
 

Más de ScyllaDB

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLScyllaDB
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasScyllaDB
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasScyllaDB
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...ScyllaDB
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...ScyllaDB
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaScyllaDB
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityScyllaDB
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptxScyllaDB
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDBScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationScyllaDB
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsScyllaDB
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesScyllaDB
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsScyllaDB
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101ScyllaDB
 

Más de ScyllaDB (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & Tradeoffs
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101
 

Último

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 

Último (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Eventually, Scylla Chooses Consistency

  • 1. Raft in Scylla Konstantin Osipov ScyllaDB
  • 2. Konstantin Osipov Team Lead, ScyllaDB Kostja’s one of the developers behind Scylla lightweight transactions. His current focus is Raft log replication and its applications to schema, topology changes and tablets. 2
  • 3. Recap Scylla Summit 2019 ▪ LWT: the first strongly consistent feature ▪ Available in 4.0 ▪ Pay per use 3 2020-12-25: Tested with Jepsen! UPDATE employees SET join_date = '2018-05-19' WHERE firstname = 'John' AND lastname = 'Doe' IF join_date != null; [applied] False
  • 4. ▪ 3 network round trips per write ▪ Must read the old value before write LWT use of Paxos 4 R1 R 2 R 3 Decision made Can I propose a value? Check condition Accept new value Learn decision
  • 5. What is Raft anyway? ▪ Raft provides strong consistency efficiently ▪ Only the leader can accept writes 5 Leader Append entries Apply Follower Follower Decision made … 1 network round trip per write on average
  • 6. Raft log replication ▪ Each node has a copy of Raft log 6
  • 7. Scylla plans to use Raft for: ▪ Topology changes 7
  • 8. Scylla plans to use Raft for: ▪ Topology changes ▪ Schema changes 8
  • 9. Scylla plans to use Raft for: ▪ Schema changes ▪ Topology changes ▪ Tablets 9 0-99 100-199 200-299 Table Node A Tablet 1 0-99 Node B Tablet 2 100-199 Node C Tablet 3 200-299
  • 11. Topology changes in Scylla ▪ Safe when one change is done at a time ▪ Rely on 30+ second timeouts for consistency ▪ Allowed on a significantly degraded cluster (split brain) 11 30s 👋 💡 💡
  • 12. Topology changes using Raft ▪ Durable and linearizable ▪ Permit adding multiple nodes ▪ Permit background data rebalancing ▪ Require a majority of replicas alive to succeed 12
  • 14. Schema changes in Scylla ▪ Each node owns a copy of the schema ▪ Schema change is first made locally ▪ Then eventually pushed through the cluster ▪ Last-timestamp-wins rule is used for reconciliation 14 Node A: > CREATE TABLE e (a int); OK (hash: a81e, ts: 1609420790) Node B: > CREATE TABLE e (a int, b int); OK (hash: 2fa3, ts: 1609420792)
  • 15. INSERT/UPDATE when schemas differ ▪ Each data request carries a schema version ▪ Missing versions can be pulled from peers 15 Node A (a81e): > INSERT INTO e (a) VALUES (1); hash: 2fa3, row: (1, null) Node B (2fa3): > INSERT INTO e (a) VALUES (1); hash: 2fa3, row: (1, null)
  • 16. Schema changes using Raft ▪ Each node continues to store a copy of the schema ▪ A change is first persisted in a global Raft log ▪ On success, it’s applied on replicas ▪ Schema changes are now linearizable and consistent ▪ Nodes catch up with schema history during boot The Speaker’s camera displays here 16
  • 18. Token based partitioning ▪ Partition key is hashed to an integer (token) ▪ Nodes own ranges of tokens ▪ Provides even distribution of data and traffic ▪ Hotspots if partitions have many clustering rows 18 ck: pk - partition key, ck - clustering key pk: a b c .. gf .. t.. u 1 2 3 .. 21 3 11 1 token footprint:
  • 19. Tablet partitioning ▪ Tablet is a new kind of partition ▪ It stores a primary key range, not a single partition key ▪ Tablet ranges are subject to dynamic load balancing ▪ Size of each tablet is configurable (e.g. 64MB) 19
  • 20. Raft for Tablets ▪ Manageable number of Raft groups (~100,000) ▪ No client-side timestamps ▪ Provides isolation for ALL queries ▪ Writes do not require a read ▪ No need to repair ▪ Strong consistency of materialized views Strong consistency by default 20
  • 21. Raft in Scylla: summary ▪ Raft extended to efficiently support many groups ▪ Raft and Tablet partitioning = fast strong consistency ▪ Linearizable, more powerful schema and topology changes ▪ High Availability and partition tolerance of Cassandra are mostly unaffected 21
  • 23. Download Scylla Open Source: scylladb.com/download Talk to an expert: scylladb.com/consultation Take a test drive: scylladb.com/test-drive The Speaker’s camera displays here Experience Scylla for Yourself 23

Notas del editor

  1. Hi, This talk is about Raft in Scylla - our effort to improve a lot of existing Cassandra functionality and add new strongly consistent features.
  2. I’m Konstantin Osipov, I live in Moscow and work on open source databases. In Scylla I’ve been involved with implementation of lightweight transactions.
  3. Before discussing Raft, let’s recap the items we delivered recently. Back at Scylla Summit 2019 we announced support for Cassandra lightweight transactions. Lightweight transactions allow all clients agree on a state of a database before making a change to it. Prior to that, Scylla lacked any strongly consistent features. We made a considerable effort testing LWT, and just recently completed an industry standard Jepsen testing for it.
  4. In Scylla, LWT are based on Paxos consensus algorithm. Paxos is a leaderless protocol, in which each participant stores little state, which was an advantage considering that to be compatible with Cassandra Scylla needed to allow each partition be independently available. Paxos runs 3 rounds of network messages to commit each transaction. This is 1 round trip less than Cassandra, but still is more than necessary in the optimal case. An important property of LWT is that it works over existing tables and alongside eventually consistent operations. If LWT are not used, the overhead on the rest of the operations is zero. This is the gain of a fairly high cost of the implementation. We mentioned at the 2019 Summit that Scylla is committed to providing an optimized implementation of strongly consistent reads and writes based on Raft. In this talk I will discuss our progress with Raft and what else we’re going to improve using it.
  5. So what is Raft anyway? It is a leader based log replication protocol. A very crude explanation of what Raft does, is it elects a leader once, and then the leader is responsible for making all the decisions about the state of the database. This helps avoid extra communication between replicas during individual reads and writes. Each node maintains a state of who the current leader is, and forwards requests to the leader. Scylla clients are unaffected - except now the leader does some more work than replicas, so the load distribution may be less even. This means Scylla will need to Raft instances side by side.
  6. Raft is built around the notion of a replicated log. When the leader receives a request, it first stores an entry for it in its log. Then it pushes the entry to replica’s copies of the log. Once the majority of replicas store the entry, the leader applies the entry and instruct the replicas to do the same. On event of leader failure, a replica with the most up to date log becomes the leader.
  7. Raft defines not only how group makes a decision, but also the protocol of deciding on new members of the group, and removing group members. This lays a solid foundation for Scylla topology changes: they translate naturally to Raft configuration changes, assuming there is a Raft group for all of the nodes, and no longer need a proprietary protocol.
  8. Schema changes translate to simply storing a command in a the global Raft log and then applying the change on each node which has a copy of the log.
  9. Because of the additional state (the current leader) stored at each peer, it’s not as straightforward to apply Raft to Scylla data manipulation statements. Maintaining a separate leader for each partition would be just too much overhead, considering individual partition updates may be rare. This is why Scylla, alongside Raft, works on a new partitioner, which would reduce the total number of partitions, while still keeping the number high to guarantee even distribution of data and work, and would allow balance the data between partitions more flexibly. For each such partition, called Tablet, Scylla will run an own instance of Raft algorithm. In the rest of the talk I will discuss these 3 applications of Raft in more detail.
  10. Let’s begin with the subject of topology changes and discuss how Raft could be used to improve it.
  11. Presently, topology changes in Scylla are eventually consistent. Let’s use node addition as an example. A node wishing to join the cluster advertises itself to the rest of the members through Gossip. For those of you not familiar with the way Gossip works, it’s a great protocol for distributing some infrequently changing information at low cost. It’s very commonly used for failure detection - when healthy clusters enjoy low network overhead induced by a failure detector, and state of a faulty node distributes across the cluster reasonably quickly - a few to several seconds would be a typical interval. Knowing Gossip is not too fast waits for (by default) 30s to let the news spread. Nodes begin forwarding relevant updates to the new node once they are aware of ot. With updates coming in, the node can start data rebalancing. Node removal or decommission works similarly, except the node spreading the rumour (aka the change coordinator) is not necessarily the same node the rumour is about (just what we are used to in real life). This poses some challenges: The actions performed by the change coordinator are unilateral, and assume the operator avoids starting a conflicting change concurrently. The joining node will proceed after a 30s interval even if one of the nodes in the cluster is down and did not get the news about the new member. Such nodes, once are back online, will continue serving queries using old topology until Gossip messages reach them. A repair will then be necessary to restore the configured data replication factor. If a joining node dies mid-way, its added data ranges will remain in the cluster topology and the operator will need to clean them up manually before proceeding with the next change. Since the procedure relies on a fairly slow vehicle to spread the information, it’s hard to split into multiple steps. When we at Scylla discuss how to add multiple nodes concurrently, we consider breaking a single topology change action into smaller, persistent and resumable steps, such as first adding an empty node, then assigning it some data ranges, then actually moving these ranges. Having to wait 30s for each step to settle in through Gossip is not very practical.
  12. Raft handles these challenges by including topology changes (called configuration changes there) into protocol core. This part of Raft protocol is also widely adopted and went under extensive scrutiny, so should be naturally preferred to Scylla’s proprietary solution inherited from Cassandra. The way Raft treats topology changes is similar to the way it handles standard strongly consistent reads and writes. A topology change is done by appending two records to the distributed Raft log. The first record is introducing the new topology to the cluster. After the first record is appended to master log, and until the log with this record is shipped to the majority of nodes, the cluster takes into account the new topology (e.g. a new node) in all writes, but doesn’t abandon the old topology yet - it’s also used for all reads and writes. Once the majority of replicas got the information about the new topology, the leader adds the second record to the log. This informs replicas that now it’s safe to discard the old topology and fully switch to the new one. This two-step procedure ensures that no two parts of the cluster operate in two different configurations - worst case, some nodes may still be using joint topology and old one, or joint topology and new one, both of which is safe, but never only old and only new topology. With Raft, Scylla topology changes could be split into multiple steps: First, add the new node to the global Raft group configuration, using the procedure just described Then, commit a record to token_metadata with the new nodes’ token. This will be linearizable with all topologies The, stream ranges to the added node, and update state of each range as it is streamed. Since all the steps are linearized through Raft log it is now possible to permit concurrent topology changes, as long as they don’t conflict. The only conceivable downside is that if the majority of the cluster nodes are down, it may be not possible to perform topology changes at all. Scylla will need to provide an emergency brake instrument to recover clusters so significantly degraded. One possible solution would be directly editing topology information on the remaining nodes, to let them continue in the state that remains.
  13. Schema changes are operations such as creating and dropping keyspaces, tables, user defined types or functions. If they are using Raft, they also benefit from linearizability.
  14. Currently, Schema changes in Scylla are eventually consistent. Each Scylla node has an own copy of the schema. Requests to change schema are validated against a local copy and then are applied locally. A new data item may be added to the immediately following, before any other cluster node knows about it. There is no coordination between changes at different nodes, and any node is free to propose a change. The change is eventually propagated through the rest of the cluster. The last-timestamp-wins is used to resolve conflicts if two changes against the same object happened concurrently.
  15. Data manipulation is aware of possible schema inconsistency. A specific request carries a schema version with it. Scylla is able execute requests with divergent history, so will fetch a particular schema version just to execute a request. This guarantees the schema changes are fully available in presence of network failures. It has some downsides as well. It is possible to submit changes that conflict: e.g. define a table based on UDT, and drop that UDT New features, such as triggers, stored functions, UDFs, aggravate the consistency problem
  16. After switching schema changes to Raft any node would still be able to propose a change. However, the change now will be forwarded to Raft leader, where it will be validated against the latest version of the schema. Then, the leader will persist it in a global Raft log, replicated to all nodes of the cluster. Once the majority of replicas confirm persisting its copy of the log, the change will be applied on all replicas. With this approach, all schema changes will form a linear history and divergent/conflicting changes will be impossible. It should open the way to complex but safe dependencies between schema objects, i.e. triggers, constraints or functional indexes. A replica which was down while the cluster has been performing schema changes will catch up with them on boot, but streaming the entire history of changes from the leader. There is also a downside. It will no longer possible to perform a schema change if the majority for the cluster is unreachable or down. It is still possible that a node gets a request for a schema it did not see yet, and will need to fetch schema for it. For older schemas we will maintain a version history. For newer schemas, we will need to make sure that the history can be fetched from any node, not just the leader. https://docs.google.com/presentation/d/1ZazssA802_bUHcJKy7yPUbiVby8acFxbebf-VbmXRDk/edit#slide=id.ga3bc8bcbea_0_131
  17. Finally, the ultimate feature enabled by Raft are fast & efficient yet strongly consistent tables. Tablets is a term for a database unit of data distribution and load balancing first introduced in Google BigTable paper from 2006. Let’s see how they work.
  18. Today, Scylla’s partitioning strategy is not pluggable. Compare with replication strategy: you can change how many replicas a keyspace has, and where these replicas are located. You can also use QUORUM/LOCAL_QUORUM and SERIAL/LOCAL_SERIAL to work efficiently in cross-dc setup. Scylla partitioner is not like it: all you can choose is what makes a partition key. The key is always hashed to a token, a token mapped to a replica set/shard. Thanks to hashing and use of vnodes (tokens), the data is evenly distributed across the cluster. Most write and read scenarios produce even load on all nodes of the cluster. Hotspots, while possible, are unlikely. Unfortunately, one size still can not fit all. Using the same partitioner for all tables can be rather a hindrance if there are a lot of small tables, which are frequently scanned. Frequent range scans also require an extra step of merging streams produced by multiple nodes. Certain partitions tend to get hot no matter how good is the choice of the partition key. https://docs.google.com/document/d/1flYRliD-VXNlrdPR2IT_rswXRW_55CySlXnEcw7qqtY/edit#heading=h.ly4c9p67vgne https://docs.google.com/presentation/d/1Pm1hIGza4RcSEzlV_bRSYv9AmUyAGRv6cuNmVuEmt9g/edit#slide=id.g51b14e1223_0_432
  19. So in Scylla, we would like to make partitioning strategy a user choice, like the replication factor is today. If a user chooses tablet partitioning, Scylla will store small tables using few tablets. Large partitions (tablets) will be automatically split, and small tablets coalesced if necessary. Other databases that support range-based partitioners include MongoDB, Couchbase, Cockroach… https://docs.google.com/document/d/1flYRliD-VXNlrdPR2IT_rswXRW_55CySlXnEcw7qqtY/edit#heading=h.ly4c9p67vgne https://docs.google.com/presentation/d/1Pm1hIGza4RcSEzlV_bRSYv9AmUyAGRv6cuNmVuEmt9g/edit#slide=id.g51b14e1223_0_432
  20. Tables, partitioned using tablets, will work efficiently with Raft. When Raft is used, the change is stored in the log before it’s applied to the table, so no repair in Cassandra sense is needed - we may still want to “repair” (i.e. sync up) the logs between replicas, but the base tables will stay consistent at all times. This addressed the problem of consistency of derived data, which has been open in Cassandra for along time (many of you who track Cassandra development are familiar with materialized view consistency issues) . https://docs.google.com/document/d/1flYRliD-VXNlrdPR2IT_rswXRW_55CySlXnEcw7qqtY/edit#heading=h.ly4c9p67vgne https://docs.google.com/presentation/d/1Pm1hIGza4RcSEzlV_bRSYv9AmUyAGRv6cuNmVuEmt9g/edit#slide=id.g51b14e1223_0_432
  21. Original Raft does not know about partitions, tokens, shards. It is an abstract algorithm describing replication of an abstract state machine. In Scylla, we have more than one state machine (schema information, topology information, and then each tablet and its replica set is an independent Raft instance), so we want to run many copies of Raft algorithm simultaneously. This poses new challenges: how do we spawn new copies consistently? How much state the algorithm will take? Can we share the overhead of the algorithm, such as the cost of distributed failure detection, between Raft instances? Where to store Raft replication log? Could we avoid the overhead of double logging: raft log and commit log? Could we make these decision configurable, depending on the balance of performance and ease of use? We have already addressed many of these issues in Scylla Raft - a reusable library, which supports joint consensus configuration changes, pluggable state machine, logging and failure detection. We’re working on rebuilding Scylla schema on top of it. The first user-visible impact of the effort is expected in the upcoming year. Stay tuned.