SlideShare una empresa de Scribd logo
1 de 47
Descargar para leer sin conexión
Spark, File Transfer,
and More
Strategies for Migrating Data to and from
a Cassandra or Scylla Cluster
WEBINAR
Presenter
2
Dan Yasny
Before ScyllaDB, Dan worked in various roles such as
sysadmin, quality engineering, product management,
tech support, integration and DevOps around mainly
open source technology, at companies such as Red Hat and
Dell.
3
+ The Real-Time Big Data Database
+ Drop-in replacement for Cassandra
+ 10X the performance & low tail latency
+ New: Scylla Cloud, DBaaS
+ Open source and enterprise editions
+ Founded by the creators of KVM hypervisor
+ HQs: Palo Alto, CA; Herzelia, Israel
About ScyllaDB
Migration overview
+ Online and Offline migrations
+ Common steps: Schema, Existing data
Online migrations
+ Dual Writes
+ Kafka
+ Custom application changes
Existing data migration
+ Highly volatile/TTL’d data
+ CQL COPY
+ SSTableloader
+ Mirror loader (Danloader)
+ Spark Migrator
Agenda
Migration Overview
+ Online migration
+ Added complexity
+ Added load
+ Offline migration
+ Downtime
+ Common steps
+ Schema migration and adjustments
+ Existing data migration
+ Data validation
SSTable
Loader
SSTables CQL
CQL CQL
Client
CQLCQL
Client
Migrate Schema
+ Use DESCRIBE to export each Cassandra Keyspace, Table, UDT
+ cqlsh [IP] "-e DESC SCHEMA" > orig_schema.cql
+ cqlsh [IP] --file 'adjusted_schema.cql'
+ Schema updates when migrating from Cassandra 3.x
+ Different syntax for some schema properties might be needed
https://docs.scylladb.com/operating-scylla/procedures/cassandra_to_scylla_migration_process/#notes-limitations-and-kn
own-issues
Online Migration
Write to DB-OLD
Cold Migration - No Dual Writes
Time
Read from DB-OLD
Write to DB-OLD
Migrate Schema
Cold Migration - No Dual Writes
Time
Read from DB-OLD
Write to DB-OLD
Forklifting Existing Data
Migrate Schema
DBs in Sync
Cold Migration - No Dual Writes
Time
Read from DB-OLD
Write to DB-OLD
Forklifting Existing Data
Validation*
Migrate Schema
DBs in Sync
Cold Migration - No Dual Writes
Time
Fade off DB-OLD
Read from DB-OLD
Write to DB-OLD
Forklifting Existing Data
Validation*
Migrate Schema
DBs in Sync
Cold Migration - No Dual Writes
Time
Read from DB-NEW
Fade off DB-OLD
Read from DB-OLD
Write to DB-NEW
Write to DB-OLD
Dual Writes
Time
Read from DB-OLD
Write to DB-OLD
Migrate
Schema
Dual Writes
Time
Read from DB-OLD
Write to DB-OLD
Dual Writes
Migrate
Schema
Dual Writes
Time
Read from DB-OLD
Write to DB-OLD
Dual Writes
Migrate
Schema
Dual Writes
Time
Read from DB-OLD
Write to DB-NEW
Write to DB-OLD
Dual Writes
Forklifting Existing Data
Migrate
Schema
Dual Writes
Time
Read from DB-OLD
Write to DB-NEW
Write to DB-OLD
Dual Writes
Forklifting Existing Data
Migrate
Schema
DBs in Sync
Dual Writes
Time
Read from DB-OLD
Write to DB-NEW
Dual Reads
Write to DB-OLD
Dual Writes
Forklifting Existing Data
Validation
Migrate
Schema
DBs in Sync
Dual Writes
Time
Read from DB-OLD
Write to DB-NEW
Dual Reads
Write to DB-OLD
Dual Writes
Forklifting Existing Data
Validation
Migrate
Schema
DBs in Sync
Dual Writes
Time
Read from DB-NEW
Read from DB-OLD
Write to DB-NEW
Dual Reads
Write to DB-OLD
Dual Writes
Forklifting Existing Data
Validation
Migrate
Schema
DBs in Sync
Dual Writes
Time
Read from DB-NEW
Fade off DB-OLD
Read from DB-OLD
Write to DB-NEW
Apache Kafka based dual writes
Already using Kafka
+ Start buffering the writes to new cluster
+ Increase the TTL and add buffer time while migration is in progress/about to start
+ Keep sending writes to old cluster
+ Once existing data migration is complete, stream the data to new cluster
+ Use the Kafka-connect framework with the Cassandra connector
+ Use KSQL if you want to run a couple of transformations on the data (EXPERIMENTAL)
Dual Writes: App changes code sample
writes = []
writes.append(db1.execute_async(insert_statement_prepared[0], values))
writes.append(db2.execute_async(insert_statement_prepared[1], values))
results = []
for i in range(0,len(writes)):
try:
row = writes[i].result()
results.append(1)
except Exception:
results.append(0)
# Handle it
if (results[0]==0):
log('Write to cluster 1 failed')
if (results[1]==0):
log('Write to cluster 2 failed')
Existing Data Migration
Existing data migration strategies
+ CQL COPY
+ SSTableloader
+ Mirror loader
+ Spark Migrator
When performing an online migration, always use a strategy
that preserves timestamps, unless all keys are unique
Databases: Under the hood
+ Native CQL to native CQL
+ Scylla Spark Migrator
+ SSTable files to CQL
+ SSTableloader
+ SSTable files to SStable files
+ Mirror loader
+ Arbitrary data files to CQL
+ COPY
Highly volatile data with low TTL
+ Establish dual writes
+ Keep running until last record in the old DB is expired
+ Turn off dual writes
+ Phase out old DB
SQL
NoSQL
CQL COPY
CQL COPY: Database Compatibility
+ Scylla and Cassandra: Common file format
+ Schema Migration vs Schema Redesign
SQL
NoSQL
COPY continued
Some knobs:
+ HEADER
+ CHUNKSIZE
+ DATETIMEFORMAT
+ DELIMITER
+ INGESTRATE
+ MAXBATCHSIZE
+ MAXROWS
+ NULL
+ PREPAREDSTATEMENTS
+ TTL
+ File size matters
+ Skipping unwanted columns
+ Formatting
+ NULL
+ DATETIME
+ Quotes
+ Escapes
+ Delimiters
+ Best practices
+ Small files
+ Separate host
COPY and CSV files
Pros
+ Simplicity
+ CSV transparency
+ Easy to validate
+ Destination schema can have less
columns than the original
+ Can be tweaked, plenty of language
support
+ Can be used for any data ingestion, not
necessarily from Scylla/Cassandra
(incompatible DBs)
+ Compatible with Cassandra, Scylla and
Scylla Cloud
Cons
+ Not for large data sizes
+ Timestamps not preserved - be
careful with online migrations
SSTableloader
+ Create snapshot of each Cassandra node
+ Run sstableloader from each Cassandra node, or from intermediate servers
+ Use throttling -t to limit the leader throughput if needed
+ Run several sstableloaders in parallel
Both Cassandra and Scylla ship with an sstableloader utility.
While they are similar, there are differences between the two:
+ Use Scylla’s sstableloader to migrate to Scylla
SSTableloader
Scylla’s
SSTable
Loader
SSTables CQL
SQL
NoSQL
SStableloader continued
+ No Schema update during forklifting
+ Scylla’s sstableloader has support for simple column renames
+ Assuming RF=3, you end up with 9 copies of the data until compaction happens
Failure handling:
+ What should I do if sstableloader fails?
+ What should I do if a source node fails?
+ What should I do if a destination node fails?
+ How to rollback and start from scratch?
https://docs.scylladb.com/operating-scylla/procedures/cassandra_to_scylla_migration_process
Mirror Loader
Mirror Loader
Preparation:
+ Temporarily mirror the original DB
+ Establish node pairs
+ Clone the token sets between node
pairs
+ nodetool ring
+ Every node has token assignment, which
has to match the data on disk
+ We want to bring up a similar Scylla
setup, using the same token ranges
+ Then let Scylla’s internals do the work
Source DB Temp DB
File copy
SQL
NoSQL
Mirror Loader
+ Start the mirrored Scylla cluster
+ Bring up destination Scylla nodes in a
new DC in the same cluster, using the
temp cluster as seed
+ Start synchronization (repair or rebuild)
+ Wait for synchronization to complete
+ Phase out the Old and Temporary
clusters
Native sync
Source
cluster
Destination
Cluster
temporary DC1
Destination cluster
DC2
Mirror Loader
Pros
+ Very simple approach
+ Direct file copy, no data digesting
+ Very low effect on source DB load
+ Native sync to destination DB
+ Very fast - files are copied and the
temporary DB is ready to be used
immediately.
Cons
+ Supported file formats
+ ka, la, mc
+ Temporary machines required
+ Not usable with Scylla cloud
+ Schema adjustments might be needed
Scylla Spark Migrator
Scylla Spark migrator
+ Highly resilient to failures, and will retry reads and
writes throughout the job
+ Continuously writes savepoint files which can be used
to resume the transfer from the point at which it stopped
+ Access compatible Databases using a native connector
+ High performance parallelized reads and writes
+ Unlimited streaming power
+ Reduce data transfer costs
+ Can be configured to preserve the WRITETIME and TTL attributes of the fields
that are copied
+ Can handle column renames as part of the transfer
SQL
NoSQL
Scylla Spark Migrator
A very simple and easy to use tool
+ Install the standard Spark stack (Java JRE and JDK, SBT)
+ Edit configuration file
+ Run
Links:
+ https://www.scylladb.com/2019/02/07/moving-from-cassandra-to-scylla-via-apache-spark-scylla-
migrator/
+ https://github.com/scylladb/scylla-migrator/
Scylla Spark Migrator - configuration
source:
host: 10.0.0.110
port: 9042
credentials:
username: <user>
password: <pass>
keyspace: keyspace1
table: standard1
splitCount: 256
connections: 4
fetchSize: 1000
target:
...
preserveTimestamps: true
savepoints:
path: /tmp/savepoints
intervalSeconds: 60
renames: []
skipTokenRanges: []
STDOUT:
2019-03-25 20:30:04 INFO migrator:405 -
Created a savepoint config at
/tmp/savepoints/savepoint_1553545804.yaml
due to schedule. Ranges added:
Set((49660176753484882,50517483720003777),
(1176795029308651734,1264410883115973030),
(-246387809075769230,-238284145977950153),
(-735372055852897323,-726956712682417148),
(6875465462741850487,6973045836764204908),
(-467003452415310709,-4589291437737669003)
...
Scylla Spark
migrator
Takeaways: Online Migration
This webinar covered ways to do online migration (no downtime) to and from
Apache Cassandra, Scylla, and Scylla Cloud. This requires the writes to be sent to
both clusters
+ Have an existing Kafka infrastructure: replay kafka streams
+ For all other scenarios: small changes to the application to do dual writes
Takeaways: Existing Data Migration
+ If Source database is SQL, MongoDB, etc:
+ Use COPY command
+ If you have control of the topology and can afford temporary extra nodes:
+ Use Mirror Loader
+ If you have access to the original SSTable files:
+ Use SSTableloader
+ Want a fully flexible streaming solution and can afford the extra load in the
source:
+ Use the Scylla Spark Loader
United States
1900 Embarcadero Road
Palo Alto, CA 94303
Israel
11 Galgalei Haplada
Herzelia, Israel
www.scylladb.com
@scylladb
Thank you
Q&A
Stay in touch
dyasny@scylladb.com
@dyasny

Más contenido relacionado

Más de ScyllaDB

Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaScyllaDB
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityScyllaDB
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptxScyllaDB
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDBScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationScyllaDB
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsScyllaDB
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesScyllaDB
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsScyllaDB
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101ScyllaDB
 
Top NoSQL Data Modeling Mistakes
Top NoSQL Data Modeling MistakesTop NoSQL Data Modeling Mistakes
Top NoSQL Data Modeling MistakesScyllaDB
 
NoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & PrinciplesNoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & PrinciplesScyllaDB
 
Optimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database DriversOptimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database DriversScyllaDB
 
Overcoming Media Streaming Challenges with NoSQL
Overcoming Media Streaming Challenges with NoSQLOvercoming Media Streaming Challenges with NoSQL
Overcoming Media Streaming Challenges with NoSQLScyllaDB
 
How Optimizely (Safely) Maximizes Database Concurrency.pdf
How Optimizely (Safely) Maximizes Database Concurrency.pdfHow Optimizely (Safely) Maximizes Database Concurrency.pdf
How Optimizely (Safely) Maximizes Database Concurrency.pdfScyllaDB
 
How Development Teams Cut Costs with ScyllaDB.pdf
How Development Teams Cut Costs with ScyllaDB.pdfHow Development Teams Cut Costs with ScyllaDB.pdf
How Development Teams Cut Costs with ScyllaDB.pdfScyllaDB
 
Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
Learning Rust the Hard Way for a Production Kafka + ScyllaDB PipelineLearning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
Learning Rust the Hard Way for a Production Kafka + ScyllaDB PipelineScyllaDB
 
NoSQL at Scale: Proven Practices & Pitfalls
NoSQL at Scale: Proven Practices & PitfallsNoSQL at Scale: Proven Practices & Pitfalls
NoSQL at Scale: Proven Practices & PitfallsScyllaDB
 

Más de ScyllaDB (20)

Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & Tradeoffs
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101
 
Top NoSQL Data Modeling Mistakes
Top NoSQL Data Modeling MistakesTop NoSQL Data Modeling Mistakes
Top NoSQL Data Modeling Mistakes
 
NoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & PrinciplesNoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & Principles
 
Optimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database DriversOptimizing Performance in Rust for Low-Latency Database Drivers
Optimizing Performance in Rust for Low-Latency Database Drivers
 
Overcoming Media Streaming Challenges with NoSQL
Overcoming Media Streaming Challenges with NoSQLOvercoming Media Streaming Challenges with NoSQL
Overcoming Media Streaming Challenges with NoSQL
 
How Optimizely (Safely) Maximizes Database Concurrency.pdf
How Optimizely (Safely) Maximizes Database Concurrency.pdfHow Optimizely (Safely) Maximizes Database Concurrency.pdf
How Optimizely (Safely) Maximizes Database Concurrency.pdf
 
How Development Teams Cut Costs with ScyllaDB.pdf
How Development Teams Cut Costs with ScyllaDB.pdfHow Development Teams Cut Costs with ScyllaDB.pdf
How Development Teams Cut Costs with ScyllaDB.pdf
 
Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
Learning Rust the Hard Way for a Production Kafka + ScyllaDB PipelineLearning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
 
NoSQL at Scale: Proven Practices & Pitfalls
NoSQL at Scale: Proven Practices & PitfallsNoSQL at Scale: Proven Practices & Pitfalls
NoSQL at Scale: Proven Practices & Pitfalls
 

Último

Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Último (20)

Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Spark, File Transfer, and More: Strategies for Migrating Data to and from a Cassandra or Scylla Cluster

  • 1. Spark, File Transfer, and More Strategies for Migrating Data to and from a Cassandra or Scylla Cluster WEBINAR
  • 2. Presenter 2 Dan Yasny Before ScyllaDB, Dan worked in various roles such as sysadmin, quality engineering, product management, tech support, integration and DevOps around mainly open source technology, at companies such as Red Hat and Dell.
  • 3. 3 + The Real-Time Big Data Database + Drop-in replacement for Cassandra + 10X the performance & low tail latency + New: Scylla Cloud, DBaaS + Open source and enterprise editions + Founded by the creators of KVM hypervisor + HQs: Palo Alto, CA; Herzelia, Israel About ScyllaDB
  • 4. Migration overview + Online and Offline migrations + Common steps: Schema, Existing data Online migrations + Dual Writes + Kafka + Custom application changes Existing data migration + Highly volatile/TTL’d data + CQL COPY + SSTableloader + Mirror loader (Danloader) + Spark Migrator Agenda
  • 5. Migration Overview + Online migration + Added complexity + Added load + Offline migration + Downtime + Common steps + Schema migration and adjustments + Existing data migration + Data validation SSTable Loader SSTables CQL CQL CQL Client CQLCQL Client
  • 6. Migrate Schema + Use DESCRIBE to export each Cassandra Keyspace, Table, UDT + cqlsh [IP] "-e DESC SCHEMA" > orig_schema.cql + cqlsh [IP] --file 'adjusted_schema.cql' + Schema updates when migrating from Cassandra 3.x + Different syntax for some schema properties might be needed https://docs.scylladb.com/operating-scylla/procedures/cassandra_to_scylla_migration_process/#notes-limitations-and-kn own-issues
  • 8. Write to DB-OLD Cold Migration - No Dual Writes Time Read from DB-OLD
  • 9. Write to DB-OLD Migrate Schema Cold Migration - No Dual Writes Time Read from DB-OLD
  • 10. Write to DB-OLD Forklifting Existing Data Migrate Schema DBs in Sync Cold Migration - No Dual Writes Time Read from DB-OLD
  • 11. Write to DB-OLD Forklifting Existing Data Validation* Migrate Schema DBs in Sync Cold Migration - No Dual Writes Time Fade off DB-OLD Read from DB-OLD
  • 12. Write to DB-OLD Forklifting Existing Data Validation* Migrate Schema DBs in Sync Cold Migration - No Dual Writes Time Read from DB-NEW Fade off DB-OLD Read from DB-OLD Write to DB-NEW
  • 13. Write to DB-OLD Dual Writes Time Read from DB-OLD
  • 14. Write to DB-OLD Migrate Schema Dual Writes Time Read from DB-OLD
  • 15. Write to DB-OLD Dual Writes Migrate Schema Dual Writes Time Read from DB-OLD
  • 16. Write to DB-OLD Dual Writes Migrate Schema Dual Writes Time Read from DB-OLD Write to DB-NEW
  • 17. Write to DB-OLD Dual Writes Forklifting Existing Data Migrate Schema Dual Writes Time Read from DB-OLD Write to DB-NEW
  • 18. Write to DB-OLD Dual Writes Forklifting Existing Data Migrate Schema DBs in Sync Dual Writes Time Read from DB-OLD Write to DB-NEW
  • 19. Dual Reads Write to DB-OLD Dual Writes Forklifting Existing Data Validation Migrate Schema DBs in Sync Dual Writes Time Read from DB-OLD Write to DB-NEW
  • 20. Dual Reads Write to DB-OLD Dual Writes Forklifting Existing Data Validation Migrate Schema DBs in Sync Dual Writes Time Read from DB-NEW Read from DB-OLD Write to DB-NEW
  • 21. Dual Reads Write to DB-OLD Dual Writes Forklifting Existing Data Validation Migrate Schema DBs in Sync Dual Writes Time Read from DB-NEW Fade off DB-OLD Read from DB-OLD Write to DB-NEW
  • 22. Apache Kafka based dual writes Already using Kafka + Start buffering the writes to new cluster + Increase the TTL and add buffer time while migration is in progress/about to start + Keep sending writes to old cluster + Once existing data migration is complete, stream the data to new cluster + Use the Kafka-connect framework with the Cassandra connector + Use KSQL if you want to run a couple of transformations on the data (EXPERIMENTAL)
  • 23. Dual Writes: App changes code sample writes = [] writes.append(db1.execute_async(insert_statement_prepared[0], values)) writes.append(db2.execute_async(insert_statement_prepared[1], values)) results = [] for i in range(0,len(writes)): try: row = writes[i].result() results.append(1) except Exception: results.append(0) # Handle it if (results[0]==0): log('Write to cluster 1 failed') if (results[1]==0): log('Write to cluster 2 failed')
  • 25. Existing data migration strategies + CQL COPY + SSTableloader + Mirror loader + Spark Migrator When performing an online migration, always use a strategy that preserves timestamps, unless all keys are unique
  • 26. Databases: Under the hood + Native CQL to native CQL + Scylla Spark Migrator + SSTable files to CQL + SSTableloader + SSTable files to SStable files + Mirror loader + Arbitrary data files to CQL + COPY
  • 27. Highly volatile data with low TTL + Establish dual writes + Keep running until last record in the old DB is expired + Turn off dual writes + Phase out old DB SQL NoSQL
  • 29. CQL COPY: Database Compatibility + Scylla and Cassandra: Common file format + Schema Migration vs Schema Redesign SQL NoSQL
  • 30. COPY continued Some knobs: + HEADER + CHUNKSIZE + DATETIMEFORMAT + DELIMITER + INGESTRATE + MAXBATCHSIZE + MAXROWS + NULL + PREPAREDSTATEMENTS + TTL + File size matters + Skipping unwanted columns + Formatting + NULL + DATETIME + Quotes + Escapes + Delimiters + Best practices + Small files + Separate host
  • 31. COPY and CSV files Pros + Simplicity + CSV transparency + Easy to validate + Destination schema can have less columns than the original + Can be tweaked, plenty of language support + Can be used for any data ingestion, not necessarily from Scylla/Cassandra (incompatible DBs) + Compatible with Cassandra, Scylla and Scylla Cloud Cons + Not for large data sizes + Timestamps not preserved - be careful with online migrations
  • 33. + Create snapshot of each Cassandra node + Run sstableloader from each Cassandra node, or from intermediate servers + Use throttling -t to limit the leader throughput if needed + Run several sstableloaders in parallel Both Cassandra and Scylla ship with an sstableloader utility. While they are similar, there are differences between the two: + Use Scylla’s sstableloader to migrate to Scylla SSTableloader Scylla’s SSTable Loader SSTables CQL SQL NoSQL
  • 34. SStableloader continued + No Schema update during forklifting + Scylla’s sstableloader has support for simple column renames + Assuming RF=3, you end up with 9 copies of the data until compaction happens Failure handling: + What should I do if sstableloader fails? + What should I do if a source node fails? + What should I do if a destination node fails? + How to rollback and start from scratch? https://docs.scylladb.com/operating-scylla/procedures/cassandra_to_scylla_migration_process
  • 36. Mirror Loader Preparation: + Temporarily mirror the original DB + Establish node pairs + Clone the token sets between node pairs + nodetool ring + Every node has token assignment, which has to match the data on disk + We want to bring up a similar Scylla setup, using the same token ranges + Then let Scylla’s internals do the work Source DB Temp DB File copy SQL NoSQL
  • 37. Mirror Loader + Start the mirrored Scylla cluster + Bring up destination Scylla nodes in a new DC in the same cluster, using the temp cluster as seed + Start synchronization (repair or rebuild) + Wait for synchronization to complete + Phase out the Old and Temporary clusters Native sync Source cluster Destination Cluster temporary DC1 Destination cluster DC2
  • 38. Mirror Loader Pros + Very simple approach + Direct file copy, no data digesting + Very low effect on source DB load + Native sync to destination DB + Very fast - files are copied and the temporary DB is ready to be used immediately. Cons + Supported file formats + ka, la, mc + Temporary machines required + Not usable with Scylla cloud + Schema adjustments might be needed
  • 40. Scylla Spark migrator + Highly resilient to failures, and will retry reads and writes throughout the job + Continuously writes savepoint files which can be used to resume the transfer from the point at which it stopped + Access compatible Databases using a native connector + High performance parallelized reads and writes + Unlimited streaming power + Reduce data transfer costs + Can be configured to preserve the WRITETIME and TTL attributes of the fields that are copied + Can handle column renames as part of the transfer SQL NoSQL
  • 41. Scylla Spark Migrator A very simple and easy to use tool + Install the standard Spark stack (Java JRE and JDK, SBT) + Edit configuration file + Run Links: + https://www.scylladb.com/2019/02/07/moving-from-cassandra-to-scylla-via-apache-spark-scylla- migrator/ + https://github.com/scylladb/scylla-migrator/
  • 42. Scylla Spark Migrator - configuration source: host: 10.0.0.110 port: 9042 credentials: username: <user> password: <pass> keyspace: keyspace1 table: standard1 splitCount: 256 connections: 4 fetchSize: 1000 target: ... preserveTimestamps: true savepoints: path: /tmp/savepoints intervalSeconds: 60 renames: [] skipTokenRanges: []
  • 43. STDOUT: 2019-03-25 20:30:04 INFO migrator:405 - Created a savepoint config at /tmp/savepoints/savepoint_1553545804.yaml due to schedule. Ranges added: Set((49660176753484882,50517483720003777), (1176795029308651734,1264410883115973030), (-246387809075769230,-238284145977950153), (-735372055852897323,-726956712682417148), (6875465462741850487,6973045836764204908), (-467003452415310709,-4589291437737669003) ... Scylla Spark migrator
  • 44. Takeaways: Online Migration This webinar covered ways to do online migration (no downtime) to and from Apache Cassandra, Scylla, and Scylla Cloud. This requires the writes to be sent to both clusters + Have an existing Kafka infrastructure: replay kafka streams + For all other scenarios: small changes to the application to do dual writes
  • 45. Takeaways: Existing Data Migration + If Source database is SQL, MongoDB, etc: + Use COPY command + If you have control of the topology and can afford temporary extra nodes: + Use Mirror Loader + If you have access to the original SSTable files: + Use SSTableloader + Want a fully flexible streaming solution and can afford the extra load in the source: + Use the Scylla Spark Loader
  • 46. United States 1900 Embarcadero Road Palo Alto, CA 94303 Israel 11 Galgalei Haplada Herzelia, Israel www.scylladb.com @scylladb Thank you