In Apache Cassandra Lunch #121: Migrating to Azure Managed Instance for Apache Cassandra, we discussed different methods for migrating data from existing Cassandra instances to Azure hosted options.
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Migrate On-Prem Cassandra to Azure Managed Instance Using Hybrid Cluster Replication or Offline Migration
1. Version 1.0
Migrating to Azure Managed
Instance for Apache Cassandra
Hybrid cluster datacenter replication based migrations vs
offline Cassandra migrator migrations.
Obioma Anomnachi
Engineer @ Anant
2. Azure Managed Instance for Apache Cassandra
● Azure hosted Cassandra instance
○ Managed service w/ Azure security
○ Automation of repairs and updates, backups and recovery
○ Scaling automations
○ Integrates with existing Cassandra tools
● Compatible with on-premise Cassandra clusters
● Compatible with Cosmos DB Cassandra API
3. Migration Methods
● using Apache Cassandra native replication
○ Create a hybrid cluster between on premise and Azure Mananged Instance
○ Let Cassandra native cross-datacenter replication move data
● Using the Azure Cassandra Migrator to do offline migration
○ Start with two separate clusters - on premise and Azure managed instance
○ Start an external spark cluster - Azure recommends Azure Databricks
○ Create a Scala notebook to run the process
4. Hybrid Cluster Replication
● Method - Create a cluster on premise, extend that with a new datacenter on azure and let
Cassandra’s cross dc replication move data onto the Azure datacenter
○ Requires node to node encryption be enabled on the starting cluster, certs must be uploaded to azure cloud
storage
○ Uses Azure cli commands to start resource, cannot be done purely through the resource creation page
● Steps -
○ Create Virtual Network and configure Subnets
■ Add extra permissions needed by Azure Managed Instance for Apache Cassandra
○ Create and configure resource for Azure Managed Instance
○ Get gossip certs from the new Azure Managed Instance cluster and install them in existing datacenter
○ Create a new datacenter
5. Azure Cassandra Migrator
● Method - Run a spark job that will copy data from an existing Cassandra instance to an Azure
Cassandra instance
○ Requires a Spark Cluster
○ Azure suggests using Azure Databricks and Scala notebook
■ That isn’t necessary, can also use standalone Apache spark and spark submit
● Azure cassandra migrator is a modification of Scylla Migrator code
○ Has readers and writes that are Cassandra specific - compared to the several readers included in Scylla
Miagrator
○ Treats ttl and writetime slightly differently from scylla migrator, includes settings for specifying min ttl
○ Has some of the same weaknesses as scylla migrator, issues with preserving writetime and ttl for collections
(really a cassandra issue - info that exists in SSTables but not accessible via query)
6. Other Methods
● Kafka Connect
○ Load data from Cassandra into a Kafka topic and load that data into Azure Managed Instance for Apache
Cassandra using Kafka Connect and a Cassandra Sink
● Dual - write proxy
○ Live migration method, does not cover historical data
○ Application data coming in is written to both the old and the new cluster, helps define a set time frame for
historical data migration (potentially with time for validation)
● CDC
○ Tool that pulls a stream of deltas from the Source db and pushes those changes to the target