Spark can be used to perform maintenance operations on Cassandra data. There are three basic patterns for interacting with Cassandra using Spark: read-transform-write (1:1), read-transform-write (1:m), and read-filter-delete (m:1). Deletes are tricky in Cassandra and require either selecting records to delete and issuing deletes or selecting records to keep and rewriting/deleting partitions. The document provides examples of using Spark for cache maintenance, trimming user history, publishing data, and multitenant backup and recovery.
13. DELETES ARE TRICKY
• Keep tombstones in mind
• Select the records you want to delete, then loop
over those and issue deletes through the driver
• OR select the records you want to keep, rewrite
them, then delete the partitions they lived in… IN
THE PAST…
17. TIPS &TRICKS
• .spanBy( partition key ) - work on one Cassandra
partition at a time
• .repartitionByCassandraReplica()
• tune
spark.cassandra.output.throughput_mb_per_sec to
throttle writes
19. USE CASE :TRIM USER
HISTORY
• Cassandra Data Model: PRIMARY KEY( userid,
last_access )
• Keep last X records
• .spanBy( partitionKey ) flatMap filtering Seq
20. USE CASE: PUBLISH DATA
• Cassandra Data Model: publish_date field
• filter by date, map to new RDD matching
destination, saveToCassandra()
21. USE CASE: MULTITENANT
BACKUP AND RECOVERY
• Cassandra Data Model: PRIMARY KEY((tenant_id,
other_partition_key), other_cluster, …)
• Backup: filter for tenant_id and .foreach() write to
external location.
• Recovery: read backup and upsert