Cassandra Operations at Netflix

Cassandra Operations at Netflix
Gregg Ulrich

1

Agenda
 Who we are
 How much we use Cassandra
 How we do it
 What we learned

2

Who we are
 Cloud Database Engineering
 Development – Cassandra and related tools
 Architecture – data modeling and sizing
 Operations – availability, performance and maintenance
 Operations
 24x7 on-call support for all Cassandra clusters
 Cassandra operations tools
 Proactive problem hunting
 Routine and non-routine maintenances

3

How much we use Cassandra

30 Number of production clusters
12 Number of multi-region clusters
3 Max regions, one cluster
65 Total TB of data across all clusters
472 Number of Cassandra nodes
72/28 Largest Cassandra cluster (nodes/data in TB)
50k/250k Max read/writes per second on a single cluster
3* Size of Operations team

* Open position for an additional engineer
4

I read that Netflix doesn’t have operations
 Extension of Amazon’s PaaS
 Decentralized Cassandra ops is expensive at scale
 Immature product that changes rapidly (and drastically)
 Easily apply best practices across all clusters

5

How we configure Cassandra in AWS
 Most services get their own Cassandra cluster
 Mostly m2.4xlarge instances, but considering others
 Cassandra and supporting tools baked into the AMI
 Data stored on ephemeral drives
 Data durability – all writes to all availabilty zones
 Alternate AZs in a replication set
 RF = 3

6

Minimum cluster configuration
 Minimum production cluster configuration – 6 nodes
 3 auto-scaling groups
 2 instances per auto-scaling group
 1 availability zone per auto-scaling group

7

Minimum cluster configuration, illustrated

ASG1 AZ1
RF=3
ASG2 AZ2 PRIAM

ASG3 AZ3

8

Tools we use
 Administration
 Priam
 Jenkins
 Monitoring and alerting
 Cassandra Explorer
 Dashboards
 Epic

9

Tools we use – Priam
 Open-sourced Tomcat webapp running on each instance
 Multi-region token management via SimpleDB
 Node replacement and ring expansion
 Backup and restore
 Full nightly snapshot backup to S3
 Incremental backup of flushed SSTables to S3 every 30 seconds
 Metrics collected via JMX
 REST API to most nodetool functions
10

Tools we use – Cassandra Explorer
• Kiosk mode – no
alerting
• High level cluster
status (thrift, gossip)
• Warns on a small set
of metrics

11

Tools we use – Epic
• Netflix-wide
monitoring and
alerting tool based on
RRD
• Priam proxies all JMX
data to Epic
• Very useful for finding
specific issues

12

Tools we use – Dashboards
• Next level cluster
metrics
• Throughput
• Latency
• Gossip status
• Maintenance
operations
• Trouble indicators
• Useful for finding
anomalies
• Most investigations
start here

13

Tools we use – Jenkins
• Scheduling tool for additional
monitors and maintenance
tasks

• Push button automation for
recurring tasks

• Repairs, upgrades, and other
tasks are only performed
through Jenkins to preserve
history of actions

• On-call dashboard displays
current issues and maintenance
required

14

Things we monitor
Cassandra System
 Throughput  Disk space
 Latency  Load average
 Compactions  I/O errors
 Repairs  Network errors
 Pending threads
 Dropped operations
 Java heap
 SSTable counts
 Cassandra log files
15

Other things we monitor
 Compaction predictions
 Backup failures
 Recent restarts
 Schema changes
 Monitors

16

What we learned
 Having Cassandra developers in house is crucial
 Repairs are incredibly expensive
 Multi-tenanted clusters are challenging
 A down node is better than a slow node
 Better to compact on our terms and not Cassandra’s
 Sizing and tuning is difficult and often done live
 Smaller per-node data size is better

17

Q&A (and Recommended viewing)
The Best of Times
Taft and Bakersfield are real places

South Park
Later season episodes like F-Word and Elementary School Musical

Caillou
My kids love this show; I don’t know why

Until the Light Takes Us
Scary documentary on Norwegian Black Metal

18

Cassandra Operations at Netflix

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Cassandra Operations at Netflix

Similar a Cassandra Operations at Netflix (20)

Último

Último (20)

Cassandra Operations at Netflix

Notas del editor