Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Cassandra for Sysadmins

43.175 visualizaciones

Publicado el

Quick introduction to the moving parts inside Cassandra and essential commands and tasks for System Administrators.

Publicado en: Tecnología

Cassandra for Sysadmins

  1. 1. 101 for System
  2. 2. What is Cassandra?• It is a distributed, columnar database.• Originally created at Facebook in 2008, is now a top level Apache Project.• Combines the best features of Amazons Dynamo (replication, mostly) and Googles Big Table (data model).
  3. 3. Who usesCassandra?
  4. 4. CommercialSupport Available: 
  5. 5. zOMG BOOKS!
  6. 6.  @Rocking ~30 Billion impressions a month, like a bawse.Used for semi-persistent storage of recommendations.• 14 nodes in two data centers.  • Dell R610, 8 cores, 32G of RAM, 6 x 10K SAS drives.• Using 0.8 currently, just upgraded. In production since 0.4.• We use Hector.• ~70-80G per node, ~550G dataset unreplicated.• RP + OldNTS @ RF2.  PropFileSnitch, RW @ CL.ONE.  Excited for NTS  Excited for TTLs!
  7. 7. How We Use Cassandra   +--------------+    |    Tomcat    | Tomcat serves Recs from Memcached.   +------------+-+         ^       | Flume ships logs to Hadoop DWH.        |       | L  Bunch of algos run against log data.  +-----+-----+ | O   | Memcached | | G Results are crammed into Cassandra.  +-----------+ | S        ^       |  Keyspace per algo.        |       | V  CacheWarmer...+-------+-----+ | I| CacheWarmer | | A  Sources recs from Cassandra (and other+-------------+ |        ^       | F sources)        |       | L  Dumps them in Memcached.  +-----+-----+ | U  | Cassandra | | M  +-----------+ | E        ^       |        |       v   +----+--------+   | Hadoop/Hive |   +-------------+* Simplified Workflow
  8. 8. Before I get too technical, relax.The following slides may sound complex at first,  but at theend of the day, to get your feet wet all you need to do is:yum install / apt-get installDefine a Seed.Define Strategies.service cassandra startGo get a beer.In my experience, once the cluster has been setup there isnot much else to do other than occasional tuning as youlearn how your data behaves.
  9. 9. Why Cassandra?• Minimal Administration.• No Single Point of Failure.• Scales Horizontally. • Writes are durable.• Consistency is tuneable as needed on reads and writes.• Schema is flexible, can be updated live.• Handles failure gracefully, Cassandra is crash-only.• Replication is easy, Rack and Datacenter aware.
  10. 10. Data Model Keyspace = {   Column Family: {     Row Key: {       Column Name: "Column Value"       Column Name: "Column Value"     }   } }A Keyspace is a container for Column Families. Analogous to adatabase in MySQL.A Column Family is a container for a group of columns.  Analagous toa table in MySQL.A Column is the basic unit.  Key, Value and Timestamp.A Row is a smattering of column data.
  11. 11. Gossip  In config, define seed(s).    Used for intra-cluster communication.   Cluster self-assembles.  Works with failure detection.  Routes client requests.
  12. 12. Pluggable Partitioning  RandomPartitioner (RP)  Orders by MD5 of the key.  Most common.  Distributes relatively evenly. There are others, but you probably will not use them.
  13. 13. Distributed Hash Table: The RingFor Random Partitioner:• Ring is made up of a range from 0 to 2**127.• Token is MD5(Key).• Each node is given a slice of the ring o Initial token is defined, node owns that token up to the next nodes initial token. Rock your tokens here:
  14. 14. Pluggable Topology DiscoveryCassandra needs to know about your network to directreplica placement. Snitches inform Cassandra about it.  SimpleSnitch   Default, good for 1 data center.  RackInferringSnitch  Infers location from the IPs octets.  10.D.R.N (Data center, Rack, Node)  PropertyFileSnitch   IP=DC:RACK (arbitrary values)   EC2Snitch  Discovers AWS AZ and Regions.
  15. 15. Pluggable Replica Placement SimpleStratgy  Places replicas in the adjacent nodes on the ring. NetworkTopologyStrategy  Used with property file snitch.  Explicitly pick how replicas are placed.  strategy_options = [{NY1:2, LA1:2}];
  16. 16. Reading & WritingOld method uses Thrift which isusually abstracted using APIs(ex. Hector, PyCassa, PHPCass)Now we have CQL and JDBC!SELECT * FROM ColumnFamily WHERE rowKey=Name;Since all nodes are equal, you can read and write to any node.The node you connect to becomes a Coordinator for thatrequest and routes your data to the proper nodes.Connection pooling to nodes is sometimes handled by the APIFramework, otherwise use RRDNS or HAProxy.
  17. 17. Tunable ConsistencyIt is difficult to keep replicas of data consistant across nodes, let aloneacross continents.In any distributed system you have to make tradeoffs between howconsistent your dataset is versus how avaliable it is and how tolerantthe system is of partitions. (a.k.a CAP theorm.)Cassandra chooses to focus on making the data avaliable and partitiontolerant and empower you to chose how consistant you need it to be.Cassandra is awesomesauce because you choose what is moreimportant to your query, consistency or latency.
  18. 18. Per-Query Consistency LevelsLatency increases the more nodes you have to involve.ANY:   For writes only. Writes to any avaliable node and expectsCassandra to sort it out. Fire and forget.ONE:   Reads or writes to the closest replica.QUORUM:   Writes to half+1 of the appropriate replicas before theoperation is successful.  A read is sucessful when half+1 replicas agreeon a value to return.  LOCAL_QUORUM:   Same as above, but only to the local datacenter ina multi-datacenter topology.ALL:   For writes, all replicas need to ack the write.  For reads, returnsthe record with the newest timestamp once all replicas reply.  In bothcases, if were missing even one replica, the operation fails.
  19. 19. Cassandra Write Path Cassandra identifies which node owns the token youre trying towrite based on your partitioning, replication and placementstrategies. Data Written to CommitLog  Sequential writes to disk, kinda like a MySQL binlog.  Mostly written to, is only read from upon a restart.  Data Written to Memtable.  Acts as a Write-back cache of data. Memtable hits a threshold (configurable) it is flushed to disk as anSSTable. An SSTable (Sorted String Table) is an immutable file ondisk. More on compaction later.
  20. 20. Cassandra Read Path Cassandra identifies which node owns the token youre trying toread based on your partitioning, replication and placementstrategies. First checks the Bloom filter, which can save us some time.  A space-efficient structure that tests if a key is on the node.   False positives are possible.  False negatives are impossible. Then checks the index.  Tells us which SStable file the data is in.  And how far into the SStable file to look so we dont need to scan the whole thing.
  21. 21. Distributed Deletes Hard to delete stuff in a distributed system.  Difficult to keep track of replicas.  SSTables are immutable. Deleted items are tombstoned (marked for deletion). Data still exists, just cant be read by API. Cleaned out during major compaction, when SSTables are merged/remade.
  22. 22. Compaction• When you have enough disparate SSTable files taking up space, they are merge sorted into single SSTable files.• An expensive process (lots of GC, can eat up half of your disk space)• Tombstones discarded.• Manual or automatic.• Pluggable in 1.0.• Leveled Compaction in 1.0
  23. 23. Repair Anti-Entropy and Read Repair  During node repair and QUORUM & ALL reads, ColumnFamilies are compared with replicas and discrepancies resolved.  Put manual repair in cron to run at an interval =< the value of GCGraceSeconds to catch old tombstones or risk forgotten deletes. Hinted Handoff  If a node is down, writes spool on other nodes and are handed off then it comes back.  Sometimes left off, since a returning node can get flooded.
  24. 24. Caching  Key Cache  Puts the location of keys in memory.  Improves seek times for keys on disk.  Enabled per ColumnFamily.  On by default at 200,000 keys.  Row Cache  Keeps full rows of hot data in memory.  Enable per ColumnFamily.  Skinny rows are more efficient.Row Cache is consulted first,then the Key CacheWill require a bit of tuning.
  25. 25. HardwareRAM: Depends on use. Stores some  objects off Heap.CPU: More cores the better.    Cassandra is built with concurrency in mind.Disk: Cassandra tries to minimize random IO.  Minimum of 2     disks.  Keep CommitLog and Data on separate spindles.    RAID10 or RAID0 as you see fit.  I set mine up thus: 1 Disk = OS + Commitlog & RAID10 = DataSSTablesNetwork:  1 x 1gigE is fine, more the better and Gossip and   Data can be defined on separate interfaces.
  26. 26. What about Cloud environments?EC2Snitch • Maps EC2 Regions to Racks • Maps EC2 Availability Zones to DCs • Use Network Topology StrategyAvoid EBS. Use RAID0/RAID10 across ephemeral drives.Replicate across Availability Zones.Netflix is moving to 100% Cassandra on EC2: 
  27. 27. InstallingRedHatrpm -i -y install apache-cassandraDebian Add to /etc/apt/sources.listdeb unstable maindeb-src unstable mainwget -O- | sudo apt-key add -sudo apt-get updatesudo apt-get install cassandra
  28. 28. Config and Log Files  /etc/cassandra/conf/  cassandra.yaml   /var/log/cassandra/  cassandra.log  system.log  gc.log (If Enabled.)
  29. 29. Hot Tips• Use Sun/Oracle JVM (1.6 u22+)• Use JNA Library. o Keep disk_access_mode as auto. o BTW, it is not using all your RAM, is like FS Cache.• Dont use autobootstrap, specify initial token.• Super columns impose a performance penalty.• Enable GC logging in • Dont use a large heap. (Yay off-heap caching!)• Dont use swap.
  30. 30. Monitoring Install MX4J jar into class path or ping JMX directly.curl | grep | awk it into Nagios, Ganglia, Cacti or what have you.
  31. 31. What to Monitor Heap Size and Usage  CompactionStage Garbage Collections  Compaction Count IO Wait  Cache Hit Rate RowMutationStage (Writes)  ReadStage  Active and Pending  Active and Pending
  32. 32. Adding/Removing/Replacing Nodes Adding a Node  Calculate new tokens.  Set correct initial token on the new node  Once it bootstrapped, nodetool move on other nodes. Removing a Node  nodetool decommission drains data to other nodes  nodetool removetoken tells the cluster to get the data from other replicas (faster, more expensive on live nodes). Replacing a Node  Bring up replacement node with same IP and token.  Run nodetool repair. 
  33. 33. Useful nodetool commands.nodetool info - Displays node-level info.nodetool ring - Displays info on nodes on the ring.nodetool cfstats - Displays ColumnFamily statistics.nodetool tpstats - Displays what operations Cassandrais doing right now.nodetool netstats - Displays streaming information.nodetool drain - Flushes Memtables to SSTables on diskand stops accepting writes.  Useful before a restart to makestartup quicker (no CommitLog to replay)
  34. 34. nodetool info
  35. 35. nodetool ring
  36. 36. nodetool tpstats
  37. 37. nodetool cfstats
  38. 38. nodetool cfhistograms
  39. 39. Backups  Single Node Snapshot  nodetool snapshot   nodetool clearsnapshot  Makes a hardlink of SSTables that you can tarball.  Cluster-wide Snapshot.  clustertool global_snapshot  clustertool clear_global_snapshot  Just does local snapshots on all nodes.  To restore:  Stop the node.  Clear CommitLogs.  Zap *.db files in the Keyspace directory.  Copy the snapshot over from the snapshots subdirectory.  Start the node and wait for load to decrease.
  40. 40. Shutdown Best PracticeWhile Cassandra is crash-safe, you can make a cleanershutdown and save some time during startup thus: Make other nodes think this one is down.  nodetool -h $(hostname) -p 8080 disablegossip Wait a few secs, cut off anyone from writing to this node.  nodetool -h $(hostname) -p 8080 dissablethrift Flush all memtables to disk.  nodetool -h $(hostname) -p 8080 drain Shut it down.  /etc/init.d/cassandra stop
  41. 41. Rolling UpgradesFrom 0.7 you can do rolling upgrades. Check for cassandra.yaml changes! On each node, one by one:  Shutdown as in previous slide, but do a snapshot after draining.  Remove old jars, rpms, debs. Your data will not be touched.  Add new jars, rpms, debs.  /etc/init.d/cassandra start  Wait for the node to come back up and for the other nodes to see it. When done, before you run repair, on each node run:  nodetool -h $(hostname) -p 8080 scrub  This is rebuilding the sstables to make them up to date.  It is essentially a major compaction, without compacting, so it is a bit expensive. Run repair on your nodes to clean up the data.  nodetool -h $(hostname) -p 8080 repair
  42. 42. Join Us! Well be ing You! These slides can be found here: