3. Outline
• Cassandra vs SQL Server
• Overview
• Data in Cassandra
• Data Partitioning
• Data Replication
• Data Consistency
• Client Libraries
4. Cassandra vs SQL Server
• Cassandra
o More servers = More capacity.
o The concerns of scaling is transparent to application.
o No single point of failure.
o Horizontal scale.
• SQL Server
o More power machine = More capacity.
o Adding capacity requires manual labor from ops people
and substantial downtime.
o There would be limit on how big you could go.
o Vertical scale, Moore’s law scaling
5. Overview
• Features are coming from Dynamo and BigTable
• Distributed
o Data partitioned among all nodes
• Extremely Scalable
o Add new node = Add more capacity
o Easy to add new node
• Fault tolerant
o All nodes are the same
o Read/Write anywhere
o Automatic Data replication
• High Performance
7. Data in Cassandra
• Keyspace ~ Database in RDBMS
• Column Family ~ Table in RDBMS
Keyspace
ColumnFamily
{
column: Phone,
ID Addr Phone value: 09...,
Key: Boris
timestamp: 1000
1 ... Taiwan 09..... }
timestamp is used
to resolve conflict.
8. Data in Cassandra
• Keyspace
o Where the replication strategy and replication factor
is defined.
CREATE KEYSPACE keyspace_name WITH
strategy_class = 'SimpleStrategy'
AND strategy_options:replication_factor=2;
• ColumnFamily
CREATE COLUMNFAMILY user (
id uuid PRIMARY KEY, address text, userName text ) WITH
comment='' AND comparator=text AND read_repair_chance=0.100000 AND
gc_grace_seconds=864000 AND default_validation=text AND
min_compaction_threshold=4 AND max_compaction_threshold=32 AND
replicate_on_write=True AND compaction_strategy_class='SizeTieredCompactionStrategy' AND
compression_parameters:sstable_compression='org.apache.cassandra.io.compress.SnappyCompres
sor';
9. Data in Cassandra
• Commit log
o Used to capture write activities. Data durability is
assured.
• Memtable
o Used to store most recent write activities.
• SSTable
o When a memtable got flushed to disk, it becomes a
sstable.
10. Data Read/Write
• Write
Data Commitlog Memtable
Flushed
SSTable
• Read
o Search Row cache, if the result is not empty, then return the
result. No further actions are needed.
o If no hit in the Row cache. Try to get data from Memtable(s)
and SSTable(s) that might contain requested key. Collate the
results and return.
12. Data Partitioning
• The total data managed by the cluster is
represented as a circular space or ring.
• Before a node can join the ring, it must be assigned
a token.
• The token determines the node’s position on the
ring and the range of data it is responsible for.
• Partitioning strategy
o Random Partitioning
Default and Recommended
o Order Partitioning
Sequential writes can cause hot spots
More administrative overhead to load balance the
cluster
13. Data Partitioning
Random
Partitioning
t1
hash(k2) hash(k1)
Data: k1 t5 t2 Data: k3
hash(k4)
hash(k3)
t4 t3
14. Data Replication
• To ensure fault tolerance and no single point
of failure.
• Replication is controlled by the parameters
replication factor and replication strategy
of a keyspace.
• Replication factor controls how many copies
of a row should be stored in the cluster
• Replication strategy controls how the data
being replicated.
15. Data Replication
Random Partitioning
t1
RF=3 hash(k1)
Data: k1 t5 t2
coordinator
t4 t3
16. Data Consistency
• Cassandra supports tunable data
consistency.
• Choose from strong and eventual
consistency depending on the need.
• Can be done on a per-operation basis, and
for both reads and writes.
• Handles multi-data center operations
17. Consistency Level
Write Read
Any
One One
Quorum Quorum
Local_Quorum Local_Quorum
Each_Quorum Each_Quorum
All All
19. Client Library for Java
• Hector
o https://github.com/hector-client/hector.git
o https://github.com/hector-client/hector/wiki/User-
Guide
• Astyanax
o https://github.com/Netflix/astyanax.git
• CQL + JDBC
o http://code.google.com/a/apache-
extras.org/p/cassandra-jdbc/
20. Hector
• High level, simple object oriented
interface to cassandra
• Failover behavior on the client side
• Connection pooling for improved
performance and scalability
• Automatic retry of downed hosts
.
.
.
24. Useful Tools
• cassandra-cli
o <cassandra-dir>/bin
o http://www.datastax.com/docs/1.0/dml/using_cli
• cqlsh
o <cassandra-dir>/bin
o http://www.datastax.com/docs/1.0/references/cql/index
• nodetool
o <cassandra-dir>/bin
o http://www.datastax.com/docs/1.0/references/nodetool
• stress
o <cassandra-dir>/tools/bin
o http://www.datastax.com/docs/1.0/references/stress_java
25. Useful Tools
• OpsCenter
o http://www.datastax.com/products/opscenter
• sstableloader
o <cassandra-dir>/bin
o http://www.datastax.com/dev/blog/bulk-loading
• More tools
http://en.wikipedia.org/wiki/Apache_Cassandra#Tools
_for_Cassandra