The document provides an overview of Apache Cassandra, an open-source distributed database management system. It discusses Cassandra's peer-to-peer architecture that allows for scalability and availability. The key concepts covered include Cassandra's data model using columns, rows, column families and its distribution across nodes using consistent hashing of row keys. The document also briefly outlines Cassandra's basic read and write operations and how it handles replication and failure recovery.
5. OUR WORLD
Traditional DBMS is very valuable
Storage(+Memory) and Computational Resources cost is cheap (than before)
But we meet new section
Big data
(near) Real time
Complex and various requirement
Recommendation
Find FOAF
…
Event Driven Trigging
User Session
…
6. OUR WORLD (CONT)
Complex applications combine difference types of problems
Different language -> more productive
ex: Functional language, Multiprocessing optimized language
Polyglot persistent layer
Performance vs Durability?
Reliability?
…
7. TRADITIONAL DBMS
Relational Model
Well-defined Schema
Access with Selection/Projection
Derived from Joining/Grouping/Aggregating(Counting..)
Small data (from refined)
…
But
Painful data model changes
Hard to scale out
Ineffective in handling large volumes of data
Not considered with hardware
…
8. TRADITIONAL DBMS (CONT)
Has many constraints for ACID
PK/FK & checking
Domain Type checking
.. checking checking
Lots of IO / Processing
OODBMS, ORDBMS
Good but .. more more checking / processing
Not well with Disk IO
10. NOSQL (CONT)
Benefits
Higher performance
Higher scalability
Flexible Datamodel
More effective for some case
Less administrative overhead
Drawbacks
Limited Transactions
Relaxed Consistency
Unconstrained data
Limited ad-hoc query capabilities
Limited administrative aid tools
11. CAP
Brewer’s theorem
We can pick two of
Consistency
Availability
Partition tolerance
A
C P
Amazon Dynamo derivatives
Cassandra, Voldemort, CouchDB
, Riak
Neo4j, Bigtable
Bigtable derivatives : MongoDB, Hbase
Hypertable, Redis
Relational:
MySQL, MSSQL,
Postgres
12. Dynamo
(Architecture)
BigTable
(Data model)
Cassandra
(Apache) Cassandra is a free, open-source, high scalable,
distributed database system for managing large amounts of data
Written in JAVA
Running on JVM
References :
BigTable (http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf)
Dynamo (http://web.archive.org/web/20120129154946/http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf)
13. DESIGN GOALS
Simple Key/Value(Column) store
limited on storage
No support anything (aggregating, grouping …) but basic operation
(CRUD, Range access)
But extendable
Hadoop (MR, HDFS, Pig, Hive ..)
ESP
Distributed Processing Interface (ex: BSP, MR)
Baas.io
…
14. DESIGN GOALS (CONT)
High Availability
Decentralized
Everyone can accessor
Replication & Their access
Multi DC support
Eventual consistency
Less write complexity
Audit and repair when read
Possible tuning -> Trade offs between consistency, durability and latency
15. DESIGN GOALS (CONT)
Incremental scalability
Equal Member
Linear Scalability
Unlimited space
Write / Read throughput increase linearly by add node(member)
Low total cost
Minimize administrative work
Automatic partitioning
Flush / compaction
Data balancing / moving
Virtual nodes (since v1.2)
Middle powered nodes make good performance
Collaborating work will make powerful performance and huge space
16. FOUNDER & HISTORY
Founder
Avinash Lakshman (one of the authors of Amazon's Dynamo)
Prashant Malik ( Facebook Engineer )
Developer
About 50
History
Open sourced by Facebook in July 2008
Became an Apache Incubator project in March 2009
Graduated to a top-level project in Feb 2010
0.6 released (added support for integrated caching, and Apache Hadoop MapReduce) in Apr 2010
0.7 released (added secondary indexes and online schema change) in Jan 2011
0.8 released (added the Cassandra Query Language (CQL), self-tuning memtables, and support for zero-downtime upgrades) in Jun 2011
1.0 released (added integrated compression, leveled compaction, and improved read performance) in Oct 2011
1.1 released (added self-tuning caches, row-level isolation, and support for mixed ssd/spinning disk deployments) in Apr 2012
1.2 released (added clustering across virtual nodes, inter-node communication, atomic batches, and request tracing) in Jan 2013
17. PROMINENT USERS
User Cluster size Node count Usage Now
Facebook >200 ? Inbox search Abandoned,
Moved to HBase
Cisco WebEx ? ? User feed, activity OK
Netflix ? ? Backend OK
Formspring ? (26 million
account with 10 m
responsed per day)
? Social-graph data OK
Urban airship,
Rackspace, Open X,
Twitter (preparing
move to)
19. P2P ARCHITECTURE
All nodes are same (has equality)
No single point of failure / Decentralized
Compare with
mongoDB
broker structure (cubrid …)
Master / slave
…
20. P2P ARCHITECTURE
Driven linear scalability
References :
http://dev.kthcorp.com/2011/12/07/cassandra-on-aws-100-million-writ/
22. COLUMN
Basic and primitive type (the smallest increment of data)
A tuple containing a name, a value and a timestamp
Timestamp is important
Provided by client
Determine the most recent one
If meet the collision, DBMS chose the latest one
Name
Value
Timestamp
23. COLUMN (CONT)
Types
Standard: A column has a name (UUID or UTF8 …)
Composite: A column has composite name (UUID+UTF8 …)
Expiring: TTL marked
Counter: Only has name and value, timestamp managed by server
Super: Used to manage wide rows, inferior to using composite
columns (DO NOT USE, All sub-columns serialized)
Counter Name
Value
Name
Name
Value
Timestamp
Name
Value
Timestamp
24. COLUMN (CONT)
Types (CQL3 based)
Standard: Has one primary key.
Composite: Has more than one primary key,
recommended for managing wide rows.
Expiring: Gets deleted during compaction.
Counter: Counts occurrences of an event.
Super: Used to manage wide rows, inferior to using
composite columns (DO NOT USE, All sub-columns
serialized)
DDL : CREATE TABLE test (
user_id varchar,
article_id uuid,
content varchar,
PRIMARY KEY (user_id, article_id)
);
user_id article_id content
Smith <uuid1> Blah1..
Smith <uuid2> Blah2..
{uuid1,content}
Blah1…
Timestamp
{uuid2,content}
Blah2…
Timestamp
Smith
<Logical>
<Physical>
SELECT user_id,article_id from test order
by article_id DESC LIMIT 1;
25. ROWS
A row containing a represent key and a set of columns
A row key must be unique (usually UUID)
Supports up to 2 billion columns per (physical) row.
Columns are sorted by their name (Column’s Name indexed)
Primitive
Secondary Index
Direct Column Access
Name
Value
Timestamp
Name
Value
Timestamp
Name
Value
Timestamp
Row
Key
26. COLUMN FAMILY
Container for columns and rows
No fixed schema
Each row is uniquely identified by its row key
Each row can have a different set of columns
Rows are sorted by row key
Comparator / Validator
Static/Dynamic CF
If columns type is super column, CF called “Super Column Familty”
Like “Table” in Relational world
Name
Value
Timestamp
Name
Value
Timestamp
Name
Value
Timestamp
Row
Key
Name
Value
Timestamp
Row
Key
28. TOKEN RING
Node is a instance (typically same as a server)
Used to map between each row and node
Range from 0 to 2127-1
Associated with a row key
Node
Assigned a unique token (ex: token 5 to Node 5)
Range is from previous node token to their token
token 4 < Node 5’range <= token 5
Node 1
Node 2
Node 3
Node 4Node 5
Node 6
Node 7
Node 8
Token 5
Token 4
30. REPLICATION
Any node has read/write role is called
coordinator node (by client)
Locator determine where located the replica
Replica is used at
Consistency check
Repair
Ensure W + R > N for consistency
Local Cache (Row cache)
Node 1
Node 2
Node 3
Node 4Node 5
Node 6
Node 7
Node 8
Replica Factor is 4 (N-1 will be replicated)
Simple Locator treat strategy order as proximity
Locator
(Simple)
Coordinator node
Locating first one
1
2
Here is original
31. REPLICATION (CONT)
Multi DC support
Allow to Specify how many replcas in each DC
Within DC replicas are placed on different racks
Relies on snitch to place replicas
Strategy (provided from Snitch)
Simple (Single DC)
RackInferringSnitch
PropertyFileSnitch
EC2Snitch
EC2MultiRegionSnitch
DC1
DC2
Entire
32. ADD / REMOVE NODE
Data transfer between nodes called “Streaming”
If add node 5,
node 3 and node 4, 1 (suppose RF is 2) involved in streaming
If remove node 2
node 3(got higher token and their replica container) serve instead
Node 1
Node 2
Node 3
Node 4
Node 1
Node 2
Node 3
Node 4
Node 5
Node 1
Node 3
Node 4
33. VIRTUAL NODES
Support since v1.2
Real time migration support?
Shuffle utility
One node has many tokens
=> one node has many ranges Node 1 Node 2
Number of token is 4
Cluster
Node 2
Node 1
34. VIRTUAL NODES (CONT)
Less administrative works
Save cost
When Add/Remove node
many node co-works
No need to determine the token
Shuffle to re-balance
Less changing time
Smart balancing
No need to balance
(Sufficiently number of token should be higher)
Number of token is 4
Node 2
Node 1
Cluster
Node 2
Node 1
Node 3
Add node 3
36. CLUSTER
Total amount of data managed by the cluster is represented as a
ring
Cluster of nodes
Has multiple(or single) Keyspace
Partitioning Strategy defined
Authentication
37. GOSSIP
Gossip protocol is used for cluster membership.
Failure detection on service level (Alive or Not)
Responsible
Every node in the system knows every other node’s status
Implemented as
Sync -> Ack -> Ack2
Information : status, load, bootstraping
Basic status is Alive/Dead/Join
Runs every second
Status disseminated in O(logN) (N is the number of nodes)
Seed
PHI is used for auditing dead or alive in time window
(5 -> detecting in 15~16 s)
Data structure
HeartBeat<Application Status<Endpoint Status<Endpoint StatusMap
N1
N2
N3
N4
N6
N5
39. WRITE / UPDATE
CommitLog
Abstracted Mmaped Type
File & Memory Sync -> On system failure? This is angel for U ^^.
Java NIO
C-Heap used (=Native Heap)
Log Data (Write->Delete? But exists)
Segment Rolling structure
Memtable
In memory buffer and workspace
Sorted order by row key
If reach threshold or period point, written to disk to a persistent table
structure(SSTable)
40. WRITE / UPDATE (LOCAL LEVEL)
Write
CommitLog
Write : “1”:{“name”:”fullname”,”value”:”smith”}
Write : “2”:{“name”:”fullname”,”value”:”mike”}
Delete : “1”
Write : “3”:{“name”:”fullname”,”value”:”osang”}
… Key Name Value
1 fullname smith
2 fullname mike
3 fullname Osang
… … …
Memtable
SSTable SSTable SSTable
1 Write to commitLog
2
Write/Update to Memtable
3Write to Disk (flush)
41. SSTABLE
SSTable is Sorted String Table
Best for log structured DB
Store large numbers of key-value pairs
Immutable
Create with “Flush”
Merges by (major/minor) compaction
Has one or more column has different version (timestamp)
Choose recent one
42. READ (LOCAL LEVEL)
Key Name Value
2 fullname mike
3 fullname Osang
… … …
SSTable
BF
IDX
SSTable
BF
IDX
Read
Memtable
43. READ (CLUSTER LEVEL, +READ REPAIR)
Replica
(Original, Right)
Replica
(Right)
Replica
(Wrong)
Digest Comparing
Choose the right one if digests differ
(the most recent)
Recover
Read
Operation
Coordinator
Locator
1 Transferred from original/replica node (with consistency level)
2
3
44. DELETE
Add tomstone (this is some type of column)
Garbage collected when compacting
GC grace seconds : 864000 (default 10 days)
Issue
If the fault node recover after GCGraceSeconds, the deleted data can
be resurrected
46. DETECTION
Dynamic threshold for marking nodes
Accrual Detection Mechanism calculates a per-node threshold
Automatic take into account Network condition, workload and
other conditions might affect perceived heartbeat rate.
From 3rd party client
Hector
Failover
47. HINTED-HANDOFF
The coordinator will store a hint for if the node down or failed to
acknowledge the write
Hint consists of the target replica and the mutation(column
object) to be replayed
Use java heap (might next to be off-heap)
Only saved within limited time (default, 1 hour) after a replica fails
When failed node is alive again, it will begin streaming the miss
writes
49. READ REPAIR
Background work
Configured per CF
Choose most recently written value if they are inconsistent, and
replace it.
50. ANTI-ENTROPY REPAIR
Ensure all data on a replica is made consistent
Merkle tree used
Tree of data block’s hashes
Verify inconsistent
Repair node request merkle hash (piece of CF)
to replicas and comparing, streaming from a
replica if inconsistent, do Read-repair
Block
1
Block
2
Block
3
…
CF
hash hash hash hash
hash hash
hash
52. BASIC
Full ACID compliance in distributed system is a bad idea.
(network, … )
Single row updates are atomic (include internal indexes),
everything else is not
Relaxing consistency does not equal data corruption
Tunable Consistency
Speed vs precision
Any read and write operation decides how consistent the requested
data should be (from client)
54. CONDITION (CONT)
N is 3
Operations
1. Write 3
2. Write 5
3. Write 1
3 5 1
Worst case
W is 1
1 5 1W is 2 3 1 1or
W is 2 1 1 1
R is 1
Possible case
3 5 1or or
R is 21
1 R is 3
Written Read
(W+R)>N ensure that at lease one latest value can be selected
This is eventual consistency
55. READ CONSISTENCY LEVELS
One
Two
Three
Quorum
Local Quorum
Each Quorum
All
Specify how many replicas must response
before a result is return to the client
Quorum : (Replication Factor / 2) + 1
Local Quorum / Each Quorum is used at Multi-
DC
Round down to a whole number processing
(If satisfied, return right away)
56. WRITE CONSISTENCY LEVELS
ANY
One
Two
Three
Quorum
Local Quorum
Each Quorum
All
Specify how many replicas must succeed
before returning acknowledge to client
Quorum : (Replication Factor / 2) + 1
Local Quorum / Each Quorum is used at Multi-
DC
ANY level contain hinted-handoff condition
Round down to a whole number processing
(If satisfied, return right away)
58. CACHE
Key/Row Cache can save their data to files
Key Cache
Accessed Frequently
Hold the location of keys (indicating to columns)
In memory, on JVM heap
Row Cache
Optional
Hold entire columns of the row
In memory, on Off-heap (since v1.1) or JVM heap
If you have huge column, this will make OOME (Out Of Memory Event)
59. CACHE
Mmaped Disk Access
On 64bit JVM, used for data and index summary (default)
Provide virtual mmaped space in Memory for SSTable
On C-Heap(native heap)
GC make this as cache
Data accessed frequently live long period, otherwise GC will purge that
If the data exists in memory, return it (=cache)
(Problem) GC C-Heap when its full only
(Problem) handle open SSTable, this mean Cassandra can allocate the entire size
of open SSTables, otherwise native OOME
If you wanna have efficient Key/Row/Mmaped Access cache, add
sufficient nodes to cluster
60. BLOOM FILTERS
Each SSTable has this
Used to check if a requested row key exists in the SSTable before
doing any seeks (disk)
Per row key, generate several hashes and mark the buckets for
the key
Check each bucket for the key’s hashes, if any is empty the key
does not exists
False positive are possible, but false negative are not
Key 1 Key 2 Key 2
Hash A Hash B Hash C
1 1 1
Same hashes
Only has
61. INDEX
Primary Index
Per CF
The index of CF’s row key
Efficient access with Index summary (1 row key out of every 128 is
sampled)
In memory, on JVM heap (next move to Off-heap)
Read BF
KeyCache
SSTable
Index
Summary
Primary
Index
Offset
Calculator
62. INDEX (CONT)
Secondary Index
For Column’s value(s)
Support composite type
Hidden CF
Implemented by CF’name index
Value is the CF’name
Write/Update/Delete operation is atomic
Share value for many rows is good for
On the contrary unique value for indexing is poor (-> use Dynamic CF for
indexing)
63. COMPACTION
Combines data from SSTables
Merge row fragments
Rebuild primary and secondary indexes
Remove expired columns marked with tomestone
Delete old SSTable if complete
“Minor” only compactions merge SSTables of similar size, “Major” compactions
merge all SSTables in a given CF
Size-tiered compaction
Leveled compaction
Since v1.0
Based on LevelDB
Temporary use maximum twice space and spike in disk IO.
64. ARCHITECTURE
Write : no race conditions, not handled by disk IO
Read : Slow than write, but fast (DHT, cache …)
Load balancing
Virtual Nodes
Replication
Multi-DC
73. RESOURCE
Memory
Off-heap & Heap
OOME Problem
CPU
GC
Hashing
Compression / Compaction
Network Handling
Context Switching
Lazy Problem
IO
Bottleneck for everything
74. MEMORY
Heap (GC management)
Permanent (-XX:PermSize, -XX:MaxPermSize)
JVM Heap (-Xmx, -Xms, -Xmn)
C-Heap (=Native Heap)
OS Shared
Thread Stack (-Xss)
Objects that access with JNI
Off-Heap
OS Shared
GC managed by Cassandra
75. MEMORY (CONT)
Heap
Permanent
JVM Heap
Memtable
KeyCache
IndexSummary(move to Off-heap on next
release)
Buffer
Transport
Socket
Disk
C-Heap
Thread Stack
File Memory Map (Virtual space)
Data / Index buffer (default)
CommitLog
v1.2
Off-Heap (OS shared)
RowCache
BloomFilter
Index->CompressionMetaData-
>ChuckOffset
76. MEMORY (CONT)
Memtable
Managed
total size (default 1/3 JVM heap, flush largest memtable for CF if reached)
Emergency, heap usage above the fraction of the max after full GC(CMS) -> flush
largest memtable (each time) -> prevent full GC / OOME
KeyCache
Managed
total size (100M or 5% of the max)
Emergency, heap usage above the fraction of the max after full GC(CMS) ->
reduce max cache size -> prevent full GC / OOME
RowCache/CommitLog
Managed
total size (default disabled) -> prevent OOME
77. MEMORY (CONT)
Thread Stack
Not managed
But XSS set as 180k (default)
Check thrift (transport level, RPC server)’s server serving type (sync,
hsha, async(has bugs))
Set min/max threads for connection (default unlimited)
v1.2
78. MEMORY (CONT)
Transport buffer
Thrift
Support many languages and crossing
Provide server/client interface, serializing
Apache project, created by Facebook
Framed buffer (default max 16M, variable size)
4k, 16k, 32k, … 16M
Determine by client
Per connection
Adjust max frame buffer size (client, server)
Set min/max threads for connection (default unlimited)
v1.2
Data Service
Client
Data Service
Thrift
79. MEMORY (LAST)
C-Heap/Off-Heap
OS Shared -> Other application possible to make some problem
File Memory Map (Virtual space)
GC when Full GC
0 <= total size <= the size of opened SSTables
If cannot allocate? -> Native OOME
But
Generally access limited space of SSTable
GC make space
Worst case? (If OOME occur)
yaml->disk_access_mode : standard (restart required)
Add sufficient nodes
Yaml->disk_access_mode : auto After joining
v1.2
80. CPU
GC
CMS
Marking phase : low thread priority -> but high usage rate (it’s not a problem)
CMSInitiatingOccupancyFraction is 75 (default)
UseCMSInitiatingOccupancyOnly
Full GC
Frequency is important -> may has a problem (eg: thrift transport buffer)
Add nodes or analyze memory usage to adjust configuration for
Minor GC
It’s OK
Compaction
If do slow, okay
So priority down with “-XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Dcassandra.compaction.priority=1”
High CPU Load -> sustaining? -> When U need to add nodes
81. SWAPPING
Swapping make big problem for real-time application
IO block -> Thread block -> Gossip/Compaction/Flush … delaying ->
make other problem
Disable or Set minimum Swapping
Disable Swap partition
Or Enable JNA + Kernel Configuration
JNA : Mlockall (keep heap memory in physical memory)
Kernel
vm.swappiness=0 (but distress -> possible to swapping)
vm.overcommit_memory=1
Or vm.overcommit_memory=2 (overcommit managed)
vm.overcommit_ratio=? (eg 0.75)
Max memory = swap partition size + ratio*physical memory size
Eg: 8G = 2G + 0.75*8G
82. MORNITERING
System Monitoring
CPU / Memory / Disk
Nagios, Ganglia, Cacti, Zabbix
Network Monitoring
Per Client
NfSen (network flow monitoring, see:
http://nfsen.sourceforge.net/#mozTocId376385)
Cluster Monitoring / Maintaining
OpsCenter
83. CHECK THREAD
“top” command
“H” key command to spread per thread
“P” key command to sort by CPU usage rate
Choose heavy rate thread’s PID
PID convert to in Hex (http://www.binaryhexconverter.com/decimal-to-hex-converter)
“jstack <Parent PID> > filename.log” command to save java stack to file
Search PID in Hex
313C
84. CHECK HEAP
Use dump file that from “jmap” or OOME
Use “jhat” or another tool to analyze
Check [B
and their reference object