4. INTRODUCTION: SELECTED CASES
Selected Cases
Who use Cassandra?
eBay has Cassandra supporting multiple
applications (Social Signals, Hunch, and many
time series use cases) with clusters spanning
several data centers.
Netflix is using Cassandra on AWS as a key
infrastructure component of its globally distributed
streaming product.
Shazam uses Cassandra cluster to power their
recommendations system.
and many others…
Check - http://www.datastax.com/cassandrausers
4
5. INTRODUCTION: MOST ADVANTAGES
Most advantages
Most advantages of Cassandra are:
• Fast writes.
• Tunable consistency.
• Decentralization.
• Integration with Hadoop.
5
7. ARCHITECTURE: FAST WRITES
Fast writes
Cassandra is very fast on writes, cause of
use of Log-structured merge tree.
Process of inserting new record into Cassandra
7
8. ARCHITECTURE: FAST WRITE
How LSM-tree is done: Memtables and SSTables
2
1
3
1
Commit log – all data is written to the
commit log for durability.
2
SSTables are immutable. A row is typically stored across multiple
SSTable files.
3
Each SSTable has a bloom filter associated with it. The
bloom filter is used to check if a requested row key exists in
the SSTable before doing any disk seeks.
4
Deleted data is not immediately removed from disk.
A deleted column can reappear. Tombstones.
8
9. ARCHITECTURE: NETWORK ARCHITECTURE
Network architecture
• All nodes – are peers
(no master).
• Client specify set of Cassandra nodes and get
connected to first live node.
• Nodes are using gossip protocol.
9
11. PARTITIONING & REPLICATION: DATA PARTITIONING
Data partitioning
Partitioner – determines, where first replica would live in the ring.
•
RandomPartitioner – default strategy, provides ±same load of all
nodes.
•
ByteOrderedPartitioner - orders rows lexically by key bytes, allows
range scans, not recommended.
11
12. PARTITIONING & REPLICATION: REPLICATION
Replication
Replication = replication factor
+ replica placement strategy
Replica placement strategy:
SimpleStrategy:
•
default strategy;
•
not taking
network topology
into account;
NetworkTopology
Strategy:
•
preferred, when
you have information
about network map
of your nodes;
12
14. DATA MANAGEMENT: DATA ACCESSING
Data accessing
READ + WRITES:
• Tunable consistency. Consistency level specify
how many nodes should answer for read/write
request(but writes goes to all replicas).
• Batches - sets a global consistency level and
client-supplied timestamp for all columns
written by the statements in the batch.
14
15. DATA MANAGEMENT: ACID
ACID
ACID
• Atomicity – writes are atomic at row level.
• Consistency – tunable consistency.
• Isolation – writes are invisible until they are
complete.
• Durability – writes are durable.
• Read-repair, anti-entropy node repair, hinted
handoff.
15
17. DATA MODEL: CASSANDRA`S DATA MODEL
Cassandra`s data model
Relational databases – you design
schema, based on entities and
relationships.
Cassandra – you design schema, based
on what queries you would like to
perform.
17
18. DATA MODEL: INDEXES
Indexes
An index is a data structure that allows for
fast, efficient lookup of data matching a given
condition.
Primary key – the unique key used to identify
each row in a table.
Secondary indexes – refer to indexes on
column values.
18
19. DATA MODEL: CQL3
CQL3
cqlsh> INSERT INTO users
(user_name, password)
VALUES ('jsmith', 'ch@ngem3a');
cqlsh> SELECT * FROM users WHERE
user_name='jsmith';
user_name | password | state
-----------+-----------+------jsmith | ch@ngem3a | null
Confidential
19