Cassandra Insider

Cassandra Insider
By :
Bhavya Aggarwal
Manjot kaur

CONTENTS
● Why NoSQL
● Features of Cassandra
● Gossip Protocol
● Data Distribution in Cassandra
● Write Path
● Read Path

WHY NOSQL
● Within corporations, around 80% of data is
unstructured.
● Availability and Scalability issues with RDBMS.
● NoSQL dbs have horizontal scalability and high
availability, in some cases at the cost of strong
consistency and ACID semantics.

CASSANDRA
● Apache Cassandra is a massively scalable
NoSQL database.

Big Companies using cassandra
More than 30,000 Companies use(or have used)
Apache Cassandra in Production.

FEATURES
● Distributed
● Decentralized
● Linearly scalability
● Tunable consistency

Distributed
Distributed i.e. capable of running on multiple
machines while appearing to users as a unified
whole.

Decentralized
● Decentralized i.e every node is identical
● There is no single point of failure.

Linear Scalability
It means that your cluster can seamlessly scale up
and scale back down.

Tunable Consistency
You can have strict, weak or causal consistency in
cassandra with the help of Replication Factor and
Consistency Level.

Cassandra vs RDBMS
Cassandra RDBMS
ACID ❌ yes
Foreign Keys ❌ yes
Joins ❌ yes
Secondary Indexes yes yes
Distributed yes ❌
Linear Scalability yes ❌
Fault Tolerance yes ❌

Cassandra Architecture
In cassandra all the nodes are identical.
A Cassandra cluster has no special nodes i.e. the
cluster has no masters, no slaves or elected leaders.

Cassandra cluster
Cassandra supports a masterless ring architecture.

Tracking Nodes
Lets see how cassandra keeps a track of nodes in a
cluster.
● Gossip Protocol
● Snitches

Gossip protocol
A node/initiator in a cluster chooses a node/peer
randomly to gossip with.
Sends the metadata it has about itself and other
nodes in the cluster.
Receives metadata/updates that the other node has.

Main points
● Every node gossips with every other node in a
cluster every second.
● The Gossiper class maintains a list of nodes that
are alive and dead.
● The gossiper runs every second on a timer on
every node of a cluster.

Snitches
The job of a snitch is to determine relative host
proximity for each node in a cluster, which is used to
determine which nodes to read and write from.

Example: Snitch in Read
Operation
While reading data cassandra must contact a number
of replicas determined by the consistency level. For
fast read operations, it selects a single replica to
query for the full object, and take hash values from
others in order to ensure the latest version of the
requested data is returned.
Snitch finds the closest replica and the coordinator
node queries it for full data.

Example: Snitch in Read
Operation

Data Distribution Across Nodes
● Tokens
● Partitioners

Rings and Tokens
● Each node in the ring is assigned one or more
ranges of data described by a token, which
determines its position in the ring.
● A token is a 64-bit integer ID used to identify each
partition.

Partitioners
● A partitioner, is a hash function for computing the
token of a partition key.
● Each row of data is distributed within the ring
according to the value of the partition key token
calculated by the partitioner at every node.
● Murmur3Partitioner is the default partitioner.

Virtual Nodes
● Cassandra’s 1.2 release introduced the concept of
virtual nodes, instead of assigning a single token
to a node, a range of tokens is assigned.
● By default, each node will be assigned 256 of
these tokens, meaning that it contains 256 virtual
nodes.

Advantages
● Tokens are generated automatically by cassandra.
● Smaller Partitions.
● Less load on nodes.

Replication Strategies
● Cassandra replicates data across nodes in a
manner transparent to the user, and the replication
factor is the number of nodes in your cluster that
will receive copies (replicas) of the same data.
● If your replication factor is 3, then three nodes in
the ring will have copies of each row.

Consistency Levels
● For read queries, the consistency level specifies
how many replica nodes must respond to a read
request before returning the data.
● For write operations, the consistency level
specifies how many replica nodes must respond
for the write to be reported as successful to the
client.

Tombstones
When you execute a delete operation, the data is not
immediately deleted. Instead, it’s treated as an
update operation that places a tombstone on the
value. A tombstone is a deletion marker that is
required to suppress older data in SSTables until
compaction can run.

Row cache and Key cache
Request flow

Bloom Filters
● Bloom filters condense a larger data set into a
digest string using a hash function.
● The digest strings are stored in memory and are
used to improve performance by reducing the
need for disk access on key lookups.
● So a Bloom filter is a special kind of cache. When
a query is performed, the Bloom filter is checked
first before accessing disk.

Replica synchronization
Read repair refers to the synchronization of replicas
as data is read. While reading if any replicas have out
of date values a read repair is performed immediately
to update the out of date replicas.
Anti-entropy repair (manual repair) is a manually
initiated operation performed on nodes as part of a
regular maintenance process. This type of repair is
executed by running nodetool repair on a node to
execute a major compaction

References
● https://docs.datastax.com/en/landing_page/doc/landing_
● https://www.youtube.com/watch?v=FuP1Fvrv6ZQ
● https://www.youtube.com/watch?v=FNfiYJm1GJs&t=153
● Cassandra The Definative Guide O’REILLY 2nd
Edition.

Cassandra Insider

Recomendados

Recomendados

Más contenido relacionado

Similar a Cassandra Insider

Similar a Cassandra Insider (20)

Más de Knoldus Inc.

Más de Knoldus Inc. (20)

Último

Último (20)

Cassandra Insider

Notas del editor