Cassandra is the leader in wide column family databases, in these slides, we discussed the Cassandra Internals and understood what makes it a perfect choice for write fast DB. Also, we delved into how we can get most out of it's reading side as well.
2. CONTENTS
● Why NoSQL
● Features of Cassandra
● Gossip Protocol
● Data Distribution in Cassandra
● Write Path
● Read Path
3. WHY NOSQL
● Within corporations, around 80% of data is
unstructured.
● Availability and Scalability issues with RDBMS.
● NoSQL dbs have horizontal scalability and high
availability, in some cases at the cost of strong
consistency and ACID semantics.
14. Cassandra Architecture
In cassandra all the nodes are identical.
A Cassandra cluster has no special nodes i.e. the
cluster has no masters, no slaves or elected leaders.
16. Tracking Nodes
Lets see how cassandra keeps a track of nodes in a
cluster.
● Gossip Protocol
● Snitches
17. Gossip protocol
A node/initiator in a cluster chooses a node/peer
randomly to gossip with.
Sends the metadata it has about itself and other
nodes in the cluster.
Receives metadata/updates that the other node has.
18.
19.
20.
21.
22.
23.
24. Main points
● Every node gossips with every other node in a
cluster every second.
● The Gossiper class maintains a list of nodes that
are alive and dead.
● The gossiper runs every second on a timer on
every node of a cluster.
26. Snitches
The job of a snitch is to determine relative host
proximity for each node in a cluster, which is used to
determine which nodes to read and write from.
27. Example: Snitch in Read
Operation
While reading data cassandra must contact a number
of replicas determined by the consistency level. For
fast read operations, it selects a single replica to
query for the full object, and take hash values from
others in order to ensure the latest version of the
requested data is returned.
Snitch finds the closest replica and the coordinator
node queries it for full data.
31. Rings and Tokens
● Each node in the ring is assigned one or more
ranges of data described by a token, which
determines its position in the ring.
● A token is a 64-bit integer ID used to identify each
partition.
32. Partitioners
● A partitioner, is a hash function for computing the
token of a partition key.
● Each row of data is distributed within the ring
according to the value of the partition key token
calculated by the partitioner at every node.
● Murmur3Partitioner is the default partitioner.
33. Virtual Nodes
● Cassandra’s 1.2 release introduced the concept of
virtual nodes, instead of assigning a single token
to a node, a range of tokens is assigned.
● By default, each node will be assigned 256 of
these tokens, meaning that it contains 256 virtual
nodes.
35. Advantages
● Tokens are generated automatically by cassandra.
● Smaller Partitions.
● Less load on nodes.
36. Replication Strategies
● Cassandra replicates data across nodes in a
manner transparent to the user, and the replication
factor is the number of nodes in your cluster that
will receive copies (replicas) of the same data.
● If your replication factor is 3, then three nodes in
the ring will have copies of each row.
38. Consistency Levels
● For read queries, the consistency level specifies
how many replica nodes must respond to a read
request before returning the data.
● For write operations, the consistency level
specifies how many replica nodes must respond
for the write to be reported as successful to the
client.
43. Tombstones
When you execute a delete operation, the data is not
immediately deleted. Instead, it’s treated as an
update operation that places a tombstone on the
value. A tombstone is a deletion marker that is
required to suppress older data in SSTables until
compaction can run.
46. Bloom Filters
● Bloom filters condense a larger data set into a
digest string using a hash function.
● The digest strings are stored in memory and are
used to improve performance by reducing the
need for disk access on key lookups.
● So a Bloom filter is a special kind of cache. When
a query is performed, the Bloom filter is checked
first before accessing disk.
48. Replica synchronization
Read repair refers to the synchronization of replicas
as data is read. While reading if any replicas have out
of date values a read repair is performed immediately
to update the out of date replicas.
Anti-entropy repair (manual repair) is a manually
initiated operation performed on nodes as part of a
regular maintenance process. This type of repair is
executed by running nodetool repair on a node to
execute a major compaction
UNDER HIGH LOADS JOINS MAKES OUR QUERIES SLOW
SO WE TEND TO DENORMALIZE OUR TABLES
Big companies effectively managing their big data .
Started with facebook Inbox search in 2009.
We have a cluster in cassandra , which is a group of several nodes.
A node is a cassandra server/or a cassandra instance that we run on a machine.
There is no master- slave architecture in cassandra, no special nodes every node is same and have similar responsibilities in cassandra.
There is no single point of failure means that is any node in the cluster fails then it does not affect any functionalities(read/ write) of cassandra.
Cassandra stores replicas in various nodes so if a node fails then also the data belonging to that node can be retrieved.
If we add nodes to our cluster then the throughput increases linearly without affecting performance.
Cassandra can handle data loads gracefully.
We set replication factor per keyspace in cassandra .
Replication Factor = How many replicas we want for our data in our system.
Consistency can be set per read write query .
Cassandra has partition tolerance and availability and is eventually consistent.
A row must is indexed by partition key and can searched only by partition key.
We define the partition key while defining the table itself.
We have to set replication factor and strategy for every keyspace in cassandra.
So how does nodes in cassandra store information about other nodes in a cluster ?
A communication protocol
Explain replicas for read and write path.
Partitioner is present at every node of the cluster.
This partition key token generated by the partitioner is compared to the token values for the various nodes to identify the range, and therefore the node, that owns the data.
Token ranges are represented by the org.apache.cassandra.dht.Range class.
Early versions of Cassandra assigned a single token to each node, in a fairly static manner, requiring you to calculate tokens for each node.
To understand read and write paths we must understand Replication Strategies and consistency level.
Use for a single data center only. If you ever intend more than one data center, use the NetworkTopologyStrategy.
Because Cassandra is eventually consistent, updates to other replica nodes may continue in the background. ALL, QUORUM, ONE are some of the consistency levels available.
Consistency level can be configured on a cluster, datacenter, or individual I/O operation basis. Consistency among participating nodes can be set globally and also controlled on a per-operation basis (for example insert or update) using Cassandra’s drivers and client libraries.
Suppose a write request is sent to Cassandra, but a replica node where the write belongs is not available ,then the coordinator will create a hint for the other node and store it and once it detects via gossip that the other is back online, the coordinator node will send hint to other node.
consider a cluster consisting of three nodes, A, B, and C,with a replication factor of 2. When a row K is written to the coordinator (node A in this case), even if node C is down, the consistency level of ONE or QUORUM can be met. Why? Both nodes A and B will receive the data, so the consistency level requirement is met. A hint is stored for node C and written when node C comes up. In the meantime, the coordinator can acknowledge that the write succeeded.
A compaction operation in Cassandra is performed in order to merge SSTables.
During compaction, the data in SSTables is merged: the keys are merged, columns are combined, tombstones are discarded, and a new index is created.
Compaction is the process of freeing up space by merging large accumulated datafiles