Apidays New York 2024 - The value of a flexible API Management solution for O...
Cassandra tutorial
1. Cassandra Tutorial
Apache Cassandra is a free open source
and distributed database management
system.It is highly scalable and designed
to manage very large amounts of
structured data. It provides high
availability with no single point of failure.
2. NoSQLDatabase
• A NoSQL database (sometimes called as Not Only SQL) is a
database that provides a mechanism to store and retrieve data other
than the tabular relations used in relational databases. These
databases are schema-free, support easy replication, have simple
API, eventually consistent, and can handle huge amounts of data.
• The primary objective of a NoSQL database is to have
• simplicity of design,
• horizontal scaling
• finer control over availability.
• NoSql databases use different data structures compared to
relational databases. It makes some operations faster in NoSQL. The
suitability of a given NoSQL database depends on the problem it
must solve.
3. • Apache Cassandra is an open source distributed database
system that is designed for storing and managing large
amounts of data across commodity servers. Cassandra can
serve as both a real-time operational data store for online
transactional applications and a read-intensive database for
large-scale business intelligence systems.
• Originally created for facebook, Cassandra is designed to have
peer to peer symmetric nodes, instead of master or named
nodes, to ensure there can never be a single point of failure
Cassandra automatically partitions data across all the nodes
in the database cluster, but the administrator has the power to
determine what data will be replicated and how many copies
of the data will be created.
4. Features of Cassandra
• Cassandra Features:
• Elastic scalability - Cassandra is highly scalable; it allows to add more hardware to
accommodate more customers and more data as per requirement.
• Always on architecture - Cassandra has no single point of failure and it is continuously
available for business-critical applications that cannot afford a failure.
• Fast linear-scale performance - Cassandra is linearly scalable, i.e., it increases your
throughput as you increase the number of nodes in the cluster. Therefore it maintains a
quick response time.
• Flexible data storage - Cassandra accommodates all possible data formats including:
structured, semi-structured, and unstructured. It can dynamically accommodate changes to
your data structures according to your need.
• Easy data distribution - Cassandra provides the flexibility to distribute data where you
need by replicating data across multiple data centers.
• Transaction support - Cassandra supports properties like Atomicity, Consistency,
Isolation, and Durability (ACID).
• Fast writes - Cassandra was designed to run on cheap commodity hardware. It performs
blazingly fast writes and can store hundreds of terabytes of data, without sacrificing the
read efficiency.
5. Components of Cassandra
• Cassandra uses the Gossip Protocol in the background to allow the nodes
to communicate with each other and detect any faulty nodes in the cluster.
• The key components of Cassandra are as follows −
• Node − It is the place where data is stored.
• Data center − It is a collection of related nodes.
• Cluster − A cluster is a component that contains one or more data centers.
• Commit log − The commit log is a crash-recovery mechanism in
Cassandra. Every write operation is written to the commit log.
• Mem-table − A mem-table is a memory-resident data structure. After
commit log, the data will be written to the mem-table. Sometimes, for a
single-column family, there will be multiple mem-tables.
• SSTable − It is a disk file to which the data is flushed from the mem-table
when its contents reach a threshold value.
• Bloom filter − These are nothing but quick, nondeterministic, algorithms
for testing whether an element is a member of a set. It is a special kind of
cache. Bloom filters are accessed after every query.
6. Apache Cassandra data types
• Apache Cassandra NoSQL DBMS supports the most
common data types, including ASCII, bigint, BLOB,
Boolean, counter, decimal, double, float, int, text,
timestamp, UUID, VARCHAR and varint.
• Cassandra's data model offers the convenience of
column indexes with the performance of log-
structured updates, strong support for
denormalization and materialized views, and built-
in caching.
• Data access is performed using Cassandra Query
Language (CQL), which resembles SQL.
7. Cassandra Query Language
• Users can access Cassandra through its nodes using
Cassandra Query Language (CQL). CQL treats the
database (Keyspace) as a container of tables.
Programmers use cqlsh: a prompt to work with CQL or
separate application language drivers.
• Clients approach any of the nodes for their read-write
operations. That node (coordinator) plays a proxy
between the client and the nodes holding the data.
8. • Data storage in Cassandra is row-oriented, meaning that
all contents of a row are serialized together on disk.
Every row of columns has its unique key. Each row can
hold up to 2 billion columns .Furthermore, each row
must fit onto a single server, because data is partitioned
solely by row-key.
• To understand why databases like Cassandra, HBase and
BigTable (I’ll call them DSS, Distributed Storage
Services, from now on) were designed the way they are,
we’ll first have to understand what they were built to be
used for.
9. • DSS(A decision support system (DSS) is a computer-based
information system that supports business or organizational
decision-making activities. were designed to handle enormous
amounts of data, stored in billions of rows on large clusters.
Relational databases incorporate a lot of things that make it hard to
efficiently distribute them over multiple machines. DSS simply
remove some or all of these ties. No operations are allowed, that
require scanning extensive parts of the dataset, meaning no JOINS
or rich-queries
• Cassandra is a NoSQL Column family implementation supporting
the Big Table data model using the architectural aspects introduced
by Amazon Dynamo.
10. column family
• Cassandra consists of many storage nodes and stores each row
within a single storage node. Within each row, Cassandra
always stores columns sorted by their column names. Using
this sort order, Cassandra supports slice queries where given a
row, users can retrieve a subset of its columns falling within a
given column name range. For example, a slice query with
range tag0 to tag9999 will get all the columns whose names
fall between tag0 and tag9999.
• Keyspace – a group of many column families together. It is
only a logical grouping of column families and provides an
isolated scope for names.
• Finally, super columns reside within a column family that
groups several columns under a one key.
11. • Cassandra provides very fast writes, and they are actually
faster than reads where it can transfer data about 80-
360MB/sec per node. It achieves this using two
techniques.Cassandra keeps most of the data within memory
at the responsible node, and any updates are done in the
memory and written to the persistent storage (file system) in a
lazy fashion. To avoid losing data, however, Cassandra writes
all transactions to a commit log in the disk. Unlike updating
data items in the disk, writes to commit logs are append-only
and, therefore, avoid rotational delay while writing to the
disk. For more information on disk-drive performance
characteristics, see Resources.
12. • Unless writes have requested full consistency, Cassandra writes data to enough
nodes without resolving any data inconsistencies where it resolves
inconsistencies only at the first read. This process is called "read repair.“
• Healing from failure is manual
• If a node in a Cassandra cluster has failed, the cluster will continue to work if
you have replicas. Full recovery, which is to redistribute data and compensate
for missing replicas, is a manual operation through a command line tool
called node tool. Also, while the manual operation happens, the system will be
unavailable.
• It remembers deletes
• Cassandra is designed such that it continues to work without a problem even if a
node goes down (or gets disconnected) and comes back later. A consequence is
this complicates data deletions. For example, assume a node is down. While
down, a data item has been deleted in replicas. When the unavailable node
comes back on, it will reintroduce the deleted data item at the syncing process
unless Cassandra remembers that data item has been deleted.