Cassandra training course is designed to provide knowledge and skills to become a successful Cassandra developer. In depth knowledge of concepts such as Clusters, Keyspaces, Column familes, Replication, Cassandra’s Data Model, Cassandra’s Architecture, Performance Tuning, How to read and write data and finally how to integrate Cassandra with Hadoop will be covered in this course.
2. What are we going to learn today?
New Problems which can’t be handled by traditional RDBMS
Tradeoff between Consistency, Availability, Partition Tolerance ( CAP theorem)
What are the different solutions available?
What is Cassandra?
Use-Cases for Cassandra
Cassandra Features – Tunable Consistency, P2P Architecture, Elastic Scalability, Column Orientation
Demo Application using Cassandra
8. So, What Is Common?
Huge Data
Fast Random access
Variable Schema
Need of Compression
High Availability
Need for Consistency
Need of Distribution (Sharding)
12. Using Cassandra
1000 TPS
Elastic Scale WEB APPLICATION
Applications Changing Data
Elastic Scale
CASSANDRA
300 ~ 500 SQL
Transaction
100 ~ 200 SQL
Transaction
5000 TPS
13. eCommerce (Travel Portal)
Both B2B & B2C Consumers
High volume of shopping transactions ( > 500 Million Visits / Day)
High volume supply changes (Manual & System) generated.
Huge Inventory Database ( Millions of hotels)
High Read/Write (Thousands Reads & Writes/Second)
Application has to 99.99% Available
Fault Tolerant & Reliable.
Fast & Quick Shopping Experience.
Elastic Scale
Innovative Recommendations & Algorithms.
Should be fast for new changes
Should be cost effective for maintenance.
Development Approaches
Legacy Way (Pure RDBMS)
Augmented (RDBMS + Caching, Heavy Database Hardware)
Using Cassandra
Cassandra Use Case -Summary
14. Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly
available, fault-tolerant, Tuneably consistent, column-oriented database.
What is Apache Cassandra
Cassandra Features
Open
Source
Distributed
Decentralized
Elastically
Scalable
Highly
Scalable
Fault
Tolerant
Tuneably
Consistent
Column
Oriented
15. Distributed And Decentralised
Post Office
Decentralised
Post Office
Centralised
CCY
Exchange stationary Letter/Couriers
Ccy Courier Stationary
CCY, Stationary, Lette
r/Couriers
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
Ccy Courier Stationary
16. Every Node Is Identical.
Peer to Peer Protocol and uses Gossip Protocol to
maintain and keep the List of nodes in Sync.
No Single Point of Failure.
No Special Host to Coordinate Activities.
Easier to Operate and Maintain because all
nodes are same.
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
Ccy Courier Stationary
Distributed And Decentralised
17. Types of Scalability
Vertical Scalability
Horizontal Scalability
What is Elastic Scalability?
This is special property of Horizontal Scalability.
The cluster can seamlessly scale up and scale back down without major disruption.
Elastic Scalability
18. Cluster must accept new nodes without major
disruption or reconfiguration.
ADD A NODE AND MOVE ON!!
CCY, Stationary,
Letter/Couriers
CCY, Stationary, Le
tter/Couriers
CCY, Stationary,
Letter/Couriers
Ccy Courier Stationary
CCY, Stationary, Le
tter/Couriers
Process should not be restarted
Do not have to change application charges
Don’t have to rebalance data
Elastic Scalability
19. Highly Available
No Downtime
High Availability And Fault Tolerance
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
Ccy Courier Stationary
21. Cassandra was designed specifically from the ground up to take full
advantage of multiprocessor/ multicore machines, and to run across many
dozens of these machines housed in multiple data centres.
It scales consistently and seamlessly to hundreds of terabytes.
Shows exceptional performance under heavy loads.
Consistently shows very fast throughput for writes per second on a basic
commodity workstation.
High Performance
23. Use if your application has :-
Big Data (Billions Of Records Rows & Columns)
Very High Velocity Random Reads & Writes.
Flexible Sparse / Wide Column Requirements.
No Multiple Secondary Index Needs.
Low Latency
Use Cases
eCommerce Inventory Cache Use Cases
Time Series / Events Use Cases.
Feed Based Activities / Use Cases.
Where to use Cassandra
24. Where NOT to use Cassandra
Don’t Use if you application has :-
• Secondary Indexes.
• Relational Data.
• Transactional (Rollback, Commit)
• Primary & Financial Records.
• Stringent Security & Authorization Needs On Data
• Dynamic Queries on Columns.
• Searching Column Data
• Low Latency
25. Cassandra Installation & Configuration
• conf/cassandra.yaml
• Tools
Key Space Setup
Column Family / Data Model Setup
• Key
• Columns & Data Types
• Indexes (Primary & Secondary)
• Programmatic Consistency
Thrift Hector API
CQL3 API
Application Demo
On this foil, we shall explain how with the advent of distributed systems, one solution cant solve all the problems stated in the preceding foils. Cassandra can be used for Twitter and Expedia due to high scale and availability where we can compromise on consistency. These usecases also don’t have dynamic queries so cassandra fits in very well. The BookMyShow usecase requires consistency along with scale. We can tradeoff Availability in that case. So MongoDB can be used.In case of Facebook Messenger, Consistency is very much required along with Massive scale. The data is short temporal and large set which rarely gets accessed. Hbase can be used in this case.
Another Classification of NoSQL DBs based on implementation
Lets take the scenario of a Post OfficeThere are three counters Currency exchange.Stationary Letters and couriersIn centralized approach we have a router or a counter to forward the customer to respective counters.Drawbacks: System will fails if the router fails.In decentralized approach all the systems are identical and no router is there in between.
If any node goes down, other node is capable of doing the job. Since each node is identical.
The client can control the number of replicas to block on for all updates. This is done by setting the consistency level against the replication factor.Strong consistency is the ability to guarantee that an update is propagated to all locations where that piece of data resides. In a single data centre set up, this would guarantee that all of the servers that should have a copy of the data will have it before the client is acknowledged with a success. In terms of performance, this usually means a cost of a few extra milliseconds to write data to several servers.Eventual consistency means that the client is acknowledged as soon as part of the cluster acknowledges the write. In one case, a single server could acknowledge receiving the data and begin propagating the data to the other servers immediately. This use case would be the best when application performance matters the most.
We can explain some of these. Need not go in details here. We shall be explaining these in the course.