Cassandra20141009

Agenda
 Quick Review Of Cassandra
 New Developments In Cassandra
 Basic Data Modeling Concepts
 Materialized Views
 Secondary Indexes
 Counters
 Time Series Data
 Expiring Data
2

Cassandra High Level
Cassandra's architecture is based on the
combination of two technologies
 Google BigTable – Data Model
 Amazon Dynamo – Distributed
Architecture
 Cassandra = C*
3

Architecture Basics &
Terminology
 Nodes are single instances of C*
 Cluster is a group of nodes
 Data is organized by keys (tokens) which
are distributed across the cluster
 Replication Factor (rf) determines how
many copies are key
 Data Center Aware
 Consistency Level – powerful feature to
tune consistency vs speed vs availability.’
4

More Architecture
 Information on who has what data and
who is available is transferred using
gossip.
 No single point of failure (SPF), every
node can service requests.
 Data Center Aware
6

CAP Theorem
 Distributed Systems Law:
 Consistency
 Availability
 Partition Tolerance
(you can only really have two in a distributed system)
 Cassandra is AP with Eventual
Consistency
7

Consistency
 Cassandra Uses the concept of Tunable
Consistency, which make it very
powerful and flexible for system needs.
8

Data Model Architecture
 Keyspace – container of column families
(tables). Defines RF among others.
 Table – column family. Contains
definition of schema.
 Row – a “record” identified by a key
 Column - a key and a value
12

Keys
 Primary Key
 Partition Key – identifies a row
 Cluster Key – sorting within a row
 Using CQL these are defined together
as a compound (composite) key
 Compound keys are how you implement
“wide rows” which we will look at a lot!
14

Single Primary Key
create table users (
user_id UUID PRIMARY KEY,
firstname text,
lastname text,
emailaddres text
);
** Cassandra Data Types
http://www.datastax.com/documentation/cql/3.0/cql/cql
_reference/cql_data_types_c.html
15

Compound Key
emailaddress text,
department text,
firstname text,
lastname text,
PRIMARY KEY (emailaddress, department)
);
 Partition Key plus Cluster Key
 emailaddress is partition key
 department is cluster key
16

Compound Key
emailaddress text,
department text,
country text,
firstname text,
lastname text,
PRIMARY KEY ((emailaddress, department), country)
);
 Partition Key plus Cluster Key
 Emailaddress & department is partition key
 country is cluster key
17

Deletions
 Distributed systems present unique
problem for deletes. If it actually deleted
data and a node was down and didn’t
receive the delete notice it would try and
create record when came back online.
So…
 Tombstone - The data is replaced with a
special value called a Tombstone, works
within distributed architecture
18

New Rules
 Writes Are Cheap
 Denormalize All You Need
 Model Your Queries, Not Data
(understand access patterns)
 Application Worries About Joins
19

What’s New In 2.0
Conditional DDL
IF Exists or If Not Exists
Drop Column Support
ALTER TABLE users DROP lastname;
20

More New Stuff
 Triggers
CREATE TRIGGER myTrigger
ON myTable
USING 'com.thejavaexperts.cassandra.updateevt'
 Lightweight Transactions (CAS)
UPDATE users
SET firstname = 'tim'
WHERE emailaddress = 'tpeters@example.com'
IF firstname = 'tom';
** Not like an ACID Transaction!!
21

CAS & Transactions
 CAS - compare-and-set operations. In a
single, atomic operation compares a
value of a column in the database and
applying a modification depending on
the result of the comparison.
 Consider performance hit. CAS is (was)
considered an anti-pattern.
22

Data Modeling… The
Basics
 Cassandra now is very familiar to
RDBMS/SQL users.
 Very nicely hides the underlying data
storage model.
 Still have all the power of Cassandra, it
is all in the key definition.
RDBMS = model data
Cassandra = model access (queries)
23

Side-Note On Querying
 Create table with compound key
 Select using ALLOW FILTERING
 Counts
 Select using IN or =
24

Batch Operations
 Saves Network Roundtrips
 Can contain INSERT, UPDATE,
DELETE
 Atomic by default (all or nothing)
 Can use timestamp for specific ordering
25

Batch Operation Example
BEGIN BATCH
INSERT INTO users (emailaddress, firstname, lastname, country)
values ('brian.enochson@gmail.com', 'brian', 'enochson', 'USA');
values ('tpeters@example.com', 'tom', 'peters', 'DE');
values ('jsmith@example.com', 'jim', 'smith', 'USA');
values ('arogers@example.com', 'alan', 'rogers', 'USA');
DELETE FROM users WHERE emailaddress = 'jsmith@example.com';
APPLY BATCH;
 select in cqlsh
 List in cassandra-cli with timestamp
26

More Data Modeling…
 No Joins
 No Foreign Keys
 No Third (or any other) Normal Form
Concerns
 Redundant Data Encouraged. Apps
maintain consistency.
27

Secondary Indexes
 Allow defining indexes to allow other
access than partition key.
 Each node has a local index for its data.
 They have uses, but shouldn’t be used
all the time without consideration.
 We will look at alternatives.
28

Secondary Index Example
 Create a table
 Try to select with column not in PK
 Add Secondary Index
 Try select again.
29

When to use?
 Low Cardinality – small number of unique
values
 High Cardinality – high number of distinct
values
 Secondary Indexes are good for Low
Cardinality. So country codes, department
codes etc. Not email addresses.
30

Materialized View
 Want full distribution can use what is
called a Materialized View pattern.
 Remember redundant data is fine.
 Model the queries
31

Materialized View Example
 Show normal able with compound key and
querying limitations
 Create Materialized View Table With
Different Compound Key, support alternate
access.
 Selects use partition key.
 Secondary indexes local, not distributed
 Allow filtering. Can cause performance issues
32

Counters
 Updated in 2.1 and now work in a more
distributed and accurate manner.
 Table organization, example
 How to update, view etc.
33

Time Series Example….
 Time series table model.
 Need to consider interval for event
frequency and wide row size.
 Make what is tracked by time and unit of
interval partition key.
34

Time Series Data
 Due to its quick writing model
Cassandra is suited for storing time
series data.
 The Cassandra wide row is a perfect fit
for modeling time series / time based
events.
 Let’s look at an example….
35

Event Data
 Notice primary key and cluster key.
 Insert some data
 View in CQL, then in CLI as wide row
36

TTL – Self Expiring Data
 Another technique is data that has a
defined lifespan.
 For instance session identifiers,
temporary passwords etc.
 For this Cassandra provides a Time To
Live (TTL) mechanism.
37

TTL Example…
 Create table
 Insert data using TTL
 Can update specific column with table
 Show using selects.
38

Questions
 Email: brian.enochson@gmail.com
 Twitter: @benochso
 G+: https://plus.google.com/+BrianEnochson
39

Cassandra20141009

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Cassandra20141009

Similar to Cassandra20141009 (20)

More from Brian Enochson

More from Brian Enochson (6)

Recently uploaded

Recently uploaded (20)

Cassandra20141009