4. Who is using DataStax?
4
Collections /
Playlists
Recommendation /
Personalization
Fraud detection
Messaging
Internet of Things /
Sensor data
5. What is Apache Cassandra?
Apache Cassandra™ is a massively scalable NoSQL database.
• Continuous availability
• High performing writes and reads
• Linear scalability
• Multi-data center support
6. 6
The NoSQL Performance Leader
Source: Netflix Tech Blog
Netflix Cloud Benchmark…
“In terms of scalability, there is a clear winner throughout our experiments.
Cassandra achieves the highest throughput for the maximum number of nodes in
all experiments with a linear increasing throughput.”
Source: Solving Big Data Challenges for Enterprise Application Performance Management benchmark paper presented at the Very Large Database Conference,
2013.
End Point Independent NoSQL Benchmark
Highest in throughput…
Lowest in latency…
7. 10
50
3070
80
40
20
60
Client
Client
Replication Factor = 3
We could still
retrieve the data
from the other 2
nodes
Token Order_id Qty Sale
70 1001 10 100
44 1002 5 50
15 1003 30 200
Node failure or it goes
down temporarily
Cassandra is Fault Tolerant
9. 9
Writes in Cassandra
Data is organized into Partitions
1. Data is written to a Commit Log for a node (durability)
2. Data is written to MemTable (in memory)
3. MemTables are flushed to disk in an SSTable based on
size.
SSTables are immutable
Client
Memory
SSTables
Commit Log
Flush to Disk
13. CQL - DevCenter
13
A SQL-like query language for communicating with Cassandra
DataStax DevCenter – a free, visual query tool for creating and running CQL
statements against Cassandra and DataStax Enterprise.
15. CQL - Basics
15
CREATE TABLE users (
username text,
password text,
create_date timestamp,
PRIMARY KEY (username, create_date desc);
INSERT INTO users (username, password, create_date)
VALUES ('caroline', 'password1234', '2014-06-01 07:01:00');
SELECT * FROM users WHERE username = ‘caroline’ AND create_date = ‘2014-06-
01 07:01:00’;
Predicates
On the partition key: = and IN
On the cluster columns: <, <=, =, >=, >, IN
16. Collection Data Types
16
CQL supports having columns that contain collections of data.
The collection types include:
Set, List and Map.
Favor sets over list – better performance
CREATE TABLE users (
username text,
set_example set<text>,
list_example list<text>,
map_example map<int,text>,
PRIMARY KEY (username)
);
17. Plus much more…
17
Light Weight Transactions
INSERT INTO customer_account (customerID, customer_email) VALUES
(‘LauraS’, ‘lauras@gmail.com’) IF NOT EXISTS;
UPDATE customer_account SET customer_email=’laurass@gmail.com’
IF customer_email=’lauras@gmail.com’;
Counters
UPDATE UserActions SET total = total + 2
WHERE user = 123 AND action = ’xyz';
Time to live (TTL)
INSERT INTO users (id, first, last) VALUES (‘abc123’, ‘abe’,
‘lincoln’) USING TTL 3600;
Batch Statements
BEGIN BATCH
INSERT INTO users (userID, password, name) VALUES ('user2',
'ch@ngem3b', 'second user')
UPDATE users SET password = 'ps22dhds' WHERE userID = 'user2'
INSERT INTO users (userID, password) VALUES ('user3',
'ch@ngem3c')
DELETE name FROM users WHERE userID = 'user2’
APPLY BATCH;
19. DataStax Java Driver
19
• Written for CQL 3.0
• Uses the binary protocol introduced in
Cassandra 1.2
• Uses Netty to provide an asynchronous architecture
• Can do asynchronous or synchronous queries
• Has connection pooling
• Has node discovery and load balancing
http://www.datastax.com/download
20. Add .JAR Files to Project
20
Easiest way is to do this with Maven, which is a software project
management tool
21. Add .JAR Files to Project
21
In the pom.xml file, select the Dependencies tab
Click the Add… button in the left column
Enter the DataStax Java driver info
22. Connect & Write
22
Cluster cluster = Cluster.builder()
.addContactPoints("10.158.02.40", "10.158.02.44")
.build();
Session session = cluster.connect("demo");
session.execute(
"INSERT INTO users (username, password) ”
+ "VALUES(‘caroline’, ‘password1234’)"
);
Note: Cluster and Session objects should be long-lived and re-used
24. Asynchronous Read
24
ResultSetFuture future = session.executeAsync(
"SELECT * FROM users");
for (Row row : future.get()) {
String userName = row.getString("username");
String password = row.getString("password");
}
Note: The future returned implements Guava's ListenableFuture interface. This
means you can use all Guava's Futures1 methods!
1http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/util/concurrent/Futures.html
25. Read with Callbacks
25
final ResultSetFuture future =
session.executeAsync("SELECT * FROM users");
future.addListener(new Runnable() {
public void run() {
for (Row row : future.get()) {
String userName = row.getString("username");
String password = row.getString("password");
}
}
}, executor);
26. Parallelize Calls
26
int queryCount = 99;
List<ResultSetFuture> futures = new
ArrayList<ResultSetFuture>();
for (int i=0; i<queryCount; i++) {
futures.add(
session.executeAsync("SELECT * FROM users "
+"WHERE username = '"+i+"'"));
}
for(ResultSetFuture future : futures) {
for (Row row : future.getUninterruptibly()) {
//do something
}
}
29. Load Balancing
29
Determine which node will next be contacted once a
connection to a cluster has been established
Cluster cluster = Cluster.builder()
.addContactPoints("10.158.02.40","10.158.02.44")
.withLoadBalancingPolicy(
new DCAwareRoundRobinPolicy("DC1"))
.build();
Policies are:
• RoundRobinPolicy
• DCAwareRoundRobinPolicy (default)
• TokenAwarePolicy
30. RoundRobinPolicy
30
• Not data-center aware
• Each subsequent request after initial connection to the
cluster goes to the next node in the cluster
• If the node that is serving as the coordinator fails during a
request, the next node is used
31. DCAwareRoundRobinPolicy
31
• Is data center aware
• Does a round robin within the local data center
• Only goes to another
data center if there is
not a node available
to be coordinator in
the local data center
32. TokenAwarePolicy
32
• Is aware of where the replicas for a given token live
• Instead of round robin, the client chooses the node that
contains the primary replica to be the chosen coordinator
• Avoids unnecessary time taken to go to any node to have it
serve as coordinator to then contact the nodes with the
replicas
33. Additional Information & Support
33
• Community Site
(http://planetcassandra.org)
• Documentation
(http://www.datastax.com/docs)
• Downloads
(http://www.datastax.com/download)
• Getting Started
(http://www.datastax.com/documentation/gettingstarted/index.html)
• DataStax
(http://www.datastax.com)
Always on:
Peer to peer architecture – all nodes are equal; each node is responsible for an assigned range (or ranges) of data
Clients can write (or read0 data to any node in the ring – native drivers can round robin across a DC and distribute load to a coordinator node
Coordinate node writes (or reads) copies of data to nodes which own each copy
In the case of a failure (such as a drive going down),
2 out of the 3 nodes are still on, so the ability to write and read data still works for the majority of nodes and therefore C* is always on
Multi-DC is very, very easy to configure with Cassandra
Datacenters are active – active: write to either DC and the other one will get a copy
In the case of a datacenter outage, applications can carry on a retry policy which flips over to the other datacenter which also has a copy of the data;
Outbrain story – Hurricane Sandy
Rack-aware
What is Quorum? i.e. 10 nodes, rf = 6 (q=4 -> can tolerate 2 down nodes)
similar to sql but does not have all sql options
DevCenter / cqlsh
keyspace = db instance
simple strategy vs NetworkTopologyStrategy
replication factor
user defined types
batch -> all or nothing - atomic
Asynchronous: Netty is an asynchronous event-driven network application framework. Not blocking threads -> can scale
It greatly simplifies and streamlines network programming such as TCP and UDP socket server / support request pipeline
Other drivers: C#, Python, ODBC + community drivers
high frequency / multi threaded
get push requests back from C* i.e., topology changes, node status changes, schema changes
v2.0.2
contact points = discover cluster. not connect to these nodes
Retrieves metadata from the cluster, including:
Name of the cluster , IP address, rack, and data center for each node
Session instances are thread-safe and normally a single session instance is all you need per application per keyspace
schemaless -> no create date
exectureAsync => returns future implementing Guavas futures => large scale > execute long query > callbacks
Execute async to get a future on your resultset
- which means from a single thread you can push a bunch of requests to be execute in paralell
- look at futures in Java
example: read w callback
break into small queries, execute as many in C* -> send as much as C* can take
fire ton of select, get futures back & iterate & do …
breaking into small queries -> better latency, easier to retry small query
prepared, compiled, execute
for CQL statements executed multiple times
for performance: precompiled & cached on server. C* parses once & exec multiple times
use ? as placeholder for literal values
gets passed as BoundStatement object
BS gets executed by Session object
programmatically build query
When you need to generate queries dynamically, quite easy
Or if your just familiar with it
Note: the default consistency level of ConsistencyLevel.ONE
Round Robin: not DC aware; request goes to next node in cluster
DCAwareRoundRobin: goes to next node in next DC
client connects to node where data resides. performance advantage
HIRING
HQ: SF
offices in Austin, London, Paris
delivering Apache C* to the enterprise
DataStax is the company that delivers Cassandra to the enterprise. First, we take the open source software and put it through rigorous quality assurance tests including a 1000 node scalability test. We certify it and provide the worlds most comprehensive support, training and consulting for Cassandra so that you can get up and running quickly. But that isn’t all DataStax does. We also build additional software features on top of DataStax including security, search, analytics as well as provide in memory capabilities that don’t come with the open source Cassandra product. We also provide management services to help visualize your nodes, plan your capacity and repair issues automatically. Finally, we also provide developer tools and drivers as well as monitoring tools. DataStax is the commercial company behind Apache Cassandra plus a whole host of additional software and services.
Spark integration - 100x faster analytics - integrated real-time analytics
performance tuning - supply diagnostic info on how cluster is doing
byoh