The document summarizes several examples of using Cassandra for different use cases: transactions, paging, analytics, and risk sensitivity calculations. For each use case, it provides an overview of the scenario, how to retrieve and run the example code, the relevant data model and schema, and the methods used. It emphasizes how Cassandra features like lightweight transactions, cursors, and aggregations allow for efficient implementation of these types of applications.
1. Cassandra Hands On
Niall Milton, CTO, DigBigData
Examples courtesy of Patrick Callaghan, DataStax
Sponsored By
2. Introduction
— We will be walking through Cassandra use cases
from Patrick Callaghan on github.
— https://github.com/PatrickCallaghan/
— Patrick sends his apologies but due to Aer Lingus
air strike on Friday he couldn’t get a flight back to
UK
— This presentation will cover the important points
from each sample application
5. Scenario
— We want to add products, each with a quantity to
an order
— Orders come in concurrently from random buyers
— Products that have sold out will return “OUT OF
STOCK”
— We want to use lightweight transactions to
guarantee that we do not allow orders to complete
when no stock is available
6. Lightweight Transactions
— Guarantee a serial isolation level, ACID
— Uses PAXOS consensus algorithm to achieve this in a
distributed system. See:
— http://research.microsoft.com/en-us/um/people/lamport/
pubs/paxos-simple.pdf
— Every node is still equal, no master or locks
— Allows for conditional inserts & updates
— The cost of linearizable consistency is higher latency,
not suitable for high volume writes where low latency is
required
8. Schema
1. create keyspace if not exists
datastax_transactions_demo WITH replication =
{'class': 'SimpleStrategy',
'replication_factor': '1' };
2. create table if not exists products(productId
text, capacityleft int, orderIds set<text>,
PRIMARY KEY (productId));
3. create table if not exists
buyers_orders(buyerId text, orderId text,
productId text, PRIMARY KEY(buyerId, orderId));
9. Model
public class Order {
private String orderId;
private String productId;
private String buyerId;
…
}
10. Method
— Find current product quantity at CL.SERIAL
— This allows us to execute a PAXOS query without
proposing an update, i.e. read the current value
SELECT capacityLeft from products WHERE
productId = ‘1234’
e.g. capacityLeft = 5
11. Method Contd.
— Do a conditional update using IF operator to make
sure product quantity has not changed since last
quantity check
— Note the use of the set collection type here.
— This statement will only succeed if the IF condition is
met
UPDATE products SET orderIds=orderIds +
{'3'}, capacityleft = 4 WHERE productId =
’1234' IF capacityleft = 5;
12. Method Contd.
— If last query succeeds, simply insert the order.
INSERT into orders (buyerId, orderId,
productId) values (1,3,’1234’);
— This guarantees that no order will be placed where
there is insufficient quantity to fulfill it.
13. Comments
— Using LWT incurs a cost of higher latency because
all replicas must be consulted before a value is
committed / returned.
— CL.SERIAL does not propose a new value but is
used to read the possibly uncommitted PAXOS
state
— The IF operator can also be used as IF NOT EXISTS
which is useful for user creation for example
15. Scenario
— We have 1000s of products in our product
catalogue
— We want to browse these using a simple select
— We don’t want to retrieve all at once!
16. Cursors
— We are often dealing with wide rows in Cassandra
— Reading entire rows or multiple rows at once could
lead to OOM errors
— Traditionally this meant using range queries to
retrieve content
— Cassandra 2.0 (and Java driver) introduces cursors
— Makes row based queries more efficient (no need to
use the token() function)
— This will simplify client code
18. Schema
create table if not exists
products(productId text, capacityleft int,
orderIds set<text>, PRIMARY KEY
(productId));
— N.B With the default partitioner, products will be
ordered based on Murmer3 hash value. Old way we
would need to use the token() function to retrieve
them in order
19. Model
public class Product {
private String productId;
private int capacityLeft;
private Set<String> orderIds;
…
}
20. Method
1. Create a simple select query for the products
table.
2. Set the fetch size parameter
3. Execute the statement
Statement stmt = new
SimpleStatement("Select * from products”);
stmt.setFetchSize(100);
ResultSet resultSet =
this.session.execute(stmt);
21. Method Contd.
1. Get an iterator for the result set
2. Use a while loop to iterate over the result set
Iterator<Row> iterator = resultSet.iterator();
while (iterator.hasNext()){
Row row = iterator.next();
// do stuff with the row
}
22. Comments
— Very easy to transparently iterate in a memory
efficient way over a large result set
— Cursor state is maintained by driver.
— Allows for failover between different page
responses, i.e. the state is not lost if a page fails to
load from a node in the replica set, the page will be
requested from another node
— See: http://www.datastax.com/dev/blog/client-
side-improvements-in-cassandra-2-0
24. Scenario
— Don’t have Hadoop but want to run some HIVE type
analytics on our large dataset
— Example: Get the Top10 financial transactions
ordered by monetary value for each user
— May want to add more complex filtering later
(where value > 1000) or even do mathematical
groupings, percentiles, means, min, max
25. Cassandra for Analytics
— Useful for many scenarios when no other analytics
solution is available
— Using cursors, queries are bounded & memory efficient
depending on the operation
— Can be applied anywhere we can do iterative or recursive
processing, SUM, AVG, MIN, MAX etc.
— NB: The example code also includes an
CQLSSTableWriter which is fast & convenient if we want
to manually create SSTables of large datasets rather
than send millions of insert queries to Cassandra
27. Schema
create table IF NOT EXISTS transactions (
accid text,
txtnid uuid,
txtntime timestamp,
amount double,
type text,
reason text,
PRIMARY KEY(accid, txtntime)
);
28. Model
public class Transaction {
pivate String txtnId;
private String acountId;
private double amount;
private Date txtnDate;
private String reason;
private String type;
…
}
29. Method
— Pass a blocking queue into the DAO method which cursors the
data, allows us to pop items off as they are added
— NB: Could also use a callback here to update the queue
public void
getAllProducts(BlockingQueue<Transaction>
processorQueue)
Statement stmt = new SimpleStatement(“SELECT * FROM
transactions”);
stmt.setFetchSize(2500);
ResultSet resultSet = this.session.execute(stmt);
30. Method Contd.
1. Get an iterator for the result set
2. Use a while loop to iterate over the result set, add each row
into the queue
while (iterator.hasNext()) {
Row row = iterator.next();
Transaction transaction =
createTransactionFromRow(row); //convenience
queue.offer(transaction);
}
31. Method Contd.
1. Use Java Collections & Transaction comparator to
track Top results
private Set<Transaction> orderedSet = new
BoundedTreeSet<Transaction>(10, new
TransactionAmountComparator());
32. Comments
— Entirely possible, but probably not to be thought of as a
complete replacement for dedicated analytics solutions
— Issues are token distribution across replicas and mixed write
and read patterns
— Running analytics or MR operations can be a read heavy
operation (as well as memory and i/o intensive)
— Transaction logging tends to be write heavy
— Cassandra can handle it, but in practice it is better to split
workloads except for smaller cases, where latency doesn’t
matter or where the cluster is not generally under significant
load
— Consider DSE Hadoop, Spark, Storm as alternatives
34. Scenario
— In financial risk systems, positions have sensitivity to
certain variable
— Positions are hierarchical and is associated with a trader
at a desk which is part of an asset type in a certain
location.
— E.g. Frankfurt/FX/desk10/trader7/position23
— Sensitivity values are inserted for each position. We
need to aggregate them for each level in the hierarchy
— The Sum of all sensitivities over time is the new
sensitivity as they are represented by deltas.
35. Scenario
— E.g. Aggregations for:
— Frankfurt/FX/desk10/trader7
— Frankfurt/FX/desk10
— Frankfurt/FX
— As new positions are entered the risk sensitivities will
change and will need to be aggregated for each level
for the new value to be available
36. Queries
select * from risk_sensitivities_hierarchy
where hier_path = 'Paris/FX'; !
select * from risk_sensitivities_hierarchy
where hier_path = 'Paris/FX/desk4' and
sub_hier_path='trader3'; !
select * from risk_sensitivities_hierarchy
where hier_path = 'Paris/FX/desk4' and
sub_hier_path='trader3' and
risk_sens_name='irDelta';!
38. Schema
create table if not exists risk_sensitivities_hierarchy (
hier_path text,
sub_hier_path text,
risk_sens_name text,
value double,
PRIMARY KEY (hier_path, sub_hier_path,
risk_sens_name)
) WITH compaction={'class': 'LeveledCompactionStrategy'};
NB: Notice the use of LCS as we want the table to be efficient for
reads also
40. Method
— Write a service to write new sensitivities to
Cassandra Periodically.
insert into risk_sensitivities_hierarchy
(hier_path, sub_hier_path, risk_sens_name,
value) VALUES (?, ?, ?, ?)
41. Method Contd.
— In our aggregator do the following periodically
— Select data for hierarchies we wish to aggregate
select * from risk_sensitivities_hierarchy where
hier_path = ‘Frankfurt/FX/desk10/trader4’
— Will get all positions related to this hierarchy
— Add the values (represented as deltas) to each other to get
the new sensitivity
— E.g. S1 = -3, S2 = 2, S3= -1
— Write it back for ‘Frankfurt/FX/desk10/trader4’
42. Comments
— Simple way to maintain up to date risk sensitivity
on an on going basis based on previous data
— Will mean (N Hierarchies) * (N variables) queries
are executed periodically (keep an eye on this)
— Cursors, blocking queue and bounded collections
help us achieve the same result without reading
entire rows
— Has other applications such as roll ups for stream
data provided you have a reasonably low cardinality
in terms of number of (time resolution) * variables.
43. — Thanks Patrick Callaghan for the hard work coding
the examples!
— Questions?