In this talk, Tzach Livyatan, VP Product at ScyllaDB discusses NoSQL Data Modeling 101. He covers:
- NoSQL vs SQL data modeling
- What are partition keys, and clustering keys and how to choose them
- What are Materialized Views in NoSQL and ScyllaDB
13. 13
INSERT INTO pet_owner(pet_chip_id,owner,pet_name) VALUES (a2a60505-3e17-4ad4-8e1a-
f11139caa1cc, 642adfee-6ad9-4ca5-aa32-a72e506b8ad8, 'Buddy');
INSERT INTO pet_owner(pet_chip_id,owner,pet_name) VALUES (80d39c78-9dc0-11eb-a8b3-
0242ac130003, 642adfee-6ad9-4ca5-aa32-a72e506b8ad8, 'Rocky');
INSERT INTO pet_owner(pet_chip_id,owner,pet_name) VALUES (92cf4f94-9dc0-11eb-a8b3-
0242ac130003, b4a63c18-9dc0-11eb-a8b3-0242ac130003, 'Rin Tin Tin');
SELECT * FROM pet_owner;
SELECT * FROM pet_owner WHERE pet_chip_id = 80d39c78-9dc0-11eb-a8b3-0242ac130003;
SELECT * FROM pet_owner WHERE pet_name = 'Rocky'; (?)
Key / Value Example
14. 14
UPDATE pet_owner SET pet_name = 'Cat' WHERE pet_chip_id = 92cf4f94-9dc0-11eb-
a8b3-0242ac130003;
DELETE FROM pet_owner WHERE pet_chip_id = 80d39c78-9dc0-11eb-a8b3-0242ac130003;
SELECT * FROM pet_owner;
Key / Value Example
16. Choosing a Partition Key
■ High Cardinality
■ Even Distribution
Avoid
■ Low Cardinality
■ Hot Partition
■ Large Partition
16
https://www.codedrome.com/zipfs-law-in-python/
17. Choosing a Partition Key
17
■ User Name
■ User ID
■ User ID + Time
■ Sensor ID
■ Sensor ID + Time
■ Customer
■ State
■ Age
■ Favorite NBA Team
■ Team Angel or Team Spike
https://commons.wikimedia.org/
18. Query:
SELECT * from heartrate_v10 WHERE
pet_chip_id = 80d39c78-9dc0-11eb-a8b3-0242ac130003 LIMIT 1;
SELECT * from heartrate_v10 WHERE
pet_chip_id = 80d39c78-9dc0-11eb-a8b3-0242ac130003 AND
time >= '2021-05-01 01:00+0000' AND
time < '2021-05-01 01:03+0000';
18
https://gist.github.com/tzach/7486f1a0cc904c52f4514f20f14d2a97
Wide Partition Example
25. Example - Query by Owner
SELECT * FROM heartrate_v10 WHERE pet_chip_id = a2a60505-3e17-4ad4-8e1a-
f11139caa1cc;
SELECT * FROM heartrate_v10 WHERE owner = 642adfee-6ad9-4ca5-aa32-
a72e506b8ad8;
SELECT * FROM heartrate_v10 WHERE owner = 642adfee-6ad9-4ca5-aa32-
a72e506b8ad8 ALLOW FILTERING;
25
https://gist.github.com/tzach/4b9dadbc6e8a9c50369da05631c5e13e
Try
TRACING ON;
TRACING OFF;
26. Solution - Materialized Views
CREATE TABLE heartrate_v10 (
pet_chip_id uuid, owner uuid, time timestamp, heart_rate int,
PRIMARY KEY (pet_chip_id, time)
);
SELECT * FROM heartrate_by_owner WHERE owner = 642adfee-6ad9-4ca5-aa32-
a72e506b8ad8;
CREATE MATERIALIZED VIEW heartrate_by_owner AS
SELECT * FROM heartrate_v10
WHERE owner IS NOT NULL AND pet_chip_id IS NOT NULL AND time IS NOT NULL
PRIMARY KEY(owner, pet_chip_id, time);
DROP MATERIALIZED VIEW heartrate_by_owner;
ALTER MATERIALIZED VIEW heartrate_by_owner [WITH table_options];
https://docs.scylladb.com/getting-started/mv/ 26
Tzach - VP of Product
Session is available in Scylla U as a course
Let’s go over some important terms:
A Cluster is a collection of nodes that Scylla uses to store the data. The nodes are logically distributed like a ring. A minimum cluster typically consists of at least three nodes. Data is automatically replicated across the cluster, depending on the Replication Factor. This cluster is often referred to as a ring architecture, based on a hash ring — the way the cluster knows how to distribute data across the different nodes.
A Keyspace is a top-level container that stores tables with attributes that define how data is replicated on nodes. It defines a number of options that apply to all the tables it contains, the most important of which is the replication strategy used by the Keyspace. A keyspace is comparable to the concept of a database Schema in the relational world. Since the keyspace defines the replication factor of all underlying tables, if we have tables that require different replication factors we would store them in different keyspaces.
A Table is how Scylla stores data and can be thought of as a set of rows and columns.
A Partition is a collection of sorted rows, identified by a unique primary key. More on primary keys later on in this session. Each partition is stored on a node and replicated across nodes.
A Row in Scylla is a unit that stores data. Each row has a primary key that uniquely identifies it in a Table. Each row stores data as pairs of column names and values. In case a Clustering Key is defined, the rows in the partition will be sorted accordingly. More on that later on.
CQL is a query language that is used to interface with Scylla. It allows us to perform basic functions such as insert, update, select, delete, create, and so on.
CQL is in some ways similar to SQL however there are some differences.
replication
The replication strategy and options to use for the keyspace (see details below).
durable_writes
Whether to use the commit log for updates on this keyspace (disable this option at your own risk!).
Share a terminal
> ty-share
Before we create a table, we need to know:
Data types
Keys
Table
Collections are used to describe a group of items connected to single key -> helps with simplifying data modeling
Remember to use appropriate collection per use case
Keep collection small to prevent high latency during querying the data
Sets are ordered alphabetically or based on the natural sorting method of the type Examples: multiple email addresses or phone numbers per user
Lists are ordered objects based on user’s definition
Maps is a name and a pair of typed values, very helpful with a sequential events logging
Summary:
Collections helps users with organizing their data
Collections should be used in adequate cases, due to performance impact
A Partition Key is one or more columns that are responsible for data distribution across the nodes. It determines in which node to store a given row.
Partition Key is a must on every table.
In the example below the Partition Key is the ID column. A consistent hash function, also known as the partitioner, is used to determine to which nodes data is written.
Scylla transparently partitions data and distributes it to the cluster. Data is replicated across the cluster. A Scylla cluster is visualized as a ring, where each node is responsible for a range of tokens and each value is attached to a token using a partition key
Allow fast query for pet, and just for pets!
PRIMARY KEY = Partition + Clustering Key
Why is large partition a problem? Is it a problem?
Large may lead to got
Index implementation (no longer an issue in Scylla)
By default, sorting is based on the natural (ASC) order of the clustering columns. What happens if we want to reverse the order? What if our query is to find the heart rate by pet_chip_id and time, but that we want to look at the ten most recent records.
By default, sorting is based on the natural (ASC) order of the clustering columns. What happens if we want to reverse the order? What if our query is to find the heart rate by pet_chip_id and time, but that we want to look at the ten most recent records.
Now that we see that we are able to query each individual pet, what about their owners?
Let’s try
Scylla will output an error message, saying that the query might hurt the performance, if you want to query anyway you should use ALLOW FILTERING
Works
Scylla raises an error since we are querying a regular column which is not indexed and it will hurt the performance because scylla will do a FULL SCAN on the partition, meaning that will read the entire partition to filter it after.
Use TRACING to see how much the performance will be affected
One way to solve this problem: create a table using owner id and another for pet id and on the application we do dual writes. The problem here is that we now need to make sure that both table are synchronized.
MVs - Is a new table that it’s updated automatically by the base table
Show syntax
But if we create a view and make owner as the partition key and then we can query the view by it’s partition key (owner)
Everytime that a insert is received by the Coordinator, scylla will insert into the base table and updated the mutations on the relevant updates on MVs replicas
They are synchronously and works as any other table except Scylla will reject writes done directly on the materialized views
No magic, thats a tradeoff between read latency and disk space
Every MV that you create, you will need more space for its creation
When querying the MV specifically - scylla will query the MV - low latency