2. Presenter
Shlomi Livne, VP of R&D
Shlomi is VP of R&D at ScyllaDB. Prior to ScyllaDB he led the
research and development team at Convergin, which was
acquired by Oracle.
6. Development Cycle NoSQL Databases
Think about the queries you are going to run
Create a Data Model
7. Development Cycle NoSQL Databases
Think about the queries you are going to run
Create a Data Model
Use cassandra-stress (or
other) to validate (*)
Develop
8. Development Cycle NoSQL Databases
Think about the queries you are going to run
Create a Data Model
Use cassandra-stress (or
other) to validate (*)
Develop
9. Development Cycle NoSQL Databases
Think about the queries you are going to run
Create a Data Model
Use cassandra-stress (or
other) to validate (*)
Develop
10. Development Cycle NoSQL Databases
Think about the queries you are going to run
Create a Data Model
Use cassandra-stress (or
other) to validate (*)
Develop
CQL Optimization
11. Development Cycle NoSQL Databases
Think about the queries you are going to run
Create a Data Model
Use cassandra-stress (or
other) to validate (*)
Develop
Scale test
12. Development Cycle NoSQL Databases
Think about the queries you are going to run
Create a Data Model
Use cassandra-stress (or
other) to validate (*)
Develop
Scale test
13. Development Cycle NoSQL Databases
Think about the queries you are going to run
Create a Data Model
Use cassandra-stress (or
other) to validate (*)
Develop
Scale test
Deploy
14. Development Cycle NoSQL Databases
Think about the queries you are going to run
Create a Data Model
Use cassandra-stress (or
other) to validate (*)
Develop
Scale test
Deploy
16. Development Cycle NoSQL Databases
Think about the queries you are going to run
Create a Data Model
Use cassandra-stress (or
other) to validate (*)
Develop
Scale test
Deploy
Disk Access
17. What will Disk Access track
■ Disk Access looks at:
● Amount of I/O operations
● Overall amount of read bytes
■ When sstables are read from disk there are two related components
(everything else is in memory):
● Data - stores the actual data
● Index - provides lookup into the data file “blocks” that contain the partition (if the
partition is large - it contains promoted index)
18. Disk Access - Why
● The ratio memory : disk is increasing:
○ EC2 i3 family memory : disk ratio for is: 1:30
○ EC2 i3en family memory : disk ratio for is 1:78
○ More queries will be served from disk
19. Disk Access - Why
● The ratio memory : disk is increasing:
○ EC2 i3 family memory : disk ratio for is: 1:30
○ EC2 i3en family memory : disk ratio for is 1:78
○ More queries will be served from disk
● There are workloads that you will always prefer running from disk
(background analytics)
21. An IoT application(+)
Total amount of data points
526 billion
temperature readings
1,000,000 sensors, representing homes in an area
365 days (1 year storage requirement) 1 reading per second
22. Analytics over the entire data?
How long would it take at
normal speeds?
We need more if analytics
are a part of the pipeline
That means we need Scylla
We need a good application
And we need hardware
200,000 points/second
730 hours (30 days)
1 million points/second
146 hours (almost a week)
23. Why climb Mount Everest?
Because it’s there.
George Leigh Mallory
What kind of performance are we after?
24. Data Model
CREATE TABLE readings (
sensor_id int,
date date,
time time,
temperature float,
PRIMARY KEY ((sensor_id, date), time))
What kind of queries can we reasonably support?
■ SELECT * from readings where sensor_id = ? and date = ?;
■ SELECT * from readings where sensor_id = ? and date = ? and time > ?;
25. Analytics Application Option 1
■ Let the server do as much work as possible
SELECT sensor_id,
date,
min(temperature) as minTemperature,
max(temperature) as maxTemperature
FROM readings where sensor_id = ? and date = ?`
26. Application
(Example) Total amount of data to scan: 1.44 billion points/day
Coordinator
Worker
(loader machine)
ScyllaDB cluster
Worker
(loader machine)
Worker
(loader machine)
Set time frame,
compute average,
min, max of
all sensors
27. Disk Access Analysis Option 1 (in theory)
● For simplification lets assume
○ Every partition:
■ is fully stored in a single sstable
■ is exactly placed in a single data block
○ Bloom filters do not provide false positives
● Analysis
Number of partitions 365 * 10^6 = 365 Million
I/O for index 365 Million
I/O for data 365 Million
28. Analytics Application Option 2
■ Do range scan’s and use CQL GROUP BY (new in 3.2)
SELECT sensor_id,
date,
min(temperature) as minTemperature,
max(temperature) as maxTemperature
FROM readings where token(sensor_id, token_id) > X and
token(sensor_id, token_id) < Y GROUP BY sensor_id, date
29. Application
(Example) Total amount of data to scan: 1.44 billion points/day
Coordinator
Worker
(loader machine)
ScyllaDB cluster
Worker
(loader machine)
Worker
(loader machine)
Set time frame,
compute average,
min, max of
all sensors
30. Disk Access Analysis Option 2 (in theory)
● For simplification lets assume
○ Application breaks requests by vnode token ranges
○ Every partition:
■ is fully stored in a single sstable
■ is exactly placed in a single data block (and the only one there)
■ vnode token ranges do not share data blocks
○ Bloom filters do not provide false positives
● Analysis
Number of scans Number of vnode token ranges
I/O for index Number of vnode token ranges *
Number of shards
I/O for data Number of data blocks
31. Disk Access Comparison
Option1: Single Partition Option 2: Range Scans
Number of ops Number of partitions
365 * 10^6 = 365 Million
Number of scans
Number of vnode token ranges
83 * 256 = 21248
I/O for index 365 Million Number of vnode token ranges *
Number of shards
83 * 256 * 54 = 1147392
I/O for data 365 Million 365 Million
32. Billy using Full Scan (theoretical) gain
1. The number of I/O ops for Index on the cluster drops from 365 Million
to ~ 1.2 Million
● In reality SSTable Bloom Filters are not perfect so single partitions reads will be
attempted on sstables that don’t have the partition - even a bigger win for scans)
1. The number of CQL operations on the cluster drops from 365 Million
to ~22K
● Returning a result per partition - 365M / 5000 (page size) = 73K pages (in optimal
case) so we will need more than 22K requests.
1. In reality partitions do share data blocks they are not perfectly
aligned
33. Putting Data Access into practice
● Queries
● Data model
● Some test data (at small scale)
● Docker
● Scylla-Nightly (Pre Scylla 3.2)
○ Tracing including disk access
‘ ... mc-132-big-Index.db: finished bulk DMA read of size 538 at offset 0,
successfully read 4096 bytes [shard 0] ‘
● A simple script that parses system_trace.events after running a
traced query
34. Billy on small scale
■ 1000 sensors, 100 dates, 1 sample per minute
■ 1 M partitions, 1440 M rows
■ # shards 4
35. Billy on small scale
■ 1000 sensors, 100 dates, 1 sample per minute
■ 1 M partitions, 1440 M rows
■ # shards 4
Results
Single Partition Range Scan Gain
Index I/O ~1.3 M
Index Bytes ~2.8 GB
Data I/O ~1 M
Data Bytes ~14.4 GB
36. Billy on small scale
■ 1000 sensors, 100 dates, 1 sample per minute
■ 1 M partitions, 1440 M rows
■ # shards 4
Results
Single Partition Range Scan Gain
Index I/O ~1.3 M 3318 X 392
Index Bytes ~2.8 GB ~6.9 M X 424
Data I/O ~1 M 10738 X 93
Data Bytes ~14.4 GB ~1.3 GB X 11
37. Billy using Full Scan gain is even bigger
1. Read aheads for the full scans - utilizing better the disk
● Single Partition Avg Data Byts: 14748600348÷1024089 = ~14.5K
● Range Scan Avg Data Bytes: 1355390291÷10738 = ~126K
1. AIO reads are sent to the disk aligning to Index/Data placement - yet
disks do block size reads:
● Doing 2 reads for two halves of a disk block will result in reading the block twice and
returning part of it each time.
38. Should Range Scans always be used
for analytics ?
■ No
■ If Number of Partitions < Number of Token Ranges * Number of
Shards
■ What if we are doing a partial scan - what should we do ?
a. Example: What was the max & min temperature over the last 7/30/90 days
39. Billy+: Partial Scan
SELECT sensor_id,
date,
min(temperature) as minTemperature,
max(temperature) as maxTemperature
FROM readings where token(sensor_id, date) > X and
token(sensor_id, date) < Y and date >= Z GROUP BY sensor_id,
date ALLOW FILTERING
40. Billy+: Partial Scan
● If we are back to the simplifications: ~7% seems to be a good mark:
○ Partial Scan < 7% data use single partitions
○ Partial Scan > 7% data use full scan and filter
● General case: it depends how big the partitions are
○ Larger partitions have a higher penalty on reading them unnecessarily
Single Partition Range Scan
Total I/O ~2.3 M 14056 0.6%
Total Bytes ~17.7 GB ~1.3 GB 7.7%
43. Evaluating a data model
We need this done faster - for simplicity lets add static min/max for each
partition that will cache the info - does this help
CREATE TABLE readings (
sensor_id int,
date date,
time time,
temperature float,
temp_min float static,
temp_max float static,
PRIMARY KEY ((sensor_id, date), time))
44. ■ Do range scan’s and use CQL PER PARTITION LIMIT (new in 3.1)
SELECT sensor_id,
date,
temp_min,
temp_max
FROM readings where token(sensor_id, token_id) > X and
token(sensor_id, token_id) < Y PER PARTITION LIMIT 1
45. Results
Range Scan Range Scan pre-computed Gain
Index I/O 3318 2874 X 1.15
Index Bytes ~6.9 M ~5.9 M X 1.15
Data I/O 10738 3520 X 3.05
Data Bytes ~1.3 GB ~430 M X 3.15
47. CQL BYPASS CACHE
■ Scylla uses Read-Through caching - if information read is not in the
cache it will be added
■ CQL BYPASS CACHE allows overriding that for a specific query - don’t
read via the cache / don’t populate the cache
48. CQL PER PARTITION LIMIT
Limits the number of rows that are returned for each partition
cqlsh:ks> select * from samples ;
pk | ck | val
----+----+-----
10 | 1 | 1
10 | 2 | 2
11 | 1 | 3
11 | 2 | 4
cqlsh:ks> select * from samples PER
PARTITION LIMIT 1;
pk | ck | val
----+----+-----
10 | 1 | 1
11 | 1 | 3
49. CQL GROUP BY
■ The GROUP BY option allows to condense into a single row all
selected rows that share the same values for a set of columns (that
are limited to partition key + optionally clustering keys)
■ Aggregate functions will produce a separate value for each group.
cqlsh:ks> select * from samples ;
pk | ck | val
----+----+-----
10 | 1 | 1
10 | 2 | 2
11 | 1 | 3
11 | 2 | 4
select pk, min(val),max (val) from
samples GROUP BY PK;
pk | system.min(val) |
system.max(val)
----+-----------------+--------------
---
10 | 1 |
2
11 | 3 |
4
50. CQL LIKE
■ Filtering using LIKE syntax
■ No need for indexing
cqlsh:ks> select * from samples ;
pk | ck | val
----+----+-----
10 | 1 | 1
10 | 2 | 2
11 | 1 | 3
11 | 2 | 4
cqlsh:ks> select * from samples
where pk like '%0' ALLOW FILTERING;
pk | ck | val
----+----+-----
10 | 1 | 1
10 | 2 | 2
52. ● Disk Access:
○ Is another form that can be used to evaluate data models
○ Its especially useful for the analytics / background batch processing jobs - since
those will access data from disk
● Scylla 3.1 includes
○ CQL:
■ BYPASS CACHE( )
■ PER PARTITION LIMIT
● Upcoming Scylla 3.2 will include:
○ Tracing with Disk Access
○ CQL:
■ GROUP BY
■ LIKE ( )
■ Non Frozen UDTS (not covered)
● Optimized(*) full scans reduce the overall amount of disk access -
when compared to aggregated single partition scans
53. Thank you Stay in touch
Any questions?
Shlomi Livne
shlomi@scylladb.com
@shlomilivne
Notas del editor
NoSQL - you need to start with the queries
Dama Model is built to answer those queries
Testing the DataModel and the queries -
Some start with c-s ot other simulating tool
This is more complex then it sounds - simulating the data distribution and request distribution on the data set is not as simple
Next step is to develop
And once you start you find some queries need to be updated / the data model needs ot be changed
Last year we showed how using Monitoring CQL optimization can have find development bugs earlier
Next - you move to scale testing - trying to emulate the real production dtaa
In this scale test - you find that you may get large partitions - and that changes ...
You delpoy
And you find yourself with hot partitions / large partitions that you may have not detected in scale testing
So this requires changes
Disk Access can be done around Data Model verification and can detect some issues detected longer down the line
Billy is the internal code name for the system that Glauber + … presented at the keynote session doing mote than 1B ops per second
Or phrasing it differently - Glauber just showed you we can do it
My session is about showing you how we can do it EVEN BETTER
We do expose metrics for disk access yet understanding they are 100% related to a single query is not possible (as such we looked for a different way - not to mislead you)
Index Bytes = 2926800234
BYPASS CACHE goes hand in hand with Workload Prioritization
Workload Prioritization assures that the analytic workload co-exists side by side with the online workload
BYPASS CACHE allows to enforce this even further to assure that analytics are only done from disk and do not “polute” the cache