What happens to a request that reaches Scylla, and why should one care? Understanding how Scylla executes your queries can help you make better architectural decisions and also better understand the performance of your application.
Are my rows too big? Should I make that other column a part of my partition key instead? This talk will cover the interaction between nodes, shards and the role of Scylla's internal components like memtables, cache and sstables. I will explain how different types of queries are executed and how to plan your queries for maximum performance.
Scylla Summit 2017: Planning Your Queries for Maximum Performance
1. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Planning your queries
for maximum performance
VP R&D, ScyllaDB
Shlomi Livne
2. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Shlomi Livne
2
Shlomi is VP of R&D at ScyllaDB. Prior to ScyllaDB
he led the research and development team at
Convergin, which was acquired by Oracle.
3. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
How Scylla executes
your queries
4. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Cluster View
4
client
Cluster of nodes
1
7
3
4
5
68
2
Coordinator
Replica
5. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Coordinator Tasks
5
1. Prepare the statement
2. Single partition queries
a. Selects replicas (using cache heat info) - and send query / digest requests
requesting a page of results
b. Compare the digests, if there is a mismatch:
i. Request data from selected replicas
ii. Repair the data on replicas
c. Return result
3. Partition scan queries
a. Split the request up based on the ring
b. Send requests for data using ranges - requesting a page of results
c. Merge results
d. Return result
6. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Replica Tasks
6
1. Receive a data/digest/range request
2. Split the request up according to shards
3. On each shard:
a. Execute the request merging data from memtables + cache/sstables
b. For data request:
i. prepare a result and return it (compute digest if RF > 1)
c. For digest request:
i. compute digest and return it
d. For partition scan request
i. return the partition range data (do not prepare a result)
7. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
emtable
P8:R1:C=3
Replica Shard Read Diagram
7
Bloom Filter Summary Index Compression Data
Bloom Filter Summary Index Compression Data
Bloom Filter Summary Index Compression Data
ResultRow CacheMemtable
Read Req Result
Bloom Filter Summary Index Compression Data
8. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
emtable
P8:R1:C=3
Replica Shard Read Diagram
8
Bloom Filter
P8
Summary
P8
Index
P8
Compression Data
P8:R1:A=8
Bloom Filter
P8
Summary
Index
P8
Compression
Data
P8:R1:B=7
Bloom Filter
P8
Summary Index Compression Data
Row Cache
P8:R1:A=8,B=7
Memtable
P8:R1:C=3
Read: P8:R1
Bloom Filter Summary Index Compression Data
9. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
emtable
P8:R1:C=3
Replica Shard Read Diagram
9
Bloom Filter
P8
Summary
P8
Index
P8
Compression Data
P8:R1:A=8
Bloom Filter
P8
Summary
Index
P8
Compression
Data
P8:R1:B=7
Bloom Filter
P8
Summary Index Compression Data
Row Cache
P8:R1:A=8,B=7
Memtable
P8:R1:C=3
Read: P8:R1
P8:R1
A=8,B=7,C=3
Bloom Filter Summary Index Compression Data
10. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
emtable
P8:R1:C=3
Replica Shard Read Diagram
10
Bloom Filter
P8
Summary
P8
Index
P8
Compression Data
P8:R1:A=8
Bloom Filter
P8
Summary
Index
P8
Compression
Data
P8:R1:B=7
Bloom Filter
P8
Summary Index Compression Data
Row Cache
Memtable
P8:R1:C=3
Read: P8:R1
Bloom Filter Summary Index Compression Data
11. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Bloom Filter
emtable
P8:R1:C=3
Replica Shard Read Diagram
11
Bloom Filter
P8
Summary
P8
Index
P8
Compression Data
P8:R1:A=8
Bloom Filter
P8
Summary
Index
P8
Compression
Data
P8:R1:B=7
Bloom Filter
P8
Summary Index Compression Data
Row Cache
Memtable
P8:R1:C=3
Read: P8:R1
Summary Index Compression Data
12. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
emtable
P8:R1:C=3
Replica Shard Read Diagram
12
Bloom Filter
P8
Summary
P8
Index
P8
Compression Data
P8:R1:A=8
Bloom Filter
P8
Summary
Index
P8
Compression
Data
P8:R1:B=7
Bloom Filter
P8
Summary Index Compression Data
Row Cache
Memtable
P8:R1:C=3
Read: P8:R1
Bloom Filter 12Summary Index Compression Data
13. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
emtable
P8:R1:C=3
Replica Shard Read Diagram
13
Bloom Filter
P8
Summary
P8
Index
P8
Compression Data
P8:R1:A=8
Bloom Filter
P8
Summary
Index
P8
Compression
Data
P8:R1:B=7
Bloom Filter
P8
Summary Index Compression Data
Row Cache
Memtable
P8:R1:C=3
Read: P8:R1
13
Bloom Filter 13Summary Index Compression Data
14. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
emtable
P8:R1:C=3
Replica Shard Read Diagram
Bloom Filter
P8
Summary
P8
Index
P8
Compression Data
P8:R1:A=8
Bloom Filter
P8
Summary
Index
P8
Compression
Data
P8:R1:B=7
Bloom Filter Summary Index Compression Data
Row Cache
Memtable
P8:R1:C=3
Read: P8:R1
Bloom Filter
P8
Summary Index Compression Data
15. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
emtable
P8:R1:C=3
Replica Shard Read Diagram
15
Bloom Filter
P8
Summary
P8
Index
P8
Compression Data
P8:R1:A=8
Bloom Filter
P8
Summary
Index
P8
Compression
Data
P8:R1:B=7
Bloom Filter Summary Index Compression Data
P8:R1:A=8,B=7Row Cache
Memtable
P8:R1:C=3
Read: P8:R1
Bloom Filter
P8
Summary Index Compression Data
16. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
emtable
P8:R1:C=3
Replica Shard Read Diagram
16
Bloom Filter
P8
Summary
P8
Index
P8
Compression Data
P8:R1:A=8
Bloom Filter
P8
Summary
Index
P8
Compression
Data
P8:R1:B=7
Bloom Filter Summary Index Compression Data
P8:R1:A=8,B=7Row Cache
P8:R1:A=8,B=7
Memtable
P8:R1:C=3
Read: P8:R1
P8:R1
A=8,B=7,C=3
Bloom Filter
P8
Summary
Index
P8
Compression
Data
P8:R1:B=7
Bloom Filter
P8
Summary Index Compression Data
17. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
emtable
P8:R1:C=3
Replica Shard Read Diagram
17
Bloom Filter
P8
Summary
P8
Index
P8
Compression Data
P8:R1:A=8
Bloom Filter
P8
Summary
Index
P8
Compression
Data
P8:R1:B=7
Bloom Filter Summary Index Compression Data
P8:R1:A=8,B=7Row Cache
P8:R1:A=8,B=7
Memtable
P8:R1:C=3
Read: P8:R1
P8:R1
A=8,B=7,C=3
Bloom Filter
P8
Summary Index Compression Data
18. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Row Cache
18
▪ Cache stores complete row data
▪ In addition to storing existing rows, cache stores information
about completeness of clustering ranges (continuity), so it doesn't
miss between cached rows.
▪ Cache is populated on:
o Queries
o Memtable flush:
• Data is merged - to keep it up to date with new sstables written.
• Data is inserted - in case there is no data for that partition on disk.
19. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Selecting Sstables
19
▪ Given a partition key (pk), the current set of sstables is reduced so that
sstable X will be included iff:
o min_partition_key(sstable X) < pk < max_partition_key (sstable X)
o bloom_filer (sstable X, pk) = True
▪ Scylla 2.0: SStables will be read in parallel
▪ Scylla 2.1:
o The reduced set of sstables is searched newest to oldest until a result can be
constructed and we can prove that older sstables are not relevant.
o SStables read parallelism will grow starting from a single sstable
20. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
7 Rules To
Optimize your Queries
21. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Rule #1 - Use Prepared statements
▪ Coordinator needs to pre-process the query:
o A lot of repetitive work that can be done only once
o Adds overhead in execution of a query - directly translates to throughput and
latency
▪ Driver is not able to send the request to a coordinator node that
holds the data (an additional hop)
▪ tip: compare scylla_query_processor_statements_prepared to the
# of executed scylla_transport_requests_served
21
22. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Sample: single Scylla server, using c-s
22
Results Unprepared Prepared
op rate 13037 18704
partition rate 13037 18704
row rate 13037 18704
latency mean 1.5 1.1
latency median 1.3 1
latency 95th percentile 2.9 1.6
latency 99th percentile 6.2 2.5
latency 99.9th percentile 12.2 7.1
latency max 31.1 16.9
Total partitions 100000 100000
23. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Rule #2 - Use Paging
▪ Paging Disabled: Coordinator will be forced to prepare a single
result that holds all the data and send it back:
o If coordinator is not able to return a response (allocate enough memory for
the single result) an error will be returned to the client
o tip: compare scylla_transport_unpaged_queries to scylla_cql_reads to
detected if many of your read queries are unpaged
23
24. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Rule #3 - Use correct Page Size
▪ Drivers enable paging by default with a default page_size 5000
rows (java, python, gocql)
▪ CQL requires returning at least one result and allows returning less
results than the page size
▪ Scylla utilizes this:
o Scylla caps a page_size to ~1MB of memory - Scylla will return less rows than
requested when rows are large
o Do not use the number of returned results as indication if there are no more
results
24
25. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
25
21
Has more pages
26. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Scylla 2.0: does the default page_size make sense
26
page size 10^6 rows of 100 bytes 10^5 rows of 1000 bytes 10^4 rows of 10^4 bytes 1000 rows of 10^5 bytes
10 timed out 2104.492031 331.087871 173.932543
50 5679.087615 737.148927 202.113023 168.165375
100 4034.920447 573.046783 186.384383 168.951807
500 2663.383039 415.760383 183.894015 173.015039
1000 2451.570687 395.313151 182.976511 168.427519
5000 2285.895679 400.031743 184.942591 169.345023
10000 2281.701375 399.769599 183.369727 169.738239
50000 2273.312767 396.099583 183.107583 170.000383
Test: duration in millisecond fetching a single wide partition with 10^8 bytes
split into rows using different page size
27. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Test: duration in millisecond fetching a single wide partition with 10^8 bytes
split into rows using different page size
C* 3.11.0: does the default page_size make sense
27
page size 10^6 rows of 100 bytes 10^5 rows of 1000 bytes 10^4 rows of 10^4 bytes 1000 rows of 10^5 bytes
10 timed out 4030.726143 903.872511 364.380159
50 12876.51328 1535.115263 419.430399 300.941311
100 8992.587775 1202.716671 405.274623 316.407807
500 6400.507903 907.542527 354.680831 348.651519
1000 6077.546495 874.512383 360.972287 370.409471
5000 5620.367359 791.674879 422.051839 358.612991
10000 5490.343935 793.772031 389.021695 360.447999
50000 5662.310399 913.833983 383.516671 355.467263
tip: consider changing the page size if your rows are large
28. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Rule #4 - Beware of Multi Partition CQL IN queries
▪ Multi-Partition CQL IN queries: force the coordinator node to split
the queries up to single partition queries and aggregate results.
28
29. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Rule #5 - Beware of Single Partition CQL IN queries
Question: Should I split the CQL IN Query ?
Sample:
▪ CQL: “Select * from ks.cf where pk = X and ck in (Y1, Y2, … Yn)
Translated to:
▪ CQL:
o “Select * from ks.cf where pk = X and ck = Y1“
o “Select * from ks.cf where pk = X and ck = Y2“
.
o “Select * from ks.cf where pk = X and ck = Yn“
29
30. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
30
31. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
31
32. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
32
33. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
33
34. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Question: Should I split the CQL IN Query ?
Answer: It depends on how wide your rows are
Comments:
▪ Prior to Scylla-2.0 in some wide partition cases single partition CQL
IN Queries - performed very badly.
▪ All reported results are using Scylla 2.0
34
35. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Rule #6 - There’s a faster way todo full scans
▪ The blog post efficient-full-table-scans-with-scylla outlaid an
algorithm todo full scans; in highlevel:
o split the range up into small sub ranges
o run “enough” sub ranges in parallel
▪ In follow up blog How to scan 475 million partitions 12x faster
using efficient full table scan a sample implementation applying
this was provided
▪ Is there even a “faster” way ?
35
36. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
▪ Yes there is:
o Using the token ownership of nodes in the ring one can select ranges of
tokens. Once a “range” has been processed - the next “range” can be
selected based on the ownership in the ring.
o An even more optimized solution would use the “sharding” information and
aim ranges based on shards on a machine - so that all cores are executing
requests in parallel.
36
37. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Rule #7: Use the tools ….
▪ Probelastic tracing
▪ Slow query tracing
▪ Wireshark
▪ CQL Trace
▪ Enable Client Side tracing.
37
38. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
THANK YOU
shlomi@scylladb.com
@ShlomiLivne
Any questions?