Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016

SASI, Cassandra on the full text search ride
DuyHai DOAN – Apache Cassandra evangelist

@doanduyhai
1 5 minutes introduction to Apache Cassandra™
2 SASI introduction
3 SASI cluster-wide
4 SASI local read/write path
5 Query planner
6 Some benchmarks
7 Take away
2

@doanduyhai
5 minutes introduction to Apache
Cassandra™

@doanduyhai
The tokens
4
Random hash of #partition à token = hash(#p)
Hash: ] –x, x ]
hash range: 264 values
x = 264/2
C*
C*
C*
C*
C* C*
C* C*

@doanduyhai
Token ranges
5
A: −x,−
3x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
B: −
3x
4
,−
2x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
C: −
2x
4
,−
x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
D: −
x
4
,0
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
E: 0,
x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
F:
x
4
,
2x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
G:
2x
4
,
3x
4
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
H :
3x
4
,x
⎤
⎦
⎥
⎥
⎤
⎦
⎥
⎥
H
A
E
D
B C
G F

@doanduyhai
Distributed tables
6
H
A
E
D
B C
G F
user_id1
user_id2
user_id3
user_id4
user_id5
CREATE TABLE users(
user_id int,
…,
PRIMARY KEY(user_id)
),

@doanduyhai
Distributed tables
7
H
A
E
D
B C
G F
user_id1
user_id2
user_id3
user_id4
user_id5

@doanduyhai
Coordinator node
8
Responsible for handling requests (read/write)
Every node can be coordinator
• masterless
• no SPOF
• proxy role
H
A
E
D
B C
G F
coordinator
request
1
2 3

@doanduyhai
What is SASI ?
11
•  SSTable-Attached Secondary Index à new 2nd index impl that follows
SSTable life-cycle
•  Objective: provide more performant & capable 2nd index

@doanduyhai
Who created it ?
12
Open-source contribution by an engineers team

@doanduyhai
Why is it better than native 2nd index ?
13
•  follow SSTable life-cycle (flush, compaction, rebuild …) à more optimized
•  new data-strutures
•  range query (<, ≤, >, ≥) possible
•  full text search options

@doanduyhai
Distributed index
16
On cluster level, SASI works exactly like native 2nd index
H
A
E
D
B C
G F
UK user1 user102 … user493
US user54 user483 … user938

@doanduyhai
Distributed search algorithm
17
H
A
E
D
B C
G F
coordinator
1st round
Concurrency factor = 1

@doanduyhai
18
H
A
E
D
B C
G F
coordinator
Not enough
results ?

@doanduyhai
19
H
A
E
D
B C
G F
coordinator
2nd round

@doanduyhai
20
H
A
E
D
B C
G F
coordinator
Still not enough
results ?

@doanduyhai
21
H
A
E
D
B C
G F
coordinator
3rd round

@doanduyhai
Concurrency factor formula
22
•  more details at:
http://www.planetcassandra.org/blog/cassandra-native-secondary-index-
deep-dive/

@doanduyhai
Caveat 1: non restrictive filters
23
H
A
E
D
B C
G F
coordinator
Hit all
nodes
eventually
L

@doanduyhai
Caveat 1 solution : always use LIMIT
24
H
A
E
D
B C
G F
coordinator
SELECT *
FROM …
WHERE ...
LIMIT 1000

@doanduyhai
Caveat 2: 1-to-1 index (user_email)
25
H
A
E
D
B C
G F
coordinator
Not found WHERE user_email = ‘xxx'

@doanduyhai
26
H
A
E
D
B C
G F
coordinator
Still no result
WHERE user_email = ‘xxx'

@doanduyhai
27
H
A
E
D
B C
G F
coordinator
At best 1 user found
At worst 0 user found
WHERE user_email = ‘xxx'

@doanduyhai
Caveat 2 solution: materialized views
28
For 1-to-1 index/relationship, use materialized views instead
CREATE MATERIALIZED VIEW user_by_email AS
SELECT * FROM users
WHERE user_id IS NOT NULL and user_email IS NOT NULL
PRIMARY KEY (user_email, user_id)
But range queries ( <, >, ≤, ≥) not possible …

@doanduyhai
Caveat 3: fetch all rows for analytics use-case
29
H
A
E
D
B C
G F
coordinator
Client

@doanduyhai
Caveat 3 solution: use co-located Spark
30
H
A
E
D
B C
G F
Local index ﬁltering in Cassandra
Aggregation in Spark
Local index query

@doanduyhai
SASI local read/write path

@doanduyhai
SASI Life-cycle: in-memory
32
Commit log1
. . .
1
Commit log2
Commit logn
Memory
. . .
MemTable
Table1
MemTable
Table2
MemTable
TableN
2
Index
MemTable1
Index
MemTable2
. . .
Index
MemTableN
3
ACK the client

@doanduyhai
Local write path data structures
33
Index mode, data type Data structure Usage
PREFIX, text Guava ConcurrentRadixTree name LIKE 'John%'
CONTAINS, text Guava ConcurrentSufﬁxTree
name LIKE ’%John%'
name LIKE ’%ny’
PREFIX, other JDK ConcurrentSkipListSet
age = 20
age >= 20 AND age <= 30
SPARSE, other JDK ConcurrentSkipListSet
age = 20
age >= 20 AND age <= 30
suitable for 1-to-N index with N ≤ 5

@doanduyhai
SASI Life-cycle: flush to SSTable
34
Commit log1
. . .
1
Commit log2
Commit logn
Memory
Table1
SStable1
Table2 Table3
SStable2 SStable3
4
OnDiskIndex1
OnDiskIndex2
OnDiskIndex3

@doanduyhai
SASI Life-cycle: compaction
35
SSTable1 SSTable2 SSTable3
New SSTable
OnDiskIndex1 OnDiskIndex2 OnDiskIndex3
New OnDiskIndex

@doanduyhai
Local write path summary
36
Index ﬁles are built
•  on memtable flush
•  on compaction flush
To avoid OOM, index ﬁles are split into chunk of
•  1Gb for memtable flush
•  max_compaction_flush_memory_in_mb for compaction flush
à consequences: SASI has impact on write bandwidth (CPU & disk I/O)

@doanduyhai
Local read path
37
•  first, optimize query using Query Planer (see later)
•  then load chunks (4k) of index files from disk into memory
•  perform binary search to find the indexed value(s)
•  retrieve the corresponding partition keys and push them into the Partition
Key Cache
à Yes, currently SASI only keep partition key(s) so on wide partition it’s not very
optimized ...

@doanduyhai
OnDiskIndex files
38
SStable1
SStable2
user_id4 FR user_id1 US user_id5 FR
user_id3 UK user_id2 DE
OnDiskIndex1
FR US
OnDiskIndex2
UK DE
B+Tree-like
data structures

@doanduyhai
OnDiskIndex Layout
39
Header
Block
Data Block
Pointer Block
Data Block
Meta
Pointer
Block Meta
Level Index
Offset
4k Multiple of 4k
Multiple of 4k
Levels
Count
Meta Data Info

@doanduyhai
Header Block Layout
40
Descriptor
Version
Header Block layout
variable
Term
Size
short
Index
Mode
Min
Term
Max
Term
Min
Pk
Max
Pk
Has
Partial
short short short short variable byte

@doanduyhai
OnDiskIndex Layout
41
Header
Block
Data Block
Pointer Block
Data Block
Meta
Pointer
Block Meta
Level Index
Offset
4k Multiple of 4k
Multiple of 4k
Levels
Count
Meta Data Info

@doanduyhai
Data Block layout
42
Terms Count
4k
Offset Array: [0, 10, 22, …] Term Block TokenTree Block
4k
Terms Count Offset Array: [0, 23, 35, …] Term Block TokenTree Block
…
Padding
Padding
Padding
Padding
Padding
Padding
Padding
Padding

@doanduyhai
OnDiskIndex Layout
43
Header
Block
Data Block
Pointer Block
Data Block
Meta
Pointer
Block Meta
Level Index
Offset
4k Multiple of 4k
Multiple of 4k
Levels
Count
Meta Data Info

@doanduyhai
Pointer Block building
44
Data Block1
Pointer Block1
4k
Data Block2 Data BlockN
…
LastTerm1 LastTerm2 LastTermN…
Pointer Block2
…
4k
Pointer BlockN
Pointer BlockN+1
LastTermM LastTermM+1 … LastTermO
Pointer BlockN+2
…
Root Pointer Block
Pointer Level 1
Pointer Level 2
Pointer Root
Level
Data Level

@doanduyhai
Binary search using OnDiskIndex files
45
Data Block1 Data Block2 Data BlockN
Pointer Block Pointer Block …
Root Pointer Block
Pointer Level 2
Pointer Level 3
Pointer Root Level
Data Level
Pointer Block
Pointer Block Pointer Block Pointer Block…
Pointer Block Pointer Block Pointer Block… Pointer Level 1
Data Block3 …

@doanduyhai
Term Block Binary Search
46
Term1 Term50 Term100
val < Term100 ?
Term25 Term75
Term50 Term75 Term100
val > Term50 ?
Term75Term50 Term63
val < Term75 ?
…
Term57
val = Term57 ?

@doanduyhai
Query planner
48
•  build predicates tree
•  predicates push-down & re-ordering
•  predicate fusions for != operator

@doanduyhai
Query optimization example
49
WHERE age < 100 AND fname LIKE 'p%' AND fname != 'pa%' AND age > 21

@doanduyhai
50
AND is associative
and commutative

@doanduyhai
51
!= transformed to
exclusion on range scan

@doanduyhai
52
AND is associative
and commutative

@doanduyhai
Hardware specs
13 bare-metal machines
• 6 CPU HT (12 vcores)
• 64Gb RAM
• 4 SSDs in RAID0 for a total of 1.5Tb
Data set
• 13 billions of rows
• 1 numerical index with 36 distinct values
• 2 text index with 7 distinct values
• 1 text index with 3 distinct values
54

@doanduyhai
Benchmark results
Full table scan using co-located Spark (no LIMIT)
55
Predicate count Fetched rows Query time in sec
1 36 109 986 609
2 2 781 492 330
3 1 044 547 372
4 360 334 116

@doanduyhai
Benchmark results
Full table scan using co-located Spark (no LIMIT)
56
Predicate count Fetched rows Query time in sec
1 36 109 986 609
2 2 781 492 330
3 1 044 547 372
4 360 334 116

@doanduyhai
Benchmark results
Beware of disk space usage for full text search !!!
Table albums with ≈ 110 000 records, 6.8Mb data size
57

@doanduyhai
SASI vs search engines
SASI vs Solr/ElasticSearch ?
•  Cassandra is not a search engine !!! (database = durability)
•  always slower because 2 passes (SASI index read + original Cassandra data)
•  no scoring
•  no ordering (ORDER BY)
•  no grouping (GROUP BY) à Apache Spark for analytics
If you don’t need the above features, SASI is for you!
59

@doanduyhai
SASI sweet spots
SASI is a relevant choice if
•  you need multi criteria search and you don't need ordering/grouping/scoring
•  you mostly need 100 to 10000 of rows for your search queries
•  you always know the partition keys of the rows to be searched for (this one applies to
native secondary index too)
•  you want to index static columns (SASI has no penalty since it indexes the whole
partition)
60

@doanduyhai
SASI blind spots
SASI is a poor choice if
•  you have strong SLA on search latency, for example few millisecs requirement
•  ordering of the search results is important for you
61

@doanduyhai63
Thank You
@doanduyhai
duy_hai.doan@datastax.com
https://academy.datastax.com/

Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (17)

Similar a Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016

Similar a Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016 (20)

Más de Duyhai Doan

Más de Duyhai Doan (6)

Último

Último (20)

Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016