1. SASI, Cassandra on full text search ride
DuyHai DOAN
Apache Cassandra Evangelist
2. @doanduyhai
Who Am I ?
Duy Hai DOAN
Apache Cassandra Evangelist
• talks, meetups, confs
• open-source devs (Achilles, Apache Zeppelin…)
• OSS Cassandra point of contact
☞ duy_hai.doan@datastax.com
☞ @doanduyhai
2
3. @doanduyhai
Datastax
• Founded in April 2010
• We contribute a lot to Apache Cassandra™
• 400+ customers (25 of the Fortune 100), 450+ employees
• Headquarter in San Francisco Bay area
• EU headquarter in London, offices in France and Germany
• Datastax Enterprise = OSS Cassandra + extra features
3
4. SASI Index
• What is SASI ?
• Distributed Index
• Life-cycle
• Query Planner
7. @doanduyhai
How ?
7
New secondary index re-designed from scratch
• follow SSTable life-cycle (flush, compaction)
• new data-strutures
• full text search options
• no dependency on Apache Lucene
SASI = SSTable-Attached Secondary Index
11. @doanduyhai
Index on user country
11
H
A
E
D
B C
G F
FR user1 user102 … user493
US user54 user483 … user938
FR user87 user176 … user987
FR user17 user409 … user787
19. @doanduyhai
Caveat 2: 1-to-1 index (user_email)
19
H
A
E
D
B C
G F
coordinator
Not found WHERE user_email
LIKE '%xxx%'
20. @doanduyhai
Caveat 2: 1-to-1 index (user_email)
20
H
A
E
D
B C
G F
coordinator
Still no result
WHERE user_email
LIKE '%xxx%'
21. @doanduyhai
Caveat 2: 1-to-1 index (user_email)
21
H
A
E
D
B C
G F
coordinator
At best 1 user found
At worst 0 user found
WHERE user_email
LIKE '%xxx%'
22. @doanduyhai
Caveat 2 solution: use materalized views
22
For 1-to-1 index/relationship, use materialized views instead
CREATE MATERIALIZED VIEW user_by_email AS
SELECT * FROM users
WHERE user_id IS NOT NULL and user_email IS NOT NULL
PRIMARY KEY (user_email, user_id)
24. @doanduyhai
Caveat 3 solution: use co-located Apache Spark
24
H
A
E
D
B C
G F
Local index filtering in Cassandra
Aggregation in Spark
Local index query
27. @doanduyhai
SASI Life-cycle: in-memory
27
Commit log1
. . .
1
Commit log2
Commit logn
Memory
. . .
MemTable
Table1
MemTable
Table2
MemTable
TableN
2
Index
MemTable1
Index
MemTable2
. . .
Index
MemTableN
3
ACK the client
28. @doanduyhai
IndexMemtable
28
Index mode, data type Data structure Usage
PREFIX, text Guava ConcurrentRadixTree name LIKE 'John%'
CONTAINS, text Guava ConcurrentSuffixTree
name LIKE ’%John%'
name LIKE ’%ny’
PREFIX, other JDK ConcurrentSkipListSet
age = 20
age >= 20 AND age <= 30
SPARSE, other JDK ConcurrentSkipListSet
age = 20
age >= 20 AND age <= 30
42. @doanduyhai
Hardware specs
42
13 bare-metal machines
• 6 CPU HT (12 vcores)
• 64Gb RAM
• 4 SSDs in RAID0 for a total of 1.5Tb
Data set
• 13 billions of rows
• 1 numerical index with 36 distinct values
• 2 text index with 7 distinct values
• 1 text index with 3 distinct values
49. @doanduyhai
Conclusion
49
Is it available ?
• yes in Cassandra 3.5
Future enhancement ?
• index on collections (List, Set & Map) !
• OR clause (WHERE (xxx OR yyy) AND zzz )
• != operator
50. @doanduyhai
Conclusion
50
SASI vs Solr/ElasticSearch ?
• Cassandra is not a search engine !!! (database = durability)
• always slower because 2 passes (SASI index read + original Cassandra data)
• no scoring
• no ordering (ORDER BY)
• no grouping (GROUP BY) à Apache Spark for analytics
Still, SASI covers 80% of search use-cases and people are happy !