C* Summit EU 2013: Being Closer to Cassandra at Ok.ru

Being Closer to Cassandra

Oleg Anastasyev
lead platform developer
Odnoklassniki.ru

Top 10 of World’s social networks
40M DAU, 80M MAU, 7M peak
~ 300 000 www req/sec,
20 ms render latency
>240 Gbit out
> 5 800 iron servers in 5 DCs
99.9% java

#CASSANDRAEU

* Odnoklassniki means “classmates” in english

Cassandra @
* Since 2010

- branched 0.6
- aiming at:
full operation on DC failure, scalability, ease of
operations

* Now

- 23 clusters
- 418 nodes in total
- 240 TB of stored data
- survived several DC failures

#CASSANDRAEU

Case #1. The fast

#CASSANDRAEU

Like! 103 927

#CASSANDRAEU

You and 103 927

Like! widget
* Its everywhere

- Have it on every page, dozen
- On feeds (AKA timeline)
- 3rd party websites elsewhere on internet

* Its on everything

- Pictures and Albums
- Videos
- Posts and comments
- 3rd party shared URLs

#CASSANDRAEU

Like! 103 927

Like! widget
* High load

- 1 000 000 reads/sec, 3 000 writes/sec

Like! 103 927
Hard load proﬁle
*
- Read most
- Long tail (40% of reads are random)
- Sensitive to latency variations
- 3TB total dataset (9TB with RF) and growing
- ~ 60 billion likes for ~6bi entities

#CASSANDRAEU

Classic solution
SQL table
RefId:long

RefType:byte

UserId:long

Created

9999999999

PICTURE(2)

11111111111

11:00

to render

You and 4256

SELECT TOP 1 WHERE RefId,RefType,UserId=?,?,?

= N >=1

(98% are NONE)
SELECT COUNT (*) WHERE RefId,RefType=?,?

= M>N

(80% are 0)
SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId)
#CASSANDRAEU

= N*140

Cassandra solution
LikeByRef (
refType byte,
refId bigint,
userId bigint,

LikeCount (
refType byte,
refId bigint,
likers counter,

PRIMARY KEY ( (RefType,RefId), UserId)

so, to render

PRIMARY KEY ( (RefType,RefId))

You and 4256

SELECT FROM LikeCount WHERE RefId,RefType=?,?
(80% are 0)
SELECT * FROM LikeByRef WHERE RefId,RefType,UserId=?,?,?
(98% are NONE)

#CASSANDRAEU

= N*20%

>11 M iops
* Quick workaround ?
LikeByRef (
refType byte,
refId bigint,
userId bigint,

PRIMARY KEY ( (RefType,RefId, UserId) )
SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId)

- Forces Order Pres Partitioner
(random not scales)

- Key range scans
- More network overhead
- Partitions count >10x, Dataset size > x2
#CASSANDRAEU

By column bloom ﬁlter
* What is does

- Includes pairs of (PartKey, ColumnKey) in
SSTable *-Filter.db

* The good

- Eliminated 98 % of reads
- Less false positives
* The bad
- They become too large
GC Promotion Failures
.. but ﬁxable (CASSANDRA-2466)

#CASSANDRAEU

Are we there yet ?
1. COUNT()

application server
> 400

00

2. EXISTS
cassandra

- min 2 roundtrips per render (COUNT+RR)
- THRIFT is slow, esp having lot of connections
- EXISTS() is 200 Gbit/sec (140*8*1Mps*20%)
#CASSANDRAEU

Co-locate!
get() : LikeSummary

odnoklassniki-like
Remote Business Intf

Counters Cache
cassandra
Social Graph Cache

- one-nio remoting (faster than java nio)
- topology aware clients
#CASSANDRAEU

co-location wins
* Fast TOP N friend likers query

1. Take friends from graph cache
2. Check it with memory bloom ﬁlter
3. Read some until N friends found

* Custom caches

- Tuned for application
* Custom data merge logic
- ... so you can detect and resolve conﬂicts
#CASSANDRAEU

Listen for mutations
// Implement it
interface StoreApplyListener {
boolean preapply(String key,
ColumnFamily data);
}

// and register with CFS
store=Table.open(..)
.getColumnFamilyStore(..);
store.setListener(myListener);

* Register it

between commit logs replay and gossip

* RowMutation.apply()

extend original mutation
+ Replica, hints, ReadRepairs

#CASSANDRAEU

Like! optimized counters
* Counters cache

- Off heap (sun.misc.Unsafe)
- Compact (30M in 1G RAM)
- Read cached local node only

* Replicated cache state
-

#CASSANDRAEU

cold replica cache problem
making (NOP) mutations
less reads
long tail aware

LikeCount (
refType byte,
refId bigint,
ip inet,
counter int
PRIMARY KEY ( (RefType,RefId), ip)

Read latency variations
* CS read behavior

1. Choose 1 node for data and N for digest
2. Wait for data and digest
3. Compare and return (or RR)

* Nodes suddenly slowdown

- SEDA hiccup, commit log rotation, sudden IO

saturation, Network hiccup or partition, page
cache miss

* The bad

- You have spikes.
- You have to wait (and timeout)

#CASSANDRAEU

Read Latency leveling
* “Parallel” read handler

1. Ask all replicas for data in parallel
2. Wait for CL responses and return

* The good

- Minimal latency response
- Constant load when DC fails

* The (not so) bad

- “Additional” work and traffic

#CASSANDRAEU

More tiny tricks
* On SSD io

- Deadline IO elevator
- 64k -> 4k read request size

* HintLog

- Commit log for hints
- Wait for all hints on startup

* Selective compaction

- Compacts most read CFs more often

#CASSANDRAEU

Case #2. The fat

#CASSANDRAEU

* Messages in chats

- Last page is accessed on open
- long tail (80%) for rest

- 150 billion, 100 TB in storage
- Read most (120k reads/sec, 8k writes/sec)
#CASSANDRAEU

Messages have structure
Message (
chatId, msgId,

MessageCF (
chatId, msgId,

created, type,userIndex,deletedBy,...
text
)

data blob,
PRIMARY KEY ( chatId, msgId )

- All chat’s messages in single partition
- Single blob for message data
to reduce overhead

- The bad
Conﬂicting modiﬁcations can happen
(users, anti-spam, etc..)

#CASSANDRAEU

LW conﬂict resolution
get
get

(version:ts1, data:d1)

(version:ts1, data:d1)
write( ts1, data2 )
delete(version:ts1)
insert(version: ts2=now(), data2)
Messages (
chatId, msgId,
version timestamp,
data blob
PRIMARY KEY ( chatId, msgId, version )

write( ts1, data3 )
delete(version:ts1)
insert(version: ts3=now(), data3)
(ts2, data2)
(ts3, data3)

#CASSANDRAEU

- merged on read

Specialized cache
* Again. Because we can

- Off-heap (Unsafe)
- Caches only freshest chat page
- Saves its state to local (AKA system) CF
keys AND values
seq read, much faster startup

- In memory compression
2x more memory almost free

#CASSANDRAEU

Disk mgmt
* 4U HDDx24, up to 4TB/node

- Size tiered compaction = 4 TB sstable ﬁle
- RAID10 ? LCS ?

* Split CF to 256 pieces
* The good

- Smaller, more frequent memtable ﬂushes
- Same compaction work
in smaller sets

- Can distribute across disks
#CASSANDRAEU

Disk Allocation Policies
* Default is

- “Take disk with most free space”
* Some disks have
- Too much read iops

* Generational policy

- Each disk has same # of same gen ﬁles
work better for HDD

#CASSANDRAEU

Case #3. The ugly
feed my Frankenstein

#CASSANDRAEU

* Chats overview

- small dataset (230GB)
- has hot set, short tail (5%)
- list reorders often
- 130k read/s, 21k write/s

#CASSANDRAEU

Conflicting updates
* List<Overview> is single blob

.. or you’ll have a lot of tombstones

* Lot of conflicts

updates of single column

* Need conflict detection
* Has merge algoritm
#CASSANDRAEU

Vector clocks
* Voldemort

- byte[] key -> byte[] value + VC
- Coordination logic on clients
- Pluggable storage engines

* Plugged

- CS 0.6 SSTables persistance
- Fronted by specialized cache
we love caches

#CASSANDRAEU

Performance
* 3 node cluster, RF = 3

- Intel Xeon CPU E5506 2.13GHz
RAM: 48Gb, 1x HDD, 1x SSD

* 8 byte key -> 1 KB byte value
* Results

- 75 k /sec reads, 15 k/ sec writes

#CASSANDRAEU

Why cassandra ?
* Reusable distributed DB components
fast persistance, gossip,
Reliable Async Messaging, Fail detectors,
Topology, Seq scans, ...

* Has structure

beyond byte[] key -> byte[] value

* Delivered promises
* Implemented in Java

#CASSANDRAEU

THANK YOU
Oleg Anastasyev
oa@odnoklassniki.ru
odnoklassniki.ru/oa
@m0nstermind

github.com/odnoklassniki
shared-memory-cache
java Off-Heap cache using shared
memory

#CASSANDRAEU

one-nio
rmi faster than java nio with fast and
compact automagic java serialization

CASSANDRASUMMITEU

C* Summit EU 2013: Being Closer to Cassandra at Ok.ru

Recomendados

Recomendados

Más contenido relacionado

Más de DataStax Academy

Más de DataStax Academy (20)

Último

Último (20)

C* Summit EU 2013: Being Closer to Cassandra at Ok.ru