SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
Being Closer to Cassandra

Oleg Anastasyev
lead platform developer
Odnoklassniki.ru
Top 10 of World’s social networks
40M DAU, 80M MAU, 7M peak
~ 300 000 www req/sec,
20 ms render latency
>240 Gbit out
> 5 800 iron servers in 5 DCs
99.9% java

#CASSANDRAEU

* Odnoklassniki means “classmates” in english
Cassandra @
* Since 2010

- branched 0.6
- aiming at:
full operation on DC failure, scalability, ease of
operations

* Now

- 23 clusters
- 418 nodes in total
- 240 TB of stored data
- survived several DC failures

#CASSANDRAEU
Case #1. The fast

#CASSANDRAEU
Like! 103 927

#CASSANDRAEU

You and 103 927
Like! widget
* Its everywhere

- Have it on every page, dozen
- On feeds (AKA timeline)
- 3rd party websites elsewhere on internet

* Its on everything

- Pictures and Albums
- Videos
- Posts and comments
- 3rd party shared URLs

#CASSANDRAEU

Like! 103 927
Like! widget
* High load

- 1 000 000 reads/sec, 3 000 writes/sec

Like! 103 927
Hard load profile
*
- Read most
- Long tail (40% of reads are random)
- Sensitive to latency variations
- 3TB total dataset (9TB with RF) and growing
- ~ 60 billion likes for ~6bi entities

#CASSANDRAEU
Classic solution
SQL table
RefId:long

RefType:byte

UserId:long

Created

9999999999

PICTURE(2)

11111111111

11:00

to render

You and 4256

SELECT TOP 1 WHERE RefId,RefType,UserId=?,?,?

= N >=1

(98% are NONE)
SELECT COUNT (*) WHERE RefId,RefType=?,?

= M>N

(80% are 0)
SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId)
#CASSANDRAEU

= N*140
Cassandra solution
LikeByRef (
refType byte,
refId bigint,
userId bigint,

LikeCount (
refType byte,
refId bigint,
likers counter,

PRIMARY KEY ( (RefType,RefId), UserId)

so, to render

PRIMARY KEY ( (RefType,RefId))

You and 4256

SELECT FROM LikeCount WHERE RefId,RefType=?,?
(80% are 0)
SELECT * FROM LikeByRef WHERE RefId,RefType,UserId=?,?,?
(98% are NONE)

#CASSANDRAEU

= N*20%
>11 M iops
* Quick workaround ?
LikeByRef (
refType byte,
refId bigint,
userId bigint,

PRIMARY KEY ( (RefType,RefId, UserId) )
SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId)

- Forces Order Pres Partitioner
(random not scales)

- Key range scans
- More network overhead
- Partitions count >10x, Dataset size > x2
#CASSANDRAEU
By column bloom filter
* What is does

- Includes pairs of (PartKey, ColumnKey) in
SSTable *-Filter.db

* The good

- Eliminated 98 % of reads
- Less false positives
* The bad
- They become too large
GC Promotion Failures
.. but fixable (CASSANDRA-2466)

#CASSANDRAEU
Are we there yet ?
1. COUNT()

application server
> 400

00

2. EXISTS
cassandra

- min 2 roundtrips per render (COUNT+RR)
- THRIFT is slow, esp having lot of connections
- EXISTS() is 200 Gbit/sec (140*8*1Mps*20%)
#CASSANDRAEU
Co-locate!
get() : LikeSummary

odnoklassniki-like
Remote Business Intf

Counters Cache
cassandra
Social Graph Cache

- one-nio remoting (faster than java nio)
- topology aware clients
#CASSANDRAEU
co-location wins
* Fast TOP N friend likers query

1. Take friends from graph cache
2. Check it with memory bloom filter
3. Read some until N friends found

* Custom caches

- Tuned for application
* Custom data merge logic
- ... so you can detect and resolve conflicts
#CASSANDRAEU
Listen for mutations
// Implement it
interface StoreApplyListener {
boolean preapply(String key,
ColumnFamily data);
}

// and register with CFS
store=Table.open(..)
.getColumnFamilyStore(..);
store.setListener(myListener);

* Register it

between commit logs replay and gossip

* RowMutation.apply()

extend original mutation
+ Replica, hints, ReadRepairs

#CASSANDRAEU
Like! optimized counters
* Counters cache

- Off heap (sun.misc.Unsafe)
- Compact (30M in 1G RAM)
- Read cached local node only

* Replicated cache state
-

#CASSANDRAEU

cold replica cache problem
making (NOP) mutations
less reads
long tail aware

LikeCount (
refType byte,
refId bigint,
ip inet,
counter int
PRIMARY KEY ( (RefType,RefId), ip)
Read latency variations
* CS read behavior

1. Choose 1 node for data and N for digest
2. Wait for data and digest
3. Compare and return (or RR)

* Nodes suddenly slowdown

- SEDA hiccup, commit log rotation, sudden IO

saturation, Network hiccup or partition, page
cache miss

* The bad

- You have spikes.
- You have to wait (and timeout)

#CASSANDRAEU
Read Latency leveling
* “Parallel” read handler

1. Ask all replicas for data in parallel
2. Wait for CL responses and return

* The good

- Minimal latency response
- Constant load when DC fails

* The (not so) bad

- “Additional” work and traffic

#CASSANDRAEU
More tiny tricks
* On SSD io

- Deadline IO elevator
- 64k -> 4k read request size

* HintLog

- Commit log for hints
- Wait for all hints on startup

* Selective compaction

- Compacts most read CFs more often

#CASSANDRAEU
Case #2. The fat

#CASSANDRAEU
* Messages in chats

- Last page is accessed on open
- long tail (80%) for rest

- 150 billion, 100 TB in storage
- Read most (120k reads/sec, 8k writes/sec)
#CASSANDRAEU
Messages have structure
Message (
chatId, msgId,

MessageCF (
chatId, msgId,

created, type,userIndex,deletedBy,...
text
)

data blob,
PRIMARY KEY ( chatId, msgId )

- All chat’s messages in single partition
- Single blob for message data
to reduce overhead

- The bad
Conflicting modifications can happen
(users, anti-spam, etc..)

#CASSANDRAEU
LW conflict resolution
get
get

(version:ts1, data:d1)

(version:ts1, data:d1)
write( ts1, data2 )
delete(version:ts1)
insert(version: ts2=now(), data2)
Messages (
chatId, msgId,
version timestamp,
data blob
PRIMARY KEY ( chatId, msgId, version )

write( ts1, data3 )
delete(version:ts1)
insert(version: ts3=now(), data3)
(ts2, data2)
(ts3, data3)

#CASSANDRAEU

- merged on read
Specialized cache
* Again. Because we can

- Off-heap (Unsafe)
- Caches only freshest chat page
- Saves its state to local (AKA system) CF
keys AND values
seq read, much faster startup

- In memory compression
2x more memory almost free

#CASSANDRAEU
Disk mgmt
* 4U HDDx24, up to 4TB/node

- Size tiered compaction = 4 TB sstable file
- RAID10 ? LCS ?

* Split CF to 256 pieces
* The good

- Smaller, more frequent memtable flushes
- Same compaction work
in smaller sets

- Can distribute across disks
#CASSANDRAEU
Disk Allocation Policies
* Default is

- “Take disk with most free space”
* Some disks have
- Too much read iops

* Generational policy

- Each disk has same # of same gen files
work better for HDD

#CASSANDRAEU
Case #3. The ugly
feed my Frankenstein

#CASSANDRAEU
* Chats overview

- small dataset (230GB)
- has hot set, short tail (5%)
- list reorders often
- 130k read/s, 21k write/s

#CASSANDRAEU
Conflicting updates
* List<Overview> is single blob

.. or you’ll have a lot of tombstones

* Lot of conflicts

updates of single column

* Need conflict detection
* Has merge algoritm
#CASSANDRAEU
Vector clocks
* Voldemort

- byte[] key -> byte[] value + VC
- Coordination logic on clients
- Pluggable storage engines

* Plugged

- CS 0.6 SSTables persistance
- Fronted by specialized cache
we love caches

#CASSANDRAEU
Performance
* 3 node cluster, RF = 3

- Intel Xeon CPU E5506 2.13GHz
RAM: 48Gb, 1x HDD, 1x SSD

* 8 byte key -> 1 KB byte value
* Results

- 75 k /sec reads, 15 k/ sec writes

#CASSANDRAEU
Why cassandra ?
* Reusable distributed DB components
fast persistance, gossip,
Reliable Async Messaging, Fail detectors,
Topology, Seq scans, ...

* Has structure

beyond byte[] key -> byte[] value

* Delivered promises
* Implemented in Java

#CASSANDRAEU
THANK YOU
Oleg Anastasyev
oa@odnoklassniki.ru
odnoklassniki.ru/oa
@m0nstermind

github.com/odnoklassniki
shared-memory-cache
java Off-Heap cache using shared
memory

#CASSANDRAEU

one-nio
rmi faster than java nio with fast and
compact automagic java serialization

CASSANDRASUMMITEU

Más contenido relacionado

Más de DataStax Academy

Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
DataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
DataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 

Más de DataStax Academy (20)

Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and Drivers
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph Databases
 
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkCassandra Data Maintenance with Spark
Cassandra Data Maintenance with Spark
 
Analytics with Spark and Cassandra
Analytics with Spark and CassandraAnalytics with Spark and Cassandra
Analytics with Spark and Cassandra
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talk
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

C* Summit EU 2013: Being Closer to Cassandra at Ok.ru

  • 1. Being Closer to Cassandra Oleg Anastasyev lead platform developer Odnoklassniki.ru
  • 2. Top 10 of World’s social networks 40M DAU, 80M MAU, 7M peak ~ 300 000 www req/sec, 20 ms render latency >240 Gbit out > 5 800 iron servers in 5 DCs 99.9% java #CASSANDRAEU * Odnoklassniki means “classmates” in english
  • 3. Cassandra @ * Since 2010 - branched 0.6 - aiming at: full operation on DC failure, scalability, ease of operations * Now - 23 clusters - 418 nodes in total - 240 TB of stored data - survived several DC failures #CASSANDRAEU
  • 4. Case #1. The fast #CASSANDRAEU
  • 6. Like! widget * Its everywhere - Have it on every page, dozen - On feeds (AKA timeline) - 3rd party websites elsewhere on internet * Its on everything - Pictures and Albums - Videos - Posts and comments - 3rd party shared URLs #CASSANDRAEU Like! 103 927
  • 7. Like! widget * High load - 1 000 000 reads/sec, 3 000 writes/sec Like! 103 927 Hard load profile * - Read most - Long tail (40% of reads are random) - Sensitive to latency variations - 3TB total dataset (9TB with RF) and growing - ~ 60 billion likes for ~6bi entities #CASSANDRAEU
  • 8. Classic solution SQL table RefId:long RefType:byte UserId:long Created 9999999999 PICTURE(2) 11111111111 11:00 to render You and 4256 SELECT TOP 1 WHERE RefId,RefType,UserId=?,?,? = N >=1 (98% are NONE) SELECT COUNT (*) WHERE RefId,RefType=?,? = M>N (80% are 0) SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId) #CASSANDRAEU = N*140
  • 9. Cassandra solution LikeByRef ( refType byte, refId bigint, userId bigint, LikeCount ( refType byte, refId bigint, likers counter, PRIMARY KEY ( (RefType,RefId), UserId) so, to render PRIMARY KEY ( (RefType,RefId)) You and 4256 SELECT FROM LikeCount WHERE RefId,RefType=?,? (80% are 0) SELECT * FROM LikeByRef WHERE RefId,RefType,UserId=?,?,? (98% are NONE) #CASSANDRAEU = N*20%
  • 10. >11 M iops * Quick workaround ? LikeByRef ( refType byte, refId bigint, userId bigint, PRIMARY KEY ( (RefType,RefId, UserId) ) SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId) - Forces Order Pres Partitioner (random not scales) - Key range scans - More network overhead - Partitions count >10x, Dataset size > x2 #CASSANDRAEU
  • 11. By column bloom filter * What is does - Includes pairs of (PartKey, ColumnKey) in SSTable *-Filter.db * The good - Eliminated 98 % of reads - Less false positives * The bad - They become too large GC Promotion Failures .. but fixable (CASSANDRA-2466) #CASSANDRAEU
  • 12. Are we there yet ? 1. COUNT() application server > 400 00 2. EXISTS cassandra - min 2 roundtrips per render (COUNT+RR) - THRIFT is slow, esp having lot of connections - EXISTS() is 200 Gbit/sec (140*8*1Mps*20%) #CASSANDRAEU
  • 13. Co-locate! get() : LikeSummary odnoklassniki-like Remote Business Intf Counters Cache cassandra Social Graph Cache - one-nio remoting (faster than java nio) - topology aware clients #CASSANDRAEU
  • 14. co-location wins * Fast TOP N friend likers query 1. Take friends from graph cache 2. Check it with memory bloom filter 3. Read some until N friends found * Custom caches - Tuned for application * Custom data merge logic - ... so you can detect and resolve conflicts #CASSANDRAEU
  • 15. Listen for mutations // Implement it interface StoreApplyListener { boolean preapply(String key, ColumnFamily data); } // and register with CFS store=Table.open(..) .getColumnFamilyStore(..); store.setListener(myListener); * Register it between commit logs replay and gossip * RowMutation.apply() extend original mutation + Replica, hints, ReadRepairs #CASSANDRAEU
  • 16. Like! optimized counters * Counters cache - Off heap (sun.misc.Unsafe) - Compact (30M in 1G RAM) - Read cached local node only * Replicated cache state - #CASSANDRAEU cold replica cache problem making (NOP) mutations less reads long tail aware LikeCount ( refType byte, refId bigint, ip inet, counter int PRIMARY KEY ( (RefType,RefId), ip)
  • 17. Read latency variations * CS read behavior 1. Choose 1 node for data and N for digest 2. Wait for data and digest 3. Compare and return (or RR) * Nodes suddenly slowdown - SEDA hiccup, commit log rotation, sudden IO saturation, Network hiccup or partition, page cache miss * The bad - You have spikes. - You have to wait (and timeout) #CASSANDRAEU
  • 18. Read Latency leveling * “Parallel” read handler 1. Ask all replicas for data in parallel 2. Wait for CL responses and return * The good - Minimal latency response - Constant load when DC fails * The (not so) bad - “Additional” work and traffic #CASSANDRAEU
  • 19. More tiny tricks * On SSD io - Deadline IO elevator - 64k -> 4k read request size * HintLog - Commit log for hints - Wait for all hints on startup * Selective compaction - Compacts most read CFs more often #CASSANDRAEU
  • 20. Case #2. The fat #CASSANDRAEU
  • 21. * Messages in chats - Last page is accessed on open - long tail (80%) for rest - 150 billion, 100 TB in storage - Read most (120k reads/sec, 8k writes/sec) #CASSANDRAEU
  • 22. Messages have structure Message ( chatId, msgId, MessageCF ( chatId, msgId, created, type,userIndex,deletedBy,... text ) data blob, PRIMARY KEY ( chatId, msgId ) - All chat’s messages in single partition - Single blob for message data to reduce overhead - The bad Conflicting modifications can happen (users, anti-spam, etc..) #CASSANDRAEU
  • 23. LW conflict resolution get get (version:ts1, data:d1) (version:ts1, data:d1) write( ts1, data2 ) delete(version:ts1) insert(version: ts2=now(), data2) Messages ( chatId, msgId, version timestamp, data blob PRIMARY KEY ( chatId, msgId, version ) write( ts1, data3 ) delete(version:ts1) insert(version: ts3=now(), data3) (ts2, data2) (ts3, data3) #CASSANDRAEU - merged on read
  • 24. Specialized cache * Again. Because we can - Off-heap (Unsafe) - Caches only freshest chat page - Saves its state to local (AKA system) CF keys AND values seq read, much faster startup - In memory compression 2x more memory almost free #CASSANDRAEU
  • 25. Disk mgmt * 4U HDDx24, up to 4TB/node - Size tiered compaction = 4 TB sstable file - RAID10 ? LCS ? * Split CF to 256 pieces * The good - Smaller, more frequent memtable flushes - Same compaction work in smaller sets - Can distribute across disks #CASSANDRAEU
  • 26. Disk Allocation Policies * Default is - “Take disk with most free space” * Some disks have - Too much read iops * Generational policy - Each disk has same # of same gen files work better for HDD #CASSANDRAEU
  • 27. Case #3. The ugly feed my Frankenstein #CASSANDRAEU
  • 28. * Chats overview - small dataset (230GB) - has hot set, short tail (5%) - list reorders often - 130k read/s, 21k write/s #CASSANDRAEU
  • 29. Conflicting updates * List<Overview> is single blob .. or you’ll have a lot of tombstones * Lot of conflicts updates of single column * Need conflict detection * Has merge algoritm #CASSANDRAEU
  • 30. Vector clocks * Voldemort - byte[] key -> byte[] value + VC - Coordination logic on clients - Pluggable storage engines * Plugged - CS 0.6 SSTables persistance - Fronted by specialized cache we love caches #CASSANDRAEU
  • 31. Performance * 3 node cluster, RF = 3 - Intel Xeon CPU E5506 2.13GHz RAM: 48Gb, 1x HDD, 1x SSD * 8 byte key -> 1 KB byte value * Results - 75 k /sec reads, 15 k/ sec writes #CASSANDRAEU
  • 32. Why cassandra ? * Reusable distributed DB components fast persistance, gossip, Reliable Async Messaging, Fail detectors, Topology, Seq scans, ... * Has structure beyond byte[] key -> byte[] value * Delivered promises * Implemented in Java #CASSANDRAEU
  • 33. THANK YOU Oleg Anastasyev oa@odnoklassniki.ru odnoklassniki.ru/oa @m0nstermind github.com/odnoklassniki shared-memory-cache java Off-Heap cache using shared memory #CASSANDRAEU one-nio rmi faster than java nio with fast and compact automagic java serialization CASSANDRASUMMITEU