SlideShare una empresa de Scribd logo
1 de 63
Descargar para leer sin conexión
Cassandra
storage internals
     Nicolas Favre-Felix
   Cassandra Europe 2012
What this talk covers

• What happens within a Cassandra node

• How Cassandra reads and writes data

• What compaction is and why we need it

• How counters are stored, modified, and read
Concepts
• Memtables        • On heap, off-heap

• SSTables         • Compaction

• Commit Log       • Bloom filters

• Key cache        • SSTable index

• Row cache        • Counters
Why is this important?
• Understand what goes on under the hood

• Understand the reasons for these choices

• Diagnose issues

• Tune Cassandra for performance

• Make your data model efficient
A word about hard drives
A word about hard drives

• Main driver behind Cassandra’s storage choices

• The last moving part

• Fast sequential I/O (150 MB/s)

• Slow random I/O (120-200 IOPS)
What SSDs bring
• Fast sequential I/O

• Fast random I/O

• Higher cost

• Limited lifetime

• Performance degradation
Disk usage with B-trees

• Important data structure in relational databases

• In-place overwrites (random I/O)

• LogB(N) random accesses for reads and writes
Disk usage with Cassandra
 • Made for spinning disks

 • Sequential writes, much less than 1 I/O per insert

 • Several layers of cache

 • Random reads, approximately 1 I/O per read

 • Generally “write-optimised”
Writing
to Cassandra
Writing to Cassandra
Let’s add a row with a few columns



Row Key   Column   Column   Column   Column
The Cassandra write path
In the JVM

  New data            Memtable




             Commit
 On disk       log
The Commit Log
• Each write is added to a log file

• Guarantees durability after a crash

• 1-second window during which data is still in RAM

• Sequential I/O

• A dedicated disk is recommended
Memtables
• In-memory Key/Value data structure

• Implemented with ConcurrentSkipListMap

• One per column family

• Very fast inserts

• Columns are merged in memory for the same key

• Flushed at a certain threshold, into an SSTable
Dumping a Memtable on disk


In the JVM            Full Memtable




             Commit
 On disk       log
Dumping a Memtable on disk


In the JVM            New Memtable




             Commit
 On disk       log
                         SSTable
The SSTable

• One file, written sequentially

• Columns are in order, grouped by row

• Immutable once written, no updates!
SSTables start piling up!

In the JVM   Memtable


Commit log   SSTable    SSTable   SSTable

 On disk      SSTable   SSTable   SSTable

              SSTable   SSTable   SSTable

              SSTable   SSTable   SSTable
SSTables
• Can’t keep all of them forever

• Need to reclaim disk space

• Reads could touch several SSTables

• Scans touch all of them

• In-memory data structures per SSTable
Compacting SSTables
Compaction
• Merge SSTables of similar size together

• Remove overwrites and deleted data (timestamps)

• Improve range query performance

• Major compaction creates a single SSTable

• I/O intensive operation
Recent improvements

• Pluggable compaction

• Different strategies, chosen per column family

• SSTable compression

• More efficient SSTable merges
Reading from Cassandra
Reading from Cassandra
• Reading all these SSTables would be very inefficient

• We have to read from memory as much as possible

• Otherwise we need to do 2 things efficiently:

 • Find the right SSTable to read from

 • Find where in that SSTable to read the data
First step for reads

• The Memtable!

• Read the most recent data

• Very fast, no need to touch the disk
Off-heap (no GC)       Row cache




In the JVM             Memtable




              Commit
 On disk        log
                        SSTable
Row cache

• Stores a whole row in memory

• Off-heap, not subject to Garbage Collection

• Size is configurable per column family

• Last resort before having to read from disk
Finding the right SSTable

In the JVM     Memtable


Commit log                SSTable               SSTable

 On disk     SSTable    SSTable               SSTable

              SSTable     SSTable   SSTable       SSTable
Bloom filter
• Saved with each SSTable

• Answers “contains(Key) :: boolean”

• Saved on disk but kept in memory

• Probabilistic data structure

• Configurable proportion of false positives

• No false negatives
Bloom filter

 In the JVM                          Memtable

exists(key)?
                       Bloom filter   Bloom filter   Bloom filter
true/false


              Commit
  On disk       log
                       SSTable       SSTable       SSTable
Reading from an SSTable
• We need to know where in the file our data is saved
• Keys are sorted, why don’t we do a binary search?
 • Keys are not all the same size
 • Jumping around in a file is very slow
 • Log2(N) random I/O, ~20 for 1 million keys
Reading from an SSTable
       Let’s index key ranges in the SSTable


  Key: k-128        Key: k-256        Key: k-384


Position: 12098   Position: 23445   Position: 43678



                                       SSTable
SSTable index
• Saved with each SSTable

• Stores key ranges and their offsets: [(Key, Offset)]

• Saved on disk but kept in memory

• Avoids searching for a key by scanning the file

• Configurable key interval (default: 128)
SSTable index

In the JVM                           Memtable


                                                SSTable
                       Bloom filter
                                                 index




              Commit
 On disk        log
                                     SSTable
Sometimes not enough

• Storing key ranges is limited

• We can do better by storing the exact offset

• This saves approximately one I/O
The key cache

In the JVM                           Memtable


                                                  SSTable
                       Bloom filter    Key cache
                                                   index




              Commit
 On disk        log
                                     SSTable
Key cache

• Stores the exact location in the SSTable

• Stored in heap

• Avoids having to scan a whole index interval

• Size is configurable per column family
2


Off-heap (no GC)                     Row cache


                       1

In the JVM                               Memtable

                       3             4                5
                                                          SSTable
                       Bloom filter        Key cache
                                                           index

                       6


              Commit
 On disk        log
                                         SSTable
2


Off-heap (no GC)                     Row cache


                       1

In the JVM                               Memtable

                       3             4                5
                                                          SSTable
                       Bloom filter        Key cache
                                                           index

                       6


              Commit
 On disk        log
                                         SSTable
2


Off-heap (no GC)                     Row cache


                       1

In the JVM                               Memtable

                       3             4                5
                                                          SSTable
                       Bloom filter        Key cache
                                                           index

                       6


              Commit
 On disk        log
                                         SSTable
2


Off-heap (no GC)                     Row cache


                       1

In the JVM                               Memtable

                       3             4                5
                                                          SSTable
                       Bloom filter        Key cache
                                                           index

                       6


              Commit
 On disk        log
                                         SSTable
2


Off-heap (no GC)                     Row cache


                       1

In the JVM                               Memtable

                       3             4                5
                                                          SSTable
                       Bloom filter        Key cache
                                                           index

                       6


              Commit
 On disk        log
                                         SSTable
2


Off-heap (no GC)                     Row cache


                       1

In the JVM                               Memtable

                       3             4                5
                                                          SSTable
                       Bloom filter        Key cache
                                                           index

                       6


              Commit
 On disk        log
                                         SSTable
2


Off-heap (no GC)                     Row cache


                       1

In the JVM                               Memtable

                       3             4                5
                                                          SSTable
                       Bloom filter        Key cache
                                                           index

                       6


              Commit
 On disk        log
                                         SSTable
Distributed counters
Distributed counters

• 64-bit signed integer, replicated in the cluster

• Atomic inc and dec by an arbitrary amount

• Counting with read-inc-write would be inefficient

• Stored differently from regular columns
Consider a cluster
with 3 nodes, RF=3
Internal counter data
• List of increments received by the local node
• Summaries (Version,Sum) sent by the other nodes
• The total value is the sum of all counts
Internal counter data
• List of increments received by the local node
• Summaries (Version,Sum) sent by the other nodes
• The total value is the sum of all counts


         Local increments        +5       +2   -3



node                              version: 3
         Received from             count: 5



                                  version: 5
         Received from            count: 10
Incrementing a counter
• A coordinator node is chosen (blue node)




Local increments     +5    +2   -3
Incrementing a counter
• A coordinator node is chosen

• Stores its increment locally




Local increments       +5    +2   -3   +1
Incrementing a counter
• A coordinator node is chosen

• Stores its increment locally

• Reads back the sum of its increments




Local increments       +5    +2   -3     +1
Incrementing a counter
• A coordinator node is chosen

• Stores its increment locally

• Reads back the sum of its increments

• Forwards a summary to other replicas: (v.4, sum 5)




Local increments       +5    +2   -3     +1
Incrementing a counter
• A coordinator node is chosen

• Stores its increment locally

• Reads back the sum of its increments

• Forwards a summary to other replicas

• Replicas update their records:

                       version: 4
Received from           count: 5
Reading a counter

• Replicas return their counts and versions

• Including what they know about other nodes

• Only the most recent versions are kept
Reading a counter




version: 6
count: 12
Reading a counter

             {
                  v. 3, count 5
                 v. 6, count 12
                  v. 2, count 8




                 {     v. 3, count 5
                      v. 5, count 10
                       v. 4, count 5
version: 6
count: 12
Reading a counter

             {
                    v. 3, count 5
                   v. 6, count 12
                    v. 2, count 8




                  {      v. 3, count 5
                        v. 5, count 10
                         v. 4, count 5
version: 6
count: 12        Counter value: 5 + 12 + 5 = 22
Storage problems
Tuning
• Cassandra can’t really use large amounts of RAM

• Garbage Collection pauses stop everything

• Compaction has an impact on performance

• Reading from disk is slow

• These limitations restrict the size of each node
Recap
• Fast sequential writes

• ~1 I/O for uncached reads, 0 for cached

• Counter increments read on write, columns don’t

• Know where your time is spent (monitor!)

• Tune accordingly
Questions?


http://www.flickr.com/photos/kubina/326628918/sizes/l/in/photostream/
http://www.flickr.com/photos/alwarrete/5651579563/sizes/o/in/photostream/
http://www.flickr.com/photos/pio1976/3330670980/sizes/o/in/photostream/
http://www.flickr.com/photos/lwr/100518736/sizes/l/in/photostream/
• In-kernel backend
• No Garbage Collection
• No need to plan heavy compactions
• Low and consistent latency
• Full versioning, snapshots
• No degradation with Big Data

Más contenido relacionado

La actualidad más candente

Migrating to XtraDB Cluster
Migrating to XtraDB ClusterMigrating to XtraDB Cluster
Migrating to XtraDB Clusterpercona2013
 
Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Lucidworks (Archived)
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simonlucenerevolution
 
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...NoSQLmatters
 
PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Replication in 10  Minutes - SCALEPostgreSQL Replication in 10  Minutes - SCALE
PostgreSQL Replication in 10 Minutes - SCALEPostgreSQL Experts, Inc.
 
Webinar Slides: Migrating to Galera Cluster
Webinar Slides: Migrating to Galera ClusterWebinar Slides: Migrating to Galera Cluster
Webinar Slides: Migrating to Galera ClusterSeveralnines
 
OpenZFS data-driven performance
OpenZFS data-driven performanceOpenZFS data-driven performance
OpenZFS data-driven performanceahl0003
 
Meet hbase 2.0
Meet hbase 2.0Meet hbase 2.0
Meet hbase 2.0enissoz
 
The Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsThe Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsRomain Jacotin
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!
Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!
Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!BertrandDrouvot
 
DataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The SequelDataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The SequelDataStax Academy
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing GuideJose De La Rosa
 
Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)Yoshinori Matsunobu
 
006 performance tuningandclusteradmin
006 performance tuningandclusteradmin006 performance tuningandclusteradmin
006 performance tuningandclusteradminScott Miao
 
Benchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBaseBenchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBaseChristopher Choi
 
Replication Solutions for PostgreSQL
Replication Solutions for PostgreSQLReplication Solutions for PostgreSQL
Replication Solutions for PostgreSQLPeter Eisentraut
 
M|18 How to use MyRocks with MariaDB Server
M|18 How to use MyRocks with MariaDB ServerM|18 How to use MyRocks with MariaDB Server
M|18 How to use MyRocks with MariaDB ServerMariaDB plc
 

La actualidad más candente (19)

Migrating to XtraDB Cluster
Migrating to XtraDB ClusterMigrating to XtraDB Cluster
Migrating to XtraDB Cluster
 
Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
 
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
 
PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Replication in 10  Minutes - SCALEPostgreSQL Replication in 10  Minutes - SCALE
PostgreSQL Replication in 10 Minutes - SCALE
 
Webinar Slides: Migrating to Galera Cluster
Webinar Slides: Migrating to Galera ClusterWebinar Slides: Migrating to Galera Cluster
Webinar Slides: Migrating to Galera Cluster
 
OpenZFS data-driven performance
OpenZFS data-driven performanceOpenZFS data-driven performance
OpenZFS data-driven performance
 
Meet hbase 2.0
Meet hbase 2.0Meet hbase 2.0
Meet hbase 2.0
 
The Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsThe Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systems
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!
Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!
Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!
 
DataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The SequelDataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The Sequel
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)
 
006 performance tuningandclusteradmin
006 performance tuningandclusteradmin006 performance tuningandclusteradmin
006 performance tuningandclusteradmin
 
Benchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBaseBenchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBase
 
Replication Solutions for PostgreSQL
Replication Solutions for PostgreSQLReplication Solutions for PostgreSQL
Replication Solutions for PostgreSQL
 
M|18 How to use MyRocks with MariaDB Server
M|18 How to use MyRocks with MariaDB ServerM|18 How to use MyRocks with MariaDB Server
M|18 How to use MyRocks with MariaDB Server
 

Similar a Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

Cassandra and Solid State Drives
Cassandra and Solid State DrivesCassandra and Solid State Drives
Cassandra and Solid State DrivesRick Branson
 
HBase: Extreme Makeover
HBase: Extreme MakeoverHBase: Extreme Makeover
HBase: Extreme MakeoverHBaseCon
 
DaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solutionDaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solutionSchubert Zhang
 
What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalabilityjbellis
 
Progressive NOSQL: Cassandra
Progressive NOSQL: CassandraProgressive NOSQL: Cassandra
Progressive NOSQL: CassandraAcunu
 
深入了解Redis
深入了解Redis深入了解Redis
深入了解Redisiammutex
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...JAXLondon2014
 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonUri Cohen
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanVerverica
 
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...peknap
 
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...DataStax Academy
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesYoshinori Matsunobu
 
HBase: Extreme makeover
HBase: Extreme makeoverHBase: Extreme makeover
HBase: Extreme makeoverbigbase
 
[G2]fa ce deview_2012
[G2]fa ce deview_2012[G2]fa ce deview_2012
[G2]fa ce deview_2012NAVER D2
 
Cassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write pathCassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write pathJoshua McKenzie
 

Similar a Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix (20)

Cassandra and Solid State Drives
Cassandra and Solid State DrivesCassandra and Solid State Drives
Cassandra and Solid State Drives
 
HBase: Extreme Makeover
HBase: Extreme MakeoverHBase: Extreme Makeover
HBase: Extreme Makeover
 
DaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solutionDaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solution
 
Cachememory
CachememoryCachememory
Cachememory
 
Bigtable and Dynamo
Bigtable and DynamoBigtable and Dynamo
Bigtable and Dynamo
 
What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalability
 
Shignled disk
Shignled diskShignled disk
Shignled disk
 
Progressive NOSQL: Cassandra
Progressive NOSQL: CassandraProgressive NOSQL: Cassandra
Progressive NOSQL: Cassandra
 
Flash 101
Flash 101Flash 101
Flash 101
 
深入了解Redis
深入了解Redis深入了解Redis
深入了解Redis
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax London
 
Memory (Computer Organization)
Memory (Computer Organization)Memory (Computer Organization)
Memory (Computer Organization)
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
 
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
 
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
HBase: Extreme makeover
HBase: Extreme makeoverHBase: Extreme makeover
HBase: Extreme makeover
 
[G2]fa ce deview_2012
[G2]fa ce deview_2012[G2]fa ce deview_2012
[G2]fa ce deview_2012
 
Cassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write pathCassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write path
 

Más de Acunu

Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu
 
Virtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinVirtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinAcunu
 
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsAcunu
 
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu
 
All Your Base
All Your BaseAll Your Base
All Your BaseAcunu
 
Realtime Analytics with Apache Cassandra
Realtime Analytics with Apache CassandraRealtime Analytics with Apache Cassandra
Realtime Analytics with Apache CassandraAcunu
 
Realtime Analytics with Apache Cassandra - JAX London
Realtime Analytics with Apache Cassandra - JAX LondonRealtime Analytics with Apache Cassandra - JAX London
Realtime Analytics with Apache Cassandra - JAX LondonAcunu
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time CassandraAcunu
 
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Acunu
 
Realtime Analytics with Cassandra
Realtime Analytics with CassandraRealtime Analytics with Cassandra
Realtime Analytics with CassandraAcunu
 
Acunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu
 
Exploring Big Data value for your business
Exploring Big Data value for your businessExploring Big Data value for your business
Exploring Big Data value for your businessAcunu
 
Realtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with CassandraRealtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with CassandraAcunu
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Acunu
 
Cassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraCassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraAcunu
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsAcunu
 
Next Generation Cassandra
Next Generation CassandraNext Generation Cassandra
Next Generation CassandraAcunu
 
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Acunu
 
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...Acunu
 

Más de Acunu (20)

Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on Cassandra
 
Virtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinVirtual nodes: Operational Aspirin
Virtual nodes: Operational Aspirin
 
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problems
 
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra Apps
 
All Your Base
All Your BaseAll Your Base
All Your Base
 
Realtime Analytics with Apache Cassandra
Realtime Analytics with Apache CassandraRealtime Analytics with Apache Cassandra
Realtime Analytics with Apache Cassandra
 
Realtime Analytics with Apache Cassandra - JAX London
Realtime Analytics with Apache Cassandra - JAX LondonRealtime Analytics with Apache Cassandra - JAX London
Realtime Analytics with Apache Cassandra - JAX London
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time Cassandra
 
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
 
Realtime Analytics with Cassandra
Realtime Analytics with CassandraRealtime Analytics with Cassandra
Realtime Analytics with Cassandra
 
Acunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra London
 
Exploring Big Data value for your business
Exploring Big Data value for your businessExploring Big Data value for your business
Exploring Big Data value for your business
 
Realtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with CassandraRealtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with Cassandra
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
 
Cassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraCassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into Cassandra
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
 
Next Generation Cassandra
Next Generation CassandraNext Generation Cassandra
Next Generation Cassandra
 
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
 
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
 

Último

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Último (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

  • 1. Cassandra storage internals Nicolas Favre-Felix Cassandra Europe 2012
  • 2. What this talk covers • What happens within a Cassandra node • How Cassandra reads and writes data • What compaction is and why we need it • How counters are stored, modified, and read
  • 3. Concepts • Memtables • On heap, off-heap • SSTables • Compaction • Commit Log • Bloom filters • Key cache • SSTable index • Row cache • Counters
  • 4. Why is this important? • Understand what goes on under the hood • Understand the reasons for these choices • Diagnose issues • Tune Cassandra for performance • Make your data model efficient
  • 5. A word about hard drives
  • 6. A word about hard drives • Main driver behind Cassandra’s storage choices • The last moving part • Fast sequential I/O (150 MB/s) • Slow random I/O (120-200 IOPS)
  • 7. What SSDs bring • Fast sequential I/O • Fast random I/O • Higher cost • Limited lifetime • Performance degradation
  • 8. Disk usage with B-trees • Important data structure in relational databases • In-place overwrites (random I/O) • LogB(N) random accesses for reads and writes
  • 9. Disk usage with Cassandra • Made for spinning disks • Sequential writes, much less than 1 I/O per insert • Several layers of cache • Random reads, approximately 1 I/O per read • Generally “write-optimised”
  • 11. Writing to Cassandra Let’s add a row with a few columns Row Key Column Column Column Column
  • 12. The Cassandra write path In the JVM New data Memtable Commit On disk log
  • 13. The Commit Log • Each write is added to a log file • Guarantees durability after a crash • 1-second window during which data is still in RAM • Sequential I/O • A dedicated disk is recommended
  • 14. Memtables • In-memory Key/Value data structure • Implemented with ConcurrentSkipListMap • One per column family • Very fast inserts • Columns are merged in memory for the same key • Flushed at a certain threshold, into an SSTable
  • 15. Dumping a Memtable on disk In the JVM Full Memtable Commit On disk log
  • 16. Dumping a Memtable on disk In the JVM New Memtable Commit On disk log SSTable
  • 17. The SSTable • One file, written sequentially • Columns are in order, grouped by row • Immutable once written, no updates!
  • 18. SSTables start piling up! In the JVM Memtable Commit log SSTable SSTable SSTable On disk SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable SSTable
  • 19. SSTables • Can’t keep all of them forever • Need to reclaim disk space • Reads could touch several SSTables • Scans touch all of them • In-memory data structures per SSTable
  • 21. Compaction • Merge SSTables of similar size together • Remove overwrites and deleted data (timestamps) • Improve range query performance • Major compaction creates a single SSTable • I/O intensive operation
  • 22. Recent improvements • Pluggable compaction • Different strategies, chosen per column family • SSTable compression • More efficient SSTable merges
  • 24. Reading from Cassandra • Reading all these SSTables would be very inefficient • We have to read from memory as much as possible • Otherwise we need to do 2 things efficiently: • Find the right SSTable to read from • Find where in that SSTable to read the data
  • 25. First step for reads • The Memtable! • Read the most recent data • Very fast, no need to touch the disk
  • 26. Off-heap (no GC) Row cache In the JVM Memtable Commit On disk log SSTable
  • 27. Row cache • Stores a whole row in memory • Off-heap, not subject to Garbage Collection • Size is configurable per column family • Last resort before having to read from disk
  • 28. Finding the right SSTable In the JVM Memtable Commit log SSTable SSTable On disk SSTable SSTable SSTable SSTable SSTable SSTable SSTable
  • 29. Bloom filter • Saved with each SSTable • Answers “contains(Key) :: boolean” • Saved on disk but kept in memory • Probabilistic data structure • Configurable proportion of false positives • No false negatives
  • 30. Bloom filter In the JVM Memtable exists(key)? Bloom filter Bloom filter Bloom filter true/false Commit On disk log SSTable SSTable SSTable
  • 31. Reading from an SSTable • We need to know where in the file our data is saved • Keys are sorted, why don’t we do a binary search? • Keys are not all the same size • Jumping around in a file is very slow • Log2(N) random I/O, ~20 for 1 million keys
  • 32. Reading from an SSTable Let’s index key ranges in the SSTable Key: k-128 Key: k-256 Key: k-384 Position: 12098 Position: 23445 Position: 43678 SSTable
  • 33. SSTable index • Saved with each SSTable • Stores key ranges and their offsets: [(Key, Offset)] • Saved on disk but kept in memory • Avoids searching for a key by scanning the file • Configurable key interval (default: 128)
  • 34. SSTable index In the JVM Memtable SSTable Bloom filter index Commit On disk log SSTable
  • 35. Sometimes not enough • Storing key ranges is limited • We can do better by storing the exact offset • This saves approximately one I/O
  • 36. The key cache In the JVM Memtable SSTable Bloom filter Key cache index Commit On disk log SSTable
  • 37. Key cache • Stores the exact location in the SSTable • Stored in heap • Avoids having to scan a whole index interval • Size is configurable per column family
  • 38. 2 Off-heap (no GC) Row cache 1 In the JVM Memtable 3 4 5 SSTable Bloom filter Key cache index 6 Commit On disk log SSTable
  • 39. 2 Off-heap (no GC) Row cache 1 In the JVM Memtable 3 4 5 SSTable Bloom filter Key cache index 6 Commit On disk log SSTable
  • 40. 2 Off-heap (no GC) Row cache 1 In the JVM Memtable 3 4 5 SSTable Bloom filter Key cache index 6 Commit On disk log SSTable
  • 41. 2 Off-heap (no GC) Row cache 1 In the JVM Memtable 3 4 5 SSTable Bloom filter Key cache index 6 Commit On disk log SSTable
  • 42. 2 Off-heap (no GC) Row cache 1 In the JVM Memtable 3 4 5 SSTable Bloom filter Key cache index 6 Commit On disk log SSTable
  • 43. 2 Off-heap (no GC) Row cache 1 In the JVM Memtable 3 4 5 SSTable Bloom filter Key cache index 6 Commit On disk log SSTable
  • 44. 2 Off-heap (no GC) Row cache 1 In the JVM Memtable 3 4 5 SSTable Bloom filter Key cache index 6 Commit On disk log SSTable
  • 46. Distributed counters • 64-bit signed integer, replicated in the cluster • Atomic inc and dec by an arbitrary amount • Counting with read-inc-write would be inefficient • Stored differently from regular columns
  • 47. Consider a cluster with 3 nodes, RF=3
  • 48. Internal counter data • List of increments received by the local node • Summaries (Version,Sum) sent by the other nodes • The total value is the sum of all counts
  • 49. Internal counter data • List of increments received by the local node • Summaries (Version,Sum) sent by the other nodes • The total value is the sum of all counts Local increments +5 +2 -3 node version: 3 Received from count: 5 version: 5 Received from count: 10
  • 50. Incrementing a counter • A coordinator node is chosen (blue node) Local increments +5 +2 -3
  • 51. Incrementing a counter • A coordinator node is chosen • Stores its increment locally Local increments +5 +2 -3 +1
  • 52. Incrementing a counter • A coordinator node is chosen • Stores its increment locally • Reads back the sum of its increments Local increments +5 +2 -3 +1
  • 53. Incrementing a counter • A coordinator node is chosen • Stores its increment locally • Reads back the sum of its increments • Forwards a summary to other replicas: (v.4, sum 5) Local increments +5 +2 -3 +1
  • 54. Incrementing a counter • A coordinator node is chosen • Stores its increment locally • Reads back the sum of its increments • Forwards a summary to other replicas • Replicas update their records: version: 4 Received from count: 5
  • 55. Reading a counter • Replicas return their counts and versions • Including what they know about other nodes • Only the most recent versions are kept
  • 57. Reading a counter { v. 3, count 5 v. 6, count 12 v. 2, count 8 { v. 3, count 5 v. 5, count 10 v. 4, count 5 version: 6 count: 12
  • 58. Reading a counter { v. 3, count 5 v. 6, count 12 v. 2, count 8 { v. 3, count 5 v. 5, count 10 v. 4, count 5 version: 6 count: 12 Counter value: 5 + 12 + 5 = 22
  • 60. Tuning • Cassandra can’t really use large amounts of RAM • Garbage Collection pauses stop everything • Compaction has an impact on performance • Reading from disk is slow • These limitations restrict the size of each node
  • 61. Recap • Fast sequential writes • ~1 I/O for uncached reads, 0 for cached • Counter increments read on write, columns don’t • Know where your time is spent (monitor!) • Tune accordingly
  • 63. • In-kernel backend • No Garbage Collection • No need to plan heavy compactions • Low and consistent latency • Full versioning, snapshots • No degradation with Big Data