SlideShare una empresa de Scribd logo
1 de 49
Jeff Jirsa
Using TimeWindowCompactionStrategy for Time Series Workloads
1 Who Am I?
2 LSM DBs
3 TWCS
4 The 1%
5 Things Nobody Else Told You About Compaction
6 Q&A
2© 2016. All Rights Reserved.
Who Am I?
(Or: Why You Should Believe Me)
2016 CROWDSTRIKE, INC. ALL
RIGHTS RESERVED.
4
© DataStax, All Rights Reserved. 5
We’ve Spent Some Time With Time Series
© 2016. All Rights Reserved. 6
• We keep some data from sensors for a
fixed time period
• Processes
• DNS queries
• Executables created
• It’s a LOT of data
• 2015 Talk: One million writes per
second with 60 nodes
• Multiple Petabytes Per Cluster
We’ve Spent Some Time With Time Series
© 2016. All Rights Reserved. 7
• TWCS was written to solve problems
CrowdStrike faced in production
• It wasn’t meant to be clever, it was
meant to be efficient and easy to reason
about
• I’m on the pager rotation, this directly
impacts my quality of life
We’ve Spent Some Time With Time Series
© 2016. All Rights Reserved. 8
• I have better things to do on my off time
We’ve Spent Some Time With Time Series
© 2016. All Rights Reserved. 9
• I have better things to do on my off time
We’ve Spent Some Time With Time Series
© 2016. All Rights Reserved. 10
• I have better things to do on my off time
Log Structured – Database, Not Cabins
If You’re Going To Use Cassandra, Let’s Make Sure We Know How It Works
Log Structured Merge Trees
• Cassandra write path:
1. First the Commitlog
2. Then the Memtable
3. Eventually flushed to a SSTable
• Each SSTable is written exactly once
• Over time, Cassandra combines those data files
Duplicate cells are merged
Obsolete data is purged
• On reads, Cassandra searches for data in each SSTable, merging any existing records and
returning the result
© 2016. All Rights Reserved. 12
Real World, Real Problems
• If you can’t get compaction happy, your cluster will never be happy
• The write path relies on efficient flushing
• If your compaction strategy falls behind, you can block flushes (CASSANDRA-9882)
• The read path relies on efficient merging
• If your compaction strategy falls behind, each read may touch hundreds or thousands of sstables
• IO bound clusters are common, even with SSDs
• Dynamic Snitch - latency + “severity”
© 2016. All Rights Reserved. 13
What We Hope For
• We accept that we need to compact sstables sometimes, but we want to do it when we have a
good reason
• Good reasons:
• Data has been deleted and we want to reclaim space
• Data has been overwritten and we want to avoid merges on reads
• Our queries span multiple sstables, and we’re having to touch a lot of sstables on each read
• Bad Reasons:
• We hit some magic size threshold and we want to join two non-overlapping files together
• We’re aiming for a situation where the merge on read is tolerable
• Bloom filter is your friend – let’s read from as few sstables as possible
• We want as few tombstones as possible (this includes expired data)
• Tombstones create garbage, garbage creates sadness
© 2016. All Rights Reserved. 14
Use The Defaults?
It’s Not Just Naïve, It’s Also Expensive
The Basics: SizeTieredCompactionStrategy
• Each time min_threshold (4) files of the same size appear, combine them into a new file
• Over time, you’ll naturally end up with a distribution of old data in large files, new data in small files
• Deleted data in large files stays on disk longer than desired because those files are very rarely compacted
© 2016. All Rights Reserved. 16
SizeTieredCompactionStrategy
© 2016. All Rights Reserved. 17
SizeTieredCompactionStrategy
© 2016. All Rights Reserved. 18
If each of the smallest blocks represent 1 day of data, and each write
had a 90 day TTL, when do you actually delete files and reclaim disk
space?
SizeTieredCompactionStrategy
• Expensive IO:
• Far more writes than necessary, you’ll recompact old data weeks after it was written
• Reads may touch a ton of sstables – we have no control over how data will be arranged on disk
• Expensive Operationally:
• Expired data doesn’t get dropped until you happen to re-compact the table it’s in
• You have to keep up to 50% spare disk
© 2016. All Rights Reserved. 19
TWCS
Because Everything Else Made Me Sad
Kübler Ross Stages of Grief
• Denial
• Anger
• Bargaining
• Depression
• Acceptance
© 2016. All Rights Reserved. 21
Sad Operator: Stages of Grief
• Denial
• STCS and LCS aren’t gonna work, but DTCS will fix it
• Anger
• DTCS seemed to be the fix, and it didn’t work, either
• Bargaining
• What if we tweak all these sub-properties? What if we just fix things one at a time?
• Depression
• Still SOL at ~hundred node scale
• Can we get through this? Is it time for a therapist’s couch?
© 2016. All Rights Reserved. 22
© 2016. All Rights Reserved. 23
Sad Operator: Stages of Grief
• Acceptance
• Compaction is pluggable, we’ll write it ourselves
• Designed to be simple and efficient
• Group sstables into logical buckets
• STCS in the newest time window
• No more confusing options, just Window Size + Window Unit
• Base time seconds? Max age days? Overloading min_threshold for grouping? Not today.
• “12 Hours”, “3 Days”, “6 Minutes”
• Configure buckets so you have 20-30 buckets on disk
© 2016. All Rights Reserved. 24
That’s It.
• 90 day TTL
• Unit = Days, # = 3
• Each file on disk spans 3 days of data (except the first window), expect ~30 + first window
• Expect to have at least 3 days of extra data on disk*
• 2 hour TTL
• Unit = Minutes, # = 10
• Each file on disk represents 10 minutes of data, expect 12-13 + first window
• Expect to have at least 10 minutes of extra data on disk*
© 2016. All Rights Reserved. 25
© 2016. All Rights Reserved. 26
© 2016. All Rights Reserved. 27
Example: IO (Real Cluster)
© 2016. All Rights Reserved. 28
Example: Load (Real Cluster)
The Only Real Optimization You Need
• Align your partition keys to your TWCS windows
• Bloom filter reads will only touch a single sstable
• Deletion gets much easier because you get rid of overlapping ranges
• Bucketing partitions keeps partition sizes reasonable ( < 100MB ), which saves you a ton of GC pressure
• If you’re using 30 day TTL and 1 day TWCS windows, put a “day_of_year” field into the partition key
• Use parallel async reads to read more than one day at a time
• Spread reads across multiple nodes
• Each node should touch exactly 1 sstable on disk (watch timezones)
• That sstable is probably hot for all partitions, so it’ll be in page cache
• Extrapolate for other windows (you may have to chunk things up into 3 day buckets or 30 minute buckets, but it’ll
be worth it)
© 2016. All Rights Reserved. 29
What We’ve Discussed Is Good Enough For 99%
Of Time Series Use Cases
But let’s make sure the 1% knows what’s up
Out Of Order Writes
• If we mix write timestamps “USING TIMESTAMP”…
• Life isn’t over, it just potentially blocks expiration
• Goal:
• Avoid mixing timestamps within any given sstable
• Options:
• Don’t mix in the memtable
• Don’t use the memtable
© 2016. All Rights Reserved. 31
Out Of Order Writes
• Don’t comingle in the memtable
• If we have a queue-like workflow, consider the following option:
• Pause kafka consumer / celery worker / etc
• “nodetool flush”
• Write old data with “USING TIMESTAMP”
• “nodetool flush
• Resume consumer/workers for new data
• Positives: No comingled data
• Negatives: Have to pause ingestion
© 2016. All Rights Reserved. 32
Out Of Order Writes
• Don’t use the memtable
• CQLSSTableWriter
• Yuki has a great blog at: http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated
• Write sstables offline
• Stream them in with sstableloader
• Positives: No comingled data, no pausing ingestion, incredibly fast, easy to parallelize
• Negatives: Requires code (but it’s not difficult code, your ops team should be able to do it)
© 2016. All Rights Reserved. 33
Per-Window Major Compaction
• At the end of each window, you’re going to see a major compaction for all sstables in that window
© 2016. All Rights Reserved. 34
Per-Window Major Compaction
• At the end of each window, you’re going to see a major compaction for all sstables in that window
• Expect a jump in CPU usage, disk usage, and disk IO
• The DURATION of these increases depends on your write rate and window size
• Larger windows will take longer to compact because you’ll have more data on disk
• If this is a problem for you, you’re under provisioned
© 2016. All Rights Reserved. 35
Per-Window Major Compaction
© DataStax, All Rights Reserved. 36
CPU Usage
During the end-of-window major, cpu
usage on ALL OF THE NODES (in all
DCs) will increase at the same time.
This will likely impact your read latency.
When you validate TWCS, be sure to
make sure your application works well at
this transition.
We can surely fix this, just need to find a
way to avoid cluttering the options.
Per-Window Major Compaction
© DataStax, All Rights Reserved. 37
Disk Usage
During the daily major, disk usage on ALL
OF THE NODES will increase at the
same time.
Per-Window Major Compaction
© DataStax, All Rights Reserved. 38
Disk Usage
In some cases, you’ll see the window
major compaction run twice because of
the timing of flush. You can manually
flush (cron) to work around it if it bothers
you.
This is on my list of things to fix
No reason to do two majors, better to
either delay the first major until we’re
sure it’s time, or keep a history that we’ve
already done a window major
compaction, and skip it the second time
There Are Things Nobody Told You About
Compaction
The More You Know…
Things Nobody Told You About Compaction
• Compaction Impacts Read Performance More Than Write Performance
• Typical advice is use LCS if you need fast reads, STCS if you need fast writes
• LCS optimizes reads by limiting the # of potential SSTables you’ll need to touch on the read path
• The goal of LCS (fast reads/low latency) and the act of keeping levels are in competition with each other
• It takes a LOT of IO for LCS to keep up, and it’s generally not a great fit for most time series use cases
• LCS will negatively impact your read latencies in any sufficiently busy cluster
© 2016. All Rights Reserved. 40
Things Nobody Told You About Compaction
• You can change the compaction strategy on a single node using JMX
• The change won’t persist through restarts, but it’s often a great way to test / canary before rolling it out to the full
cluster
• You can change other useful things in JMX, too. No need to restart to change:
• Compaction threads
• Compaction throughput
• If you see an IO impact of changing compaction strategies, you can slow-roll it out to the cluster using JMX.
© 2016. All Rights Reserved. 41
Things Nobody Told You About Compaction
• Compaction Task Prioritization
© 2016. All Rights Reserved. 42
Things Nobody Told You About Compaction
• Compaction Task Prioritization
• Just kidding, stuff’s going to run in an order you don’t like.
• There’s nothing you can do about it (yet)
• If you run Cassandra long enough, you’ll eventually OOM or run a box out of disk doing cleanup or bootstrap or
validation compactions or similar
• We run watchdog daemons that watch for low disk/RAM conditions and interrupts cleanups/compactions
• Not provided, but it’s a 5 line shell script
• 2.0 -> 2.1 was a huge change
• Cleanup / Scrub used to be single threaded
• Someone thought it was a good idea to make it parallel (CASSANDRA-5547)
• Now cleanup/scrub blocks normal sstable compactions
• If you run parallel operations, be prepared to interrupt and restart them if you run out of disk, RAM, or if your
sstable count gets too high (CASSANDRA-11179). Consider using –seq or userDefinedCleanup (JMX)
• CASSANDRA-11218 (priority queue)
© 2016. All Rights Reserved. 43
Things Nobody Told You About Compaction
• ”Fully Expired”
• Cassandra is super conservative
• Find global minTimestamp of any overlapping sstable, compacting sstable, and memtables
• This is the oldest “live” data
• Build a list of “candidates” that we think are fully expired
• See if the candidates are completely older than that global minTimestamp
• Operators are not as conservative
• CASSANDRA-7019 / Philip Thompson’s talk from yesterday
• When you’re running out of disk space, Cassandra’s definition may seem silly =>
• Any out of order write can “block” a lot of data from being deleted
• Read repair, hints, whatever
• It used to be so hard to figure out, cassandra now has `sstableexpiredblockers`
© 2016. All Rights Reserved. 44
Things Nobody Told You About Compaction
• Tombstone compaction sub-properties
• Show of hands if you’ve ever set these on a real cluster
© 2016. All Rights Reserved. 45
Things Nobody Told You About Compaction
• Tombstone compaction sub-properties
• Cassandra has logic to try to eliminate mostly-expired sstables
• Three basic knobs:
1. What % of the table must be tombstones for it to be worth compacting?
2. How long has it been since that file has been created?
3. Should we try to compact the tombstones away even if we suspect it’s not going to be successful?
© 2016. All Rights Reserved. 46
Things Nobody Told You About Compaction
• Tombstone compaction sub-properties
• Cassandra has logic to try to eliminate mostly-expired sstables
• Three basic knobs:
1. What % of the table must be tombstones for it to be worth compacting?
• tombstone_threshold (0.2 -> 0.8)
2. How long has it been since that file has been created?
• tombstone_compaction_interval (how much IO do you have?
3. Should we try to compact the tombstones away even if we suspect it’s not going to be successful?
• unchecked_tombstone_compaction (false -> true)
© 2016. All Rights Reserved. 47
Q&A
© 2016. All Rights Reserved. 48
Spoilers
TWCS is available in mainline Cassandra in 3.0.8 and newer.
If you’re running 2.0, 2.1, or 2.2, you can build a JAR from source on
github.com/jeffjirsa/twcs
You PROBABLY don’t need to do anything special to change from DTCS -> TWCS
Thanks!
© 2016. All Rights Reserved. 49
CrowdStrike Is Hiring
Talk to me about TWCS on Twitter: @jjirsa
Find me on IRC: jeffj on Freenode (#cassandra)
If you’re running 2.0, 2.1, or 2.2, you can build a JAR from source on
https://github.com/jeffjirsa/twcs

Más contenido relacionado

La actualidad más candente

Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBDistributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBYugabyteDB
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David AndersonVerverica
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanVerverica
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlJiangjie Qin
 
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistentconfluent
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
Improving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberImproving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberYing Zheng
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkFlink Forward
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안SANG WON PARK
 
When NOT to use Apache Kafka?
When NOT to use Apache Kafka?When NOT to use Apache Kafka?
When NOT to use Apache Kafka?Kai Wähner
 
Secrets of Performance Tuning Java on Kubernetes
Secrets of Performance Tuning Java on KubernetesSecrets of Performance Tuning Java on Kubernetes
Secrets of Performance Tuning Java on KubernetesBruno Borges
 
Jvm tuning for low latency application & Cassandra
Jvm tuning for low latency application & CassandraJvm tuning for low latency application & Cassandra
Jvm tuning for low latency application & CassandraQuentin Ambard
 
G1 Garbage Collector: Details and Tuning
G1 Garbage Collector: Details and TuningG1 Garbage Collector: Details and Tuning
G1 Garbage Collector: Details and TuningSimone Bordet
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 

La actualidad más candente (20)

Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBDistributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
 
Ceph issue 해결 사례
Ceph issue 해결 사례Ceph issue 해결 사례
Ceph issue 해결 사례
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistent
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Improving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberImproving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at Uber
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
 
When NOT to use Apache Kafka?
When NOT to use Apache Kafka?When NOT to use Apache Kafka?
When NOT to use Apache Kafka?
 
Secrets of Performance Tuning Java on Kubernetes
Secrets of Performance Tuning Java on KubernetesSecrets of Performance Tuning Java on Kubernetes
Secrets of Performance Tuning Java on Kubernetes
 
Jvm tuning for low latency application & Cassandra
Jvm tuning for low latency application & CassandraJvm tuning for low latency application & Cassandra
Jvm tuning for low latency application & Cassandra
 
G1 Garbage Collector: Details and Tuning
G1 Garbage Collector: Details and TuningG1 Garbage Collector: Details and Tuning
G1 Garbage Collector: Details and Tuning
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 

Similar a Using Time Window Compaction Strategy For Time Series Workloads

CrowdStrike: Real World DTCS For Operators
CrowdStrike: Real World DTCS For OperatorsCrowdStrike: Real World DTCS For Operators
CrowdStrike: Real World DTCS For OperatorsDataStax Academy
 
Cassandra Summit 2015: Real World DTCS For Operators
Cassandra Summit 2015: Real World DTCS For OperatorsCassandra Summit 2015: Real World DTCS For Operators
Cassandra Summit 2015: Real World DTCS For OperatorsJeff Jirsa
 
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...DataStax
 
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016DataStax
 
Everyday I’m scaling... Cassandra
Everyday I’m scaling... CassandraEveryday I’m scaling... Cassandra
Everyday I’m scaling... CassandraInstaclustr
 
Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Jon Haddad
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionDataStax Academy
 
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionCassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionDataStax Academy
 
Cassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionCassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionDataStax Academy
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionDataStax Academy
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionDataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Why does my choice of storage matter with cassandra?
Why does my choice of storage matter with cassandra?Why does my choice of storage matter with cassandra?
Why does my choice of storage matter with cassandra?Johnny Miller
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentationEdward Capriolo
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraJon Haddad
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
Manage your compactions before they manage you!
Manage your compactions before they manage you!Manage your compactions before they manage you!
Manage your compactions before they manage you!Carlos Juzarte Rolo
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)DataStax Academy
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchJoe Alex
 

Similar a Using Time Window Compaction Strategy For Time Series Workloads (20)

CrowdStrike: Real World DTCS For Operators
CrowdStrike: Real World DTCS For OperatorsCrowdStrike: Real World DTCS For Operators
CrowdStrike: Real World DTCS For Operators
 
Cassandra Summit 2015: Real World DTCS For Operators
Cassandra Summit 2015: Real World DTCS For OperatorsCassandra Summit 2015: Real World DTCS For Operators
Cassandra Summit 2015: Real World DTCS For Operators
 
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
 
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
 
Everyday I’m scaling... Cassandra
Everyday I’m scaling... CassandraEveryday I’m scaling... Cassandra
Everyday I’m scaling... Cassandra
 
Advanced Operations
Advanced OperationsAdvanced Operations
Advanced Operations
 
Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
 
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionCassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in Production
 
Cassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionCassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in Production
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Why does my choice of storage matter with cassandra?
Why does my choice of storage matter with cassandra?Why does my choice of storage matter with cassandra?
Why does my choice of storage matter with cassandra?
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - Cassandra
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Manage your compactions before they manage you!
Manage your compactions before they manage you!Manage your compactions before they manage you!
Manage your compactions before they manage you!
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 

Último

WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT  - Elevating Productivity in Today's Agile EnvironmentHarnessing ChatGPT  - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT - Elevating Productivity in Today's Agile EnvironmentVictorSzoltysek
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 

Último (20)

WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT  - Elevating Productivity in Today's Agile EnvironmentHarnessing ChatGPT  - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 

Using Time Window Compaction Strategy For Time Series Workloads

  • 2. 1 Who Am I? 2 LSM DBs 3 TWCS 4 The 1% 5 Things Nobody Else Told You About Compaction 6 Q&A 2© 2016. All Rights Reserved.
  • 3. Who Am I? (Or: Why You Should Believe Me)
  • 4. 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. 4
  • 5. © DataStax, All Rights Reserved. 5
  • 6. We’ve Spent Some Time With Time Series © 2016. All Rights Reserved. 6 • We keep some data from sensors for a fixed time period • Processes • DNS queries • Executables created • It’s a LOT of data • 2015 Talk: One million writes per second with 60 nodes • Multiple Petabytes Per Cluster
  • 7. We’ve Spent Some Time With Time Series © 2016. All Rights Reserved. 7 • TWCS was written to solve problems CrowdStrike faced in production • It wasn’t meant to be clever, it was meant to be efficient and easy to reason about • I’m on the pager rotation, this directly impacts my quality of life
  • 8. We’ve Spent Some Time With Time Series © 2016. All Rights Reserved. 8 • I have better things to do on my off time
  • 9. We’ve Spent Some Time With Time Series © 2016. All Rights Reserved. 9 • I have better things to do on my off time
  • 10. We’ve Spent Some Time With Time Series © 2016. All Rights Reserved. 10 • I have better things to do on my off time
  • 11. Log Structured – Database, Not Cabins If You’re Going To Use Cassandra, Let’s Make Sure We Know How It Works
  • 12. Log Structured Merge Trees • Cassandra write path: 1. First the Commitlog 2. Then the Memtable 3. Eventually flushed to a SSTable • Each SSTable is written exactly once • Over time, Cassandra combines those data files Duplicate cells are merged Obsolete data is purged • On reads, Cassandra searches for data in each SSTable, merging any existing records and returning the result © 2016. All Rights Reserved. 12
  • 13. Real World, Real Problems • If you can’t get compaction happy, your cluster will never be happy • The write path relies on efficient flushing • If your compaction strategy falls behind, you can block flushes (CASSANDRA-9882) • The read path relies on efficient merging • If your compaction strategy falls behind, each read may touch hundreds or thousands of sstables • IO bound clusters are common, even with SSDs • Dynamic Snitch - latency + “severity” © 2016. All Rights Reserved. 13
  • 14. What We Hope For • We accept that we need to compact sstables sometimes, but we want to do it when we have a good reason • Good reasons: • Data has been deleted and we want to reclaim space • Data has been overwritten and we want to avoid merges on reads • Our queries span multiple sstables, and we’re having to touch a lot of sstables on each read • Bad Reasons: • We hit some magic size threshold and we want to join two non-overlapping files together • We’re aiming for a situation where the merge on read is tolerable • Bloom filter is your friend – let’s read from as few sstables as possible • We want as few tombstones as possible (this includes expired data) • Tombstones create garbage, garbage creates sadness © 2016. All Rights Reserved. 14
  • 15. Use The Defaults? It’s Not Just Naïve, It’s Also Expensive
  • 16. The Basics: SizeTieredCompactionStrategy • Each time min_threshold (4) files of the same size appear, combine them into a new file • Over time, you’ll naturally end up with a distribution of old data in large files, new data in small files • Deleted data in large files stays on disk longer than desired because those files are very rarely compacted © 2016. All Rights Reserved. 16
  • 18. SizeTieredCompactionStrategy © 2016. All Rights Reserved. 18 If each of the smallest blocks represent 1 day of data, and each write had a 90 day TTL, when do you actually delete files and reclaim disk space?
  • 19. SizeTieredCompactionStrategy • Expensive IO: • Far more writes than necessary, you’ll recompact old data weeks after it was written • Reads may touch a ton of sstables – we have no control over how data will be arranged on disk • Expensive Operationally: • Expired data doesn’t get dropped until you happen to re-compact the table it’s in • You have to keep up to 50% spare disk © 2016. All Rights Reserved. 19
  • 21. Kübler Ross Stages of Grief • Denial • Anger • Bargaining • Depression • Acceptance © 2016. All Rights Reserved. 21
  • 22. Sad Operator: Stages of Grief • Denial • STCS and LCS aren’t gonna work, but DTCS will fix it • Anger • DTCS seemed to be the fix, and it didn’t work, either • Bargaining • What if we tweak all these sub-properties? What if we just fix things one at a time? • Depression • Still SOL at ~hundred node scale • Can we get through this? Is it time for a therapist’s couch? © 2016. All Rights Reserved. 22
  • 23. © 2016. All Rights Reserved. 23
  • 24. Sad Operator: Stages of Grief • Acceptance • Compaction is pluggable, we’ll write it ourselves • Designed to be simple and efficient • Group sstables into logical buckets • STCS in the newest time window • No more confusing options, just Window Size + Window Unit • Base time seconds? Max age days? Overloading min_threshold for grouping? Not today. • “12 Hours”, “3 Days”, “6 Minutes” • Configure buckets so you have 20-30 buckets on disk © 2016. All Rights Reserved. 24
  • 25. That’s It. • 90 day TTL • Unit = Days, # = 3 • Each file on disk spans 3 days of data (except the first window), expect ~30 + first window • Expect to have at least 3 days of extra data on disk* • 2 hour TTL • Unit = Minutes, # = 10 • Each file on disk represents 10 minutes of data, expect 12-13 + first window • Expect to have at least 10 minutes of extra data on disk* © 2016. All Rights Reserved. 25
  • 26. © 2016. All Rights Reserved. 26
  • 27. © 2016. All Rights Reserved. 27 Example: IO (Real Cluster)
  • 28. © 2016. All Rights Reserved. 28 Example: Load (Real Cluster)
  • 29. The Only Real Optimization You Need • Align your partition keys to your TWCS windows • Bloom filter reads will only touch a single sstable • Deletion gets much easier because you get rid of overlapping ranges • Bucketing partitions keeps partition sizes reasonable ( < 100MB ), which saves you a ton of GC pressure • If you’re using 30 day TTL and 1 day TWCS windows, put a “day_of_year” field into the partition key • Use parallel async reads to read more than one day at a time • Spread reads across multiple nodes • Each node should touch exactly 1 sstable on disk (watch timezones) • That sstable is probably hot for all partitions, so it’ll be in page cache • Extrapolate for other windows (you may have to chunk things up into 3 day buckets or 30 minute buckets, but it’ll be worth it) © 2016. All Rights Reserved. 29
  • 30. What We’ve Discussed Is Good Enough For 99% Of Time Series Use Cases But let’s make sure the 1% knows what’s up
  • 31. Out Of Order Writes • If we mix write timestamps “USING TIMESTAMP”… • Life isn’t over, it just potentially blocks expiration • Goal: • Avoid mixing timestamps within any given sstable • Options: • Don’t mix in the memtable • Don’t use the memtable © 2016. All Rights Reserved. 31
  • 32. Out Of Order Writes • Don’t comingle in the memtable • If we have a queue-like workflow, consider the following option: • Pause kafka consumer / celery worker / etc • “nodetool flush” • Write old data with “USING TIMESTAMP” • “nodetool flush • Resume consumer/workers for new data • Positives: No comingled data • Negatives: Have to pause ingestion © 2016. All Rights Reserved. 32
  • 33. Out Of Order Writes • Don’t use the memtable • CQLSSTableWriter • Yuki has a great blog at: http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated • Write sstables offline • Stream them in with sstableloader • Positives: No comingled data, no pausing ingestion, incredibly fast, easy to parallelize • Negatives: Requires code (but it’s not difficult code, your ops team should be able to do it) © 2016. All Rights Reserved. 33
  • 34. Per-Window Major Compaction • At the end of each window, you’re going to see a major compaction for all sstables in that window © 2016. All Rights Reserved. 34
  • 35. Per-Window Major Compaction • At the end of each window, you’re going to see a major compaction for all sstables in that window • Expect a jump in CPU usage, disk usage, and disk IO • The DURATION of these increases depends on your write rate and window size • Larger windows will take longer to compact because you’ll have more data on disk • If this is a problem for you, you’re under provisioned © 2016. All Rights Reserved. 35
  • 36. Per-Window Major Compaction © DataStax, All Rights Reserved. 36 CPU Usage During the end-of-window major, cpu usage on ALL OF THE NODES (in all DCs) will increase at the same time. This will likely impact your read latency. When you validate TWCS, be sure to make sure your application works well at this transition. We can surely fix this, just need to find a way to avoid cluttering the options.
  • 37. Per-Window Major Compaction © DataStax, All Rights Reserved. 37 Disk Usage During the daily major, disk usage on ALL OF THE NODES will increase at the same time.
  • 38. Per-Window Major Compaction © DataStax, All Rights Reserved. 38 Disk Usage In some cases, you’ll see the window major compaction run twice because of the timing of flush. You can manually flush (cron) to work around it if it bothers you. This is on my list of things to fix No reason to do two majors, better to either delay the first major until we’re sure it’s time, or keep a history that we’ve already done a window major compaction, and skip it the second time
  • 39. There Are Things Nobody Told You About Compaction The More You Know…
  • 40. Things Nobody Told You About Compaction • Compaction Impacts Read Performance More Than Write Performance • Typical advice is use LCS if you need fast reads, STCS if you need fast writes • LCS optimizes reads by limiting the # of potential SSTables you’ll need to touch on the read path • The goal of LCS (fast reads/low latency) and the act of keeping levels are in competition with each other • It takes a LOT of IO for LCS to keep up, and it’s generally not a great fit for most time series use cases • LCS will negatively impact your read latencies in any sufficiently busy cluster © 2016. All Rights Reserved. 40
  • 41. Things Nobody Told You About Compaction • You can change the compaction strategy on a single node using JMX • The change won’t persist through restarts, but it’s often a great way to test / canary before rolling it out to the full cluster • You can change other useful things in JMX, too. No need to restart to change: • Compaction threads • Compaction throughput • If you see an IO impact of changing compaction strategies, you can slow-roll it out to the cluster using JMX. © 2016. All Rights Reserved. 41
  • 42. Things Nobody Told You About Compaction • Compaction Task Prioritization © 2016. All Rights Reserved. 42
  • 43. Things Nobody Told You About Compaction • Compaction Task Prioritization • Just kidding, stuff’s going to run in an order you don’t like. • There’s nothing you can do about it (yet) • If you run Cassandra long enough, you’ll eventually OOM or run a box out of disk doing cleanup or bootstrap or validation compactions or similar • We run watchdog daemons that watch for low disk/RAM conditions and interrupts cleanups/compactions • Not provided, but it’s a 5 line shell script • 2.0 -> 2.1 was a huge change • Cleanup / Scrub used to be single threaded • Someone thought it was a good idea to make it parallel (CASSANDRA-5547) • Now cleanup/scrub blocks normal sstable compactions • If you run parallel operations, be prepared to interrupt and restart them if you run out of disk, RAM, or if your sstable count gets too high (CASSANDRA-11179). Consider using –seq or userDefinedCleanup (JMX) • CASSANDRA-11218 (priority queue) © 2016. All Rights Reserved. 43
  • 44. Things Nobody Told You About Compaction • ”Fully Expired” • Cassandra is super conservative • Find global minTimestamp of any overlapping sstable, compacting sstable, and memtables • This is the oldest “live” data • Build a list of “candidates” that we think are fully expired • See if the candidates are completely older than that global minTimestamp • Operators are not as conservative • CASSANDRA-7019 / Philip Thompson’s talk from yesterday • When you’re running out of disk space, Cassandra’s definition may seem silly => • Any out of order write can “block” a lot of data from being deleted • Read repair, hints, whatever • It used to be so hard to figure out, cassandra now has `sstableexpiredblockers` © 2016. All Rights Reserved. 44
  • 45. Things Nobody Told You About Compaction • Tombstone compaction sub-properties • Show of hands if you’ve ever set these on a real cluster © 2016. All Rights Reserved. 45
  • 46. Things Nobody Told You About Compaction • Tombstone compaction sub-properties • Cassandra has logic to try to eliminate mostly-expired sstables • Three basic knobs: 1. What % of the table must be tombstones for it to be worth compacting? 2. How long has it been since that file has been created? 3. Should we try to compact the tombstones away even if we suspect it’s not going to be successful? © 2016. All Rights Reserved. 46
  • 47. Things Nobody Told You About Compaction • Tombstone compaction sub-properties • Cassandra has logic to try to eliminate mostly-expired sstables • Three basic knobs: 1. What % of the table must be tombstones for it to be worth compacting? • tombstone_threshold (0.2 -> 0.8) 2. How long has it been since that file has been created? • tombstone_compaction_interval (how much IO do you have? 3. Should we try to compact the tombstones away even if we suspect it’s not going to be successful? • unchecked_tombstone_compaction (false -> true) © 2016. All Rights Reserved. 47
  • 48. Q&A © 2016. All Rights Reserved. 48 Spoilers TWCS is available in mainline Cassandra in 3.0.8 and newer. If you’re running 2.0, 2.1, or 2.2, you can build a JAR from source on github.com/jeffjirsa/twcs You PROBABLY don’t need to do anything special to change from DTCS -> TWCS
  • 49. Thanks! © 2016. All Rights Reserved. 49 CrowdStrike Is Hiring Talk to me about TWCS on Twitter: @jjirsa Find me on IRC: jeffj on Freenode (#cassandra) If you’re running 2.0, 2.1, or 2.2, you can build a JAR from source on https://github.com/jeffjirsa/twcs