SignalFx ingests, processes runs analytics against, (and ultimately stores) massive numbers of time series streaming in parallel into our service which provides an analytics-based monitoring platform for modern applications.
We've chose to build our time series database (TSDB) on Cassandra for it's read and write performance at high load. This presentation will go over our evolution of optimizations to squeeze the most performance out of the TSDB to date and some steps we'll be taking in the future.
08448380779 Call Girls In Friends Colony Women Seeking Men
SignalFx: Making Cassandra Perform as a Time Series Database
1. M M / D D / Y Y
YOUR TITLE HERE
P R E PA R E D F O R :
P L A C E L O G O
H E R E
Making Cassandra perform
as a time series database
Paul Ingram
psi@signalfx.com
2. Introduction
• real time streaming analytics for monitoring and alerting
• ingest many billions of points of timeseries data per day
• ingest at 1 second resolution
• all of this data ends up in cassandra
#CassandraSummit
3. What we’re talking about
• a metric is an abstract quantity such as CPU load or heap size
• a source is some entity which measures and reports metrics
• a datapoint is a value for a metric from a source at some time
• a timeseries a sequence of those datapoints over time
#CassandraSummit
9. buffered writes rationale (version 1)
• writing every datapoint individually is very expensive
• buffer data in memory
• write many points in a batch statement
• buffers are dropped when they have been written to cassandra
9#CassandraSummit
13. packed writes rationale (version 2)
• writing data point-by-point means a column for each datapoint
• pack a buffer of datapoints into a block and write the block
• this will reduce the number of columns and write operations
• will have more impact on storage than on performance
• schema and overall flow remain the same
13#CassandraSummit
16. redo-log rationale (version 3)
• if the ingest server dies, we lose the buffered data
• fix this with more cassandra
• write a persistent log of data as it’s written to the memory-tier
• when an ingest server restarts it will reload its memory-tier
from this log
16#CassandraSummit
21. what we found
• matching the workload to the database is very important
• load is much more dependent on rate of writes than on volume
of data written
• for our very write-heavy workload we saw 4x performance
improvement by doing fewer, larger writes
• it turns out to be cheaper to write data twice efficiently than
once naively
21#CassandraSummit