OpenTSDB was built on the belief that, through HBase, a new breed of monitoring systems could be created, one that can store and serve billions of data points forever without the need for destructive downsampling, one that could scale to millions of metrics, and where plotting real-time graphs is easy and fast. In this presentation we’ll review some of the key points of OpenTSDB’s design, some of the mistakes that were made, how they were or will be addressed, and what were some of the lessons learned while writing and running OpenTSDB as well as asynchbase, the asynchronous high-performance thread-safe client for HBase. Specific topics discussed will be around the schema, how it impacts performance and allows concurrent writes without need for coordination in a distributed cluster of OpenTSDB instances.
1. Lessons Learned
from OpenTSDB
Or why OpenTSDB is the way it is
and how it changed iteratively to
correct some of the mistakes made
Benoît “tsuna” Sigoure
tsuna@stumbleupon.com
2. Key concepts
• Data Points
(time, value)
• Metrics
proc.loadavg.1m
• Tags
host=web42 pool=static
• Metric + Tags = Time Series
• Order of magnitude: >106 time series, >1012 data points
put proc.loadavg.1m 1234567890 0.42 host=web42 pool=static
3. OpenTSDB @ StumbleUpon
• Main production monitoring system for ~2 years
• Storing hundreds of billions of data points
• Adding over 1 billion data points per day
• 13000 data points/s → 130 QPS on HBase
• If you had a 5 node cluster, this load
would hardly make it sweat
4. Do’s
• Wider rows to seek faster
before: ~4KB/row, after: ~20KB
• Make writes idempotent and independent
before: start rows at arbitrary points in time
after: align rows on 10m (then 1h) boundaries
• Store more data per KeyValue
Remember you pay for the key along each value
in a row, so large keys are really expensive
5. Don’ts
• Use HTable / HTablePool in app servers
asynchbase + Netty or Finagle = performance++
• Put variable-length fields in composite keys
They’re hard to scan
• Exceed a few hundred regions per RegionServer
“Oversharding” introduces overhead and makes
recovering from failures more expensive
7. How OpenTSDB
came to be the
way it is
Questions:
• How to store time series data efficiently in HBase?
• How to enable concurrent writes without
synchronization between the writers?
• How to save space/memory when storing
hundreds of billions of data items in HBase?
8. Ta
Time Series Data in HBase ke
1
Col don’t care
umn
Key
1234567890 1
values
1234567892 2
timestamps
1234567894 3
Simplest design: only 1 time series, 1 row with a
single KeyValue per data point.
Supports time-range scans.
9. Ta
Time Series Data in HBase ke
2
Colu
mn
Key
foo 1234567890 1
foo 1234567892 3
metric
name fool 1234567890 2
Metric name first in row key for data locality.
Problem: can’t store the metric as text in row key
due to space concerns
10. Ta
Time Series Data in HBase ke
3
Colu Separate
mn
Key Lookup Table:
Key Value
0x1 1234567890 1
0x1 foo
0x1 1234567892 3 0x2 fool
metric
foo 0x1
ID 0x2 1234567890 2
fool 0x2
Use a separate table to assign unique IDs to
metric names (and tags, not shown here). IDs give us a
predictable length and achieve desired data locality.
11. Ta
Time Series Data in HBase ke
4
Colu
mn +0 +2
Key
0x1 1234567890 1 3
0x1 1234567892 3
0x2 1234567890 2
Reduce the number of rows by storing multiple
consecutive data points in the same row.
Fewer rows = faster to seek to a specific row.
12. Ta
Time Series Data in HBase ke
4
Colu
mn +0 +2
Key
0x1 1234567890 1 3
Misleading
table 0x1 1234567892 3
representation
0x2 1234567890 2
Gotcha #1: wider rows don’t save any space*
Key Colum Value
le 0x1 1234567890 n
+0 1
ab
l t d 0x1 1234567890
ua re +2 3 * Until magic prefix
ct to +0 2
compression happens in
A s 0x2 1234567890 upcoming HBase 0.94
13. Ta
Time Series Data in HBase ke
4
Colu
mn +0 +2
Key
0x1 1234567890 1 3
0x1 1234567892 3
0x2 1234567890 2
Devil is in the details: when to start new rows?
Naive answer: start on first data point, after some
time start a new row.
14. Ta
Time Series Data in HBase ke
4
Colu
mn +0
Key
0x1 1000000000 1
0000 00 1 TSD1
1000 First data point:
foo Start a new row
Client TSD2
15. Ta
Time Series Data in HBase ke
4
Colu
mn +0 +10 ...
Key
0x1 1000000000 1 2 ...
0000 10 2 TSD1
1000 Keep adding
foo
points until...
Client TSD2
16. Ta
Time Series Data in HBase ke
4
Colu
mn +0 +10 ... +599
Key
0x1 1000000000 1 2 ... 42
42
0000 0599 TSD1
... some arbitrary
fo o 10
limit, say 10min
Client TSD2
17. Ta
Time Series Data in HBase ke
4
Colu
mn +0 +10 ... +599
Key
0x1 1000000000 1 2 ... 42
0x1 1000000600 51
51
0000 0610 TSD1
Then start a new
fo o 10
row
Client TSD2
18. Ta
Time Series Data in HBase ke
4
Colu
mn +0
Key
0x1 1234567890 1
But this scheme fails with multiple TSDs
5678 90 1 TSD1 Create new row
foo 1234
Client TSD2
19. Ta
Time Series Data in HBase ke
4
Colu
mn +0 +2
Key
0x1 1234567890 1 3
5678 92 3 TSD1 Add to row
foo 1234
Client TSD2
20. Ta
Time Series Data in HBase ke
4
Colu
mn +0 +2
Key
0x1 1234567890 1 3
Oops!
0x1 1234567892 3
Maybe a connection failure occurred, client is
retransmitting data to another TSD
TSD1 Add to row
foo 12345678
92 3
Client TSD2 Create new row
21. Ta
Time Series Data in HBase ke
5
Colu
mn +90 +92
Key
Base
timestamp 0x1 1234567800 1 3
always a
multiple of 0x2 1234567800 2
600
In order to scale easily and keep TSD stateless,
make writes independent & idempotent.
New rule: rows are aligned on 10 min. boundaries
22. Ta
Time Series Data in HBase ke
6
Colu
mn +1890 +1892
Key
Base
timestamp 0x1 1234566000 1 3
always a
multiple of 0x2 1234566000 2
3600
1 data point every ~10s => 60 data points / row
Not much. Go to wider rows to further increase
seek speed. One hour rows = 6x fewer rows
23. Ta
Time Series Data in HBase ke
6
Colu
mn +1890 +1892
Key
0x1 1234566000 1 3
0x2 1234566000 2
Remember: wider rows don’t save any space!
Key Colum Value Key is easily 4x
le 0x1 1234566000 n
+1890 1 bigger than
tab column + value
al ed 0x1 1234566000
tu or
+1892 3
Ac st 0x2 1234566000 +1890 2 and repeated
24. Ta
Time Series Data in HBase ke
7
Colu
mn +1890 +1890 +1892 +1892
Key
0x1 1234566000 1 1 3 3
0x2 1234566000 2
Solution: “compact” columns by concatenation
Key Column Value Space savings
le 0x1 1234566000 +1890 1 on disk and in
tab
al ed 0x1
tu or 1234566000 +1890,+1892 1, 3 memory are
Ac st 0x1 1234566000 +1892 3 huge: data is
0x2 1234566000 +1890 2 4x-8x smaller!
25. ¿ Questions ?
ub
tH
Gi
on
opentsdb.net
e
m
kr
Fo
Summary
• Use asynchbase • Use Netty or Finagle
• Wider table > Taller table • Short family names
• Make writes idempotent • Make writes independent
• Compact your data • Have predictable key sizes
ool?
Thin k this is c Benoît “tsuna” Sigoure
W e’re hiring tsuna@stumbleupon.com