4. INTRODUCTION
Thanos is a set of components that can be composed into a highly available metric system with
unlimited storage capacity, which can be added seamlessly on top of existing Prometheus
deployments.
Curren release 0.5.0 is designed to store old metrics (which reached retention period on
Prometheus nodes) on some S3 like storage for long-term.
Collected metrics can be accessed for reviewing via Grafana. Prometheus query dashboard will
show only data stored on Prometheus instances.
VictoriaMetrics is fast, cost-effective and scalable time-series database. It can be used as long-
term remote storage for Prometheus. It uses own data compression, it allows to store more data
on the same disk size.
Cortex provides horizontally scalable, highly available, multi-tenant, long term storage for
Prometheus.
5. Prometheus
node
Monitored service 1
Monitored service 2
Monitored service ...
Monitored service N
Storage
Grafana
Long-Term Storage
DataSource 1
DataSource 2
Alerts
Alertmanager
Store data after retention is reached
6. Why do we need Long-Term storage:
To store a historical data about your workloads
To review an incidents
To plan a scaling based on seasonal load
To find a bottlenecks into infrastructure during continuous run/load
What solutions can be used for storing Long-Term historical timeseries:
Cortex, InfluxDB, Kafka, Graphite, …, Thanos, VictoriaMetrics *
* https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage
LONG-TERM STORAGE OVERVIEW
7. THANOS ARCHITECTURE
Prometheus POD
Prometheus Prometheus config reloader
Configmap reloader Thanos sidecar
Thanos query POD
Thanos query
Thanos compact
Thanos compact
Thanos Store Gateway POD
Thanos store gateway
9. Thanos query POD
Thanos query
Thanos Store Gateway POD
Thanos store gateway
Prometheus 2
Grafana or Thanos UI
Prometheus 1
Bucket
10. AVANTAGES AND DISAVANTAGES
- Infinity retention without reconfiguring
srorage
- Collected data is available even if
infrastucture recreated (data is into bucket)
- Global query view over data collected from
multiple Prometheus instances and bucket
- Horizontal scalability
- Metrics compaction
- Full monitoring stack
- Complicated infrastructure
11. HOW IT WAS TESTED
NODE_0
NODE_2
NODE...
NODE_498
NODE_499
METRIC_0
METRIC_1
METRIC_2
METRIC_...
METRIC_999
NODE_49
9
NODE_
4
500
NODES
1000 METRIC PER
NODE
each 15 seconds
12. 24 Hours
Scroll Bar (500 reporters)
500 nodes, 4 times per minute, 24 hours = 2 880 000 000 points
17. MEMORY USAGE STABILIZATION ON CLUSTER NODES
SCRAPE DURATION
GKE CLUSTER DETAILS
pay attention on allocation )))
Between scrapes 30 sec, during this time we have 2 15-sec intervals,
So 4.37 sec prometheus needs to scrape 1 000 000 metrics
21. AVANTAGES AND DISAVANTAGES
- Infinity retention with reconfiguring storage
- Global query view over data collected from
storage
- Horizontal scalability
- Metrics compaction (multpile times better)
(floating to integer)
- Simple infrastructure
- No integration with Alert Manager
- Cloud storages are not supported yet
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/129
- More load on hosts
22.
23. 24 Hours
Scroll Bar (500 reporters)
500 nodes, 4 times per minute, 24 hours = 2 880 000 000 points
24.
25.
26. THANOS VICTORIAMETRICS
- 12-15 GiB metrics per 1 day (2.88Bil)
- 16GiB memory used on nodes
- 2.1 – 2.4 CPU cores are used on nodes
- 2.8-3 GiB metrics per 1 day on each
storage(2.88Bil)
- 16GiB memory used on nodes
- 2.8 – 4 CPU cores are used on nodes
Storage price (Cloud Storage*):
15*365=5475 ~5500Gib
Storage total: $126.50 per month;
~$1500 in 1 year
* Based on retention we can move data to a
cold line storage class
Storage price (Persistent Disk Standard):
3*365=1095 ~1100Gib
$52.80 per month * Numer_of_Storages
Storage total: 52.8*3=158.4 per 1 month
* https://github.com/VictoriaMetrics/VictoriaMetrics/issues/134
If one of the storages lost – some part of data became unavailable
PRICE COMPARISON
27. Thanos Vicrotiametrics
Instance’s price 3* N-standard-4 4vCPU 15GB memory $97.49 monthly estimate
3*97.49=$292 Standard Provisioned Space: 1,500 GB - $60
CPU usage
Memory usage
50%
16GB
65%
16GB
Metrics per day 15GB 9GB
Metrics per minute 2 000 000 2 000 000
Metrics per one day 2 880 000 000 2 880 000 000
Scrape interval (1M metrics) 4.373 s 4.553 s
Historical data access 303-525 ms
(500 timeseies)
179-492 ms
(500 timeseies)
31. To produce downsampled data, the Compactor continuously aggregates series down to five
minute and one hour resolutions. For each raw chunk, encoded with TSDB’s XOR
compression, it stores different types of aggregations, e.g. min, max, or sum in a single block.
This allows Querier to automatically choose the aggregate that is appropriate for a given
PromQL query.
34. VM Gorilla compression analysis
The only problem is the result may exceed 64 bits — default integer size used in modern computers.
How to deal with it? Normalize the integer by dividing by 10^M where M is the minimum value that
allows fitting all the time series values into 64 bits and removing common trailing decimal zeros.