How to reduce expenses on monitoring

How to reduce expenses on monitoring
with VictoriaMetrics
Roman Khavronenko | github.com/hagen1778
Roman Khavronenko
Co-founder of VictoriaMetrics
Software engineer with experience in distributed systems,
monitoring and high-performance services.
https://github.com/hagen1778
https://twitter.com/hagen1778
What this talk is about
1. Best ways for storing and processing metrics
2. Open source tools only
3. For people familiar with Prometheus,
Thanos, Mimir, VictoriaMetrics
How to reduce expenses on monitoring
How to reduce expenses on monitoring
How to reduce expenses on monitoring
How to reduce expenses on monitoring
How to reduce expenses on monitoring
Expenses!
You can either have a faster car…
…or be a smarter driver!
What can you get from simple replacing?
How to reduce expenses on monitoring
Prometheus remote-write benchmark
Prometheus vs VictoriaMetrics benchmark
# the number of nodeexporter instances to scrape
targetsCount: 1000
# how frequently to scrape nodeexporter targets
scrapeInterval: 15s
# rules evaluation interval
# https://awesome-prometheus-alerts.grep.to/rules.html#host-and-hardware-1
queryInterval: 30s
# scrapeConfigUpdatePercent is a churn rate generated once
# per scrapeConfigUpdateInterval
scrapeConfigUpdatePercent: 5
scrapeConfigUpdateInterval: 10m
Prometheus vs VictoriaMetrics benchmark
How to reduce expenses on monitoring
How to reduce expenses on monitoring
How to reduce expenses on monitoring
x16 times faster!
x1.9 times faster!
x1.7 less memory!
x2.5 times less!
How to reduce expenses on monitoring
Summary after 7d benchmark (1k nodeexporter targets)
Prometheus:
CPU avg used: 0.79 / 3 cores
Disk occupied: 83.5 GiB
Mem max used: 8.12 GiB / 12 GiB
Read latency avg:
50th - 70.5ms
99th - 7s
VictoriaMetrics:
CPU avg used: 0.76 / 3 cores
Disk occupied: 33 GiB
Mem max used: 4.5 GiB / 12 GiB
Read latency avg:
50th - 4.3ms
99th - 3.6s
Data transfer costs
Network Data transfer costs
x4.5 times less!
Improving network compression
1. Increase compression level, trade CPU for network savings:
a. -remoteWrite.vmProtoCompressLevel
2. Increase batch size, trade latency for compression:
a. -remoteWrite.maxBlockSize
b. -remoteWrite.maxRowsPerBlock
c. -remoteWrite.flushInterval
3. Reduce entropy to improve compression:
a. -remoteWrite.significantFigures
b. -remoteWrite.roundDigits
How to be smarter about data
Keeping only significant figures
instance:cpu_utilization:ratio_avg{instance="foo"} 0.05055757575781
instance:cpu_utilization:ratio_avg{instance="bar"} 0.05058181818236
rules:
- record: instance:cpu_utilization:ratio_avg
expr: avg_over_time(instance:node_cpu_utilization:ratio[5m])
Keeping only significant figures
Applying --vm-significant-figures=8 to recording rules
0.05055757575781
0.050557576
changed compression ratio from 1.2B to 0.8B per sample
See more at https://medium.com/victoriametrics-how-to-migrate-data-from-prometheus
Understanding the data - query tracing
VictoriaMetrics supports query tracing for detecting bottlenecks during query processing.
This is like EXPLAIN ANALYZE from Postgresql!
https://play.victoriametrics.com
Query tracing demo!
If query tracing demo didn't work…
Typical query takes 4s to execute… Why?
If query tracing demo didn't work…
Let's check the trace!
If query tracing demo didn't work…
91% of the time was spent on vmselect while aggregating
9.4k series, 13Mil data samples!
How to improve query speed?
1. Add more resources to monitoring.
2. Or… be smarter about data!
Cardinality explorer demo!
https://play.victoriametrics.com
If cardinality explorer demo didn't work…
If cardinality explorer demo didn't work…
If cardinality explorer demo didn't work…
Cardinality explorer: summary
VictoriaMetrics allows exploring time series cardinality to identify:
● Metric names with the highest number of series
● Labels with the highest number of series
● Values with the highest number of series for the selected label
● label=name pairs with the highest number of series
● Labels with the highest number of unique values
➔ Available built-in in VictoriaMetrics components
➔ Supports specifying Prometheus URL
Streaming aggregation vs Recording rules
The number of time series stored in TSDB
is Data-in + Recording Rules results
Streaming aggregation vs Recording rules
The number of time series stored in TSDB
is only what needs to be persisted
How to use streaming aggregation
- match: "grpc_server_handled_total" # time series selector
interval: "2m" # on 2m interval
outputs: ["total"] # aggregate as counter
without: ["grpc_method"] # group without label
Result:
grpc_server_handled_total:2m_without_grpc_method_total
How to use streaming aggregation
https://play.victoriametrics.com
Streaming aggregation: summary
1. Aggregate incoming samples in streaming mode before data is written to remote
storage
2. Aggregation is applied to all the metrics received via any supported data
ingestion protocol and/or scraped from Prometheus-compatible targets
3. Statsd alternative
4. Recording rules alternative
5. Reducing the number of stored samples
6. Reducing the number of stored series
7. Compatible with tools supporting Prometheus remote write protocol
Complexity penalty
Cortex architecture
Mimir architecture
VictoriaMetrics architecture
Complexity penalty
● Complex systems are harder to maintain
● Complex systems are harder to educate about
● Complex systems are more expensive to scale
Additional materials
1. Snapshot of Grafana dashboard from the benchmark
2. Benchmark repo for reproducing the test
3. Save network costs with VictoriaMetrics remote write protocol
4. VictoriaMetrics: achieving better compression than Gorilla for time series data
5. Streaming aggregation
6. VictoriaMetrics playground
Questions?
● https://github.com/VictoriaMetrics
● https://github.com/hagen1778
1 de 54

Recomendados

OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali... por
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...NETWAYS
512 vistas99 diapositivas
Cloud Native PostgreSQL por
Cloud Native PostgreSQLCloud Native PostgreSQL
Cloud Native PostgreSQLEDB
700 vistas56 diapositivas
VictoriaMetrics 15/12 Meet Up: 2022 Features Highlights por
VictoriaMetrics 15/12 Meet Up: 2022 Features HighlightsVictoriaMetrics 15/12 Meet Up: 2022 Features Highlights
VictoriaMetrics 15/12 Meet Up: 2022 Features HighlightsVictoriaMetrics
127 vistas57 diapositivas
VictoriaLogs: Open Source Log Management System - Preview por
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaMetrics
2.1K vistas98 diapositivas
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptx por
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptxGrafana Mimir and VictoriaMetrics_ Performance Tests.pptx
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptxRomanKhavronenko
246 vistas48 diapositivas
Serving ML easily with FastAPI por
Serving ML easily with FastAPIServing ML easily with FastAPI
Serving ML easily with FastAPISebastián Ramírez Montaño
1K vistas35 diapositivas

Más contenido relacionado

La actualidad más candente

Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16 por
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16AppDynamics
3.6K vistas38 diapositivas
Introduction to Prometheus por
Introduction to PrometheusIntroduction to Prometheus
Introduction to PrometheusJulien Pivotto
6.7K vistas55 diapositivas
Prometheus Storage por
Prometheus StoragePrometheus Storage
Prometheus StorageFabian Reinartz
9.7K vistas23 diapositivas
Prometheus por
PrometheusPrometheus
Prometheuswyukawa
1.3K vistas11 diapositivas
Infrastructure & System Monitoring using Prometheus por
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusMarco Pas
4.8K vistas63 diapositivas
KFServing and Kubeflow Pipelines por
KFServing and Kubeflow PipelinesKFServing and Kubeflow Pipelines
KFServing and Kubeflow PipelinesAnimesh Singh
377 vistas9 diapositivas

La actualidad más candente(20)

Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16 por AppDynamics
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
AppDynamics3.6K vistas
Introduction to Prometheus por Julien Pivotto
Introduction to PrometheusIntroduction to Prometheus
Introduction to Prometheus
Julien Pivotto6.7K vistas
Prometheus por wyukawa
PrometheusPrometheus
Prometheus
wyukawa 1.3K vistas
Infrastructure & System Monitoring using Prometheus por Marco Pas
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using Prometheus
Marco Pas4.8K vistas
KFServing and Kubeflow Pipelines por Animesh Singh
KFServing and Kubeflow PipelinesKFServing and Kubeflow Pipelines
KFServing and Kubeflow Pipelines
Animesh Singh377 vistas
Rust Tutorial | Rust Programming Language Tutorial For Beginners | Rust Train... por Edureka!
Rust Tutorial | Rust Programming Language Tutorial For Beginners | Rust Train...Rust Tutorial | Rust Programming Language Tutorial For Beginners | Rust Train...
Rust Tutorial | Rust Programming Language Tutorial For Beginners | Rust Train...
Edureka!327 vistas
Provisioning Datadog with Terraform por Matt Spurlin
Provisioning Datadog with TerraformProvisioning Datadog with Terraform
Provisioning Datadog with Terraform
Matt Spurlin328 vistas
Exploring the power of OpenTelemetry on Kubernetes por Red Hat Developers
Exploring the power of OpenTelemetry on KubernetesExploring the power of OpenTelemetry on Kubernetes
Exploring the power of OpenTelemetry on Kubernetes
Red Hat Developers1.6K vistas
Semmle Codeql por M. S.
Semmle Codeql Semmle Codeql
Semmle Codeql
M. S.1.2K vistas
Getting Started Monitoring with Prometheus and Grafana por Syah Dwi Prihatmoko
Getting Started Monitoring with Prometheus and GrafanaGetting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and Grafana
Syah Dwi Prihatmoko3.5K vistas
Monitoring with prometheus por Kasper Nissen
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheus
Kasper Nissen6.7K vistas
Monitoring kubernetes with prometheus por Brice Fernandes
Monitoring kubernetes with prometheusMonitoring kubernetes with prometheus
Monitoring kubernetes with prometheus
Brice Fernandes1.2K vistas
Python for the Network Nerd por Matt Bynum
Python for the Network NerdPython for the Network Nerd
Python for the Network Nerd
Matt Bynum2.8K vistas
Data modeling for Elasticsearch por Florian Hopf
Data modeling for ElasticsearchData modeling for Elasticsearch
Data modeling for Elasticsearch
Florian Hopf12.7K vistas
Distributed tracing using open tracing & jaeger 2 por Chandresh Pancholi
Distributed tracing using open tracing & jaeger 2Distributed tracing using open tracing & jaeger 2
Distributed tracing using open tracing & jaeger 2
Chandresh Pancholi808 vistas
VictoriaMetrics: Welcome to the Virtual Meet Up March 2023 por VictoriaMetrics
VictoriaMetrics: Welcome to the Virtual Meet Up March 2023VictoriaMetrics: Welcome to the Virtual Meet Up March 2023
VictoriaMetrics: Welcome to the Virtual Meet Up March 2023
VictoriaMetrics123 vistas
TypeScript for Java Developers por Yakov Fain
TypeScript for Java DevelopersTypeScript for Java Developers
TypeScript for Java Developers
Yakov Fain3.9K vistas

Similar a How to reduce expenses on monitoring

stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by... por
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...NETWAYS
28 vistas55 diapositivas
DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P... por
DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...
DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...DiscoveredByte
618 vistas20 diapositivas
observability pre-release: using prometheus to test and fix new software por
observability pre-release: using prometheus to test and fix new softwareobservability pre-release: using prometheus to test and fix new software
observability pre-release: using prometheus to test and fix new softwareSneha Inguva
516 vistas79 diapositivas
Kafka monitoring and metrics por
Kafka monitoring and metricsKafka monitoring and metrics
Kafka monitoring and metricsTouraj Ebrahimi
2.1K vistas20 diapositivas
Prometheus Everything, Observing Kubernetes in the Cloud por
Prometheus Everything, Observing Kubernetes in the CloudPrometheus Everything, Observing Kubernetes in the Cloud
Prometheus Everything, Observing Kubernetes in the CloudSneha Inguva
1.9K vistas50 diapositivas
Performance eng prakash.sahu por
Performance eng prakash.sahuPerformance eng prakash.sahu
Performance eng prakash.sahuDr. Prakash Sahu
113 vistas40 diapositivas

Similar a How to reduce expenses on monitoring(20)

stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by... por NETWAYS
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
NETWAYS28 vistas
DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P... por DiscoveredByte
DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...
DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...
DiscoveredByte618 vistas
observability pre-release: using prometheus to test and fix new software por Sneha Inguva
observability pre-release: using prometheus to test and fix new softwareobservability pre-release: using prometheus to test and fix new software
observability pre-release: using prometheus to test and fix new software
Sneha Inguva516 vistas
Kafka monitoring and metrics por Touraj Ebrahimi
Kafka monitoring and metricsKafka monitoring and metrics
Kafka monitoring and metrics
Touraj Ebrahimi2.1K vistas
Prometheus Everything, Observing Kubernetes in the Cloud por Sneha Inguva
Prometheus Everything, Observing Kubernetes in the CloudPrometheus Everything, Observing Kubernetes in the Cloud
Prometheus Everything, Observing Kubernetes in the Cloud
Sneha Inguva1.9K vistas
Prelim Slides por smpant
Prelim SlidesPrelim Slides
Prelim Slides
smpant347 vistas
Overcoming (organizational) scalability issues in your Prometheus ecosystem por QAware GmbH
Overcoming (organizational) scalability issues in your Prometheus ecosystemOvercoming (organizational) scalability issues in your Prometheus ecosystem
Overcoming (organizational) scalability issues in your Prometheus ecosystem
QAware GmbH215 vistas
Monitor your Java application with Prometheus Stack por Wojciech Barczyński
Monitor your Java application with Prometheus StackMonitor your Java application with Prometheus Stack
Monitor your Java application with Prometheus Stack
Wojciech Barczyński1.6K vistas
Query Optimization with MySQL 8.0 and MariaDB 10.3: The Basics por Jaime Crespo
Query Optimization with MySQL 8.0 and MariaDB 10.3: The BasicsQuery Optimization with MySQL 8.0 and MariaDB 10.3: The Basics
Query Optimization with MySQL 8.0 and MariaDB 10.3: The Basics
Jaime Crespo1.6K vistas
Overcoming scalability issues in your prometheus ecosystem por Nebulaworks
Overcoming scalability issues in your prometheus ecosystemOvercoming scalability issues in your prometheus ecosystem
Overcoming scalability issues in your prometheus ecosystem
Nebulaworks78 vistas
DevoxxUK: Optimizating Application Performance on Kubernetes por Dinakar Guniguntala
DevoxxUK: Optimizating Application Performance on KubernetesDevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on Kubernetes
Dinakar Guniguntala207 vistas
So You Want to Write an Exporter por Brian Brazil
So You Want to Write an ExporterSo You Want to Write an Exporter
So You Want to Write an Exporter
Brian Brazil4.2K vistas
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System por Accumulo Summit
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Accumulo Summit521 vistas
Monitoring using Prometheus and Grafana por Arvind Kumar G.S
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
Arvind Kumar G.S3.5K vistas
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus por OpenStack Korea Community
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
A Framework for Scene Recognition Using Convolutional Neural Network as Featu... por Tahmid Abtahi
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
Tahmid Abtahi2.4K vistas

Último

How to reduce cold starts for Java Serverless applications in AWS at JCON Wor... por
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...Vadym Kazulkin
70 vistas64 diapositivas
Understanding GenAI/LLM and What is Google Offering - Felix Goh por
Understanding GenAI/LLM and What is Google Offering - Felix GohUnderstanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix GohNUS-ISS
39 vistas33 diapositivas
The Importance of Cybersecurity for Digital Transformation por
The Importance of Cybersecurity for Digital TransformationThe Importance of Cybersecurity for Digital Transformation
The Importance of Cybersecurity for Digital TransformationNUS-ISS
25 vistas26 diapositivas
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu... por
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...NUS-ISS
32 vistas54 diapositivas
Photowave Presentation Slides - 11.8.23.pptx por
Photowave Presentation Slides - 11.8.23.pptxPhotowave Presentation Slides - 11.8.23.pptx
Photowave Presentation Slides - 11.8.23.pptxCXL Forum
126 vistas16 diapositivas
Micron CXL product and architecture update por
Micron CXL product and architecture updateMicron CXL product and architecture update
Micron CXL product and architecture updateCXL Forum
27 vistas7 diapositivas

Último(20)

How to reduce cold starts for Java Serverless applications in AWS at JCON Wor... por Vadym Kazulkin
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...
Vadym Kazulkin70 vistas
Understanding GenAI/LLM and What is Google Offering - Felix Goh por NUS-ISS
Understanding GenAI/LLM and What is Google Offering - Felix GohUnderstanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix Goh
NUS-ISS39 vistas
The Importance of Cybersecurity for Digital Transformation por NUS-ISS
The Importance of Cybersecurity for Digital TransformationThe Importance of Cybersecurity for Digital Transformation
The Importance of Cybersecurity for Digital Transformation
NUS-ISS25 vistas
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu... por NUS-ISS
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
NUS-ISS32 vistas
Photowave Presentation Slides - 11.8.23.pptx por CXL Forum
Photowave Presentation Slides - 11.8.23.pptxPhotowave Presentation Slides - 11.8.23.pptx
Photowave Presentation Slides - 11.8.23.pptx
CXL Forum126 vistas
Micron CXL product and architecture update por CXL Forum
Micron CXL product and architecture updateMicron CXL product and architecture update
Micron CXL product and architecture update
CXL Forum27 vistas
Future of Learning - Yap Aye Wee.pdf por NUS-ISS
Future of Learning - Yap Aye Wee.pdfFuture of Learning - Yap Aye Wee.pdf
Future of Learning - Yap Aye Wee.pdf
NUS-ISS38 vistas
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum... por NUS-ISS
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
NUS-ISS28 vistas
Transcript: The Details of Description Techniques tips and tangents on altern... por BookNet Canada
Transcript: The Details of Description Techniques tips and tangents on altern...Transcript: The Details of Description Techniques tips and tangents on altern...
Transcript: The Details of Description Techniques tips and tangents on altern...
BookNet Canada119 vistas
MemVerge: Memory Viewer Software por CXL Forum
MemVerge: Memory Viewer SoftwareMemVerge: Memory Viewer Software
MemVerge: Memory Viewer Software
CXL Forum118 vistas
Empathic Computing: Delivering the Potential of the Metaverse por Mark Billinghurst
Empathic Computing: Delivering  the Potential of the MetaverseEmpathic Computing: Delivering  the Potential of the Metaverse
Empathic Computing: Delivering the Potential of the Metaverse
Mark Billinghurst449 vistas
Combining Orchestration and Choreography for a Clean Architecture por ThomasHeinrichs1
Combining Orchestration and Choreography for a Clean ArchitectureCombining Orchestration and Choreography for a Clean Architecture
Combining Orchestration and Choreography for a Clean Architecture
ThomasHeinrichs168 vistas
"Fast Start to Building on AWS", Igor Ivaniuk por Fwdays
"Fast Start to Building on AWS", Igor Ivaniuk"Fast Start to Building on AWS", Igor Ivaniuk
"Fast Start to Building on AWS", Igor Ivaniuk
Fwdays36 vistas
Business Analyst Series 2023 - Week 3 Session 5 por DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10165 vistas
Liqid: Composable CXL Preview por CXL Forum
Liqid: Composable CXL PreviewLiqid: Composable CXL Preview
Liqid: Composable CXL Preview
CXL Forum121 vistas
Astera Labs: Intelligent Connectivity for Cloud and AI Infrastructure por CXL Forum
Astera Labs:  Intelligent Connectivity for Cloud and AI InfrastructureAstera Labs:  Intelligent Connectivity for Cloud and AI Infrastructure
Astera Labs: Intelligent Connectivity for Cloud and AI Infrastructure
CXL Forum125 vistas
PharoJS - Zürich Smalltalk Group Meetup November 2023 por Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi113 vistas
"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur por Fwdays
"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur
"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur
Fwdays40 vistas
GigaIO: The March of Composability Onward to Memory with CXL por CXL Forum
GigaIO: The March of Composability Onward to Memory with CXLGigaIO: The March of Composability Onward to Memory with CXL
GigaIO: The March of Composability Onward to Memory with CXL
CXL Forum126 vistas
AI: mind, matter, meaning, metaphors, being, becoming, life values por Twain Liu 刘秋艳
AI: mind, matter, meaning, metaphors, being, becoming, life valuesAI: mind, matter, meaning, metaphors, being, becoming, life values
AI: mind, matter, meaning, metaphors, being, becoming, life values

How to reduce expenses on monitoring

  • 1. How to reduce expenses on monitoring with VictoriaMetrics Roman Khavronenko | github.com/hagen1778
  • 2. Roman Khavronenko Co-founder of VictoriaMetrics Software engineer with experience in distributed systems, monitoring and high-performance services. https://github.com/hagen1778 https://twitter.com/hagen1778
  • 3. What this talk is about 1. Best ways for storing and processing metrics 2. Open source tools only 3. For people familiar with Prometheus, Thanos, Mimir, VictoriaMetrics
  • 10. You can either have a faster car… …or be a smarter driver!
  • 11. What can you get from simple replacing?
  • 15. # the number of nodeexporter instances to scrape targetsCount: 1000 # how frequently to scrape nodeexporter targets scrapeInterval: 15s # rules evaluation interval # https://awesome-prometheus-alerts.grep.to/rules.html#host-and-hardware-1 queryInterval: 30s # scrapeConfigUpdatePercent is a churn rate generated once # per scrapeConfigUpdateInterval scrapeConfigUpdatePercent: 5 scrapeConfigUpdateInterval: 10m Prometheus vs VictoriaMetrics benchmark
  • 24. Summary after 7d benchmark (1k nodeexporter targets) Prometheus: CPU avg used: 0.79 / 3 cores Disk occupied: 83.5 GiB Mem max used: 8.12 GiB / 12 GiB Read latency avg: 50th - 70.5ms 99th - 7s VictoriaMetrics: CPU avg used: 0.76 / 3 cores Disk occupied: 33 GiB Mem max used: 4.5 GiB / 12 GiB Read latency avg: 50th - 4.3ms 99th - 3.6s
  • 28. Improving network compression 1. Increase compression level, trade CPU for network savings: a. -remoteWrite.vmProtoCompressLevel 2. Increase batch size, trade latency for compression: a. -remoteWrite.maxBlockSize b. -remoteWrite.maxRowsPerBlock c. -remoteWrite.flushInterval 3. Reduce entropy to improve compression: a. -remoteWrite.significantFigures b. -remoteWrite.roundDigits
  • 29. How to be smarter about data
  • 30. Keeping only significant figures instance:cpu_utilization:ratio_avg{instance="foo"} 0.05055757575781 instance:cpu_utilization:ratio_avg{instance="bar"} 0.05058181818236 rules: - record: instance:cpu_utilization:ratio_avg expr: avg_over_time(instance:node_cpu_utilization:ratio[5m])
  • 31. Keeping only significant figures Applying --vm-significant-figures=8 to recording rules 0.05055757575781 0.050557576 changed compression ratio from 1.2B to 0.8B per sample See more at https://medium.com/victoriametrics-how-to-migrate-data-from-prometheus
  • 32. Understanding the data - query tracing VictoriaMetrics supports query tracing for detecting bottlenecks during query processing. This is like EXPLAIN ANALYZE from Postgresql!
  • 34. If query tracing demo didn't work… Typical query takes 4s to execute… Why?
  • 35. If query tracing demo didn't work… Let's check the trace!
  • 36. If query tracing demo didn't work… 91% of the time was spent on vmselect while aggregating 9.4k series, 13Mil data samples!
  • 37. How to improve query speed? 1. Add more resources to monitoring. 2. Or… be smarter about data!
  • 39. If cardinality explorer demo didn't work…
  • 40. If cardinality explorer demo didn't work…
  • 41. If cardinality explorer demo didn't work…
  • 42. Cardinality explorer: summary VictoriaMetrics allows exploring time series cardinality to identify: ● Metric names with the highest number of series ● Labels with the highest number of series ● Values with the highest number of series for the selected label ● label=name pairs with the highest number of series ● Labels with the highest number of unique values ➔ Available built-in in VictoriaMetrics components ➔ Supports specifying Prometheus URL
  • 43. Streaming aggregation vs Recording rules The number of time series stored in TSDB is Data-in + Recording Rules results
  • 44. Streaming aggregation vs Recording rules The number of time series stored in TSDB is only what needs to be persisted
  • 45. How to use streaming aggregation - match: "grpc_server_handled_total" # time series selector interval: "2m" # on 2m interval outputs: ["total"] # aggregate as counter without: ["grpc_method"] # group without label Result: grpc_server_handled_total:2m_without_grpc_method_total
  • 46. How to use streaming aggregation https://play.victoriametrics.com
  • 47. Streaming aggregation: summary 1. Aggregate incoming samples in streaming mode before data is written to remote storage 2. Aggregation is applied to all the metrics received via any supported data ingestion protocol and/or scraped from Prometheus-compatible targets 3. Statsd alternative 4. Recording rules alternative 5. Reducing the number of stored samples 6. Reducing the number of stored series 7. Compatible with tools supporting Prometheus remote write protocol
  • 52. Complexity penalty ● Complex systems are harder to maintain ● Complex systems are harder to educate about ● Complex systems are more expensive to scale
  • 53. Additional materials 1. Snapshot of Grafana dashboard from the benchmark 2. Benchmark repo for reproducing the test 3. Save network costs with VictoriaMetrics remote write protocol 4. VictoriaMetrics: achieving better compression than Gorilla for time series data 5. Streaming aggregation 6. VictoriaMetrics playground