https://www.youtube.com/playlist?list=PLAiEy9H6ItrKC5PbH7KiELiSEIKv3tuov
-What is Prometheus?
-Difference Between Nagios vs Prometheus
-Architecture
-Alertmanager
-Time series DB
-PromQL (Prometheus Query Language)
-Live Demo
-Grafana
2. Takeaways:
• What is Prometheus?
• Difference Between Nagios vs Prometheus
• PromQL (Prometheus Query Language)
• Time series DB
• Grafana
• Live Demo
3. What is Prometheus?
• Prometheus is an open-source systems monitoring and alerting
toolkit originally built at SoundCloud.
• Inspired by Google’s Borgmon Monitoring System
• Written in Go .. Go, also known as Golang.. Go is syntactically
similar to C. Go is widely used in production at Google and in
many other organizations and open-source projects.
• It is now a standalone open source project and maintained
independently of any company. To emphasize this, and to clarify
the project's governance structure, Prometheus joined the CNCF
in 2016 as the second hosted project, after Kubernetes.
• The core Prometheus server is a single binary, with no
dependencies like Zookeeper, Consul, Cassandra, Hadoop or the
internet. All it needs is local disk, preferably an SSD.
• It is a systems and service monitoring system. It collects metrics
from configured targets at given intervals, evaluates rule
expressions, displays the results, and can trigger alerts if some
condition is observed to be true.
https://appinventiv.com/blog/mini-guide-to-go-programming-language/
4. ABOUT
• The Linux Foundation is the parent.
• OpenSource cloud computing for applications. Not
to confuse with OpenStack which is for
infrastructure.
• Netflix pioneered the concept of cloud native as a
practical tool
• Cloud native is a term used to describe container-
based environments. Cloud native technologies are
used to develop applications built with services
packaged in containers, deployed as microservices
and managed on elastic infrastructure through
agile DevOps processes and continuous delivery
workflows.
• August 9, 2018 - CNCF Announces Prometheus
Graduation.
https://www.cncf.io/webinars/what-is-cloud-native-and-why-does-it-exist/
5. Why Prometheus?
Multi-Dimensional Data Model – Ex: instance, service, endpoint, and method.
Operational Simplicity
Scalable data Collection
Powerful query Language.
All of these features existed in various systems.
However, Prometheus combined them all.
6. Nagios – an Overview
• The Industry Standard In IT Infrastructure Monitoring
• First launched in 1999.Nagios is officially sponsored by Nagios Enterprises.
• Nagios Core, is a free and open-source computer-software application that monitors systems,
networks and infrastructure. Nagios offers monitoring and alerting services for servers, switches,
applications and services. It alerts users when things go wrong and alerts them a second time when
the problem has been resolved.
• NDOUTILS -The NDOUTILS addon is designed to store all configuration and event data from Nagios
in a database. It requires a MariaDB or MySQL database for storing Nagios Core data .
• RRDtool and Highcharts are included to create customizable graphs that can be displayed in
dashboards.
• (Nagios Core vs Nagios XI) Nagios Core is open source whereas Nagios XI is a commercial,
enterprise version of Nagios.
• Historical performance data that is used to generate graphs are stored in Round Robin Database
(RRD) files.
• Rrdcached - On a Nagios XI server, rrdcached collects host and service performance data and then
flushes it to the appropriate rrd files at a specified interval. This reduces the amount of disk activity
needed to keep a large number of rrd files current for performance graphs.
7. Nagios vs Prometheus
• Nagios is primarily about alerting based on the exit codes of
scripts.
• Nagios is host-based. Each host can have one or more services
and each service can perform one check.
• There is no notion of labels or a query language.
• Nagios has no storage per-se, beyond the current check state.
There are plugins which can store data such as for
visualisation.
• Nagios XI - Using Grafana With Existing Performance Data:
Grafana uses the existing performance data files (RRD) to
generate the graphs.
• Overall, Nagios is suitable for basic monitoring of small and/or
static systems where blackbox probing is sufficient. If you want
to do whitebox monitoring, or have a dynamic or cloud based
environment, then Prometheus is a good choice.
11. Architecture - Explanation
• Prometheus scrapes metrics from instrumented jobs, either directly or via an
intermediary push gateway for short-lived jobs. It stores all scraped samples
locally and runs rules over this data to either aggregate and record new time
series from existing data or generate alerts.
• Also pulling is slightly better than pushing.
• For cases where you must push, we offer the Pushgateway as occasionally you
will need to monitor components which cannot be scraped. The Prometheus
Pushgateway allows you to push time series from short-lived service-level batch
jobs to an intermediary job which Prometheus can scrape.
• Limitation:-Not for Billing using the status collected for monitoring as as the
collected data will likely not be detailed and complete enough.
• Grafana or other API consumers can be used to visualize the collected data.
12. Alertmanager
• Grouping: Useful during larger outages when many systems fail at once and
hundreds to thousands of alerts may be firing simultaneously
• Inhibition is a concept of suppressing notifications for certain alerts if certain
other alerts are already firing.
• Silences are a straightforward way to simply mute alerts for a given time
• Following external systems are supported:
Email
Generic Webhooks
HipChat
OpsGenie
PagerDuty
Pushover
Slack
• To make Prometheus highly available: Run identical Prometheus servers on two or
more separate machines. Identical alerts will be deduplicated by the Alertmanager.
13. Time Series Database (TSDB)
• What is a time series -The value of something tracked over time.
• Labels (key/value pairs). Identifier -> (t0, v0), (t1, v1), (t2, v2), (t3, v3), .... Each data
point is a tuple of a timestamp and a value. For the purpose of monitoring, the
timestamp is an integer and the value any number.
Example : - This could be temperature once a day, or requests to your API once a minute.
The latter could look like:
my_api_requests: 5@1:00PM 2@1:01PM 18@1:02PM
• Fundamentally the same as the one of OpenTSDB
• Prometheus includes a local on-disk time series database, but also optionally
integrates with remote storage systems
• Ingested samples are grouped into blocks of two hours. Each two-hour block
consists of a directory containing one or more chunk files that contain all time
series samples for that window of time, as well as a metadata file and index file
(which indexes metric names and labels to time series in the chunk files). When
series are deleted via the API, deletion records are stored in separate tombstone
files (instead of deleting the data immediately from the chunk files).
• limitation of the local storage is that it is not clustered or replicated. Hence Using
RAID for disk availiablity, snapshots for backups, capacity planning, etc, is
recommended for improved durability. Alternatively, external storage may be used
via the remote read/write APIs.
14. TSDB Configuration:-
• Prometheus has several flags that allow configuring the local storage.
The most important ones are:
--storage.tsdb.path: This determines where Prometheus writes its database. Defaults to data/.
--storage.tsdb.retention.time: This determines when to remove old data. Defaults to 15d.
--storage.tsdb.retention.size: This determines the maximum number of bytes that storage blocks can use The oldest
data will be removed first. Defaults to 0 or disabled.
--storage.tsdb.wal-compression: This flag enables compression of the write-ahead log (WAL). Depending on your data,
you can expect the WAL size to be halved with little extra cpu load.
• TSDB Storage as follows
16. • Prometheus means Forethinker
• Prometheus is Titan. i.e A titan is an
extremely important person. Albert Einstein
was a titan in the world of science.
• A Trickster figure, he was a champion of
mankind known for his wily intelligence,
who stole fire from Zeus and the gods and
gave it to mortals.
• Prometheus is a 2012 science fiction film of
spaceship.
Are You a Titan or just wearing Titan Watch?
17. Let’s Start - Prometheus
• Prerequisite: Configure Prometheus.yml (i.e scrape interval, target server to be monitored, alertmanager configuration, etc)
• Config file is written in YAML format. Prometheus can reload its configuration at runtime. A configuration reload is triggered by sending a
SIGHUP to the Prometheus process or sending a HTTP POST request to the /-/reload endpoint (when the --web.enable-lifecycle flag is
enabled).
• The kill command can send all of the above signals to commands and process. However, commands only give response if they are
programmed to recognize those signals. Particularly useful signals include: There are 64 signal(kill –l), Some are as below
SIGHUP (1) - Hangup detected on controlling terminal or death of controlling process.
SIGKILL (9) - Kill signal i.e. kill running process.
SIGSTOP (19) - Stop process.
SIGCONT (18) - Continue process if stopped.
To send a kill signal to PID # 1234 use: kill -9 1234
To send a kSIGHUP signal to PID # 1234 use: kill -1 1234
18. Prometheus – Exporter
• Exporters bridge the gap between Prometheus and system which don’t export metrics
in the Prometheus format.
• There are official & externally contributed exporter available like for mysql, oracledb,
DELL/IBM Hw, jira,Hadoop storage, apache http,AWS APIs, Docker,SNMP etc
https://prometheus.io/docs/instrumenting/exporters/
• Build Your Own Exporter:-
Important Cronjob success or not.
Any New Error from timesten db - error.log
Online Selling Website perspective – Total order success vs failure.
Order Data Metric - Dashboard Integration
Important file received/processed or not.
Top selling product/category
5star to 1star review metric analysis.
etc.
20. PromQL - Prometheus Query Language
• Prometheus provides a functional query
language.
• It lets user select and aggregate time series data
in real time. The result of an expression can either
be shown as a graph, viewed as tabular data in
Prometheus's expression browser, or consumed
by external systems via the HTTP API.
• The Prometheus query language allows you to
slice and dice the dimensional data for ad-hoc
exploration, graphing, and alerting.
21. Time Series Selectors
• Instant Vector - One Value per time series Guaranteed. In the simplest
form, only a metric name is specified
• Range Vector - Any Number of Value between two timestamps. a
range duration is appended in square brackets ([]) at the end of a
vector selector
22. Metric types
• Counter :A counter is a cumulative metric that
represents a single monotonically increasing counter
whose value can only increase or be reset to zero on
restart. For example, you can use a counter to represent
the number of requests served, tasks completed, or
errors.
• Gauge :A gauge is a metric that represents a single
numerical value that can arbitrarily go up and down. i.e
temperatures or current memory usage
• Histogram :A histogram samples observations (usually
things like request durations or response sizes) and
counts them in configurable buckets.
• Summary:Similar to a histogram, a summary samples
observations (usually things like request durations and
response sizes).
https://povilasv.me/prometheus-tracking-request-duration/
23. Operators
• Binary Comparison Operators:
== , !=, >,<,>=,<=
• Binary Arithmetic Operators:
+, -, *, /,% (modulo), ^(power/exponentiation)
• Logical/set Binary operators:
and (intersection),or (union),unless (complement)
• Built-in aggregation operators:
sum, min, max, avg, stddev,stdvar,count, count_values, bottomk, topk, quantile
- These operators can either be used to aggregate over all label dimensions or preserve
distinct dimensions using,
by, without
https://blog.pvincent.io/2017/12/prometheus-blog-series-part-2-metric-types/
24. Basic Functions
• PromQL has 46 functions & growing…
• Most of the mathematical functions &
day, month, year, minute, hour, time are
avilable.
• In Prometheus perspective, we use
below mostly,
Rate()
irate() -irate should only be used when graphing
volatile, fast-moving counters.
increase()
label_join()/label_replace()
<aggregation>_over_time()
min_over_time
max_over_time
avg_over_time
sum_over_time
count_over_time
25. Wow! Functions
• delta()
• holt_winters()
• predict_linear()
• clamp_max()
• clamp_min()
• histogram_quantile()
Holt-Winters
https://www.otexts.org/fpp/7/5
New Relic Doc
Averages unfortunately have the big drawback
of hiding distribution and prevent the discovery
of outliers/deviation.
Quantiles are better measurement for this kind
of metrics, as they allow to understand
distribution. For example, if the request latency
0.5-quantile (50th percentile) is 100ms, it
means that 50% of requests completed under
100ms. Similarly, if the 0.99-quantile (99th
percentile) is 4s, it means that 1% of requests
responded in more than 4s.
predict_linear()
26. Demo Queries
• max by(instance)(node_filesystem_size_bytes)
• max without(device, fstype, mountpoint)(node_filesystem_size_bytes)
• sum without(device, fstype, mountpoint)(node_filesystem_size_bytes)
• sum(node_filesystem_size_bytes)
• round(sum(node_filesystem_size_bytes)/1024/1024/1024)
• round(sum by(instance, device)(node_filesystem_size_bytes)/1024/1024/1024)
• rate(node_load1[5m])
• rate(node_cpu_seconds_total{mode="system"}[5m])
• min_over_time(node_load1[5m])
• max_over_time(node_load1[5m])
• avg_over_time(node_load1[5m])
• sum_over_time(node_load1[5m])
• count_over_time(node_load1[5m])
• delta(node_hwmon_temp_celsius[1h])
• clamp_max(node_load1,1.2)
• clamp_min(clamp_max(node_load1,1.2),1.05)
• predict_linear(node_load1[1h],4*3600)
• quantile without(cpu)(0.9, rate(node_cpu_seconds_total{mode="system"}[5m]))
• topk(3, sum by (mode) (node_cpu_seconds_total))
• bottomk(3, sum by (le) (alertmanager_http_request_duration_seconds_bucket))
27. Grafana – Demo
• Download and install grafana as described in url https://grafana.com/grafana/download/beta
• Post install, Follow as below to start, stop or check status accordingly. There are different way
too, follow installation guide for more data (attached logs)
gmv-evo@gmvevo:~/Downloads$ sudo systemctl start grafana-server
gmv-evo@gmvevo:~/Downloads$ sudo systemctl status grafana-server
gmv-evo@gmvevo:~/Downloads$ sudo systemctl stop grafana-server
• Open Url as follows and configure login process -http://localhost:3000.
• Configure Prometheus dashboard as generic and import Node Exporter dashboard: -
https://grafana.com/grafana/dashboards/1860
29. Out of Syllabus – Trigger to look out
• Remote Endpoints and Storage - long term storage
• Alertmanager - Webhook Receiver (Gmail, etc)
• Prometheus Concerns - fixed by Cortex and Thanos
https://grafana.com/blog/2019/11/21/promcon-recap-two-
households-both-alike-in-dignity-cortex-and-thanos/
• Prometheus open bugs and fixes:
https://github.com/prometheus/prometheus/issues?
• Cloud Monitoring : Nagios vs. Prometheus
• Google's mtail - Extract Prometheus metrics from application logs.
• Prometheus is a system to collect and process metrics, not an event
logging system - ELK stack Answer.
30. Study Material –Free & Cost
Free
• https://prometheus.io/docs/introduction/overview/
• https://promcon.io/2019-munich/stream/
• Prometheus Monitoring : The Definitive Guide in 2019
• subreddit collecting all Prometheus-related resources on the internet.
• https://training.robustperception.io/ - Introduction to Prometheus
• Soundcloud - What makesPrometheusa “next generation”monitoring
system?
Cost
• Understanding PromQL by Robust Perception
• Prometheus: Up & Running by oreilly
31. Thanks for Listening!!!
be happy and make happy @how? given by my aasan:-
Go below what you have # Dream above what you have # First love what you have
Spread info what you have # Get info what others have # Help as per what you have