The Netdata Agent is free, open source single-node monitoring software. Netdata Cloud is a free, closed source, software-as-a-service that brings together metadata from endpoints running the Netdata Agent, giving a complete view of the health and performance of an infrastructure. All the metrics remain on the Netdata Agent, making Netdata Cloud the focal point of a distributed, infinitely scalable, low cost solution.
The heart of Netdata Cloud is Pulsar. Almost every message coming from and going to the open source agents passes through Pulsar. Pulsar's infinite number of topics has given us the flexibility we needed and in some cases, every single Netdata Agent has its own unique Pulsar topic. A single message from an agent or from a service that processes a front end request can trigger several other Pulsar messages, as we also use Pulsar for communication between microservices (using a CQRS pattern with shared subscriptions for scalability).
The reliable persistence of messages has allowed us to replay old events to rebuild old and build new materialized views and debug specific production issues. It's also what will enable us to implement an event sourcing pattern, for a new set of features we want to introduce shortly.
We have had a few issues with a specific client and our shared subscriptions that we're working on resolving, but overall Pulsar has proven to be one of the most reliable parts of our infrastructure and we decided to proceed with a managed services agreement.
How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for Free - Pulsar Summit NA 2021
1. How Pulsar enables Netdata to offer unlimited
infrastructure monitoring for free
Pulsar Virtual Summit North America 2021
2. Speaker
Pulsar Virtual Summit North America 2021
Christopher Akritidis
COO, Netdata
Christopher rallies the day-to-day efforts of the Netdata team with 20
years of experience in delivering IT-empowered business solutions.
Before Netdata, he focused on optimizing performance and efficiency
for telecom companies. When he’s not working, he enjoys science,
philosophy, science fiction, strategy games, reading, and writing.
3. Agenda
● Introduction / Presentation summary
● Who we are and what we want to achieve
● The challenges of a free infrastructure monitoring solution
● Netdata Cloud Architecture
● Pulsar features that we rely on
● Pulsar features missed - Challenges for the future
Pulsar Virtual Summit North America 2021
5. Presentation Summary
● The Netdata Agent is free, open source single-node monitoring software. Netdata Cloud is a free, closed
source, software-as-a-service that brings together metadata from endpoints running the Netdata Agent,
giving a complete view of the health and performance of an infrastructure.
● All the metrics remain on the Netdata Agent, making Netdata Cloud the focal point of a decentralized,
scalable, low cost solution.
● The heart of Netdata Cloud is Pulsar. Almost every message coming from and going to the open source
agents passes through it, often generating a series of other messages.
○ Pulsar's millions of topics and key-shared subscriptions allow us to deal with an arbitrary number
of agents and requests from the front-end.
○ The reliable persistence of messages has allowed us to replay old events to rebuild old and build
new materialized views and debug specific production issues. It's also what will enable us to
implement an event sourcing pattern, for a new set of features we want to introduce shortly.
○ Decoupled and tiered storage help keep costs low, as the infrastructure expands.
● We have had a few challenges with Pulsar, especially regarding the Go client with shared subscriptions
and future needs for event sourcing, but we rely on it heavily.
Pulsar Virtual Summit North America 2021
6. Who we are and what we want to achieve
Pulsar Virtual Summit North America 2021
7. Some History
2014
👨💻
Out of frustration for the current
solutions, Costa Tsaousis starts
working on a side-project.
Netdata GitHub launch
Almost 2 years in development,
the project launches as an Open
Source project
2016
Netdata Inc.
With tens of thousands of GitHub
✨ in a matter of weeks. Costa
founds Netdata Inc.
2018
Netdata Cloud launch
Netdata delivers free-forever
Cloud solution for easily and
visually monitoring entire
infrastructures
2020
Pulsar Virtual Summit North America 2021
8. Reach
● Over 54K ✨ on GitHub
● Hundreds of thousands of active installations
● 440 individual contributors to the core Netdata codebase
● You can find us on:
○ Community Forums : https://community.netdata.cloud
○ Reddit : https://www.reddit.com/r/netdata/
○ Twitter: https://twitter.com/linuxnetdata
Pulsar Virtual Summit North America 2021
9. Why the success?
● Built by monitoring professionals.
● Empower user by removing requirement to set everything up.
○ Run a single command to install it on every machine.
○ Sane defaults for data source detection (200+) and OOB alerts.
○ Instant stunning metric visualizations of all detected sources.
● Open-source, with a dedication to the success of the user.
● Extendable (easily collect data from any source and export to common TSDBs)
● Every metric, every second, but ridiculously efficient.
○ Programmed in C
○ Per-second metrics with minimal overhead because the data are stored on
the data source
○ Data queried only when required
○ In-house TSDB for short to medium term storage
Pulsar Virtual Summit North America 2021
10. Limitations of the FOSS agent
● Great to monitor a single, traditional machine, but what about:
○ Data replication
○ Ephemeral instances
○ Infrastructure-level metrics, alerts, patterns
● Step 1 - Centralization points in the user infrastructure
● Step 2 - Netdata Cloud
Pulsar Virtual Summit North America 2021
11. Centralization points (streaming)
Pulsar Virtual Summit North America 2021
Caveats:
● See only one node at at a time
● Access management
● Slow and unreliable feedback cycle
for new features
● Slow rollout
● Difficult to monetize
12. Enter Netdata Cloud
● Organize Netdata Agents into groups (War Rooms)
● Collaborate with your team by joining the same Space
● Instantly view charts of all the nodes in a group for faster root-cause
analysis for the entire infrastructure
● Centralize Alarm management from all agents
● Offer unlimited monitoring for free forever, charge later for advanced
user control and auditing, increased metadata retention, and enterprise
plugins.
Monitor and troubleshoot the entire
infrastructure, immediately, collaboratively.
Pulsar Virtual Summit North America 2021
13. The challenges of free infrastructure
monitoring
Pulsar Virtual Summit North America 2021
14. Unique challenges
● Needs to:
○ Be real time, but without sacrificing the number of metrics or the
per sec sampling rate (eventual consistency problematic).
○ Scale to hundreds of thousands (eventually millions) of monitored
instances
○ Provide strong auditing capabilities of any significant change,
including every alarm status update, metrics monitored etc.
○ Be free forever!
Pulsar Virtual Summit North America 2021
15. Unique solution
● Don’t centralize metrics.
○ Cloud needs to have metadata about the monitored nodes, the
metrics collected and individual alarms raised.
○ Queries about the metrics themselves need to be sent to agents
and served by them.
● Have the FOSS agent execute most of the required processing.
○ Even centralized alert thresholds require agents to evaluate
individual thresholds.
○ ML can’t be done centrally.
● Conclusion: We need persistent, bidirectional channels with the agents
and near-real-time updates to the metadata and alerts in the cloud.
Pulsar Virtual Summit North America 2021
21. Decoupled storage from brokers
This feature allows us to scale horizontally much more efficiently than with
any solution that would couple storage with message handling. Our brokers
tend to be 6 under normal load, with 4 bookkeepers. Under load, we usually
don’t currently need to autoscale to more than 8 brokers. The 4
bookkeepers seem to keep up, this far.
Pulsar Virtual Summit North America 2021
22. Tiered Storage / offloading
● Tiered storage allows us to reduce costs, by keeping recent data in fast
(and expensive) storage, close to Pulsar, while offloading older data
that is only used occasionally (e.g. when replaying a topic to materialize
a view for some new functionality) in cheaper, "cold" storage
Pulsar Virtual Summit North America 2021
23. Unlimited retention
● We call CockroachDB our source of truth but the reality is that any time
we want to spin up a new service and populate its materialized views,
what we usually do is replay the Pulsar messages that this service
needs to subscribe to.
● We are currently in the process of creating jobs that would actually read
the entries in CockroachDB in order to update materialized views, but
even that will rely on creating individual messages in Pulsar, exactly like
the ones that the service is supposed to process.
Pulsar Virtual Summit North America 2021
24. Millions of topics
● We mainly use the almost unlimited number of topics for our
communication with the front end. A call to a REST API may cause our
microservices to issue requests via Pulsar. While the request is in-flight
we need to make sure that the required messages are picked up by the
microservice instance that is serving each request. This is why we need
a topic per pod to handle the responses. All these topics live in a
dedicated Pulsar namespace with no persistence. With pods always
getting created and destroyed, we end up with many such topics that
need to be routinely cleaned up.
Pulsar Virtual Summit North America 2021
25. Key-shared subscriptions
● We leverage key-shared subscriptions to guarantee in-order delivery of
messages. This allows us to write idempotent consumers with
decreased code complexity. It also allows us to avoid costly operations
like transactions when side effects take place, thanks to the fact that
we know for sure that only one consumer at a time can process a
message about a given key, thus removing any potential race condition.
Pulsar Virtual Summit North America 2021
26. Delayed publishing
Some Netdata alarms are configured to trigger notifications after a given
“delay”. This feature prevents the flurry of notifications that can arise when
a certain metrics moves quickly above and below a certain threshold within
a short period of time. Pulsar’s delayed publishing permitted us to
implement this feature without introducing repeating jobs that would
constantly check timestamps of pending notifications. So we are able to
preserve an event-based pattern that guarantees near-real-time processing
of the pending requests, without maintaining additional tables and running
additional queries.
Pulsar Virtual Summit North America 2021
28. Multi-tenancy and Geo-replication
● We haven’t used these features yet, but they were key in our selection
of a message broker, to guarantee that we won’t have issues in the
future.
● For us, it’s not just a matter of high availability/DR. Our users and more
importantly the infrastructures they monitor are all over the world.
Since we communicate heavily with the agents running on those
infrastructures, we will need to have presence in all cloud providers, to
guarantee the lowest possible latency.
Pulsar Virtual Summit North America 2021
29. Schema support
● Schema support is an interesting feature, because validation is
especially important for us. The FOSS agent is not fully in our control
and the wide variety of OSs and infrastructures our agents run on
create a lot of unpredictable edge cases. Of course that’s without even
talking about the ability of any user to modify the source code either
with good intentions or not.
● However, when we implemented our first services we weren’t aware of
the feature and ended up validating ProtoBufs manually. This is
certainly one feature we will try out soon.
Pulsar Virtual Summit North America 2021
30. Subscription message selector support
● Our next big challenge and a key to our path to monetization is the
“Feed”. We toyed extensively with the idea of event sourcing using
Pulsar but, without powerful subscription message selectors, we can’t
offer the ability to filter the messages that the front-end requires to:
a. Provide a historical event log.
b. Update its state in real time, without polling for changes.
Pulsar Virtual Summit North America 2021
31. Go client feature parity with java client
● We have faced challenges with the Pulsar Go client and especially for
shared-key subscriptions for many months. Successive attempts to fix
the issues we had with race conditions and stuck producers /
consumers had been largely unsuccessful for a while, forcing us to
restart practically all services every 30 minutes. We have recently
engaged StreamNative for a managed services agreement and the
latest Go client version does seem to be behaving better, but we still
haven’t been able to stop the automated restarts by May 24, 2021
Pulsar Virtual Summit North America 2021
AlexM: Programmed in C - do we need it this way? Why? What is the reason? Suggestion: Programmed in C and Go for the best performance, efficiency and maintainability.
AlexM: General suggestion - not everybody will understand abbreviations like OOB and TSDB
AlexM: the same for FOSS - we understand what it means but others?
AlexM: As a preparation for the next slide we need to focus people here that Parent Node aggregates Data and keeps your data within your infrastructure but how to use this data, how to make it more useful -> Cloud solution
AlexM: It is safer to use Unique monitoring challenges - as we are saying here “free forever”, but we are planning to start charging clients for reasons explained before
AlexM: Nodes - outside of Netdata control, managed by clients, challenging to synchronize and aggregate data for presentation on the Cloud
AlexM - We need to be clear across presentation what is Node and what is Agent