The kernel knows more than our programs. Stop bloating our applications with copy-and-paste instrumentation code for metrics. Let's go look under the hoods!
Nowadays every application exposes their metrics via an HTTP endpoint readable by using Prometheus. Nevertheless, this very common pattern, by definition only exposes metrics regarding the specific applications being observed.
This talk, and its companion slides, wants to expose the idea, and a reference implementation (https://github.com/bpftools/kube-bpf), of using eBPF programs to collect and automatically expose applications and kernel metrics via a Prometheus endpoint.
It walks through the architecture of the proposed reference implementation - a Kubernetes operator with a custom resource for eBPF programs - and finally links to a simple demo showing how to use it to grab and present some metrics without having touched any application running on the demo cluster.
---
Talk given at Cloud_Native Rejekts EU - Barcelona, Spain - on May 18th, 2019
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Prometheus as exposition format for eBPF programs running on Kubernetes
1. Prometheus as exposition
format for eBPF programs
running on k8s
Leonardo Di Donato. Open Source Software Engineer @ Sysdig.
2019.05.18 - Cloud_Native Rejekts EU - Barcelona, Spain
3. @leodido
• Old buzzword.
• Is this SNMP? 😂
• Focus on collecting, persisting, and alerting
on just any data!
• It might also become simply garbage.
• Data lake.
• Doing it well requires a strategy.
• Uninformed monitoring equals hope.
Monitoring
The missing buzzwords
Wait, another really cool buzzword is Tracing!
• Ability of a system to give to humans
insights.
• Humans can observe, understand, and act on
the presented state of an observable system.
• Ability to make deductions about internal
state only looking at boundaries (inputs vs
outputs).
• Never truly achieved. Ongoing process and
mindset.
• Avoid black box data. Extract fine-grained
and meaningful data.
Observability
4. @leodido
• Monitoring landscape very fragmented
• Many solutions
• with ancient tech
• Proprietary data formats
• often not completely impl. or undocumented or ...
• Hierarchical data models
• Metrics? W00t?
Before Prometheus
But there’s a thing ...
• De-facto standard
• Cloud-native metric monitoring
• Ease of use
• Explosion of /metrics endpoints
After Prometheus
The journey so far
5. What if we could exploit Prometheus
(or OpenMetrics) exposition format’s
awesomeness without having to
punctually instrument applications?
Can we avoid to clog our applications
through eBPF superpowers?
eBFP superpowers
@leodido
6. What eBPF is
You can now write mini programs that run on events like disk I/O
which are run in a safe virtual machine in the kernel.
In-kernel verifier refuses to load eBPF programs with invalid
pointer dereferences, exceeding maximum call stack, or with loop
without an upper bound.
Imposes a stable Application Binary Interface (ABI).
BPF on steroids 🚀
A core part of the Linux kernel.
@leodido
7. @leodido
userspace
program
bpf() syscall
eBPF program ...
user-space
kernel
eBPF map
BPF_MAP_CREATE
BPF_MAP_LOOKUP_ELEM
BPF_MAP_UPDATE_ELEM
BPF_MAP_DELETE_ELEM
BPF_MAP_GET_NEXT_KEY
http://bit.ly/bpf_map_types 📎
BPF_PROG_TYPE_SOCKET_FILTER
BPF_PROG_TYPE_KPROBE
BPF_PROG_TYPE_TRACEPOINT
BPF_PROG_TYPE_RAW_TRACEPOINT
BPF_PROG_TYPE_XDP
BPF_PROG_TYPE_PERF_EVENT
BPF_PROG_TYPE_CGROUP_SKB
BPF_PROG_TYPE_CGROUP_SOCK
BPF_PROG_TYPE_SOCK_OPS
BPF_PROG_TYPE_SK_SKB
BPF_PROG_TYPE_SK_MSG
BPF_PROG_TYPE_SCHED_CLS
BPF_PROG_TYPE_SCHED_ACT
📎 http://bit.ly/bpf_prog_types
eBPF program
How does eBFP work?
8. • fully programmable
• can trace everything in a system
• not limited to a specific application
• unified tracing interface for both kernel and
userspace
• [k,u]probes, (dtrace)tracepoints and so on
are also used by other tools
• minimal (negligible) performance impact
• attach JIT native compiled instrumentation
code
• no long suspensions of execution
Advantages
• requires a fairly recent kernel
• definitely not for debugging
• no knowledge of the calling higher level
language implementation
• not fully running in user space
• kernel-user context (usually negligible)
switch when eBPF instrument a user process
• still not portable as other tracers
• VM primarily developer in the Linux kernel
(work-in-progress portings btw)
Disadvantages
Why use eBPF at all to trace userspace processes?
10. 📎 http://bit.ly/k8s_crd
An extension of the
K8S API that let you
store and retrieve
structured data.
Custom resources
📎 http://bit.ly/k8s_shared_informers
The actual control
loop that watches the
shared state using the
workqueue.
Shared informers
📎
http://bit.ly/k8s_custom_controllers
It declares and
specifies the desired
state of your resource
continuously trying to
match it with the
actual state.
Controllers
Customize all the things
13. @leodido
Count packets by protocol Count sys_enter_write by process ID
macro to generate sections inside the object file (later interpreted by the ELF BPF loader)
14. @leodido
Compile and inspect
This is important because communicates to set the
current running kernel version!
Tricky and controversial legal thing about
licenses ...
The bpf_prog_load() wrapper also has a license
parameter to provide the license that applies to
the eBPF program being loaded.
Not GPL-compatible license?
Kernel won’t load you eBPF!
Exceptions applies...
eBPF
Maps
19. @leodido
# HELP test_packets No. of packets per protocol (key), node
# TYPE test_packets counter
test_packets{key="00001",node="127.0.0.1"} 8
test_packets{key="00002",node="127.0.0.1"} 1
test_packets{key="00006",node="127.0.0.1"} 551
test_packets{key="00008",node="127.0.0.1"} 1
test_packets{key="00017",node="127.0.0.1"} 15930
test_packets{key="00089",node="127.0.0.1"} 9
test_packets{key="00233",node="127.0.0.1"} 1
# EOF
It is a WIP project but already open source! 🎺
Check it out @ gh:bfptools/kube-bpf 🔗
ip-10-12-0-136.ec2.internal:9387/metrics
# <- ICMP
# <- IGMP
# <- TCP
# <- EGP
# <- UDP
# <- OSPF
# <- ?
20. @leodido
# HELP test_dummy No. sys_enter_write calls per PID (key), node
# TYPE test_dummy counter
test_dummy{key="00001",node="127.0.0.1"} ...
test_dummy{key="00001",node="127.0.0.1"} 8
test_dummy{key="00295",node="127.0.0.1"} 1
test_dummy{key="01278",node="127.0.0.1"} 1158
test_dummy{key="04690",node="127.0.0.1"} 209
test_dummy{key="04691",node="127.0.0.1"} 889
# EOF
It is a WIP project but already open source! 🎺
Check it out @ gh:bfptools/kube-bpf 🔗
ip-10-12-0-122.ec2.internal:9387/metrics
21. @leodido
It is a WIP project but already open source! 🎺
Check it out @ gh:bfptools/kube-bpf 🔗
22. @leodido
kubectl-trace
More eBPF + k8s
Run bpftrace program (from file)
Ctrl-C tells the
program to
plot the results
using hist()
The output histogram
Maps
23. @leodido
• Prometheus exposition format is here to stay given how simple it is 📊
• OpenMetrics will introduce improvements on such giant shoulders 📈
• We cannot monitor and observe everything from inside our applications 🎯
• We might want to have a look at the orchestrator (context) our apps live
and die in 🕸
• Kubernetes can be extended to achieve such levels of integrations 🔌
• ELF is cool 🧝
• We look for better tools (eBPF) for grabbing our metrics and even more 🔮
• Almost nullify footprint ⚡
• Enable a wider range of available data 🌊
• Do not touch our applications directly 👻
• There is a PoC doing some magic at gh:bfptools/kube-bpf 🧞
Key takeaways
24. Thanks.
Reach me out @leodido on twitter & github!
SEE Y’ALL AROUND AT KUBECON
http://bit.ly/prometheus_ebpf_k8s