Each of us operates distributed systems. Some of us operate traditional infrastructure
with database, web, and load-balancing tiers. Others require infrastructure that is
more bespoke and may incorporate non-traditional storage solutions (such as Riak).
Regardless of where each of us falls on this spectrum, the network closely describes the
behavior of our applications. Furthermore, it is the only place we can look to understand
emergent behavior of applications working together in concert. In this talk, we take a
radiological view of network-derived imagery and discuss what it can tell us about our
systems as a whole.
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Distributed systems-radiology
1. Modern Radiology for
Distributed Systems
Dietrich Featherston
@d2fn
Thursday, October 11, 12
2. This is a talk about
monitoring
Thursday, October 11, 12
3. But not just any kind of
monitoring
Non-invasive monitoring
Thursday, October 11, 12
4. non-invasive monitoring
measures taken to describe the
state of a system with minimal
changes to the system being
monitored
Thursday, October 11, 12
5. Insight
Radiographic
Imagery
Invasiveness
Thursday, October 11, 12
6. preventative care
measures taken to prevent
diseases or injuries rather than
curing them or treating their
symptoms
Thursday, October 11, 12
7. Non-invasive monitoring
techniques focus primarily
on host-based metrics
Why is this a problem?
Thursday, October 11, 12
9. Information emitted
about nodes in the network
n Information emitted
about edges
in the network
n²
Network size
Thursday, October 11, 12
10. We analyze cell-structure
because we can’t envision
the whole organism
We react to disease and
injury because we lack
preventative care
Thursday, October 11, 12
11. We lack preventative care for
applications because our
non-invasive monitoring
techniques are growing less
and less meaningful
Thursday, October 11, 12
12. Radiology is useful in
illuminating non-invasive
monitoring of distributed
systems
Thursday, October 11, 12
16. Context is
everything
Thursday, October 11, 12
17. How do we use
context?
Thursday, October 11, 12
18. !!! Context
Your Big
Dumb Data
Thursday, October 11, 12
19. Human
brain
Diagnoses
+
med school
Radiographic
Imagery
Thursday, October 11, 12
20. E.T. Signal
Processing
VLA Output
Thursday, October 11, 12
21. Application Topology
Signal Processing
Expert Brain
Application
Behavior
Network
Thursday, October 11, 12
Data
22. dimensions (11) measurements (8)
epoch seconds egress packets
epoch minutes egress octets
epoch hours ingress packets
node id ingress octets
source ip retransmits
source port
errors
dest ip
dest port app-rtt
interface handshake-rtt
country
network/asn
Thursday, October 11, 12
23. Case Study #1
GC-Death of a distributed
JVM application
Thursday, October 11, 12
42. 95
18
Wilhelm Röntgen
discovers X-Rays
First medical use of x-rays in
human imaging takes place one
month later
Thursday, October 11, 12
43. 95
05
18
19
First English text on
chest radiography
Wilhelm Röntgen
discovers X-Rays
First medical use of x-rays in
human imaging takes place one
month later
Thursday, October 11, 12
44. 20
95
05
18
19
19
First English text on
chest radiography
Society of Radiographers formed
Wilhelm Röntgen
discovers X-Rays
First medical use of x-rays in
human imaging takes place one
month later
Thursday, October 11, 12
45. Recognition of radiology as
a formal medical discipline
was a cultural problem, not
a technology problem
http://www.bshr.org.uk/page13.html
Thursday, October 11, 12
46. If you want to talk to me about the
query language used to ask questions
of the network data we collect at
Boundary talk to me after or hit me up
on twitter.
@d2fn
github.com/dietrichf
Thursday, October 11, 12
47. Find 45 minutes get volume_1s_meter_ip [
meter in {1, 2, 226, 301};
of total traffic epochMillis from -18h for 45m;
seen on meters categorize
]
1, 2, 226, & 301 sum(ingress) as ingress,
sum(egress) as egress,
starting 18 hours sum(ingressPackets +
ago broken egressPackets) as packets,
sum(retransmits) as retransmits,
down by peer ip bymean(appRttUsec/1000) as appRttMs
retain top 10 by epochMillis, ip
retain
the ratio of top 10
retransmits to on retransmits/packets
per epochMillis
packets
Thursday, October 11, 12
48. Find 45 minutes get volume_1s_meter_ip [
meter in {1, 2, 226, 301};
of total traffic epochMillis from -18h for 45m;
seen on meters categorize
]
1, 2, 226, & 301 sum(ingress) as ingress,
sum(egress) as egress,
starting 18 hours sum(ingressPackets +
ago broken egressPackets) as packets,
sum(retransmits) as retransmits,
down by peer ip bymean(appRttUsec/1000) as appRttMs
retain top 10 by epochMillis, ip
retain
the ratio of top 10
retransmits to on retransmits/packets
per epochMillis
packets
Thursday, October 11, 12
49. Find 45 minutes get volume_1s_meter_ip [
meter in {1, 2, 226, 301};
of total traffic epochMillis from -18h for 45m;
seen on meters categorize
]
1, 2, 226, & 301 sum(ingress) as ingress,
sum(egress) as egress,
starting 18 hours sum(ingressPackets +
ago broken egressPackets) as packets,
sum(retransmits) as retransmits,
down by peer ip bymean(appRttUsec/1000) as appRttMs
retain top 10 by epochMillis, ip
retain
the ratio of top 10
retransmits to on retransmits/packets
per epochMillis
packets
Thursday, October 11, 12
50. Find 45 minutes get volume_1s_meter_ip [
meter in {1, 2, 226, 301};
of total traffic epochMillis from -18h for 45m;
seen on meters categorize
]
1, 2, 226, & 301 sum(ingress) as ingress,
sum(egress) as egress,
starting 18 hours sum(ingressPackets +
ago broken egressPackets) as packets,
sum(retransmits) as retransmits,
down by peer ip bymean(appRttUsec/1000) as appRttMs
retain top 10 by epochMillis, ip
retain
the ratio of top 10
retransmits to on retransmits/packets
per epochMillis
packets
Thursday, October 11, 12
51. Find 45 minutes get volume_1s_meter_ip [
meter in {1, 2, 226, 301};
of total traffic epochMillis from -18h for 45m;
seen on meters categorize
]
1, 2, 226, & 301 sum(ingress) as ingress,
sum(egress) as egress,
starting 18 hours sum(ingressPackets +
ago broken egressPackets) as packets,
sum(retransmits) as retransmits,
down by peer ip bymean(appRttUsec/1000) as appRttMs
retain top 10 by epochMillis, ip
retain
the ratio of top 10
retransmits to on retransmits/packets
per epochMillis
packets
Thursday, October 11, 12
52. Find 45 minutes get volume_1s_meter_ip [
meter in {1, 2, 226, 301};
of total traffic epochMillis from -18h for 45m;
seen on meters categorize
]
1, 2, 226, & 301 sum(ingress) as ingress,
sum(egress) as egress,
starting 18 hours sum(ingressPackets +
ago broken egressPackets) as packets,
sum(retransmits) as retransmits,
down by peer ip bymean(appRttUsec/1000) as appRttMs
retain top 10 by epochMillis, ip
retain
the ratio of top 10
retransmits to on retransmits/packets
per epochMillis
packets
Thursday, October 11, 12
53. Find 45 minutes get volume_1s_meter_ip [
meter in {1, 2, 226, 301};
of total traffic epochMillis from -18h for 45m;
seen on meters categorize
]
1, 2, 226, & 301 sum(ingress) as ingress,
sum(egress) as egress,
starting 18 hours sum(ingressPackets +
ago broken egressPackets) as packets,
sum(retransmits) as retransmits,
down by peer ip bymean(appRttUsec/1000) as appRttMs
retain top 10 by epochMillis, ip
retain
the ratio of top 10
retransmits to on retransmits/packets
per epochMillis
packets
Thursday, October 11, 12
54. Find 45 minutes get volume_1s_meter_ip [
meter in {1, 2, 226, 301};
of total traffic epochMillis from -18h for 45m;
seen on meters categorize
]
1, 2, 226, & 301 sum(ingress) as ingress,
sum(egress) as egress,
starting 18 hours sum(ingressPackets +
ago broken egressPackets) as packets,
sum(retransmits) as retransmits,
down by peer ip bymean(appRttUsec/1000) as appRttMs
retain top 10 by epochMillis, ip
retain
the ratio of top 10
retransmits to on retransmits/packets
per epochMillis
packets
Thursday, October 11, 12