This talk will provide motivation for the extensive instrumentation of complex computer systems and make the argument that such systems. This talk will provide practical starting points in Erlang projects and maintain a perspective on the human organization around the computer system. Brian will focus on getting started with instrumentation in a systematic way and follow up with the challenge of interpreting and acting on metrics emitted from a production system in a way which does not overwhelm operators’ ability to effectively control or prioritize faults in the system. He’ll use historical examples and case studies from my work to keep the talk anchored in the practical.
Talk objectives:
Brian hopes to convince the audience of two things:
* that monitoring and instrumentation is an essential component of any long-lived system and
* that it's not so hard to get started, after all.
He’ll keep a clear-eyed view of what works and is difficult in practice so that the audience can make a reasoned decision after the talk.
Target audience:
This talk would appeal to engineers with long-running production employments, operations folks and Erlangers in general.
13. The Problem Domain
• Low latency ( < 100 ms per transaction)
• Firm real-time system
14. The Problem Domain
• Low latency ( < 100 ms per transaction)
• Firm real-time system
• Highly concurrent (~2 million transactions
per second, peak)
15. The Problem Domain
• Low latency ( < 100 ms per transaction)
• Firm real-time system
• Highly concurrent (~2 million transactions
per second, peak)
• Global, 24/7 operation
19. Humans are bad at predicting the
performance of complex systems(…).
Our ability to create large and
complex systems fools us into
believing that we’re also entitled to
understand them.
C A R L O S B U E N O
“ M AT U R E O P T I M I Z AT I O N H A N D B O O K ”
42. Important Terms
metric a measurement
entry a receiver and aggregator of metrics
reporter that which samples entries periodically
and ships them to another system
subscription the definition of the regular interval on
which reporters sample entries
43. exometer
• Responsive upstream (Ulf Wiger never sleeps?)
• Metric collection, aggregation and reporting
decoupled.
• Static and dynamic configuration.
• Very low, predictable runtime overhead.
58. Creating Reporters
•Very easy to add your own reporters and
entries.
•Reporters and entries can be proprietary.
Just have to be loaded at runtime.
•Authors are responsive to issues.
73. • Scheduler threads were locked to CPUs
• Background process comes on every 20
minutes, consumes a lot of CPU time
• No cpu-shield was set up on our
production systems
• OS bumped a scheduler thread off its CPU,
backing up its run-queue