SlideShare una empresa de Scribd logo
1 de 109
Descargar para leer sin conexión
Monitoring
Complex Systems
Keeping Your Head on Straight
in a Hard World
I do things to/with
computers.
I build real-time
systems.
I build fault-
tolerant systems.
I build critical
systems.
AdRoll
Less this.
More this.
Engineering + Mathematics = ads
Engineering + Mathematics = ads
(you’re welcome)
R E A L - T I M E
B I D D I N G
The Problem Domain
• Low latency ( < 100 ms per transaction)
The Problem Domain
• Low latency ( < 100 ms per transaction)
• Firm real-time system
The Problem Domain
• Low latency ( < 100 ms per transaction)
• Firm real-time system
• Highly concurrent (~2 million transactions
per second, peak)
The Problem Domain
• Low latency ( < 100 ms per transaction)
• Firm real-time system
• Highly concurrent (~2 million transactions
per second, peak)
• Global, 24/7 operation
I build
Complex Systems
Complex Systems
• Non-linear feedback
• Tightly coupled to external systems
• Difficult to model, understand
Bad things happen when
Complex Systems fail.
Humans are bad at predicting the
performance of complex systems(…).
Our ability to create large and
complex systems fools us into
believing that we’re also entitled to
understand them.
C A R L O S B U E N O
“ M AT U R E O P T I M I Z AT I O N H A N D B O O K ”
Complex Systems often
create worse problems
than those they solve.
The key challenge to
sustaining a complex
system is maintaining our
understanding of it.
What can be done?
Compile-time guarantees
are not sufficient.
don’t scrimp on them, though
Compile-time guarantees
are not sufficient.
Ahead of time verification is
not sufficient.
don’t scrimp on these, either
Ahead of time verification is
not sufficient.
We need insight into the
running system.
• VM killers
What are we looking for?
• VM killers
• Application performance regressions
What are we looking for?
• VM killers
• Application performance regressions
• Abnormal application behavior
What are we looking for?
• VM killers
• Application performance regressions
• Abnormal application behavior
• Surprises
What are we looking for?
INSTRUMENTATION
BEAM is ready to play.
erlang:memory/1
erlang:memory/1
• ets
• binary
• atom
• total
• processes
• system
erlang:statistics/1
erlang:statistics/1
• run_queue
• garbage_collection
• io
erlang:system_info/1
erlang:system_info/1
• port_count
• process_count
• *_limit
What about
our own work?
Exometer
Important Terms
metric a measurement
entry a receiver and aggregator of metrics
reporter that which samples entries periodically
and ships them to another system
subscription the definition of the regular interval on
which reporters sample entries
exometer
• Responsive upstream (Ulf Wiger never sleeps?)
• Metric collection, aggregation and reporting
decoupled.
• Static and dynamic configuration.
• Very low, predictable runtime overhead.
Defining Entries
{predefined, [
{[erlang,memory],
{function, erlang, memory,
['$dp'], value,[ets,binary]},
[]
},
{[erlang, statistics],
{function, erlang, statistics,
['$dp'], value, [run_queue]},
[]
},
{[erlang, gc],
{function, erlang, statistics,
[garbage_collection], match,
{total_coll, rec_wrd, '_'}},
[]
},
{[boodah, freq_cap, not_found], spiral},
{[boodah, freq_cap, ok], spiral},
{[boodah, freq_cap, timeout], spiral}
]},
{predefined, [
{[erlang,memory],
{function, erlang, memory,
['$dp'], value,[ets,binary]},
[]
},
{[erlang, statistics],
{function, erlang, statistics,
['$dp'], value, [run_queue]},
[]
},
{[erlang, gc],
{function, erlang, statistics,
[garbage_collection], match,
{total_coll, rec_wrd, '_'}},
[]
},
{[boodah, freq_cap, not_found], spiral},
{[boodah, freq_cap, ok], spiral},
{[boodah, freq_cap, timeout], spiral}
]},
Defining Entries
{predefined, [
{[erlang,memory],
{function, erlang, memory,
['$dp'], value,[ets,binary]},
[]
},
{[erlang, statistics],
{function, erlang, statistics,
['$dp'], value, [run_queue]},
[]
},
{[erlang, gc],
{function, erlang, statistics,
[garbage_collection], match,
{total_coll, rec_wrd, '_'}},
[]
},
{[boodah, freq_cap, not_found], spiral},
{[boodah, freq_cap, ok], spiral},
{[boodah, freq_cap, timeout], spiral}
]},
Defining Entries
{predefined, [
{[erlang,memory],
{function, erlang, memory,
['$dp'], value,[ets,binary]},
[]
},
{[erlang, statistics],
{function, erlang, statistics,
['$dp'], value, [run_queue]},
[]
},
{[erlang, gc],
{function, erlang, statistics,
[garbage_collection], match,
{total_coll, rec_wrd, '_'}},
[]
},
{[boodah, freq_cap, not_found], spiral},
{[boodah, freq_cap, ok], spiral},
{[boodah, freq_cap, timeout], spiral}
]},
Defining Entries
{predefined, [
{[erlang,memory],
{function, erlang, memory,
['$dp'], value,[ets,binary]},
[]
},
{[erlang, statistics],
{function, erlang, statistics,
['$dp'], value, [run_queue]},
[]
},
{[erlang, gc],
{function, erlang, statistics,
[garbage_collection], match,
{total_coll, rec_wrd, '_'}},
[]
},
{[boodah, freq_cap, not_found], spiral},
{[boodah, freq_cap, ok], spiral},
{[boodah, freq_cap, timeout], spiral}
]},
Defining Entries
{ reporters,
[
{ exometer_report_statsd,
[
{hostname, "localhost"},
{port, 8125},
{type_map,
[
{[erlang,statistics,run_queue],
histogram},
{[erlang, gc, tot_coll], histogram},
{[erlang, gc, rec_wrd], histogram},
{[erlang,memory,ets], gauge},
{[erlang,memory,binary],gauge},
{[boodah,freq_cap,not_found],gauge},
{[boodah,freq_cap,ok],gauge},
{[boodah,freq_cap,timeout],gauge}
]},
Defining Reporters
{ reporters,
[
{ exometer_report_statsd,
[
{hostname, "localhost"},
{port, 8125},
{type_map,
[
{[erlang,statistics,run_queue],
histogram},
{[erlang, gc, tot_coll], histogram},
{[erlang, gc, rec_wrd], histogram},
{[erlang,memory,ets], gauge},
{[erlang,memory,binary],gauge},
{[boodah,freq_cap,not_found],gauge},
{[boodah,freq_cap,ok],gauge},
{[boodah,freq_cap,timeout],gauge}
]},
Defining Reporters
{ reporters,
[
{ exometer_report_statsd,
[
{hostname, "localhost"},
{port, 8125},
{type_map,
[
{[erlang,statistics,run_queue],
histogram},
{[erlang, gc, tot_coll], histogram},
{[erlang, gc, rec_wrd], histogram},
{[erlang,memory,ets], gauge},
{[erlang,memory,binary],gauge},
{[boodah,freq_cap,not_found],gauge},
{[boodah,freq_cap,ok],gauge},
{[boodah,freq_cap,timeout],gauge}
]},
Defining Reporters
{ report,
[
{ subscribers,
[
{exometer_report_statsd, [erlang, statistics],
run_queue, 1000},
{exometer_report_statsd, [erlang, gc],
tot_coll, 1000},
{exometer_report_statsd, [erlang, gc],
rec_wrd, 1000},
{exometer_report_statsd, [erlang, memory],
ets, 10000},
{exometer_report_statsd, [erlang, memory],
binary, 10000},
{exometer_report_statsd,
[boodah, freq_cap, not_found], one,
1000},
{exometer_report_statsd,
[boodah, freq_cap, ok], one, 1000},
{exometer_report_statsd,
[boodah, freq_cap, timeout], one, 1000}
]}
]}
Defining Subscriptions
{ report,
[
{ subscribers,
[
{exometer_report_statsd, [erlang, statistics],
run_queue, 1000},
{exometer_report_statsd, [erlang, gc],
tot_coll, 1000},
{exometer_report_statsd, [erlang, gc],
rec_wrd, 1000},
{exometer_report_statsd, [erlang, memory],
ets, 10000},
{exometer_report_statsd, [erlang, memory],
binary, 10000},
{exometer_report_statsd,
[boodah, freq_cap, not_found], one,
1000},
{exometer_report_statsd,
[boodah, freq_cap, ok], one, 1000},
{exometer_report_statsd,
[boodah, freq_cap, timeout], one, 1000}
]}
]}
Defining Subscriptions
{ report,
[
{ subscribers,
[
{exometer_report_statsd, [erlang, statistics],
run_queue, 1000},
{exometer_report_statsd, [erlang, gc],
tot_coll, 1000},
{exometer_report_statsd, [erlang, gc],
rec_wrd, 1000},
{exometer_report_statsd, [erlang, memory],
ets, 10000},
{exometer_report_statsd, [erlang, memory],
binary, 10000},
{exometer_report_statsd,
[boodah, freq_cap, not_found], one,
1000},
{exometer_report_statsd,
[boodah, freq_cap, ok], one, 1000},
{exometer_report_statsd,
[boodah, freq_cap, timeout], one, 1000}
]}
]}
Defining Subscriptions
1> exometer:new([a, histogram], histogram).
ok
2> exometer:get_value([a, histogram]).
{ok,[{n,0},
{mean,0},
{min,0},
{max,0},
{median,0},
{50,0},
{75,0},
{90,0},
{95,0},
{99,0},
{999,0}]}
3> exometer_report:add_reporter(
exometer_report_tty, []).
ok
4> exometer_report:subscribe(
exometer_report_tty,
[a, histogram], mean, 1000, []).
ok
exometer_report_tty: a_histogram_mean
1393627070:0
exometer_report_tty: a_histogram_mean
1393627071:0
exometer_report_tty: a_histogram_mean
1393627072:0
Doing it dynamically.
These are all loosely
coupled at runtime.
Configuration is static, but
you can adapt it on the fly.
Creating Reporters
•Very easy to add your own reporters and
entries.
•Reporters and entries can be proprietary.
Just have to be loaded at runtime.
•Authors are responsive to issues.
Why not…
…folsom?
…statman?
…vmstat?
Okay, great. We have
instrumentation.
Now what?
MONITORING
This is the
hard part.
Visualization
Alerting
Analysis
Visualization tells you
how things look but
not why.
Periodic Bid Timeouts
Consistent System Load
Correlated Network Traffic Spikes
Correlated Run Queue Spikes
What happened?
• Scheduler threads were locked to CPUs
• Background process comes on every 20
minutes, consumes a lot of CPU time
• No cpu-shield was set up on our
production systems
• OS bumped a scheduler thread off its CPU,
backing up its run-queue
Alerting tells you that
something happened,
but not why.
A
normal
day.
Wat?
That’s
some
cliff.
Timeouts look good.
Errors
prior
are
okay.
What happened?
“Uh, hey guys, you know
Facebook is down, right?”
Analysis gives you why
but only if you know
how to ask for what.
The
memory
use of a
bidder.
ಠ_ಠ
It’s all
binaries.
Not in processes.
Not in ETS.
Come on now.
What happened?
A jiffy bug.
A jiffy bug.
A byte here, a byte there eventually it turns into real memory.
Okay, great. We
have monitoring and
instrumentation.
Now all our problems
are solved, right?
Not
quite.
Instruments make up
for our lack of
insight.
Monitoring
makes up
for our
frailty.
No solution is perfect.
Instruments may
be misleading.
Instruments may be
overwhelming.
Instruments
may be
inaccurate.
Instruments may
be ignored.
What can
be done?
A little
paranoia never hurt
anyone.
Use glass displays.
Train.
Keep sight of
the main goal.
Recognize that
even simple
systems go
wrong
sometimes.
Have resources
you’re willing
to sacrifice.
Thanks, folks!
<3
@bltroutwine

Más contenido relacionado

Similar a Monitoring Complex Systems: Keeping Your Head on Straight in a Hard World

Lec 1 Ds
Lec 1 DsLec 1 Ds
Lec 1 Ds
Qundeel
 
Data Structure
Data StructureData Structure
Data Structure
sheraz1
 
Lec 1 Ds
Lec 1 DsLec 1 Ds
Lec 1 Ds
Qundeel
 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure
Eman magdy
 
Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)
Cdiscount
 

Similar a Monitoring Complex Systems: Keeping Your Head on Straight in a Hard World (20)

Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.
 
SE2016 Exotic Valerii Vasylkov "Erlang. Measurements and benefits"
SE2016 Exotic Valerii Vasylkov "Erlang. Measurements and benefits"SE2016 Exotic Valerii Vasylkov "Erlang. Measurements and benefits"
SE2016 Exotic Valerii Vasylkov "Erlang. Measurements and benefits"
 
Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiyTues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
 
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
Polyglot Persistence in the Real World: Cassandra + S3 + MapReducePolyglot Persistence in the Real World: Cassandra + S3 + MapReduce
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
 
VCE Unit 01 (1).pptx
VCE Unit 01 (1).pptxVCE Unit 01 (1).pptx
VCE Unit 01 (1).pptx
 
Lec 1 Ds
Lec 1 DsLec 1 Ds
Lec 1 Ds
 
Data Structure
Data StructureData Structure
Data Structure
 
Lec 1 Ds
Lec 1 DsLec 1 Ds
Lec 1 Ds
 
Where the wild things are - Benchmarking and Micro-Optimisations
Where the wild things are - Benchmarking and Micro-OptimisationsWhere the wild things are - Benchmarking and Micro-Optimisations
Where the wild things are - Benchmarking and Micro-Optimisations
 
Performance
PerformancePerformance
Performance
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)
 
Performance tests - it's a trap
Performance tests - it's a trapPerformance tests - it's a trap
Performance tests - it's a trap
 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure
 
Mat lab workshop
Mat lab workshopMat lab workshop
Mat lab workshop
 
Model-based GUI testing using UPPAAL
Model-based GUI testing using UPPAALModel-based GUI testing using UPPAAL
Model-based GUI testing using UPPAAL
 
Perf Tuning Short
Perf Tuning ShortPerf Tuning Short
Perf Tuning Short
 
Basic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsBasic terminologies & asymptotic notations
Basic terminologies & asymptotic notations
 
Algorithm analysis.pptx
Algorithm analysis.pptxAlgorithm analysis.pptx
Algorithm analysis.pptx
 
Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 

Más de Brian Troutwine

Más de Brian Troutwine (9)

(Moonconf 2016) Fetching Moths from the Works: Correctness Methods in Software
(Moonconf 2016) Fetching Moths from the Works: Correctness Methods in Software(Moonconf 2016) Fetching Moths from the Works: Correctness Methods in Software
(Moonconf 2016) Fetching Moths from the Works: Correctness Methods in Software
 
Getting Uphill on a Candle: Crushed Spines, Detached Retinas and One Small Step
Getting Uphill on a Candle: Crushed Spines, Detached Retinas and One Small StepGetting Uphill on a Candle: Crushed Spines, Detached Retinas and One Small Step
Getting Uphill on a Candle: Crushed Spines, Detached Retinas and One Small Step
 
The Charming Genius of the Apollo Guidance Computer
The Charming Genius of the Apollo Guidance ComputerThe Charming Genius of the Apollo Guidance Computer
The Charming Genius of the Apollo Guidance Computer
 
Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over
Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over
Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over
 
Let it crash! The Erlang Approach to Building Reliable Services
Let it crash! The Erlang Approach to Building Reliable ServicesLet it crash! The Erlang Approach to Building Reliable Services
Let it crash! The Erlang Approach to Building Reliable Services
 
Automation With Humans in Mind: Making Complex Systems Predictable, Reliable ...
Automation With Humans in Mind: Making Complex Systems Predictable, Reliable ...Automation With Humans in Mind: Making Complex Systems Predictable, Reliable ...
Automation With Humans in Mind: Making Complex Systems Predictable, Reliable ...
 
Erlang, LFE, Joxa and Elixir: Established and Emerging Languages in the Erlan...
Erlang, LFE, Joxa and Elixir: Established and Emerging Languages in the Erlan...Erlang, LFE, Joxa and Elixir: Established and Emerging Languages in the Erlan...
Erlang, LFE, Joxa and Elixir: Established and Emerging Languages in the Erlan...
 
Instrumentation as a Living Documentation: Teaching Humans About Complex Systems
Instrumentation as a Living Documentation: Teaching Humans About Complex SystemsInstrumentation as a Living Documentation: Teaching Humans About Complex Systems
Instrumentation as a Living Documentation: Teaching Humans About Complex Systems
 
Monitoring with exometer at AdRoll
Monitoring with exometer at AdRollMonitoring with exometer at AdRoll
Monitoring with exometer at AdRoll
 

Último

Último (20)

The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 

Monitoring Complex Systems: Keeping Your Head on Straight in a Hard World