This document discusses various techniques for measuring and improving application performance. It begins by explaining the importance of measuring performance at the machine, component, and request levels. This includes collecting metrics on CPU, memory, I/O, logs, and tracing requests. Once issues are identified, the document recommends actions like caching, queueing work, and rearchitecting systems using service-oriented principles to improve performance. It stresses the importance of an ongoing process of measuring, analyzing data, taking action, and verifying the impact of changes.
4. speed: where is it?
DNS
First HTTP Request your boxes
DNS, connection
Fulfill HTTP Request (“Time to first byte”)
5. speed: where is it?
DNS
First HTTP Request your boxes
Subsequent HTTP Requests
DNS, connection
Fulfill HTTP Request (“Time to first byte”)
Download + render page contents
(+js)
14. Why you care (performance)
• Speed optimization
• A lot on client side, but not all
15. Why you care (performance)
• Speed optimization
• A lot on client side, but not all
• Troubleshooting
• Service disruptions -- resolve
ASAP
16. Why you care (performance)
• Speed optimization
• A lot on client side, but not all
• Troubleshooting
• Service disruptions -- resolve
ASAP
• Concurrency
• How does it scale?
17. Why you care (performance)
• Speed optimization
• A lot on client side, but not all
• Troubleshooting
• Service disruptions -- resolve
ASAP
• Concurrency
• How does it scale?
• Money
• The purple bar is expensive.
22. It’s all about tradeoffs
good / evil
risk / reward
fearlessness / sobriety
23. How to make decisions (ideally)
1. Decide what to measure
24. How to make decisions (ideally)
1. Decide what to measure
2. Measure, examine
25. How to make decisions (ideally)
1. Decide what to measure
2. Measure, examine
3. Act
26. How to make decisions (ideally)
1. Decide what to measure
2. Measure, examine
3. Act
4. Check
27. 1. What to measure
• Depends on what you’re looking for
• Bottlenecks -- db or app server
• Outages -- blocking on services
• Business metrics -- SLA reports, infrastructure
utilization
• Measure as much as possible (reasonable)
28. 1. What to measure
• Depends on what you’re looking for
• Bottlenecks -- db or app server
• Outages -- blocking on services
• Business metrics -- SLA reports, infrastructure
utilization
• Measure as much as possible (reasonable)
• You’ll never have all the data you want
30. How to make decisions (ideally)
1. Decide what to measure
31. How to make decisions (ideally)
1. Decide what to measure
2. Measure, examine
32. How to make decisions (ideally)
1. Decide what to measure
2. Measure, examine
3. Act
33. How to make decisions (ideally)
1. Decide what to measure
2. Measure, examine
3. Act
4. Check
34. 1. What to measure
• Depends on what you’re measuring
• DB = i/o, slow query log, buffer cache
• Server = fastcgi queue
• App = cpu/network
• Cache = ram, eviction, hits
35. 1. What to measure
• Depends on what you’re measuring
• DB = i/o, slow query log, buffer cache
• Server = fastcgi queue
• App = cpu/network
• Cache = ram, eviction, hits
• Tower of Babel?
36. 1. What to measure
• Depends on what you’re measuring
• DB = i/o, slow query log, buffer cache
• Server = fastcgi queue
• App = cpu/network
• Cache = ram, eviction, hits
• Tower of Babel?
• Common language: latency
37. 1. What to measure
• Depends on what you’re measuring
• DB = i/o, slow query log, buffer cache
• Server = fastcgi queue
• App = cpu/network
• Cache = ram, eviction, hits
• Tower of Babel?
• Common language: latency
• “Profiling”
38. 2. How to measure
• Machine-level
• Cpu, load, i/o, network
• Component-level
• Logs, instrumentation
• New Relic, Query Analyzer
• Request-level
• Tracing
39. 2. Machine metrics
• You have four basic resources
• CPU
• RAM
• I/O
• Network
• Open-source: Ganglia, Munin, Zabbix, etc.
• Commercial: CloudKick, AppFirst, Librato, etc...
• Everybody uses some form of this
• Facebook monitors over 5 million metrics with
Ganglia
41. 2. Machine metrics
• Home run:
• DB has high CPU wait
• Requests are slow -- why?
• Falling short:
• Low CPU usage on app and DB
• Low disk usage on DB
• Requests are slow -- why?
42. 2. Component metrics
• Very heterogeneous
• Throughput metrics
• Error conditions
• Profiling data
• Collect from:
• Logs: tail -f, Splunk, Loggly, Hoptoad
• Service calls: JMX
• Profiling: xhprof, cProfile
• Other: New Relic, Query Analyzers
• Basically everybody does this too in some form
43. 2. Component metrics
• Home run:
• Low CPU usage on app and DB
• Low disk usage on DB
• App instrumentation shows time spent in service
calls
• fastcgi queue getting deep
• Requests are slow -- why?
64. 3. Act
• You found your problem
• If not, go back 20 slides and repeat...
65. 3. Act
• You found your problem
• If not, go back 20 slides and repeat...
• Infrastructure upgrades
66. 3. Act
• You found your problem
• If not, go back 20 slides and repeat...
• Infrastructure upgrades
• More boxes, better boxes
67. 3. Act
• You found your problem
• If not, go back 20 slides and repeat...
• Infrastructure upgrades
• More boxes, better boxes
• Redistribute work / resource scheduling
68. 3. Act
• You found your problem
• If not, go back 20 slides and repeat...
• Infrastructure upgrades
• More boxes, better boxes
• Redistribute work / resource scheduling
• Service-oriented architecture (SOA)
69. 3. Act
• You found your problem
• If not, go back 20 slides and repeat...
• Infrastructure upgrades
• More boxes, better boxes
• Redistribute work / resource scheduling
• Service-oriented architecture (SOA)
• Do less work
70. 3. Act
• You found your problem
• If not, go back 20 slides and repeat...
• Infrastructure upgrades
• More boxes, better boxes
• Redistribute work / resource scheduling
• Service-oriented architecture (SOA)
• Do less work
• Skip what you can, cache what you can’t
71. 3. Act
• You found your problem
• If not, go back 20 slides and repeat...
• Infrastructure upgrades
• More boxes, better boxes
• Redistribute work / resource scheduling
• Service-oriented architecture (SOA)
• Do less work
• Skip what you can, cache what you can’t
• Do work later
72. 3. Act
• You found your problem
• If not, go back 20 slides and repeat...
• Infrastructure upgrades
• More boxes, better boxes
• Redistribute work / resource scheduling
• Service-oriented architecture (SOA)
• Do less work
• Skip what you can, cache what you can’t
• Do work later
• Deferred processing
73. 3. Caching
• Store things where they can be retrieved more cheaply
(faster)
87. 3. C.R.E.A.M.
• Browser cache
• CDN More speed gain,
• Proxy / optimizer More invalidations
• Opcode cache
• Application-driven
• App-specific cache
• ORM cache
• Local (runtime) cache
• Database
• Query cache Less speed gain,
• Denormalization Fewer invalidations
88. 3. When to cache
• Protect resources
• DB
• Services
• Cover for slow actions
• DB
• Disk hits
• External service calls
• Number-crunching
89. 3. Deferred work
• Presmise: synchronous work is lame
• Go async!
• Mechanism: queue
• RabbitMQ, 0MQ, ActiveMQ, Amazon SQS
Q
app servers workers/hadoop/??
db/cache
90. 3. When to queue
• Actions you can decouple from that page load
• Things that don’t have to update in real-time
• Counter updates (queue and aggregate)
• External API calls
• Long-running requests (ajax)
• Batch processing
• Shell commands
92. 3. SOA
• We’ve got two pages on our website and one box
serving it
def fast_action(): def slow_action():
x *= y x = compute()
render (‘fast.tpl’) render(‘slow.tpl’)
• Problem?
• Slow actions starve fast actions!
• How to remedy?
93. 3. SOA
• Take 1: buy more servers
• But if anyone calls slow action on one, we lose
• All servers must be able to handle slow_action’s
workload
• Take 2: pull out slow action
def fast_action(): def slow_action():
x *= y x = remote_compute()
render (‘fast.tpl’) render(‘slow.tpl’)
• Who does this????
94. 3. Resource scheduling
Low
Low
Low
app
Low High
High Low
Low Low
memcached number-cruncher
95. 3. Resource scheduling
Low
Low
Low
app
Low High
High Low
Low Low
memcached number-cruncher
97. How to make decisions (ideally)
1. Decide what to measure
98. How to make decisions (ideally)
1. Decide what to measure
2. Measure, examine
99. How to make decisions (ideally)
1. Decide what to measure
2. Measure, examine
3. Act
100. How to make decisions (ideally)
1. Decide what to measure
2. Measure, examine
3. Act
4. Check
101. 4. Did we ruin everything?
• If your metrics were right, things are probably faster
• But they’re different
• ... and probably more complicated
• How do we keep track of it?
• Better tools
• Next month: performance and load testing with
Selenium
102. Takeaways
• Hard to solve problems without understanding them at
a fundamental level
• Get data, visualize
• Machine and component metrics are key
• Sometimes they’re not enough
• Once we know a problem, there’s help
• SOA, Cache, Deferral -- complementary tools
• As web systems become more complicated, we must
use more sophisticated tools to monitor and debug
them
split this into 4 pages with more info\n\nso, it’s important, but I’d argue that it’s also particularly interesting, and only getting more so\n
split this into 4 pages with more info\n\nso, it’s important, but I’d argue that it’s also particularly interesting, and only getting more so\n
split this into 4 pages with more info\n\nso, it’s important, but I’d argue that it’s also particularly interesting, and only getting more so\n
split this into 4 pages with more info\n\nso, it’s important, but I’d argue that it’s also particularly interesting, and only getting more so\n
split this into 4 pages with more info\n\nso, it’s important, but I’d argue that it’s also particularly interesting, and only getting more so\n
split this into 4 pages with more info\n\nso, it’s important, but I’d argue that it’s also particularly interesting, and only getting more so\n
split this into 4 pages with more info\n\nso, it’s important, but I’d argue that it’s also particularly interesting, and only getting more so\n
split this into 4 pages with more info\n\nso, it’s important, but I’d argue that it’s also particularly interesting, and only getting more so\n
\n
\n
\n
\n
\n
there’s a lot -- too much. here’s a little bit\n
I took a class called making decisions...\nit’s about measuring, then optimizing\ncan’t tell you how to act--that’s too specific\nbut can tell you how to decide how to act\n
I took a class called making decisions...\nit’s about measuring, then optimizing\ncan’t tell you how to act--that’s too specific\nbut can tell you how to decide how to act\n
I took a class called making decisions...\nit’s about measuring, then optimizing\ncan’t tell you how to act--that’s too specific\nbut can tell you how to decide how to act\n