bp

Application Performance
Management
“tightening up your backend”

dan kuebrich
dan@tracelytics.com

speed: where is it?
DNS

DNS, connection

speed: where is it?
DNS

First HTTP Request your boxes

DNS, connection
Fulﬁll HTTP Request (“Time to ﬁrst byte”)

speed: where is it?
DNS

First HTTP Request your boxes

Subsequent HTTP Requests

DNS, connection
Fulﬁll HTTP Request (“Time to ﬁrst byte”)
Download + render page contents
(+js)

What’s taking so long?

...


...

Time to connect (3ms)


...

Time to ﬁrst byte (1.61s)

33%

...

Time to ﬁrst byte (1.61s)


?

Why you care (performance)
• Speed optimization

• A lot on client side, but not all


• Troubleshooting
• Service disruptions -- resolve
ASAP


• Troubleshooting
ASAP

• Concurrency
• How does it scale?


• Troubleshooting
ASAP

• Concurrency
• How does it scale?

• Money
• The purple bar is expensive.

It’s all about tradeoffs

good / evil


good / evil

risk / reward


good / evil

risk / reward

fearlessness / sobriety

How to make decisions (ideally)
1. Decide what to measure


2. Measure, examine


2. Measure, examine

3. Act


2. Measure, examine

3. Act

4. Check

1. What to measure
• Depends on what you’re looking for
• Bottlenecks -- db or app server
• Outages -- blocking on services
• Business metrics -- SLA reports, infrastructure
utilization

• Measure as much as possible (reasonable)

1. What to measure
• Depends on what you’re looking for
• Bottlenecks -- db or app server
• Outages -- blocking on services
• Business metrics -- SLA reports, infrastructure
utilization

• Measure as much as possible (reasonable)
• You’ll never have all the data you want

1. What to measure
• Depends on what you’re measuring
• DB = i/o, slow query log, buffer cache
• Server = fastcgi queue
• App = cpu/network
• Cache = ram, eviction, hits

1. What to measure

• Tower of Babel?

1. What to measure

• Tower of Babel?

• Common language: latency

1. What to measure

• Tower of Babel?

• Common language: latency
• “Profiling”

2. How to measure
• Machine-level
• Cpu, load, i/o, network

• Component-level
• Logs, instrumentation
• New Relic, Query Analyzer

• Request-level
• Tracing

2. Machine metrics
• You have four basic resources
• CPU
• RAM
• I/O
• Network

• Open-source: Ganglia, Munin, Zabbix, etc.
• Commercial: CloudKick, AppFirst, Librato, etc...

• Everybody uses some form of this
• Facebook monitors over 5 million metrics with
Ganglia

2. Machine metrics
• Home run:
• DB has high CPU wait
• Requests are slow -- why?

• Falling short:
• Low CPU usage on app and DB
• Low disk usage on DB

2. Component metrics
• Very heterogeneous
• Throughput metrics
• Error conditions
• Profiling data

• Collect from:
• Logs: tail -f, Splunk, Loggly, Hoptoad
• Service calls: JMX
• Profiling: xhprof, cProfile
• Other: New Relic, Query Analyzers

• Basically everybody does this too in some form

2. Component metrics
• Home run:
• Low CPU usage on app and DB
• Low disk usage on DB
• App instrumentation shows time spent in service
calls
• fastcgi queue getting deep

2. Looking for blame

A

B

2. Looking for blame

HELP!
A

B

2. Finding blame

A

B

2. Finding blame

No, help ME!
A

B

2. Finding blame

No, help ME!
A

127 results += 24;
B
128
129 do this a lot:
130   something_slow()
131
132 return results;
133

2. Tracing metrics
• Profiling + flow-of-control

• Causal organization

2. Tracing metrics

• Lamport’s “happens before”

2. Tracing metrics


• Who does this?

2. Tracing metrics


• Who does this?
• In-house solutions
• Google, Goldman Sachs, others?

2. Tracing metrics


• Who does this?
• Open-source
• X-Trace, Magpie

2. Tracing metrics


• Who does this?
• Open-source
• X-Trace, Magpie
• Commercial availability

2. Tracing metrics


• Who does this?
• Open-source
• X-Trace, Magpie
• Commercial availability
• DynaTrace, Tracelytics

3. Act
• You found your problem

3. Act
• If not, go back 20 slides and repeat...

3. Act

• Infrastructure upgrades

3. Act

• More boxes, better boxes

3. Act


• Redistribute work / resource scheduling

3. Act


• Service-oriented architecture (SOA)

3. Act



• Do less work

3. Act



• Do less work
• Skip what you can, cache what you can’t

3. Act



• Do less work

• Do work later

3. Act



• Do less work

• Do work later
• Deferred processing

3. Caching
• Store things where they can be retrieved more cheaply
(faster)

3. C.R.E.A.M.
• Browser cache

3. C.R.E.A.M.
• Browser cache
• CDN

3. C.R.E.A.M.
• Browser cache
• CDN
• Proxy / optimizer

3. C.R.E.A.M.
• Browser cache
• CDN
• Opcode

3. C.R.E.A.M.
• Browser cache
• CDN
• Opcode
• Application-driven

3. C.R.E.A.M.
• Browser cache
• CDN
• Opcode
• App-specific cache

3. C.R.E.A.M.
• Browser cache
• CDN
• Opcode
• ORM cache

3. C.R.E.A.M.
• Browser cache
• CDN
• Opcode
• ORM cache
• Local (runtime) cache

3. C.R.E.A.M.
• Browser cache
• CDN
• Opcode
• ORM cache
• Database

3. C.R.E.A.M.
• Browser cache
• CDN
• Opcode
• ORM cache
• Database
• Query cache

3. C.R.E.A.M.
• Browser cache
• CDN
• Opcode
• ORM cache
• Database
• Query cache
• Denormalization

3. C.R.E.A.M.
• Browser cache
• CDN
• Opcode cache
• ORM cache
• Database
• Query cache
• Denormalization

3. C.R.E.A.M.
• Browser cache
• CDN More speed gain,
• Proxy / optimizer More invalidations
• Opcode cache
• ORM cache
• Database
• Query cache Less speed gain,
• Denormalization Fewer invalidations

3. When to cache
• Protect resources
• DB
• Services

• Cover for slow actions
• DB
• Disk hits
• External service calls
• Number-crunching

3. Deferred work
• Presmise: synchronous work is lame
• Go async!

• Mechanism: queue
• RabbitMQ, 0MQ, ActiveMQ, Amazon SQS

Q

app servers workers/hadoop/??

db/cache

3. When to queue
• Actions you can decouple from that page load
• Things that don’t have to update in real-time
• Counter updates (queue and aggregate)
• External API calls
• Long-running requests (ajax)
• Batch processing
• Shell commands

3. Redistribute work
• Service-oriented architecture
• Reusable components
• Co-tenable components

app

3. SOA
• We’ve got two pages on our website and one box
serving it

def fast_action(): def slow_action():
x *= y x = compute()
render (‘fast.tpl’) render(‘slow.tpl’)

• Problem?
• Slow actions starve fast actions!
• How to remedy?

3. SOA
• Take 1: buy more servers
• But if anyone calls slow action on one, we lose
• All servers must be able to handle slow_action’s
workload

• Take 2: pull out slow action
def fast_action(): def slow_action():
x *= y x = remote_compute()
render (‘fast.tpl’) render(‘slow.tpl’)

• Who does this????

3. Resource scheduling

Low
Low
Low
app

Low High
High Low
Low Low
memcached number-cruncher

4. Did we ruin everything?
• If your metrics were right, things are probably faster
• But they’re different
• ... and probably more complicated

• How do we keep track of it?
• Better tools

• Next month: performance and load testing with
Selenium

Takeaways
• Hard to solve problems without understanding them at
a fundamental level
• Get data, visualize

• Machine and component metrics are key
• Sometimes they’re not enough

• Once we know a problem, there’s help
• SOA, Cache, Deferral -- complementary tools

• As web systems become more complicated, we must
use more sophisticated tools to monitor and debug
them

Thanks!
dan kuebrich
dan@tracelytics.com

bp

Recomendados

Recomendados

Más contenido relacionado

Similar a bp

Similar a bp (20)

bp

Notas del editor