SlideShare una empresa de Scribd logo
1 de 103
Application Performance
      Management
   “tightening up your backend”


          dan kuebrich
       dan@tracelytics.com
speed: where is it?
speed: where is it?
            DNS




DNS, connection
speed: where is it?
             DNS

             First HTTP Request        your boxes




DNS, connection
    Fulfill HTTP Request (“Time to first byte”)
speed: where is it?
            DNS

            First HTTP Request         your boxes

            Subsequent HTTP Requests



DNS, connection
    Fulfill HTTP Request (“Time to first byte”)
                   Download + render page contents
                                  (+js)
Speed
What’s taking so long?




      ...
What’s taking so long?




        ...


     Time to connect (3ms)
What’s taking so long?




        ...


     Time to connect (3ms)
            Time to first byte (1.61s)
What’s taking so long?
                                    33%



        ...


     Time to connect (3ms)
            Time to first byte (1.61s)
What’s taking so long?




           ?
What is in that bar?
Why you care (performance)
• Speed optimization
Why you care (performance)
• Speed optimization
  • A lot on client side, but not all
Why you care (performance)
• Speed optimization
  • A lot on client side, but not all

• Troubleshooting
  • Service disruptions -- resolve
    ASAP
Why you care (performance)
• Speed optimization
  • A lot on client side, but not all

• Troubleshooting
  • Service disruptions -- resolve
    ASAP

• Concurrency
  • How does it scale?
Why you care (performance)
• Speed optimization
  • A lot on client side, but not all

• Troubleshooting
  • Service disruptions -- resolve
    ASAP

• Concurrency
  • How does it scale?

• Money
  • The purple bar is expensive.
1996
2011
It’s all about tradeoffs

              good / evil
It’s all about tradeoffs

              good / evil



             risk / reward
It’s all about tradeoffs

                good / evil



              risk / reward



          fearlessness / sobriety
How to make decisions (ideally)
1. Decide what to measure
How to make decisions (ideally)
1. Decide what to measure

2. Measure, examine
How to make decisions (ideally)
1. Decide what to measure

2. Measure, examine

3. Act
How to make decisions (ideally)
1. Decide what to measure

2. Measure, examine

3. Act

4. Check
1. What to measure
• Depends on what you’re looking for
  • Bottlenecks -- db or app server
  • Outages -- blocking on services
  • Business metrics -- SLA reports, infrastructure
    utilization

• Measure as much as possible (reasonable)
1. What to measure
• Depends on what you’re looking for
  • Bottlenecks -- db or app server
  • Outages -- blocking on services
  • Business metrics -- SLA reports, infrastructure
    utilization

• Measure as much as possible (reasonable)
  • You’ll never have all the data you want
How to make decisions (ideally)
How to make decisions (ideally)
1. Decide what to measure
How to make decisions (ideally)
1. Decide what to measure

2. Measure, examine
How to make decisions (ideally)
1. Decide what to measure

2. Measure, examine

3. Act
How to make decisions (ideally)
1. Decide what to measure

2. Measure, examine

3. Act

4. Check
1. What to measure
• Depends on what you’re measuring
  • DB = i/o, slow query log, buffer cache
  • Server = fastcgi queue
  • App = cpu/network
  • Cache = ram, eviction, hits
1. What to measure
• Depends on what you’re measuring
  • DB = i/o, slow query log, buffer cache
  • Server = fastcgi queue
  • App = cpu/network
  • Cache = ram, eviction, hits

• Tower of Babel?
1. What to measure
• Depends on what you’re measuring
  • DB = i/o, slow query log, buffer cache
  • Server = fastcgi queue
  • App = cpu/network
  • Cache = ram, eviction, hits

• Tower of Babel?

• Common language: latency
1. What to measure
• Depends on what you’re measuring
  • DB = i/o, slow query log, buffer cache
  • Server = fastcgi queue
  • App = cpu/network
  • Cache = ram, eviction, hits

• Tower of Babel?

• Common language: latency
  • “Profiling”
2. How to measure
• Machine-level
  • Cpu, load, i/o, network

• Component-level
  • Logs, instrumentation
  • New Relic, Query Analyzer

• Request-level
  • Tracing
2. Machine metrics
• You have four basic resources
  • CPU
  • RAM
  • I/O
  • Network

• Open-source: Ganglia, Munin, Zabbix, etc.
• Commercial: CloudKick, AppFirst, Librato, etc...

• Everybody uses some form of this
  • Facebook monitors over 5 million metrics with
    Ganglia
2. Machine Metrics
2. Machine metrics
• Home run:
  • DB has high CPU wait
  • Requests are slow -- why?

• Falling short:
  • Low CPU usage on app and DB
  • Low disk usage on DB
  • Requests are slow -- why?
2. Component metrics
• Very heterogeneous
  • Throughput metrics
  • Error conditions
  • Profiling data

• Collect from:
  • Logs: tail -f, Splunk, Loggly, Hoptoad
  • Service calls: JMX
  • Profiling: xhprof, cProfile
  • Other: New Relic, Query Analyzers

• Basically everybody does this too in some form
2. Component metrics
• Home run:
  • Low CPU usage on app and DB
  • Low disk usage on DB
  • App instrumentation shows time spent in service
    calls
  • fastcgi queue getting deep
  • Requests are slow -- why?
2. Looking for blame


            A




            B
2. Looking for blame


            A




            B
2. Looking for blame

                       HELP!
            A




            B
2. Finding blame


            A




            B
2. Finding blame

       No, help ME!
                A




                 B
2. Finding blame

       No, help ME!
                   A



            127   results
+=
24;
                   B
            128
            129   do
this
a
lot:
            130   

something_slow()
            131
            132   return
results;
            133
2. Tracing metrics
• Profiling + flow-of-control

• Causal organization
2. Tracing metrics
• Profiling + flow-of-control

• Causal organization
  • Lamport’s “happens before”
2. Tracing metrics
• Profiling + flow-of-control

• Causal organization
  • Lamport’s “happens before”

• Who does this?
2. Tracing metrics
• Profiling + flow-of-control

• Causal organization
  • Lamport’s “happens before”

• Who does this?
  • In-house solutions
     • Google, Goldman Sachs, others?
2. Tracing metrics
• Profiling + flow-of-control

• Causal organization
  • Lamport’s “happens before”

• Who does this?
  • In-house solutions
     • Google, Goldman Sachs, others?
  • Open-source
     • X-Trace, Magpie
2. Tracing metrics
• Profiling + flow-of-control

• Causal organization
  • Lamport’s “happens before”

• Who does this?
  • In-house solutions
     • Google, Goldman Sachs, others?
  • Open-source
     • X-Trace, Magpie
  • Commercial availability
2. Tracing metrics
• Profiling + flow-of-control

• Causal organization
  • Lamport’s “happens before”

• Who does this?
  • In-house solutions
     • Google, Goldman Sachs, others?
  • Open-source
     • X-Trace, Magpie
  • Commercial availability
     • DynaTrace, Tracelytics
2. Tracing Metrics
How to make decisions (ideally)
How to make decisions (ideally)
1. Decide what to measure
How to make decisions (ideally)
1. Decide what to measure

2. Measure, examine
How to make decisions (ideally)
1. Decide what to measure

2. Measure, examine

3. Act
How to make decisions (ideally)
1. Decide what to measure

2. Measure, examine

3. Act

4. Check
3. Act
• You found your problem
3. Act
• You found your problem
  • If not, go back 20 slides and repeat...
3. Act
• You found your problem
  • If not, go back 20 slides and repeat...

• Infrastructure upgrades
3. Act
• You found your problem
  • If not, go back 20 slides and repeat...

• Infrastructure upgrades
   • More boxes, better boxes
3. Act
• You found your problem
  • If not, go back 20 slides and repeat...

• Infrastructure upgrades
   • More boxes, better boxes

• Redistribute work / resource scheduling
3. Act
• You found your problem
  • If not, go back 20 slides and repeat...

• Infrastructure upgrades
   • More boxes, better boxes

• Redistribute work / resource scheduling
  • Service-oriented architecture (SOA)
3. Act
• You found your problem
  • If not, go back 20 slides and repeat...

• Infrastructure upgrades
   • More boxes, better boxes

• Redistribute work / resource scheduling
  • Service-oriented architecture (SOA)

• Do less work
3. Act
• You found your problem
  • If not, go back 20 slides and repeat...

• Infrastructure upgrades
   • More boxes, better boxes

• Redistribute work / resource scheduling
  • Service-oriented architecture (SOA)

• Do less work
  • Skip what you can, cache what you can’t
3. Act
• You found your problem
  • If not, go back 20 slides and repeat...

• Infrastructure upgrades
   • More boxes, better boxes

• Redistribute work / resource scheduling
  • Service-oriented architecture (SOA)

• Do less work
  • Skip what you can, cache what you can’t

• Do work later
3. Act
• You found your problem
  • If not, go back 20 slides and repeat...

• Infrastructure upgrades
   • More boxes, better boxes

• Redistribute work / resource scheduling
  • Service-oriented architecture (SOA)

• Do less work
  • Skip what you can, cache what you can’t

• Do work later
  • Deferred processing
3. Caching
• Store things where they can be retrieved more cheaply
  (faster)
3. C.R.E.A.M.
3. C.R.E.A.M.
• Browser cache
3. C.R.E.A.M.
• Browser cache
• CDN
3. C.R.E.A.M.
• Browser cache
• CDN
• Proxy / optimizer
3. C.R.E.A.M.
• Browser cache
• CDN
• Proxy / optimizer
• Opcode
3. C.R.E.A.M.
• Browser cache
• CDN
• Proxy / optimizer
• Opcode
• Application-driven
3. C.R.E.A.M.
• Browser cache
• CDN
• Proxy / optimizer
• Opcode
• Application-driven
  • App-specific cache
3. C.R.E.A.M.
• Browser cache
• CDN
• Proxy / optimizer
• Opcode
• Application-driven
  • App-specific cache
  • ORM cache
3. C.R.E.A.M.
• Browser cache
• CDN
• Proxy / optimizer
• Opcode
• Application-driven
  • App-specific cache
  • ORM cache
  • Local (runtime) cache
3. C.R.E.A.M.
• Browser cache
• CDN
• Proxy / optimizer
• Opcode
• Application-driven
  • App-specific cache
  • ORM cache
  • Local (runtime) cache
• Database
3. C.R.E.A.M.
• Browser cache
• CDN
• Proxy / optimizer
• Opcode
• Application-driven
  • App-specific cache
  • ORM cache
  • Local (runtime) cache
• Database
  • Query cache
3. C.R.E.A.M.
• Browser cache
• CDN
• Proxy / optimizer
• Opcode
• Application-driven
  • App-specific cache
  • ORM cache
  • Local (runtime) cache
• Database
  • Query cache
  • Denormalization
3. C.R.E.A.M.
• Browser cache
• CDN
• Proxy / optimizer
• Opcode cache
• Application-driven
  • App-specific cache
  • ORM cache
  • Local (runtime) cache
• Database
  • Query cache
  • Denormalization
3. C.R.E.A.M.
• Browser cache
• CDN                       More speed gain,
• Proxy / optimizer         More invalidations
• Opcode cache
• Application-driven
  • App-specific cache
  • ORM cache
  • Local (runtime) cache
• Database
  • Query cache              Less speed gain,
  • Denormalization         Fewer invalidations
3. When to cache
• Protect resources
  • DB
  • Services

• Cover for slow actions
  • DB
  • Disk hits
  • External service calls
  • Number-crunching
3. Deferred work
• Presmise: synchronous work is lame
  • Go async!

• Mechanism: queue
  • RabbitMQ, 0MQ, ActiveMQ, Amazon SQS




                               Q


                app servers              workers/hadoop/??

                              db/cache
3. When to queue
• Actions you can decouple from that page load
  • Things that don’t have to update in real-time
    • Counter updates (queue and aggregate)
  • External API calls
  • Long-running requests (ajax)
    • Batch processing
    • Shell commands
3. Redistribute work
• Service-oriented architecture
  • Reusable components
  • Co-tenable components




                          app
3. SOA
• We’ve got two pages on our website and one box
  serving it

       def fast_action():      def slow_action():
         x *= y                  x = compute()
         render (‘fast.tpl’)     render(‘slow.tpl’)

• Problem?
  • Slow actions starve fast actions!
  • How to remedy?
3. SOA
• Take 1: buy more servers
  • But if anyone calls slow action on one, we lose
  • All servers must be able to handle slow_action’s
    workload

• Take 2: pull out slow action
  def fast_action():             def slow_action():
    x *= y                         x = remote_compute()
    render (‘fast.tpl’)            render(‘slow.tpl’)

• Who does this????
3. Resource scheduling

                         Low
                         Low
                         Low
                   app


     Low                        High
    High                        Low
     Low                        Low
       memcached    number-cruncher
3. Resource scheduling

                       Low
                       Low
                       Low
                 app


         Low            High
         High           Low
         Low            Low
     memcached    number-cruncher
How to make decisions (ideally)
How to make decisions (ideally)
1. Decide what to measure
How to make decisions (ideally)
1. Decide what to measure

2. Measure, examine
How to make decisions (ideally)
1. Decide what to measure

2. Measure, examine

3. Act
How to make decisions (ideally)
1. Decide what to measure

2. Measure, examine

3. Act

4. Check
4. Did we ruin everything?
• If your metrics were right, things are probably faster
   • But they’re different
   • ... and probably more complicated

• How do we keep track of it?
  • Better tools

• Next month: performance and load testing with
  Selenium
Takeaways
• Hard to solve problems without understanding them at
  a fundamental level
  • Get data, visualize

• Machine and component metrics are key
  • Sometimes they’re not enough

• Once we know a problem, there’s help
  • SOA, Cache, Deferral -- complementary tools

• As web systems become more complicated, we must
  use more sophisticated tools to monitor and debug
  them
Thanks!
    dan kuebrich
 dan@tracelytics.com

Más contenido relacionado

Similar a bp

Introduction to bugs measurement
Introduction to bugs measurementIntroduction to bugs measurement
Introduction to bugs measurement
Volodya Novostavsky
 
The challenges of live events scalability
The challenges of live events scalabilityThe challenges of live events scalability
The challenges of live events scalability
Guy Tomer
 

Similar a bp (20)

Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)
 
Trending with Purpose
Trending with PurposeTrending with Purpose
Trending with Purpose
 
Nondeterministic Software for the Rest of Us
Nondeterministic Software for the Rest of UsNondeterministic Software for the Rest of Us
Nondeterministic Software for the Rest of Us
 
Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)
 
Changing rules 1_stopcheating_slideshare
Changing rules 1_stopcheating_slideshareChanging rules 1_stopcheating_slideshare
Changing rules 1_stopcheating_slideshare
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
Hard Coding as a design approach
Hard Coding as a design approachHard Coding as a design approach
Hard Coding as a design approach
 
BTV PHP - Building Fast Websites
BTV PHP - Building Fast WebsitesBTV PHP - Building Fast Websites
BTV PHP - Building Fast Websites
 
Expecto Performa! The Magic and Reality of Performance Tuning
Expecto Performa! The Magic and Reality of Performance TuningExpecto Performa! The Magic and Reality of Performance Tuning
Expecto Performa! The Magic and Reality of Performance Tuning
 
Pushing the Bottleneck: Predicting and Addressing the Next, Next Thing
Pushing the Bottleneck: Predicting and Addressing the Next, Next ThingPushing the Bottleneck: Predicting and Addressing the Next, Next Thing
Pushing the Bottleneck: Predicting and Addressing the Next, Next Thing
 
Scaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON TutorialScaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON Tutorial
 
Ds @ bol
Ds @ bolDs @ bol
Ds @ bol
 
PostgreSQL at 20TB and Beyond
PostgreSQL at 20TB and BeyondPostgreSQL at 20TB and Beyond
PostgreSQL at 20TB and Beyond
 
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACLPerformance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
 
(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects
(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects
(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects
 
Introduction to bugs measurement
Introduction to bugs measurementIntroduction to bugs measurement
Introduction to bugs measurement
 
Presentation maximizing database performance performance tuning with db time
Presentation    maximizing database performance performance tuning with db timePresentation    maximizing database performance performance tuning with db time
Presentation maximizing database performance performance tuning with db time
 
The challenges of live events scalability
The challenges of live events scalabilityThe challenges of live events scalability
The challenges of live events scalability
 
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian BharadwajH2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
 
Cvcc performance tuning
Cvcc performance tuningCvcc performance tuning
Cvcc performance tuning
 

bp

  • 1. Application Performance Management “tightening up your backend” dan kuebrich dan@tracelytics.com
  • 3. speed: where is it? DNS DNS, connection
  • 4. speed: where is it? DNS First HTTP Request your boxes DNS, connection Fulfill HTTP Request (“Time to first byte”)
  • 5. speed: where is it? DNS First HTTP Request your boxes Subsequent HTTP Requests DNS, connection Fulfill HTTP Request (“Time to first byte”) Download + render page contents (+js)
  • 7. What’s taking so long? ...
  • 8. What’s taking so long? ... Time to connect (3ms)
  • 9. What’s taking so long? ... Time to connect (3ms) Time to first byte (1.61s)
  • 10. What’s taking so long? 33% ... Time to connect (3ms) Time to first byte (1.61s)
  • 12. What is in that bar?
  • 13. Why you care (performance) • Speed optimization
  • 14. Why you care (performance) • Speed optimization • A lot on client side, but not all
  • 15. Why you care (performance) • Speed optimization • A lot on client side, but not all • Troubleshooting • Service disruptions -- resolve ASAP
  • 16. Why you care (performance) • Speed optimization • A lot on client side, but not all • Troubleshooting • Service disruptions -- resolve ASAP • Concurrency • How does it scale?
  • 17. Why you care (performance) • Speed optimization • A lot on client side, but not all • Troubleshooting • Service disruptions -- resolve ASAP • Concurrency • How does it scale? • Money • The purple bar is expensive.
  • 18. 1996
  • 19. 2011
  • 20. It’s all about tradeoffs good / evil
  • 21. It’s all about tradeoffs good / evil risk / reward
  • 22. It’s all about tradeoffs good / evil risk / reward fearlessness / sobriety
  • 23. How to make decisions (ideally) 1. Decide what to measure
  • 24. How to make decisions (ideally) 1. Decide what to measure 2. Measure, examine
  • 25. How to make decisions (ideally) 1. Decide what to measure 2. Measure, examine 3. Act
  • 26. How to make decisions (ideally) 1. Decide what to measure 2. Measure, examine 3. Act 4. Check
  • 27. 1. What to measure • Depends on what you’re looking for • Bottlenecks -- db or app server • Outages -- blocking on services • Business metrics -- SLA reports, infrastructure utilization • Measure as much as possible (reasonable)
  • 28. 1. What to measure • Depends on what you’re looking for • Bottlenecks -- db or app server • Outages -- blocking on services • Business metrics -- SLA reports, infrastructure utilization • Measure as much as possible (reasonable) • You’ll never have all the data you want
  • 29. How to make decisions (ideally)
  • 30. How to make decisions (ideally) 1. Decide what to measure
  • 31. How to make decisions (ideally) 1. Decide what to measure 2. Measure, examine
  • 32. How to make decisions (ideally) 1. Decide what to measure 2. Measure, examine 3. Act
  • 33. How to make decisions (ideally) 1. Decide what to measure 2. Measure, examine 3. Act 4. Check
  • 34. 1. What to measure • Depends on what you’re measuring • DB = i/o, slow query log, buffer cache • Server = fastcgi queue • App = cpu/network • Cache = ram, eviction, hits
  • 35. 1. What to measure • Depends on what you’re measuring • DB = i/o, slow query log, buffer cache • Server = fastcgi queue • App = cpu/network • Cache = ram, eviction, hits • Tower of Babel?
  • 36. 1. What to measure • Depends on what you’re measuring • DB = i/o, slow query log, buffer cache • Server = fastcgi queue • App = cpu/network • Cache = ram, eviction, hits • Tower of Babel? • Common language: latency
  • 37. 1. What to measure • Depends on what you’re measuring • DB = i/o, slow query log, buffer cache • Server = fastcgi queue • App = cpu/network • Cache = ram, eviction, hits • Tower of Babel? • Common language: latency • “Profiling”
  • 38. 2. How to measure • Machine-level • Cpu, load, i/o, network • Component-level • Logs, instrumentation • New Relic, Query Analyzer • Request-level • Tracing
  • 39. 2. Machine metrics • You have four basic resources • CPU • RAM • I/O • Network • Open-source: Ganglia, Munin, Zabbix, etc. • Commercial: CloudKick, AppFirst, Librato, etc... • Everybody uses some form of this • Facebook monitors over 5 million metrics with Ganglia
  • 41. 2. Machine metrics • Home run: • DB has high CPU wait • Requests are slow -- why? • Falling short: • Low CPU usage on app and DB • Low disk usage on DB • Requests are slow -- why?
  • 42. 2. Component metrics • Very heterogeneous • Throughput metrics • Error conditions • Profiling data • Collect from: • Logs: tail -f, Splunk, Loggly, Hoptoad • Service calls: JMX • Profiling: xhprof, cProfile • Other: New Relic, Query Analyzers • Basically everybody does this too in some form
  • 43. 2. Component metrics • Home run: • Low CPU usage on app and DB • Low disk usage on DB • App instrumentation shows time spent in service calls • fastcgi queue getting deep • Requests are slow -- why?
  • 44. 2. Looking for blame A B
  • 45. 2. Looking for blame A B
  • 46. 2. Looking for blame HELP! A B
  • 48. 2. Finding blame No, help ME! A B
  • 49. 2. Finding blame No, help ME! A 127 results
+=
24; B 128 129 do
this
a
lot: 130 

something_slow() 131 132 return
results; 133
  • 50. 2. Tracing metrics • Profiling + flow-of-control • Causal organization
  • 51. 2. Tracing metrics • Profiling + flow-of-control • Causal organization • Lamport’s “happens before”
  • 52. 2. Tracing metrics • Profiling + flow-of-control • Causal organization • Lamport’s “happens before” • Who does this?
  • 53. 2. Tracing metrics • Profiling + flow-of-control • Causal organization • Lamport’s “happens before” • Who does this? • In-house solutions • Google, Goldman Sachs, others?
  • 54. 2. Tracing metrics • Profiling + flow-of-control • Causal organization • Lamport’s “happens before” • Who does this? • In-house solutions • Google, Goldman Sachs, others? • Open-source • X-Trace, Magpie
  • 55. 2. Tracing metrics • Profiling + flow-of-control • Causal organization • Lamport’s “happens before” • Who does this? • In-house solutions • Google, Goldman Sachs, others? • Open-source • X-Trace, Magpie • Commercial availability
  • 56. 2. Tracing metrics • Profiling + flow-of-control • Causal organization • Lamport’s “happens before” • Who does this? • In-house solutions • Google, Goldman Sachs, others? • Open-source • X-Trace, Magpie • Commercial availability • DynaTrace, Tracelytics
  • 58. How to make decisions (ideally)
  • 59. How to make decisions (ideally) 1. Decide what to measure
  • 60. How to make decisions (ideally) 1. Decide what to measure 2. Measure, examine
  • 61. How to make decisions (ideally) 1. Decide what to measure 2. Measure, examine 3. Act
  • 62. How to make decisions (ideally) 1. Decide what to measure 2. Measure, examine 3. Act 4. Check
  • 63. 3. Act • You found your problem
  • 64. 3. Act • You found your problem • If not, go back 20 slides and repeat...
  • 65. 3. Act • You found your problem • If not, go back 20 slides and repeat... • Infrastructure upgrades
  • 66. 3. Act • You found your problem • If not, go back 20 slides and repeat... • Infrastructure upgrades • More boxes, better boxes
  • 67. 3. Act • You found your problem • If not, go back 20 slides and repeat... • Infrastructure upgrades • More boxes, better boxes • Redistribute work / resource scheduling
  • 68. 3. Act • You found your problem • If not, go back 20 slides and repeat... • Infrastructure upgrades • More boxes, better boxes • Redistribute work / resource scheduling • Service-oriented architecture (SOA)
  • 69. 3. Act • You found your problem • If not, go back 20 slides and repeat... • Infrastructure upgrades • More boxes, better boxes • Redistribute work / resource scheduling • Service-oriented architecture (SOA) • Do less work
  • 70. 3. Act • You found your problem • If not, go back 20 slides and repeat... • Infrastructure upgrades • More boxes, better boxes • Redistribute work / resource scheduling • Service-oriented architecture (SOA) • Do less work • Skip what you can, cache what you can’t
  • 71. 3. Act • You found your problem • If not, go back 20 slides and repeat... • Infrastructure upgrades • More boxes, better boxes • Redistribute work / resource scheduling • Service-oriented architecture (SOA) • Do less work • Skip what you can, cache what you can’t • Do work later
  • 72. 3. Act • You found your problem • If not, go back 20 slides and repeat... • Infrastructure upgrades • More boxes, better boxes • Redistribute work / resource scheduling • Service-oriented architecture (SOA) • Do less work • Skip what you can, cache what you can’t • Do work later • Deferred processing
  • 73. 3. Caching • Store things where they can be retrieved more cheaply (faster)
  • 76. 3. C.R.E.A.M. • Browser cache • CDN
  • 77. 3. C.R.E.A.M. • Browser cache • CDN • Proxy / optimizer
  • 78. 3. C.R.E.A.M. • Browser cache • CDN • Proxy / optimizer • Opcode
  • 79. 3. C.R.E.A.M. • Browser cache • CDN • Proxy / optimizer • Opcode • Application-driven
  • 80. 3. C.R.E.A.M. • Browser cache • CDN • Proxy / optimizer • Opcode • Application-driven • App-specific cache
  • 81. 3. C.R.E.A.M. • Browser cache • CDN • Proxy / optimizer • Opcode • Application-driven • App-specific cache • ORM cache
  • 82. 3. C.R.E.A.M. • Browser cache • CDN • Proxy / optimizer • Opcode • Application-driven • App-specific cache • ORM cache • Local (runtime) cache
  • 83. 3. C.R.E.A.M. • Browser cache • CDN • Proxy / optimizer • Opcode • Application-driven • App-specific cache • ORM cache • Local (runtime) cache • Database
  • 84. 3. C.R.E.A.M. • Browser cache • CDN • Proxy / optimizer • Opcode • Application-driven • App-specific cache • ORM cache • Local (runtime) cache • Database • Query cache
  • 85. 3. C.R.E.A.M. • Browser cache • CDN • Proxy / optimizer • Opcode • Application-driven • App-specific cache • ORM cache • Local (runtime) cache • Database • Query cache • Denormalization
  • 86. 3. C.R.E.A.M. • Browser cache • CDN • Proxy / optimizer • Opcode cache • Application-driven • App-specific cache • ORM cache • Local (runtime) cache • Database • Query cache • Denormalization
  • 87. 3. C.R.E.A.M. • Browser cache • CDN More speed gain, • Proxy / optimizer More invalidations • Opcode cache • Application-driven • App-specific cache • ORM cache • Local (runtime) cache • Database • Query cache Less speed gain, • Denormalization Fewer invalidations
  • 88. 3. When to cache • Protect resources • DB • Services • Cover for slow actions • DB • Disk hits • External service calls • Number-crunching
  • 89. 3. Deferred work • Presmise: synchronous work is lame • Go async! • Mechanism: queue • RabbitMQ, 0MQ, ActiveMQ, Amazon SQS Q app servers workers/hadoop/?? db/cache
  • 90. 3. When to queue • Actions you can decouple from that page load • Things that don’t have to update in real-time • Counter updates (queue and aggregate) • External API calls • Long-running requests (ajax) • Batch processing • Shell commands
  • 91. 3. Redistribute work • Service-oriented architecture • Reusable components • Co-tenable components app
  • 92. 3. SOA • We’ve got two pages on our website and one box serving it def fast_action(): def slow_action(): x *= y x = compute() render (‘fast.tpl’) render(‘slow.tpl’) • Problem? • Slow actions starve fast actions! • How to remedy?
  • 93. 3. SOA • Take 1: buy more servers • But if anyone calls slow action on one, we lose • All servers must be able to handle slow_action’s workload • Take 2: pull out slow action def fast_action(): def slow_action(): x *= y x = remote_compute() render (‘fast.tpl’) render(‘slow.tpl’) • Who does this????
  • 94. 3. Resource scheduling Low Low Low app Low High High Low Low Low memcached number-cruncher
  • 95. 3. Resource scheduling Low Low Low app Low High High Low Low Low memcached number-cruncher
  • 96. How to make decisions (ideally)
  • 97. How to make decisions (ideally) 1. Decide what to measure
  • 98. How to make decisions (ideally) 1. Decide what to measure 2. Measure, examine
  • 99. How to make decisions (ideally) 1. Decide what to measure 2. Measure, examine 3. Act
  • 100. How to make decisions (ideally) 1. Decide what to measure 2. Measure, examine 3. Act 4. Check
  • 101. 4. Did we ruin everything? • If your metrics were right, things are probably faster • But they’re different • ... and probably more complicated • How do we keep track of it? • Better tools • Next month: performance and load testing with Selenium
  • 102. Takeaways • Hard to solve problems without understanding them at a fundamental level • Get data, visualize • Machine and component metrics are key • Sometimes they’re not enough • Once we know a problem, there’s help • SOA, Cache, Deferral -- complementary tools • As web systems become more complicated, we must use more sophisticated tools to monitor and debug them
  • 103. Thanks! dan kuebrich dan@tracelytics.com

Notas del editor

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. split this into 4 pages with more info\n\nso, it’s important, but I’d argue that it’s also particularly interesting, and only getting more so\n
  31. split this into 4 pages with more info\n\nso, it’s important, but I’d argue that it’s also particularly interesting, and only getting more so\n
  32. split this into 4 pages with more info\n\nso, it’s important, but I’d argue that it’s also particularly interesting, and only getting more so\n
  33. split this into 4 pages with more info\n\nso, it’s important, but I’d argue that it’s also particularly interesting, and only getting more so\n
  34. split this into 4 pages with more info\n\nso, it’s important, but I’d argue that it’s also particularly interesting, and only getting more so\n
  35. split this into 4 pages with more info\n\nso, it’s important, but I’d argue that it’s also particularly interesting, and only getting more so\n
  36. split this into 4 pages with more info\n\nso, it’s important, but I’d argue that it’s also particularly interesting, and only getting more so\n
  37. split this into 4 pages with more info\n\nso, it’s important, but I’d argue that it’s also particularly interesting, and only getting more so\n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. there’s a lot -- too much. here’s a little bit\n
  44. I took a class called making decisions...\nit’s about measuring, then optimizing\ncan’t tell you how to act--that’s too specific\nbut can tell you how to decide how to act\n
  45. I took a class called making decisions...\nit’s about measuring, then optimizing\ncan’t tell you how to act--that’s too specific\nbut can tell you how to decide how to act\n
  46. I took a class called making decisions...\nit’s about measuring, then optimizing\ncan’t tell you how to act--that’s too specific\nbut can tell you how to decide how to act\n
  47. \n
  48. \n
  49. \n
  50. \n
  51. I took a class called making decisions...\n
  52. I took a class called making decisions...\n
  53. I took a class called making decisions...\n
  54. I took a class called making decisions...\n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. I took a class called making decisions...\n
  64. I took a class called making decisions...\n
  65. I took a class called making decisions...\n
  66. I took a class called making decisions...\n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n
  77. \n
  78. \n
  79. \n
  80. \n
  81. \n
  82. \n
  83. \n
  84. \n
  85. \n
  86. \n
  87. \n
  88. \n
  89. \n
  90. \n
  91. \n
  92. \n
  93. \n
  94. \n
  95. \n
  96. \n
  97. \n
  98. \n
  99. \n
  100. \n
  101. \n
  102. \n
  103. \n
  104. \n
  105. \n
  106. \n
  107. \n
  108. \n
  109. \n
  110. todo: replace with helloapp?\n
  111. I took a class called making decisions...\n
  112. I took a class called making decisions...\n
  113. I took a class called making decisions...\n
  114. I took a class called making decisions...\n
  115. \n
  116. \n
  117. \n
  118. \n
  119. \n
  120. \n
  121. \n
  122. \n
  123. \n
  124. \n
  125. \n
  126. \n
  127. \n
  128. \n
  129. \n
  130. \n
  131. \n
  132. \n
  133. \n
  134. \n
  135. \n
  136. \n
  137. \n
  138. \n
  139. \n
  140. \n
  141. \n
  142. \n
  143. \n
  144. \n
  145. \n
  146. \n
  147. \n
  148. \n
  149. I took a class called making decisions...\n
  150. I took a class called making decisions...\n
  151. I took a class called making decisions...\n
  152. I took a class called making decisions...\n
  153. \n
  154. \n
  155. \n
  156. \n
  157. \n