Performance problems are one of the most cited concerns about to the cloud. But is it really the cloud or the application? What does performance mean anyway when you can scale to thousands of servers? This session will discuss why traditional means of performance management and troubleshooting no longer work and how this affects everything. Most importantly we will look at how to identify the root cause of performance problems in such dynamic environments. Finally we will explain how to assess and manage performance when capacity is no longer the issue.
2. What are the risks of moving to the cloud?
IDC(Survey Q4 „09) Results from actual pilots (March 2010)
Perception Primary Benefits Biggest Issues
Before Reduced IT costs Security
“The Maturing Cloud: What It
Will Take to Win” (Published Mar After Scalability Performance
2010)
What are the major risks in the Agility SLA Management
Cloud?
• Security – 87.5%
• Availability – 83.3%
• Performance – 82.9%
(88.6% stated that cloud
“All About The Cloud” Conference (May 2010)
service providers need to
“Security in the Cloud isn‟t any harder than it is in the
provide SLAs)
Enterprise – it‟s just different” (Unisys)
“[Application] Performance Management in the Cloud is
becoming the hot topic” (THINKstrategies)
Projects fail to deliver acceptable performance
Moving Legacy Applications is harder than thought
5. How do we measure Performance
Response Time
Transaction Level Metric
Don’t use averages High Volatility
Be specific Which type of transaction
Throughput
Volume of Transactions per Timeframe
Average Speed of Transaction
Be specific Which type of transactions
6. What does Scalability mean
More concurrent Transactions with same response time
Linear growing Throughput with linear more hardware
Scalability depends on Performance
7. Performance in the cloud
“Pure Performance” is never better in a Cloud!
Co Tenancy
Resource sharing
Commodity and generally smaller hardware
Scalability can be better in the Cloud
Rapid elasticity
Depends on Application Design and Performance
Legacy Applications have limitations
End User Performance depends on both and more
Web Delivery Chain
Network!
Can be better than on premise!
9. Traditional Performance Management - Fails
Sniffing and other appliances do not work
Are based on System metrics which are
Corrupted
Do not answer application performance questions
Are not manageable
To many unrelated metrics
Does not deal well Exponential Complexity Increase
10. Why is Cloud Monitoring not enough?
Only System and High Level Response Metrics
No Visibility into Application
(Regressions, MTTR, Application Dependencies)
No Visibility into End User Impact Business Impact
We need Application Focus
11. What we really care about
Availability and Baseline
Performance
Web 2.0
Load Balancer WebServer Frontend(s) Backend(s) Private Datacenter
Detailed Contribution
End User Performance
Times
12. Key Challenge - Volatility
Real vs. Measured
Performance ^= F(Capacity)
60
Utilization
40
20
0
18. Cloud Designs are simple, yet…
Everything Fails!
Tight Couple End User Delivery Components
Few Tiers
Response Time
Scale Upfront
100.000s
users
19. Cloud Designs are simple, yet…
Everything Fails!
Tight Couple End User Delivery Components
Few Tiers
Response Time
Scale Upfront
Loosely Couple everything else
Throughput
Scale everything independent
Simple Designs still lead to Complex Systems
Complex Systems are hard to manage
21. Context matters
Too much Aggregation will blur the picture
Buying
Books
Buying Context
DVDs matters!
Buying
Cloth
22. Measure what Matters
The Application and its Business Transactions
Measure End User Performance
Measure Throughput on Transaction Type Level
How Performance effects your business
e.g. Conversion Rate
SLA Window
Cost vs. Gain
Prioritize based on what matters most
23. Identify cause of End User Impact
Flow of single
Transaction
Response Time
Hotspots
27. We want to scale the Application and not the Cloud
Auto Scaling on System metrics
Is indirect and not goal oriented
Fails when application changes
Scale on Application Metrics and Application Components
Transaction Load
Response Time Contribution and Trend
Throughput Goals
29. Understand your Flow
Understand the Application Flow
Always Capture Performance Data
Everything is transitory
Reproducing problems is hard
Analyze offline
Identify Contributors
31. Reacting Automatically to Issues
Disk Latency Degradation
Too much steal time
Hardware Issues
Detect “Application” Degradation
Terminate!
And start new
32. Make sure you are not blind
Application Monitoring must be high available
Outside and Inside
Failover
not in the same zone.
Automated Deployments
Zero Configuration Monitoring
34. What is the goal?
Performance and Scalability are not self serving
“Desired” End User Experience
Faster than that is not better
Using less resources is cheaper!
35. A Price Performance Index
Dollar Value for acceptable Performance:
90th response time/(Total Cost/Number of Transactions)
Desired Throughput/Total Cost
Mind Volatility
Price Performance Index is comparable
Cost Scalability
Cost per Transaction must remain stable
Performance is not based on Capacity
It is a function of desired User
Experience and associated Cost
Surveys find that (http://callcenterinfo.tmcnet.com/analysis/articles/149923-survey-finds-cloud-application-performance-concern-delaying-adoption.htm) performance concerns about the cloud rise.Delay of cloud adoption due to perceived and measured bad performance. When we look at it however the real problem is not the bad performance itself, but that it is not understood what to do in such a case?- Cloud Provider SLAs are purely on availability metrics and there mostly on availability of their APIs and not the instances them selves- There are no SLAs on actual provided capacity nor reports on actual consumed capacityTo make matters worse, due to the technology itself traditional APM tools fail to deliver these metrics, so the cloud customer is left in the lerge.Is it the Cloud or is it the Application. Or both? or None?So the first thing that we need to solve the cloud performance concern is the ability to measure our application and identify the root cause of performance issues be it the cloud, a thirdparty service, the application itself or further upfront in the delivery chain.That however brings up a far more important question, what does performance mean? And here it can be said that actually the term performance does not change in the cloud. If we define performance as pure speed, then it is independent of the cloud, it does not matter how much instances we have. Speed is defined by the response time of a single transaction under defined circumstances. To make things simple, lets define performance being the speed of a single transaction when there is nothing else going on.Flow.Raw speed can be impacted by cloud hardware, services and everything else. While we can measure that by looking at things like node response time, the only way to analyze it is to get visibility into the transaction. Then we see whether it is the application that is slow, squandering resources or if it is waiting for resources or simply not getting enough CPU. The beauty is that can now be compared with speed on premise in a similar distrubted setup. A comparision will show the differences. and while we can never analyze cloud issues on premise we can understand where the cloud has impact in comparision to on premise. and we can identify these issues even if we don't compare.Now about scalability, this is the main case for the cloud. Scalability defines how much parallel transactions can be served without degradition of response time. or if we talk batch or transaction processing. How does throughput increase when adding another node. Now if "performance" goes down under load we scale up. if performance is than satisfactory again we say it scales. if performance goes down although we add resources than it does not scale. Or if we need to add thrise the number of resources for twice the load, it might scale but not very good. The important thing to understand now is that these kind of scalability issues can be again both in the application or the cloud. Only here it will not be a matter of cpu or disk most likely. the most likely congestion will happen in cloud services and network. And again we see why the current offered cloud monitoring is not enough to help. While we might be able to see the slow down of a service under load we will not see if it is uniformly slower or only for certain requests. so we do not see if it isreally the load that is the problem. the same is true for network. Of course for the application itself its even worse if we can't look inside.---- Scaling on application metrics. understand application impact, business impact.So in order to solve this we must again look inside the application. What's more we need to understand what the application is doing, which different transactions are doing what and how they might effect each other. In reallity it is not so much different from an on premise installation. But with much more moving parts.However with proper tools we can master this challenge.Now that we can measure, understand and diagnose our applications in the cloud we can also finally understand what performance means in the cloud. Or more presicely how the performance and scalability of our application differs there. We can now define what performance in the cloud means. It means Response Time/$ or Throughput/$. In this scenario the response time or throughput is something that you define and measure. once you achieve this than the in the cloud of your choice performance is not a "concern". However more importantly this kind of price performance index allows you to compare not only cloud against on premise it allows you to compare cloud vendors to each other!
A common misnomer is that Scalability takes care of performance. That is not true. Performance is about speed of a single transaction or throughput at a given size. Scalability is about being able to get the same speed with more transactions and more nodes. Scalability is about doubling throughput when doubling the size. This actually means that an application needs to perform in order to scale!
A common misnomer is that Scalability takes care of performance. That is not true. Performance is about speed of a single transaction or throughput at a given size. Scalability is about being able to get the same speed with more transactions and more nodes. Scalability is about doubling throughput when doubling the size. This actually means that an application needs to perform in order to scale!First a cloud build upon sharing resources can never perform better than a dedicated environment. But that is not even the question. The real question is
End User Performance equals PurePerformance + Scalability
Profilers will not work, cloud monitoring is not application monitoring. Application monitoring in its traditional sense only tells us when something is slow but not why. This is important because we cannot replicate it in a normal environment and we need to understand it fast, because tomorrow we will deploy again, new changes will make analysis all the harder and might add new problems. On the other hand if we find it fast, we have the chance of fixing and improving tomorrow without changing our schedule.
As wehaveseenevnethe real utilizationcannottellperformance.Time is relativeUtilization in theguestisuselessUtilization on thehostdoes not allowtoinferperformanceThresholdscannotbemanagedPerformance can not beinferedfromresourceusage
This can and should be measured outside the cloud. We can do this via synthetic transaction monitoring which gives us a good feel about the base line performance and wholesome degradations. Of course we need to be sure we do this from the most important locations in the world to take backbones into account. Another way of doing this is even closer to the user, which is called RUM or UEM. This measures the responsetime directly from the browser of the customer via injected java scrip agents.
purepath
If you don’t see anything here, then you really don’t care about it.
One General and one Detail Transaction Flow with Database Impact. About Business Transactions
CPU Usage on the Web Server is the cause for volatility here. This is really usage, not a percentage, which means it is really an application issue.If on the other hand we would see wait or I/O growing than it might well be virtualization that is the cause for volatility here.This is of course only a high level picture, but I think you get the idea.
Scalability comes before performance in the cloud. Or to be more specific, Scalability trumps resource usage. We used to make a tradeoff between scalability and resource usage like CPU, Memory or disk usage. That does not hold true in a cloud. We have cpu, memory we have disk. The one thing that are still limiting factors are network and database. That needs to be taken care in the design. We can remove sync points in the database with NoSQL and Data denormalization. We can take care of network by using multiple zones and clouds and cdns to some degree. But to a larger degree bandwitdh needs to be taken care in the design.All that makes our application more scalable, the downside is that it makes it harder to understand single transactions, harder to monitor and harder to analyze. And of course, once we have an application, finding scalability issues is not easy, and cloud sizing does make it all the harder.