4. Why benchmark?
How long will the current configuration be
Will this platform provide adequate adequate?
performance, now and in the future?
For a specific workload, how does one
platform compare to another?
What configuration (infrastructure and application)
will it take to meet current needs?
What size instance will provide the best
cost/performance for my application?
How will the application running in my
datacenter perform in the cloud?
Are the changes being made to a system going
to have the intended impact on the system?
4
5. Why can’t these questions be answered?
• How many users does Drupal
support?
• How much memory does MySQL
require?
• What is the overhead of using
Flash?
• How many requests per second
can Apache handle?
• What instance type will it take to
support 1000 unique users on AWS
running Drupal?
*without clarification
5
6. Benchmarking is not easy on-premises
It takes time to obtain and build test
configurations
6
7. Benchmarking is not easy…
Buying the latest equipment each time gets
expensive
7
8. Benchmarking is not easy…
Generating large-scale load requires huge
temporary spikes in capacity
8
9. Benchmarking is not easy…
Building up and tearing down test
configurations can be very labor intensive
9
10. Benchmarking in AWS is fast…
Benchmarking in AWS is fast with parallel
execution
10
15. The Benchmark Lifecycle
Start with a Goal
Measure
Define your
against Report
workload
goal
Test Test
Test Design Test Analysis
Configuration Execution
Generate Run a series of
Load controlled
experiments
Carefully control
changes
15
16. 3 ways to use benchmarks
1. Design and run a benchmark from your existing
application and workloads
2. Run a standard benchmark
3. Use published benchmark results
16
17. 1. Benchmark your application
• Choose which parts of the application to test and in what
combinations (workloads)
• Determine how to generate load and how much of it
• Decide how to measure and what metrics
• Design how reports get generated and what report contents
17
19. 2. Run a standard benchmark
• Lots of work already done:
Workloads defined
Load generation defined
Measurement is defined
Reports are defined
Some tuning needs to be done to build and
run
Run controlled tests and automate for
repetition
19
20. 2. Run a standard benchmark
Is the test relevant to your requirements?
How does the test map to your application?
20
21. 2. Standard benchmark: example
Testing DynamoDB
– Before shipping DynamoDB, benchmarks were run
to verify latency and scale
– Short window for testing, selected Yahoo Cloud
Serving Benchmark to run scaling tests
• Multiple parallel tests set up to find optimal test
configuration
• Multiple DynamoDB databases provisioned and tests
run in parallel
• DynamoDB server scaling and latency validated
• A number of client side issues found and fixed
21
23. 3. Use published benchmark results
Similar to running standard benchmarks but
more …
Picture source: http://www.nzei.org.nz/
23
24. 3. Reading and interpreting a benchmark report
1. What is being measured?
2. Why is it being measured?
3. How is it being measured?
4. How closely does this benchmark
resemble my results?
5. How accurate are the reports and
citations?
6. Are the results repeatable?
24
26. Cloud Tip: The 4 Rs
– Relevant – the best test is based on your
application
– Recent – Out of date results are rarely useful
– Repeatable – Is there enough information to
repeat the test (cold fusion anyone ?)
– Reliable – Do you trust the tools, the publisher,
and the results?
26
28. Example: dissecting a benchmark report
• Mistakes in test design
Instance Cores
X.Instance1 1
– CPU tests with vastly different
X.Instance2 2
X.Instance3 2
instance types X.Instance4 4
X.Instance5 2
– The “5X” claim comes from X.Instance6 8
X.Instance7 4
comparing Y.Instance5 against X.Instance8 8
X.Instance1 Y.Instance1 4
Y.Instance2 4
Y.Instance3 4
Y.Instance4 4
Y.Instance5 4
28
29. Example: dissecting a benchmark report
• Mistakes in test configuration
– Tests for vendor Y were run on Ubuntu 10.4
– Tests for vendor X were run on CentOS 5.4
29
30. Example: dissecting a benchmark report
• Mistakes in test analysis
– Report spreadsheet contained several critical
errors
30
31. Example: dissecting a benchmark report
• Mistakes in test analysis
– The spreadsheet containing the data used to
produce reports contained several critical
errors Corrected:
31
32. Example: dissecting a benchmark report
• What the data should have looked like:
– CPU performance (higher is better):
– X.Instance7 is 1.9 times better than
Y.Instance5
32
33. Example: dissecting a benchmark report
• What the report should have looked like:
– Cost/performance (lower is better)
– X.Instance7 is 2.13 times better than
Y.Instance5
33
34. Interesting Reads
Questions to Ask About Benchmark Studies
1. What is the claim?
2. What is the claimed measurement?
3. What is the actual measurement?
4. Is it an apples-to-apples comparison?
5. Is the playing field level?
6. Was the data reported accurately?
7. Does it matter to you?
Source: http://blog.cloudharmony.com/2011/11/many-are-skeptical-of-claims-that.html
34
35. Not all benchmark reports are bad…
Benchmarking High Performance I/O with SSD for Cassandra on AWS
http://techblog.netflix.com/2012/07/benchmarking-high-performance-io-
with.html
Benchmarking Cassandra Scalability on AWS - Over a million writes per second
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
35
36. Benchmarking in the Cloud - Summary
1. Benchmarking on premises is hard
2. AWS is a great place to benchmark
3. The best benchmark is your application
4. Run standard benchmarks with controlled and
repeatable tests
5. Be a careful consumer published of benchmark
reports
Of course, everything on the internet is true….
36
37. Thank you!
Robert Barnes
rabarnes@amazon.com
37
Notas del editor
Two versions of origin – surveyors mark or The term benchmarking was first used by cobblers to measure people's feet for shoes. They would place someone's foot on a "bench" and mark it out to make the pattern for the shoes. Benchmarking is used to measure performance using a specific indicator (cost per unit of measure, productivity per unit of measure, cycle time of x per unit of measure or defects per unit of measure) resulting in a metric of performance that is then compared to others.There should always be a goal or reason to benchmark – you measure in order to prove something works, determine if it can work,
Consumers are often faced with the challenge of choosing between multiple similar offerings when shopping for goods or services. There is rarely a single measure such as cost or size that makes selecting the best offering simple. For example, when shopping for a car, many people use gas mileage as one of the selection criteria to narrow the set of cars to consider for purchase. In the United States, the Environmental Protection Agency (EPA) dictates precisely how an automobile manufacture needs to test and report gas mileage. Defining a useful measurement to fairly compare competing products and/or services requires careful planning and can be quite complex to define and execute. Continuing with the EPA mileage example - the 2007 document detailing updates to gas mileage test and reporting methodology was 19 pages long and the technical support document detailing testing and reporting was 179 pages in length. Why so much detail? Being very prescriptive about how to measure and how to report fuel mileage helps ensure that comparisons from any two vehicles end up being “apples to apples” comparisons but entails excruciating levels of detail.
The Importance of Benchmarking (Decision making)The cost of fixing performance problems increases proportionally to the stage of development. The later in the software lifecycle you attempt to fix a problem, the more it will cost to fix it.
Benchmarks require running multiple experiments to get reliable results. With the cloud, you can run multiple experiments in parallel and significantly reduce the time it takes to get results.Deploying new configurations can be fully automated and done in minutes When you are done, you can save results to S3 and tear everything down….The beauty of the cloud is that you pay for only what you use. Running a benchmark to validate your use case is not only cost effective but also quick since you don’t have to wait months to procure, assemble, and configure test resources. Typically, it is possible to run benchmark tests that last for a few hours and cost a few dollars. See how Netflix was able to run a benchmark that involved 96 EC2 instances in each of the 3 availability zones (3.3 Million writes per second) that costs them a few hundred dollars and couple of hours. Moreover, unlike traditional datacenter or on-premises benchmarking, you don’t have to wait long for systems to be configured or have the need to ask for permission to execute these tests. You can run as many tests you like, as many number of times, any time you like. You have the flexibility to decide the scale of your tests and are not limited to small number of fixed resources.http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
You can deploy and tear down configurations rapidly and you only pay for what you use. Generating load can be done with many small instances or a handful of very large instances.
You can rapidly grow and shrink the scale of benchmarks and only pay for what you use.Cloud is highly cost-effective because you can turn off and stop paying for it when you don’t need it or your users are not accessing. Build websites that sleep at night
AWS provides APIs so the entire benchmark lifecycle can be automated.
When you use your own application, you have experiments and results that will require the least amount of extrapolation to get reasonable answers.When you run standardized benchmarks, you have to figure out how the test design and configuration relates to your application, but you still control how the test is run and the analysis of test results.When you use published benchmark results, have to figure out how the test design, configuration and execution relate to your application. In some cases, published numbers have to confirm to strict reporting standards so analysis can be possible. A disciplined Process:Start with a goalUse a thoroughly defined scenarioRun a series of controlled experimentsTake careful notesCarefully control changesAlways measure with your goal in mindStop when you meet your goal …Look for bottlenecks when you don’t
Anecdote about a major web site failing and using benchmarking to figure out why and how to fix it.After taking over engineering for a major customer-facing web portal, the site starting failing under annual peak load.The team did not test performance for the previous 2 releases (no workloads defined). There was no dedicated performance test configuration and no spare hardware available for test. There were no test programs to generate load. The application did not have enough instrumentation to understand what was failing.After trial and error (and patching to add instrumentation), built up test ability and began testing. Would have been great to use AWS to spin up test cluster quickly to repro failures and test proposed fixes…
Github drop for YCSB: https://github.com/brianfrankcooper/YCSBWiki for YCSB: https://github.com/brianfrankcooper/YCSB/wikiTar for YCSB: https://github.com/downloads/brianfrankcooper/YCSB/ycsb-0.1.4.tar.gzBeforeDynamoDB launched, we wanted to make sure we had the scalability we promised.Built a DynamoDB plug-in for YCSB to test scale up to 100,000 requests per secondRan many experiments in parallel to get results quicklyFound a number of areas to improve in the AWS (client) toolkit, Logging level was too highSession cache improvementsSession Token Throttle conflict between YCSB framework thread connection model and optimal DynamoDB connection managementCustomer impact – multi-threaded clients will receive throttle messages well below provisioned DynamoDB throughput levels. DynamoDB is one of the first services to use STS, and this issue can happen for any service using STS, I.e. Any service that does not have a concept of provisioned throughput would also receive this throttling message. SDK has released a fix for this problem.Default SDK logging levelThe default logging level for DynamoDB was “INFO” and this level included output for every request and response.Customer impact – the default verbose logging level is a performance bottleneck for multi-threaded clients at scale. Before fixing, maximum throughput for a single jvm was 7K reads/second. After the fix, maximum throughput for a jvm was over 15K reads/second. Resolution – SDK made request logging “DEBUG” level for DynamoDBSDK http connection recyclingThe SDK contains code that periodically harvests unused http connections.Customer impact – since http connections include authentication for DynamoDB, new connections are expensive and the cost of finding and killing connections (while locking the connection pool) affects scalability. A prototype of the SDK wherein connections were not killed improved performance by 20 to 25% at scale. (Some tests demonstrated over 2.5X improvement in throughput with this change).….
Github drop for YCSB: https://github.com/brianfrankcooper/YCSBWiki for YCSB: https://github.com/brianfrankcooper/YCSB/wikiTar for YCSB: https://github.com/downloads/brianfrankcooper/YCSB/ycsb-0.1.4.tar.gzBeforeDynamoDB launched, we wanted to make sure we had the scalability we promised.Built a DynamoDB plug-in for YCSB to test scale up to 100,000 requests per secondRan many experiments in parallel to get results quicklyFound a number of areas to improve in the AWS (client) toolkit, Logging level was too highSession cache improvementsSession Token Throttle conflict between YCSB framework thread connection model and optimal DynamoDB connection managementCustomer impact – multi-threaded clients will receive throttle messages well below provisioned DynamoDB throughput levels. DynamoDB is one of the first services to use STS, and this issue can happen for any service using STS, I.e. Any service that does not have a concept of provisioned throughput would also receive this throttling message. SDK has released a fix for this problem.Default SDK logging levelThe default logging level for DynamoDB was “INFO” and this level included output for every request and response.Customer impact – the default verbose logging level is a performance bottleneck for multi-threaded clients at scale. Before fixing, maximum throughput for a single jvm was 7K reads/second. After the fix, maximum throughput for a jvm was over 15K reads/second. Resolution – SDK made request logging “DEBUG” level for DynamoDBSDK http connection recyclingThe SDK contains code that periodically harvests unused http connections.Customer impact – since http connections include authentication for DynamoDB, new connections are expensive and the cost of finding and killing connections (while locking the connection pool) affects scalability. A prototype of the SDK wherein connections were not killed improved performance by 20 to 25% at scale. (Some tests demonstrated over 2.5X improvement in throughput with this change).….