This is the presentation given for the Docker Meetup in Cordoba, Argentina. Recording should soon be up on http://www.meetup.com/Docker-Cordoba-ARG/events/226995018/
Key Takeaways: Pick your Metrics! Automate It! Fail Bad Builds Faster! Deliver Faster with Better Quality!
To the Docker Audience my main point was that: Just adding Docker doesn't give you free performance and scalability of your app. I walk through many examples of failing apps. What are the metrics that highlight the problem and how to automatically detect bad builds by looking at these Metrics along your Pipeline.
The Ultimate Guide to Choosing WordPress Pros and Cons
Docker/DevOps Meetup: Metrics-Driven Continuous Performance and Scalabilty
1. 1 @Dynatrace
Application Quality Metrics for your Pipeline
(and why Docker is not the solution to all of your problems)
Andreas (Andi) Grabner - @grabnerandi
Metrics-Driven DevOps
2.
3.
4. 700 deployments / year
10 + deployments / day
50 – 60 deployments / day
Every 11.6 seconds
5. Example #1: Online Casino 282! Objects
on that page9.68MB Page Size
8.8s Page
Load Time
Most objects are images
delivered from your main
domain
Very long Connect time
(1.8s) to your CDN
6. 879! SQL Queries8!Missing CSS & JS
Files
340!Calls to GetItemById
Example #2: Lawyer Website based on SharePoint
11s!To load
Landing Page
7.
8.
9.
10. • Waterfall Agile: 3 years
• 220 Apps - 1 deployment per month
“EVERYONE can do Continuous Delivery”
“Every manual tester does AUTOMATION”
“WE DON’T LOG BUGS – WE FIX THEM!”
Measures Built-In, Visible to Everyone
Promote your Wins, Educate your Peers
34. • Symptoms
• HTML takes between 60 and 120s to render
• High GC Time
• Developer Assumptions
• Bad GC Tuning
• Probably bad Database Performance as rendering was simple
• Result: 2 Years of Finger pointing between Dev and DBA
Project: Online Room Reservation System
35. Developers built own monitoring
void roomreservationReport(int officeId)
{
long startTime = System.currentTimeMillis();
Object data = loadDataForOffice(officeId);
long dataLoadTime = System.currentTimeMillis() - startTime;
generateReport(data, officeId);
}
Result:
Avg. Data Load Time: 45s!
DB Tool says:
Avg. SQL Query: <1ms!
36. #1: Loading too much data
24889! Calls to the Database
API!
High CPU and High Memory Usage
to keep all data in Memory
37. #2: On individual connections 12444!
individual
connections
Classical N+1 Query
Problem
Individual SQL
really <1ms
38. #3: Putting all data in temp Hashtable
Lots of time spent in
Hashtable.get
Called from their Entity
Objects
39. • … you know what code is doing you inherited!!
• … you are not making mistakes like this
• Explore the Right Tools
• Built-In Database Analysis Tools
• “Logging” options of Frameworks such as Hibernate, …
• JMX, Perf Counters, … of your Application Servers
• Performance Tracing Tools: Dynatrace, Ruxit, NewRelic,
AppDynamics, Your Profiler of Choice …
Lessons Learned – Don’t Assume …
40. Key Metrics
# of SQL Calls
# of same SQL Execs (1+N)
# of Connections
Rows/Data Transferred
43. 43 @Dynatrace
26.7s
Execution Time 33! Calls to the
same Web
Service
171! SQL Queries through LINQ
by this Web Service – request
similar data for each call
Architecture Violation: Direct access
to DB instead from frontend logic
44. 44 @Dynatrace
Key Metrics
# Service Calls, # Containers
# of Threads, Sync and Wait
# SQL executions
# of SAME SQL’s
Payload (kB) of Service Calls
47. Distance calculation issues
480km biking
in 1 hour!
Solution: Unit Test in
Live App reports Geo
Calc Problems
Finding: Only
happens on certain
Android versions
60. 60
What you currently measure
What you should measure
Quality Metrics
in your pipeline
# Test Failures
Overall Duration
Execution Time per test
# calls to API
# executed SQL statements
# Web Service Calls
# JMS Messages
# Objects Allocated
# Exceptions
# Log Messages
# HTTP 4xx/5xx
Request/Response Size
Page Load/Rendering Time
…
61. Extend your Continuous Integration
12 0 120ms
3 1 68ms
Build 20 testPurchase OK
testSearch OK
Build 17 testPurchase OK
testSearch OK
Build 18 testPurchase FAILED
testSearch OK
Build 19 testPurchase OK
testSearch OK
Build # Test Case Status # SQL # Excep CPU
12 0 120ms
3 1 68ms
12 5 60ms
3 1 68ms
75 0 230ms
3 1 68ms
Test & Monitoring Framework Results Architectural Data
We identified a regresesion
Problem solved
Exceptions probably reason for
failed tests
Problem fixed but now we have an
architectural regression
Problem fixed but now we have an
architectural regressionNow we have the functional and
architectural confidence
Let’s look behind the scenes
62. #1: Analyzing every Unit
& Integration test
#2: Metrics for each test
#3: Detecting regression
based on measure
Unit/Integration Tests are auto baselined! Regressions auto-detected!
69. #1: Pick your App Metrics
# of Service Calls Bytes Sent & Received
# of Worker
Threads
# of Worker
Threads
# of SQL Calls, # of
Same SQLs # of DB
Connections
# of SQL Calls, # of
Same SQLs # of DB
Connections
70. #2: Figure out how to monitor them
http://bit.ly/dtpersonal
Get Dynatrace Free Trial at http://bit.ly/dtpersonal
Video Tutorials on YouTube Channel: http://bit.ly/dttutorials
Online Webinars every other week: http://bit.ly/onlineperfclinic
Share Your PurePath with me: http://bit.ly/sharepurepath
More blogs on http://blog.dynatrace.com
If you are new to DevOps and Continuous Delivery check out these two books: Continuous Delivery from Jez Humble, David Farley and The Phoenix Project from Gene Kim, Kevin Behr, and George Spafford
Many companies that have a „DevOps Strategy“ too often just follow the Unicorns
Several companies changed their way they develop and deploy software over the years. Here are some examples (numbers from 2011 – 2014)
Cars: from 2 deployments to 700
Flicks: 10+ per Day
Etsy: lets every new employee on their first day of employment make a code change and push it through the pipeline in production: THAT’S the right approach towards required culture change
Amazon: every 11.6s
Remember: these are very small changes – which is also a key goal of continuous delivery. The smaller the change the easier it is to deploy, the less risk it has, the easier it is to test and the easier is it to take it out in case it has a problem.
If „Being DevOps“ just means you just increase the number of deployments then you are bound to fail. Here is an example of a bad web application. When deploying this more frequently you will end up in more war rooms
Another example from a SharePoint app that allows production deployments by SharePoint Admins. A simply change directly in production can have very negative impacts, e.g: deploying a new WebPart with a Data-Driven Performance Hotspot
Don‘t just copy the Unicorns – dont be just driven the number of deployments.
The problem is though – when you blindly copy what you read you may end up with a very ugly copy of a Unicorn. Its not about copying everything or thinking that you have to release as frequently as the Unicorns. It is about changing and adapting a lot of their best practices but doing it in a way that makes sense to you. For you it might be enough to release once a month or once week.
Listen to the next generation Unicorns, e.g: those talking at Velocity or other conferences: Target, CapitalOne, IG, ...
These are the highlights of these talks for me this year:
http://apmblog.dynatrace.com/2015/05/27/velocity-2015-our-conference-highlights/
http://apmblog.dynatrace.com/2015/05/28/velocity-2015-highlights-from-day-2/
http://apmblog.dynatrace.com/2015/05/29/velocity-2015-highlights-from-last-day/
Despite all these stories the main Challenge remains ...
Don’t’ just try to deploy faster …
… as you may just ending up failing faster and more often!
Don’t become the next headline on the news as United in the summer of 2015
Or the Fifa World Cup App that crashed for 80% of their Android Users caused by a memory leak in an outdated UI Library one week before the WorldCup
I love metrics – and I think we should make decisions on deployments based on key metrics. But also monitor deployments in production to learn whether the deployment was really good
The BASIC Metric EVERYONE has to have: Synthetic Availability Monitoring -> Clearly something went wrong
Even if the deployment seemed good because all features work and response time is the same as before. If your resource consumption goes up like this the deployment is NOT GOOD. As you are now paying a lot of money for that extra compute power:
http://apmblog.dynatrace.com/2015/06/30/fighting-technical-debt-memory-leak-detection-in-production/
http://apmblog.dynatrace.com/2014/10/28/hands-tutorial-5-steps-identify-java-net-memory-leaks/
Layer Breakdown perfectly shows which layer of your app is not scaling:
http://apmblog.dynatrace.com/2015/01/22/key-performance-metrics-load-tests-beyond-response-time-part/
Got a marketing campaign? If you roll it out do it smart: Start with a small number – monitor user behavior – fix errors if there are any before rolling out the rest of the campaign:
http://apmblog.dynatrace.com/2015/02/26/omni-channel-monitoring-in-real-life/
A lot of people dont look at these metrics and just add new code on an ever growing big pile of technical debt
Based on a recent study:
80% of Dev Team overall is spent in Bugfixing instead of building new cool features
$60B annual costs of bad software instead of investing it in new cool features to spearhead competition
Yes – we are focusing on quality TOO LATE
When its too late we end up here
We need to leave that status quo. And there are two numbers that tell us that it is not as hard to do as it may seem
Based on my experience
80% of the problems are only caused by 20% problem patterns. And focusing on 20% of potential problems that take away 80% of the pain is a very good starting point
Sounds super nice on paper – so – how do we get there?
This story is from Joe – a DB guy from a very large telco arguing with his developers over performance problems of an online room reservation system which has evolved from a small project implemented by an intern to an application that is now used in their entire organization
Devs buillt custom monitoring to proof their point! Contradicting what Joe‘s DB Tools had to say
Reading this Transaction Flow showed what the real problem was: Loading Too Much Data from the Database causing High Memory Usage and therefore high CPU to cleanup the garbage
Every SQL was executed on its on Connection
The intern back then implemented its own OR Mapper by loading the full database content into a HashTable using individual queries
This was a monolithic app for searching sports club websites. The executed sample search brought 33 sports club. Before this app was „migrated“ to Microservices everything was in a single monolith taking about 1s to execute. After the „migration“ to (micro)services the same call takes 26.7s including 33 calls to the new microservice and 171 roundtrips to the database
A Mobile App with a GPS Distance Calculation Problem. Couldnt be found in test – so they moved the Test to Production to find out which devices actually have the problem
http://apmblog.dynatrace.com/2013/07/23/too-fast-for-the-user/
As many mobile apps – you might rely on 3rd party services for your users to login. Make sure you monitor the response time and success of these calls and how it impacts your end users
Overloaded Kia website brings it down during superbowl:
http://apmblog.dynatrace.com/2014/03/05/bloated-web-pages-can-make-or-break-the-day-lessons-learned-from-super-bowl-advertisers/
GoDaddy is doing something different: they have a special „bare minimum static optimized“ website for the spike period -> thats smart:
http://apmblog.dynatrace.com/2014/02/19/dns-tcp-and-size-application-performance-best-practices-of-super-bowl-advertisers/
So – we have seen a lot of metrics. The goal now is that you start with one metric. Pick a single metric and take it back to your engineering team (Dev, Test, Ops and Business). Sit down and agree on what this metric means for everyone, how to measure it and also how to report it
Also remember that for most of these use cases discussed and metrics derived from it we only need a single user test. Even though we can identify performance, scalability and architectural issues – in most cases we don’t need a load test. Single user tests or unit tests are good enough
If you are already executing tests than that is great – BUT – you are only testing functionality. It is time to look „underneath“ the hood and automaitcally find all these other problems we just talked about by looking at the right metrics
Here is how we do this. In addition to looking at functional and unit test results which only tell us how functionality is we also look into these backed metrics for every test. With that we can immediately identify whether code changes result in any performance, scalability or architectural regressions. Knowing this allows us to stop that build early
This is how this can look like in a real life example. Analyzing Key Performance, Scalability and Architectural Metrics for every single test
Dynatrace can either show the data in our own dashboards or you can integrate this data through our REST APIs with your Build Server such as Jenkins, Bamboo, .... And even „BREAK THE BUILD“ if something is bad!
Make sure you do not end in Pre-Production. Once you deploy your application you also want to monitor how your application is doing in the wild. Same technical metrics are important to monitor but also correlate them with the business metrics such as Conversion Rates, Bounce Rates, Revenue, ...
Docker Fans: Make sure you monitor your Docker Enviornments to identify any bottlenecks – whether caused by Docker or by your app making inefficient use of Docker/Container resources!
http://apmblog.dynatrace.com/2015/07/21/how-to-get-visibility-into-docker-clusters-running-kubernetes/
More screenshots and tips and tricks on docker/container monitoring
http://apmblog.dynatrace.com/2015/07/21/how-to-get-visibility-into-docker-clusters-running-kubernetes/
A „dockerized“ app monitored with Dynatrace
http://apmblog.dynatrace.com/2015/07/21/how-to-get-visibility-into-docker-clusters-running-kubernetes/
So – our goal is to deploy new features faster to get it in front of our paying end users or employees