These are the slides of my JavaOne presentation. The abstract goes like this:
How do companies developing business-critical Java enterprise Web applications increase releases from 40 to 300 per year and still remain confident about a spike of 1,800 percent in traffic during key events such as Super Bowl Sunday or Cyber Monday? It takes a fundamental change in culture. Although DevOps is often seen as a mechanism for taming the chaos, adopting an agile methodology across all teams is only the first step. This session explores best practices for continuous delivery with higher quality for improving collaboration between teams by consolidating tools and for reducing overhead to fix issues. It shows how to build a performance-focused culture with tools such as Hudson, Jenkins, Chef, Puppet, Selenium, and Compuware APM/dynaTrace
24. 2424
What are the real questions?
Individual Users? ALL users?
Is it the APP? Or Delivery Chain?
Code problem? Infrastructure?
One transaction? ALL transactions?
In AppServer? In Virtual Machine?
26. 2626
Problem: What Devs would like to have
Top Contributor is related to
String handling
99% of that time comes from
RegEx Pattern Matching
Page Rendering is the main component
30. 3030
Problem: Attitudes like this don’t help either
Image taken from https://www.scriptrock.com/blog/devops-whats-hype-about/
Shopzilla CIO (in 2010): “… when they get in the war room - the developers and ops teams
describe the problem as the enemy, not each other”
31. 3131
Problem: Very “expensive” to work on these issues
~80% of problems
caused by ~20% patterns
YES we know this
80%Dev Time in Bug Fixing
$60BDefect Costs
BUT
54. 5454
Solution: DevOps + Performance Focus
Culture
“Shared Responsibility”
Agile Process for ALL Teams
Performance as Key Requirement
X-Team Collaboration and Education
Automation
Measurement, Collaboration and Deployment
Automate Performance and
Architectural Problem Detection
Measurement
“Visible” KPIs for each Team
Focus on Performance, Architectural
and Deployment Measures
Sharing
Expertise, Tool and Data Sharing
“Easy” sharing of Performance, Deployment
and Production Data
http://www.opscode.com/blog/2010/07/16/what-devops-means-to-me/
57. 5757
Measurement: Define KPIs accepted by all teams
# of SQL Executions
# of Log Lines
MBs / Uses
Time for Deployment
Time for Rollback
Response TimesPerf Test Code Coverage
59. 5959
DevOps Collaboration – TODO LIST FOR YOU!!
Access to Production Data
Shared Reporting and Task Management
Diagnostic Tools
Shared Performance KPIs and Tooling
Known How Exchange
60. 6060
Recap – Problem – Root Cause – Solution - Result
DevOps +
Performance Culture
Automation
Measurement
Collaboration
62. 6262
Performance Focus in Test Automation
12 0 120ms
3 1 68ms
Build 20 testPurchase OK
testSearch OK
Build 17 testPurchase OK
testSearch OK
Build 18 testPurchase FAILED
testSearch OK
Build 19 testPurchase OK
testSearch OK
Build # Test Case Status # SQL # Excep CPU
12 0 120ms
3 1 68ms
12 5 60ms
3 1 68ms
75 0 230ms
3 1 68ms
Test Framework Results Architectural Data
We identified a regresesion
Problem solved
Lets look behind the
scenes
Exceptions probably reason
for failed tests
Problem fixed but now we have an
architectural regression
Problem fixed but now we have an
architectural regression
Now we have the functional and
architectural confidence
63. 6363
Performance Focus in Test Automation
Analyzing All Unit / Performance Tests
Analyzing Metrics
such as DB Exec
Count
Jump in DB Calls
from one Build to the
next
AbstractHow do companies developing business-critical Java enterprise Web applications increase releases from 40 to 300 per year and still remain confident about a spike of 1,800 percent in traffic during key events such as Super Bowl Sunday or Cyber Monday? It takes a fundamental change in culture. Although DevOps is often seen as a mechanism for taming the chaos, adopting an agile methodology across all teams is only the first step. This session explores best practices for continuous delivery with higher quality for improving collaboration between teams by consolidating tools and for reducing overhead to fix issues. It shows how to build a performance-focused culture with tools such as Hudson, Jenkins, Chef, Puppet, Selenium, and Compuware APM/dynaTrace
Who knows what that is?It’s the Fifa World Cup Trophy
Teams are currently competing in the qualifications to compete in Brazil 2014
This is “my” austrian national team soccer team. Their GOAL is to qualify for Brazil 2014. After the many failed attempts in the past we hired a new coach who’s goal is to form a new team that PERFORMs good enough to qualify
In order to get there the team competed in many test games. Which gaves them a lot of confidence because they played against teams that were “easier” to beat. At the end of these tests we even started in the qualification with some wins against teams that we were expecting to winSo – at the end of these “test and easy qualification games” we thought: “ALL GOOD – THE ROAD IS OPEN FOR 2014 – NOT ONLY WILL WE QUALIFY BUT WE ALSO BELIEVE WE HAVE SUCH A STRONG TEAM THAT WILL ALSO DO WELL AT THE WORLDCUP”
Then reality kicked in when we had our first “real competitor” – it was the first qualification against a team whos quality level is at a level that we have to expect at the world cup.The competing team was Germany – and – based on these images you can see how the game went
The coach is responsible to watch the game and see how things are going. Like in other sports – soccer has a couple of Key Performance Indicators such as Ball Possession, Fouls and the actual scoreThe first 5 minutes actually didn’t look too bad
After the first 5 minutes the game changes – with germany taking over the game in their typical way. The KPIs make this very clearThe coach is responsible to react based on these values and how the game wents
The coach should use more data for detailed analysis on what is going wrong in the game
One of his options is to substitute players – or even change tacticsDoes this succeed based on the KPIs that we have seen before?
Well – not always. Just replacing players – putting some in that are faster in chasing the ball doesn’t always help
StoryNew Build Deployed on Thursday Evening Everything runs smooth on Friday DaytimeAn Ad Campaign hits the Air Friday NightThe site crashes under load -> ALERTS GO OFFRestarting Server -> SERVER DOESN’T STARTAdding more Servers-> PROBLEM REMAINSCalling in the “App Experts” and Pizza Delivery!
They getOps’ problem description: “App Server crashed”, “Out of file handles”Users’ problem description: “It is slow”, “It crashed”
They GetHigh CPU, Memory or Bandwidth IssuesLog files: GB’s of logfiles with 99.9% “useless” information
There is lots of data – but – does a high CPU Utilization really mean that this machine has a problem and need to be restarted?What could be the problem if your user experience tool tells you that people have bad response times?But what do we do with all this disconnected data?
They needApplication data: Executed Transactions, Load, CPU, Memory, Disk usage,...Impacted transactions with context information: User Actions, Call stack, Thread Overview, Method Parameters, SQL Calls, Invoked Service CallsInvolved Application Components: Web Server, App Servers, DatabaseImpact of service calls: Performance, Availability, Response CodesError Details: HTTP Errors, Exceptions, warning/severe log messages
30%: What we hear from talking to people is that a lot of problems that happen in production happen to times that are not very “developer friendly” -> RUN THROUGH STORY60%: Restarting a crashed application server or adding an additional server to handle the load often doesn’t solve the problem either -> That’s when its time to call in the Application Experts (Developers or Testers)100%: Devs (and probably anybody else as well) are not happy to get called at 2AM to look at a problem. They also know that its not going to be an easy fix because there is probably not enough data available to fix this – so its going to be a lot of trial and errors with a Team (Ops) that is reluctant for Trial and Error.More talking points:The Challenge with Outside Business Hours problemsRestarts are not the silver bulletApplication Experts to fix problems unlikely to be available at 2AMThis leads to “CritSit”, “War Room”, … including Dev, Test, Ops …The Challenge with Production Problem AnalysisOps often doesn’t know what information is required by Dev & TestOps typically doesn’t want to give Devs access to machines for triage Leads to Tension between Dev, Test and OpsINTERESTING FACT: 80/20 Rule20% of Problem Patterns responsible for 80% of ProblemsMost problems could have been found early on PREVENTION is POSSIBLEBecause RESTARTING Applications IS NOTthe solutionBest Case: You are just “hiding” a problemWorst Case: App doesn’t start anymoreBecause ROOT CAUSEis often NOT FOUNDin log filesWhich log files to look at? App Server, Web Server, OS Event Log, …?Even Splunk can’t help if there is not sufficient informationBecause CHANGING APPbehavior CANNOTalways be done through config filesYou can’t turn off a Memory Leak via a switchTrial & Error Changes, e.g: Increasing pool sizes will just “shift” the problemBecause They (DEV, TEST, ARCHITECTS) are the APPLICATION EXPERTSThey know WHERE to look and WHAT to look forThey can fix the code and advise on other deployment options
Well – I guess there is just not more to say about this. The attitude between these teams doesn’t help in solving issues any faster
We all know this statistic in one form or another – so – it is clear that these problems that are handled in War Rooms are VERY EXPENSIVEBUTWhat is interesting is that these problems are typically not detected earlier because the focus of engineering is on building new features instead of focusing on performance and scalable architecture.What’s interesting though is that many of these problems could easily be found earlier on – LETS have a look at these common problems that we constantly run into …
Depending on the audience you want to show or hide some of the following slides
Resource Pool ExhaustionMisconfiguration or failed deployment, e.g: default config from devActual resource leak -> can be identified with Unit/Integration Tests
Resource Pool Exhaustion (same as before – just different Pool)Using the same deployment tools in Test and Ops can prevent thisTesting with real load can detect that
Deployment Issues leading to heavy logging resulting in high I/O and CPUUsing the same deployment tools in Test and Ops can prevent thisAnalyzing Log Output per Component in Dev prevents this problem
Deployment Issues leading to heavy logging resulting in high I/O and CPUUsing the same deployment tools in Test and Ops can prevent thisAnalyzing Log Output per Component in Dev prevents this problem
Too many and too slow Database QueriesDev and Test need to have “production-like” database – Updates on a “Sample Databases” won’t show slow updatesAccess Patterns such as N+1 can be identified with Unit Tests
Too many and too slow Database QueriesDev and Test need to have “production-like” database – Updates on a “Sample Databases” won’t show slow updatesAccess Patterns such as N+1 can be identified with Unit Tests
Too much data requested from DatabaseDev and Test need to have “production-like” database – Otherwise these problem patterns can only be found in prodEducate Developers on “the power of SQL” – instead of loading everything in memory and performing filters/aggregations/… in the App
Memory Leaks: Too much data in CacheCan be found in test with “production-like” data sets and tests that do not only test the same “search” query -> get feedback from ProdEducate developers on memory and cache strategies
Synching issues caused by deadlocksCan be found with small scale performance unit tests by developersEducate developers on synchronization/multi-threading strategies
Not following WPO (Web Performance Optimization Rules)Non optimized content, e.g: compression, merging, …Educate developers and automate WPO checks
Not leveraging Browser-side CachingMisconfigured CDNs or missing cache settings -> automate cache configuration deploymentEducate developers; Educate testers to do “real life” testing (CDN, …)
Slow or failing 3rd party contentImpacts page load time; Ops is required to monitor 3rd party servicesEducatedevs to optimize loading; Educate test to include 3rd party testing
Why this is a problem?Biz pushes features. In order to deliver more features in a more agile way development adopted agile development methodologies to deliver more releases with more features in a shorter timeframeTo save costs we outsource. Some companies also organically grew by acquisition leaving us with dev teams that are distributed across the globeTo be faster we use 3rd Party Code as we do not want to re-invent the wheel. However – not every 3rd party component or service is really fit for the requirements we have in our production enviornment. It may work well on the workstation for a single user – but often fails in a larger environment3rd Party Services or ContentAverage US Sports Website loads content from 29! domains3rd Party Components in Application CodeHibernate, Spring, .NET Enterprise Blocks …GWT, ExtJS, jQueryAmazon Web Services, Google API, …
Feature – richness vs. NO CHANGE
Not well communicated what change is ahead. No “Integration” of Ops Teams in Agile Process
A big step is to tear down these walls between these teams.
CAMS is taken from OpsCode (Creators of Chef) Blog: http://www.opscode.com/blog/2010/07/16/what-devops-means-to-me/ Culture People and process first. If you don’t have culture, all automation attempts will be fruitless.Automation This is one of the places you start once you understand your culture. At this point, the tools can start to stitch together an automation fabric for Devops. Tools for release management, provisioning, configuration management, systems integration, monitoring and control, and orchestration become important pieces in building a Devops fabric.Measurement If you can’t measure, you can’t improve. A successful Devops implementation will measure everything it can as often as it can… performance metrics, process metrics, and even people metrics.SharingSharing is the loopback in the CAMS cycle. Creating a culture where people share ideas and problems is critical. Jody Mulkey, the CIO at Shopzilla, told me that they get in the war room the developers and operations teams describe the problem as the enemy, not each other. Another interesting motivation in the Devops movement is the way sharing Devops success stories helps others. First, it attracts talent, and second, there is a belief that by exposing ideas you can create a great open feedback that in the end helps them improveThe change that is required is already well understood in the DevOps movement that’s been going on for years – BUT – it is important to add Performance as Key Requirement to Culture, Automation, Measurement and Sharing. Culture: PERFORMANCE is a key requirement for everything that is done throughout the delivery chain. We have heard that a lot of the problems that lead to a War Room scenario are problems that could be found earlier if there would be a focus on Performance and Quality throughout the organizationAutomation: Automation is Key for DevOps and Agile Development. What needs to change is that performance and architectural problems are automatically detected in the development and delivery process. This can be achieved by focusing automated testing for exactly these problems – whether it is in C/I or in the “traditional” test areaMeasurement: We can only measure success if we have Key Performance Indicators for each team, e.g: Test Coverage %, Number of Tests Executed, Throughput, Response Time, Number of Deployments, … - an additional focus must be on measures that allow us to track performance and architectural issues. This allows us to identify and prevent any performance regressions as soon as they get introducedSharing:
Agile Development (Stories & Tasks) excludesPerformance and Scalability Requirements from TestTestability Requirements from TestDeployment and Stability Requirements from OpsRequirements: are currently mainly brought in by the business side who demand more features. What is missing are the requirements from Test and Ops.
Agile Process excludes Test and OpsNot part of Standups, Reviews, Planning'sNo active sharing of data, requirements, feedbackNo common toolset/platform/metrics that makes sharing easyCollaboration: Test & Ops are not part of the agile process. There is no active involvement in the standups, reviews or planning meetings. The lack of common tools and a different understanding of quality, metrics and requirements also make it hard to share dataSharing ToolsThe different teams currently use their own set of tools that help them in their day-to-day work in their “local” environment.Developers focus on development tools to help them with developing code, debugging and analyzing the basic problems.Testers use their load testing tools and combine them with some system monitoring tools to e.g: capture CPU, Memory, Network UtilizationOps uses their tools to analyze network traffic, host health, log analyzers, …When these teams need to collaborate in order to identify the root cause of a problem they typically speak a different language. Developers are used to debuggers, thread and memory dumps. But what they get is things like “the system is slow with that many Virtual Users on the system where Host CPU starts showing a problem”.When there is a production problem both developers and testers are typically not satisfied with network statistics or operating system event logs that don’t tell them what really went on in the application. Test wont be able to reproduce the problem with that information nor will devs be able to debug through their code based on that informationIn order to make life easier for developers to troubleshoot the issue they would like to install their tools in test and ops – but these tools are typically not fit for high load and production enviornments. Debuggers have too much overhead, they require restarts and changes to the system -> Ops doesn’t like change!!
Some examples on KPIsNumber of SQL Statements executed -> tells Ops on what to expect in production-> tells architects on whether to optimize this with a cache or a different DB Access StrategyNumber of Log Lines-> tells Ops how to optimize storage for Logging-> tells architects whether there is LOG SPAM happeningMemory Consumption per User Session-> tells Ops how to scale their production environment-> tells architects whether you are “wasteful” with heap spaceTime for a single DeploymentTime for rollbacks
Automation: C/I currently only executes tests that cover functionality, e.g: unit and maybe integration or some functional tests (Selenium, …). What is missing is the concept of already executing small scale performance and scalability tests that would allow us to automatically detect those problem patterns discussed earlier. With that we could already eliminate the need for MOST War Room situations.
So – to sum up – here are some action items (ToDo List)1a: Share and Develop Tools that are used across team boundaries, e.g: Add more diagnostics tools to test or share deployment tools developed by test with Ops1b: a critical component is testing early and testing in real life environments. Test Teams need to empower developers by giving them access to their performance test frameworks that Devs can use to also test performance and architectural KPIs early on. Ops needs to work with Test and provide access to production or staging so that Test can perform their large scale load and performance test in a realistic enviornment2a: It is important to establish a shared reporting and task management system. Very often we see companies that share Wiki Instances and Task Tracking systems where they have status report pages from all teams as well as tracking issues that are found in dev, test and production2b: It is also important to share the same toolset across all tiers so that Dev, Test and Ops get the data they need, that can easily be shared and is understood by everybody
When we are recapping the initial problem that we described and the root causes for it we have to say we have a good solution to solve these problems.DevOps is the way to go – BUT – it requires a big focus on Performance, Architecture, Scalability and Deployment.It requires more Automation to find these problems early onIt requires more Measurement as measures allow us to identify these deficiency throughout the Agile processIt requires active sharing of the data which will bring the teams even closer together so that they are working on a “SHARED GOAL”Following all of this will result in 100% confidence when rolling out a production release – without the need of a war room
When we look at the results of your Testing Framework from Build over Build we can easily spot functional regressions. In our example we see that testPurchase fails in Build 18. We notify the developer, problem gets fixed and with Build 19 we are back to functional correctness. Looking behind the scenesThe problem is that Functional Testing only verifies the functionality to the caller of the tested function. Using dynaTrace we are able to analyze the internals of the tested code. We analyze metrics such as Number of Executed SQL Statements, Number of Exceptions thrown, Time spent on CPU, Memory Consumption, Number of Remoting Calls, Transfered Bytes, …In Build 18 we can see a nice correlation of Exceptions to the failed functional test. We can assume that one of these exceptions caused the problem. For a developer it would be very helpful to get exception information which helps to quickly identify the root cause of the problem and solve it faster.In Build 19 the Testing Framework indicates ALL GREEN. When we look behind the scenes we see that we have a big jump in SQL Statements as well as CPU Usage. What just happened? The Developer fixed the functional problem but introduced an architectural regression. This needs to be looked into – otherwise this change will have negative impact on the application once tested under loadIn Build 20 all these problems are fixed. We are still meeting our functional goals and are back to acceptable number of SQL Statements, Exceptions, CPU Usage, …
Web Architectural Metrics# of JS Files, # of CSS, # of redirectsSize of Images