Slides from my presentation in JavaOne 2016 on the topic of how to keep your CI/CD pipeline under control. Don't let it grow to unmanageable build times! Learn to find out when your pipeline is too slow and you need to do something about it, and when it's fine and you can just carry on with your life.
11. Continuous Integration: check everything is still
working after every commit
Continuous Deployment: every successful
commit turns into a release
What is CI / CD?
20. Slow feedback
Broken builds mask issues
Development paralysis
Impact on ability to meet our SLAs
Missed business opportunities
The Problems Of Size
21. Live with it
Partial CD: only quick tests
Phased CD: split into components
Test Deprecation Policy
Microservices
How Organisations Manage Size
46. Build Time (BT): time an individual build takes
to run
Change Rate (CR): percentage of commits upon
an individual build with respect to the whole
system
Useful Metrics
54. Weighted Impact Time (WIT): impact time of a build
weighted according to its change rage
WIT(A) = IT(A) * CR(A)
Useful Metrics
55. Average Impact Time (AIT): total time needed, on
average, to execute all necessary builds after any
given commit anywhere in the system
AIT = WIT(A) + WIT(B) + ... + WIT(Z)
Useful Metrics
57. Average Impact Time
Average Impact Time is what indicates how well you
have scaled your system
Sample Thresholds
58. Maximum Impact Time
In a worst-case scenario, a build won’t take longer
than this.
Sample Thresholds
59. Maximum Impact Time for Critical Components
The same, but only for your most sensitive modules
(log-in, payment gateway, etc.)
Beware of dependencies!
Sample Thresholds
EITHER BY FOLLOWING THIS, OR BECAUSE YOU HAVE IT, THE NEXT STEP IS MANAGING THE NETWORK
35
This is good to measure what you should change if you need to change something. But do you need to?
PERFORMANCE ESTABLISH THRESHOLD, MEASURE, CHANGE IF ABOVE
46
CALCULATE IN DIFFERENT WAYS, DEPENDING ON OUR PARALLEL EXECUTION CAPABILITIES
If we don’t have the ability to run builds in parallel, then we’ll run A and then B and C (or C and B). In any case, the impact time will the sum of all of them.
CLICK
If we allow parallel execution, then both B and C will be triggered at the same time after A, which means we’ll only have to wait for the slowest of the two.
CLICK
If we allow parallel execution, then both B and C will be triggered at the same time after A, which means we’ll only have to wait for the slowest of the two.
CLICK
Bear in mind these are only approximations. In real life it can be that your ability to run things in parallel is limited by total number of slaves (maybe you can only run up to 5 builds in parallel) or other shared resources (maybe you only have one staging database and two builds cannot get hold of it at the same time). But, despite being approximations, they are a good way to establish a baseline to track and compare.
CLICK
There is something interesting to note about Impact Time, and is that this grows as you go up in the hierarchy. This graph shows the Build Time as the size of the bubbles, but the Impact Time of each bubble will include directly or indirectly that of its dependants. This means that the Parent POM file will be the build with the highest Impact Time, since whenever we change that build we have to rebuild absolutely everything. Now, is that a problem? Maybe not, because it’s also the least modified build (hence its colour). This leads us to conclude that we need to assess the relationship between Impact Time and Change Rate, which brings us to the next metric.
CLICK
This value allows us to compare which builds are the ones causing the highest impact over a period of time, letting us know when an impactful build is infrequent enough so as not to be a problem. And then, by combining all the weighted impact times.
CLICK
We get to the Average Impact Time, which will tell us how long, on average, it takes for our build system to rebuild all the necessary modules after a commit anywhere in the system. Now we’re really getting onto something, because now that we have all these metrics we have a way to define (CLICK) useful thresholds for us.
56
57
58
59
Now, let’s take a moment to reflect on all this. We’re defining metrics based on build duration, but also on change rate. And we are considering architectural changes, restructuring of modules, based on these data. But let’s take a closer look at this this temperature graph. It is driven by dependencies among builds, but also by where I am making changes. That means that some of the attributes of this graph will change over time as developers focus on different parts of the system so as to develop different features. That means that the optimal shape of the system will change according to the data of our build, and what was a good idea yesterday may not be so much today.
Let’s also note that all these graphs are created manually. And I also had to do the analysis manually. I had to do these manually because there aren’t any tools (that I know of) that can provide this information for you. And, useful as this is, you can’t do it too often because CLICK manual processing takes time.
62
63
CI/CD can be your worst bottleneck
Keeping your CI/CD fast is a performance tuning activity, approach it as such
No proper tools available, help me build them