First webinar from the PEnDAR project, outlining the distributed system design challenge, and begin to investigate the application of advanced system engineering techniques for all phases of system design to ensure that critical cost and performance targets can be met. The goal is to make expensive performance failures and cost overruns a thing of the past!
We starting this project because we are seeing cost/performance hazards becoming visible late in the development process – too late to save some projects! We see this as a multi-$B problem worldwide.
What we are investigating is how to enable verification and validation of cost and performance as opposed to simply functional aspects, and to do this for distributed and hierarchical systems, supporting both initial development and ongoing maintenance and incremental development.
We aim to provide early visibility of system cost/performance hazards to avoid costly failures and maximise the chances of successful in-budget delivery of acceptable end-user outcomes. This project is a feasibility study of how techniques that exist to achieve this can be brought into more general use in the industry.
The project partners are:
Predictable Network Solutions, with a long history of addressing performance issues in large-scale distributed systems
Test and Verification Solution, who have expertise in test and verification, particularly for safety-critical and automotive applications
Vodafone Group, in the role of a system-of-systems integrator, aiming to deliver innovative services, often across a wide geographical area
The project is supported by InnovateUK.
The focus group is an opportunity to articulate some of what we know about the boundaries of the possible in distributed computing, and to get insight into achieving viability and sustainability in large and complex and developments
We aim to find out where and how these techniques can be applied, and what the major benefits can be are, and whether a tool or service solution will be the best fit. What are the constraints on applicability of new approaches: established procedures, market inertia etc.
As yet there’s nothing to sell!
What is the core challenge in delivering distributed shared-resources systems that deliver satisfactory performance at reasonable cost? How can they be feasible and sustainable? How can the tail risks be constrained? I.e. problems that occur infrequently and so may not manifest until after the system has been deployed.
Managing constraints on system developers:
cosmic, coming from the laws of physics
ludic, arising from the way systems are constructed and how various ‘games’ of chance get played out
ecological, arising from the ecosystem of suppliers and vendors and what is actually available to work with.
System requirements are often vague and/or contradictory, and change during (and after) development.
Complexity forces hierarchical decomposition of the problem, creating boundaries that may hinder optimal development, including commercial boundaries with third-party suppliers
Time pressure forces parallel development that may not fit naturally with the hierarchy, and encourages leaving tricky issues for later, when they will tend to cause re-work and overruns, and leave tail-risks.
Cost and resource constraints force sharing of resources, both within the system and with other systems (for example when network infrastructure or computing resources are shared), and may also require re-use of existing assets (own or third-party) that may not be ideal, in particular their behaviour may be inadequately quantified.
Everyone knows this – not everyone deals with it smoothly.
How can we decompose a top-level requirement into requirements on subsystems so that we can have confidence that meeting all the lower-level requirements will satisfy the top-level one?
For functional outcomes (‘doing the right thing’) there are various ways of dealing with this.
Performance outcomes means delivering an outcome within a time bound (‘doing the right thing in the right time’), not just performing them at a particular rate.
We want to minimise the amount of integration testing we have to do and re-work, particularly when these become a bug-fixing cycle.
At the same time we want to manage risks, especially those to do with moving from a lab/pilot phase to a wider deployment, where loading and scaling factors may be problematic.
There are established approaches, but these may be running out of steam.
Let’s consider two of the dimensions of this problem:
(1) To what extent are resources are dedicated or shared (either within the system or even with unrelated systems)?
(2) How far is the system (and its resources) distributed?
Examples range from standalone microcontrollers with their own (if small) dedicated resources to virtualized cloud apps (note IoT links these two domains!). Traditional avionic systems are distributed but consist of dedicated hardware units connected by dedicated communication links, whereas the shift is to use software modules sharing processing platforms, connected by links shared between different functions.
From the bottom left to the top right, performance hazards go from well-managed to very unconstrained.
Think about your own experiences – where do your systems of concern fit on this diagram?
Virtualisation is driving us to more shared resources; cost constraints are forcing use of pre-packed services located remotely.
We’re now going to run through some of the technical dimensions of this challenge
This captures what we have learnt about system delivery problems over the last decade. There’s a lot here so we’re going to break it down!
They key task with shared-resource systems is to find a way to quantify and manage the performance/resource tradeoff.
Quantifying and managing the performance/resource tradeoff (yellow centre) is specific to each particular system; the issues around it can de dealt with by applying generic techniques. Analysis of the central problem is complemented by a synthesis of other techniques.
The three key aspects to consider are:
Scale – how are the resource/performance trades affected by the scale of the system?
Exception/failure – how are these managed, given that they become inevitable in a shared, distributed system
Variability – how variable are the resources and the demand for outcomes?
Scale has two dimensions:
Space – either in terms of physical distance, affecting transmission times, or in terms of numbers of users/demands on the system, which together create a notion of ‘density’ that can drive the economics of the solution.
Time – on long timescales the question is one of capacity, on short ones of schedulability.
Exception and failure are specifically not a question of ‘coding errors’ or hardware faults (although those are a factor) but more one of temporary shortage of resources, resulting, for example, in the loss of a packet or a deadline being missed.
Two approaches to handling this are mitigation (re-transmitting a packet, for example) or propagation (packet loss resulting in a failed transfer), requiring handing at a higher layer. These interact, and the optimal approach will depend on the frequency and severity of the failures and the costs of handling them in different ways.
Variability applies both to resources and to load, and its key aspect is correlation:
Positively correlated, e.g. by TV advert breaks
Negatively correlated, e.g. use of one part of the system precludes simultaneous use of another
Uncorrelated, basically a random effect.
Correlations can be externally generated or be a result of the operation of the system
We need to consider both the impact on individual outcomes and the impact on the ability of the rest of the system to deliver collective outcomes.
Once the core is understood, the rest is manageable with the right tools.
Need to support stages in the SDLC.
In Design:
Feasibility: can you deliver the outcomes with sufficient timeliness with acceptable use of resources
Hierarchical decomposition
Acceptance criteria
Verification requires checking quantified outcomes, in a way that is ‘cheap’ enough to re-apply during the system lifetime.
Rocket science used to be something only world superpowers could do – now you only need to be a billionaire! It’s well enough understood to be reproducible, and is just (complex) engineering. Brain surgery requires experience, skill and gut feel – not easy to teach! Outcomes are hard to quantify.
20
Any CDF whose curve is always to the left and above this one represents an outcome that is “acceptable”. If the black line crosses the blue line we have a performance hazard.
Response were scored from 1 - 5
Looking at a more formal approach to managing cost/performance hazards – do the benefits and costs of this balance out?
There’s a push to use standard commodity infrastructure for safety/mission critical purposes – saves a lot of costs but also introduces risk. Need to be able to make a safety case! Virtualisation is coming in everywhere – what are the risks?
Case studies done inside the project show that getting intentions to be quantified can be hard; however explaining that allowing for some possibility of delay or failure can dramatically reduce the delivery costs may encourage engagement.
Even functional verification can be considered ‘too expensive’.