1. The challenges of
emerging software
eco-systems
19 May 2013
Neil Siegel, Ph.D.
Sector Vice-President & Chief Engineer,
Northrop Grumman
2. Bottom-line up front
• New opportunities for software-intensive
configurations are arriving on the market
• Likely will be adopted quickly, and at
large scale
• There remains, however, a significant
potential for unintended problems
• Case studies
• Let’s talk about our role in creating the
fixes
2
3. Emerging software eco-systems
• Cloud, cyber-physical, and so forth
• These constructs, together with the necessary tools,
processes, training, reference design / design patterns,
and so forth, can be considered new software “eco-
systems”
• By extending the range of societal problems that
can be solved by software, these new eco-
systems provide the opportunity to increase our
value to society
• Such complex structures exhibit emergent
behavior
• Obtaining this “1+1=3” is often the reason for
interconnecting formerly-separate elements
• There is, however, often unintended emergent
behavior, some of which has adverse effects.
• We need to take the lead in addressing these problems
– only we have the expertise
3
4. Case-studies (1 of 5)
1. Clients #1 & #2 – private cloud
• 2 million to 20 million users per client, most of
them public-facing
• A smaller set of users (10%) do mission-critical
functions
• Customer believes that there is significant
synergy by co-hosting public-facing and
mission-critical capabilities
• Observed result: the process to manage fine-
grained access control is labor-intensive, and
hence, too expensive and considered error-
prone
4
5. 2.Client #3 -- Transitioning an enterprise to a
private cloud
• > 100 separate mission applications, connected
into complicated and time-critical mission
threads. Each formerly on their own stand-
alone computer.
• Very large ingest rate
• Transitioned to a private cloud. >80% decrease
in hardware foot-print. Significant decrease in
software SLOC count (every application
formerly had its own infrastructure, etc.).
Case-studies (2 of 5)
5
6. 6
2.Client #3 -- Transitioning an enterprise to a
private cloud
• But . . . (a) could not re-code to a modern computer
target some of the applications (re-certification
considered too expensive, source code not
available (!), etc.) (b) mission threading made the
use of middleware appropriate; selected and ported
most of the applications to a single middleware
structure (c) most needed cyber-protection retrofits,
and other modifications
• The new system is realizing the desired benefits,
preserved the mission timelines etc., but the
transition process was material, and this was not
well-understood by the client at the beginning of the
process. Since the process was competitive, that
was a real obstacle.
Case-studies (2 of 5 – continued)
7. 3.Client #4 -- Forensics
• Capability hosted in a public cloud
• Serious hacking attack
• The client wants us to do analysis / forensics.
There may be a need for law-enforcement
participation.
• We don’t have access to all of the same sort
of information that we would if this were
hosted on a dedicated mission computer in
the client’s (or our own) facilities
• It is not clear to what and how law
enforcement can gain access . . . There is a
complicated “stack” of participants, yet our
client’s contract is only with the “top” of that
stack
Case-studies (3 of 5)
7
8. 4.Client #5 – big-data
• Customer “fell in love” with a commercial product
that a vendor offered, and required us to
implement the system using that product
• It appears that at the high-end of the performance
spectrum, there is an extreme performance
sensitivity to specifics of the mission (often 10x or
more).
• Vendors either don’t understand this, or don’t
like to acknowledge it . . . and work very hard
to convince customers to choose their stuff.
Case-studies (4 of 5)
8
9. 9
4.Client #5 – big-data
• In the end:
• COTS product hit less than 10% of the
promised performance
• Customer brought in lots of other vendors to
perform triage. Nothing worked.
• Had eventually to throw it all away, and allow
us to start all over, based on detailed
mission engineering (which worked).
• Not good for either the customer or the
vendor! (nor us – a “Cassandra” effect)
Case-studies (4 of 5 – continued)
10. 5.Clients #6 & #7 – cyber-physical systems
• Significant functional and convenience benefits
of using digital systems to interconnect formerly-
stand-alone capabilities
• Significant (apparent) cost savings using internet
to replace dedicated / bespoke control systems
• But . . .
a) reliability / available too low
b) variance in port-to-port timing too high
c) result is far too fragile, and too susceptible
to a wide range of attacks – hooking things
up to the internet creates new attack
channels.
Case-studies (5 of 5)
10
11. Some of the lessons-learned from the case studies
Cloud
• Co-location of data
increases the
down-side of a
potential spill
• Fine-grained
access control is
labor-intensive
• True enterprise
transition can be
quite labor-
intensive
• How to do forensics
Big-data
• Extreme
performance
sensitivity to
mission specifics
• How to test, when
increasingly we
cannot control the
data we process
Cyber-physical
• Authentication
• Built-in protection
against
unreasonable use
• Socially-induced
complexities
• Too much variance
• Reliability /
availability far too
low
Reliability
• In general, the MTBF for software stinks
• People have pointed the way for a long time, but time-to-market and
the quest for more functionality still drives design decisions
11
12. Special discussion:
Reliability / availability / capacity / port-to-port timing
• I spent many years of my career as the designated “fix-it”
person for problem programs
• My findings: weak designs allowed too much unplanned
dynamic behavior
• Good at implementing the dynamic behavior they wanted . . .
• . . . but weak at preventing other the dynamic behavior
• Usually pretty good at implementing the functional and
presentation requirements
• But unplanned dynamic behavior resulted in low reliability /
availability, poor MTBF, low capacity, high variance in port-to-port
timing (including human response times)
• I extended this to a more systematic study in (Siegel 2011)
• Including a technique for designs that are effective at preventing
unplanned dynamic behavior
12
13. We now live in the middle of critical missions
• Capers Jones’ data shows a huge range of software quality
outcomes: 20x latent defects, and so forth
• Consistent with other’s data, and confirmed by my own benchmarks
(100x to 1,000x MTBF)
• Yet it seems like time-to-market and the quest for functionality
dominates decision-making
• For example, typical software desktop / application MTBF: 10 hours
• Modern commercial jetliner MTBF: >> 1,000,000 hours
• Techniques to be better exist / have been proposed:
• Model-driven design / development
• Formal validation methodologies
• Domain-specific languages (to reduce SLOC count)
• And yet . . . they remain at the margins of most practice
If we are going to be implementing critical missions – the
next-generation power grid, advanced health care
prognostics, self-driving cars, etc. – this is probably not
good enough
13
14. Cyber-physical
• In my view, this is the high-leverage area in the
near future
• It is also the area of highest risk
• Why:
• Cyber vulnerabilities (my view: we must learn to build secure
systems from insecure components)
• Privacy breaches
• Direct and obvious safety impacts
• We have often focused on mean & peak performance /
capacity / port-to-port timing, rather than on reducing
variance
• Failure to include built-in blocks against unreasonable
use
• The new environment limits our ability to control, or
even understand, all of the data we will be asked to
process
14
As we get into more consumer-facing / safety-critical
missions, expectations will rise
15. 15
An on-off
switch
Is expected to
work 100% of
the time
A baseball
player
Is a huge
success if he
get a hit 30%
of the time
Expectations of our field
• Achieve the benefits, but minimize the
unintended adverse consequences
• Our track-record is pretty good, but expectations
are very high . . . and will get higher
16. Summary
• New software eco-systems offer us an enormous
opportunity . . .
• . . . that comes with increased exposure and
public expectations
• Significant down-side risks, and insufficient
attention to mitigating them
• Significant variation in the quality of outcomes;
which indicates that we are not yet a mature field-
of-practice
• Some solutions exist, both old and new
• But the variation cited above indicates that these are
not always adopted
• If we wait for laws and regulations, we won’t like
the outcome
• Cyber-physical is the domain upon which to focus
16
17. 17
About the presenter: Neil Siegel, Ph.D., sector
vice-president & chief engineer at Northrop
Grumman, has been responsible for several of the
Nation’s most successful military systems, including
the Blue-Force Tracker, the Army’s first unmanned
air vehicle, and many others. He is a member of the
National Academy of Engineering, a Fellow of the
IEEE, the recipient of the prestigious Simon Ramo
Medal for systems engineering, the Army’s Order of
Saint Barbara, and other awards and honors.