1. The Quality
Attribute of
Upgradability
Len Bass with
Hiroshi Wada, Ingo Weber, Liming Zhu,
Ross Jeffery
NICTA Copyright 2012 From imagination to impact
2. About NICTA
National ICT Australia
• Federal and state funded research
company established in 2002
• Largest ICT research resource in
Australia
• National impact is an important
success metric
• ~700 staff/students working in 5 labs
across major capital cities
• 7 university partners NICTA technology is
• Providing R&D services, knowledge in over 1 billion mobile
transfer to Australian (and global) ICT phones
industry
2
NICTA Copyright 2012 From imagination to impact
3. Consider the follow sequence.
• You have prepared an upgrade to an existing large
enterprise system
– You have coded it
– You have tested it
– It is ready!!
• Alternatively, the IT department (or you) get a package
from a third party – a vendor or open source – that has
been coded and tested.
• What happens then?
NICTA Copyright 2012 From imagination to impact 3
4. Consider the follow sequence.
• You have prepared an upgrade to an existing large
enterprise system
– You have coded it
– You have tested it
– It is ready!!
• Alternatively, the IT department (or you) get a package
from a third party – a vendor or open source – that has
been coded and tested.
• What happens then?
– ~10% of the time the upgrade will fail.
NICTA Copyright 2012 From imagination to impact 4
5. This is the upgradability problem
• How do we make upgrading a system less
problematic?
• Talk outline
– Characteristics of the upgrade problem
– FMEA analysis
• Possible causes of failure
• Failure prevention, detection, and recovery
– Relation to existing product and process quality work
NICTA Copyright 2012 From imagination to impact 5
6. Upgrades to enterprise systems are a very
common occurrence
Upgrade frequency of some common systems
Application Average release interval
Facebook (platform) < 7 days
Google Docs <50 days
Media Wiki 21 days
Joomla 30 days
This frequency would suggest it is important to get
the upgrades correct
NICTA Copyright 2012 From imagination to impact 6
7. Unfortunately, Upgrades Fail Often
• 4.6-10 component failures each month in three
large-scale Internet services. Mostly during
regular maintenance
• Average and maximum failure rates from a
survey of systems administrators are 8.6% and
50%.
• Some claim that user visible failures from
upgrade outweigh user visible failures from
software errors.
NICTA Copyright 2012 From imagination to impact 7
8. Why is this?
• Installation is complicated.
– Installation guides for SAS 9.3 Intelligence, IBM i, Oracle 11g for
Linux are ~250 pages each
– Apache description of addresses and ports (one out of 16
descriptions) has following elements:
• Choosing and specifying ports for the server to listen to
• IPv4 and IPv6
• Protocols
• Virtual Hosts
– The number of configuration options that must be set can be
large
• Hadoop has 206 options
• Hbase has 64
– Many dependencies are not visible until execution
NICTA Copyright 2012 From imagination to impact 8
9. Provides Research Agenda
• Indeed, the surprise is not that upgrades fail
8.6% of the time but that they are successful
91.4% of the time.
• Rich area for research.
NICTA Copyright 2012 From imagination to impact 9
10. What kind of problem is this - product?
• ISO 25010 provides
– A quality in use model composed of five
characteristics (some of which are further subdivided
into subcharacteristics) that relate to the outcome of
interaction when a product is used in a particular
context of use.
– I.e. is upgradability a quality of the system being
upgraded?
• The answer is yes.
NICTA Copyright 2012 From imagination to impact 10
11. What kind of problem is this – process?
• ITIL (Information Technology Infrastructure Library)
– Change Management aims to ensure that
standardised methods and procedures are used for
efficient handling of all changes.
• SPICE – ISO 15504
– process assessment provides the means of
characterizing the current practice within an
organizational unit in terms of the capability of the
selected processes.
• Is upgradability of quality of the process used to manage
information technology?
• The answer is yes.
NICTA Copyright 2012 From imagination to impact 11
12. Upgradability is a hybrid quality problem
• A hybrid quality problem is one in which
improvement involves both product and process
and in which the product has process
awareness.
• Many product centered conferences
– Dependability
– Security
–…
• Some process centered conferences
– Software Process Improvement
– SPICE
– SPEG
–…
NICTA Copyright 2012 From imagination to impact 12
13. Hybrid quality improvement is not well
served by the academic community
• Hybrid quality improvement – as we shall see – involves
close interaction between product, process and tools to
support the process.
• Venues that should emphasize this interaction include
– Profes (Product focused Software Development and
Process Improvement)
– ASQ (Conference on Quality and Improvement)
• Yet an examination of the CFPs and proceedings for
these conferences shows a distinction between process
activities and product characteristics
• We will present the results of a FMEA (Failure Mode and
Effects Analysis) style analysis for upgradability and then
return to the hybrid quality issue
NICTA Copyright 2012 From imagination to impact 13
14. FMEA
• Failure Modes and Effect Analysis is an
inductive failure analysis for analysis of failure
modes.
• FMEA involves describing
– Potential failure modes
– The severity and likelihood of these failures.
• We will focus on the first portion and generate
the potential failure modes as well as potential
prevention, detection, and recovery from these
failures.
• I.e. we are performing an FMEA style
analysis, not an FMEA, per se.
NICTA Copyright 2012 From imagination to impact 14
15. Scenario for Upgradability
• We are concerned with the following scenario
– Version N+1 of an enterprise system is available for
deployment.
• Version N+1 can be deployed by developers
• Version N+1 can be deployed by the Information Technology
Department (The Release Manager if there is one).
– Version N+1 is completely coded and tested by its
developers.
• Measures can include
– Downtime
– Resources (hardware or personnel) required to
perform the upgrade
– Number of failed attempts to install upgrade
NICTA Copyright 2012 From imagination to impact 15
16. Fundamental goals during upgrade
• The literature identifies four fundamental goals
while upgrade is occurring.
– Efficiently manage resources
– Completely and correctly specify configurations
– Manage multiple versions to avoid problems with
version mismatch.
– Maintain consistency of persistent data.
• Failures are caused by the violation of one of
these fundamental goals.
– Our FMEA analysis will look at potential causes for
violations of one of these goals.
NICTA Copyright 2012 From imagination to impact 16
17. Activities during an upgrade of a system
• Make the upgrade available.
• Prepare the environment. Ensure that there are
sufficient resources available for installation and
that assumed software is available.
• Configuration
• Deployment
• Activation
NICTA Copyright 2012 From imagination to impact 17
18. Organization of next portion of the
presentation
• For each activity
˗ Potential fault (a fault is a failure in waiting)
˗ Prevention of the fault
˗ Detection of the fault
˗ Correction of the fault
• Research opportunity
• Blank cell
• Cell with only partial coverage
NICTA Copyright 2012 From imagination to impact 18
19. Make Upgrade available
Fault possibility Prevention Detection Recovery
Element omitted/included Manifest Recreate
incorrectly in installing Bill of lading distribution
software
System corrupted during Hash code, Retransmit
movement checksum
Source of distribution from Digital
an untrusted site signature
Forgotten/misplaced Separate secret
credentials Independent
channel for new
credentials
Credential verifier Codify
unavailable acceptable
credentials in
distribution
NICTA Copyright 2012 From imagination to impact 19
20. Prepare environment
Fault possibility Prevention Detection Recovery
Incorrect versions of support Include version Encode hash of
libraries number in APIs
specification
Utilize services
to announce
incompatibilities
Multiple versions of support Include version
libraries simultaneously required number in name
Libraries expose
version numbers
Linkers version
aware
Insufficient resources Rolling Upgrade
Schema modification on Convert data to
database new schema
prior to upgrade
NICTA Copyright 2012 From imagination to impact 20
21. Configuration
Fault possibility Prevention Detection Recovery
Missing parameter Parameter
database
Parameter built
into tool
Static analysis
of code
Incorrectly specified Abstract Check
parameter specification syntax
Validate
against a
specification
Inconsistent Constraint
parameters checker
NICTA Copyright 2012 From imagination to impact 21
22. Deployment
Fault possibility Prevention Detection Recovery
Insufficient resources Pre-allocate
during
preparation
Rolling
upgrade
Inconsistent hardware Verify during
preparation
Operator error Undo
mechanism
NICTA Copyright 2012 From imagination to impact 22
23. Activation
Fault possibility Prevention Detection Recovery
Discovered hidden Monitoring Recovery
dependency block
Multiple Separation Version
simultaneous Dynamic aware code
versions Software and data
Update
Automatic
translation of
data when
old schema is
used
Version
aware load
balancer
NICTA Copyright 2012 From imagination to impact 23
24. Our activities in this space so far (green
cells)
• Mixed version race condition solution
• Operator undo
NICTA Copyright 2012 From imagination to impact 24
25. What is the “mixed version race condition”
• Common practice when pushing an upgrade to a
large number of servers is to perform the
upgrades one (or several) servers at a time
• This means that version N+1 (the new version)
will be available on some servers and version N
(the old version) will be available on other
servers.
• Suppose version N+1 has functionality not
available in version N
NICTA Copyright 2012 From imagination to impact 25
26. Now consider the following sequence
1. A client (browser) issues a request that is
routed by the load balancer to an instance of
version N+1
2. Version N+1 sends JavaScript assuming new
functionality back to the client.
3. Client sends an AJAX request that utilizes new
functionality and the load balancer routes it to
an instance of version N.
4. Error because version N does not have the new
functionality.
NICTA Copyright 2012 From imagination to impact 26
27. Mixed Version Race Condition
Client (browser) Server
1 Start
rolling
upgrade
2
Initial request
HTTP reply with New
embedded JavaScript 3 Version
4 AJAX callback
Old
5 Version
X ERROR
NICTA Copyright 2012 From imagination to impact 27
28. What does the solution involve?
1. Label communication between instances and
the client with version information
2. Modify load balancer so that messages are
routed to an appropriate version
3. Modify load balancer so that messages are
balanced across all child instances.
NICTA Copyright 2012 From imagination to impact 28
29. Why is this a hard problem?
• Large installations have multiple distributed load balancers that must
be kept in synch. I.e. some load balancers may know about new
version and some may not
• Not enough to put version number in message
– Suppose second request goes to a load balancer that does not yet know about
version N+1.
• Must keep messages balanced so that all servers handle roughly
the same number of requests.
/service /service
/service/vN /service/vN+1 /service/vN
server server server server server server
NICTA Copyright 2012 From imagination to impact 29
30. Operator undo
• After perofmring an operation in AWS, may want
to go back to original state – i.e. Undo the
operation
• Not always that straight-forward:
– Attaching volume is no problem while the instance is
running, detaching might be problematic
– Creating / changing auto-scaling rules has effect on
number of running instances
• Cannot terminate additional instances, as the rule would
create new ones!
– Deleted / terminated / released resources are gone!
NICTA Copyright 2012 From imagination to impact 30
31. Undo for System Operators
Administrator
begin- do
do
do rollback
transaction
+ commit
+ pseudo-delete
NICTA Copyright 2012 From imagination to impact 31
32. Approach
Administrator
begin- do
do
do rollback
transaction
Sense cloud Sense cloud
resources states resources states
Undo System
NICTA Copyright 2012 From imagination to impact 32
33. Approach
Administrator
begin- do
do
do rollback
transaction
Sense cloud Sense cloud
resources states resources states
Goal
Goal Initial
Initial
state
state state
state
Undo System
NICTA Copyright 2012 From imagination to impact 33
34. Approach
Administrator
begin- do
do
do rollback
transaction
Sense cloud Sense cloud
resources states resources states
Goal
Goal Initial
Initial Set of
Set of
state
state state
state actions
actions
Execute Generate code Plan
Undo System
NICTA Copyright 2012 From imagination to impact 34
35. Upgradability as a process&product quality
• Architecture of the system being upgraded can
affect the process of installation
– Suppose the system checks for version information
from dependent libraries. Then the process must
encompass descriptions of what to do if an error
condition occurs.
• Process of upgrade can affect the architecture of
the product.
– Suppose the process is supported by a tool that
checks the health of the installation of version N+1.
Then the system must make visible the information
used by this tool.
NICTA Copyright 2012 From imagination to impact 35
36. Summary
• Upgrade is an important problem
– Upgrade failures affect user satisfaction
– Upgrade failures happen frequently
• Upgrade involves the interaction of product and
process quality issues.
– Communities are focussed on improving the quality of
the process or the product. Not the joint
process/product quality.
• Multiple opportunities for research exist.
NICTA Copyright 2012 From imagination to impact 36
37. Q&A
Thank You!
Research study opportunities in dependable cloud computing:
• Software Architecture
• Data Management
• Performance Engineering
• Autonomic Computing
To find out more, send your CV and undergraduate details to
students@nicta.com.au
NICTA Copyright 2012 From imagination to impact 37