Slide presentation of "How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform"

How Bad Can a Bug Get?
An Empirical Analysis of Software Failures
in the OpenStack Cloud Computing Platform
Domenico Cotroneo*, Luigi De Simone*, Pietro Liguori*,
Roberto Natella*, Nematollah Bidokhti**
*DIETI, Università degli Studi di Napoli Federico II, Italy
**Futurewei Technologies, Inc., USA
*{cotroneo, luigi.desimone, pietro.liguori, roberto.natella}@unina.it **nbidokht@futurewei.com
ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019

ESEC/FSE 2019, Tallinn, Estonia, 26-30 August, 2019 pietro.liguori@unina.it - 2
Problem: The fragility of cloud
computing infrastructure software
Gunawi et al., 2016. “Why Does the Cloud Stop Computing?
Lessons from Hundreds of Service Outages”. In Proc. SoCC

Our case study: OpenStack
Nova
Horizon
Cinder NeutronGlance
Keystone
Swift
1. Failure notified by a
timely API error
(Fail-stop)
instance creation request
2. Log messages with CRITICAL
or ERROR severity
2019-08-27 15:13:20.106 ERROR nova.api.openstack.extensions
Unexpected exception in API method …
3. Failure is isolated

Contribution
 Empirical analysis of high-severity failures in the
OpenStack cloud computing platform:
RQ1: Are failures actually “fail-stop”?
RQ2: Are failures logged?
RQ3: Are failures propagated across sub-systems?
 Artifacts for reproducing our experimental
environment in a virtual machine:
 DOI: 10.6084/m9.figshare.8242877

Fault Injection Methodology
Workload Logs
 API Errors
- openstack instance create
 Assertion (Healthy) Checks
- Network Status: Active
- Instance Status: Error
OpenStack sub-systems Logs
2019-08-27 15:13:20.106 ERROR
nova.api.openstack.extensions
Unexpected exception in API method …
// ~/nova/compute/api.py
// ORIGINAL CODE
// self.compute_task_api.schedule_and_build_instances
(instanceID, build_parameters)
// BUGGY CODE (missing parameter)
self.compute_task_api.schedule_and_build_instances
(instanceID)
Workload

Overview of a fault injection experiment
iface_name = self.get_interface_name
(network, port)
Original
Python code
TIMELINE
Faulty
round
ON
Fault-free
round
OFF
Injected
Python code
if bug_trigger == True:
// BUGGY CODE (FAULTY ROUND)
// Missing Parameter MP
(network)
else:
// CORRECT CODE (FAUL-FREE ROUND)
(network, port)
Clean-up

0
5
10
15
20
25
Numberbugfixes
Fault type
API DICT SQL RPC SYSTEM AGENT/PLUGIN
We went through
problem reports on
Launchpad to identify
recurring bug-fixing
changes in OpenStack
Which bugs should we inject?
--- nova/virt/libvirt_conn.py 2011-01-25 12:44:26 +0000
+++ nova/virt/libvirt_conn.py 2011-01-25 20:42:26 +0000
@@ -1268,13 +1268,13 @@
if(ip_version == 4):
# Allow DHCP responses
dhcp_server = self._dhcp_server_for_instance(instance)
- our_rules += ['-A %s -s %s -p udp --sport 67 --dport 68' %
- (chain_name, dhcp_server)]
+ our_rules += ['-A %s -s %s -p udp --sport 67 --dport 68 '
+ '-j ACCEPT ' % (chain_name, dhcp_server)]
elif(ip_version == 6):
Sub-system
Fault type Nova Cinder Neutron ALL
MFC 110 55 36 201
WPV 60 40 36 136
MP 57 38 36 131
WRV 149 96 59 304
TE 63 40 36 139
ALL 439 269 203 911

Fail-stop Behavior
Add
Role
Create
Keypair
Create
Security
Group
Create
Router
Create
Networ
k
Create
Instance
Create
Floating IP
Create
Volume
Reboot
instance
Create
Image
Create
Domain
Create
Project
Create
User
Create
Subnetwork
Set
Gateway
Add
Floating IP
to Instance
Attach
Volume
to Instance
Cleanup
Resources
TIMELINE
API Error
openstack instance create
Workload
When an API call generates an
error, the workload is abortedAssertion Checks on the
status of the virtual resources
Network Status: Active

Non Fail-stop Behavior
API Error
Cannot 'attach_volume’ instanceID
while it is in vm_state error
Instance Status: Error
No API Error!
Failure delay
Workload
The workload continues the execution
regardless the assertion check(s)
Add
Role
Create
Keypair
Create
Security
Group
Create
Router
Create
Networ
k
Create
Instance
Create
Floating IP
Create
Volume
Reboot
instance
Create
Image
Create
Domain
Create
Project
Create
User
Create
Subnetwork
Set
Gateway
Add
Floating IP
to Instance
Attach
Volume
to Instance
Cleanup
Resources
TIMELINE

RQ1: Does OpenStack Show a Fail-Stop
Behavior?
40%
37%
23%
35%
46%
18%
60%
32%
7%
44%
38%
18%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
API Error Only Assertion Failure(s) & API
Error
Assertion Failure(s) Only
PercentageExperiments
Failure Type
Nova Cinder Neutron All sub-systems
Failures notified by
a timely API error
Failures with no API error
(but virtual resources are
in incorrect state)
Failures that were
notified with a delay
Fail-Stop Non Fail-Stop

RQ1: Does OpenStack Show a Fail-Stop
Behavior?
Subsystem Median
Latency [s]
Assertion
Failure(s)
followed by
API Error
(Non Fail-stop)
Nova 152.25
Cinder 74.52
Neutron 144.72
API Error Only
(Fail-stop)
Nova 3.73
Cinder 0.30
Neutron 0.30
Long API error latency
(2 minutes on average)
0 100 200 300 400
time (s)
0
0.2
0.4
0.6
0.8
1
Probability
Nova
Neutron
Cinder

RQ2: Is OpenStack Able to Log Failures?
 In 8.5% of experiments, no log messages with
CRITICAL or ERROR severity
Logging coverage
Subsystem API Errors Only
Assertion
Failure(s) and
API Errors
Assertion
Failure(s) Only
Nova 90.32% 82.56% 80.77%
Cinder 100% 100% 95.65%
Neutron 98.67% 95% 66.67%

8
Neutron
Injection in
Neutron
Injection in
Nova
Injection in
Cinder
Failure SSH
Failure Instance
Active
Failure Volume
Attached
Failure Volume
Created
Cinder API
Error
Nova API ErrorNeutron API
Error
Nova Cinder
RQ3: Do Failures Propagate Across
OpenStack?
Faulty Round
39
22
74
108
78
83
37
25
56
5555
The failures propagate across OpenStack services
in a significant amount of cases (37.5% of the failures)

RQ3: Do Failures Propagate Across
OpenStack?
Fault-Free Round
after fault removal
Neutron
Injection in
Neutron
Injection in
Nova
Injection in
Cinder
Failure SSH
Failure Instance
Active
Failure Volume
Attached
Failure Volume
Created
Cinder API
Error
Nova API ErrorNeutron API
Error
Nova Cinder
24
24
7
7
Persistent Failures
Even after that we disable the fault (fault-free round),
OpenStack still experiences failures (7.5% of the cases).

Conclusion (Answers) (1/2)
 RQ1: Are failures actually “fail-stop”?
 Answer: In the majority of the cases, OpenStack does not behave in a
«fail-stop» way (late or no API error)
 Suggestions: Mitigate failures by actively checking the status of virtual
resources as in our assertion checks (e.g., checks incorporated in a
monitoring solution)
 RQ2: Are failures logged?
 Answer: In a small fraction of the experiments, there was no indication
of the failure in the logs
 Suggestions: Improve logging in the source code (e.g., by checking for
errors returned by the faulty function calls)

 RQ3: Are failures propagated across sub-systems?
 Answer: In most of the failures, the injected bugs propagated across
several OpenStack sub-systems. There were also relevant cases of
failures that caused subtle residual effects on OpenStack
 Suggestions: Improve resource clean-up on errors, to prevent
propagation across service API calls and across subsystems.
Conclusion (Answers) (2/2)
Use our artifact to support future research
on mitigating the impact of software bugs
(DOI: 10.6084/m9.figshare.8242877)

Slide presentation of "How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform"

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (16)

Similar a Slide presentation of "How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform"

Similar a Slide presentation of "How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform" (20)

Último

Último (20)

Slide presentation of "How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform"