The slide presentation of the paper "Enhancing the Analysis of Software Failures in Cloud Computing Systems with Deep Learning" at The 32nd International Symposium on Software Reliability Engineering (ISSRE 2021)
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
Enhancing the Analysis of Software Failures in Cloud Computing Systems with Deep Learning
1. Enhancing the Analysis of Software Failures in
Cloud Computing Systems with Deep Learning
Domenico Cotroneo, Luigi De Simone, Pietro Liguori, Roberto Natella
DIETI, Università degli Studi di Napoli Federico II, Italy
{cotroneo, luigi.desimone, pietro.liguori, roberto.natella}@unina.it
The 32nd International Symposium on Software Reliability Engineering
2. ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 2
Cloud Computing Infrastructure
Analyzing how faults can turn into service failures (Failure
Mode Analysis) is very difficult and time-consuming, even for
expert developers
• Huge volumes of data (hundreds of MBs, thousands of events)
• Large number of fault experiments
• High complexity, non-determinism
X
Faults
Storage, network,
software, etc.
Sys. admins
Failures
Data loss, resource
unavailable, etc.
IaaS
Service
requests
Clients
Failure Data
3. ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 3
Our case study: OpenStack
Nova
Horizon
Cinder Neutron
Glance
Keystone
Swift
instance
creation
request
Silent failures occur as
omissions, delays, or out-of-
order events in these workflows
auth-token
validation
get image id
get IP
address
volume
attachment
4. ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 4
Events in Fault-Injection Experiments
5. ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 5
Contribution
A novel approach for discovering the classes of
failure ("failure modes") of cloud computing systems,
using fault injection and deep learning
Case study on a dataset of thousands of failures of
the OpenStack cloud computing platform
The raw failure data (logs, event traces) are clustered into
few failure modes (ease of interpretation by developers and
sysadmins)
6. ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 6
Contribution (cont.)
The failure dataset containing the events collected in
OpenStack during our fault-injection experiments is
publicly available on GitHub:
https://github.com/dessertlab/Failure-Dataset-OpenStack
The paper is available on ScienceDirect:
https://doi.org/10.1016/j.jss.2021.111043
7. ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 7
Failure Mode Analysis Based on Plain
Sequences of Events
Vector
representation
Node
Node
Node
Traces under fault-
injected conditions
Execution with fault-
injection
1
Instrumentation
2
1
3
2
Instrumented
communication libraries
(REST APIs, Message
Queues, …)
Clustering
4
3
FAIL
#1
FAIL
#3
FAIL
#2
Visualization
5
AACABBA
Occurrence vector
<A = 4, B = 2, C = 1>
Clusters of failure
modes
Example: the events A, B, C happened
4, 2 and 1 times, respectively, during
the failure
8. ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 8
Anomaly
Detection
Node
Node
Node
Traces under fault-
injected conditions
Traces under fault-
free conditions
Execution with fault-
injection
2
1
Instrumentation
3
1
3
2
Instrumented
communication libraries
(REST APIs, Message
Queues, …)
Fault-free execution
Clustering
6
Model training of
normal behavior
4
5 AACABBA
FAIL
#1
FAIL
#3
FAIL
#2
Visualization
7
Anomaly vector
spurious anomalies
< A = 1, B = 0, C = 1,
A = 0, B = 2, C = 1 >
missing anomalies
Clusters of failure
modes
AABBBBCA
AABBBABCC
AABBABBC
Failure Mode Analysis Based on
Anomaly Detection
Cotroneo, Domenico, et al. "Enhancing failure propagation analysis in cloud computing
systems." 2019 IEEE 30th International Symposium on Software Reliability Engineering
(ISSRE). IEEE, 2019.
9. ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 9
Proposed Solution:
Deep Embedded Clustering (DEC)
Vector representation
Node
Node
Node
Traces under fault-
injected conditions
Execution with fault-
injection
1
Instrumentation
2
1
3
2
Instrumented
communication libraries
(REST APIs, Message
Queues, …)
Autoencoder
4
3
FAIL
#1
FAIL
#3
FAIL
#2
Visualization
6
Clusters of failure
modes
Clustering
Cluster
Layer
Encoder
embedded
features
5
Encoder Decoder
Reconstruction
Error
This solution can be used also in
combination with anomaly detection, by
applying it on anomaly vectors
11. ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 11
Clustering without Anomaly Detection
Workload
Clustering
Approach
DEPL NET STO
k-medoids w/o fine-
tuning
0.70 0.80 0.80
k-medoids with
fine-tuning
0.74 0.85 0.82
DEC 0.86 0.86 0.92
DEC achieves clusters with higher purity
compared to traditional clustering, both without
and with manual fine-tuning of feature weights
12. ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 12
Clustering with Anomaly Detection
Workload
Clustering
Approach
DEPL NET STO
k-medoids w/o fine-
tuning
0.80 0.78 0.87
k-medoids with
fine-tuning
0.94 0.86 0.90
DEC 0.84 0.83 0.89
DEC approaches the performance of manually-
tuned clustering with anomaly detection
13. ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 13
Failure Modes Distribution
0
200
400
600
800
1000
1200
1400
1600
1800
Instance
Failure
Volume
Failure
Network
Failure
SSH Failure Cleanup
Failure
No Failure
Ground Truth k-medoids k-med with fine-tuning DEC
14. ISSRE, October 25 - 28, 2021 pietro.liguori@unina.it - 14
Conclusion
We presented a novel approach for analyzing failure
data from cloud systems, by using unsupervised
learning algorithms and deep learning
We presented results on failure data from the popular
OpenStack cloud computing platform
• The approach can achieve performance comparable to, or in
some cases even better than, the performance of manually-
tuned clustering
• The approach performs better than unsupervised clustering
w/o feature engineering