Introduction to the field of AIOps. large-scale monitoring, and observability. Provides an example illustrating how Deep Learning can be used to analyze distributed traces to reveal exactly which component is causing a problem in microservice applications.
Presentation given at the National University of Ireland, Galway (NUI Galway)
on 2019.08.20.
Thanks to Prof. John Breslin
AWS Community Day CPH - Three problems of Terraform
AIOps: Anomalies Detection of Distributed Traces
1. AIOps: Anomaly Detection of Distributed Traces
Prof. Dr. Jorge Cardoso
E-mail: jorge.cardoso@huawei.com
Planet-scale AIOps / Chief Architect
Ireland and Munich Research Centers
Definition (Gartner) [AIOps]
AIOps platforms utilize big data, modern machine learning and other advanced analytics technologies to directly and indirectly
enhance IT operations (monitoring, automation and service desk) functions with proactive, personal and dynamic insight.
National University of Ireland, Galway (NUI Galway)
2019.08.20
2. HUAWEI TECHNOLOGIES CO., LTD. 1
AIOps: Anomaly Detection of Distributed Traces
Abstract and bio
In planet-scale deployments, the Operation and Maintenance (O&M) of cloud platforms cannot be done any longer
manually or simply with off-the-shelf solutions. It requires self-developed automated systems, ideally exploiting the use of AI
to provide tools for autonomous cloud operations. This talk will introduce the field of AIOps (artificial intelligence for IT
operations) and present what the next generation of AI-driven IT Operations tools and platforms will be. Our current research
in Munich explores how AIOps, and, namely, deep learning, machine learning, distributed traces, graph analysis, time-series
analysis, sequence analysis, and log analysis can be used to effectively detect and localize failures in cloud infrastructures to
reduce the workload of SRE operators. In particular, I will describe how distributed traces can be analyzed using deep
learning to effectively identify anomalies.
Dr. Jorge Cardoso is Chief Architect for Planet-scale AIOps at Huawei’s Ireland and Munich
Research Centers. Previously he worked for several major companies such as SAP Research
(Germany) on the Internet of Services and the Boeing Company in Seattle (USA) on Enterprise
Application Integration. He previously gave lectures at the Karlsruhe Institute of Technology
(Germany), University of Georgia (USA), University of Coimbra and University of Madeira (Portugal).
He recently published his latest book titled “Fundamentals of Service Systems” with Springer. His
current research involves the development of the next generation of Cloud Operations and Analytics
using AI, Cloud Reliability and Resilience, and High Performance Business Process Management
systems. He has a Ph.D. in Computer Science from the University of Georgia (USA).
Jorge Cardoso
GitHub | Slideshare.net | GoogleScholar
Previous Positions
Interests AIOps, Service Reliability Engineering, Cloud Computing, Distributed Systems, Business Process Management
3. HUAWEI TECHNOLOGIES CO., LTD. 2
Operations and Maintenance
Global Trends
Core
Networks
IoT
Infrastructure
Network
Appliances
Radio
Access
Network
Content
Delivery
Networks
Virtual
CPE
Mobile
Edge
Computing
Energy Manufacturing Smart Cities/BuildingsHealthcare
Networking
East Environment Energy PHILIPS MedicalGeely
The cloud has evolved from a disruptor to an obvious solution for enterprise IT. However, the
increasing complexity of public, private, edge, and hybrid cloud environments presents new
challenges for cost-efficient operations and maintenance (O&M) tasks
4. HUAWEI TECHNOLOGIES CO., LTD. 3
Underlying Problem
Increasing Complexity of Systems
Every year the management of IT O&M is more complex
► Digitalization with cloud, mobile, and edge
► Large-scale microservice and serverless systems
► Increase in IT size, and event/alert volumes
► SLA guarantee for critical applications
► Nano/microservices break existing monitoring tools
Microservices/serverless add complexity to communications between
services making systems far more difficult to maintain and troubleshoot.
With a monolith application, when there is a problem, only one product
owner needs to be contacted.
HC Cloud
Data Center
1
Data Center
2
Server 1 Server 2
Socket 1 Socket 2
Core 1 Core 2
VM 1
VM 2
VM 3
Core .. Core 16
Server ..
Server
20.000
Data Center
..
Data Center
10
Huawei Cloud is an example of a planet-
scale complex system:
40 availability zones, 23 geographical
regions
(Germany, France, South/Central America, Hong
Kong and Russia to Thailand and South Africa)
20M VMs (3 VM/core), billion metrics
(for illustrative purpose only)
However, existing O&M tools still rely on old approaches which
involve manual rule configuration, averages and std, and data
wrangling to manage physical assets
5. HUAWEI TECHNOLOGIES CO., LTD. 4
Industry Trends
Bringing AI to O&M
Startup in the field of AIOps
The market is seeing interesting mergers and acquisitions by important vendors
► Cloud providers are also emerging as ITIM tool providers purely for their IaaS offerings
At the moment, these tools have visibility limited to the cloud providers' own environment
► Nonetheless, they are valuable assets which can also be offered to external customers
VMware’s acquisition of Wavefront is interesting from an analytics perspective
ScienceLogic has acquired AppFirst, enabling it to acquire granular application metrics
CISCO has acquired AppDynamics and Perspica, enabling it to acquire rea-time monitoring
and AI
CA Technologies’ acquisition of Automic brings in an additional automation technology to its
existing portfolio
Spinoff of Hewlett Packard Enterprise’s (HPE's) software business to Micro Focus
VMware acquired Wavefront (2017)
Cisco acquired AppDynamics (2017)
Cisco acquired Perspica for ML (Oct 2017)
CA acquired Automic (2017)
Google acquired
StackDriver (2014)
Amazon AWS
6. HUAWEI TECHNOLOGIES CO., LTD. 5
Industry Trends
Bringing AI to O&M
“Virtyt Koshi, the EMEA general manager for virtualization
vendor Mavenir, reckons Google is able to run all of its data
centers in the Asia-Pacific with only about 30 people, and that a
telco would typically have about 3,000 employees to manage
infrastructure across the same area."
Reducing O&M Costs
Google: 30 people
Others: 3000 people
Automation can
reduce the number
of incidents by 50%
“We began applying machine learning two years ago (2016) to operate our data centers more efficiently… over the past few
months, DeepMind researchers began working with Google’s data center team to significantly improve the system’s utility. Using
neural networks trained on different operating scenarios and parameters, we created a more efficient and adaptive framework to
understand data center dynamics and optimize efficiency.” Eric Schmidt, December 2018
7. HUAWEI TECHNOLOGIES CO., LTD. 6
AIOps @ Huawei Cloud
Components and Troubleshooting Scenarios
Troubleshooting Scenarios
Response Time Analysis
A service started responding to requests more slowly than normal
The change happened suddenly as a consequence of regression in
the latest deployment
System Load
The demand placed on the system, e.g., REST API requests per
second, increase since yesterday
Error Analysis
The rate of requests that fail -- either explicitly (HTTP 5xx) or
implicitly (HTTP 2xx with wrong content) -- is increasing slowly, but
steadily
System Saturation
The resources (e.g., memory, I/O, disk) used by key controller
services is rapidly reaching threshold levels
https://intl.huaweicloud.com/
System’s Components
► OBS. Object Storage Service is a stable, secure, efficient, cloud
storage service
► EVS. Elastic Volume Service offers scalable block storage for
servers
► VPC. Virtual Private Cloud enables to create private, isolated
virtual networks
► ECS. Elastic Cloud Server is a cloud server that provides
scalable, on-demand computing resources
8. HUAWEI TECHNOLOGIES CO., LTD. 7
AIOps @ Huawei Cloud
Monitoring
Logs. Service, microservices, and applications
generate logs, composed of timestamped
records with a structure and free-form text,
which are stored in system files.
Metrics. Examples of metrics include CPU
load, memory available, and the response
time of a HTTP request.
Traces. Traces records the workflow and tasks
executed in response to an HTTP request.
Events. Major milestones which occur within a
data center can be exposed as events.
Examples include alarms, service upgrades,
and software releases.
{"traceId": "72c53", "name": "get", "timestamp": 1529029301238, "id": "df332",
"duration": 124957, “annotations": [{"key": "http.status_code", "value": "200"}, {"key":
"http.url", "value": "https://v2/e5/servers/detail?limit=200"}, {"key": "protocol", "value":
"HTTP"}, "endpoint": {"serviceName": "hss", "ipv4": "126.75.191.253"}]
{“tags": [“mem”, “192.196.0.2”, “AZ01”], “data”: [2483, 2669, 2576, 2560, 2549, 2506,
2480, 2565, 3140, …, 2542, 2636, 2638, 2538, 2521, 2614, 2514, 2574, 2519]}
{"id": "dns_address_match“, "timestamp": 1529029301238, …}
{"id": "ping_packet_loss“, "timestamp": 152902933452, …}
{"id": "tcp_connection_time“, "timestamp": 15290294516578, …}
{"id": "cpu_usage_average “, "timestamp": 1529023098976, …}
2017-01-18 15:54:00.467 32552 ERROR oslo_messaging.rpc.server [req-c0b38ace -
default default] Exception during message handling
System’s Components (e.g., OBS, EVS, VPC, ECS) are monitored and generate various types of data:
Logs, Metrics, Traces, Events, Topologies
9. HUAWEI TECHNOLOGIES CO., LTD. 8
Anomaly
Detection
Fault
Diagnosis
Fault
Prediction
Fault
Recovery
AIOps @ Huawei Cloud
O&M Troubleshooting Tasks
Data Source Connectors
Distribution Layer
Time Series Registry
Stability Analysis: Machine Learning and Analytics
Visualization, Exploration, Reporting
Real-time IngestionHistorical Data Storage
Stream ProcessingBatch Processing
Analytical Data Store
DevOpsBusiness Operators Developers
Anomaly Detection Fault Recovery
Fault Prediction Fault Diagnosis Topology Analysis
Fault Prevention
TracingMetrics Events
Reference architecture for AIOps platforms
App Logs Topology
► Anomaly detection. Determine what constitutes normal system
behavior, and then to discern departures from that normal
system behavior
► Fault diagnosis (root cause analysis). Identify links of
dependency that represent causal relationships to discover the
true source of an anomaly
► Fault Prediction. Use of historical or streaming data to predict
incidents with varying degrees of probability
► Fault recovery (Incident Remediation). Explore how decision
support systems can manage and select recovery processes to
repair failed systems
O&M Troubleshooting Tasks
iForesight is an intelligent new tool aimed at SRE cloud maintenance teams.
It enables them to quickly detect anomalies thanks to the use of artificial
intelligence when cloud services are slow or unresponsive.
iForesight3.0
Operational AI & AIOps
10. HUAWEI TECHNOLOGIES CO., LTD. 9
SAP HANA In-Memory Predictive Analytics
Typical toolbox of algorithms for AIOps
https://www.slideshare.net/SAPTechnology/hana-sps08-newpredictiveanalysislibrary
► Generic algorithms are building blocks for AIOps, but they rarely
work as ready to use solutions
► Their applicability is often illustrated with toy examples which do
not hold in production
► Algorithms need to be wisely combined, modified, complemented
to build new techniques (and methods) which are useful for O&M
for a particular context
‒ e.g., unstable systems, high noise-signal ratio, multi-modal
11. HUAWEI TECHNOLOGIES CO., LTD. 10
AIOps @ Huawei Cloud
Trace Analysis
N
1
N
2
N
3
N
4
Distributed Traces
User Commands: image delete, image list, flavor set, flavor
show, flavor unset, …
keystone -> nova -> glance -> network
Services
30%
70%
Warning Y
Warning A
Warning B
Warning C
Error Z
A B C ZTrace t1
Sequence of spans
Tracing has applicability in many fields besides microservices, e.g., end-to-end enterprise applications orchestrating mobile
apps, edges, websites, storage, and centralized systems.
12. HUAWEI TECHNOLOGIES CO., LTD. 11
► Monitoring and Incident Detection
Anomaly detection in large-scale system is widely studied
IT monitoring systems typically use application logs and
resource metrics to detect failures
⎻ Distributed tracing is becoming the third pillar of
microservices observability
⎻ Logs and metrics were previously investigated (see [1-4])
► Single and Bi-Models
Use of single and bi-models to capture traces as
sequences of events and their response time
Use sequential model representation by utilizing long-
short-term memory (LSTM) networks
► Results
Detect anomalies in Huawei Cloud infrastructure
The novel approach outperforms other deep-learning
methods based on traditional architectures
Trace Analysis
Using Multimodal Deep Learning
S. Nedelkoski, J. Cardoso, O. Kao, Anomaly Detection from System Tracing Data using Multimodal Deep Learning, IEEE Cloud 2019, July 3-8, 2019, Milan, Italy.
S. Nedelkoski, J. Cardoso, O. Kao, Anomaly Detection and Classification using Distributed Tracing and Deep Learning, CCGrid 2019, 14-17.05, Cyprus.
J. Cardoso, Mastering AIOps with Deep Learning, Presentation at SRECon18, 29–31 August 2018, Dusseldorf, Germany.
Long Short Term Memory (LSTM)
LSTMs [1] are models which capture sequential data by
using an internal memory. They are very good for analyzing
sequences of values and predicting the next point of a
given time series. This makes LSTMs adequate for Machine
Learning problems that involve sequential data (see [3])
such speech recognition, machine translation, visual
recognition and description, and distributed traces
analysis.
[1] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation
9.8 (1997): 1735-1780.
[2] Sequential processing in LSTM (from: http://colah.github.io/posts/2015-08-Understanding-
LSTMs/
[3] LSTM model description (from Andrej Karpathy. http://karpathy.github.io/2015/05/21/rnn-
effectiveness/
13. HUAWEI TECHNOLOGIES CO., LTD. 12
A span (event) is a vector of key-value pairs
𝑘𝑖, 𝑣𝑖 describing a characteristic of a
microservice at time 𝑡𝑖
A trace 𝑇 is an enumerated collection of events
(i.e., spans) sorted by timestamps 𝑒0, 𝑒1, … , 𝑒𝑖
[16]
An event contains
⎻ trace id, event id, parent id
⎻ protocol, host ip, status code, url
⎻ response time, timestamp
Trace can have different lengths
⎻ 𝑇𝑝 = 𝑒0
𝑝
, 𝑒1
𝑝
, 𝑒2
𝑝
, … , 𝑒𝑖
𝑝
and 𝑇𝑞 =
𝑒0
𝑞
, 𝑒2
𝑞
, 𝑒1
𝑞
, … , 𝑒𝑖
𝑞
are different (𝑒1 and 𝑒2 are
swapped) but originate from the same activity
⎻ Possibly caused by concurrent systems
Label each trace with
⎻ concat(url, status code, host ip)
⎻ 𝑁𝑙 lables
Pad trace vector up to 𝑇𝑙 or truncate traces
Trace Analysis
Approach
Trace structure
⎻ Trace one-hot categorical encoding [17]
⎻ 𝐷1 = 𝑁𝑡, 𝑇𝑙, 𝑁𝑙
⎻ 𝑁𝑡 = # traces, 𝑇𝑙 = max length, 𝑁𝑙 = # labels
Response time
⎻ Group response time by label
⎻ Min-max scaling [0, 1]
⎻ 𝐷2 = 𝑁𝑡, 𝑇𝑙, 1
⎻ Last dimension is the response time
14. HUAWEI TECHNOLOGIES CO., LTD. 13
We model structural anomaly detection in traces
as a sequence-to-sequence, multi-classification
problem
LSTM network architecture
⎻ SAD = Structural Anomaly Detection (𝐷1)
Model input
⎻ 𝑇𝑘 = 𝑒0, 𝑒1, … , 𝑒 𝑇 𝑙
Model output
⎻ 𝑒0, 𝑒1, … , 𝑒𝑖−1
⎻ Probability distribution over the 𝑁𝑙 unique
labels from 𝐿, representing the probability for
the next label 𝑒𝑖 in the sequence
Compare the predicted output against the
observed label value
⎻ input 𝑒0, 𝑒1, … , 𝑒 𝑇 𝑙
→ output 𝑒1, … , 𝑒 𝑇 𝑙
, ′! 0′
⎻ The output is shifted by one event and
padded
Update network weights using categorical cross-
entropy loss minimization via gradient descent
Trace Analysis
Single-Modality LSTM (1)
SAD = Structural Anomaly Detection (𝐷1)
15. HUAWEI TECHNOLOGIES CO., LTD. 14
Evaluate if trace 𝑇𝑡𝑒𝑠𝑡 is anomalous
⎻ 𝑇𝑡𝑒𝑠𝑡 = 𝑒0, 𝑒1, … , 𝑒 𝑇 𝑙
The network calculates
⎻ Probability distribution 𝑃
⎻ 𝑃 = 𝑙0: 𝑝0, 𝑙1: 𝑝1, … , 𝑙 𝑁𝑡
: 𝑝 𝑁𝑡
⎻ Probability of a label of 𝐿 to appear as the
next label value in a trace, given the previous
values
The output layer has a softmax function to
distribute the probability over labels
⎻ 𝜎(𝑧)𝑖 =
𝑒 𝑧 𝑖
𝑗=1
𝐾
𝑒
𝑧 𝑗
Such that
⎻ 𝑖
𝑁 𝑙
𝑝𝑖 = 1
Detection
⎻ Accept top-𝑘 predicted labels as behaviorally
correct
⎻ Otherwise, report an anomaly along with the
events which contributed to the decision
Trace Analysis
Single-Modality LSTM (2)
SAD = Structural Anomaly Detection (𝐷1)
16. HUAWEI TECHNOLOGIES CO., LTD. 15
We model response time anomaly detection in
traces as a regression task
LSTM network architecture
⎻ RTAD = Response Time Anomaly Detection (𝐷2)
Approach
⎻ A each timestep 𝑡𝑖𝑚𝑒 = 𝑖 with 𝑟𝑡0, 𝑟𝑡1, … , 𝑟𝑡𝑖−1
⎻ Predict the response time 𝑟𝑡𝑖 of the next event
Update network weights using mean squared loss
via gradient descent
Detection
⎻ Compute the squared error distance
⎻ 𝑒𝑟𝑟𝑜𝑟𝑖 = 𝑟𝑡𝑖 − 𝑟𝑡𝑖
𝑝 2
, where 𝑟𝑡𝑖
𝑝
is the predicted
value at timestep 𝑖
⎻ errors are fitted by a Gaussian 𝑁(0, 𝜎2
)
⎻ report trace as anomalous, If squared error
between prediction and input at time 𝑖 is out of
95% confidence interval
Trace Analysis
Single-Modality LSTM (3)
RTAD = Response Time Anomaly Detection (𝐷2)
17. HUAWEI TECHNOLOGIES CO., LTD. 16
The correlation between trace
structure and response time can be
explored to improve accuracy/recall
LSTM network architecture
⎻ Concatenation of both single-
modality architectures in the
second hidden layer
Approach
⎻ input
⎻ 𝑒0, 𝑒1, … , 𝑒 𝑇 𝑙
, 𝑟𝑡0, 𝑟𝑡1, … , 𝑟𝑡 𝑇 𝑙
⎻ →
⎻ output
⎻ 𝑒1, … , 𝑒 𝑇 𝑙
, ′! 0′ , 𝑟𝑡1, 𝑟𝑡2, … , 𝑟𝑡 𝑇 𝑙
, 0
Update network weights using sum
mean squared error and categorical
cross-entropy
Trace Analysis
Multimodal LSTM (1)
Detection
⎻ Performed by comparing the output element-wise
with the input for both modalities using the strategy
developed in the single-modality architectures
⎻ Anomaly source: 1) response time, 2) structural, or 3)
both
SAD (𝐷1) and RTAD (𝐷2)
18. HUAWEI TECHNOLOGIES CO., LTD. 17
► Datasets
System under study has 1000+ services running production Openstack-based cloud [19]
Traces collected using Zipkin [16] over 50 days: >4.5M events; 1M traces
► Evaluation Platform
Python using Keras, model with batch size = 512, learning rate of 0.001, and 400 epochs
PC using GPU-NVIDIA GTX 1060
► Preprocessing
To avoid outliers, select labels that appear more than 1000 times (105 unique labels)
Distribution of trace lengths is imbalanced: >90% have lengths <10 events
Select only 1000 samples of each trace length
⎻ Requires <1% of all the recorded data
⎻ Efficient and fast for training
For robustness, we also select the traces with lengths between 4 and 20
► Performance
Time to train multimodal LSTM on 1% traces (1M): approx. 30 min
Prediction time per trace: <50 ms
Trace Analysis
Evaluation
19. HUAWEI TECHNOLOGIES CO., LTD. 18
Trace Analysis
Evaluation
Best results in terms of accuracy of structural anomaly
detection are achieved using the multimodal LSTM predictions
Multimodal LSTMSingle-model LSTM
Multimodal DenseSingle-model Dense
Single-modality LSTM achieves a comparable accuracy, while
single- and multimodal dense architectures have low
accuracies
Multimodal LSTM achieves a better accuracy than the single-
modality LSTM for k=1, otherwise it is similar
Accuracy when the anomaly is injected in traces with
different sizes
Multimodal slightly outperforms the single-modality
approach in 9 out of 15 trace lengths for k = 3 and k = 5
Significantly better results are achieved for k = 1 for
almost all of the trace lengths
20. HUAWEI TECHNOLOGIES CO., LTD. 19
Trace Analysis
Evaluation
For SAD, significantly better performance of the multimodal
approach for k = 1
Single-modality LSTM achieves a comparable accuracy, while
single- and multimodal dense architectures have low
accuracies
Multimodal LSTM achieves a better accuracy than the single-
modality LSTM for k=1, otherwise it is similar
For RTAD, single-task models have low performance
than those of both multimodal models
Multimodal approach achieves a higher accuracy for
RTAD
Models have high accuracy even when the length of the
trace increases. This is because the LSTMs are able to
learn long-term dependencies in sequential tasks
21. HUAWEI TECHNOLOGIES CO., LTD. 20
[1] F. Schmidt, A. Gulenko, M. Wallschlger, A. Acker, V. Hennig, F. Liu, and O. Kao, “Iftm - unsupervised anomaly detection for
virtualized network function services,” in 2018 IEEE International Conference on Web Services (ICWS), July 2018, pp. 187–194.
[2] A. Gulenko, F. Schmidt, A. Acker, M. Wallschlager, O. Kao, and F. Liu, “Detecting anomalous behavior of black-box services
modeled with distance-based online clustering,” in 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), Jul 2018,
pp. 912–915.
[3] M. Du, F. Li, G. Zheng, and V. Srikumar, “Deeplog: Anomaly detection and diagnosis from system logs through deep learning,” in
Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communica- tions Security. ACM, 2017, pp. 1285–1298.
[4] F. Schmidt, F. Suri-Payer, A. Gulenko, M. Wallschlger, A. Acker, and O. Kao, “Unsupervised anomaly event detection for cloud
monitoring using online arima,” in 2018 IEEE/ACM International Conference on Utility and Cloud Computing Companion (UCC
Companion), Dec 2018, pp. 71–76.
[16] OpenZipkin, “openzipkin/zipkin,” 2018. [Online]. Available: https://github.com/openzipkin/zipkin
[19] A. Shrivastwa, S. Sarat, K. Jackson, C. Bunch, E. Sigler, and T. Campbell, OpenStack: Building a Cloud Environment. Packt
Publishing, 2016.
Further reading
S. Nedelkoski, J. Cardoso, O. Kao, Anomaly Detection from System Tracing Data using Multimodal Deep Learning, IEEE Cloud
2019, July 3-8, 2019, Milan, Italy.
S. Nedelkoski, J. Cardoso, O. Kao, Anomaly Detection and Classification using Distributed Tracing and Deep Learning, CCGrid 2019,
14-17.05, Cyprus.
J. Cardoso, Mastering AIOps with Deep Learning, Presentation at SRECon18, 29–31 August 2018, Dusseldorf, Germany.
Trace Analysis
References