Sunku Rangarnath on service providers miss to implement complete service assurance solutions that encompasses its 3 elements of monitor, report & provision the infrastructure. Service Assurance requires deeper tracking of infrastructure & service metrics, automated intervention of threshold violations using trend analysis against configured parameters & finally configuring the hardware resources & service levels based on service priority.
This talk presents range of closed loop platform automation domains focusing on the real-time and near-real-time loops touching the platform. We discuss the integration of Infrastructure telemetry, analytics, policy management interfaces & introduce the concept of Node Agent, using a noisy neighbor demo, for VM/container orchestrators to achieve intervention free Closed Loop Automation based service assurance solutions.
5. Platform Observability & Service Assurance (SA)
• Observability: Ability to expose state of the platform to ensure Service Level
Objectives are met
• Observability Considerations: Logging, Metrics & Tracing
• Communications Service Provider Context:
• Care about overall Service Assurance
• Both Monitoring & Observability are important
• Service Assurance
• Application of policies to ensure services meet a pre-defined service quality level
• FCAPS (Fault, Configuration, Accounting, Performance & Security) attributes on
existing network infrastructure
6. 6
Three Key Elements of SA Platform
Monitoring: Enabling deeper
management and tracking of
specific service levels
Presentation: Reporting to
enable reaction to service level
changes:
Provisioning: Enable
configuration of service levels
based on workload or service
priority
Figure: Service Assurance elements mapping to ETSI NFV Model
7. 7
Collectd Monitoring Agent
Collectd: Why & What
• Statistics collection daemon
• Uses read or write plugins to collect metrics write to an end
point
• Open source
• Widely adopted
• Configurable Collection Interval
Various Plugin types:
• Input/Output
• Binding Plugins
• Logging Plugins
• Notification Plugins
• Other: Network plugin with both send/receive feature
Figure: Collectd Architecture
https://github.com/collectd/collectd
8. 8
Platform Telemetry Exposure & Integration
Compute Network Storage
Hypervisor [RT/SA KVM4NFV extensions]
NFVI
IPFIX
Virtualised
Compute
Virtualised
Network
Virtualised
Storage
E.g.
Working/Protect
Failover
Local
Corrective
Action
Enterprise
MIB
SYSLOG
Collectd
PMU*
counters
NIC counters
vSwitch
counters
SNMP API
Perfmon
MIB
Common / Standard Open APIs
Fast Path
Triggers on events or
counters
VM Stall Detection/
RT Stall Detection
Monitoring/
Analytics
Systems
Slow Path
Periodic Pull 1/15mins
RAS Hypervisor/Container
Counters
Container
Monitoring
Solutions
(Prometheus
….)
Includes
NetFlow Collectors
Vendor SA
Middleware
Intel® Node
Manager
NFV Platform
MIB
Standard Open APIs
Intel Components
Open Platform
Collectors
Intel® Run Sure Technology
MCA* PCIe AER
Resilient System Technology
Resilient Memory Technology
SDDC DDDC+1 Mirroring
RAID/
NVMe
Intel® Rapid
Storage
Technology
sFlow
Intel®
Management
Engine
IPMI
Ceilometer
Aodh
Vitrage
Congress
In progress
Done/Integrated
Open Stack
Collectd PluginsIntel Infrastructure
Management Technologies
Gnocchi
VES Plugin
Redfish
C
M
T
Intel® RDT
C
A
T
M
B
M
C
D
P
PO
W
ER
Out Of
Band
Telemetry
Kafka Prometheus
OpenStack
VIM
PMU*: Performance Monitoring Unit
10. 10
Networking Closed Loops – High Level Architecture
Platform Resources
Forwarding Plane
Interfaces
Interfaces
TrafficTraffic
Platform
Analytics
Systems
Business Applications
Setting of Policy
SDN/NMS
Network Services
Cloud and Virtual
Management
MANO
EMS VNFM
Infrastructure
Control
Application
Independent Closed Loops: SDN, Cloud & Virtual Mgt, Platform
Local
Platform
Agent
Telemetry
distribution or
storage or
…..
Platform
Telemetry
Policy Based Provisioning
Control Loops
11. 11
Closed Loops – Networking Stack
Application Layer
Network Data Analytics
Orchestration, Management, Policy
Cloud & Virtual Management
Network Control
Operating Systems
Data Path
Hardware/
Disaggregated Hardware
ServicesManagement&ControlInfrastructure
Micro-seconds/
Milliseconds
Mins/Hours/Days
Closed Loop
Reaction Time
Domain Knowledge
Local to
Platform
End to End
Enforce Local
Policy
Deployment
Policies
Enforce Network
Domain Policy
Map Policies
HW Enabled
Loops (eg
RAS)
Enforce DP
Loops (HA etc.)
Analyze/
Plan Policies
High Speed Control Loops are Close to the Platform
Seconds/Mins
12. Analytics
12
Closed Loops – Business Cases
Improved Customer
Experience
Cloud Optimization &
Efficiency
Edge Placement
Service Healing
Differentiated QoS
Service Optimization
Energy Optimization
Capacity Optimization
Cloud Configurations
Business
Use Cases
AI/ML/DL
Platform(s)
Feature Exposure Provisioning Telemetry
Local Policy Enforcement Agent(s)
For Local Dynamic Control
Intel Infrastructure
Management Tech
Intel RDT Power
Monitoring/Storage
NFV Orchestrator (NFVO) [eg ONAP/OSM]
Security
Threat Detection
Threat Response
Business Applications
collectd
Policy Based Provisioning
Control Loops
VNF Manager (VNFM)
Open Stack Kubernetes Telemetry I/FTelemetry I/F
Actively
Contributing
Intel
RunSure
Bare Metal
Telemetry I/F
13. Closed Loop Resiliency Demo
Goal: Maximize Service Availability
of Virtual Border Network Gateway
(vBNG) in memory error scenario
Figure 1 Source: OpenSAF and VMware from the Perspective of High Availability - Ali Nikzad, Ferhat KhendekMaria Toeroe
Concordia University Ericsson SVM’2013 – Zurich – October 2013
Figure 1: Service Recovery Timeline Figure 2: Closed Loop Resiliency
Demo with Kubernetes
More Details on Demo: https://networkbuilders.intel.com/social-hub/video/closed-loop-
platform-automation-workload-resiliency-demo
14. Closed Loop Automation (CLA) – Communities,
Standards
• Open Network Automation Platform
(ONAP) – Closed Loop Automation
Management Platform (CLAMP)
• OPNFV Working Group for CLA
• ETSI Zero Touch Service
Management (ZSM)
• ETSI Experiential Networked
Intelligence (ENI)
Ex: OPNFV WG
Ex: ONAP CLAMP
15. Use Cases & Gaps
• 5G Network Slicing
• Demand based Energy Savings
• Workload Resiliency
• Noisy Neighbor Detection & Avoidance
• And many more….
Figure: 5G Network Slicing Architecture
Source: https://www.researchgate.net/figure/5G-network-slicing-architecture_fig1_324175599
Gaps, On Going Work
• Telemetry tagging
• Policy delivery & management across
VIM to NFVI
16. Summary
Platform Observability & Monitoring play crucial role in ensuring service assurance
Platform telemetry heavily differentiate the services, along side of application telemetry
Various levels of closed loops are required for autonomous networks
Realtime & Near-Realtime closed loops require automation
Collaborate through Open Source Communities
Figure out use cases of interest
Leverage relevant infrastructure telemetry
Call To Action
17.
18. 18
ServiceAssurance“Phased”EvolutionforNFV/SDN
• Strategic Framework for SA “Phase” Evolution
Phase 1 - Equivalence (Virtualized + Interworking with existing management systems)
Phase 2 - Automated by MANO+SDN Controller
Phase 3 - Predict failures and adapt automatically
Platform Service Assurance -
Equivalence
• Platform Service Assurance supporting:
•Intel RAS Technologies
•Cache Config & Monitoring
•Bios Config & Reporting
•Fastpath DPDK Interface Reporting
•Fastpath DPDK Keep Alive
•Virtual Switch Health
•QAT Watchdog
•Host Health
• …….
Platform Service Assurance
(MANO + SDN Controller)
•VIM and above, support:
• Enable RAS Technologies
• Enable Watchdog Metrics
• Enable DPDK and Keep Alive
• Enable Host Health
• Policy Based Provisioning
• …
Predictive Platform Service
Assurance
•Predict Failures and Adapt
Automatically:
• Automated and Adaptive to changes
notified in metrics
• Closed loop and Dynamic SA
environment
•
Phase 1 Phase 2 Phase 3
Evolving from Equivalence towards NFV/SDN Automation
Never Stops Solution of the day Under Construction
19. 19
Platform Plugins Contributed by Intel
Plugin Domain Description
Intel RunSure/
RAS
Mcelog, PCIe AER, logparser: Metrics & notifications pertaining to Intel RunSure
technologies
Intel_RDT Resource Director Technologies related metrics
Virt Libvirt related metrics
OVS Ovs_stats, ovs_events: Metrics related to Open Virtual Switch
DPDK Dpdk_stats, dpdk_events, hugepages: DPDK related metrics
OpenStack Gnocchi, Aodh: Integration in OpenStack projects
Cloud Write_Kafka, Write_Prometheus, VES: Integration in to various cloud platforms
Storage RAID, NVMe: Storage related Metrics
Power/Energy CPUFreq, Turbostat: Frequency & power related metrics
Platform IPMI, RedFish, PMU: Out of Band metrics & platform counters
Infrastructure Metrics are Crucial as Application Metrics
20. 20
Barometer Strategy:
• Ensure platform metrics/events are
accessible through open industry standard
interfaces.
• Demonstrate IA platform technologies can
be monitored, consumed and actioned in
real time
Opnfvbarometer
One Click Install:
Easy install/configuration
for customers
One command to install
Collectd/Influxdb/Grafana
• Three container approach for
Collectd:
• Stable Container: latest stable branch
• Master Container: up to date with
master
• Experimental Container: cherry pick
features of interest
Source: https://opnfv-barometer.readthedocs.io/en/latest/release/userguide/docker.userguide.html