This Webinar describes several ways of providing High Availability, Reliability and Resiliency in KVM and OpenStack for NFV. Plus a brief overview of Stratus' Software Defined Availability (SDA) - an elegant way of bringing transparent and seamless Resiliency to all VNFs without code changes
Stratus Fault-Tolerant Cloud Infrastructure Software for NFV using OpenStack
1. ACHIEVING AVAILABILITY AND RESILIENCY
IN OPENSTACK FOR NFV
Stratus Webinar
May 26, 2015
Ali Kafel | Senior Director, Business Development | Ali.Kafel@Stratus.com Twitter: @akafel
Steve Hauser | CTO | stephen.hauser@stratus.com
2. NFV Overview
Defining Availability, Reliability and Resiliency
Achieving Resiliency in Applications vs Infrastructure
Software Defined Availability (SDA)
• Seamless service continuity, with no required code changes.
• Selectable levels of availability, for different control and forwarding applications
• Increasing traditional 45% utilization towards 80% to 90% utilization
Agenda
5. Network Functions Virtualization
Virtualization
Commodity Hyper Scale COTS
Computing
RAN
Backhaul
GPRS/1X
MSC
HLR
SMSC
RAN
Backhaul
GPRS/1X
MSC
HLR
SMSC
RAN
Backhaul
GPRS/1X
MSC
HLR
SMSC
Vendor
A
Vendor
B
Vendor
C
Monolithic Vertical Integration
Vendor A
Vendor B
Vendor C
Vendor D
Vendor E
Vendor A
Vendor F
Vendor B
Vendor A
Vendor D
RAN
Backhaul
EPC
PCEF
Diameter Core
MME
OCS/OFCS
HSS
PCRF
IMS
Delamination
Liquid Pool of Dynamically
Allocated Resources
Automation
Orchestration
Linux
EPC
Linux
PCRF
Linux
Firewall
Linux
IMS
…
Decoupling with NFV
Page 5
6. Network Functions Virtualization
Commodity Hyper Scale COTS
Computing
RAN
Backhaul
GPRS/1X
MSC
HLR
SMSC
RAN
Backhaul
GPRS/1X
MSC
HLR
SMSC
RAN
Backhaul
GPRS/1X
MSC
HLR
SMSC
Vendor
A
Vendor
B
Vendor
C
Monolithic Vertical Integration
Vendor A
Vendor B
Vendor C
Vendor D
Vendor E
Vendor A
Vendor F
Vendor B
Vendor A
Vendor D
RAN
Backhaul
EPC
PCEF
Diameter Core
MME
OCS/OFCS
HSS
PCRF
IMS
Delamination
Liquid Pool of Dynamically
Allocated Resources
Automation
Orchestration
Decoupling with NFV
Virtualization
Linux
EPC
Linux
PCRF
Linux
Firewall
Linux
IMS
…
Page 6
9. Commodity
High Volume
Networking
Virtualization
Linux
EPC
LinuxPCRF Linux
HSS
Linux
IMS
…
L3 Routing
L2 Switching
Optical
Transport
Control
Control
Control
Linux
OpticalTransport
ControlPlane
Linux
L3Routing
ControlPlane
Linux
Billing
Linux
CustomerCare
Linux
NOC
Linux
L2Switching
ControlPlane
Virtualized
OSS/BSS
Commodity Hyper Scale
COTS Computing
Commodity
High Volume
Storage
Virtualized
SDN
SDN
Separates
Control
From
Forwarding
Orchestration
Decoupling with NFV
Network Functions Virtualization
With Software Defined Networks
Page 9
10. Agenda
NFV Overview
Defining Availability, Reliability and Resiliency
Achieving Resiliency in Applications vs Infrastructure
Software Defined Availability (SDA)
• Seamless service continuity, with no required code changes.
• Selectable levels of availability, for different control and forwarding
applications
• Increasing traditional 45% utilization towards 80% to 90% utilization
11. Stratus Technologies Page 2
Defining Availability, Reliability and
Resiliency
Availability
• % of time an equipment is in an operable state ie. access
information or resources
• Availability = Uptime / Total time
Reliability
• How long a system performs its intended function.
• MTBF = total time in service / number of failures
Resiliency
• The ability to recover quickly from failures, to return to
its original form, state, etc. (just before the failure)
12. Stratus Technologies Page 2
Defining Availability, Reliability and
Resiliency
Therefore, a Highly Available (HA) system may not be
Highly Reliable (HRel) or Highly Resilient (HRes)
A Fault Tolerant (FT) system is Highly Available, Highly
Reliable and Highly Resilient (state is preserved)
13. Lose Transactions Lose Reputation Lose RevenueLose Customers
Fault Tolerant Systems Never Stop
Stateful Fault Tolerance = HA + HRel + HRes
When Seconds Count… Loss of Revenue, Reputation, Safety, Life
High Availability
Stateful
Fault Tolerant
Page 13
May be a few seconds, minutes or hours
Failure
Original state is lost
Original state is preserved!
14. Agenda
NFV Overview
Defining Availability, Reliability and Resiliency
Achieving Resiliency in Applications vs Infrastructure
Software Defined Availability (SDA)
• Seamless service continuity, with no required code changes.
• Selectable levels of availability, for different control and forwarding
applications
• Increasing traditional 45% utilization towards 80% to 90% utilization
15. Three ways to provide Stateful FT in VNFs
Applications / VNFs
Operating Environment
Hardware
• Transparent – no code change
• Fast & Simple Deployment
• No special App Software
• Very expensive
• Inefficient utilization
• Special Hardware
• Rigid
Applications / VNFs
Operating Environment
with Resilience Layer
Hardware
• Transparent – no code change
• Fast & Simple Deployment
• No special App Software – deploy any
• No Special Hardware – use commodity
• Multiple Levels of Resiliency Supported
• Higher efficiency of resiliency – N+k
• Higher efficiency may not be possible on
very large monolithic Apps
Applications / VNFs
Operating Environment
Hardware
• App specific state can be
Customized
• Every App must be modified
• Longer time to deploy
• Complex
• Rigid
In the Hardware In the Applications In the Software Infrastructure
Costs&Resources
Pros
Cons
16. 16
But Fault Tolerance is more than just State Protection, it is about the
complete Fault Management Cycle with multiple levels of resiliency
(State
Protection)
Detection
Localization
IsolationRecovery
Repair
(Restore
Redundancy)
We call this:
Software Defined Availability (SDA)
and has 4 characteristics
1. Selectable Resiliency for each VNF
2. Seamless Protection for all VNFs
3. Agility with 3rd party ecosystem
4. Efficiency of Redundancy
17. Stratus Technologies Page 17
Stratus’ Software-Defined Architecture (SDA) Solution
provides a highly resilient Cloud and NFVI
1. Seamless Protection for all VNFs
• Software Defined, transparent Service Continuity, performed automatically by the
infrastructure, without Application code changes
2. Selectable Resiliency for each VNF
• Deploy each VNF with selectable levels of resiliency including High Availability and
stateful Fault Tolerance (state protection), with Geo-Redundancy, without application
awareness
3. Agility with 3rd party ecosystem and any VNF
• Protect all VNFs in any KVM/OpenStack environment seamlessly, with No complex
code development, testing and support – for optimal partner ecosystem
4. Efficiency of Redundancy
• Unlike traditional approaches for Fault Tolerance, which limit Utilization to sub-50%, get
dramatic increase in Efficiency of Redundancy, at 80% to 90%
18. 18
1 | Selectable Resiliency for each VNF: Software Defined
Availability (SDA) with selectable levels of resiliency
Deliver Availability as an
infrastructure service to virtual and
cloud ecosystems
Firewall MME IMS Web
Server
Page 18
Any application with any
availability need with application
transparency
VNF-C
Forwarding
Element
VNF-C
Forwarding
Element
VNF-C
Forwarding
Element
VNF-C
Control
Element
Monolithic VNFs Componentized VNF
Stateless Fast Path
Forwarding Elements
Stateful
Control
Element
The Right Level of Resiliency for Each Component
FT protected Control Elements
With SR-IOV enabled high-performance, low latency
Forwarding Elements
19. Stratus Technologies
2 | Seamless Protection: When needed, application
states are protected without application awareness, in
the VM Operation - Statepointing
VM instances paired between host in the cloud infrastructure
State of primary captured regularly and applied to secondary standby
If fault on primary, secondary takes over from the most recent
Statepoint without data loss
Control when information (network, storage I/O) is allowed to leave
the guest
Secondary Host
SP N-1
Fault
Primary Host
Guest Run
Epoch N-1
Guest Run
Epoch N
SP N-1
SP N
SP N
Guest Run
Epoch N+1
Guest Run
Epoch N+2
Guest Run
Epoch N+1
SP N+1
Third Host
(created
post primary
failure)
19
Guest From
Image
SP N+X
SP N+1 SP N
Page 19
20. 20
Act.-Stby. Statepoint Processes & Egress Network Barrier
VM n-1 VM n+1
w/ barrier n-1
QEMU Monitor
Enqueue
VM n
QEMU Monitor
QEMU Monitor
w/ barriers n & n-1
QEMU Monitor
w/ barriers n+1 & n
QEMU Monitor
QEMU Monitor
n-1 P1
n-1 P2
n-1 P3
n-1 P4
n-1 P5
Guest
VM
(Active)
QEMU
(Active)
QEMU
(Standby)
Egress Network Queue Barrier; prevents transmission of queued egress packet(s) until the barrier is removed
PCR
PCR
Guest
Egress
Queue
[snapshots]
PCR Pause, Capture, Resume (PCR); phases of Statepoint process when VM execution is suspended
Enqueue
Enqueue
P1 P2 P3 P4 P5
Note: For simplicity, n-2 interactions are not shown.
n P1
n P2
n P3
n P4
n+1 P3
n+1 P2
n+1 P1
1
2, P
async
3, C
5, R
4
21. Commodity
High Volume
Networking
Virtualization
Commodity Hyper Scale
COTS Computing
Commodity
High Volume
Storage
SDN
Separates
Control
From
Forwarding
Linux
EPC
Linux
PCRF
Linux
HSS
Linux
IMS
…
Linux
OpticalTransport
ControlPlane
Linux
L3Routing
ControlPlane
Linux
Billing
Linux
CustomerCare
Linux
NOC
Linux
L2Switching
ControlPlane
Virtualized
OSS/BSS
Virtualized
SDN
Orchestration
Decoupling with NFV
3 | Agility with 3rd party ecosystem and any VNF
NFV and SDN Allow Low Cost Commodity HW
But When Failures Happen, Service Continuity can be Affected
Does Not Provide Five 9’s (99.999%) Reliability
Page 21
22. Commodity
High Volume
Networking
Virtualization
L3 Routing
L2 Switching
Optical
Transport
Commodity Hyper Scale
COTS Computing
Commodity
High Volume
Storage
SDN
Separates
Control
From
Forwarding
Stratus Automated Virtualized Resilience Layer
Linux
EPC
LinuxPCRF Linux
HSS
Linux
IMS
…
Linux
OpticalTransport
ControlPlane
Linux
L3Routing
ControlPlane
Linux
Billing
Linux
CustomerCare
Linux
NOC
Linux
L2Switching
ControlPlane
Virtualized
OSS/BSS
Virtualized
SDN
Orchestration
Decoupling with NFV
We solved it by inserting A Virtualized Cloud Resilience Layer for
NFV and SDN
Page 22
23. Commodity
High Volume
Networking
Virtualization
L3 Routing
L2 Switching
Optical
Transport
Commodity Hyper Scale
COTS Computing
Commodity
High Volume
Storage
SDN
Separates
Control
From
Forwarding
Linux
EPC
Linux
PCRF
Linux
HSS
Linux
IMS
…
Linux
OpticalTransport
ControlPlane
Linux
L3Routing
ControlPlane
Linux
Billing
Linux
CustomerCare
Linux
NOC
Linux
L2Switching
ControlPlane
Virtualized
OSS/BSS
Virtualized
SDN
Orchestration
Decoupling with NFV
Stratus Automated Virtualized Resilience Layer
Stratus Provides
A Virtualized Cloud Resilience Layer for NFV and SDN
Page 23
24. 4 | Efficiency of Redundancy:
We have designed Shadow Secondary VMs in anti-affinity rules (different
hosts) to take up much less resources than their primaries,
Yielding High Utilization, and Low Additional Reserve Capacity
Page 24
A
B
C
D
A1
B1
C1
D1
25. But before we get into details, let’s look at how Traditional
Fault Tolerance is Achieved by Full HW Redundancy
• Cloud Computing environments often utilize volumes of High Density
Commodity Servers for Computing
Racks Of High Density Cloud Servers
Page 25
26. Server Workloads that need fault tolerance typically need
redundancy to run another copy in LockStep
Workloads
Racks Of High Density Cloud Servers
Page 26
27. • Which has typically been Supported by twice the Hardware
Racks Of High Density Cloud ServersRacks of Redundant Servers
Workloads
Just-In-Case
Workload
Capacity
Page 27
Server Workloads that need fault tolerance typically need
redundancy to run another copy in LockStep
28. Server Workloads that need fault tolerance typically need
redundancy to run another copy in LockStep
• Which has typically been Supported by twice the Hardware
Arranged Rigidly
In Mated Pairs
Mated Pairs Mated Pairs Mated Pairs Mated Pairs
Page 28
29. • When a Failure happens, a Backup Takes Over until the original is
replaced, thus preserving Service Continuity
• However, backup replacement can takes days and a great deal of
human intervention, during which another failure would be
disastrous
Page 29
30. But Resource Utilization is 50% at Best
“Traditional telecom networks operate great at 45% utilization, but as
AT&T becomes a software company, a reasonable goal could be 80%
to 90% utilization” John Donovan, Senior EVP AT&T
45%Utilization 55%Unutilized Backup CapacityProblem
Page 30
31. Stratus Resilient Cloud Technology Provides
Fully Stateful Fault Tolerance at up to 80% Utilization
45%Utilization 55%Unutilized Backup Capacity
80%Utilization 20%Unutilized Backup Capacity
VirtualizedResilience
37.5% Savings
Problem
Solution
• Stratus Virtualized Resilience requires much less Backup Capacity
for fully stateful Functional Fault Tolerance
Page 31
32. Stratus Resilient Cloud Technology Provides
Fully Stateful Fault Tolerance at up to 80% Utilization
45%Utilization 55%Unutilized Backup Capacity
80%Utilization 20%Unutilized Backup Capacity
VirtualizedResilience
77.8% More
Capacity
Problem
Solution
• Stratus Virtualized Resilience could alternatively provide 78% more
Actively Utilized Capacity using the same resources
Page 32
33. Instead of the traditional 1+1 approach, the
Stratus Resilient Cloud Technology uses
Software Defined Availability (SDA) which
Increases Utilization and The Efficiency of
Resiliency, and Decreases Cost
Page 33
It’s based on an n+k De-Clustered redundancy
approach where Shadow Secondary VMs are
deployed in anti-affinity rules (different hosts) to
take up much less resources than their primaries
34. 34
Software Defined Availability Increases Utilization and The
Efficiency of Resiliency, and Decreases Cost
Simple
HW
Monolithic
Fwd + CTRL
SW Virtualized
De-Coupled
Fwd + CTRL
AGILITYMost Traditional Telco
Systems are in this
Category
EfficiencyofRedundancy
1+1
Software
Defined
Availability
Sophisticated
35. 35
Software Defined Availability Increases Utilization and The
Efficiency of Resiliency, and Decreases Cost
SophisticatedSimple
HW
Monolithic
Fwd + CTRL
SW Virtualized
De-Coupled
Fwd + CTRL
AGILITYMost Traditional Telco
Systems are in this
Category
EfficiencyofRedundancy
1+1
N+1
C+C
F+FFWD
CTRL
FWD
CTRL
1+.06
F+k
CTRL
FWD
F+k
k<<F
SR-IOV
Software
Defined
Availability
36. Page 36
Asymetric StateSync™ Redundancy
Coordinated VM Interleave Improves Performance
on High Latency Links
Primary
Compute
StatePoint™
Secondary
6%-10%
StatePoint™ Sync Link
ProcessorActivity
ProcessorActivity
40. Server 4 App ABCD
Page 71
Page
40
N+k De-Clustered Redundancy
Server 4 Apps ABCD Are Backed Up On Separate Servers
Which could be anywhere in the Pool of Servers
42. N+k De-Clustered Redundancy
All 5 Server Apps ABCD Are Backed Up On Separate Servers
Which are shown on each other in this example
Page 42
43. Secondary
Shadow VMs Stand Up
Reserve
Capacity
Stand Up can
happen on other
machines with
Lower Priority Pre-
Emption
N+k De-Clustered Redundancy
Primaries, Secondaries, plus Reserve Capacity Shown for Each
Page 43
A
B
C
D
A1 B1
C1
D1
44. Upon Node Failure, Secondaries are Activated
With No Loss of State
Secondary
Shadow VMs Stand Up
Reserve
Capacity
Stand Up can
happen on other
machines with
Lower Priority Pre-
Emption
Page 44
45. Secondary
Shadow VMs Stand Up
Reserve
Capacity
Stand Up can
happen on other
machines with
Lower Priority Pre-
Emption
Cloud Server
Resource
Pool
Recycle
One of “k” Reserve Servers is Activated
While the Failed Node is Logically Removed
Page 45
47. Stratus Resilient Cloud Technology
Dramatically Improves The Efficiency of Redundancy
• Enables up to 37.5% Resource
Savings to provide Redundancy
45%Utilization 55%Unutilized Backup Capacity
80%Utilization 20%Unutilized Backup Capacity
VirtualizedResilience
37.5% Savings
Problem
Solution
45%Utilization 55%Unutilized Backup Capacity
80%Utilization 20%Unutilized Backup Capacity
VirtualizedResilience
77.8% More
Capacity
Problem
Solution
• Enables up to 77% More Capacity
for Protected Redundant
Workloads
Benefits Either
Or a combination of the two
Page 47
48. “Traditional telecom networks operate great at 45% utilization, but as
AT&T becomes a software company, a reasonable goal could be 80%
to 90% utilization.”
John Donovan, Senior Executive Vice President, AT&T
Page 48
49. Core OpenStack
Orchestrator(s)
OSS/BSS
49
Beyond the Virtualized Resilience Layer, the Resilience
Management Layer enables Automation
Linux Host OS
Virtualized Resilience Layer
Discovery And
Tagging Tool
Heat
Template
Service Template
Resiliency Workload Management
Authoring Tool/
Service Catalog
Service Template
NFVI domain [SDN Controller]
MANO
vSwitch
Running any Guest OS
VNFCs Instantiated in NFVI
VNFC
GuestOS
VM
VNFC VNFC
VNFM VNFM
VNFC
MANO/VIM
Heat Orchestration
API
Standard Server Platform – Commodity Off-The-Shelf (COTS)
NFVI Compute Domain
[Linux/KVM+QEMU, OpenStack, OVS+Availability Services]
VNFC
GuestOS
VM
VNFC
GuestOS
VM
VNFC
GuestOS
VM
VNFC
GuestOS
VM
VNFC
GuestOS
VM Resilience Management Layer
VNF Service Template
Page 12
50. 50
Stratus Cloud Solutions – Two Technologies
Continuous Availability Including
Stateful Fault Tolerance
Based upon Linux technology
and KVM
Available on multiple
distributions
Based on Stratus everRun
technology which is field proven,
with 12,000+ license deployed
Deployment of workloads
Automation of availability events
Layers between Orchestrators and
OpenStack distributions
Availability Services Workload Services
Resilience Management
51. Stratus Technologies Page 51
In Summary: The Stratus Cloud Solution for telcos and
Communications Infrastructures offers:
1. Seamless Protection for all VNFs
• Software Defined, transparent Service Continuity, performed automatically by the
infrastructure, without Application code changes
2. Selectable Resiliency for each VNF
• Deploy each VNF with selectable levels of resiliency including High Availability
and stateful Fault Tolerance (state protection), with Geo-Redundancy, without
application awareness
3. Agility with 3rd party ecosystem and any VNF
• Protect all VNFs in any KVM/OpenStack environment seamlessly, with No
complex code development, testing and support – for optimal partner ecosystem
4. Efficiency of Redundancy
• Unlike traditional approaches for Fault Tolerance, which limit Utilization to sub-
50%, get dramatic increase in Efficiency of Redundancy, at 80% to 90%
52. Seeing is believing: ETSI PoC#35
Availability Management with Stateful Fault Tolerance,
Telcos include AT&T, NTT & iBasis
Contact us to:
1. See this demo and learn more about seamless Software-based
Fault Tolerance in VNFs and other Cloud applications
2. Get a copy of the slide or have further questions
Ali.Kafel@Stratus.com
Twitter: @akafel