Scaling API-first – The story of a global engineering organization
LF_OVS_17_Ingress Scheduling
1. Ingress Scheduling in OvS-DPDK
Billy O’Mahony – Intel
Jan Scheurich – Ericsson
November 16-17, 2017 | San Jose, CA
2. Introduction
u Use cases for traffic prioritization in NFV
u State of the art in OvS-DPDK datapath
u Rx queue prioritization in DPDK datapath
u Traffic classification and queue selection on NIC
u Next steps
3. Compute Node
OvS
br-intbr-ctrl
Tenant
VM
Tenant
VM
vhostuser
Local Agents
Local Agents
Local AgentsVIM
Components
Host
networking
bond
br-prv
tag tag
Compute Node
OvS
Scenario: NFVI on Converged Data Center
VIM control plane sharing physical network with tenant data
ToR A ToR B
dpdk0 dpdk1
br-intbr-ctl
Tenant
VM
Tenant
VM
vhostuser
Local Agents
Local Agents
Local AgentsVIM
Components
Host
networking
bond
br-prv
tag tag
VIM = Virtual Infrastructure Manager
For example OpenStack component: Nova,
Neutron services and their local agents
4. Compute Node
OvS
br-intbr-ctrl
Tenant
VM
Tenant
VM
vhostuser
Local Agents
Local Agents
Local AgentsVIM
Components
Host
networking
bond
br-prv
tag tag
Compute Node
OvS
Use Case 1: In-band OvS Control Plane
ToR A ToR B
dpdk0 dpdk1
br-intbr-ctl
Tenant
VM
Tenant
VM
vhostuser
Local Agents
Local Agents
Local AgentsVIM
Components
Host
networking
bond
br-prv
tag tag
LACP bond supervision
LACP = Link Aggregation Control Protocol
Here: in-band heart-beat between OVS and each ToR
5. Compute Node
OvS
br-intbr-ctrl
Tenant
VM
Tenant
VM
vhostuser
Local Agents
Local Agents
Local AgentsVIM
Components
Host
networking
bond
br-prv
tag tag
Compute Node
OvS
Use Case 1: In-band OvS Control Plane
ToR A ToR B
dpdk0 dpdk1
br-intbr-ctl
Tenant
VM
Tenant
VM
vhostuser
Local Agents
Local Agents
Local AgentsVIM
Components
Host
networking
bond
br-prv
tag tag
BFD tunnel monitoring BFD packets are
sent inside tunnel
BFD = Bidirectional Forwarding Detection
Here: Heart-beat between OVS instances
connected through tunnel mesh
6. Compute Node
OvS
br-intbr-ctrl
Tenant
VM
Tenant
VM
vhostuser
Local Agents
Local Agents
Local AgentsVIM
Components
Host
networking
bond
br-prv
tag tag
Compute Node
OvS
Use Case 2: VIM Control Plane
ToR A ToR B
dpdk0 dpdk1
br-intbr-ctl
Tenant
VM
Tenant
VM
vhostuser
Local Agents
Local Agents
Local AgentsNova,
Neutron, …
Host
networking
bond
br-prv
tag tag
VIM Control
Plane
VIM Control Plane
In OpenStack: RabbitMQ, REST APIs calls, …
7. Compute Node
OvS
br-intbr-ctrl
Tenant
VM
Tenant
VM
vhostuser
Local Agents
Local Agents
Local AgentsVIM
Components
Host
networking
bond
br-prv
tag tag
Compute Node
OvS
Use Case 2: VIM Control Plane
ToR A ToR B
dpdk0 dpdk1
br-intbr-ctl
Tenant
VM
Tenant
VM
vhostuser
Local Agents
Local Agents
Local AgentsVIM
Components
Host
networking
bond
br-prv
tag tag
OvS control plane:
OpenFlow and OVSDB
OpenFlow and OVSDB are special cases of
VIM control plane
8. OvS
Status Quo in OvS DPDK Datapath
NIC
PMD
1
PMD
2
ovs-vswitchd
thread
Tenant VM
BFD
LACP
RSS
HW
Scheduler
VIM
Components
Host
networking
br-ctl
10. OvS
Scenario: Egress Link Overload
NIC
PMD
1
PMD
2
ovs-vswitchd
thread
Tenant VM
BFD
LACP
RSS
VIM
Components
Host
networking
br-ctl
PMD Tx queues
full. Tenant data
being dropped.
Separate Tx queue for
ovs-vswitchd. HW
scheduler can provide
fair share of link band-
width
Egress link band-
width exhausted
by tenant data
11. Offered load on physical
port [Kpps] 2000 2200 2400 2600 2800 3200 3600 4000
Offered load on phy port
[Gbit/s] 1.54 1.69 1.84 2.00 2.15 2.46 2.76 3.07
PMD overload factor [%] 0% 9% 18% 27% 45% 64% 82%
PMD utilization [%] 99.95% 99.99% 100% 100% 100% 100% 100% 100%
Phy port rx drop [%] 0% 0% 8% 15% 21% 31% 39% 45%
ping -f average RTT [ms] 0.45 0.50 3.02 3.03 3.15 3.10 3.69 3.95
ping -f packet drop [%] 0 0 10% 16% 21% 37% 45% 49%
BFD flappings [1/min] 1.85 3.75 5.66 5.71
Num flaps 0 0 17 17 43 20
OpenFlow connection
timeouts in OVS 0 0 0 0 0 0 3 3
Connection closed by
peer (ODL) 20 15
Connection reset by peer
(ODL) 2 0
37
5.03
PMD polling physical port overloaded with 64B packets
u Packet drops in the Rx queue of
physical port equally affect tenant
data, BFD and OVS control plane
packets
u “ping –f“ to br-ctl interface to
quantify control plane impact
u Ping packet drop in line with overall
packet drop
u RTT jumps from 50 us to 3 ms
u BFD flapping occurs already at
moderate overload
u The rate increases with overload
u Above 45% packet drop the
OpenFlow control channel breaks
due to missed Echo Replies
Measurements
Impact of PMD Overload
source: Ericsson
CPU: Dual socket, Xeon CPU E5-2697 v3 @2.60GHz, 14 cores + HT, 896K L1, 3584K L2, 35MB L3 cache; NIC: Intel Fortville X710, 4 x 10Gbit/s;
OvS: version 2.6, 1 PMD, 1 phy port, 1 vhostuser port; VM: TRex DPDK traffic source/sink
12. u Egress link overload does not affect
the control plane
u Outgoing packets are forwarded by
the ovs-vswitchd thread, which has
its dedicated TX queue in the
Fortville NIC
u The NIC schedules packets from each
of the TX queues in some fair
manner, so that the ovs-vswitchd
queue gets sufficient bandwidth on
the link
u Incoming packets are not affected as
neither link nor PMDs are overloaded
u No BFD flapping
Measurements
Impact of Egress Link Overload
Offered load from VM [Kpps] 800 900 1000 1200 1600
Offered load from VM [Gbit/s] 9.80 11.03 12.26 14.71 19.61
Transmitted load on phy port [Gbit/s] 9.81 9.90 9.88 9.88 9.88
Link overload 0% 11% 24% 49% 99%
PMD utilization [%] 41.45% 46.30% 50.20% 56.36% 69.89%
ping -f average RTT [ms] 0.109 0.205 0.206 0.210 0.204
ping -f packet drop [%] 0% 0% 0% 0% 0%
BFD flappings [1/min]
Num flaps 0 0 0 0 0
OpenFlow connection timeouts in OVS 0 0 0 0 0
Connection closed by peer (ODL)
Connection reset by peer (ODL)
10G Link from OvS overloaded with outgoing traffic from VM
(1500 byte packets)
source: Ericsson
CPU: Dual socket, Xeon CPU E5-2697 v3 @2.60GHz, 14 cores + HT, 896K L1, 3584K L2, 35MB L3 cache; NIC: Intel Fortville X710, 4 x 10Gbit/s;
OvS: version 2.6, 1 PMD, 1 phy port, 1 vhostuser port; VM: TRex DPDK traffic source/sink
13. Use Case 3: QoS for Tenant Data
All tenant data traffic is equal!?
Well, some packets are more equal than others!
u Virtual Network Functions send/receive a large variety of network traffic
u Top prio: Critical internal control plane (e.g. cluster membership)
u …
u Min prio: Bulk user plane
u VNFs need prioritization for their critical traffic in the NFVI
u How to orchestrate and implement the necessary QoS end-to-end?
u Will need additional priority levels and packet marking (e.g. IP Diffserv)
14. Desired Ingress Prioritization on
Physical Ports
u Priority 1: In-band control plane
u Untagged LACP packets
u BFD packets inside tunnel based on IP DSCP of outer IP header
u Priority 2: VIM control plane
u Certain prioritized VLAN tags
u Priority 3+: Prioritized tenant data
u E.g. based on IP DSCP of outer IP header
u Base Priority
u Non-prioritized traffic spread through RSS over multiple Rx queues
15. PMD
“Schedulers arrange and/or rearrange packets for output.”
-- http://www.tldp.org/HOWTO/html_single/Traffic-Control-HOWTO/#e-scheduling
Ingress Scheduling
RX Queue
à TX Queue
à
?
ovs-vswitchd
BFD
LACP
Priority packet – e.g
control plane.
16. Ingress Scheduling - Implementation
RX Queue x2
à
TX Queue
à
?
DPDK rte_flowAPI
installs rxq
assignment filters
on supported
vendors NICs
ovs-vswitchd
BFD
LACP
PMD empties
priority queue
before reading non-
priority queue
PMD
17. Ingress Scheduling - Implementation
1. Move packet prioritization decision to the NIC
2. Place prioritized packets on separate RX Queue
3. Read preferentially from “priority” RxQ. Keep it simple:
u Read from priority queue until it’s empty
u Service other queues
u Repeat
18. Ingress Scheduling – Latency effect
~99.9 % of packets
already have have a
latency <20us
There are x10 to
x50 less packets in
any given latency
bucket – good.
But worst case
latency does not
improve.
CPU: Dual socket, Xeon(R) CPU E5-2695 v3 @2.30GHz 14 core no-HT, 896K L1, 3584K L2, 35MB L3 cache; NIC: Intel Fortville X710, 4 x
10Gbit/s;
OvS: version 2.7.90, 1 PMD, 2 phy port, Hardware trafficsource/sink
Source: Intel
19. Ingress Scheduling – Overload protection
RX Queue x2
à
TX Queue
à
L
?
ovs-vswitchd
BFD
LACP
PMD
1
20. dpdk1
Ingress Scheduling – Traffic Protection
u Overload PMD through 64 byte DPDK traffic on dpdk0
à 100% PMD load in pmd-stats-show
à 25% rx packet drop on dpdk0
u Add iperf3 UDP traffic (256 bytes) in parallel over dpdk1
u Measurement result:
SUT Server
OvS
VM
dpdk
testpmd
vhostuser
iperf3
udp
server
PMD
br-prv
ToR
dpdk0dpdk1
TGen Server
BM
dpdk
pktgen
iperf3
udp
client
dpdk0
Low priorityHigh/Low
priority
source: Ericsson
Condition 1
dpdk1
low priority
Condition 2
dpdk1
high priority
iperf3 UDP
throughput
not
measured
1 Gbit/s
460 Kpps 1)
iperf3 UDP
packet loss 28% 0%
1) iperf3 throughput limited by UDP/IP stack on client side
CPU: Dual socket, Xeon CPU E5-2680 v4 @2.40GHz, 14 cores + HT, 896K L1, 3584K L2, 35MB L3 cache
NIC: Intel Fortville X710, 4 x 10Gbit/s; OvS: version 2.6, 1 PMD, all ports and VM on NUMA node 0
21. u $ ovs-vsctl set Interface phy1
ingress_sched:
eth_type=0x8809
Ingress Scheduling – Configuration
Field as per ovs-fields(7)
and ofctl add-flow. Not
all netdevs/NICs will
support all combinations.
Single prioritization
condition.
22. u $ ovs-vsctl set Interface phy1
ingress_sched:
vlan_tci=0x1123/0x1fff,ip,ip_dscp=0x5
Ingress Scheduling – Configuration (future)
Several different
prioritization
conditions. a AND b.
23. u $ ovs-vsctl set Interface phy1
ingress_sched:
filter=vlan_tci=0x1123/0x1fff
filter=ip,ip_dscp=0x5
Ingress Scheduling – Configuration (future)
Several different
prioritization
conditions. a OR b.
24. u $ ovs-vsctl set Interface phy1
ingress_sched:
prio=1,
filter,vlan_tci=0x1123/0x1fff,
filter,eth_type=0x8809,
prio=2,
filter,ip,ip_dscp=0x5,
Ingress Scheduling – Configuration (future)
Traffic Priority Levels
Support several levels of
prioritization: High and Low
but also a Critical level for
instance.
25. u $ ovs-vsctl set Interface phy1
ingress_sched:
prio=2,
filter,ip,ip_dscp=0x5,
prio=1,
filter=vlan_tci=0x1123/0x1fff,
filter,eth_type=0x8809
Ingress Scheduling – Configuration (future)
Filter Priority:
Filter groups are
applied in the order in
which they appear on
the configuration line.
26. u ovsdb-schema
<table name="Interface"…
<column name="ingress_sched" key="err">
If the specified ingress scheduling could not be
applied, Open vSwitch sets this column to an error
description in human readable form. Otherwise, Open
vSwitch clears this column.
Ingress Scheduling –Error reporting
27. u $ ovs-vsctl set Interface phy1
options:n_rxq=4
u $ ovs-vsctl set Interface phy1
ingress_sched:
prio=2,
filter,ip,ip_dscp=0x5,
prio=1,
filter=vlan_tci=0x1123/0x1fff,
filter,eth_type=0x8809
Ingress Scheduling – RxQ’s & RSS
RSS queues
Additional Priority
Queues
28. Ingress Scheduling – Next Steps
PMD PMD PMD PMD
u Avoid poor rxq -> pmd assignment
29. Ingress Scheduling – Next Steps
u Use rte_flow API for offload
u Extend to several priorities
u Priorities of overlapping filters
u Multiple traffic priorities
u Working with RFC ‘Flow Offload’ feature…
u …
u Prioritization to the Guest…
30. Summary
u OvS-DPDK in NFVI context needs ingress scheduling to
protect priority traffic against PMD overload
u SW priority queue handling in the PMD loop is effective
u Could be upstreamed first, priority configurable per port
u Off-loading classification and queue selection to NIC
through rte_flow API allows generic solution
u Interaction with RFC Flow Classification Offload
u Work in progress
u Lots left to figure out
u We are open for suggestions/collaboration