7. Original Approach - Recomputing
● Compute OVS flows by reprocessing all inputs when
○ Any input changes
○ Or even when there is no change at all (but just unrelated events)
● Benefit
○ Relatively easy to implement and maintain
● Problems
○ 100% CPU of ovn-controller process on all compute nodes
○ High control plane latency
8. Solution - Incremental Processing Engine
● DAG representing dependencies
● Each node contains
○ Data
○ Links to input nodes
○ Change-handler for each input
○ Full recompute handler
● Engine
○ DFS post-order traverse the DAG from the
final output node
○ Invoke change-handlers for inputs that
changed
○ Fall back to recompute if for ANY of its inputs:
■ Change-handler is not implemented for that
input, or
■ Change-handler cannot handle the particular
change (returns false)
input
intermediate
input
intermediate
output
input
10. ● Create and bind 10k ports on 1k HVs
○ Simulated 1k HVs on 20 BMs x 40 cores (2.50GHz)
○ 10k ports all under the same logical router
○ Batch size 100 lports
○ Bind port one by one for each batch
○ Wait all ports up before next batch
CPU Efficiency Improvement
11. ● End to end latency on top of 10k existed logical ports
○ Create one more logical port and bind the port on HV
○ Wait until northd generate lflows and create port-binding in SB
○ Wait until ovn-controller claim the port on HV
○ Wait until northd generate all lflows
○ Wait until OVS flows programmed on all HVs
Latency Improvement
12. Tests at Larger Scale
● Next bottle-necks:
○ OVS flow installation
○ Port-binding handling when the binding happens locally
13. What’s next for Incremental-Processing (WIP)
● Incremental flow installation
○ Low hanging fruit - with the help of incremental flow computing
● Implement more change handlers as needed
○ E.g. support incremental processing when port-binding happens locally - further improve
end-to-end latency
● New implementation: Differential Datalog (DDlog)
○ Data-flow approach
○ Reuse the effort taken for Northd improvement (will be discussed in Northd scaling)
● Upstream?
○ Not in upstream, because DDlog is the preferred long term solution
○ For those who need this:
■ Rebased on Master: https://github.com/hzhou8/ovs/tree/ovn-controller-inc-proc
■ Rebased on 2.11: https://github.com/hzhou8/ovs/tree/ip12_rebase_on_2.11
■ Rebased on 2.10: https://github.com/hzhou8/ovs/tree/ip12_rebase_on_2.10
14. OVN-Controller Other Improvements (WIP)
● Reduce data size per-HV
○ Problem: External Provider Network connects everything
○ Solution: Don’t cross external network boundary when calculating connected datapaths
● On-demand tunnel port creation
○ Problem: Too many OVS ports when there are a lot of HVs
○ Solution: Create tunnel to a remote host only if there are ports on these hosts logically connected.
15. ● Factors
○ Number of clients (HVs & GWs)
○ Size of data
○ Rate of changes
● Problems
○ Probe handling
○ Data resync during restart/failover
○ Clustered-mode problems
Northd
North-bound
ovsdb
South-bound
ovsdb
Central
North-bound
ovsdb
Northd
South-bound
ovsdb
SB DB Scaling Challenges
OVN-Controller
OVS
HV
HV …
OVSDB protocol (RFC7047)
HV GW
CMS
(OpenStack/K8S)
Virtual Network
Abstractions
Logical Flows
OpenFlows
16. SB DB Probe
● Default 5 sec probe interval causing connection flapping
○ Ovsdb-server response can occasionally exceed 5 sec
■ DB log compression
■ Large transaction handling
○ Clients reconnecting adds more load to the server - cascade failure
■ Clients resync data from server (solved - see next slide)
● Solution
○ Increase probe interval
■ Client side (on HVs)
● ovs-vsctl set open . external_ids:ovn-remote-probe-interval=60000
■ Server side (DON’T FORGET!!)
● ovn-sbctl -- --id=@conn_uuid create Connection
target="ptcp:6642:0.0.0.0"
inactivity_probe=0 -- set SB_Global . connections=@conn_uuid
○ Rely on external monitorings for HVs connectivity
17. Data re-sync during DB reconnect
● Problem
○ OVSDB client caching => NOT a problem
○ Server restart/failover: re-sync data for all
clients. => This is the problem!
● Solution - OVSDB fast re-sync (in master -> v2.12)
○ Track and maintain recent history transactions
in disk and memory.
○ New method monitor_cond_since in OVSDB
protocol, to request changes since last point
before connection lost.
○ Note: now it works for clustered mode only.
● Test Result - 1k HVs, 10k ports
○ Before: SB DB 100% CPU, >30 min to recover.
○ After: No CPU spike, all connections restored in
<1 min (probe interval).
18. OVSDB Clustered Mode
● Raft based clustering (experimental support since v2.9)
● Problems at scale
○ High CPU load (solved in master)
○ Follower update latency (solved in master)
○ Leader flapping (WIP, workaround ready)
○ Client reconnect (solved in master)
19. OVSDB Clustered Mode - High CPU
● OVSDB Raft Implementation
○ Preprocessing on followers before sending to leader - share
some load for leader
○ Send preprocessed transaction to leader together with a
prerequisite version ID
● Problem
○ Lots of prerequisite check failure and retry at large scale
■ Different HVs update chassis/port_binding at the same time
through different follower nodes
○ Continuous retry causes 100% CPU
● Solution (in master -> v2.12)
○ Retry only when the follower have applied the largest local
Raft log index
■ Otherwise, the prerequisite is already out-of-date, so don’t
waste CPU
20. OVSDB Clustered Mode - Follower Latency
● Original behavior: leader sends Raft log update to follower nodes when:
○ A new change is proposed, or
○ A heartbeat is sent
● Problem
○ Update from follower node suffers big latency
● Solution (in master -> v2.12)
○ Send log to followers as soon as a new entry is committed
● Test result: 100 updates through same follower from same client
○ Before: >30 sec
○ After: 500 ms
21. OVSDB Clustered Mode - Leader Flapping
● Problem: heartbeat timeout, triggering re-election
○ Large transaction execution
○ Raft log compression (snapshot)
● Solution
○ Quick and dirty: Increase election timeout (hardcode)
○ Short term: Make election timeout configurable at cluster level (WIP)
○ Longer term: Separate thread for Raft RPC (WIP)
■ Still need to configure timeout for snapshot scenarios
22. OVSDB Clustered Mode - Client Reconnect
● Problem: during leader failover, all clients of new leader will reconnect
○ DB state changes to “disconnected” when there is no leader (temporarily)
○ Client tries to reconnect to a new node
● Solution (in master -> v2.12)
○ Don’t change state to “disconnected” if
■ Current node is candidate, and
■ Election didn’t timeout yet
23. Scale Test for Clustered Mode
● Setup
○ 3-node cluster, 1k HVs
○ Election timeout: 10s (hardcoded in the test)
● Test
○ Keep creating and binding ports up to 10k
○ Periodically kill->wait(10s)->start each ovsdb-server randomly
● Test passed at scale!
○ All port creation and binding completed correctly.
○ Fast-resync helped!
24. Further Improvement: SB-DB Scale-out Replicas (TODO)
● How to support more HVs - 2k? 5k? 10k?
○ More nodes in cluster? Doesn’t scale.
○ Multi-threading OVSDB? Would help, but...
● Precondition: no write to SB from HV
○ Chassis/Encap/Port-binding update by
CMS/northd only
○ Does not use dynamic ARP (mac-binding)
● How
○ Use replication mode of OVSDB to create N
read-only replicas
○ HV connections sharding on read-only
replicas
○ HV can failover to other replicas
NorthdNorthd
SB ovsdb
SB
Replica 1
SB
Replica 2
SB
Replica n
…
HV HV HV
…
HV HV HV
…
HV HV HV
…
CMS
NB ovsdb
26. OVN-Northd Incremental Processing (WIP from community)
● OVN-Northd is a perfect target user of Differential Datalog (DDlog)
○ Inputs - NB DB tables (logical routers, switch, port, etc.)
○ Outputs - SB DB tables (logical flows, port-bindings, etc.)
○ Rules to convert inputs to outputs
● Differential Datalog
○ An open-source datalog language for incremental data-flow processing
○ Defining inputs and outputs as relations
○ Defining rules to generate outputs from inputs
● Efforts can be reused by OVN-Controller
○ OVSDB - DDlog wrappers
○ Process framework changes
28. Some More Scaling Problems
● Security Group / Network policy using ACLs
● Nested workloads (K8S containers)
29. ACLs
● Used by Security Group (OpenStack) / Network Policy (K8S)
● Typical use case: members of same group are allowed to access each other
● Naked => O(N^2)
● Using Address Set => O(N)
● #Flows in OVS is always O(M*N) (M = number of ports on the HV)
outport == <port1_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN}
outport == <port2_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN}
...
outport == <portN_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN}
outport == <port1_uuid> && ip4 && ip4.src == $as_ip4_sg1
outport == <port2_uuid> && ip4 && ip4.src == $as_ip4_sg1
...
outport == <portN_uuid> && ip4 && ip4.src == $as_ip4_sg1
30. Solution - Port Group (Released in v2.10)
● All-in-one
● Greatly simplified CMS Implementation
○ networking-ovn
○ ovn-kubernetes
● Enables more efficient OVS flow generation with conjunction, when multiple ports on same HV
belongs to same port-group
○ E.g.
■ N members in a port-group, all M ports on HV1 belong to this group
■ Number of OVS flows on HV1 will be M + N, instead of M * N
outport == @port_group1 && ip4 && ip4.src == $port_group1_ip4
CMS creates
port-group instead
of address-set
OVN-Northd
generates
address-set for you
31. Further Improvement - Group-ID in Packet (TODO)
● Problem - still too many OVS flows
○ Best case: M + N, if all M ports on HV belongs to same group.
○ Worst case: M * N, if ports are distributed randomly.
■ M ports on HV, each belongs to a different group, each group has N members
● Solution (just an idea)
○ Encoding port-group in tunnel metadata
■ Only M flows in all cases
■ Best part: no local flow change needed for remote member changes
○ Challenge: what if a port belongs to multiple groups
■ Limit the number of groups for a single port
■ Fall back to old way if exceeds
○ Limitation: works for ingress (to-lport) rules only
outport == @port_group1 && src_group_id == <group1 id>
From tunnel
metadata
32. Scaling Nested Workloads
● Use Case
○ VM overlay networking with OVN (e.g. using OpenStack networking-ovn)
○ Run Kubernetes on top of the VMs
● Problem
○ How to connect the pods at scale?
33. ARP Proxy
● OVN doesn’t support MAC-learning (MAC-Port binding
learning), but IP-MAC binding can be learned through
ARP
● How
○ LR send ARP request for Pod IPs
○ ARP proxy in the VM replies with VM’s MAC for
all Pod IPs on the VM
● Works, but
○ Requires VM and Pods on same subnet
○ Unreliable when SB DB connection fails
○ Scale: O(N), N = number of pods, usually much
bigger than number of VMs
■ Note: IP-MAC Binding incremental processing
change handler is implemented - no re-compute.
HV
VM
OVS
Pod
Pod Pod
Pod
ARP
Proxy
OVN
Controller
SB
IP-MAC
Binding Table
LR ARP Cache (dynamic):
10.0.0.102 => aa:bb:cc:dd:ee:ff
10.0.0.103 => aa:bb:cc:dd:ee:ff
10.0.0.104 => aa:bb:cc:dd:ee:ff
...
10.0.0.102
10.0.0.103 10.0.0.104
10.0.0.105
10.0.0.2 (aa:bb:cc:dd:ee:ff)
34. LR Static Route
● Assign Pod subnet(s) per VM (minion)
● How
○ Configure static routes in OVN LR for pod
subnets: next hop = VM IP
● Considerations
○ De-couples VM and Pod subnets
○ Declarative, more reliable than ARP
○ May waste more IPs, but size of subnet is
flexible
○ Scale: O(S), S = number of pod subnets
■ Worst case O(N), N = number of pods, if subnet
size is /32.
HV
VM
OVS
Pod
Pod Pod
Pod
10.0.0.2/25
10.0.0.3/25 10.0.0.4/25
10.0.0.5/25
172.0.0.2/24
LR Routing Table (static):
10.0.0.0/25 => 172.0.0.2
10.0.0.128/25 => 172.0.1.100
10.0.0.1/25 => 172.0.1.3
...