SlideShare una empresa de Scribd logo
1 de 35
Descargar para leer sin conexión
Large Scale Overlay Networks with OVN:
Problems and Solutions
Han Zhou (hzhou8@ebay.com)
Open Infrastructure Summit - Denver, 2019
Agenda
● Background
● Control-plane components scaling
○ OVN-Controller
○ South-bound DB
○ OVN-Northd
● Scaling ACL
● Scaling nested workloads (containers on VMs)
Background of OVN
● SDN solution developed by OVS (Open vSwitch) community
● OpenStack support - neutron ML2 plugin: networking-ovn
● Kubernetes support - CNI plugin: ovn-kubernetes
● Main Features
● Full L2/L3 virtualization with overlay
networks (Geneve, STT, VxLAN)
● L2 gateway, L3 gateway
(centralized/distributed) & NAT with HA
● L4 ACLs (stateful FW) with address-set,
port-group and packet logging
● Distributed Load-Balancer
● L2/L3 Port-security
● ARP responder, static/dynamic ARP
● Flat/Vlan physical networks
● Native DHCP, Metadata
● Parent-child ports for nested workloads
● QoS
● IPSec
● Policy-based routing
● ...
● Logical/physical separation
● Distributed local controllers
● Database Approach (ovsdb) Northd
North-bound
ovsdb
South-bound
ovsdb
Central
North-bound
ovsdb
Northd
South-bound
ovsdb
Distributed Control Plane
OVN-Controller
OVS
HV
HV …
OVSDB protocol (RFC7047)
HV GW
CMS
(OpenStack/K8S)
Virtual Network
Abstractions
Logical Flows
OpenFlows
Northd
North-bound
ovsdb
South-bound
ovsdb
Central
North-bound
ovsdb
Northd
South-bound
ovsdb
OVN-Controller Scaling Challenges
OVN-Controller
OVS
HV
HV …
OVSDB protocol (RFC7047)
HV GW
CMS
(OpenStack/K8S)
Virtual Network
Abstractions
Logical Flows
OpenFlows
● Factors
○ Size of data
○ Rate of changes
● Challenges
○ Big size of data to be processed
■ E.g. 10k logical ports generates >40k
logical flows and 10k port-bindings
○ Logical flow parsing is CPU intensive
○ Cloud workload changes frequently
○ Lots of inputs for flow computation
OVS
qos Address Sets
(converted)
MFF OVN
Geneve
OVS
open_vswitch
OVS
bridge
SB
logical_flow
SB
chassis
SB
encap
SB
mc_group
SB
dp_binding
SB
port_binding
SB
mac_binding
SB
dhcp
SB
dhcpv6
SB
dns
SB
gw_chassis
OVS
port
SB
addr_set
SB
port_group
Runtime Data
------------------------------
Local_datapath
Local_lports
Local_lport_ids
Active_tunnels
Ct_zone_bitmap
Pending_ct_zones
Ct_zones
Flow Output
---------------------------
Desired_flow_table
Group_table
Meter_table
Conj_id_ofs
SB OVSDB input
Local OVSDB input
Dependency Graph of OVN-Controller
Port Groups
(converted)
Original Approach - Recomputing
● Compute OVS flows by reprocessing all inputs when
○ Any input changes
○ Or even when there is no change at all (but just unrelated events)
● Benefit
○ Relatively easy to implement and maintain
● Problems
○ 100% CPU of ovn-controller process on all compute nodes
○ High control plane latency
Solution - Incremental Processing Engine
● DAG representing dependencies
● Each node contains
○ Data
○ Links to input nodes
○ Change-handler for each input
○ Full recompute handler
● Engine
○ DFS post-order traverse the DAG from the
final output node
○ Invoke change-handlers for inputs that
changed
○ Fall back to recompute if for ANY of its inputs:
■ Change-handler is not implemented for that
input, or
■ Change-handler cannot handle the particular
change (returns false)
input
intermediate
input
intermediate
output
input
OVS
qos Address Sets
(converted)
MFF OVN
Geneve
OVS
open_vswitch
OVS
bridge
SB
logical_flow
SB
chassis
SB
encap
SB
mc_group
SB
dp_binding
SB
port_binding
SB
mac_binding
SB
dhcp
SB
dhcpv6
SB
dns
SB
gw_chassis
OVS
port
SB
addr_set
SB
port_group
Runtime Data
------------------------------
Local_datapath
Local_lports
Local_lport_ids
Active_tunnels
Ct_zone_bitmap
Pending_ct_zones
Ct_zones
Flow Output
---------------------------
Desired_flow_table
Group_table
Meter_table
Conj_id_ofs
SB OVSDB input
Local OVSDB input
Input with change
handler implemented
Change Handler Implemented
Port Groups
(converted)
● Create and bind 10k ports on 1k HVs
○ Simulated 1k HVs on 20 BMs x 40 cores (2.50GHz)
○ 10k ports all under the same logical router
○ Batch size 100 lports
○ Bind port one by one for each batch
○ Wait all ports up before next batch
CPU Efficiency Improvement
● End to end latency on top of 10k existed logical ports
○ Create one more logical port and bind the port on HV
○ Wait until northd generate lflows and create port-binding in SB
○ Wait until ovn-controller claim the port on HV
○ Wait until northd generate all lflows
○ Wait until OVS flows programmed on all HVs
Latency Improvement
Tests at Larger Scale
● Next bottle-necks:
○ OVS flow installation
○ Port-binding handling when the binding happens locally
What’s next for Incremental-Processing (WIP)
● Incremental flow installation
○ Low hanging fruit - with the help of incremental flow computing
● Implement more change handlers as needed
○ E.g. support incremental processing when port-binding happens locally - further improve
end-to-end latency
● New implementation: Differential Datalog (DDlog)
○ Data-flow approach
○ Reuse the effort taken for Northd improvement (will be discussed in Northd scaling)
● Upstream?
○ Not in upstream, because DDlog is the preferred long term solution
○ For those who need this:
■ Rebased on Master: https://github.com/hzhou8/ovs/tree/ovn-controller-inc-proc
■ Rebased on 2.11: https://github.com/hzhou8/ovs/tree/ip12_rebase_on_2.11
■ Rebased on 2.10: https://github.com/hzhou8/ovs/tree/ip12_rebase_on_2.10
OVN-Controller Other Improvements (WIP)
● Reduce data size per-HV
○ Problem: External Provider Network connects everything
○ Solution: Don’t cross external network boundary when calculating connected datapaths
● On-demand tunnel port creation
○ Problem: Too many OVS ports when there are a lot of HVs
○ Solution: Create tunnel to a remote host only if there are ports on these hosts logically connected.
● Factors
○ Number of clients (HVs & GWs)
○ Size of data
○ Rate of changes
● Problems
○ Probe handling
○ Data resync during restart/failover
○ Clustered-mode problems
Northd
North-bound
ovsdb
South-bound
ovsdb
Central
North-bound
ovsdb
Northd
South-bound
ovsdb
SB DB Scaling Challenges
OVN-Controller
OVS
HV
HV …
OVSDB protocol (RFC7047)
HV GW
CMS
(OpenStack/K8S)
Virtual Network
Abstractions
Logical Flows
OpenFlows
SB DB Probe
● Default 5 sec probe interval causing connection flapping
○ Ovsdb-server response can occasionally exceed 5 sec
■ DB log compression
■ Large transaction handling
○ Clients reconnecting adds more load to the server - cascade failure
■ Clients resync data from server (solved - see next slide)
● Solution
○ Increase probe interval
■ Client side (on HVs)
● ovs-vsctl set open . external_ids:ovn-remote-probe-interval=60000
■ Server side (DON’T FORGET!!)
● ovn-sbctl -- --id=@conn_uuid create Connection 
target="ptcp:6642:0.0.0.0" 
inactivity_probe=0 -- set SB_Global . connections=@conn_uuid
○ Rely on external monitorings for HVs connectivity
Data re-sync during DB reconnect
● Problem
○ OVSDB client caching => NOT a problem
○ Server restart/failover: re-sync data for all
clients. => This is the problem!
● Solution - OVSDB fast re-sync (in master -> v2.12)
○ Track and maintain recent history transactions
in disk and memory.
○ New method monitor_cond_since in OVSDB
protocol, to request changes since last point
before connection lost.
○ Note: now it works for clustered mode only.
● Test Result - 1k HVs, 10k ports
○ Before: SB DB 100% CPU, >30 min to recover.
○ After: No CPU spike, all connections restored in
<1 min (probe interval).
OVSDB Clustered Mode
● Raft based clustering (experimental support since v2.9)
● Problems at scale
○ High CPU load (solved in master)
○ Follower update latency (solved in master)
○ Leader flapping (WIP, workaround ready)
○ Client reconnect (solved in master)
OVSDB Clustered Mode - High CPU
● OVSDB Raft Implementation
○ Preprocessing on followers before sending to leader - share
some load for leader
○ Send preprocessed transaction to leader together with a
prerequisite version ID
● Problem
○ Lots of prerequisite check failure and retry at large scale
■ Different HVs update chassis/port_binding at the same time
through different follower nodes
○ Continuous retry causes 100% CPU
● Solution (in master -> v2.12)
○ Retry only when the follower have applied the largest local
Raft log index
■ Otherwise, the prerequisite is already out-of-date, so don’t
waste CPU
OVSDB Clustered Mode - Follower Latency
● Original behavior: leader sends Raft log update to follower nodes when:
○ A new change is proposed, or
○ A heartbeat is sent
● Problem
○ Update from follower node suffers big latency
● Solution (in master -> v2.12)
○ Send log to followers as soon as a new entry is committed
● Test result: 100 updates through same follower from same client
○ Before: >30 sec
○ After: 500 ms
OVSDB Clustered Mode - Leader Flapping
● Problem: heartbeat timeout, triggering re-election
○ Large transaction execution
○ Raft log compression (snapshot)
● Solution
○ Quick and dirty: Increase election timeout (hardcode)
○ Short term: Make election timeout configurable at cluster level (WIP)
○ Longer term: Separate thread for Raft RPC (WIP)
■ Still need to configure timeout for snapshot scenarios
OVSDB Clustered Mode - Client Reconnect
● Problem: during leader failover, all clients of new leader will reconnect
○ DB state changes to “disconnected” when there is no leader (temporarily)
○ Client tries to reconnect to a new node
● Solution (in master -> v2.12)
○ Don’t change state to “disconnected” if
■ Current node is candidate, and
■ Election didn’t timeout yet
Scale Test for Clustered Mode
● Setup
○ 3-node cluster, 1k HVs
○ Election timeout: 10s (hardcoded in the test)
● Test
○ Keep creating and binding ports up to 10k
○ Periodically kill->wait(10s)->start each ovsdb-server randomly
● Test passed at scale!
○ All port creation and binding completed correctly.
○ Fast-resync helped!
Further Improvement: SB-DB Scale-out Replicas (TODO)
● How to support more HVs - 2k? 5k? 10k?
○ More nodes in cluster? Doesn’t scale.
○ Multi-threading OVSDB? Would help, but...
● Precondition: no write to SB from HV
○ Chassis/Encap/Port-binding update by
CMS/northd only
○ Does not use dynamic ARP (mac-binding)
● How
○ Use replication mode of OVSDB to create N
read-only replicas
○ HV connections sharding on read-only
replicas
○ HV can failover to other replicas
NorthdNorthd
SB ovsdb
SB
Replica 1
SB
Replica 2
SB
Replica n
…
HV HV HV
…
HV HV HV
…
HV HV HV
…
CMS
NB ovsdb
Northd
North-bound
ovsdb
South-bound
ovsdb
Central
North-bound
ovsdb
Northd
South-bound
ovsdb
OVN-Northd Scaling Challenges
HV …
OVSDB protocol (RFC7047)
HV GW
CMS
(OpenStack/K8S)
Virtual Network
Abstractions
Logical Flows
OpenFlows
● Factors
○ Size of data
○ Rate of changes
● Problems
○ Recompute
OVN-Northd Incremental Processing (WIP from community)
● OVN-Northd is a perfect target user of Differential Datalog (DDlog)
○ Inputs - NB DB tables (logical routers, switch, port, etc.)
○ Outputs - SB DB tables (logical flows, port-bindings, etc.)
○ Rules to convert inputs to outputs
● Differential Datalog
○ An open-source datalog language for incremental data-flow processing
○ Defining inputs and outputs as relations
○ Defining rules to generate outputs from inputs
● Efforts can be reused by OVN-Controller
○ OVSDB - DDlog wrappers
○ Process framework changes
● OVN-Northd
● OVN-SB DB
● OVN-Controller Northd
North-bound
ovsdb
South-bound
ovsdb
Central
North-bound
ovsdb
Northd
South-bound
ovsdb
Recap Scaling Bottlenecks
OVN-Controller
OVS
HV
HV …
OVSDB protocol (RFC7047)
HV GW
CMS
(OpenStack/K8S)
Virtual Network
Abstractions
Logical Flows
OpenFlows
Some More Scaling Problems
● Security Group / Network policy using ACLs
● Nested workloads (K8S containers)
ACLs
● Used by Security Group (OpenStack) / Network Policy (K8S)
● Typical use case: members of same group are allowed to access each other
● Naked => O(N^2)
● Using Address Set => O(N)
● #Flows in OVS is always O(M*N) (M = number of ports on the HV)
outport == <port1_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN}
outport == <port2_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN}
...
outport == <portN_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN}
outport == <port1_uuid> && ip4 && ip4.src == $as_ip4_sg1
outport == <port2_uuid> && ip4 && ip4.src == $as_ip4_sg1
...
outport == <portN_uuid> && ip4 && ip4.src == $as_ip4_sg1
Solution - Port Group (Released in v2.10)
● All-in-one
● Greatly simplified CMS Implementation
○ networking-ovn
○ ovn-kubernetes
● Enables more efficient OVS flow generation with conjunction, when multiple ports on same HV
belongs to same port-group
○ E.g.
■ N members in a port-group, all M ports on HV1 belong to this group
■ Number of OVS flows on HV1 will be M + N, instead of M * N
outport == @port_group1 && ip4 && ip4.src == $port_group1_ip4
CMS creates
port-group instead
of address-set
OVN-Northd
generates
address-set for you
Further Improvement - Group-ID in Packet (TODO)
● Problem - still too many OVS flows
○ Best case: M + N, if all M ports on HV belongs to same group.
○ Worst case: M * N, if ports are distributed randomly.
■ M ports on HV, each belongs to a different group, each group has N members
● Solution (just an idea)
○ Encoding port-group in tunnel metadata
■ Only M flows in all cases
■ Best part: no local flow change needed for remote member changes
○ Challenge: what if a port belongs to multiple groups
■ Limit the number of groups for a single port
■ Fall back to old way if exceeds
○ Limitation: works for ingress (to-lport) rules only
outport == @port_group1 && src_group_id == <group1 id>
From tunnel
metadata
Scaling Nested Workloads
● Use Case
○ VM overlay networking with OVN (e.g. using OpenStack networking-ovn)
○ Run Kubernetes on top of the VMs
● Problem
○ How to connect the pods at scale?
ARP Proxy
● OVN doesn’t support MAC-learning (MAC-Port binding
learning), but IP-MAC binding can be learned through
ARP
● How
○ LR send ARP request for Pod IPs
○ ARP proxy in the VM replies with VM’s MAC for
all Pod IPs on the VM
● Works, but
○ Requires VM and Pods on same subnet
○ Unreliable when SB DB connection fails
○ Scale: O(N), N = number of pods, usually much
bigger than number of VMs
■ Note: IP-MAC Binding incremental processing
change handler is implemented - no re-compute.
HV
VM
OVS
Pod
Pod Pod
Pod
ARP
Proxy
OVN
Controller
SB
IP-MAC
Binding Table
LR ARP Cache (dynamic):
10.0.0.102 => aa:bb:cc:dd:ee:ff
10.0.0.103 => aa:bb:cc:dd:ee:ff
10.0.0.104 => aa:bb:cc:dd:ee:ff
...
10.0.0.102
10.0.0.103 10.0.0.104
10.0.0.105
10.0.0.2 (aa:bb:cc:dd:ee:ff)
LR Static Route
● Assign Pod subnet(s) per VM (minion)
● How
○ Configure static routes in OVN LR for pod
subnets: next hop = VM IP
● Considerations
○ De-couples VM and Pod subnets
○ Declarative, more reliable than ARP
○ May waste more IPs, but size of subnet is
flexible
○ Scale: O(S), S = number of pod subnets
■ Worst case O(N), N = number of pods, if subnet
size is /32.
HV
VM
OVS
Pod
Pod Pod
Pod
10.0.0.2/25
10.0.0.3/25 10.0.0.4/25
10.0.0.5/25
172.0.0.2/24
LR Routing Table (static):
10.0.0.0/25 => 172.0.0.2
10.0.0.128/25 => 172.0.1.100
10.0.0.1/25 => 172.0.1.3
...
● OVS/OVN
○ http://www.openvswitch.org/
● Networking-OVN
○ https://docs.openstack.org/networking-ovn/latest/
● OVN-Kubernetes
○ https://github.com/openvswitch/ovn-kubernetes/
● OVN-Scale-Test
○ https://github.com/openvswitch/ovn-scale-test
● GO-OVN library
○ https://github.com/eBay/go-ovn
References

Más contenido relacionado

La actualidad más candente

[OpenStack 하반기 스터디] Interoperability with ML2: LinuxBridge, OVS and SDN
[OpenStack 하반기 스터디] Interoperability with ML2: LinuxBridge, OVS and SDN[OpenStack 하반기 스터디] Interoperability with ML2: LinuxBridge, OVS and SDN
[OpenStack 하반기 스터디] Interoperability with ML2: LinuxBridge, OVS and SDN
OpenStack Korea Community
 

La actualidad más candente (20)

Openstack Neutron, interconnections with BGP/MPLS VPNs
Openstack Neutron, interconnections with BGP/MPLS VPNsOpenstack Neutron, interconnections with BGP/MPLS VPNs
Openstack Neutron, interconnections with BGP/MPLS VPNs
 
20150511 jun lee_openstack neutron 분석 (최종)
20150511 jun lee_openstack neutron 분석 (최종)20150511 jun lee_openstack neutron 분석 (최종)
20150511 jun lee_openstack neutron 분석 (최종)
 
Kolla talk at OpenStack Summit 2017 in Sydney
Kolla talk at OpenStack Summit 2017 in SydneyKolla talk at OpenStack Summit 2017 in Sydney
Kolla talk at OpenStack Summit 2017 in Sydney
 
OpenStack Networking
OpenStack NetworkingOpenStack Networking
OpenStack Networking
 
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
 
The Basic Introduction of Open vSwitch
The Basic Introduction of Open vSwitchThe Basic Introduction of Open vSwitch
The Basic Introduction of Open vSwitch
 
OpenStack networking (Neutron)
OpenStack networking (Neutron) OpenStack networking (Neutron)
OpenStack networking (Neutron)
 
macvlan and ipvlan
macvlan and ipvlanmacvlan and ipvlan
macvlan and ipvlan
 
[OpenStack 하반기 스터디] Interoperability with ML2: LinuxBridge, OVS and SDN
[OpenStack 하반기 스터디] Interoperability with ML2: LinuxBridge, OVS and SDN[OpenStack 하반기 스터디] Interoperability with ML2: LinuxBridge, OVS and SDN
[OpenStack 하반기 스터디] Interoperability with ML2: LinuxBridge, OVS and SDN
 
Open vSwitch 패킷 처리 구조
Open vSwitch 패킷 처리 구조Open vSwitch 패킷 처리 구조
Open vSwitch 패킷 처리 구조
 
Packet flow on openstack
Packet flow on openstackPacket flow on openstack
Packet flow on openstack
 
OpenStack Neutron behind the Scenes
OpenStack Neutron behind the ScenesOpenStack Neutron behind the Scenes
OpenStack Neutron behind the Scenes
 
2014 OpenStack Summit - Neutron OVS to LinuxBridge Migration
2014 OpenStack Summit - Neutron OVS to LinuxBridge Migration2014 OpenStack Summit - Neutron OVS to LinuxBridge Migration
2014 OpenStack Summit - Neutron OVS to LinuxBridge Migration
 
Pushing Packets - How do the ML2 Mechanism Drivers Stack Up
Pushing Packets - How do the ML2 Mechanism Drivers Stack UpPushing Packets - How do the ML2 Mechanism Drivers Stack Up
Pushing Packets - How do the ML2 Mechanism Drivers Stack Up
 
VXLAN and FRRouting
VXLAN and FRRoutingVXLAN and FRRouting
VXLAN and FRRouting
 
OpenStack Quantum Intro (OS Meetup 3-26-12)
OpenStack Quantum Intro (OS Meetup 3-26-12)OpenStack Quantum Intro (OS Meetup 3-26-12)
OpenStack Quantum Intro (OS Meetup 3-26-12)
 
Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...
Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...
Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...
 
Deploying CloudStack and Ceph with flexible VXLAN and BGP networking
Deploying CloudStack and Ceph with flexible VXLAN and BGP networking Deploying CloudStack and Ceph with flexible VXLAN and BGP networking
Deploying CloudStack and Ceph with flexible VXLAN and BGP networking
 
Understanding Open vSwitch
Understanding Open vSwitch Understanding Open vSwitch
Understanding Open vSwitch
 
[OpenInfra Days Korea 2018] (Track 2) Neutron LBaaS 어디까지 왔니? - Octavia 소개
[OpenInfra Days Korea 2018] (Track 2) Neutron LBaaS 어디까지 왔니? - Octavia 소개[OpenInfra Days Korea 2018] (Track 2) Neutron LBaaS 어디까지 왔니? - Octavia 소개
[OpenInfra Days Korea 2018] (Track 2) Neutron LBaaS 어디까지 왔니? - Octavia 소개
 

Similar a Large scale overlay networks with ovn: problems and solutions

Moving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNMoving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERN
Belmiro Moreira
 
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red_Hat_Storage
 

Similar a Large scale overlay networks with ovn: problems and solutions (20)

OVN Controller Incremental Processing
OVN Controller Incremental ProcessingOVN Controller Incremental Processing
OVN Controller Incremental Processing
 
Baker: Scaling OVN with Kubernetes API Server
Baker: Scaling OVN with Kubernetes API ServerBaker: Scaling OVN with Kubernetes API Server
Baker: Scaling OVN with Kubernetes API Server
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...
OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...
OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
99.999% Available OpenStack Cloud - A Builder's Guide
99.999% Available OpenStack Cloud - A Builder's Guide99.999% Available OpenStack Cloud - A Builder's Guide
99.999% Available OpenStack Cloud - A Builder's Guide
 
haproxy_Load_Balancer.pptx
haproxy_Load_Balancer.pptxhaproxy_Load_Balancer.pptx
haproxy_Load_Balancer.pptx
 
haproxy_Load_Balancer.pdf
haproxy_Load_Balancer.pdfhaproxy_Load_Balancer.pdf
haproxy_Load_Balancer.pdf
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
 
Speed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with AlluxioSpeed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with Alluxio
 
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBasehbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
 
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage TieringHadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
 
Moving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNMoving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERN
 
MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017
 
M|18 Battle of the Online Schema Change Methods
M|18 Battle of the Online Schema Change MethodsM|18 Battle of the Online Schema Change Methods
M|18 Battle of the Online Schema Change Methods
 
What's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon ValleyWhat's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon Valley
 
What's new in Neutron Juno
What's new in Neutron JunoWhat's new in Neutron Juno
What's new in Neutron Juno
 
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 

Último

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
rknatarajan
 

Último (20)

UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 

Large scale overlay networks with ovn: problems and solutions

  • 1. Large Scale Overlay Networks with OVN: Problems and Solutions Han Zhou (hzhou8@ebay.com) Open Infrastructure Summit - Denver, 2019
  • 2. Agenda ● Background ● Control-plane components scaling ○ OVN-Controller ○ South-bound DB ○ OVN-Northd ● Scaling ACL ● Scaling nested workloads (containers on VMs)
  • 3. Background of OVN ● SDN solution developed by OVS (Open vSwitch) community ● OpenStack support - neutron ML2 plugin: networking-ovn ● Kubernetes support - CNI plugin: ovn-kubernetes ● Main Features ● Full L2/L3 virtualization with overlay networks (Geneve, STT, VxLAN) ● L2 gateway, L3 gateway (centralized/distributed) & NAT with HA ● L4 ACLs (stateful FW) with address-set, port-group and packet logging ● Distributed Load-Balancer ● L2/L3 Port-security ● ARP responder, static/dynamic ARP ● Flat/Vlan physical networks ● Native DHCP, Metadata ● Parent-child ports for nested workloads ● QoS ● IPSec ● Policy-based routing ● ...
  • 4. ● Logical/physical separation ● Distributed local controllers ● Database Approach (ovsdb) Northd North-bound ovsdb South-bound ovsdb Central North-bound ovsdb Northd South-bound ovsdb Distributed Control Plane OVN-Controller OVS HV HV … OVSDB protocol (RFC7047) HV GW CMS (OpenStack/K8S) Virtual Network Abstractions Logical Flows OpenFlows
  • 5. Northd North-bound ovsdb South-bound ovsdb Central North-bound ovsdb Northd South-bound ovsdb OVN-Controller Scaling Challenges OVN-Controller OVS HV HV … OVSDB protocol (RFC7047) HV GW CMS (OpenStack/K8S) Virtual Network Abstractions Logical Flows OpenFlows ● Factors ○ Size of data ○ Rate of changes ● Challenges ○ Big size of data to be processed ■ E.g. 10k logical ports generates >40k logical flows and 10k port-bindings ○ Logical flow parsing is CPU intensive ○ Cloud workload changes frequently ○ Lots of inputs for flow computation
  • 6. OVS qos Address Sets (converted) MFF OVN Geneve OVS open_vswitch OVS bridge SB logical_flow SB chassis SB encap SB mc_group SB dp_binding SB port_binding SB mac_binding SB dhcp SB dhcpv6 SB dns SB gw_chassis OVS port SB addr_set SB port_group Runtime Data ------------------------------ Local_datapath Local_lports Local_lport_ids Active_tunnels Ct_zone_bitmap Pending_ct_zones Ct_zones Flow Output --------------------------- Desired_flow_table Group_table Meter_table Conj_id_ofs SB OVSDB input Local OVSDB input Dependency Graph of OVN-Controller Port Groups (converted)
  • 7. Original Approach - Recomputing ● Compute OVS flows by reprocessing all inputs when ○ Any input changes ○ Or even when there is no change at all (but just unrelated events) ● Benefit ○ Relatively easy to implement and maintain ● Problems ○ 100% CPU of ovn-controller process on all compute nodes ○ High control plane latency
  • 8. Solution - Incremental Processing Engine ● DAG representing dependencies ● Each node contains ○ Data ○ Links to input nodes ○ Change-handler for each input ○ Full recompute handler ● Engine ○ DFS post-order traverse the DAG from the final output node ○ Invoke change-handlers for inputs that changed ○ Fall back to recompute if for ANY of its inputs: ■ Change-handler is not implemented for that input, or ■ Change-handler cannot handle the particular change (returns false) input intermediate input intermediate output input
  • 9. OVS qos Address Sets (converted) MFF OVN Geneve OVS open_vswitch OVS bridge SB logical_flow SB chassis SB encap SB mc_group SB dp_binding SB port_binding SB mac_binding SB dhcp SB dhcpv6 SB dns SB gw_chassis OVS port SB addr_set SB port_group Runtime Data ------------------------------ Local_datapath Local_lports Local_lport_ids Active_tunnels Ct_zone_bitmap Pending_ct_zones Ct_zones Flow Output --------------------------- Desired_flow_table Group_table Meter_table Conj_id_ofs SB OVSDB input Local OVSDB input Input with change handler implemented Change Handler Implemented Port Groups (converted)
  • 10. ● Create and bind 10k ports on 1k HVs ○ Simulated 1k HVs on 20 BMs x 40 cores (2.50GHz) ○ 10k ports all under the same logical router ○ Batch size 100 lports ○ Bind port one by one for each batch ○ Wait all ports up before next batch CPU Efficiency Improvement
  • 11. ● End to end latency on top of 10k existed logical ports ○ Create one more logical port and bind the port on HV ○ Wait until northd generate lflows and create port-binding in SB ○ Wait until ovn-controller claim the port on HV ○ Wait until northd generate all lflows ○ Wait until OVS flows programmed on all HVs Latency Improvement
  • 12. Tests at Larger Scale ● Next bottle-necks: ○ OVS flow installation ○ Port-binding handling when the binding happens locally
  • 13. What’s next for Incremental-Processing (WIP) ● Incremental flow installation ○ Low hanging fruit - with the help of incremental flow computing ● Implement more change handlers as needed ○ E.g. support incremental processing when port-binding happens locally - further improve end-to-end latency ● New implementation: Differential Datalog (DDlog) ○ Data-flow approach ○ Reuse the effort taken for Northd improvement (will be discussed in Northd scaling) ● Upstream? ○ Not in upstream, because DDlog is the preferred long term solution ○ For those who need this: ■ Rebased on Master: https://github.com/hzhou8/ovs/tree/ovn-controller-inc-proc ■ Rebased on 2.11: https://github.com/hzhou8/ovs/tree/ip12_rebase_on_2.11 ■ Rebased on 2.10: https://github.com/hzhou8/ovs/tree/ip12_rebase_on_2.10
  • 14. OVN-Controller Other Improvements (WIP) ● Reduce data size per-HV ○ Problem: External Provider Network connects everything ○ Solution: Don’t cross external network boundary when calculating connected datapaths ● On-demand tunnel port creation ○ Problem: Too many OVS ports when there are a lot of HVs ○ Solution: Create tunnel to a remote host only if there are ports on these hosts logically connected.
  • 15. ● Factors ○ Number of clients (HVs & GWs) ○ Size of data ○ Rate of changes ● Problems ○ Probe handling ○ Data resync during restart/failover ○ Clustered-mode problems Northd North-bound ovsdb South-bound ovsdb Central North-bound ovsdb Northd South-bound ovsdb SB DB Scaling Challenges OVN-Controller OVS HV HV … OVSDB protocol (RFC7047) HV GW CMS (OpenStack/K8S) Virtual Network Abstractions Logical Flows OpenFlows
  • 16. SB DB Probe ● Default 5 sec probe interval causing connection flapping ○ Ovsdb-server response can occasionally exceed 5 sec ■ DB log compression ■ Large transaction handling ○ Clients reconnecting adds more load to the server - cascade failure ■ Clients resync data from server (solved - see next slide) ● Solution ○ Increase probe interval ■ Client side (on HVs) ● ovs-vsctl set open . external_ids:ovn-remote-probe-interval=60000 ■ Server side (DON’T FORGET!!) ● ovn-sbctl -- --id=@conn_uuid create Connection target="ptcp:6642:0.0.0.0" inactivity_probe=0 -- set SB_Global . connections=@conn_uuid ○ Rely on external monitorings for HVs connectivity
  • 17. Data re-sync during DB reconnect ● Problem ○ OVSDB client caching => NOT a problem ○ Server restart/failover: re-sync data for all clients. => This is the problem! ● Solution - OVSDB fast re-sync (in master -> v2.12) ○ Track and maintain recent history transactions in disk and memory. ○ New method monitor_cond_since in OVSDB protocol, to request changes since last point before connection lost. ○ Note: now it works for clustered mode only. ● Test Result - 1k HVs, 10k ports ○ Before: SB DB 100% CPU, >30 min to recover. ○ After: No CPU spike, all connections restored in <1 min (probe interval).
  • 18. OVSDB Clustered Mode ● Raft based clustering (experimental support since v2.9) ● Problems at scale ○ High CPU load (solved in master) ○ Follower update latency (solved in master) ○ Leader flapping (WIP, workaround ready) ○ Client reconnect (solved in master)
  • 19. OVSDB Clustered Mode - High CPU ● OVSDB Raft Implementation ○ Preprocessing on followers before sending to leader - share some load for leader ○ Send preprocessed transaction to leader together with a prerequisite version ID ● Problem ○ Lots of prerequisite check failure and retry at large scale ■ Different HVs update chassis/port_binding at the same time through different follower nodes ○ Continuous retry causes 100% CPU ● Solution (in master -> v2.12) ○ Retry only when the follower have applied the largest local Raft log index ■ Otherwise, the prerequisite is already out-of-date, so don’t waste CPU
  • 20. OVSDB Clustered Mode - Follower Latency ● Original behavior: leader sends Raft log update to follower nodes when: ○ A new change is proposed, or ○ A heartbeat is sent ● Problem ○ Update from follower node suffers big latency ● Solution (in master -> v2.12) ○ Send log to followers as soon as a new entry is committed ● Test result: 100 updates through same follower from same client ○ Before: >30 sec ○ After: 500 ms
  • 21. OVSDB Clustered Mode - Leader Flapping ● Problem: heartbeat timeout, triggering re-election ○ Large transaction execution ○ Raft log compression (snapshot) ● Solution ○ Quick and dirty: Increase election timeout (hardcode) ○ Short term: Make election timeout configurable at cluster level (WIP) ○ Longer term: Separate thread for Raft RPC (WIP) ■ Still need to configure timeout for snapshot scenarios
  • 22. OVSDB Clustered Mode - Client Reconnect ● Problem: during leader failover, all clients of new leader will reconnect ○ DB state changes to “disconnected” when there is no leader (temporarily) ○ Client tries to reconnect to a new node ● Solution (in master -> v2.12) ○ Don’t change state to “disconnected” if ■ Current node is candidate, and ■ Election didn’t timeout yet
  • 23. Scale Test for Clustered Mode ● Setup ○ 3-node cluster, 1k HVs ○ Election timeout: 10s (hardcoded in the test) ● Test ○ Keep creating and binding ports up to 10k ○ Periodically kill->wait(10s)->start each ovsdb-server randomly ● Test passed at scale! ○ All port creation and binding completed correctly. ○ Fast-resync helped!
  • 24. Further Improvement: SB-DB Scale-out Replicas (TODO) ● How to support more HVs - 2k? 5k? 10k? ○ More nodes in cluster? Doesn’t scale. ○ Multi-threading OVSDB? Would help, but... ● Precondition: no write to SB from HV ○ Chassis/Encap/Port-binding update by CMS/northd only ○ Does not use dynamic ARP (mac-binding) ● How ○ Use replication mode of OVSDB to create N read-only replicas ○ HV connections sharding on read-only replicas ○ HV can failover to other replicas NorthdNorthd SB ovsdb SB Replica 1 SB Replica 2 SB Replica n … HV HV HV … HV HV HV … HV HV HV … CMS NB ovsdb
  • 25. Northd North-bound ovsdb South-bound ovsdb Central North-bound ovsdb Northd South-bound ovsdb OVN-Northd Scaling Challenges HV … OVSDB protocol (RFC7047) HV GW CMS (OpenStack/K8S) Virtual Network Abstractions Logical Flows OpenFlows ● Factors ○ Size of data ○ Rate of changes ● Problems ○ Recompute
  • 26. OVN-Northd Incremental Processing (WIP from community) ● OVN-Northd is a perfect target user of Differential Datalog (DDlog) ○ Inputs - NB DB tables (logical routers, switch, port, etc.) ○ Outputs - SB DB tables (logical flows, port-bindings, etc.) ○ Rules to convert inputs to outputs ● Differential Datalog ○ An open-source datalog language for incremental data-flow processing ○ Defining inputs and outputs as relations ○ Defining rules to generate outputs from inputs ● Efforts can be reused by OVN-Controller ○ OVSDB - DDlog wrappers ○ Process framework changes
  • 27. ● OVN-Northd ● OVN-SB DB ● OVN-Controller Northd North-bound ovsdb South-bound ovsdb Central North-bound ovsdb Northd South-bound ovsdb Recap Scaling Bottlenecks OVN-Controller OVS HV HV … OVSDB protocol (RFC7047) HV GW CMS (OpenStack/K8S) Virtual Network Abstractions Logical Flows OpenFlows
  • 28. Some More Scaling Problems ● Security Group / Network policy using ACLs ● Nested workloads (K8S containers)
  • 29. ACLs ● Used by Security Group (OpenStack) / Network Policy (K8S) ● Typical use case: members of same group are allowed to access each other ● Naked => O(N^2) ● Using Address Set => O(N) ● #Flows in OVS is always O(M*N) (M = number of ports on the HV) outport == <port1_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN} outport == <port2_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN} ... outport == <portN_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN} outport == <port1_uuid> && ip4 && ip4.src == $as_ip4_sg1 outport == <port2_uuid> && ip4 && ip4.src == $as_ip4_sg1 ... outport == <portN_uuid> && ip4 && ip4.src == $as_ip4_sg1
  • 30. Solution - Port Group (Released in v2.10) ● All-in-one ● Greatly simplified CMS Implementation ○ networking-ovn ○ ovn-kubernetes ● Enables more efficient OVS flow generation with conjunction, when multiple ports on same HV belongs to same port-group ○ E.g. ■ N members in a port-group, all M ports on HV1 belong to this group ■ Number of OVS flows on HV1 will be M + N, instead of M * N outport == @port_group1 && ip4 && ip4.src == $port_group1_ip4 CMS creates port-group instead of address-set OVN-Northd generates address-set for you
  • 31. Further Improvement - Group-ID in Packet (TODO) ● Problem - still too many OVS flows ○ Best case: M + N, if all M ports on HV belongs to same group. ○ Worst case: M * N, if ports are distributed randomly. ■ M ports on HV, each belongs to a different group, each group has N members ● Solution (just an idea) ○ Encoding port-group in tunnel metadata ■ Only M flows in all cases ■ Best part: no local flow change needed for remote member changes ○ Challenge: what if a port belongs to multiple groups ■ Limit the number of groups for a single port ■ Fall back to old way if exceeds ○ Limitation: works for ingress (to-lport) rules only outport == @port_group1 && src_group_id == <group1 id> From tunnel metadata
  • 32. Scaling Nested Workloads ● Use Case ○ VM overlay networking with OVN (e.g. using OpenStack networking-ovn) ○ Run Kubernetes on top of the VMs ● Problem ○ How to connect the pods at scale?
  • 33. ARP Proxy ● OVN doesn’t support MAC-learning (MAC-Port binding learning), but IP-MAC binding can be learned through ARP ● How ○ LR send ARP request for Pod IPs ○ ARP proxy in the VM replies with VM’s MAC for all Pod IPs on the VM ● Works, but ○ Requires VM and Pods on same subnet ○ Unreliable when SB DB connection fails ○ Scale: O(N), N = number of pods, usually much bigger than number of VMs ■ Note: IP-MAC Binding incremental processing change handler is implemented - no re-compute. HV VM OVS Pod Pod Pod Pod ARP Proxy OVN Controller SB IP-MAC Binding Table LR ARP Cache (dynamic): 10.0.0.102 => aa:bb:cc:dd:ee:ff 10.0.0.103 => aa:bb:cc:dd:ee:ff 10.0.0.104 => aa:bb:cc:dd:ee:ff ... 10.0.0.102 10.0.0.103 10.0.0.104 10.0.0.105 10.0.0.2 (aa:bb:cc:dd:ee:ff)
  • 34. LR Static Route ● Assign Pod subnet(s) per VM (minion) ● How ○ Configure static routes in OVN LR for pod subnets: next hop = VM IP ● Considerations ○ De-couples VM and Pod subnets ○ Declarative, more reliable than ARP ○ May waste more IPs, but size of subnet is flexible ○ Scale: O(S), S = number of pod subnets ■ Worst case O(N), N = number of pods, if subnet size is /32. HV VM OVS Pod Pod Pod Pod 10.0.0.2/25 10.0.0.3/25 10.0.0.4/25 10.0.0.5/25 172.0.0.2/24 LR Routing Table (static): 10.0.0.0/25 => 172.0.0.2 10.0.0.128/25 => 172.0.1.100 10.0.0.1/25 => 172.0.1.3 ...
  • 35. ● OVS/OVN ○ http://www.openvswitch.org/ ● Networking-OVN ○ https://docs.openstack.org/networking-ovn/latest/ ● OVN-Kubernetes ○ https://github.com/openvswitch/ovn-kubernetes/ ● OVN-Scale-Test ○ https://github.com/openvswitch/ovn-scale-test ● GO-OVN library ○ https://github.com/eBay/go-ovn References