A novel way of creating overlay networks for OpenNebula is presented here. Using BGP Ethernet VPN (EVPN) with VXLAN data-plane encapsulation. This provides scalable Layer 2 over IP networks.
7. Limitations and problems
old DC network
Bandwidth limitation
STP
Slow convergence
IPv6 routing in CPU
Unsupported gear (too old)
8. Requirements for
new DC network
Open standards and protocols
Must work in a IPv6 only setup
No more STP (all active links ?? + L3 only ??)
Bandwidth on demand
All current customer setups must be supported
Must work with our current billing software
9. What if...
Every switch tracks all attached hosts / IP addresses
Switch creates a host route (/32 in IPv4 world, /128 in IPv6
world) for every directly-attached IP host
Host routes are redistributed into a routing protocol,
allowing every other switch in the network to route towards
any other host
Traffic to unknown destinations is dropped instead of
forwarded out all ports
Possible ???
12. VXLAN in summary
RFC7348
24 bits VNI field
Minimum recommended L2 MTU = VM MTU + 50 bytes
Recommended L2 >1600 bytes (VLAN tags and IPv6)
SRC UDP port = payload hash (inner-ethernet header)
DST UDP port = 4789 (linux default is 8472)
13. EVPN in summary
RFC7432
Multi-tenant control plane for L2/L3 VPNs
Uses a new BGP address family
Works with many data-plane encapsulations
Carries IP+MAC reachability information
MAC/IP advertisement (EVPN route type 2)
VTEP advertisments (EVPN route type 3)
IP prefix route (EVPN route type 5)
15. Underlay network
Underlay network single purpose is to ensure reachability
of the loopback interfaces, because there are used as
VxLAN tunnel endpoints!
19. Underlay design
Only one address family needed in the underlay
Only p2p /31 links between spines and leafs
One AS for all spines and one unique AS per switch pair
eBGP to make loopbacks (VTEPs) reachable
BGP timers tweaked; no BFD needed
Very simple to setup and troubleshoot
MTU >9000
Less than 300 routes in BGP for whole underlay setup which
makes convergence really Speedy (Gonzales)
20. Overlay design
Dual stack address family
One overlay AS (65101) for all spines and leafs
Spines are BGP RR’s for overlay network
BGP timers tweaked; no BFD needed
Line failure in core network has no impact on overlay RIB
which makes convergence sub-second.
All overlay (production) traffic in non-defaultVrf
VRF_ID * 10000 + VLAN_ID = VNI_ID
21. Overlay design
MC-LAG as first hop redundancy
vARP (all active gateway)
No access to defaultVrf; No unwanted tunnel access
Loopback as source for icmp replies
Filter advertised routes learned from spines
22.
23.
24. EVPN + VXLAN on Hypervisor
Required:
Linux distro with kernel >= 4.5
FRRouting >= 5.1dev build with cumulus option
Recommended:
ifupdown2 >=1.0
25. Step 1: Underlay
1 or more /31 uplink(s)
1 loopback address in defaultVrf for VTEP endpoint
MgmtVrf for in-band management (netns)
BGP session(s) on uplink(s) to leaf switches
Make loopback reachable to all other loopbacks/vteps
MTU >1600
26. Step 2: Overlay Data Plane
Create VRF (internetVrf)
Create at least two bridges (L2VNI + L3VNI) per VRF
Create 1 VTEP for each bridge with ip address of loopback
Attach VTEP interface to bridge
Attach VNET interface to bridge
Configure mac + ip address on L2VNI bridge
Filter ARP traffic on VTEP interface
Enable forwarding + sysctl tuning
27. Create VRF
ip link add internetVrf type vrf table 1000
ip link set internetVrf up
28. Create two bridges
brctl addbr br-vlan601
ip link set br-vlan601 master internetVrf
ip link set br-vlan601 up
brctl addbr br-vlan4003
ip link set br-vlan4003 master internetVrf
ip link set br-vlan4003 up
29. Create VTEP for each bridge
ip link add vtep10601 type vxlan id 10601 proxy
nolearning dstport 4789 local 213.136.24.130
ip link add vtep20003 type vxlan id 20003 proxy
nolearning dstport 4789 local 213.136.24.130
30. Attach VTEP + VNET to bridge
ip link set vtep10601 mtu 9000
ip link set vtep10601 up
brctl addif br-vlan601 vtep10601
ip link set vtep20003 mtu 9000
ip link set vtep20003 up
brctl addif br-vlan4003 vtep20003
31. Configure MAC + IP, drop ARP
ip addr add 213.136.24.161/28 dev br-vlan601
ip link set dev br-vlan601 address 02:62:69:74:67:77
ebtables -A OUTPUT -p arp -o vtep+ -j DROP
33. OpenNebula support
cat /var/lib/one/remotes/etc/vnm/OpenNebulaNetwork.conf
...
# Multicast protocol for multi destination BUM traffic. Options:
# - multicast, for IP multicast
# - evpn, for BGP EVPN control plane
:vxlan_mode: evpn
# Tunnel endpoint communication type. Only for evpn vxlan_mode.
# - dev, tunnel endpoint communication is sent to PHYDEV
# - local_ip, first ip addr of PHYDEV is used as address for the communiation
:vxlan_tep: local_ip
# Additional ip link options, uncomment the following to disable learning for EVPN mode
:ip_link_conf:
:nolearning:
:proxy:
:srcport: 49152 65535
:dstport: 4789
39. Credits
Jeroen Louwes (BIT) ← as he did all the work
Sebastian Mangelkramer (convince ONE to
integrate in ONE, issue #2161)
Vincent Bernat (great blogs posts about everything
networking, including numerous examples on L3
routing to hypervisors)
40. Symmetric IRB vs Asymmetric IRB
Asymmetrical IRB
Route on ingress switch
Bridge from ingress switch to destination MAC
Ingress switch needs MAC-IP entries for all destinations
More easy to troubleshoot
41. Symmetric IRB vs Asymmetric IRB
Symmetrical IRB
Route on ingress switch
intermediate segment across network (L3VNI)
Route on egress switch
Requires extra intermediate VNI per VRF
Scalable
42. Future wishes and plans
Migrating BGP RR role from spine switches to external
Unknown unicast block
IPv6 only underlay
All hypervisors (and AP’s) are VTEPs (this talk)
Upgrade current core-ring from EAPS (Extreme) to VXLAN