2. I am:
Raymond Burkholder
In and Out of:
Software Development
Linux Administration
Network Management
System Monitoring
raymond@burkholder.net
ray@oneunified.net
https://blog.raymond.burkholder.net
4. Items To Talk About
● Virtualization
● Redundancy & Resiliency
● Networking
● Firewall
● Connectivity
● Open Source Tools:
– Iproute2 – kernel tools for building sophisticated connections
– Open vSwitch -- for layer 2 switching, firewalling
– Free Range Routing -- layer 2/3 route distribution with BGP, EVPN, anycast
– LXC -- containers – lighter weight than Docker
– Nftables – successor to iptables for ACL with connection tracking
– SaltStack – living documentation, automation, orchestration
Over All Goals: a) total remote access, b) total re-creation of solution via automation
5. Monitoring Replica – Cloud ‘nn’
nftables
dnsmasqcache-ng
saltcheck_mksmtp
Free Range Routing
Open vSwitch
6. Console Serial Connections
Cloud01 Cloud03Cloud02
Console Server Console Server
PDU
PDU
MellanoxSw.
MellanoxSw.
Host HostStorage StorageHost
Dual Console Servers for Diagnostics - Side A & Side B
7. Ethernet Management
Cloud01 Cloud03Cloud02
Console Server A
PDUA
PDUB
MellanoxSw.A
MellanoxSw.B
Host HostStorage StorageHost
Console Server B
Ethernet Management Ports distributed across Cloud interfaces
[any Cloudxx can get to any other’s serial interface via one of two console servers]
8. Hand in Hand
● eBGP vs iBGP
– Multiple ASNs vs Single ASN (eBGP used in this installation)
● VxLAN vs LAN
– 16 million encaps vs 4000 encaps
– VXLAN, also called virtual extensible LAN , is designed to provide
layer 2 overlay networks on top of a layer 3 network by using MAC
address-in-user datagram protocol (MAC-in-UDP) encapsulation.
In simple terms, VXLAN can offer the same services as VLAN
does, but with greater extensibility and flexibility.
● aka EVPN via MP-BGP (enhanced VPN via Multi-Protocol BGP) used
for auto-distribution of VxLAN MAC/IP
Layer 2 is cocaine. It has never been right — and yet people keep packaging it in various ways and
selling it’s virtues and capabilities. -- @trumanboyes
9. Light vs Heavy Virtualization
● LXC – (Linux Containers) is an operating-system-level
virtualization method for running multiple isolated Linux systems
(containers) on a control host using a single Linux kernel.
● KVM - (Kernel-based Virtual Machine) is a full virtualization
solution for Linux on x86 hardware containing virtualization
extensions ... that provides the core virtualization
infrastructure ... where one can run multiple virtual machines
running unmodified Linux or Windows images. Each virtual
machine has private virtualized hardware: a network card, disk,
graphics adapter, etc.
10. Virtualization Selection
● Since no customer applications are running on the
management cloud hosts, light virtualization in the form of LXC
containers is used
● Goal is to keep the base host install as plain and simple as
possible – all services and management functionality should be
segregated into individual containers
● Containers, and their configurations can then be destroyed and
rebuilt at will as bugs and upgrades require
13. Resiliency
● Choices:
– Consul (dns for service resolution)
● Require heartbeats and for each service type
– HAProxy (layer 3 load balancing – userland)
● Overkill for service load type
– IPVS (l2 kernel based load balancing)
● Only local to the machine
– BGP AnyCast (routing based load distribution)
● Proven routing based resiliency
14. AnyCast
● Add Container Unique Loopback Address
● Add Service Common Loopback Address – advertised into BGP
by each common service container
● When container dies, common loopback address disappears.
● Loopback addresses are weighted in BGP so local services use
local services in preference
15. Host Functions
● Host functions are minimized. Management functions relegated
to containers
● Host has main BGP router, connects to BGP instances of each
of the other two hosts
● Configured to handle the VxLAN/EVPN MAC/IP advertisements
to/from each container
● Keeps container traffic ‘segregated’ from host ‘native’ routing
tables – virtualizes networking within and across the hosts
16. eBGP
● Next set of slides show eBGP routing tables to show the
resiliency created by routing.
● A non-production two-cloudbox is shown as an example
17. host01.ny1 neighbors
host01.ny1# sh ip bgp sum
IPv4 Unicast Summary:
BGP router identifier 10.20.1.1, local AS number 64601 vrf-id 0
BGP table version 62
RIB entries 55, using 8360 bytes of memory
Peers 9, using 174 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
host02.ny1(10.20.3.2) 4 64602 100218 100229 0 0 0 07w4d05h 18
pprx01.ny1(10.20.5.11) 4 64701 100132 100147 0 0 0 09w6d12h 2
nacl01.ny1(10.20.5.12) 4 64702 100139 100157 0 0 0 09w6d06h 2
ntp01.ny1(10.20.5.13) 4 64705 100132 100148 0 0 0 09w6d12h 2
dmsq01.ny1(10.20.5.14) 4 64703 100133 100149 0 0 0 09w6d12h 2
bind01.ny1(10.20.5.15) 4 64706 100133 100150 0 0 0 09w6d12h 2
prxy01.ny1(10.20.5.17) 4 64704 100132 100146 0 0 0 09w6d12h 2
smtp01.ny1(10.20.5.18) 4 64707 100132 100145 0 0 0 09w6d12h 2
fw01.ny1(10.20.5.19) 4 64708 100130 100148 0 0 0 09w6d12h 1
Total number of neighbors 9
host01 has private ASN 64601, host02 has ASN 64602
18. host02.ny1 neighbors
host02.ny1# sh ip bgp sum
IPv4 Unicast Summary:
BGP router identifier 10.20.1.2, local AS number 64602 vrf-id 0
BGP table version 54
RIB entries 55, using 8360 bytes of memory
Peers 9, using 174 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
host01.ny1(10.20.3.3) 4 64601 100233 100223 0 0 0 07w4d05h 18
pprx02.ny1(10.20.6.11) 4 64801 100135 100145 0 0 0 09w6d12h 2
nacl02.ny1(10.20.6.12) 4 64802 100135 100145 0 0 0 09w6d12h 2
ntp02.ny1(10.20.6.13) 4 64805 100135 100145 0 0 0 09w6d12h 2
dmsq02.ny1(10.20.6.14) 4 64803 100135 100146 0 0 0 09w6d12h 2
bind02.ny1(10.20.6.15) 4 64806 100136 100147 0 0 0 09w6d12h 2
prxy02.ny1(10.20.6.17) 4 64804 100135 100145 0 0 0 09w6d12h 2
smtp02.ny1(10.20.6.18) 4 64807 100135 100144 0 0 0 09w6d12h 2
fw02.ny1(10.20.6.19) 4 64808 100134 100145 0 0 0 09w6d12h 1
Total number of neighbors 9
Containers on host01 have private ASN 647xx, host02 containers use ASN 648xx
19. host01.ny1 loopbacks view A
host01.ny1# sh ip bgp
BGP table version is 62, local router ID is 10.20.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
*> 10.20.1.1/32 0.0.0.0 0 32768 ?
*> 10.20.1.2/32 10.20.3.2 0 0 64602 ?
*> 10.20.1.17/32 10.20.5.11 0 0 64701 ?
*> 10.20.1.18/32 10.20.5.12 0 0 64702 ?
*> 10.20.1.19/32 10.20.5.14 0 0 64703 ?
*> 10.20.1.20/32 10.20.5.17 0 0 64704 ?
*> 10.20.1.21/32 10.20.5.13 0 0 64705 ?
*> 10.20.1.22/32 10.20.5.15 0 0 64706 ?
*> 10.20.1.23/32 10.20.5.18 0 0 64707 ?
*> 10.20.1.24/32 10.20.5.19 0 0 64708 ?
*> 10.20.1.33/32 10.20.3.2 0 64602 64801 ?
*> 10.20.1.34/32 10.20.3.2 0 64602 64802 ?
*> 10.20.1.35/32 10.20.3.2 0 64602 64803 ?
*> 10.20.1.36/32 10.20.3.2 0 64602 64804 ?
*> 10.20.1.37/32 10.20.3.2 0 64602 64805 ?
*> 10.20.1.38/32 10.20.3.2 0 64602 64806 ?
*> 10.20.1.39/32 10.20.3.2 0 64602 64807 ?
*> 10.20.1.40/32 10.20.3.2 0 64602 64808 ?
... on next slide
Loopbacks 10.20.1.x/32 are unique per container
Containers on
host01 are seen
as local hops
Containers on host02
are seen as two hops
away via host02
20. host01.ny1 loopbacks view B
* 10.20.2.101/32 10.20.3.2 0 64602 64801 ?
*> 10.20.5.11 0 0 64701 ?
* 10.20.2.102/32 10.20.3.2 0 64602 64802 ?
*> 10.20.5.12 0 0 64702 ?
* 10.20.2.103/32 10.20.3.2 0 64602 64803 ?
*> 10.20.5.14 0 0 64703 ?
* 10.20.2.104/32 10.20.3.2 0 64602 64804 ?
*> 10.20.5.17 0 0 64704 ?
* 10.20.2.105/32 10.20.3.2 0 64602 64805 ?
*> 10.20.5.13 0 0 64705 ?
* 10.20.2.106/32 10.20.3.2 0 64602 64806 ?
*> 10.20.5.15 0 0 64706 ?
* 10.20.2.107/32 10.20.3.2 0 64602 64807 ?
*> 10.20.5.18 0 0 64707 ?
* 10.20.3.2/31 10.20.3.2 0 0 64602 ?
*> 0.0.0.0 0 32768 ?
*> 10.20.5.0/24 0.0.0.0 0 32768 ?
*> 10.20.6.0/24 10.20.3.2 0 0 64602 ?
Displayed 28 routes and 36 total paths
Loopbacks 10.20.2.x/32 are unique per service
Service loopbacks are
seen on two separate
containers
on two different hosts
with the local container taking
precedence
30. Nftables YAML to Config to Runningpolicy:
local-private:
from: local
to: private
default: accept
local-public:
from: local
to: public
default: accept
private-local:
from: private
to: local
default: accept
private-public:
from: private
to: public
default: drop
public-private:
from: public
to: private
default: drop
policy:
local-private:
from: local
to: private
default: accept
local-public:
from: local
to: public
default: accept
private-local:
from: private
to: local
default: accept
private-public:
from: private
to: public
default: drop
public-private:
from: public
to: private
default: drop
public-local:
from: public
to: local
default: drop
rule:
# salt clients
- proto: tcp
saddr:
- 192.168.195.100
- 172.16.42.192/27
- 172.16.43.192/27
- 172.16.42.224/28
- 172.16.43.224/28
dport:
- 4505
- 4506
public-local:
from: public
to: local
default: drop
rule:
# salt clients
- proto: tcp
saddr:
- 192.168.195.100
- 172.16.42.192/27
- 172.16.43.192/27
- 172.16.42.224/28
- 172.16.43.224/28
dport:
- 4505
- 4506
# exerpt from /etc/nftables.conf:
add chain ip filter public_local
add rule ip filter public_local tcp dport {4505,4506} ip saddr
{192.168.195.100,172.16.42.192/27,172.16.43.192/27,172.16.42.224/28,172.16.43.224/28} accept
add rule ip filter public_local tcp sport {4505,4506} ip saddr {192.168.195.100} accept
add rule ip filter public_local tcp dport 22 accept
add rule ip filter input iifname eth443 goto public_local
add rule ip filter public_local iifname eth443 counter goto loginput
add rule ip filter public_local log prefix "public_local:DROP:" group 0 counter drop
# exerpt from /etc/nftables.conf:
add chain ip filter public_local
add rule ip filter public_local tcp dport {4505,4506} ip saddr
{192.168.195.100,172.16.42.192/27,172.16.43.192/27,172.16.42.224/28,172.16.43.224/28} accept
add rule ip filter public_local tcp sport {4505,4506} ip saddr {192.168.195.100} accept
add rule ip filter public_local tcp dport 22 accept
add rule ip filter input iifname eth443 goto public_local
add rule ip filter public_local iifname eth443 counter goto loginput
add rule ip filter public_local log prefix "public_local:DROP:" group 0 counter drop
# exerpt from nft list ruleset:
chain public_local {
tcp dport { 4505, 4506} ip saddr { 172.16.42.192-172.16.42.239, 172.16.43.192-172.16.43.239, 192.168.195.100} accept
tcp sport { 4505, 4506} ip saddr { 192.168.195.100} accept
tcp dport ssh accept
iifname "eth443" counter packets 155364 bytes 7981388 goto loginput
log prefix "public_local:DROP:" group 0 counter packets 0 bytes 0 drop
}
# exerpt from nft list ruleset:
chain public_local {
tcp dport { 4505, 4506} ip saddr { 172.16.42.192-172.16.42.239, 172.16.43.192-172.16.43.239, 192.168.195.100} accept
tcp sport { 4505, 4506} ip saddr { 192.168.195.100} accept
tcp dport ssh accept
iifname "eth443" counter packets 155364 bytes 7981388 goto loginput
log prefix "public_local:DROP:" group 0 counter packets 0 bytes 0 drop
}
- proto: tcp
saddr:
- 192.168.195.100
sport:
- 4505
- 4506
# ssh from anywhere
- proto: tcp
dport: 22
- proto: tcp
saddr:
- 192.168.195.100
sport:
- 4505
- 4506
# ssh from anywhere
- proto: tcp
dport: 22
A simple zone based firewall configuration in YAML in the pillar file
Excerpt from the auto-generated configuration file, based upon the above YAML file
Once the configuration file is installed into the kernel via nftables, the result installed ruleset can be viewed
31. Example 2 - Network Constructs
brPub421
(linux bridge)
(connects of FRR)
veth
ovsbr0
(Open vSwitch bridge)
vlan421
vbPub421
voPub421
veth
edge01
(lxc)
fw01
(lxc)
veth
eth421 eth421
ve-edge01-v421ve-fw01-v421
enp2s0f1 [physical]
(vxlan encap on ip)
vxPub421
(linux vxlan interface)
(mac/ip to FRR)
(encap over net)
32. Ex2 - Map Salt -> Interface/BGP Config
# less pillar/net/example/ny1/host01.sls
enp2s0f1:
description: enp2s0f1.host02.ny1.example.net
auto: True
inet: manual
addresses:
- 10.20.3.3/31
bgp:
prefix_lists:
plIpv4ConnIntMgmt:
- prefix: 10.20.3.2/31
neighbors:
- remoteas: 64602
peer:
ipv4: 10.20.3.2
password: oneunified
Mtu: 9000
# Portion of /etc/network/interfaces:
# description: enp2s0f1.host02.ny1.example.net
auto enp2s0f1
iface enp2s0f1
address 10.20.3.3/31
Mtu 9000
# part of bgp route-map
ip prefix-list plIpv4ConnIntMgmt seq 5 permit 10.20.5.0/24
ip prefix-list plIpv4ConnIntMgmt seq 10 permit 10.20.3.2/31
route-map rmIpv4Connected permit 110
match ip address prefix-list plIpv4ConnLoop
set community 64601:1001
!
route-map rmIpv4Connected permit 120
match ip address prefix-list plIpv4ConnIntMgmt
set community 64601:1002 64601:1202
!
route-map rmIpv4Connected permit 130
match ip address prefix-list plIpv4ConnInt
set community 64601:1002
!
route-map rmIpv4Connected deny 190
# linux bash
# ip route show 10.20.3.2/31
10.20.3.2/31 dev enp2s0f1 proto kernel scope link src 10.20.3.3
# free range routing vtysh
host01.ny1# sh ip route 10.20.3.2/31
Routing entry for 10.20.3.2/31
Known via "connected", distance 0, metric 0, best
Last update 07w1d22h ago
* directly connected, enp2s0f1
# vtysh sh run exerpt
router bgp 64601
bgp router-id 10.20.1.1
bgp log-neighbor-changes
no bgp default ipv4-unicast
bgp default show-hostname
coalesce-time 1000
neighbor 10.20.3.2 remote-as 64602
neighbor 10.20.3.2 password oneunified
This exerpt of a pillar file is used to build ...
... BGP Configuration
... interface configuration
With the following run time results:
Parameters in pillar file kept together to
facilitate readability and clarify relationships
33. VNI -> Pillar for VxLAN
# cat pillar/net/example/ny1/vni.sls
#
# the vni is used to build the second part of a route-descriptor (rd)
# type 0: 2 byte ASN, 4 byte value
# type 1: 4 byte IP, 2 byte value
# type 2: 4 byte ASN, 2 byte value
# if vlans are kept in the range of 1 - 999:
# use a realm of 1 - 64, use rd of
# ip:rrvvv
# up to 16m vxlan identifiers can be used, will need to evolve if/when
# scale requires it
# but... since ebgp is being used predominately, which provides a unique asn to each
# device, it is conceivable that type 0 RDs could be used, which would provide
# for the 16 million vxlan identifiers
vni:
- id: 1012
desc: vlan12 10.20.7.0/24
member:
- 10.20.1.1
- 10.20.1.2
- id: 1101
desc: edge0[1-2] v101
member:
- 10.20.1.1
- 10.20.1.2
- id: 1421
desc: public services
member:
- 10.20.1.1
- 10.20.1.2
Some pillar files have
information shared across
multiple instances –
common configuration
elements are factored out
and included in the top.sls
file where necessary
34. Auto Config: BGP, Interfaces, Links# exerpt from BGP configuration file
address-family l2vpn evpn
neighbor 10.20.3.2 activate
vni 1101
rd 10.20.1.1:1101
route-target import 10.20.1.2:1101
route-target export 10.20.1.1:1101
exit-vni
vni 1012
rd 10.20.1.1:1012
route-target import 10.20.1.2:1012
route-target export 10.20.1.1:1012
exit-vni
vni 1421
rd 10.20.1.1:1421
route-target import 10.20.1.2:1421
route-target export 10.20.1.1:1421
exit-vni
advertise-all-vni
exit-address-family
# exerpt from BGP configuration file
address-family l2vpn evpn
neighbor 10.20.3.2 activate
vni 1101
rd 10.20.1.1:1101
route-target import 10.20.1.2:1101
route-target export 10.20.1.1:1101
exit-vni
vni 1012
rd 10.20.1.1:1012
route-target import 10.20.1.2:1012
route-target export 10.20.1.1:1012
exit-vni
vni 1421
rd 10.20.1.1:1421
route-target import 10.20.1.2:1421
route-target export 10.20.1.1:1421
exit-vni
advertise-all-vni
exit-address-family
# exerpt from /etc/network/interfaces:
# description: shared external containers
auto vlan421
iface vlan421
pre-up brctl addbr brPub421
pre-up brctl stp brPub421 off
up ip link set dev brPub421 up
pre-up ip link add vxPub421 type vxlan id 1421 dstport 4789 local 10.20.1.1 nolearning
pre-up brctl addif brPub421 vxPub421
up ip link set dev vxPub421 up
pre-up ip link add vbPub421 type veth peer name voPub421
pre-up brctl addif brPub421 vbPub421
pre-up ovs-vsctl --may-exist add-port ovsbr0 voPub421 tag=421
up ip link set dev vbPub421 up
up ip link set dev voPub421 up
down ip link set dev vbPub421 down
down ip link set dev voPub421 down
pre-up ovs-vsctl --may-exist add-port ovsbr0 vlan421 tag=421 -- set interface vlan421 type=internal
post-down ovs-vsctl --if-exists del-port ovsbr0 vlan421
# exerpt from /etc/network/interfaces:
# description: shared external containers
auto vlan421
iface vlan421
pre-up brctl addbr brPub421
pre-up brctl stp brPub421 off
up ip link set dev brPub421 up
pre-up ip link add vxPub421 type vxlan id 1421 dstport 4789 local 10.20.1.1 nolearning
pre-up brctl addif brPub421 vxPub421
up ip link set dev vxPub421 up
pre-up ip link add vbPub421 type veth peer name voPub421
pre-up brctl addif brPub421 vbPub421
pre-up ovs-vsctl --may-exist add-port ovsbr0 voPub421 tag=421
up ip link set dev vbPub421 up
up ip link set dev voPub421 up
down ip link set dev vbPub421 down
down ip link set dev voPub421 down
pre-up ovs-vsctl --may-exist add-port ovsbr0 vlan421 tag=421 -- set interface vlan421 type=internal
post-down ovs-vsctl --if-exists del-port ovsbr0 vlan421
# ip link show dev brPub421
17: brPub421: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group
default qlen 1000
link/ether 6e:56:4f:62:7c:82 brd ff:ff:ff:ff:ff:ff
# ip link show vxPub421
18: vxPub421: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master brPub421 state UNKNOWN
mode DEFAULT group default qlen 1000
link/ether ee:38:74:6c:99:3f brd ff:ff:ff:ff:ff:ff
# ip link show voPub421
19: voPub421@vbPub421: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system
state UP mode DEFAULT group default qlen 1000
link/ether 9a:e4:51:35:89:83 brd ff:ff:ff:ff:ff:ff
# ip link show vbPub421
20: vbPub421@voPub421: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master brPub421 state
UP mode DEFAULT group default qlen 1000
link/ether 6e:56:4f:62:7c:82 brd ff:ff:ff:ff:ff:ff
# ip link show vlan421
21: vlan421: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group
default qlen 1000
link/ether 62:06:81:20:29:09 brd ff:ff:ff:ff:ff:ff
# ip link show dev brPub421
17: brPub421: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group
default qlen 1000
link/ether 6e:56:4f:62:7c:82 brd ff:ff:ff:ff:ff:ff
# ip link show vxPub421
18: vxPub421: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master brPub421 state UNKNOWN
mode DEFAULT group default qlen 1000
link/ether ee:38:74:6c:99:3f brd ff:ff:ff:ff:ff:ff
# ip link show voPub421
19: voPub421@vbPub421: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system
state UP mode DEFAULT group default qlen 1000
link/ether 9a:e4:51:35:89:83 brd ff:ff:ff:ff:ff:ff
# ip link show vbPub421
20: vbPub421@voPub421: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master brPub421 state
UP mode DEFAULT group default qlen 1000
link/ether 6e:56:4f:62:7c:82 brd ff:ff:ff:ff:ff:ff
# ip link show vlan421
21: vlan421: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group
default qlen 1000
link/ether 62:06:81:20:29:09 brd ff:ff:ff:ff:ff:ff
a) a simple config is used
to build ...
b) ... the complicated interface configuration in the diagram shown previously ....
c) ... with the resulting instances installed into the kernel
35. Process
● With salt state, pillar and reactor files defined for all services
and configuration elements, two commands only are necessary
for rebuilding any one of the three cloud management boxes:
– destroy the boot sector
– reboot
36. Process
● Upon reboot, the physical box obtains the pxeboot installation
files, allocates and formats the file system, installs operating
system, installs Salt agent, and automatically reboots
● Upon the reboot, the Salt agent contacts one of the remaining
Salt Masters, and automatically starts provisioning the system
and services as defined in the Salt state and pillar files.
● LXC containers are instantiated and started at this time
● The Salt agent in each container contacts the Salt Master to
initiate the build of each specific container, using services
supplied in the containers of surviving hosts