(1) Network engineers sometimes face switching loops and broadcast storms even with loop prevention enabled. This can occur when connected to other networks that have STP disabled or loose VLAN policies.
(2) Tests show that a single ping in another network can cause a broadcast storm affecting directly connected switches through high traffic volumes. While the storm originates elsewhere, connected switches still must process each frame.
(3) It is recommended to connect to other networks at their routers rather than switches to avoid being directly exposed to potential broadcast storms. Using keepalive and careful VLAN configuration can also help prevent issues.
2. ▪ Once in a while network engineers working in IIGs or ISPs in Bangladesh have to face
a phenomenon: a switching loop . In our part of the network backbone which is
switch based ,we have all the recommended loop prevention mechanisms. Even
after that sometimes broadcast storm takes places. The paper discusses my findings
on what may have caused this occurrences and my recommendations. I wrote about
this topic for the first time 14 months back on LinkedIn as an article. I believe the
topic is still relevant.
3. ▪ In our NovoCom and InterCloud’s networks, at the switching backbone we have RSTP
running.
▪ Furthermore the switching network has a tree like topology, there are no rings.So
even with STP disabled, there is no scope of loop occurring.
▪ But no matter how unlikely it seems ,sometimes we face broadcast storms.
4.
5. ▪ These broadcasts come to us from our partner networks. The affected devices are
our partner/client facing switches.
▪ More often than not, we are compelled to connect to our clients or partners at their
switches rather than their routers.
▪ These networks may be fully switch-based and may connect to other
uplinks/partners/clients through their switches. They even might have their STP
disabled .
6. ▪ When the loop occurs, one or multiple of our switches have almost 100 CPU
utilization, management IPs become unreachable and all clients in these devices
face up to 100 percent packet loss.
▪ With this 100 CPU utilization, it does not remain possible to check logs and interface
traffics to isolate what traffic is causing this outage.
7. ▪ Upon accessing the switches we have to manually/ or physically shut down clients
and partners to see that disconnecting which network makes things normal again.
▪ Very time consuming process.
▪ It is quite frustrating to see our network being affected even after having the
recommended design and configurations.
8. ▪One interesting matter: we come to know in real time or
later that several service providers are facing the same
issue at exactly the same time.
▪In the earlier days ,we used to think something like this is
happening :
9.
10. ▪ But even if the networks are
physically connected in this manner,
loop should not take place.
▪ It is because in our network we follow
a strict policy of allowing specific
VLANs in client-connected interfaces
and a specific VLAN is never
repeated.
11. ▪ So in the diagram, VLANs allowed at
Interface A do not exist at Interface Z.
▪ Therefore broadcast domains are
completely separated.
12. ▪ To understand it, I performed some very simple tests with BDCOM switches. I used
BDCOM(tm) S5612 Software, Version 2.2.0C Build 42666. Here are a few of the
findings:
▪ When intentionally loops are created, BDCOM switches can detect and prevent loops
just fine with STP running.
▪ I used a ring of 4 switches to create loop for specific VLANs, running STP even at
only 1 of the switches still prevented loops by blocking certain interfaces.
13. ▪But the problem occurs when the scenario is
something like this in my lab test:
14.
15. ▪ I created loop intentionally at A and B
to see how it affects Switch Z.
▪ Here the circled portion replicates a
network which is connected to us but
we have no control over their
STP/VLAN policies.
▪ But our focus is switch Z . Switch Z
resembles our device which is
connected to client/partner.
16. ▪ A single ping to a non-existent IP
creates a broadcast storm at switches
A and B and takes CPU utilization to
100 percent.
▪ The broadcast storm is occurring in
VLANs 1 and 200.But Switch Z should
be discarding every packet which do
not have tag of VLAN 500.
▪ So switch Z itself is not taking part in
the loop. But it still gets unreachable,
and CPU utilization becomes 100
percent.
17. ▪ Switch Z could have saved itself if STP
could block the port connected to
switch A. But a switch detects loops
(when STP is enabled) when it sends
out BPDU and receives that BPDU on
another port.
18. ▪ But switch Z is not getting its own
BPDU back from switch A via interface
Z which it had sent out through other
interfaces.
▪ So there is no reason for STP to
conclude that there is any loop, and
so does not take the interface into
BLK mode.
19. ▪ A single ping from switch A/B to a
non-existent IP creates about 600
Mbps traffic at the connected
interface of Switch Z.
▪ Switch Z is supposed to discard these
broadcast packets as the packets do
not belong to VLAN 500,but it still has
to check every frame,check VLAN tag
and then drop. Dealing with so many
broadcasts leads to CPU utilization of
100 percent.
▪ I captured packets from interface Z
and they are all broadcasts
20.
21. ▪ Afterwards I replaced BDCOM with
Cisco Me3400 and used it as Switch Z.
The result is the same , a spike in CPU
utilization over 95 percent.
22. ▪ When a router receives a broadcast,
the router simply drops it. Even if I
connect a router (a Mikrotik CCR
1016-12G for my test) , it results in
very high CPU utilization.
▪ And all of these are happening from a
broadcast storm which was created by
just 1 single ping.
23. ▪So what actually happens to affect
several service provider networks at the
same time is something like this:
24.
25. ▪ Not only the broadcast storm
originator network,but all the
attached networks are affected.
26.
27. ▪ During real L2 looping incidents, we
find multiple of our switches getting
unreachable.
▪ But in my lab setup, when I connect
another switch X with switch Z shown
in diagram below, only switch Z gets
unreachable ,switch X is not affected.
28. ▪ An explanation to this in my opinion
is, in my LAB tests and in our own
production environment, we allow
only specific VLANs at all interfaces
▪ . Therefore although the directly
connected switch has 100 percent
CPU utilization, it does not propagate
the broadcasts to the next switch.
29. ▪ But my assumption is that many networks may leave all the trunk interfaces at
default config and allow all VLANS including vlan 1.
▪ If a network Q is such network and is connected to a broadcast storm originator
network P, then a broadcast storm from its neighbor network P will not only affect
its edge switch, but will reach farthest corner of its network.
▪ As a result all other networks are connected to different switches of network Q are
also affected.
▪ May be this why the looping incidents are on such a large scale and takes down so
many networks at the same time.
30. ▪ (1 )Never disable STP in your switching network.
▪ It is almost never advisable to disable STP. If you want to make STP convergence
faster you can use the Portfast and BPDU Guard commands at the interfaces where
routers/servers/PCs are connected. But disabling STP all together is not
recommended.
▪ One reason for keeping STP disabled I assume is, having many VLANs in the network
and having a complete control of the traffic flow direction.
31. ▪ (1 )Never disable STP in your switching network.
▪ Another reason of STP disabling could be efficient use of all device ports and
links,because STP may keep some ports blocked.
▪ However this can be done by using PVST+ and manually changing primary root
bridges and secondary root bridges for each VLAN.
▪ For this an extensive and thorough planning is required, but this will enable you to
dictate traffic flow for each VLAN as per your preferences.
32. ▪ (2)Connect with your client/partner/peer at their routers rather than their
switches.
▪
▪ Running STP will not save you if your directly connected network is the broadcast
storm originator. So try to persuade your client/partner so that you can connect at
their router. Your switch will never have to face a storm.
33. ▪ (3) Using keepalive command:
▪ It is advised to apply the ‘keepalive’ command at the client facing interfaces.
▪ If the neighboring switch has 100 percent CPU utilization due to broadcast storm, it
will be unable to return back the keepalive query. This command then shuts the
interface down and protects itself.
▪ In my lab tests, the keepalive command worked in 4 out of 6 cases to shut down the
interface before being affected by broadcast storm.
34. ▪ These conclusions are based on my own observations and studies.
▪ Findings from lab tests.
▪ Many other factors may contribute in production environment.
35. ▪ Even after deploying all loop prevention mechanisms ,you may still face broadcast
storms.
▪ These broadcast storms would originate in your neighbor network over which you
have no control .
▪ Will affect your directly connected devices.
▪ Following the recommendations may save you in such scenarios.