Radia Perlman, a distinguished engineer at Sun Microsystems, named as one of the 20 most influential people in the industry in the 25th anniversary issue of Data Communications magazine and the original inventor of the 802.1D spanning-tree specification recently had a few words to say about the protocol: "It's time to redo (one of the Internet's most widely used technologies) in a way that is more robust and gives more efficient paths."1
Introducing Spanning Tree Protocol
Chapter 2, "Defeating a Learning Bridge's Forwarding Process," explained how Ethernet switches build their forwarding tables by learning source MAC addresses from data traffic. When an Ethernet frame arrives on a switch port in VLAN X with a destination MAC address for which there is no entry in the forwarding table, the switch floods the frame. That is, it sends a copy of the frame to every single port in VLAN X (except the port that originally received the frame). Although this is perfectly fine in a single-switch environment, interesting side effects are observed in multiswitch topologies, as Figure 3-1 shows. The figure represents a simple network composed of two LAN switches interconnected by two Ethernet links.
Figure 3-1 Basic Network Setup
In the next steps, MAC addresses are conveniently shortened to a single-letter format for clarity. A legitimate Ethernet MAC address is actually made up of 6 bytes. The following sequence of events occurs when an application on the top PC (MAC address A) communicates with the bottom PC (MAC address B):
- The top PC sends a frame to the bottom PC (destination MAC address B).
- Switch 1 learns that MAC address A is off port 0/1.
- Switch 1 looks up MAC address B; no match is found.
- Switch 1 sends out the frame on link X and Y (a process known as flooding).
Switch 2 receives the frame from A to B on link X and updates its forwarding table. (A is on link X.)
A split-second later, switch 2 receives the exact same frame on link Y; this time, it causes a new update to the forwarding table. This is known as a race condition—whichever MAC address arrives first wins the race and gets installed in the forwarding table.
- Switch 2 looks up MAC address B; no match is found. (B hasn't talked yet.)
- Switch 2 sends out the frame on port 0/2 and link Y (or X, depending on the outcome of the race condition described in Step 5).
- Switch 1 and PC B both receive the frame; however, this frame causes switch 1 to again update its forwarding table. (MAC address A is now off link Y or X.)
- Return to Step 3 and loop forever. Even if B talks, nothing changes because both switches constantly update their forwarding tables with incorrect information (because of the never-ending packet loop).
There is no such thing as a Time to Live (TTL) field in Ethernet headers. No routing protocol distributes information related to MAC addresses and their whereabouts. Simply put, short of a power or link failure, nothing can stop the packets from looping endlessly between switch 1 and 2. There's no need for a broadcast or multicast frame; a simple unicast frame does fine.
The problem is hardly new. After Radia Perlman's work in the early 1990s, the IEEE ratified her protocol work into a standard known as 802.1D. 802.1D defines the original Spanning Tree Protocol (STP), whose task is to disable redundant paths from one end of the Layer 2 network to another, thereby achieving two goals: no packet duplication or loops while still providing automatic traffic rerouting in case of failure. If switch 1 or switch 2 (or both) were running the STP, the topology represented in Figure 3-1 would logically appear as what's shown in Figure 3-2.
Figure 3-2 Loop-Free Topology Calculated by STP
With link Y disabled by the spanning-tree algorithm running on switch 2, packets from the top PC to the bottom PC can no longer loop forever.
STP is an extremely pervasive protocol; it keeps virtually every single existing Ethernet-based LAN network loop free.
Types of STP
Today, various flavors of STP exist, either as IEEE specs (802.1Q Common STP, 802.1w Rapid STP, 802.1s Multiple STP) or as proprietary vendor extensions. All of them function in similar fashions; they are typically differentiated only by the time they need to recalculate an alternate topology in case of a link failure. Proper STP operation is critical, yet it is so fragile, which this chapter is about to demonstrate.
Understanding 802.1D and 802.1Q Common STP
Originally defined in 1993, the IEEE 802.1D document specifies an algorithm and a protocol to create a loop-free topology in a Layer 2 network. (At that time, there was no concept of VLAN.) The algorithm also ensures automatic reconfiguration after a link or device failure. The protocol converges slowly by today's standards: up to 50 seconds (sec) with the default protocol timers. The 802.1Q specification later augmented the 802.1D by defining VLANs, but it stopped short of recommending a way to run an individual spanning-tree instance per VLAN—something many switch vendors naturally implemented using proprietary extensions to the 802.1D/Q standards.
Understanding 802.1w Rapid STP
Incorporated in the 2004 revision of the 802.1D standard, the 802.1w (Rapid Reconfiguration of Spanning Tree) introduced significant changes, primarily in terms of convergence speeds. According to the IEEE, motivations behind 802.1w include the following:
- The desire to develop an improved mode of bridge operation that, while retaining the plug-and-play benefits of spanning tree, discards some of the less desirable aspects of the existing STP (in particular, the significant time it takes to reconfigure and restore service on link failure/restoration).
- The realization that, although small improvements in spanning-tree performance are possible by manipulating the existing default parameter values, it is necessary to introduce significant changes to the way the spanning-tree algorithm operates to achieve major improvements.
- The realization that it is possible to develop improvements to spanning tree's operation that take advantage of the increasing prevalence of structured wiring approaches, while still retaining compatibility with equipment based on the original spanning-tree algorithm.
The bottom line is that 802.1w usually converges in less than a second. All Cisco switches running recent software versions make 802.1w the default STP.
Understanding 802.1s Multiple STP
The 802.1s supplement to IEEE 802.1Q adds the facility for bridges to use multiple spanning trees, providing for traffic belonging to different VLANs to flow over potentially different paths within the virtual bridged LAN. The primary driver behind the development of 802.1s is the increased scalability it provides in large bridged networks. Indeed, an arbitrary number of VLANs can be mapped to a spanning-tree instance, rather than running a single spanning-tree instance per VLAN. The loop-breaking algorithm now runs at the instance level instead of at the individual VLAN level. With 802.1s, you can, for example, map a thousand VLANs to a single spanning-tree instance. This means that all these VLANs follow a single logical topology (a blocked port blocks for all those VLANs), but the reduction in terms of CPU cycles is significant.
STP Operation: More Details
To understand the attacks that a hacker is likely to carry out against STP, network administrators must gain a solid understanding of STP's inner workings. The protocol builds a loop-free topology that looks like a tree. At the base of the tree is a root bridge—an election process takes place to determine which bridge becomes the root. The switch with the lowest bridge ID (a concatenation of a 16-bit user-assigned priority and the switch's MAC address) wins. The root-bridge election process begins by having every switch in the domain believe it is the root and claiming it throughout the network by means of Bridge Protocol Data Units (BPDU). BPDUs are Layer 2 frames multicast to a well-known MAC address in case of IEEE STP (01-80-C2-00-00-00) or vendor-assigned addresses, in other cases. When receiving a BPDU from a neighbor, a bridge compares the sender's bridge ID with its own to determine which switch has the lowest ID. Only the one with the lowest ID keeps on generating BPDUs, and the process continues until a single switch wins the designated root-bridge election. STP assigns roles and functions to network ports. Every nonroot bridge has one root port: It is the port that leads to the root bridge.
STP uses a path cost–based method to build its loop-free tree. Every port is configured with a port cost—most switches are capable of autoassigning costs based on link speed.
A port's cost is inversely proportional to its bandwidth. Each time a port receives a BPDU, the port's path cost is added to the path cost contained in the BPDU. The root sends BPDUs with the path cost equal to 0, and the cost keeps increasing as the network diameter increases. When two BPDUs are received on a switch because of redundant links in the network, the one with the higher cost is logically disabled—it is put in blocked mode. The bridge that is responsible for forwarding packets on a given segment is called the designated bridge. After a while, ranging from less than a second to just under a minute depending on the STP flavor, the network converges and a single-rooted loop-free tree is built. Before a port transitions to forwarding, it goes through several states:
- Disabled. The port is electrically inactive and does not send or receive any traffic. Once enabled, the port transitions to the next state (blocking).
- Blocking. Discards all data frames except BPDUs.
- Listening. Switches listen to BPDUs to build the loop-free tree. Data packets are not forwarded (15 sec by default with 802.1D timers).
- Learning. Forwarding tables are built using the source MAC addresses of data frames; data frames are not forwarded.
- Forwarding. Data traffic. At this point, the port is fully operational.
After the network converges, STP network-wide timers maintain its stability. (A network can be a VLAN.)
In 802.1D, bridges actually have no idea whether their BPDUs are heard by neighboring switches. For example, the root bridge is not sure that everyone acknowledges its presence—the protocol contains no provision to ensure this. The protocol simply relies on the timers (as just explained) to assume BPDUs are properly delivered to every bridge in the network. Table 3-1 represents an 802.1D BPDU.
Table 3-1. 802.1D BPDU Frame Format
Field |
Value |
Destination MAC |
01 80 c2 00 00 00 IEEE reserved BPDU MAC |
Source MAC |
00 00 0c a0 01 96 Port's MAC address |
LENGTH |
00 26 |
LLC HEADER |
|
Destination Service Access Point |
42 |
Source Service Access Point |
42 |
Unnumbered Information |
03 |
PROTOCOL |
00 00 |
PROTOCOL VERSION |
00 |
BPDU TYPE |
00 |
BPDU FLAGS |
00 |
ROOT ID |
20 00 00 d0 00 f6 ba 04 |
PATH COST |
00 00 00 00 |
BRIDGE ID |
20 00 00 d0 00 f6 ba 04 |
PORT |
81 14 |
MESSAGE AGE |
00 00 |
MAXIMUM AGE |
14 00 |
HELLO TIME |
02 00 |
FORWARD DELAY |
0f 00 |
In a converged network, the root bridge sends a BPDU out each port every hello interval (2 sec, by default). Every BPDU contains an age field that represents how long it has been in transit. It starts from 0 at the root and increases as the BPDU makes its way through the switched network. A maximum valid age is defined for the network (max_age parameter—20 sec, by default). When a BPDU is received on a port, the switch extracts the age contained in the BPDU and starts running a port clock initialized with that value. For example, if the BPDU is 6 sec old, the clock starts counting from 6. Normally, the next BPDU is supposed to arrive 2 sec later, but because of various conditions (packet loss, unreliable software, excessive CPU utilization, unidirectional links, and so on), BPDUs are known to sometimes fail to show on time. Meanwhile, the port clock runs until it reaches max_age. If it reaches max_age, the bridge starts the election process again, claiming to be the root! Ports go back to blocking/listening/learning before finally forwarding, potentially causing massive traffic blackouts.
Another property of the STP is its ability to influence the forwarding table's aging time by using a particular bit in the BPDU. Figure 3-3 shows the Flags field found in every BPDU.
Figure 3-3 BPDU Packet Capture—TC Bit
In 802.1D, the Flags field can take two values: 1000 0000 or 0000 0001. When the low-order bit is set, it indicates that the BPDU is actually a topology-change notification (TCN) BPDU. It is a lightweight BPDU whose purpose is to inform the upstream switches all the way to the root bridge that a connectivity event occurred on this switch. A switch sends a TCN BPDU whenever a link or port transitions up or down. Bridges located between the originator of the TCN BPDU and the root immediately acknowledge the reception of the TCN BPDU, without being certain that the root still exists. When the TCN BPDU finally reaches the root bridge, it acknowledges this by setting the high-order bit of the Flags field (TC-ACK bit) in BPDU it generates. This notifies every bridge to reduce its forwarding table's aging time to forward_delay sec (15, by default). The TC bit is set for a certain period of time (max_age + forward_delay sec, or 35 sec with timers using default values). Figure 3-4 shows a scenario where this mechanism plays a crucial role in restoring network connectivity faster.
Figure 3-4 TC Bit Plays a Crucial Role
Suppose traffic flows between PC A and PC B through switches 1, 2, 3, and 4, and all forwarding tables are correctly populated, with switch 1 pointing to switch 2 to reach B. Now, the link between switches 2 and 3 fails. As a result, switch 4 removes the link to switch 1 from its blocked mode and puts it in forwarding. Traffic from A arrives on switch 1, only to be sent to switch 2. Indeed, nobody told switch 1 that it should use switch 4 to reach B. Naturally, this creates a temporary traffic "black hole." In this particular case, relying on the usual forwarding-table aging time alone is not sufficient. Thanks to the TCN/TC-ACK bits, however, switch 1's forwarding table can age out faster and soon point to the correct switch 1-to-4 link to reach B.
Many vendors have augmented the original 802.1D and 802.1w specs to provide a per-VLAN 802.1D or 802.1w for better flexibility in network design. Cisco's own proprietary version of 802.1D and 802.1w is called per-VLAN (rapid) spanning-tree plus (PVST+). Other than a Cisco-specific destination MAC address and a Subnetwork Access Protocol (SNAP) frame header, the BPDU payload contains exactly the same information as a regular 802.1D or 802.1w BPDU, as Table 3-2 shows.
Table 3-2. Cisco PVST+ BPDU in VLAN 10
Field |
Value |
Explanation |
DMAC |
01 00 0c cc cc cd |
Cisco SSTP BPDU MAC |
SMAC |
00 02 fc 90 08 38 |
Port MAC |
PROTOCOL TYPE IDENTIFIER |
81 00 |
802.1Q Ethertype |
TAG CONTROL INFO |
00 0a |
COS and VLAN ID (VLAN 10) |
LENGTH |
00 32 |
|
802.2 Logical Link Control HEADER |
||
DSAP |
Aa |
Indicates SNAP encap |
SSAP |
Aa |
|
UI |
03 |
|
SNAP HEADER |
||
VENDOR ID |
00 00 0c |
Cisco Systems |
TYPE |
01 0b |
SSTP |
PROTOCOL |
00 00 |
|
PROTOCOL VERSION |
00 |
|
BPDU TYPE |
00 |
|
BPDU FLAGS |
00 |
|
ROOT ID |
20 00 00 d0 00 66 2c 0a |
|
PATH COST |
00 00 00 00 |
|
BRIDGE ID |
20 00 00 d0 00 66 2c 0a |
Bridge ID in VLAN 10 |
PORT |
81 41 |
|
MESSAGE AGE |
00 00 |
|
MAXIMUM AGE |
14 00 |
|
ROOT HELLO TIME |
02 00 |
|
ROOT FORWARD DELAY |
0f 00 |
|
VLAN ID Type Length Value |
||
PAD |
34 |
|
TYPE |
00 00 |
|
LENGTH |
00 02 |
|
VLAN ID |
00 0a |
VLAN 10 |