Operation of BGP
The section “BGP Basics” in Chapter 1 introduced you to the fundamental facts about BGP. To recap
Unique among the common IP routing protocols, BGP sends only unicast messages and forms a separate point-to-point connection with each of its peers.
BGP is an application layer protocol using TCP (port 179) for this point-to-point connection and relies on the inherent properties of TCP for session maintenance functions such as acknowledgment, retransmission, and sequencing.
BGP is a vector protocol, although called a path vector rather than distance vector because it sees the route to a destination as a path through a series of autonomous systems rather than as a series of routers hops.
A BGP route describes the path vector using a route attribute called the AS_PATH, which sequentially lists the autonomous system numbers comprising the path to the destination.
The AS_PATH attribute is a shortest path determinant. Given multiple routes to the same destination, the route with an AS_PATH listing the fewest AS numbers is assumed to be the shortest path.
The AS numbers on the AS_PATH list are used for loop detection; a router receiving a BGP route with its own AS number in the AS_PATH assumes a loop and discards the route.
If a router has a BGP session to a neighbor with a different AS number, the session is called external BGP (EBGP); if the neighbor has the same AS number as the router, the session is called internal BGP (IBGP). The neighbors are called, respectively, external or internal neighbors.
This chapter builds on these basic facts to describe the operation of BGP.
BGP Message Types
Before establishing a BGP peer connection, the two neighbors must perform the standard TCP three-way handshake and open a TCP connection to port 179. TCP provides the fragmentation, retransmission, acknowledgment, and sequencing functions necessary for a reliable connection, relieving BGP of those duties. All BGP messages are unicast to the one neighbor over the TCP connection.
BGP uses four basic message types:
Open
Keepalive
Update
Notification
This section describes how these messages are used; for a complete description of the message formats and the variables of each message field, see the section “BGP Message Formats.”
Open Message
After the TCP session is established, both neighbors send Open messages. Each neighbor uses this message to identify itself and to specify its BGP operational parameters. The Open message includes the following information:
BGP version number: This specifies the version (2, 3, or 4) of BGP that the originator is running; the IOS default is BGP-4. Prior to IOS 12.0(6)T, IOS would autonegotiate the version: If a neighbor is running an earlier version of BGP, it rejects the Open message specifying version 4; the BGP-4 router then changes to BGP-3 and sends another Open message specifying this version. If the neighbor rejects that message, an Open specifying version 2 is sent. BGP-4 has now become so prevalent that as of 12.0(6)T IOS no longer autonegotiates, but you can still configure a session to speak to a neighbor running version 2 or 3 with neighbor version.
Autonomous system number: This is the AS number of the originating router. It determines whether the BGP session is EBGP (if the AS numbers of the neighbors differ) or IBGP (if the AS numbers are the same).
Hold time: This is the maximum number of seconds that can elapse before the router must receive either a Keepalive or an Update message. The hold time must be either 0 seconds (in which case, keepalives must not be sent) or at least 3 seconds; the default IOS hold time is 180 seconds. If the neighbors’ hold times differ, the smaller of the two times becomes the accepted hold time. The default hold time can be changed for the entire BGP process with the configuration statement timers bgp or for a specific neighbor or peer group with neighbor timers.
BGP identifier: This is an IPv4 address that identifies the neighbor. IOS determines the BGP Identifier in exactly the same way as it determines the OSPF router ID: The numerically highest loopback address is used; if no loopback interface is configured with an IP address, the numerically highest IP address on a physical interface is selected. Or you can manually specify the BGP identifier with bgp router-id.
Optional parameters: This field is used to advertise support for such optional capabilities as authentication, multiprotocol support, and route refresh.
Keepalive Message
If a router accepts the parameters specified in its neighbor’s Open message, it responds with a Keepalive. Subsequent Keepalives are sent every 60 seconds by IOS default, or a period equal to one-third the agreed-upon holdtime. Like the holdtime, the keepalive interval can be changed for the entire BGP process with timers bgp or on a per-neighbor or per-peer-group basis with neighbor timers.
Note that although BGP offloads several reliability functions to the underlying TCP session, it does use its own keepalive rather than using the TCP keepalive.
Update Message
The Update message advertises feasible routes, withdrawn routes, or both. The Update message includes the following information:
Network Layer Reachability Information (NLRI): This is one or more (Length, Prefix) tuples that advertise destination prefixes and their lengths. If 206.193.160.0/19 were advertised, for example, the Length portion would specify the /19 and the Prefix portion would specify 206.193.160. However, as discussed at the beginning of Chapter 3, “BGP and NLRI,” and covered more extensively in Chapter 6, “Multiprotocol BGP,” the NLRI can be more than just a unicast IPv4 prefix.
Path attributes: The path attributes, described in a later section of the same name, are characteristics of the advertised NLRI. The attributes provide the information that allows BGP to choose a shortest path, detect routing loops, and determine routing policy.
Withdrawn routes: These are (Length, Prefix) tuples describing destinations that have become unreachable and are being withdrawn from service.
Although multiple prefixes might be included in the NLRI field, each Update message describes only a single BGP path (because the path attributes describe only a single path, but that path might lead to multiple destinations). This, again, emphasizes that BGP takes a higher view of an internetwork than an IGP, whose routes always lead to a single destination IP address.
Notification Message
The Notification message is sent whenever an error is detected and always causes the BGP connection to close. The section “BGP Message Formats” includes a list of possible errors that can cause a Notification message to be sent.
An example of the use of a Notification message is the negotiation of a BGP version between neighbors. If, after establishing a TCP connection, a BGP-4 speaker receives an Open message specifying version 3, the router responds with a Notification message stating that the version is not supported, and the session is closed.
BGP Finite State Machine
The stages of a BGP neighbor connection establishment and maintenance can be described in terms of a finite state machine. Figure 2-9 and Table 2-1 show the complete BGP finite state machine and the input events that can cause a state transition.
Figure 2-9 BGP Finite State Machine
Table 2-1 Input Events (IE) of Figure 2-9
IE |
Description |
1 |
BGP Start |
2 |
BGP Stop |
3 |
BGP Transport connection open |
4 |
BGP Transport connection closed |
5 |
BGP Transport connection open failed |
6 |
BGP Transport fatal error |
7 |
ConnectRetry timer expired |
8 |
Hold timer expired |
9 |
Keepalive timer expired |
10 |
Receive Open message |
11 |
Receive Keepalive message |
12 |
Receive Update message |
13 |
Receive Notification message |
The following sections provide a brief description of each of the six neighbor states, as shown in Figure 2-9.
Idle State
BGP always begins in the Idle state, in which it refuses all incoming connections. When a Start event (IE 1) occurs, the BGP process initializes all BGP resources, starts the ConnectRetry timer, initializes a TCP connection to the neighbor, listens for a TCP initialization from the neighbor, and changes its state to Connect. The Start event is caused by an operator configuring a BGP process or resetting an existing process, or by the router software resetting the BGP process.
An error causes the BGP process to transition to the Idle state. From there, the router may automatically try to issue another Start event. However, limitations should be imposed on how the router does this—constantly trying to restart in the event of persistent error conditions causes flapping. Therefore, after the first transition back to the Idle state, the router sets the ConnectRetry timer and cannot attempt to restart BGP until the timer expires. IOS’s initial ConnectRetry time is 120 seconds; this value cannot be changed. The ConnectRetry time for each subsequent attempt is twice the previous time, meaning that consecutive wait times increase exponentially.
Connect State
In this state, the BGP process is waiting for the TCP connection to the neighbor to be completed. If the TCP connection is successful, the BGP process clears the ConnectRetry timer, completes initialization, sends an Open message to the neighbor, and transitions to the OpenSent state. If the TCP connection is unsuccessful, the BGP process continues to listen for a connection to be initiated by the neighbor, resets the ConnectRetry timer, and transitions to the Active state.
If the ConnectRetry timer expires while in the Connect state, the timer is reset, another attempt is made to establish a TCP connection with the neighbor, and the process stays in the Connect state. Any other input event causes a transition to Idle.
Active State
In this state, the BGP process tries to initiate a TCP connection with the neighbor. If the TCP connection is successful, the BGP process clears the ConnectRetry timer, completes initialization, sends an Open message to the neighbor, and transitions to OpenSent. The IOS default Hold time is 180 seconds (3 minutes) and can be changed with the timers bgp statement.
If the ConnectRetry timer expires while BGP is in the Active state, the process transitions back to the Connect state and resets the ConnectRetry timer. It also initiates a TCP connection to the peer and continues to listen for connections from the peer. If the neighbor attempts to establish a TCP session with an unexpected IP address, the ConnectRetry timer is reset, the connection is refused, and the local process stays in the Active state. Any other input event (except a Start event, which is ignored in the Active state) causes a transition to Idle.
OpenSent State
In this state, an Open message has been sent, and BGP is waiting to hear an Open from its neighbor. When an Open message is received, all its fields are checked. If errors exist, a Notification message is sent and the state transitions to Idle.
If no errors exist in the received Open message, a Keepalive message is sent and the Keepalive timer is set. The Hold time is negotiated, and the smaller value is agreed upon. If the negotiated Hold time is zero, the Hold and Keepalive timers are not started. The peer connection is determined to be either internal or external, based on the peer’s AS number, and the state is changed to OpenConfirm.
If a TCP disconnect is received, the local process closes the BGP connection, resets the ConnectRetry timer, begins listening for a new connection to be initiated by the neighbor, and transitions to Active. Any other input event (except a start event, which is ignored) causes a transition to Idle.
OpenConfirm State
In this state, the BGP process waits for a Keepalive or Notification message. If a Keepalive is received, the state transitions to Established. If a Notification is received, or a TCP disconnect is received, the state transitions to Idle.
If the Hold timer expires, an error is detected, or a Stop event occurs, a Notification is sent to the neighbor and the BGP connection is closed, changing the state to Idle.
Established State
In this state, the BGP peer connection is fully established and the peers can exchange Update, Keepalive, and Notification messages. If an Update or Keepalive message is received, the Hold timer is restarted (if the negotiated hold time is nonzero). If a Notification message is received, the state transitions to Idle. Any other event (again, except for the Start event, which is ignored) causes a Notification to be sent and the state to transition to Idle.
Path Attributes
A path attribute is a characteristic of an advertised BGP route. Although the term is specific to BGP, the concept is not unfamiliar to you: Every route advertisement, no matter what the originating routing protocol, has attributes. For example, every route advertisement has information (an address prefix) representing some destination, some quantification (metric) of the destination enabling comparison to other routes to the same destination, and some directional information about the destination, such as a next-hop address. BGP routes have the same attributes you are familiar with from other protocols but can also include a number of other attributes that are designed to be manipulated for the creation and communication of routing policies.
Each path attribute falls into one of four categories:
Well-known mandatory
Well-known discretionary
Optional transitive
Optional nontransitive
First, an attribute is either well known, meaning that it must be recognized by all BGP implementations, or it is optional, meaning that the BGP implementation is not required to support the attribute.
Well-known attributes are either mandatory, meaning that they must be included in all BGP Update messages, or they are discretionary, meaning that they may or may not be sent in a specific Update message.
An optional attribute is either transitive, meaning that a BGP process should accept the Update in which it is included—even if the process doesn’t support the attribute—and should pass the attribute on to its peers, or it is nontransitive, meaning that a BGP process that does not recognize the attribute can quietly ignore the Update in which it is included and not advertise the path to its other peers. In simple terms, the attribute either can or cannot transit a router.
Table 2-2 lists the BGP path attributes. The three well-known mandatory attributes, because they are required to be in every BGP Update, are described in the following subsections. A Cisco-specific attribute called weight is also covered in this section. The other attributes are described within the context of their primary use as policy enablers (Chapter 4), for scaling (Chapter 5), or for carrying multiple NLRI types (Chapter 6).
Table 2-2 BGP Path Attributes
Attribute |
Class |
RFC |
Application |
ORIGIN |
Well-known mandatory |
4271 |
Policy |
AS_PATH |
Well-known mandatory |
4271 |
Policy, loop detection |
NEXT_HOP |
Well-known mandatory |
4271 |
Policy |
LOCAL_PREF |
Well-known discretionary |
4271 |
Policy |
ATOMIC_AGGREGATE |
Well-known discretionary |
4271 |
Address aggregation |
AGGREGATOR |
Optional transitive |
4271 |
Address aggregation |
COMMUNITIES |
Optional transitive |
1997 |
Scaling |
EXTENDED COMMUNITY |
Optional transitive |
4360 |
Scaling |
MULTI_EXIT_DISC (MED) |
Optional nontransitive |
4271 |
Policy |
ORIGINATOR_ID |
Optional nontransitive |
4456 |
Scaling, loop detection, policy |
CLUSTER_LIST |
Optional nontransitive |
4456 |
Scaling, loop detection, policy |
AS4_PATH |
Optional transitive |
6793 |
Scaling, policy |
AS4_AGGREGATOR |
Optional transitive |
6793 |
Scaling, address aggregation |
Multiprotocol Reachable NLRI |
Optional nontransitive |
4760 |
Multiprotocol BGP |
Multiprotocol Unreachable NLRI |
Optional nontransitive |
4760 |
Multiprotocol BGP |
ORIGIN Attribute
ORIGIN is a well-known mandatory attribute that specifies the origin of the routing update. When BGP has multiple routes to the same destination, it uses the ORIGIN as one factor in determining the preferred route. It specifies one of the following origins:
IGP: The Network Layer Reachability Information (NLRI) was learned from a protocol internal to the originating AS. An IGP origin gets the highest preference of the ORIGIN values. IOS gives BGP routes an origin of IGP if they are learned from an IGP routing table via the BGP network statement, as described in Chapter 3.
EGP: The NLRI was learned from the Exterior Gateway Protocol. EGP is preferred second to IGP. Because EGP is obsolete, you should never encounter this origin type; it’s an artifact of the days when we transitioned from EGP to BGP.
Incomplete: The NLRI was learned by some other means. Incomplete is the lowest-preferred ORIGIN value. Incomplete does not imply that the route is in any way faulty, only that the information for determining the origin of the route is incomplete. Routes that BGP learns through redistribution carry the incomplete origin attribute because there is no way to determine the original source of the route.
Although ORIGIN is still a mandatory part of the BGP standard, it was created to help—as the second of the three possible origins indicates—with the transition from EGP to BGP. It might have some limited use in a few “corner case” policy configurations but for the most part should be considered a legacy attribute.
AS_PATH Attribute
AS_PATH is a well-known mandatory attribute that uses a sequence of AS numbers to describe the inter-AS path, or AS-level route, to the destination specified by the NLRI. When an AS originates a route—when it advertises NLRI about a destination within its own AS to an external neighbor—it adds its AS number to the AS_PATH. As subsequent BGP speakers advertise the route to external peers, they prepend their own AS numbers to the AS_PATH (see Figure 2-10). The result is that the AS_PATH describes all the autonomous systems it has passed through, beginning with the most recent AS and ending with the originating AS.
Figure 2-10 AS Numbers Are Prepended (Added to the Front of) the AS_PATH List
Note that a BGP router adds its AS number to the AS_PATH only when an Update is sent to a neighbor in another AS. That is, an AS number is prepended to the AS_PATH only when the route is advertised between EBGP peers. If the route is advertised between IBGP peers—peers within the same autonomous system—no AS number is added.
Usually, having multiple instances of the same AS number on the list would make no sense and would defeat the purpose of the AS_PATH attribute. In one case, however, adding multiple instances of a particular AS number to the AS_PATH proves useful. Remember that outgoing route advertisements directly influence incoming traffic. Normally, the route from AS500 to AS100 in Figure 2-10 passes through AS300 because the AS_PATH of that route is shorter (that is, lists fewer AS numbers). But what if the link to AS200 is AS100’s preferred path for incoming traffic? The links along the (400,200,100) path might all be 10G, for example, whereas the links along the (300,100) path are only 1G. Or perhaps AS200 is the primary provider, and AS300 is only the backup provider. Outgoing traffic is sent to AS200, so it is desired that incoming traffic follow the same path.
AS100 can influence its incoming traffic by changing the AS_PATH of its advertised route (Figure 2-11). By adding multiple instances of its own AS number to the list sent to AS300, AS100 can make routers at AS500 think that the (400,200,100) path is the shorter path. This procedure of adding extra AS numbers to the AS_PATH is called AS path prepending.
Figure 2-11 AS100 Has Prepended Two Extra Instances of Its AS Number to the AS_PATH Advertised to AS300, to Influence the Path Choice Made at AS500
The AS_PATH attribute has been presented so far as consisting of an ordered sequence of AS numbers that describes the path to a particular destination. There are actually two types of AS_PATH:
AS_SEQUENCE: This is the ordered list of AS numbers, as previously described.
AS_SET: This is an unordered list of the AS numbers along a path to a destination.
These two types are distinguished in the AS_PATH attribute with a type code, as described in the section “BGP Message Formats.”
Recall that the second function of the AS_PATH is loop prevention. If a BGP speaker sees its own AS number in a received route from an external peer, it knows that a loop has occurred and ignores the route. When aggregation is performed, however, some AS_PATH detail is lost. For example, AS3113 in Figure 2-12 is aggregating the prefixes advertised by AS225, AS237, and AS810. Because AS3113 originates the aggregate prefix, the AS_PATH associated with it contains only that AS number. As a result, the potential for a loop increases.
Figure 2-12 The Aggregation at AS3113 Causes a Loss of the AS_PATH Information of the Aggregate’s Constituent Prefixes
Suppose, for example, AS810 has an alternative connection to another AS, as shown in Figure 2-13. The aggregate from AS3113 is advertised to AS6571 and from there back to AS810.
Because the AS numbers “behind” the aggregation point are not included in the AS_PATH, AS810 does not detect the potential loop. Next, suppose a network within AS810, such as 206.25.225.0/24, fails. The routers within that AS match the aggregate route from AS6571, and a loop occurs.
If you think about it, the loop-prevention function of the AS_PATH does not require that the AS numbers be included in any particular order. All that is necessary is that a receiving router recognize whether its AS number is a part of the AS_PATH. This is where AS_SET comes in.
When a BGP speaker creates an aggregate from NLRI learned from other autonomous systems, it can include all those AS numbers in the AS_PATH as an AS_SET. For example, Figure 2-14 shows the network of Figure 2-12 with an AS_SET added to the aggregate route.
The aggregating router still begins an AS_SEQUENCE, so receiving routers can trace the path back to the aggregator, but an AS_SET is included to prevent routing loops. In this example, you also can see why the AS_SET is an unordered list. Behind the aggregator in AS3113 are branching paths to the autonomous systems in which the aggregated routes reside. There is no way for an ordered list to describe these separate paths.
Figure 2-13 The Loss of AS_PATH Information at Aggregation Points Weakens the AS_PATH Loop Avoidance Function
Figure 2-14 Including an AS_SET in the AS_PATH of an Aggregate Route Restores the Loop Avoidance That Was Lost in the Aggregation
AS_SET involves a trade-off. You already understand that one of the advantages of route summarization is route stability. If a network that belongs to the aggregate fails, the failure is not advertised beyond the aggregation point. But if an AS_SET is included with the aggregate’s AS_PATH, this stability is reduced. If the link to AS225 in Figure 2-14 fails, for example, the AS_SET changes; this change must be advertised beyond the aggregation point. However, the visibility of constituent AS numbers associated with an aggregate route is much less of a concern that the visibility of many prefixes behind an aggregate.
As it turns out, AS_SET is seldom used in the public Internet at aggregation points. Given the potential instability discussed in the previous paragraph, plus the potential for accidentally including private AS numbers in the AS_SET, and other complexities, RFC 6472 recommends that AS_SET not be used except where a few “corner cases” might justify it. Although AS_SET is still supported by most Internet-grade BGP implementations, including Cisco IOS, RFC 6472 suggests that a future update of BGP might remove AS_SET support.
NEXT_HOP Attribute
As the name implies, this well-known mandatory attribute describes the IP address of the next-hop router on the path to the advertised destination. Unlike typical IGPs, however, the IP address described by the BGP NEXT_HOP attribute is not always the address of a neighboring router. The following rules apply:
If the advertising router and receiving router are in different autonomous systems (external peers), the NEXT_HOP is the IP address of the advertising router’s interface.
If the advertising router and the receiving router are in the same AS (internal peers), and the Update’s NLRI refers to a destination within the same AS, the NEXT_HOP is an IP address belonging to the neighbor that advertised the route.
If the advertising router and the receiving router are internal peers and the Update’s NLRI refers to a destination in a different AS, the NEXT_HOP is the IP address of the external peer from which the route was learned.
Figure 2-15 illustrates the first rule. Here, the advertising router and receiving router are in different autonomous systems. The NEXT_HOP is the interface address of the external peer. So far, this behavior is the same as would be expected of any routing protocol.
Figure 2-15 If a BGP Update Is Advertised via EBGP, the NEXT_HOP Attribute Is the IP Address of the External Peer
Figure 2-16 illustrates the second rule. This time, the advertising router and the receiving router are in the same AS, and the destination advertised is also in the AS. The NEXT_HOP associated with the NLRI is the IP address of the originating router.
Figure 2-16 If a BGP Update Is Advertised via IBGP and the Advertised Destination Is in the Same AS, the NEXT_HOP Attribute Is the IP Address of the Originating Router
The advertising router and the receiving router do not share a common data link, but the IBGP TCP connection is passed through an IGP-speaking router. The receiving router must perform a recursive route lookup (recursive lookups are discussed in Routing TCP/IP, Volume I) to send a packet to the advertised destination. For example, suppose the router at 172.16.101.2 in Figure 2-16 must forward a packet with a destination address of 172.16.5.30. It looks up the destination and matches the prefix 172.16.5.0/24; that route indicates a next hop of 172.16.83.2. Because that IP address does not belong to one of the router’s directly connected subnets, the router must then look up the route to 172.16.83.2. That route, learned via the IGP, indicates a next hop of 172.16.101.1. The packet can now be forwarded. This example is important for understanding the dependency of IBGP on the IGP.
Figure 2-17 illustrates the third rule. Here, a route has been learned via EBGP and is then passed to an internal peer. Because the destination is in a different AS, the NEXT_HOP of the route passed across the IBGP connection is the interface of the external router from which the route was learned.
Figure 2-17 If a BGP Update Is Advertised via IBGP and the Advertised Destination Is in a Different AS, the NEXT_HOP Attribute Is the IP Address of the External Peer from Which the Route Was Learned
In Figure 2-17, the IBGP peer must perform a recursive route lookup to forward a packet to 207.135.64.0/19. However, a potential problem exists. The subnet 192.168.5.0, to which the next-hop address belongs, is not part of AS509. Unless the AS border router advertises it into AS509, the IGP—and hence the internal peers—will not know about this subnet. And if the subnet is not in the routing tables, the next-hop address for 207.135.64.0/19 is unreachable, and packets for that destination are dropped. Actually, although the route to 207.135.64.0/19 is installed in the internal peer’s BGP table, it is not installed in the routing table because the next-hop address is invalid for that router.
One solution to the problem is, of course, to ensure that the external subnet linking the two autonomous systems is known to the internal routers. Although you could use static routes, the practical method is to run the IGP in passive mode on the external interfaces. In some cases, this might be undesirable. An alternative solution—and the solution that is considered best practice—is to use a configuration option called next-hop-self to cause the AS border router in AS509 to set its own IP address in the NEXT_HOP attribute, in place of the IP address of the external peer. The internal peers would then have a next-hop router address of 172.16.83.2, which is known to the IGP. The configuration of next-hop-self is covered in Chapter 3.
Weight
Weight3 is a Cisco-specific BGP path attribute that applies only to routes within an individual router. It is not communicated to other routers. The weight is a number between 0 and 65,535 that can be assigned to a route; the higher the weight, the more preferable the route. When choosing a best path, the BGP decision process considers weight above all other route characteristics except specificity. By default, all routes learned from a peer have a weight of 0, and all routes generated by the local router have a weight of 32,768.
Weights can be set for individual routes, or for routes learned from a specific neighbor. For example, peer A and peer B might be advertising the same routes to a BGP speaker. By assigning a higher weight to the routes received from peer A, the BGP speaker prefers the routes through that peer. This preference is entirely local to the single router; weights are not included in the BGP updates or in any other way communicated to the BGP speaker’s peers. Accordingly, weights are valuable for influencing the routing decisions of a single router without changing the routing decisions of any other router.
Weights are useful when you want one BGP router in an AS to treat some prefixes differently than the way other routers in the AS treat the same prefixes. But this can also be dangerous. Because weight affects only the BGP decision process in a single router, this tells you to carefully consider the implications of using it. Misuse can easily lead to unexpected or inconsistent routing results such as loops.
BGP Decision Process
The BGP Routing Information Base (RIB) consists of three parts:
Adj-RIBs-In: Stores unprocessed routing information that has been learned from BGP Updates received from peers. The routes contained in Adj-RIBs-In are considered feasible routes.
Loc-RIB: Contains the routes that the BGP speaker has selected by applying the decision process to the routes contained in Adj-RIBs-In. These routes populate the routing table (RIB) along with routes discovered by other routing protocols.
Adj-RIBs-Out: Contains the routes that the BGP speaker advertises to its peers in BGP Updates. The outgoing routing policies determine what routes are placed in Adj-RIBs-Out.
These three parts of the RIB may be three distinct databases, or the RIB may be a single database with pointers to distinguish the three parts.
The BGP decision process selects routes by applying incoming routing policies to the routes in the Adj-RIBs-In and by entering the selected or modified routes into the Loc-RIB. The decision process entails three phases:
Phase 1 calculates the degree of preference for each feasible route in the Adj-RIBs-In. It is invoked whenever a router receives a BGP Update from a peer in a neighboring AS containing a new route, a changed route, or a withdrawn route. Each route is considered separately, and a nonnegative integer is derived that indicates the degree of preference for that route.
Phase 2 chooses the best route out of all the available routes to a particular destination and installs the route in the Loc-RIB. It is invoked only after phase 1 has been completed. Loops are also detected in Phase 2 by examining the AS_PATH. Any routes with the local AS number in the AS_PATH are dropped.
Phase 3 adds the appropriate routes to the Adj-RIBs-Out for advertisement to peers. It is invoked after the Loc-RIB has changed, and only after phase 2 has been completed. Route aggregation, if it is to be performed, happens during this phase.
Barring a routing policy that dictates otherwise, phase 2 always selects the most specific route to a particular destination out of all feasible routes to that destination. It is important to note that if the address specified by the route’s NEXT_HOP attribute is unreachable, the route is not selected. This has particular ramifications for IBGP: A route cannot be selected if it is not “synchronized” with the IGP (refer to Chapter 3).
You should have an appreciation by now of the multiple attributes that can be assigned to a BGP route to enforce routing policy within a single router, to internal peers, to adjacent autonomous systems, and beyond. A sequential set of rules is needed for considering these attributes as tie-breakers when a router must select among multiple, equally specific routes to the same destination. This set of rules is the BGP decision process. The decision process used by IOS is as follows:4
Prefer the route with the highest weight. This is an IOS-specific function, as described in the previous section.
If the weights are equal, prefer the route with the highest LOCAL_PREF value.
If the LOCAL_PREF values are the same, prefer the route that was originated locally on the router and injected into BGP with the network or aggregate statement or through redistribution. That is, prefer a route that was learned from an IGP or from a direct connection on the same router. Note that a route injected into BGP via the network statement or redistribution is preferred over a local aggregate injected by the aggregate-address statement. All these means of injecting prefixes are covered in Chapter 4.
If the LOCAL_PREF is the same and no route was locally originated, prefer the route with the shortest AS_PATH.
If the AS_PATH length is the same, prefer the path with the lowest ORIGIN code. IGP is lower than EGP, which is lower than Incomplete.
If the ORIGIN codes are the same, prefer the route with the lowest MED (MULTI_EXIT_DISC) value. By default, this comparison is done only if the AS number is the same for all the routes being considered.5
If the MED is the same, prefer EBGP routes over Confederation EBGP routes, and prefer Confederation EBGP routes over IBGP routes.
If the routes are still equal, prefer the route with the shortest path to the BGP NEXT_HOP. This is the route with the lowest IGP metric to the next-hop address.
If the routes are still equal, they are from the same neighboring AS, and BGP multipath is enabled with the maximum-paths statement; install all the equal-cost routes in the Loc-RIB.
If the routes are still equal and are external, prefer the path that was received first. This helps reduce flapping by allowing a newer route to take precedence over an older one. If the bgp best path compare-routerid statement is enabled, this step is skipped.
If multipath is not enabled, prefer the route with the lowest BGP router ID, or if route reflection (Chapter 5) is used, prefer the route with the lowest ORIGINATOR_ID.
If the routes are still equal and route reflection is used, prefer the route with the shortest CLUSTER_LIST.
If the routes are still equal, prefer the route advertised from the neighbor with the lowest IP address.
BGP Message Formats
BGP messages are carried within TCP segments using TCP port 179. The maximum message size is 4096 octets, and the minimum size is 19 octets. All BGP messages have a common header (Figure 2-18). Depending on the message type, a data portion might or might not follow the header.
Figure 2-18 BGP Message Header
Marker is a 16-octet field that was intended to detect loss of synchronization between BGP peers and to authenticate messages when authentication is supported. However, its use is deprecated in RFC 4271, and modern BGP implementations set the field to all ones in all cases; it continues to exist in the message header for backward compatibility with older implementations.
Length is a 2-octet field that indicates the total length of the message, including the header, in octets.
Type is a 1-octet field specifying the message type. Table 2-3 indicates the possible type codes.
Table 2-3 BGP Type Codes
Code |
Type |
1 |
Open |
2 |
Update |
3 |
Notification |
4 |
Keepalive |
5 |
Route Refresh (covered in Chapter 4) |
Open Message
The Open message, whose format is shown in Figure 2-19, is the first message sent after a TCP connection has been established. If a received Open message is acceptable, a Keepalive message is sent to confirm the Open. After the Open has been confirmed, the BGP connection is in the Established state and Update, Keepalive, and Notification messages can be sent.
Figure 2-19 BGP Open Message Format
The minimum length of the Open message including the BGP message header is 29 octets.
The BGP Open message contains the following fields:
Version: A 1-octet field specifying the BGP version running on the originator.
My Autonomous System: A 2-octet field specifying the AS number of the originator.
Hold Time: A 2-octet number indicating the number of seconds the sender proposes for the hold time. A receiver compares the value of the Hold Time field and the value of its configured hold time and accepts the smaller value or rejects the connection. The hold time must be either 0 or at least 3 seconds.
BGP Identifier: The BGP router ID of the originator. Unless a router ID is specified in the BGP configuration, IOS sets its router ID as either the highest IP address of any of its loopback interfaces, or if no loopback interface is configured, the highest IP address of any of its physical interfaces.
Optional Parameters Length: A 1-octet field indicating the total length of the following Optional Parameters field, in octets. If the value of this field is zero, no Optional Parameters field is included in the message.
Optional Parameters: A variable-length field containing a list of optional parameters. Each parameter is specified by a 1-octet type field, a 1-octet length field, and a variable-length field containing the parameter value.
Update Message
The Update message, whose format is shown in Figure 2-20, advertises a single feasible route to a peer, or to withdraw multiple unfeasible routes, or both.
Figure 2-20 BGP Update Message Format
The BGP Update message contains the following fields:
Withdrawn Routes Length6: A 2-octet field indicating the total length of the following Withdrawn Routes field, in octets. A value of zero indicates that no routes are being withdrawn and that no Withdrawn Routes field is included in the message.
Withdrawn Routes: A variable-length field containing a list of routes to be withdrawn from service. Each route in the list is described with a (Length, Prefix) tuple in which the Length is the length of the prefix and the Prefix is the IP address prefix of the withdrawn route. If the Length part of the tuple is zero, the Prefix matches all routes.
Total Path Attribute Length: A 2-octet field indicating the total length of the following Path Attribute field, in octets. A value of zero indicates that attributes and NLRI are not included in this message.
Path Attributes: A variable-length field listing the attributes associated with the NLRI in the following field. Each path attribute is a variable-length triple of (Attribute Type, Attribute Length, Attribute Value). The Attribute Type part of the triple is a 2octet field consisting of four flag bits, four unused bits, and an Attribute Type code (see Figure 2-21). Table 2-4 shows the most common Attribute Type codes and the possible Attribute Values for each Attribute Type.
Figure 2-21 Attribute Type Part of the Path Attributes Field in the Update Message
Network Layer Reachability Information: A variable-length field containing a list of (Length, Prefix) tuples. The Length indicates the length in bits of the following prefix, and the Prefix is the IP address prefix of the NLRI. A Length value of zero indicates a prefix that matches all IP addresses.
Table 2-4 Attribute Types and Associated Attribute Values7
Attribute Type Code |
Attribute Type |
Attribute Value Code |
Attribute Value |
1 |
ORIGIN |
0 |
IGP |
|
|
1 |
EGP |
|
|
2 |
Incomplete |
2 |
AS_PATH |
1 |
AS_SET |
|
|
2 |
AS_SEQUENCE |
|
|
3 |
AS_CONFED_SET |
|
|
4 |
AS_CONFED_ SEQUENCE |
3 |
NEXT_HOP |
0 |
Next-hop IP address |
4 |
MULTI_EXIT_DISC |
0 |
4-octet MED |
5 |
LOCAL_PREF |
0 |
4-octet LOCAL_PREF |
6 |
ATOMIC_AGGREGATE |
0 |
None |
7 |
AGGREGATOR |
0 |
AS number and IP address of aggregator |
8 |
COMMUNITY |
0 |
4-octet community identifier |
9 |
ORIGINATOR_ID |
0 |
4-octet router ID of originator |
10 |
CLUSTER_LIST |
0 |
Variable-length list of cluster IDs |
14 |
MP_REACH_NLRI |
0 |
Variable-length Multiprotocol BGP NLRI |
15 |
MP_UNREACH_NLRI |
0 |
Variable-length Multiprotocol BGP NLRI |
16 |
EXTENDED COMMUNITIES |
0 |
16-octet extended community identifier |
17 |
AS4_PATH |
0 |
AS path with 4-octet AS numbers |
18 |
AS4_AGGREGATOR |
0 |
4-octet AS number and IP address of aggregator |
Keepalive Message
Keepalive messages are exchanged on a period one-third the hold time but not less than 1 second. If the negotiated hold time is 0, Keepalives are not sent.
The Keepalive message consists of only the 19-octet BGP message header, with no additional data.
Notification Message
Notification messages, whose format is shown in Figure 2-22, are sent when an error condition is detected. The BGP connection is closed immediately after the message is sent.
Figure 2-22 BGP Notification Message Format
The BGP Notification message contains the following fields:
Error Code: A 1-octet field indicating the type of error.
Error Subcode: A 1-octet field providing more-specific information about the error. Table 2-5 shows the possible error codes and associated error subcodes.
Data: A variable-length field used to diagnose the reason for the error. The contents of the Data field depend on the error code and subcode.
Table 2-5 BGP Notification Message Error Codes and Error Subcodes
Error Code |
Error |
Error Subcode |
Subcode Detail |
1 |
Message Header Error |
1 |
Connection not synchronized |
|
|
2 |
Bad message length |
|
|
3 |
Bad message type |
2 |
Open Message Error |
1 |
Unsupported version number |
|
|
2 |
Bad peer AS |
|
|
3 |
Bad BGP identifier |
|
|
4 |
Unsupported optional parameter |
|
|
5 |
Authentication failure (deprecated in RFC 4271) |
|
|
6 |
Unacceptable hold time |
3 |
Update Message Error |
1 |
Malformed attribute list |
|
|
2 |
Unrecognized well-known attribute |
|
|
3 |
Missing well-known attribute |
|
|
4 |
Attribute flags error |
|
|
5 |
Attribute length error |
|
|
6 |
Invalid ORIGIN attribute |
|
|
7 |
AS routing loop (deprecated in RFC 4271) |
|
|
8 |
Invalid NEXT_HOP attribute |
|
|
9 |
Optional attribute error |
|
|
10 |
Invalid network field |
|
|
11 |
Malformed AS_PATH |
4 |
Hold Timer Expired |
0 |
— |
5 |
Finite State Machine Error |
0 |
— |
6 |
Cease |
0 |
— |