PoS Network Designs
Resiliency is an important concern in SP networks. Outages result in lost revenue and might cause customers to cancel their service. SPs enter into Service Level Agreements (SLAs) with their customer. These SLAs guarantee certain levels of service. SLAs differ in many respects depending on the amount of risk the customer is willing to take. The more risk the customer is willing to take, the looser the SLA is and the cheaper the cost to the customer. The downside is that the customer is not guaranteed the same level of service as the customer who was not willing to take as much risk and paid more money for a stringent SLA.
PoS provides support of the optical 1+1 automatic protection switching (APS) mechanism. A customer desiring this level of protection orders two circuits from the SP: one for working traffic and one for protect traffic. SPs offer discounts for circuits that are used for protect traffic. The CPE router in this design could be a single point of failure, depending on how the circuit terminates at the CPE. Both circuits terminating on one line card of one router would result in a single point of failure from both the line card and router perspective. A method of slightly higher resiliency is to still use one router, but use separate line cards for the working and protect circuits. This scenario provides fault tolerance in the case of a line card failure, but not a router failure. A higher level of fault tolerance might be achieved if each circuit terminates on a separate router.
All of these survivability network designs are connected to one ADM at the service provider. APS 1+1 protection schemes are normally implemented per add/drop multiplexer. Ring failure is handled by the SP's robust SONET ring protection mechanisms. ADMs are carrier-class devices and must maintain a level of Five-Nines reliability. Five-Nines reliability refers to the amount of uptime a customer should expect from that network. Five-Nines reliability represents an uptime of 99.999 percent.
One Router
Figure 9-9 shows a design where there is one router at the customer premises with two optical interfaces used for APS 1+1 protection. Although a one-router CPE design does not provide the highest level of resiliency, this design does offer some advantages, including the following:
No routing convergence upon failure of the working circuit or optical interfaces.
1+1 APS optical protection. Convergence time can be achieved in sub-60-ms time.
Low-complexity network configuration.
Figure 9-9 One-Router CPE Design
Two Routers
In a two-router design, each router has one optical connection to the SP's add/drop multiplexer. Fault tolerance has been increased with this design because the CPE router is no longer a single point of failure and the routers can be located in different areas of the building to facilitate fault tolerance associated with issues that could arise in isolated areas.
Figure 9-10 illustrates the two-router design philosophy. Although each router has one optical interface to the ADM, one link is working (active) while the other link is protecting (standby) the working link.
Figure 9-10 Two-Router CPE Design
Resources are wasted if only one router is actively forwarding traffic. To fully use both routers, you could use a design including four circuits. The design requires twice the number of interfaces and circuits, but this might still be cost advantageous depending on the amount of bandwidth required and the router hardware employed. Figure 9-11 displays an environment that includes two routers and four circuits in which both routers are in a working state for one circuit. The optical protection scheme is logically divided into APS protection groups that the routers monitor. A large router such as the Cisco 12000 can accommodate hundreds of APS groups.
Figure 9-11 Two Routers with Four Circuits
Advantages of having two PoS-connected routers include the following:
Router redundancy in addition to circuit redundancy
Load balancing of traffic
Disadvantages of having two routers connected to the single ADM in the SONET/SDH network include the following:
Convergence timeThe Layer 3 routing protocol must converge to optical circuit failure.
Complex network configurationAPS groups.
CostThe costs associated with setting up and maintaining the design.
Failure recovery using two routers cannot achieve the sub-60-ms time that the one-router alternative offers. The Layer 3 routing protocol implemented in the infrastructure needs to reconverge around the failure. This is not an issue with one-router designs because both of the PoS interfaces on one router can have the same IP address with the PoS APS 1+1 configuration commands. This feature is allowable because only one of the interfaces is active at any one time.
PoS Protection Schemes
Packet over SONET protection uses the SONET APS 1+1 protection scheme. APS 1+1 looks at the K1 and K2 bytes of the SONET line overhead to determine whether issues exist with the SONET ring. A failure in the SONET network that affects the customer's working path causes a failover at the client site. This failover time occurs under 50 ms in the SONET Layer 1 network. The router interface uses a keepalive to determine whether the other side of the connection is alive. Keepalives are sent every 10 seconds by default, and the loss of 3 subsequent keepalives results in the interface going to an up/down state. After Layer 2 is lost, the Layer 3 routing protocol must converge around the link failure. Waiting for Layer 2 and Layer 3 to go through this procedure can take a long time (more than 30 seconds). Configuring the keepalives to 1 second lowers the convergence time to 3 seconds. Because PoS interfaces are Layer 3 implementations, the interfaces need to rely on a hierarchical error-recovery method such as that shown in Figure 9-12.
Figure 9-12 PoS Hierarchical Error Recovery
Layer 3 provides rerouting decisions during network failures to provide intelligent resiliency to the network. Layer 3 routing might be needed during a link failure if the Layer 3 IP address is changed. The Layer 3 IP address in the one-router design would be identical and there would never need to be Layer 3 routing protocol reconvergence. PoS interfaces in different routers require different IP addresses and always result in routing protocol reconvergence.
APS 1+1 Protection
SONET APS 1+1 is used for any PoS design that has more than one optical interface. APS provides optical protection during times of optical failure in the SP network. This information is carried over the K1 and K2 bytes of the SONET overhead. The CPE listens to the K1 and K2 bytes generated by the SONET network and generates K1 and K2 bytes when a failure occurs on the customer side. If the working interface of an APS 1+1 group fails, the protect interface can quickly assume its traffic load. The Layer 1 APS 1+1 recovery mechanism operates in 60 ms.
NOTE
SONET rings have a 50-ms switchover time rather than the 60-ms switchover used in the Bellcore APS 1+1 specification. The APS 1+1 specification provides 10 ms for failure detection and 50 ms for switch initiation, which collectively equal 60 ms.
SONET APS works at Layer 1 providing switchover times significantly faster than any protocols operating at Layer 2 or 3.
The SONET protection mechanism used for PoS on Cisco products uses APS 1+1 with either unidirectional or bidirectional switching. (You can read more about 1+1 uni- and bidirectional switching in Chapter 3, "SONET Overview.")
The SONET APS 1+1 architecture designates that there will be two circuits and each will carry the same traffic. One circuit is considered the working circuit; the other is the protect line. This differs from 1:1 or 1:n electrical-protection schemes because the backup equipment in electrical-protection schemes only carries traffic upon failure. The working and protect lines of APS 1+1 are both always transporting traffic. The receiving device(s) only process the traffic being received on the working circuit.
Protection mechanisms are more complex when circuits are terminated on different routers. The protection router must somehow be identified of the failure situation. An additional protocol is needed to provide for this signaling. This protocol is a Cisco proprietary mechanism called the Protection Group Protocol (PGP).
If a signal fail (SF) or a signal degrade (SD) condition is detected, the hardware switches from the working circuit to the protect circuit. APS 1+1 has reversionary capabilities allowing the hardware to switch back to the working circuit automatically when the original signal is restored for the configured time interval. The configurable reversion time is used to prevent the system from switching back to the working circuit if it is flapping (repeatedly going up and down). Flapping is sometimes referred to as switch oscillation and should be avoided at all costs so that the SP equipment can meet the SLAs. If the revertive option is not used, after a switch has moved to the protect circuit, the hardware does not automatically revert back to the working circuit. A system administrator must manually perform this function. Bidirectional switching is the default operation in Cisco routers. A circuit that automatically switches back to the original facility is called a reversionary circuit.
The K1/K2 bytes from the line overhead of the SONET frame indicate the current status of the APS connection and convey any requests for action. In standard APS, the two ends of the connection use this signaling channel to maintain synchronization.
With Cisco PoS, the working and protect channels are synchronized through an independent communications channel that is not part of the standard SONET APS system. This independent channel works whether the interfaces are on the same or different routers. This low-bandwidth connection is the Cisco PGP.
Cisco Protect Group Protocol (PGP)
PGP is the Cisco proprietary APS communication channel that is used between routers to complement APS 1+1 protection signaling. APS 1+1 is normally only done on the same router, but PGP enables this functionality to span multiple routers for added resiliency.
Performing APS 1+1 operation between routers creates some Layer 3 convergence issues. The standard Layer 2 mechanism used to determine whether an interface is down is the keepalive function. To accommodate fast reconvergence times, the keepalive update timer should be changed to 1 second and the hold timer changed to 3 seconds. PGP is the signaling channel used to inform the router with the protect facility about the failure. PGP operation closely resembles that of Cisco Hot Standby Router Protocol (HSRP) performing a heartbeat operation over a low-speed interface that tracks the status of certain ports. You can configure different protection groups to monitor multiple ports. The PGP protocol is a connectionless protocol that uses User Datagram Protocol (UDP) port 172 for message transfer. Figure 9-13 displays two routers that are configured in the same APS group. Notice that PGP updates are propagated bidirectionally between the working and protect routers to exchange information regarding the status of the PoS interface.
Figure 9-13 Protection Group Protocol Operation
Figure 9-14 displays a network in which an outage occurs between POP B and POP C on the working facility. The routers at POP B and POP C will have knowledge of this outage through a loss of signal (LOS) condition, and PGP will notify the other router that it will now become the working interface. The other routers in the network will learn of this occurrence through the K1/K2 byte signaling occurring throughout the network.
Figure 9-14 PGP Link Selection
PoS Convergence
Convergence time is the amount of time required for all routers in a network to learn of changes in the network topology. Routers must propagate new route information from one end of the network to the other. Routing protocols are implemented to exchange this information. The routing protocol implemented should provide an ample amount of scalability to meet the future needs of the networks used in the environment. The faster the routing protocol can converge, the less downtime that will occur.
Scalable IP network routing protocols, such as Open Shortest Path First (OSPF), Integrated IS-IS, and Border Gateway Protocol (BGP), are responsible for recovering from error conditions in the network. Although the SONET APS 1+1 protection switching mechanisms guarantee a restoration time of 60 ms, the PoS interfaces are Layer 3 implementations and require some deal of routing protocol convergence. Typical convergence times for scalable routing protocols are several seconds or more depending on the environment and routing protocol design.
Figure 9-15 displays a design in which one router is used for the PoS interfaces. With this design, both of the PoS interfaces in the router can be configured with the same IP address. If a failure occurs, the router can perform switchover in the APS 1+1 switchover time of 60 ms. The Layer 3 routing protocol has not changed in any way on the LAN or WAN side of the router. The Layer 2 keepalive mechanism might not be aware of this switchover because it occurred in less than the lowest keepalive timer of 1 second. Regardless, three keepalives must be missed before an interface is determined as down.
Figure 9-15 1-Router APS 1+1 Convergence
Figure 9-16 displays an environment that requires a higher degree of fault tolerance. This design uses two routers to implement the APS 1+1 group to protect the design from a router failure. The added resiliency creates some Layer 3 convergence issues because the interfaces used cannot have the same IP address if they reside on different physical routers. When the failure occurs, PGP is used to determine that the working interface has gone down, and the protect interface takes over. After this switchover has occurred, the Layer 3 routing protocol must communicate this information on both the LAN and WAN side so that the end to end network learns of the failure and solution. It is best to use HSRP on the LAN side if the PoS routers represent the default gateways out of the network. HSRP update and dead timers should be configured to match those of PGP.
Figure 9-16 2-Router APS 1+1 Convergence
Flapping
Flapping is the operation of a transmission line regularly transitioning from an up/up to an up/down state in a short period of time. Intermittent failures can result in the APS protection mechanism switching between the working and protection traffic repeatedly, causing many fluctuations in the network. If a two-router PoS model is implemented, the Layer 3 routing protocols will flap, too. You can see this issue in Figure 9-17.
APS switches traffic upon failures, but the routing protocol must send out routing updates. If another failure happens (Failure 2), the failure results in another APS switchover and more routing updates. Subsequent failures (Failures 3 and 4) repeat the process. The result of this flapping is that the network could end up spending all the time sending routing updates and reconverging around repeating failures instead of sending data across the network.
Figure 9-17 Flapping in a 2-Router PoS Design
The issue is manageable by tweaking the reversion timer to a time greater than that necessary for the Layer 3 routing protocol to converge. The interfaces would not bring down the network because they must be stable in that amount of time before any switchover will take place.
PoS Reflector Mode
PoS Reflector mode is a process that is used to inform the remote router of a change in the network topology due to a line failure. Figure 9-18 displays an environment with two routers where a failure has occurred on the working line. As soon as the protect router receives information of the down interface through PGP, the protect router initiates a packet to the other side of the connection to speed up convergence. The packet contains the router ID information needed by the routing protocol to create the new Layer 3 adjacency. The remote router can now change the IP adjacency information immediately and reduce the convergence time dramatically.
Figure 9-18 PoS Reflector Mode
Load Balancing
Load balancing refers to the capability to have traffic traverse two separate paths simultaneously to maximize the resources at the site. Load balancing is possible in a PoS APS 1+1 environment where four circuits are present. APS groups are configured on each router. One router is the working router for Group 1, and the other router is the working router for Group 2. Each of these routers protects each other using the PGP mechanism to alert the other side of failures. Figure 9-19 shows this design. You can use Multigroup HSRP (MHSRP) on the LAN side to actively forward traffic to both of these devices while providing the resiliency necessary. Layer 3 convergence is an end-to-end solution.
Figure 9-19 PoS Load Balancing
Alarms and PoS
Customers want to be notified of problems and errors that occur on their lines. PoS uses the same alarming of that used for SONET alarm reporting. The information that is carried in the overhead bytes of the Section, Line, and Path overhead layers are used by PoS to determine and report errors. This includes such items as the following:
Loss of signal (LOS)Signal failure due to a loss of light on the receive interface. A loss of light can also be thought of as receiving an all-0s pattern before descrambling. A downstream AIS should be sent when an LOS is detected.
Loss of frame (LOF)Issue created by receiving A1 and A2 bytes that do not indicate the 2-byte code of F628 in hexadecimal. An LOF condition is registered after no valid framing information has been received in 3 ms. The receipt of two subsequent valid A1/A2 frames clears this condition. A line alarm indication signal (AIS) must be sent downstream when this condition occurs.
Bit interleaved parity (BIP) errorsBIP-3 errors occur at the path layer. The PoS interface is a path terminating equipment (PTE) device. The B3 byte carries the path parity errors in this byte.
Loss of pointer (LOP)When a pointer processor cannot obtain a valid pointer condition, an LOP state is declared, and a downstream AIS must be sent. Recall that the H1 through H3 bytes of the LOH are used for the pointer functionality.
Threshold registers record all the normal SONET counters for errors that occurred over the past 15 minutes and past 24 hours. You can view these by using IOS show commands. When the threshold register exceeds the threshold register settings, a threshold crossing alarm (TCA) indication occurs, meaning the device needs to notify the management station of the alarm.