Defining Key Terms
Before proceeding further, it would be useful to define some key terms.
Availability and Unavailability
The phrase "availability of a system such as a router or network" denotes the probability (with values in the 0.0 to 1.0 range such as 0.1, 0.2, and so forth) that the system or network can be used when needed. Alternatively, the phrase describes the fraction of the time that the service is available. As a benchmark, carrier-class network equipment requires availability in the range of five-nines (0.99999), which means the equipment is available for service 99.999 percent of the time.
The term unavailability is defined as the probability that a system or network is not available when needed, or as the fraction of the time service is not available. An alternative and often more convenient expression (because of its additive properties) for unavailability is downtime per year. Downtime in units of minutes per year is obtained through multiplication of unavailability values by minutes in a year (365 days in a year times 24 hours in a day times 60 minutes in an hour). Service providers commonly use yet another expression for unavailability, especially when referring to voice calls. This term is defects per million (DPM). DPM measures the number of defective units (or number of failed call attempts) out of a sample size of one million units (1,000,000 call attempts). DPM is obtained by multiplying unavailability by 1,000,000. From these definitions, it follows that 0.99999 availability is equivalent to 0.00001 unavailability, 5.256 downtime per year, or 10 DPM.
Reliability and Its Relationship to Availability
The phrase "reliability of a system or network" is defined as the probability that the system or network will perform its intended function without failure over a given period of time. A commonly used measure of reliability is known as mean time between failures (MTBF), which is the average expected time between failures. A service outage caused by a failure is represented as mean time to repair (MTTR). That is the average time expected to be required to restore a system from a failure. MTTR includes time required for failure detection, fault diagnosis, and actual repair. Availability is related to MTBF and MTTR as follows:
Availability = MTBF/(MTBF + MTTR)
This relationship shows that increasing MTBF and decreasing MTTR improves availability. This means that the availability of a router can be improved by increasing the reliability of its hardware and software components. Similarly, improving the reliability of its constituent elements such as routers, switches, and transport facilities can enhance the availability of a network.
In general, reliability is just one of several factors that can influence the availability of a system. For example, in addition to reliability of constituent network elements, network availability is strongly influenced by the fault-tolerance capability of the network elements, as described in the following section.
Fault Tolerance and Its Effect on Availability
Fault tolerance describes the characteristics of a system or component that is designed in such a way that, in the event of a component failure, a backup or "redundant" component immediately can take its place with no loss of service. Fault tolerance can be provided via software, hardware, or combination of the two. The switch between the failing component and the backup component is opaque to the outside worldfrom the view outside the system, no failure has occurred.
A network is said to be fault tolerant or survivable if it can maintain or restore an acceptable level of service performance during network failures. Network-level fault tolerance relies on software or hardware to quickly detect the failure and switch to a known backup path/link. The backup paths may be provided at multiple transport layers, including wavelength-division multiplexing (WDM), Synchronous Optical Network/Synchronous Digital Hierarchy (SONET/SDH), and MPLS.
As described in the previous section, improving MTBF can increase overall system availability. However, by using redundant components, one can reduce system downtime by orders of magnitude and get closer to the carrier-class goal of five-nines availability while keeping the MTBF and MTTR the same. The effectiveness of a redundancy scheme depends on its switchover success rate (the probability of a successful switchover from active to standby component when the active component fails). Generally, it is difficult to achieve a perfect (100 percent) switchover success rate. In practice, a redundancy scheme that can achieve a 99 percent or better switchover success rate is considered a good design.
To summarize, redundancy is one of the key building blocks for improving high availability. Redundancy not only prevents equipment failures from causing service outages, it also can provide a means for in-service planned maintenance and upgrade activities.