Many enterprises are making fundamental changes to their business processes by using advanced IT applications to achieve enhanced productivity and operational efficiencies. As a result, the underlying network architecture to support these applications is evolving to better accommodate this new model.
As data availability becomes a critical requirement, many businesses are devoting more resources to ensure continuous operation. Enterprises are provisioning dedicated networks to guarantee performance metrics for applications without compromising security.
Although maintaining uninterruptible access to all data center applications is desirable, the economics of business-continuance require network operators to prioritize applications according to their importance to the business. As a result, data centers need a range of business-continuance solutions to accommodate this goal, from simple tape backup and remote replication to synchronous mirroring and mirrored distributed data centers.
Enterprises can enhance application resilience in several ways, including the following:
- Removing single points of server failure by deploying high-availability clusters or load-balancing technology across web and application servers
- Extending the deployment of these clusters in different data centers to protect against major disruptions
User access is as important as downtime protection and data recovery. Following a disruption, how long can the business afford for users to be without access to applications? Companies are employing technologies such as Global Site Selector that enable users to manually or automatically connect to an alternative site running the application they need.
Businesses run tens and often hundreds of applications, each of which might have differing continuance requirements, measured in a time-to-recovery and data-loss perspective. IT groups need to match the associated characteristics and cost of a business-continuance solution to the potential business and consider which technologies to deploy where problems impact data, applications, and user access.
Cisco delivers scalable, secure, and cost-effective technology that helps enterprises build end-to-end backup and recovery solutions and disaster recovery solutions. These solutions include the following:
- High-availability data center networking and storage-area networks for nonstop access to applications and data
- Synchronized distributed data centers for continuous service over WANs in the event of site disruptions
- Synchronous disk mirroring and replication over WANs for fast recovery and zero data loss
- Asynchronous data replication over IP networks for remote data protection
- Consolidated backup to tape or near-line disk and remote electronic vaulting over enterprise-wide storage networks for consistent protection of distributed data
Each of these solutions requires the appropriate network infrastructure to help ensure that user-specific availability, performance, distance, and latency requirements are met. In addition, enterprises require a resilient, integrated business-continuance network infrastructure to protect three key areas in the event of a disruption:
- Data
- Applications
- User access
Overview of High-Availability Clusters
High-availability (HA) clusters operate by using redundant computers or nodes that provide services when system components fail. Normally, if a server with a particular application crashes, the application is unavailable until the problem is resolved. HA clustering remedies this situation by detecting hardware/software faults, and immediately providing access to the application on another system without requiring administrative intervention. This process is known as failover.
HA clusters are often used for key databases, file sharing on a network, business applications, and customer services such as e-commerce websites. HA cluster implementations attempt to build redundancy into a cluster to eliminate single points of failure. These implementations include multiple network connections and data storage that connects via storage-area networks (SAN).
HA clusters usually are built with two separate networks:
- The public network: Used to access the active node of the cluster from outside the data center
- The private network: Used to interconnect the nodes of the cluster for private communications within the data center and to monitor the health and status of each node in the cluster
Public Network Attachment
For the public network (facing the nodes cluster), the server often is enabled by a dualhoming mechanism with one network interface card (NIC) configured in active state and one NIC configured in standby state. If a link to the active NIC fails, or the NIC loses connectivity with its default gateway, the operating system performs a failover. A NIC failover for a public network has no affect on cluster availability because the heartbeat mechanism and NICs in active/standby mode for public access are two separate handcheck mechanisms.
The network design must provide the highest availability for the LAN infrastructure. To achieve this goal, the teaming service or dual-homing should be distributed between different access switches, which in turn should be connected to different aggregation switches, as illustrated in Figure 1-1.
Figure 1-1 Extending the public network.
Private Network Attachment
The private network primarily carries cluster heartbeat, or keepalive, messages. Other server-to-server communications that occur on this private network include the following:
- Cluster data
- Cluster file system data
- Application data (back-end)
The private network is a nonrouted network that shares the same Layer 2 VLAN between the nodes of the cluster even when extended across multiple sites. In a campus cluster environment, heartbeats are sent via the private network from node to node of the HA cluster using a proprietary Layer 2 (nonroutable) protocol. The servers manage the I/O by sending traffic over all interfaces and by preventing traffic from being sent over a failing path. This approach provides resiliency in the event of a NIC failure on a server.
The heartbeat is the most important component of the cluster that uses the private network interconnection. However, if all paths go down for more than 10 seconds (applicable for most HA clusters), a split-brain situation can occur, which prompts the cluster framework to check the number of votes and decide which server or servers will continue as the members in the cluster. Nodes that lose cluster membership assume that they are isolated, and any applications that run on those nodes terminate. Surviving nodes know that the nonsurviving nodes have stopped, and the cluster will then restart the applications.
Although some HA cluster vendors recommend disabling Spanning Tree Protocol (STP) for the private interconnect infrastructure, such a drastic measure is neither necessary nor recommended when using Cisco Catalyst switches. In fact, Cisco has since provided the PortFast feature, which puts an access port into forwarding mode immediately after link up without losing loop-detection capabilities. To avoid connectivity delays, PortFast must be enabled on all access interfaces connecting cluster nodes. This rule also applies to any servers connected to the switch. The IEEE also defines the PortFast concept within the Rapid STP 802.1w standard under the edge port designation. In addition, Cisco supports Per-VLAN Spanning Tree, which maintains a spanning-tree instance for each VLAN configured in the network.