Scalability and High Availability
One of the attractive cost benefits of SIP trunking is the technical ability to centralize PSTN access for the enterprise into a single large pipe. Doing so, however, creates several design considerations, including both scalability and high availability:
-
Scalability: Routing all calls from the entire enterprise over a single or a small number of centralized SIP trunk access points means that you are looking at a SIP trunk capacity of several hundred to several thousand connections for all enterprises except the really small ones.
This implies border handling session capacity equipment that often far outstrips any single TDM gateway that exists in the typical enterprise. Most enterprise gateways are in the 1 to 16 T1/E1 range that equates up to between 384 to 480 sessions. Even a T3 gateway, a relative rarity in the average enterprise, presents only 672 sessions.
Some of the redundancy schemes covered in the remainder of this section simultaneously address scalability mechanisms including higher-capacity equipment and load balancing over clusters of individual boxes.
High Availability: The more sessions that are concentrated into a single physical pipe, the larger the business impact to your organization of this single point of failure. For this reason few enterprises truly deploy a single SIP trunk entry point into their networks; there are almost always multiple points.
Redundancy also becomes a much more pressing consideration because of the potentially large session capacity of SIP trunks. TDM gateway redundancy amounted to alternative routing over a different gateway when there was a failure. But when a single failure can now easily impact more than a 1000 calls, and potentially the routing of all PSTN-destined calls, the need for mitigation of such a failure escalates.
You can deploy several strategies to protect against the business impact of a SIP trunk failure:
- Local and geographical SIP trunk redundancy
- Border element redundancy
- Load balancing and clustering
- PSTN TDM gateway failover
The handling for emergency calls that you decide on (see the "Emergency Calls" section earlier in this chapter) might affect considerations for the redundancy mechanisms discussed next.
Local and Geographical SIP Trunk Redundancy
For redundancy purposes there are almost always multiple SIP trunk entry points into an enterprise network even in a largely centralized design. This ensures that calls have alternative routing points if an equipment or building power failure occurs or a natural disaster in a particular region occurs. The only realistic alternative to multiple SIP trunk entry points is to have a single SIP trunk and maintain TDM gateway access to the PSTN for failover, a scenario discussed later in this section. For small, single-site businesses, cellular phone access might be a realistic alternative to a single SIP trunk, but this is rarely practical for a multisite enterprise of any size.
Consider three different areas of SIP trunk redundancy:
- Local redundancy: Most SIP trunk services offer at least two IP addresses. For local redundancy the physical medium is most likely shared and terminates into the same building on your premises. Local redundancy protects against equipment failure or power failure to a single piece of equipment. These two IP addresses should ideally terminate onto two redundant border elements. Most providers offer either a primary/secondary or a load-balancing scheme that the enterprise can choose from.
- Geographic redundancy: Most medium-to-large enterprises prefer to bring in the two IP addresses or perhaps two different SIP trunks (that is, four IP addresses, each SIP trunk with local redundancy) into two separate buildings, likely data centers, in two different geographies. This protects against natural disasters and buildingwide power or other outages.
- Service provider redundancy: Some enterprises and contact centers get SIP trunks from two different providers, both for least-cost routing opportunities and for redundancy purposes. If one provider is having problems, the other provider's facilities can carry all traffic. This scheme is easy to implement for outbound traffic but harder (due to DID mapping) for inbound traffic.
Border Element Redundancy
SIP trunks terminate on the session border controller, or border elements, in the enterprise. These elements have to be redundant for high session capacity SIP trunks, both for scalability and high availability reasons. You can use various ways to provide redundancy for a particular border element platform (in addition to the local and geographic redundancy schemes already previously discussed):
- In-box hardware redundancy
- Box-to-box hardware redundancy
- Clustering
In-Box Hardware Redundancy
In-box redundancy means duplicate processing components exist contained within the platform itself so that if one hardware component fails, another immediately takes over. In-box redundancy often includes components such as the CPU card, possibly the memory cards, I/O interface cards, and control and data plane forwarding engines.
The level of hardware redundancy the CUBE provides depends on the hardware platform on which the function is installed. The higher-end platforms offer more hardware redundancy than the lower-end platforms. In-box hardware redundancy is almost invariably seamless, also called stateful failover, so sessions are not dropped and end users on active calls are generally unaware that a hardware failover has occurred.
Box-to-Box Hardware Redundancy (1+1)
Box-to-box redundancy, or 1+1 redundancy, means there are duplicate platforms, acting and configured as a single one, in an active/standby arrangement with a keepalive mechanism between them. If the active hardware platform fails, the standby platform takes over.
One such method is the Hot Standby Router Protocol (HSRP) supported on Cisco IOS routers. With HSRP transparent hardware failover is possible while maintaining a single SIP trunk (that is, a single visible IP address) to the service provider. How well HSRP works in a particular deployment depends on the service provider IP addressing rules and the release of software deployed on the CUBE.
HSRP redundancy is not inherently stateful but can support stateful failover if the higher layers of software support application-level checkpointing and the basic router keepalive. The operation of this mechanism is shown in Figure 7-6.
Figure 7-6 Using HRSP for Redundancy
Enterprise TDM gateways do not offer stateful failover redundancy because the session capacity per gateway is limited; therefore, the impact of a failure is limited. If an individual CUBE carries no more sessions than the average enterprise TDM gateway, there might not be a reason to expend the cost on deploying high-end hardware with stateful failover capability on the border element either. Instead, border element clustering can provide effective redundancy, as it does for TDM gateways.
Clustering (N+1)
Redundancy via clustering, or N+1 redundancy, means there are duplicate platforms independent of each other and each carries a fraction of the traffic, together providing a high session count SIP trunk. There is no state sharing or keepalives between the components, and if a single element is lost, some calls drop, but it is not the entire SIP trunk that goes down.
The CUBE can be deployed in a clustering architecture with load balancing over the individual components managed by the attached devices or by a SIP proxy element. (Load balancing methods are explored further in the next section.) A clustering architecture has the advantage of a pool of smaller elements, each of which can be taken out of service and upgraded without affecting the entire SIP trunk. The cluster can also be spread out over several buildings or geographic locations to enhance redundancy concerns about the impact of a power loss or a natural disaster on a building or data center.
Load Balancing
SIP trunks from providers usually come with two (sometimes more) IP addresses. As previously discussed, you might want to have multiple border elements fronting this SIP trunk for both redundancy and scalability benefits. If you choose a load-balancing algorithm (as opposed to a primary/secondary active/standby arrangement) for the multiple platforms forming the network border, some network entity is required to do load balancing across the possible destinations.
You can use multiple ways to implement SIP trunk load balancing:
- Service provider load balancing
- DNS
- CUCM route groups and route lists
- Cisco Unified SIP proxy
Service Provider Load Balancing
Many SIP trunk providers offer a choice of primary/secondary or load-balancing algorithm to the enterprise customer. If load balancing is chosen, this is implemented either on their SIP softswitch or their provider edge SBC.
Domain Name System (DNS)
You can use DNS SRV records (RFC-2782) to provide multiple IP address resolutions for the same hostname. In this way, the individual platforms in the border element cluster can be addressed dynamically using the information returned by DNS. The operation of this mechanism is shown in Figure 7-7.
Figure 7-7 Using DNS SRV for Load Balancing
The attached SIP softswitch (this can be used either on the service provider side or on the enterprise side) queries DNS for the IP addresses of the border element. The originating softswitch uses these addresses to load balance traffic. If a call is presented to a CUBE that is overloaded (its configured CAC threshold has been reached), it returns a SIP 503 Internal Server Error, and the softswitch can use the next available address in the DNS SRV record.
DNS is not offered by all service provider SIP trunk offerings, but when it is, this is generally a good method of load balancing. Even when it is not offered, this mechanism can still be used to good effect on the enterprise side of the network border. This method is dependent on a predictable design of DNS server response time to ensure that post-dial delay (PDD) is minimal.
The DNS SRV mechanism can also be used for load-balancing calls outbound from the CUBE to an attached softswitch. If DNS is used for this call path, the SIP INVITE retry timer might need to be tuned to constrain PDD for outbound calls, as shown in Example 7-10.
Example 7-10. SIP Retry Timers
sip-ua retry-invite 2
CUCM Route Groups and Route Lists
When connecting a CUCM to a cluster of border elements for PSTN SIP trunk access, its Route Group and Route Lists constructs can be used to implement a load balancing algorithm for presenting calls outbound from the enterprise to the PSTN. Other SIP softswitches and IP-PBXs most likely have similar alternative routing capabilities that can be used in a similar manner. The operation of this mechanism is shown in Figure 7-8.
Figure 7-8 CUCM Route Groups and Route Lists
Configure a Route Group on CUCM pointing to each individual border element. Aggregate these Route Groups into a Route List that points to the SIP trunk. Configure a Route Pattern in the CUCM dial-plan to route calls of the appropriate dialed number patterns to this Route List. Configure CAC on the individual CUBEs to refuse calls under overload conditions, forcing CUCM to reroute to the next Route Group in the Route List.
Cisco Unified SIP Proxy
The Cisco Unified SIP Proxy can be used with a cluster of border elements as a logical large-scale SIP trunk network border interface to the attached softswitches. That is, the attached softswitches on both the service provider and enterprise sides are unaware of the individual elements, or the number of them, in the CUBE cluster. This is a handy mechanism when:
- You build large-scale SIP trunks where the number of border elements exceed the two IP addresses given by your provider.
- You want to grow the SIP trunk capacity over time without affecting the configurations of the attached softswitches on either side of the border.
The Cisco Unified SIP Proxy is responsible for the load balancing over the individual border elements, keeps track of their loads, and reroutes traffic when a particular element is overloaded or unavailable. The operation of this mechanism is shown in Figure 7-9.
Figure 7-9 Cisco Unified SIP Proxy and Border Element Cluster
In addition to load balancing, the Cisco Unified SIP Proxy offers many benefits to the SIP trunk interconnect:
- Hides the size of the border element pool from the attached softswitch configurations.
- Offers policy-based SIP trunk call routing such as time-of-day and least-cost routing.
- Offers powerful SIP Normalization capabilities.
- Offers graceful service degradation for upgrades or maintenance of the border elements.
- Offers an easy way to expand the capacity of your SIP trunk when your needs grow.
- Offers intrinsic redundancy because there isn't a single border element but a cluster of them. (The SIP proxy itself must, of course, be deployed in a redundant configuration; otherwise, it becomes a single point of failure.)
PSTN TDM Gateway Failover
An easy and cost-effective way to provide redundancy and failover for a SIP trunk is simply to reroute calls to your already existing TDM gateways when the SIP trunk is not available or overloaded. This method provides a ready migration path while you ramp up SIP trunk traffic to full production and enables you more time to design and implement some of the other SIP trunk redundancy mechanisms in preparation for a future state where your network might no longer have TDM connectivity. The operation of this mechanism is shown in Figure 7-10.
Figure 7-10 SIP Trunk to PSTN Failover
Configure call routing to use the SIP trunk as the primary method of access (using a higher preference dial-peer) and the TDM gateway as the secondary path (using a lower-preference dial-peer). You can use the same physical Cisco platform for both functions so that adding a SIP trunk to your PSTN gateway does not mean adding equipment to the network.
SIP Trunk Capacity Engineering
Part of the scalability assessment for your network is to determine how many concurrent sessions should be supported on the SIP trunk service offering that you get from a service provider. If you have current PSTN traffic statistics on your TDM gateways, this assessment is somewhat easier as the ratios of phones to trunks do not change with SIP trunking. But many enterprise networks do not have detailed current statistics of these call patterns.
SIP trunk session sizing is also affected if you choose a centralized model, as opposed to the distributed model of traditional TDM trunking where there is often oversubscription at each site. This oversubscription can be consolidated with a centralized SIP trunk facility, but you still have to engineer with some level of bursting of call traffic for unusual situations.
As a ballpark assessment, you can use the same method of estimating trunk (which is equivalent to a SIP trunk session) capacity as you used in the traditional voice traffic engineering exercises. An average enterprise business can use a 5:1 trunking ratio, meaning for every five phones, provision one trunk (SIP session). Enterprises that are primarily internally focused (for example research facilities or engineering departments) can use a 10:1 ratio. Contact center deployments should use a 1:1 ratio, and phones in this context include both live agents and automated ports serving Interactive Voice Response (IVR) front-end applications.