Service Assurance
As illustrated in Figure 3-12, SLAs have evolved through necessity from those based only on general network performance in Layers 1 through 3 (measuring metrics such as jitter and availability), to SLAs increasingly focused on network performance for specific applications (as managed by technologies such as a WAN optimization controller), to SLAs based on specific application metrics and business process SLAs based on key performance indicators (KPI) such as cycle time or productivity rate. Examples of KPIs are the number of airline passengers who check in per hour or the number of new customer accounts provisioned.
- Traditional SPs/MSPs can differentiate from OTPs by providing a end-to-end SLA as opposed to resource-specific SLA
- End-to-end monitoring of service delivery is critical in this differentiation
Figure 3-12 Expanding the SLA Boundary
Customers expect that their critical business processes (such as payroll and order fulfillment) will always be available and that sufficient resources are provided by the service provider to ensure application performance, even in the event that a server fails or if a data center becomes unavailable. This requires cloud providers to be able to scale up data center resources, ensure the mobility of virtual machines within the data center and across data centers, and provide supplemental computer resources in another data center, if needed.
With their combined data center and Cisco IP NGN assets, service providers can attract relationships with independent software vendors with SaaS offerings, where end customers purchase services from the SaaS provider while the service provider delivers an assured end-to-end application experience.
In addition to SLAs for performance over the WAN and SLAs for application availability, customers expect that their hosted applications will have security protection in an external hosting environment. In many cases, they want the cloud service provider to improve the performance of applications in the data center and over the WAN, minimizing application response times and mitigating the effects of latency and congestion.
With their private IP/MPLS networks, cloud service providers can enhance application performance and availability in the cloud and deliver the visibility, monitoring, and reporting that customers require for assurance. As cloud service providers engineer their solutions, they should consider how they can continue to improve on their service offerings to support not only network and application SLAs but also SLAs for application transactions and business processes.
Service assurance solutions today need to cope with rapidly changing infrastructure configurations as well as understand the status of a service with the backdrop of ever-changing customer ownership of a service. The solution also needs to understand the context of a service that can span traditionally separate IT domains, such as the IP WAN and the Data Center Network (DCN).
Ideally, such a solution should ideally be based on a single platform and code base design that eliminates some of the complexities of understanding a service in a dynamic environment. This makes it easier to understand and support the cloud services platform and also eliminates costly and time-consuming product integration work. However, the single-platform design should not detract from scalability and performance that would be required in a large virtual public cloud environment and obviously with an HA deployment model supported.
Northbound and southbound integration to third-party tools, with well-defined and documented message format and workflow that allow direct message interaction and web integration APIs, is an absolute basic requirement to build a functional system.
An IaaS assurance deployment requires a real-time and extensible data model that can support the following:
- Normalized object representation of multiple types of devices and domain managers, their components, and configuration
- Flexible enough to represent networking equipment, operating systems, data center environmental equipment, standalone and chassis servers, and domain managers such as vSphere, vCloud Director, and Cisco UCS
- Able to manage multiple overlapping relationships among and between managed resources
- Peer relationships, such as common membership in groups
- Parent-child relationships, such as the relationship between a UCS chassis and blade
- Fixed dependency relationships, such as the relationship between a process and an operating system
- Mobile dependency relationships, such as the relationship between a VM and its current host system
- Cross-silo discovered relationships, such as the relationship between a virtual host and a logical unit number (LUN) that represents network attached logical storage volume
- Linkages between managed objects and management data streams, such as event database and performance metrics
- Security boundaries between sets of managed objects and subsets of users to enable use in multitenant environments
- Developer-extensible to allow common capabilities to be developed for all customers
- Field-extensible to enable services teams and customers to meet uncommon or unique requirements
The ability to define logical relationships among service elements to represent the technical definition of a service is a critical step in providing a service-oriented impact analysis.
Service elements include
- Physical: Systems, infrastructure, and network devices
- Logical: Aspects of a service that must be measured or evaluated
- Virtual: Software components, for example, processes
- Reference: Elements represented by other domain managers
In addition, to understand the service components, the service element relationships are both fixed and dynamic and need to be tracked. Fixed relationships identify definitions, such as the fact that this web application belongs to this service. Dynamic relationships are managed by the model, such as identifying as an example which Cisco UCS chassis is hosting an ESX server where a virtual machine supporting this service is currently running.
Service policies evaluate the state of and relationships among elements and provide impact roll-up so that the services affected by a low-level device failure are known. They assist in root cause identification so that from the service a multilevel deep failure in the infrastructure can be seen to provide up, down, and degraded service states. (For example, if a single web server in a load-balanced group is down, the service might be degraded.) Finally, service policies provide event storm filtering, roll-up, and windowing functions.
All this information, service elements, relationships, and service policies provide service visualization that allows operations to quickly determine the current state of a service, service elements, and current dynamic network and infrastructure resources, and in addition allow service definition and tuning. A good example of a service assurance tool that supports these attributes and capabilities can be found at www.zenoss.com.