Service Life Cycle Management
Previously, we discussed how a service could be designed and, in this chapter, how a service is instantiated and transitioned into an operational state. Viewing this from an Information Technology Infrastructure Library (ITIL)4 perspective, these sections could be viewed as the service design and service transition phases of the ITIL V3 model, respectively. Of course, they don’t cover the entire best-practice recommendations of ITIL V3, but the essence is there. This section covers service life cycle management, which is a term often used to refer to the complete ITIL V3 framework, from service strategy to service operations, including continual improvement. However, this section refers to service life cycle management as the management of an operational service throughout its remaining lifetime until the tenant chooses to decommission and/or delete the service. The reason for this is that depending on the overall cloud operations model that is chosen by the provider and the type of cloud service offered, there will often be a division of responsibility between the provider and consumer at the service operational level, whereas the provider is normally fully responsible for the strategy, design, and transition process areas. Service life cycle management from the consumer's perspective begins when the service is operational.
Making the distinction between decommission and delete is important because the tenant might simply choose to decommission the service (that is, not have the service active) rather than to delete it. Consider the example of the development and test environments. While development is taking place, the test environment might not be needed and vice versa. If those environments incur costs while active, it might be more cost-effective to decommission a service with the view that it will be commissioned again when it is required. Deleting a service obviously removes all the building blocks of the service, releases any resources, and deletes the service definition so that the service can no longer be recovered.
The service operations phase consists of a number of key process areas:
- Incident and problem management
- Event management
- Request fulfillment
- Access management
- Operations management
- Service desk function
ITIL V3 also expects some form of service improvement framework to be in place to support continual improvement, and this is never more important in the operational phase. This chapter is concerned with orchestrating and automating cloud services, and this doesn’t stop simply because the service is operational. The following sections will discuss in more detail the impact that the cloud and, in particular, orchestration and automation have on each of the ITIL V3 areas.
Incident and Problem Management
The primary focus of the incident management process is to manage the life cycle of all incidents, an incident being defined as an unplanned outage or loss of quality to an IT/cloud service. The primary objective of incident management is to return the IT service to users as quickly as possible. If services are being deployed into a multitenant environment, a single incident might affect many different tenants or users, so this becomes a critical process. The primary objectives of problem management are to prevent incidents from happening and to minimize the impact of incidents that cannot be prevented. Both incident and problem management processes are typically managed by the service desk function, which is typically implemented in IT Service Management (ITSM) service desk software. All workflow and coordinating activities are managed in the service desk software, and this won't change when a service is hosted in a cloud, although the service desk software itself might need to support a different usage.
Event Management
The primary focus of event management is to filter and categorize events and to decide on appropriate actions. Event management is one of the main activities of service operations. Within the cloud, event handling and categorization will become a major task, the event manager must receive, categories, enrich and correlate alarms from virtual machines and the infrastructure but also alarms need to be processed from provisioning and activation systems, in fact alarms need to be processed form any operational activity that is being automated. Combine this complexity with the exponential demand and rate of change that cloud offers and the need to have operational awareness of any issues to do with service provisioning or operations, and you have identified one of the major operational challenges of a cloud platform. Orchestration can play a major part in linking event handling to problem and incident resolution, as illustrated in Figure 11-9.
Figure 11-9 Orchestration for Incident and Problem Management
As the number and variety of alarms grow, the need to understand the operational context of an event becomes paramount. Is this event impacting multiple tenants or cloud users or just a single user? Does this event affect a service-level agreement (SLA) or not? Given the potential volume of events and the fact that a cloud is effectively open for business 24 hours a day, this analysis cannot be performed by operators any longer. The systems that perform event management can do event correlation; however, that is normally done by looking at the event data itself or applying the event data within a domain. For example, network faults can be correlated based on which device they occur on, that is, multiple failures can be correlated to the fact that an uplink on a device has failed or using a network topology model can be correlated against an upstream device failure.
Where event management systems are often weak is looking across domains, applying that operational context to a number of events. This is where orchestration can help. Orchestration can be used to process correlated and uncorrelated events within an operational context by querying other systems and relating different domain events together. Unfortunately, this will not happen "out of the box." The experience of senior support staff is required to build the orchestration workflows, and in effect, what you are doing is automating the investigation and diagnostic knowledge of engineers, or the known problem database in ITIL speak. Many years ago, Cisco launched a tool called MPLS Diagnostic Expert (MDE) that took the combined knowledge of the Cisco Technical Assistance Center (TAC) and distilled it into a workflow for deterring connectivity for IP-VPN. One of the first case studies involved rerunning a major service provider outage that took one full day to troubleshoot manually through the tool and resolving the problem in ten minutes (with the correct determination that the culprit, a network operator, had removed a Border Gateway Protocol [BGP] connectivity statement). The workflow was only successful because the subject matter expert responsible for its content really understood the technical context. Now expand that across multiple domains and include an understanding of the real-time state of the operational environment, and you begin to see the scope of the problem; however, the cloud is transformational, and this is one of the major areas that need to be considered by the provider when adopting the cloud. Consumers will normally be unaware of the underlying process and will typically only see the output of the event management and incident/problem management processes. When the incident is raised directly by the customer, orchestration can still assist with investigation and resolution within the operational context, but the trigger is the incident rather than an event.
Request Fulfillment
The primary focus of the request fulfillment process is to fulfill service requests, which in most cases are minor (standard) changes (for example, requests to change a password) or requests for information. Depending on the type of request, the orchestrator can forward the request to the required element manager to process or simply raise a ticket in the service desk to allow an operator to respond to a request for information. Given the self-service nature of the cloud, most standard changes will be preapproved and simply fulfilled by the appropriate element manager.
Access Management
The primary focus of the access management process is to grant authorized users the right to use a service while preventing access to non-authorized users. The access management process essentially executes policies defined in IT security management and, as such, is a critical process within cloud operations. From a Software as a Service (SaaS) perspective, this is relatively simple as the provider is responsible for all aspects of the application. Access management, therefore, is focused on simply providing access to the application, and the application provides access to the data. In IaaS, access must be granted at a more granular level:
- Access to the virtual machine
- Out-of-band access to the virtual machine, in the case of a configuration
- Access to backups and snapshots
- Access to infrastructure consoles such as restore consoles
All this access needs to be provided in a consistent and secure manner across multiple identity repositories. Orchestration and automation can ensure that as users are added to the cloud, their identity and access rights are provisioned correctly and modified in a consistent manner.
Operations Management
The primary focus of the operations management process is to monitor and control the IT services and IT infrastructure—in short, the day-to-day routine tasks related to the operation of infrastructure components and applications. Table 11-1 defines typical operational tasks and shows where orchestration and automation have a role to play. The task of facilities management is not covered because of the size and breadth of this subject.
Table 11-1 Operational Tasks
Task |
Role of Orchestration |
---|---|
Patching |
Typically, this function is carried out by element managers, but if the portal allows the consumer to upload specific patches and apply them, orchestration will be involved to coordinate the automated deployment and installation of the patches. |
Backup and restore |
Typically, a backup is scheduled to occur on a regular basis, so this would be handled by the cloud scheduler application, but the initial creation, modification, and deletion of the backup job would be automated and coordinated by the orchestration system. |
Antivirus management |
The orchestration system would coordinate the deployment of antivirus agents (if required), but typically the scanning, detection, and remediation of viruses and worms will be handled by the antivirus applications. |
Compliance checking |
As with antivirus, the orchestration system would coordinate the deployment of compliance policies, but typically the scanning, detection, and reporting of compliance will be handled by the compliance applications. |
Monitoring |
While monitoring is a key component of Continuous Service Improvement from the provider perspective, this data is normally exported to the cloud portal to allow the tenants to see how their services are performing, so there is little orchestrator involvement here, apart from setting up the policy defining which data should be exported. |
The Cloud Service Desk
The cloud service desk is the single point of contact between the consumer and provider. As such, it is typically a view within the overall cloud portal that provides access to the support functions and standard change options. The following features are typically required for a cloud-enabled service desk:
- Support multiple tenants
- Support a web-based user interface and be Internet "hardened"
- Allow content to be embedded in another portal
- Support a single sign-on
Continued Service Improvement
A dynamic, self-service demand model means that the IT environment and the services that run in the environment need to be continually monitored, analyzed, and optimized. The CSI process should implement a monitoring framework that is continually measuring performance against an expected baseline and optimizing the infrastructure to ensure that any deviation from the baseline is managed effectively. It is in the optimization step that orchestration and automation can play a significant part in coordinating and performing the actions that bring the cloud platform performance back in line with the expected baseline.