Introduction
IT operations teams essentially have a prime directive against which their success is constantly measured: to deliver performant applications at the lowest possible cost while maintaining compliance with IT policies.
This goal is thwarted by the almost intractable complexity of modern application architectures—whether virtualized, containerized, monolithic or microservices based, on premises or public cloud, or a combination of them all—as well as the sheer scale of the workloads under management and the constraints imposed by licensing and placement rules. Having a handle on which components of which applications depend on which pieces of the infrastructure is challenging enough; knowing where a given workload is supposed to—or allowed to—run is more difficult still; knowing what specific decisions to make at any given time across the multicloud environment to achieve the prime directive is a Herculean task, beyond the scale of humans. As a result, this prime directive is oftentimes met with a brick wall.
Workload Optimizer—a separately licensed feature set within the Intersight platform—aims to solve this challenge through application resource management, ensuring that applications get the resources they need when they need them. Workload Optimizer helps applications perform well while simultaneously minimizing cost (in a public cloud) and optimizing resource utilization (on premises) while also complying with workload policies.
Traditional Shortfalls of IT Resource Management
The traditional method of IT resource management has fallen short in the modern data center. This process-based approach typically involves several steps:
Step 1. Setting static thresholds for various infrastructure metrics, such as CPU or memory utilization
Step 2. Generating an alert when these thresholds are crossed
Step 3. Relying on a human being viewing the alert to:
Determine whether the alert is anything to worry about. (What percentage of alerts on any given day are simply discarded in most IT shops? 70%? 80%? 90%?)
If the alert is worrisome, determine what action to take to push the metric back below the static threshold.
Step 4. Execute the necessary action and then lather, rinse, repeat.
This approach has significant fundamental flaws.
First, most such metrics are merely proxies for workload performance; they don’t measure the health of the workload itself. High CPU utilization on a server may be a positive sign that the infrastructure is well utilized and does not necessarily mean that an application is performing poorly. Even if the thresholds aren’t static but are centered on an observed baseline, there’s no telling whether deviating from the baseline is good or bad or simply a deviation from normal.
Second, most thresholds are set low enough to provide human beings time to react to an alert (after having frequently ignored the first or second notifications), meaning expensive resources are not used efficiently.
Third, and maybe most importantly, this approach relies on human beings to decide what to do with any given alert. An IT administrator must somehow divine from all current alerts not just which ones are actionable but which specific actions to take. These actions are invariably intertwined with and will affect other application components and pieces of infrastructure in ways that are difficult to predict. For example:
A high CPU alert on a given host might be addressed by moving a virtual machine (VM) to another host—but which VM?
Which other hosts?
Does that other host have enough memory and network capacity for the intended move?
Will moving that VM create more problems than it solves?
Multiply this analysis by every potential metric and every application workload in the environment, and the problem becomes exponentially more difficult.
Finally, usually the standard operating procedure is to clear an alert, but, as noted previously, any given alert is not a true indicator of application performance. As every IT administrator has seen time and again, healthy apps can generate red alerts, and “all green” infrastructures can still have poorly performing workloads. A different paradigm is needed, and Workload Optimizer provides such a paradigm.
Paradigm Shift
Workload Optimizer is an analytical decision engine that generates actions (recommendations that are optionally automatable in most cases) that drive the IT environment toward a desired state where workload performance is assured and cost is minimized. It uses economic principles (the fundamental laws of supply and demand) in a market-based abstraction to allow infrastructure entities (for example, hosts, VMs, containers, storage arrays) to shop for commodities such as CPU, memory, storage, or network resources.
This market analysis leads to actions. For example, a physical host that is maxed out on memory (high demand) would sell its memory at a high price to discourage new tenants, whereas a storage array with excess capacity would sell its space at a low price to encourage new workloads. While all this buying and selling takes place behind the scenes within the algorithmic model and does not correspond directly to any real-world dollar values, the economic principles are derived from the behaviors of real-world markets. These market cycles occur constantly, in real time, to ensure that actions are currently and always driving the environment toward the desired state. In this paradigm, workload performance and resource optimization are not an either/or proposition; in fact, they must be considered together to make the best decisions possible.
Workload Optimizer can be configured to either recommend or automate infrastructure actions related to placement, resizing, or scaling for either on-premises or cloud resources.