Home > Articles > Solving Congestion with Storage I/O Performance Monitoring

Solving Congestion with Storage I/O Performance Monitoring

Chapter Description

This sample chapter from Detecting, Troubleshooting, and Preventing Congestion in Storage Networks explains the use of storage I/O performance monitoring for handling network congestion problems and includes practical case studies.

I/O Operations and Network Traffic Patterns

Traffic in a storage network is the direct result of an application initiating a read or write I/O operation. Because of this, network traffic patterns can be better understood by analyzing the application I/O profile, such as the timing, size, type, and rate of I/O operations. Essentially, the application I/O profile helps in understanding why the network has traffic.

Read I/O Operation in a Fibre Channel Fabric

Figure 5-8 shows a SCSI or NVMe read I/O operation in a Fibre Channel fabric. A host initiates a read I/O operation using a read command, which the host encapsulates in a Fibre Channel frame and sends out its port. The host-connected switchport receives the frame and sends them to the next hop, based on the destination in the frame header. The network of switches, in turn, delivers this frame to the target. Such a frame that carries a read command is called a read command frame (CMND).

Figure 5-8

Figure 5-8 SCSI or NVMe Read I/O Operation in a Fibre Channel Fabric

The target, after receiving the read command frame, sends the data to the host in one or more FC frames. These frames that carry data are called data frames (DATA). The exact number of data frames returned by the target depends on the I/O size of the read command. A full-size FC frame can transfer up to 2048 bytes (2 KB) of data. Hence, the target sends one data frame if the read I/O size is less than or equal to 2 KB. The size of this frame depends on the data carried by it plus the overhead of the header. However, when the I/O size is larger than 2 KB, the target sends the data in multiple frames. Typically, all these frames are full-size FC frames carrying 2 KB worth of data. If the size requested is not a multiple of 2 KB, then the last frame is smaller than 2 KB. For example, an I/O size of 4 KB results in two full-size FC frames. But if the I/O size is 5 KB, the target may send two full-size FC frames, each carrying 2 KB, and a third frame carrying any remaining data, which is 1 KB.

After sending all the data to the host, the target indicates the completion of the I/O operations by sending a response, which carries the status. A frame that carries a response is called a response frame (RSP).

Some implementations can optimize the read I/O operations by sending the last data and the response in the same frame if their combined size is below 2 KB. These optimized read I/O operations may not always have dedicated response frames. Regardless of the type of read I/O operation, their result on network traffic remains the same.

Write I/O Operation in a Fibre Channel Fabric

Figure 5-9 shows a SCSI or NVMe write I/O operation in a Fibre Channel fabric. A host initiates a write I/O operation using a write command, which the host encapsulates in a Fibre Channel frame and sends out its port. The host-connected switchport receives the frame and sends it to the next hop, based on the destination in the frame header. The network of switches, in turn, delivers this frame to the target. Such a frame that carries a write command is called a write command frame (CMND).

Figure 5-9

Figure 5-9 SCSI or NVMe Write I/O Operation in a Fibre Channel Fabric

The target, after receiving the write command frame, prepares to receive the data and sends a frame to the host indicating that it is ready to receive all or some of the write data. This is called a transfer-ready frame (XFER_RDY). A transfer-ready frame carries the amount of data that the target is ready to receive in one sequence or burst. Refer to Chapter 2, “Understanding Congestion in Fibre Channel Fabrics,” for more details on a Fibre Channel sequence. Typically, this size is the same as the size requested by the write command frame. But sometimes, the target may not have the resources to receive all the data that the host wants to write in a single sequence. For example, a host may want to write 4 MB of data, which it specifies in the write command frame. The target, however, may have the resources to accept only 1 MB of data at a time. Hence, the target sends 1 MB as the burst length in the transfer-ready frame.

The host, after receiving the transfer-ready frame, sends the data to the host in one or more FC frames. These frames are called data frames (DATA). The exact number of data frames returned by the host depends on the burst size of the transfer-ready frame. It follows the same rules as explained previously for the read I/O operations. The difference for write I/O operations is that multiple sequences of transfer-ready may be involved if the target chooses to return a burst size that is less than the write command I/O size.

After receiving all the data that the host requested to write in this I/O operation (which may have been in multiple sequences due to the target sending one or multiple transfer-ready frames), the target indicates the completion of the I/O operations by sending a response, which carries the status. A frame that carries a response is called a response frame (RSP).

Some implementations can optimize the write I/O operations by eliminating the transfer-ready frame. In such cases, the target informs the initiator, during the process login (PRLI) state, that it will always keep the resources ready to receive a minimum size (first burst) of data. The initiator sends the data frames immediately after sending the write command frames, without waiting for the transfer-ready frames to arrive. Regardless of the type of write I/O operation, the result on network traffic is the same.

Network Traffic Direction

Table 5-1 shows the direction of traffic as a result of a read I/O operation in Figure 5-8. Figure 5-10 shows the traffic directions on various network ports due to different sequences of read and write I/O operations.

Table 5-1 Traffic Direction in a Storage Network Because of Read I/O Operation

Frame Type

Host Port

Host-Connected Switchport

ISL Port on Host-Edge Switch

ISL Port on Storage-Edge Switch

Storage-Connected Switchport

Storage Port

Read I/O command frame

Egress

Ingress

Egress

Ingress

Egress

Ingress

Read I/O data frame

Ingress

Egress

Ingress

Egress

Ingress

Egress

Read I/O response frame

Ingress

Egress

Ingress

Egress

Ingress

Egress

Figure 5-10

Figure 5-10 Network Traffic Direction Because of Read and Write I/O Operations

Table 5-2 explains the direction of traffic because of a write I/O operation in Figure 5-9. Figure 5-10 shows the traffic directions on various network ports due to different sequences of read and write I/O operations.

Table 5-2 Traffic Direction in a Storage Network Because of Write I/O Operation

Frame Type

Host Port

Host-Connected Switchport

ISL Port on Host-Edge Switch

ISL Port on Storage-Edge Switch

Storage-Connected Switchport

Storage Port

Write I/O command frame

Egress

Ingress

Egress

Ingress

Egress

Ingress

Write I/O transfer ready

Ingress

Egress

Ingress

Egress

Ingress

Egress

Write I/O data frame

Egress

Ingress

Egress

Ingress

Egress

Ingress

Write I/O response frame

Ingress

Egress

Ingress

Egress

Ingress

Egress

As is clear from Table 5-1 and Table 5-2, egress traffic on the host port, which is the same as the ingress traffic on the host-connected switchport, is due to:

  • Read I/O command frames

  • Write I/O command frames

  • Write I/O data frames

Similarly, ingress traffic on the host port, which is the same as the egress traffic on the host-connected switchport, is due to:

  • Read I/O data frames

  • Read I/O response frames

  • Write I/O transfer-ready frames

  • Write I/O response frames

Typically, a network switch doesn’t need to know the type of a frame (command, data, transfer-ready, or response frame) in order to send the frame toward its destination. However, without knowing the type of the frame, the real cause of throughput can’t be explained. This is another reason for monitoring storage I/O performance by using SAN Analytics.

Network Traffic Throughput

The previous section explains the direction of traffic for read and write I/O operations. But not all the frames are of the same size. Read and write I/O data frames are large and usually occur in larger quantities. Hence, they are the major contributors to link utilization. Other frames, such as read and write I/O command frames, response frames, and write I/O transfer-ready frames, are small and relatively few. Hence, they cause much lower link utilization. Table 5-3 shows the typical sizes of different frame types for SCSI and NVMe I/O operations.

Table 5-3 Typical Sizes of Frames for SCSI and NVMe I/O Operations

FC Frame Type

FC Frame Size Using SCSI

FC Frame Size Using NVMe

Read command frame

68 bytes

68 bytes

Read data frame

I/O size of 2 KB or larger typically results in full-size FC frames (2148 bytes). Smaller I/O size operations result in smaller frame sizes.

I/O size of 2 KB or larger typically results in full-size FC frames (2148 bytes). Smaller I/O size operations result in smaller frame sizes.

Read response frame

60 bytes

60 bytes

Write command frame

68 bytes

132 bytes

Write transfer-ready frame

48 bytes

48 bytes

Write data frame

I/O size of 2 KB or larger typically results in full-size FC frames (2148 bytes). Smaller I/O size operations result in smaller frame sizes.

I/O size of 2 KB or larger typically results in full-size FC frames (2148 bytes). Smaller I/O size operations result in smaller frame sizes.

Write response frame

60 bytes

68 bytes

Correlating I/O Operations, Traffic Patterns, and Network Congestion

The directions and sizes of various frames in a storage network lead to the following conclusions:

  • Read and write data frames are the major cause of link utilization. Other frames, such as command frames and response frames, are small, and their throughput is negligible compared to that of data frames.

  • Read and write data frames flow only after (or as the result of) command frames.

  • A command frame, based on the size of the requested data (called I/O size), can generate many data frames.

  • Most data frames of an I/O operation are full sized, except the last frame in the sequence.

  • Read data frames flow from storage (target) to hosts (initiators), whereas write data frames flow from hosts to storage.

  • When a host-connected switchport is highly utilized in the egress direction, it’s mostly due to read data frames. Likewise, when a storage-connected switchport is highly utilized in the egress direction, it’s mostly due to write data frames.

  • The key reason for congestion due to slow drain from hosts and due to overutilization of the host link is the multiple concurrent large-size read I/O command frames from the host. In other words, the host is asking for more data than it can process or than can be sent to it on its link.

  • The key reason for congestion due to slow drain from a storage port or due to overutilization of the storage link is the total amount of data being requested by the storage array via multiple concurrent write I/O transfer-ready frames. In other words, the storage array is asking for more data than it can process or than can be sent to it on its link.

These conclusions are extremely useful in understanding the reason for congestion caused by a culprit device or the effect of congestion on the victim devices. These conclusions also explain that host port or switchport monitoring can detect congestion, whereas storage I/O performance monitoring can give insights into why the congestion exists.

For example, Figure 5-11 illustrates congestion due to overutilization of the host links because of large-size read I/O operations. The host connects at 32 GFC. It initiates 5000 read I/O operations per second (IOPS), each requesting to read 1 MB of data from various targets. To initiate these I/O operations, the host sends 5000 command frames per second, each 68 bytes, which leads to the host port’s egress throughput of 2.8 Mbps (5000 × 68 B × 8 bits per byte), which is the same as the ingress throughput on the host-connected switchport. Because the maximum data rate of a 32 GFC port is 28.025 Gbps, these command frames result in 0.01% utilization, which is negligible.

Figure 5-11

Figure 5-11 Congestion Due to Overutilization Because of Large-Size Read I/O Operations

The targets, after receiving these command frames, send the data for every I/O operation in approximately 512 full-size frames (2048 bytes per frame). For 5000 IOPS, the targets send 2,560,000 frames/second (5000 × 512), each 2148 bytes (including the header). These data frames lead to a throughput of 44 Gbps (2,560,000 × 2148 bytes × 8 bits per byte). But the host can receive only 28.025 Gbps on the 32 GFC link. This condition results in congestion due to overutilization of the host link. The key point to understand is that the ingress utilization of the host-connected switchport is negligible, yet this minimal throughput results in 100% egress utilization. From the perspective of the network, these are just the percentage utilizations of the links. Only after getting insight into the I/O operations can the real reason for the link utilization be explained.

Although the read I/O data frames make the most of the egress traffic on a host-connected switchport, these data frames are just a consequence of the read I/O command frames that were sent by the host port. Because limiting the rate of read I/O command frames can lower the rate of read I/O data frames, limiting the rate of ingress traffic on the host-connected switchport can lower the rate of egress traffic on this port. This logic forms the foundation of Dynamic Ingress Rate Limiting, which is a congestion prevention mechanism explained in Chapter 6, “Preventing Congestion in Fibre Channel Fabrics.”

Case Study 1: A Trading Company That Predicted Congestion Issues Using SAN Analytics

A trading company has thousands of devices connected to a Fibre Channel fabric, and it has multiple such fabrics. Because of the large scale, the company has always had minor congestion issues. However, the severity and number of such issues increased as the company deployed all-flash storage arrays. In an investigation, they found that the newer congestion issues were due to the overutilization of the host links. Most hosts were connected to the fabric at 8 GFC. The older storage arrays were connected at 16 GFC. But the newer all-flash arrays were connected at 32 GFC, which increased the speed mismatch between the hosts and the storage. As explained in Chapter 1, “Introduction to Congestion in Storage Networks,” this speed mismatch, combined with the high performance of all-flash arrays, was the root cause of the increased occurrences of congestion issues.

The trading company understood the problem and its root cause. It also understood that the real solution was to upgrade the hosts because doing so would eliminate the speed mismatch with the all-flash storage arrays, essentially removing one major cause of congestion due to overutilization of the host links. But, due to finite human resources, the company could only upgrade a few hundred hosts every month. At this pace, it would take many years to upgrade all the hosts, and the company would be subjected to congestion issues during this time. While the company could not speed up this change, it wanted to have a prioritized list of the hosts that were most likely to cause congestion. Instead of upgrading a host randomly or in an order that didn’t consider the likeliness of congestion, following this methodology would allow the company to minimize congestion issues.

Background

The trading company uses storage arrays from two major vendors. The hosts include almost all kinds of servers (such as blade and rack-mount servers) from all major vendors. The company uses all major operating systems for hosting hundreds of applications.

The trading company uses Cisco MDS switches (mostly modular directors) in its Fibre Channel fabrics. Most connections were capable of running at 16 GFC. However, while deploying all-flash arrays, they upgraded the storage connections to 32 GFC. For management and monitoring of the fabric, the company uses Cisco Data Center Network Manager (DCNM), which has since been rebranded as Nexus Dashboard Fabric Controller (NDFC).

Initial Investigation: Finding the Cause and Source of Congestion

The trading company used the following tools for detecting and investigating congestion issues:

  • Alerts from Cisco MDS switches: The company had enabled alerts for Tx B2B credit unavailability by using the TxWait counter and alerts for high link utilization by using the Tx-datarate counter. As the company deployed all-flash arrays, the number of alerts generated due to TxWait didn’t change, but the number of alerts due to Tx-datarate increased.

  • Traffic trends, seasonality, and peak utilization using DCNM: After receiving the alerts from the MDS switches, the trading company used the historic traffic patterns in DCNM. The host ports that generated Tx-datarate alerts showed increased peak utilization. This increased utilization coincided with the time when the company deployed all-flash storage arrays.

These two mechanisms are explained in detail in Chapter 3, “Detecting Congestion in Fibre Channel Fabrics.”

A Better Host Upgrade Plan

The trading company designed the host upgrade plan using two steps:

  • Step 1. Detect the hosts that were already causing congestion and upgrade them first.

  • Step 2. Predict what hosts were most likely to cause congestion and upgrade them next.

Step 1: Detect Congestion

The trading company detected the hosts that needed urgent attention, as explained earlier, in the section “Initial Investigation: Finding the Cause and Source of Congestion.” These were the first ports to be upgraded, and the company prioritized upgrading the ports with slower speeds. But only a small percentage of the hosts made it to this list, and the company still wanted a prioritized list of the other hosts.

Step 2: Predict Congestion

The next step in designing a host upgrade plan (that is, a priority list of hosts) was finding the hosts that were most likely to cause congestion due to overutilization of their links.

In addition, the company wanted to find the hosts that were causing congestion but that could not be detected in Step 1. Any detection approach has a minimum time granularity. Events that are sustained for a shorter duration than the minimum time granularity often remain undetected. For example, even if congestion is detected at a granularity of 1 second, many congestion issues that are sustained for microseconds (sometimes called microcongestion) can’t be detected. This is common with the all-flash storage arrays that have response times in microseconds. Because of this, the usual detection mechanisms used in Step 1 can’t predict the likelihood of congestion.

This is where the insights obtained by using SAN Analytics help. The trading company enabled SAN Analytics on all its storage ports. Although only the storage ports inspected the traffic, the visibility from SAN Analytics was end-to-end at a granularity of every initiator, target, and logical unit (LUN) or ITL flow.

After collecting I/O flow metrics for a week, the company took the following steps (see Figure 5-12):

Figure 5-12

Figure 5-12 Sorted List of Hosts Based on Peak Read I/O Size for Predicting Congestion Due to Overutilization

  • Step 1. The company extracted the read I/O size, write I/O size, read IOPS, and write IOPS for all the hosts.

  • Step 2. The company made sorted lists of the hosts according to read I/O size and read IOPS. In other words, the company found the hosts with the largest read I/O size and highest read IOPS. Write I/O size and write IOPS were not considered because, as mentioned in the section “Correlating I/O Operations, Traffic Patterns, and Network Congestion,” most traffic due to write I/O operations flows from hosts to targets and does not lead to congestion due to overutilization of the host link.

  • Step 3. The company assumed that the hosts at the top of the list were more likely to cause congestion of their links and upgraded these hosts before upgrading the hosts with smaller read I/O sizes and lower IOPS.

A key consideration in predicting congestion is to focus on the peak values instead of the average values of the I/O flow metrics. This is because high average values indicate that the real-time values are sustained for a while. In this case, sustained traffic could have been detected by the Tx-datarate alert in Step 1, which has a granularity of 10 seconds. But the Tx-datarate counter could miss occasional spikes in traffic that are sustained only for a few milliseconds or even seconds. Such conditions can be found or even predicted by focusing on the peak values of the I/O flow metrics.

Another consideration is to prioritize the I/O size metric over the IOPS metric—for two key reasons. First, as explained earlier in this chapter, in the section “I/O Size,” I/O size is determined by the application or the host, and it is not affected by network congestion. In contrast, IOPS is reduced during network congestion. The second reason is that I/O size is an absolute metric, which means it is directly collected from the frame headers. As a result, its peak value is not affected by averaging. In contrast, IOPS is a derived metric from the average number of I/O operations over a duration such as 30 seconds. Even the most granular value of IOPS must be calculated over a duration, which makes it an average value. This goes against the benefit of the peak values explained earlier.

For collecting data, the trading company used a custom-developed collector that polled the metrics for initiator flows every 30 seconds from the MDS switches and then used the peak values in 6-hour ranges. It was a custom development because this use case was very specific, and it was unavailable ready-made at that time on the MDS switches or SAN Insights. The raw metrics were available, but they were not available in an easy-to-interpret format. The custom development gave the company the easy-to-interpret format it wanted. This enhancement was later integrated with Cisco NX-OS running on MDS switches and it is available by default.

Example 5-2 shows the output of a similar custom development that is based on the ShowAnalytics command on MDS switches. It shows a sorted list of initiators according to their read I/O sizes. The ShowAnalytics command is a presentation layer for the raw flow metrics, and it is written in Python. Many use cases are available ready-made, and their functionality can be enhanced even further by users. More details are available at https://github.com/Cisco-SAN/ShowAnalytics-Examples/tree/master/004-advanced-top-iosize. Example 5-2 shows a modified version of the ShowAnalytics command.

Example 5-2 Finding I/O Sizes of Hosts by Using SAN Analytics

MDS# python bootflash:analytics-top-iosize.py --top --key RIOSIZE

+--------+------------------------------------------+-------------------+
|  PORT  |        VSAN|Initiator|Target|LUN         |      IO SIZE      |
+--------+------------------------------------------+-------------------+
|        |                                          |   Read  |  Write  |
| fc1/35 | 20|0x320076|0x050101|002c-0000-0000-0000 |  1.2 MB |32.0 KB  |
| fc1/34 | 20|0x320076|0x050041|000c-0000-0000-0000 |  1.1 MB |32.0 KB  |
| fc1/33 | 20|0x320076|0x050021|002f-0000-0000-0000 |  1.0 MB |25.6 KB  |
| fc1/35 | 20|0x320076|0x050101|001b-0000-0000-0000 |  1.0 MB |48.0 KB  |
| fc1/33 | 20|0x320076|0x050021|0017-0000-0000-0000 | 992.0 KB|27.4 KB  |
| fc1/33 | 20|0x320076|0x050021|0026-0000-0000-0000 | 992.0 KB|32.0 KB  |
| fc1/33 | 20|0x320076|0x050021|0022-0000-0000-0000 | 960.0 KB|32.0 KB  |
| fc1/34 | 20|0x320076|0x050041|0025-0000-0000-0000 | 960.0 KB|28.0 KB  |
| fc1/35 | 20|0x320076|0x050101|001a-0000-0000-0000 | 960.0 KB|32.0 KB  |
| fc1/34 | 20|0x320076|0x050041|0014-0000-0000-0000 | 928.0 KB|32.0 KB  |
+--------+------------------------------------------+-------------------+

Case Study 1 Summary

The trading company reduced its congestion issues by designing a two-step host upgrade plan. In Step 1, the company used the congestion detection capabilities of Cisco MDS switches and DCNM (NDFC). In Step 2, it used the predictive capabilities of SAN Analytics. Instead of upgrading the hosts randomly, the company prioritized upgrading the hosts that were more likely to cause congestion based on the peak read I/O size values. By following this plan, the company lowered the severity of congestion, and the number of such issues was only a fraction of what it had been at the beginning of the upgrade cycle, when the company started deploying all-flash arrays.

Case Study 2: A University That Avoided Congestion Issues by Correcting Multipathing Misconfiguration

A university observed congestion issues in its storage networks. After enabling alerting on the MDS switches, the university concluded that the congestion was due to the overutilization of a few host links.

The university monitored the read and write I/O throughput on these hosts by using the host-centric approach described earlier in this chapter, in the section “Storage I/O Performance Monitoring in the Host.” The throughput reported by the operating system (Linux) was much lower than the combined capacity of the host ports. This led the university to believe that ample network capacity was still available.

The university wanted to know why these hosts caused congestion due to overutilization even though the I/O throughput was less than the available capacity. Finding the reason for the congestion would pave the way to a solution.

Background

The university used the Port-Monitor feature to automatically detect congestion and generate alerts on Cisco MDS switches. It also enabled SAN Analytics and exported the metrics to DCNM/NDFC SAN Insights for long-term trending and end-to-end correlation of the I/O flow metrics.

Investigation

The university measured the host I/O throughput at the operating system, which was the combined throughput, but it had not measured the per-path I/O throughput. This was important because its hosts were connected to the storage arrays via two independent and redundant Fibre Channel fabrics (Fab-A and Fab-B). Most of its hosts have two HBAs, each with two ports (for a total of four ports). The first port on both HBAs connects to Fab-A, whereas the second port on both HBAs connects to Fab-B (see Figure 5-13).

Figure 5-13

Figure 5-13 Per-Path Throughput Monitoring Helps in Finding Multipathing Misconfiguration

The university used SAN Analytics to find the throughput per path, which is also available in DCNM SAN Insights. It found that although the combined throughput reported by SAN Insights was the same as the throughput measured at the operating system, the per-path throughput was not uniformly balanced. The ports connected to Fab-A were up to four times more utilized than the ports connected to Fab-B. When the host I/O throughput spiked, the increase seen on the ports connected to Fab-A was up to four times more than the increase seen on the ports connected to Fab-B. During this spike, the ports connected to Fab-A operated at full capacity, while the ports connected to Fab-B were underutilized. This was the reason for congestion due to the overutilization of host links in Fab-A.

In Figure 5-13, traffic imbalance among the four host links can also be detected by measuring the utilization of host ports or their connected switchports. But if the hosts are within a blade server chassis, finding this traffic imbalance is not possible just by measuring port utilization. For example, in Cisco UCS architecture, the links that connect to the MDS switches can carry traffic for up to 160 servers, each with multiple initiators. Finding the throughput per initiator is possible only after getting flow-level visibility, as provided by SAN Analytics.

Figure 5-14 shows per-path throughput for the host and an end-to-end topology in DCNM/NDFC.

Figure 5-14

Figure 5-14 Ready-Made View of the per-Path Throughput of Hosts in NDFC/DCNM SAN Insights

The root cause of this congestion was the misconfiguration of multipathing on these hosts. The university solved this congestion issue by correcting the multipathing misconfiguration on these hosts. SAN Analytics played a key role in finding the root cause because it was able to show a host’s combined throughput as well as the per-path throughput.

Case Study 2 Summary

Using SAN Analytics, a university was able to find non-uniform traffic patterns that led to congestion due to overutilization of a few links while other links were underutilized. The insights provided by SAN Analytics pinpointed a problem at the host multipathing layer. The university solved the congestion issues by correcting the multipathing misconfiguration, which resulted in uniform utilization of the available paths.

Case Study 3: An Energy Company That Eliminated Congestion Issues

An energy company observed high TxWait values on its storage-connected switchports, which means the storage arrays had a slower processing rate than the traffic being delivered to them (that is, slow drain). Thus, the storage ports slowed down the sending of R_RDY primitives, leading to zero remaining-Tx-B2B-credits on the connected switchports, which led to high TxWait values.

The company observed the high TxWait values across all of its storage ports. No specific storage array stood out. Also, the TxWait spikes were observed throughout the peak business hours. The company couldn’t pinpoint the high TxWait values to any specific hour.

The energy company wanted to know the reason for the high TxWait values on its storage-connected switchport. Knowing the root cause of this problem would allow them to find a solution before the issue became a business-impacting problem.

Background

The energy company uses storage arrays from a few major vendors. Its hosts include almost all kinds of servers (such as blade and rack-mount servers) from all major vendors. Most of its servers are virtualized using a leading hypervisor. The company uses Cisco MDS switches in its Fibre Channel fabrics. It used the Port-Monitor feature to automatically detect congestion and generate alerts for TxWait and other counters. However, not many alerts were generated because the TxWait values measured by the switchports were lower than the configured thresholds.

The energy company polls the TxWait value from all switchports every 30 seconds by using the MDS Traffic Monitoring (MTM) app (refer to Chapter 3). Cisco NDFC/DCNM Congestion Analysis also provides this information.

Investigation

The energy company needed more details to proceed with the investigation of high TxWait values on the storage-connected switchport because the existing data points were not conclusive. There were no specific time patterns or locations to pinpoint. TxWait values were observed throughout business hours randomly across all the storage-connected switchports. Also, some team members suspected issues within storage arrays. However, this possibility was ruled out because high TxWait values on the connected switchports were seen from all the storage arrays that had different vendors and different architectures.

The energy company took the following steps in investigating this issue:

  • Step 1. The company enabled SAN Analytics on the storage-connected switchports and allowed the I/O flow metrics to be collected for a week.

  • Step 2. Next, the company correlated TxWait values with ECT values on the storage ports. The ECT pattern matched with the TxWait pattern, which was expected because high TxWait values cause a delay in frame transmission, which in turn leads to longer exchange completion times.

  • Step 3. The company also tried matching the pattern of IOPS and throughput, but that didn’t lead to any new revelations.

  • Step 4. The company correlated TxWait with I/O size. It didn’t observe any matching patterns with read I/O size. However, it noticed that the time pattern of the spikes in write I/O size was an exact match with the time pattern of the spikes in TxWait.

  • Step 5. The company believed the spikes in write I/O size could explain the spikes in TxWait on the storage ports. It used this reasoning:

    • Typically, the write I/O size was in the range 512 bytes to 64 KB. During the spikes, the write I/O size increased to 1 MB. A 64 KB write I/O operation results in 32 full-size Fibre Channel frames, and a 1 MB write I/O operation results in 512 full-size Fibre Channel frames.

    • Most traffic due to a write I/O operation flows from hosts to storage ports.

    • The spike in write I/O size caused a burst of frames toward the storage arrays.

    • It was possible that the storage arrays could not process the burst of the frames in a timely manner and used the B2B flow control mechanism to slow down the ingress frame rate. The storage arrays reduced the rate of sending R_RDY primitives, leading to zero remaining-Tx-B2B-credits on the connected switchport, which led to high TxWait values.

  • Step 6. After determining that the large write I/O operations were the reason for the TxWait values on storage-connected switchports, the company wanted to resolve this issue. It had to find which hosts (initiators) and possibly which applications used the large-size write I/O operations.

  • Step 7. The company used SAN Analytics to find the write I/O size for every initiator-target-LUN (ITL) flow on the storage-connected switchports. This detailed information was enough to find the hosts (initiators) that initiated the large-size write I/O operations.

  • Step 8. Using SAN Analytics, the company found that these ITL flows had been active, and they had been doing write I/O operations with typical I/O sizes in the range 512 bytes to 64 KB. The write I/O size spiked to 1 MB just before these ITL flows stopped showing any I/O activity. In other words, the IOPS and throughput of these ITL flows dropped to zero right after the spike in write I/O size to 1 MB. It was an interesting pattern that was commonly seen on all the ITL flows that showed spikes in write I/O size to 1 MB.

  • Step 9. The company located the servers by using the initiator value from the ITL flows. Because these servers were virtualized, the company used the LUN value from the ITL flow to locate the datastore and a virtual disk on the hypervisor. However, it couldn’t find any data store or a virtual disk that was associated with the LUN value.

  • Step 10. Because the data from SAN Analytics showed nonzero IOPS for the ITL flows, the company was confident that these hosts used the storage volume associated with the LUN. Initially, it thought that it was not seeing all the information from the hosts. But later it was suspected that probably all these hosts stopped using the LUN. Not using the LUN coincided with the traffic pattern where the ITL flows showed no I/O activity right after a spike in the write I/O size.

  • Step 11. The company suspected some cleanup mechanism before freeing up the disks. The application and virtualization teams found that, as per the company’s compliance guidelines, explicit (eager) zeros are written before the volumes are freed up.

  • Step 12. The company found that many applications were short-lived. When such applications are provisioned, the company creates virtual machines and allocates storage. As soon as an application is shut down, the virtual machine resources are freed. During this process, the company wipes all the data and then writes (eager) zeros on the volumes.

  • Step 13. Next, the company found the disk cleanup process. The hypervisor documentation made it clear that this cleanup process of writing zeros used an I/O size of 1 MB. This value matched with the write I/O size value shown by SAN Analytics on the storage-connected switchport that reported spikes in TxWait values. This also explained why no I/O activity was seen right after the write I/O size spiked.

  • Step 14. The company concluded that the disk cleanup process was the root cause of the spikes in write I/O size, which in turn caused the spikes in TxWait values on the storage-connected switchports. To test this idea, the company followed the same sequence of deploying an application followed by shutting it down. When the virtual machine was freed, the company could match the timestamps on the hypervisor with the spike in write I/O size for the corresponding ITL flow on the storage port, as reported by SAN Analytics. Connecting these end-to-end dots between the storage network and the application gave the company a clear understanding of the root cause of the problem. However, the problem was not yet solved. Because of the compliance guidelines, the company couldn’t stop the disk cleanup process. Also, changing the default write I/O size of the disk cleanup process was perceived to be risky.

  • Step 15. The company’s final approach, which aligned with its compliance guidelines and was agreed upon by all the teams, was to avoid cleaning up the virtual machines during peak business hours. The company changed the workflow to not free up the virtual machine immediately after the application was shut down. Rather, it delayed the cleanup process until off-peak (late-night) hours.

  • Step 16. The company verified this change by using the TxWait values on switchports and write I/O size, as reported by SAN Analytics. It didn’t see spikes in TxWait values anymore. It saw spikes in write I/O size, but now TxWait values didn’t increase, probably because the overall load on the storage arrays was low during the off-peak hours, and thus, the spike of the write I/O size for some flows didn’t cause processing delays with the storage arrays.

Figure 5-15 shows a TxWait graph in NDFC/DCNM Congestion Analysis. This graph has a granularity of 60 seconds. TxWait of 30 seconds in this graph translates to 50% TxWait.

Figure 5-15

Figure 5-15 TxWait in NDFC/DCNM Congestion Analysis

Figure 5-16 shows a write I/O size time-series graph in NDFC/DCNM SAN Insights. Notice the sudden spike and timestamp.

Figure 5-16

Figure 5-16 Write I/O Size Spike in NDFC/DCNM SAN Insights

Figures 5-15 and 5-16 are close representations, but they are not sourced from the environment of the energy company. They are shown here to illustrate how the spikes in TxWait values and I/O size can be found and used.

Case Study 3 Summary

Using SAN Analytics, the energy company was able to find the root cause of high TxWait values on the storage-connected switchports and eliminate this congestion issue. First, it found that the spike in TxWait values was caused by the spike in write I/O size. Then it found the culprit ITL flows and used the initiator and LUN values to locate the hosts and the virtual machine. Finally, it used the traffic pattern—zero I/O activity just after a spike in write I/O size—to conclude that the disk cleanup process was the root cause of the spike in write I/O size. Based on this conclusion, the company solved the problem by delaying the disk cleanup until off-peak hours. This simple step eliminated congestion (TxWait spikes) from the company’s storage-connected switchports, which essentially led to better overall storage performance. This performance optimization wouldn’t have been possible without the insights provided by SAN Analytics.

Case Study 4: A Bank That Eliminated Congestion Through Infrastructure Optimization

A bank had an edge–core design in a storage network that connects thousands of devices. It often received a high egress utilization alert from a switchport connected to Host-1. The high-utilization condition persisted for a few minutes, and it happened a few times every day. While this switchport reported high egress utilization, congestion was seen on the ISL ports, as confirmed using TxWait values on the ISL ports of the upstream switch.

The bank had a large server farm, and many servers were underutilized. It was believed that high egress utilization on the switchport connected to Host-1 could be eliminated by moving some of the workloads to another server. However, instead of randomly moving a workload to another server (which would be a hit-or-miss approach), the bank wanted to make a data-driven decision to make the right change in one attempt. Every change is expensive, and the cost multiplies quickly in large environments.

Background

The bank used storage arrays from a few major vendors. Its hosts deployment included almost all kinds of servers (such as blade and rack-mount servers). Most of its servers were virtualized using a leading hypervisor. The bank used Cisco MDS switches in its Fibre Channel fabrics. It had enabled automatic monitoring and alerting using the Port-Monitor feature on MDS switches.

Using the high egress utilization (Tx-datarate) alerts, the bank was able to find the following information:

  • When the congestion started: This was based on the timestamp of the Port-Monitor alerts.

  • How long the congestion lasted: This was determined by finding the difference in timestamps between the rising and falling threshold events.

  • Where the source of congestion was located: Port-Monitor alerts reported which switch and switchport were highly utilized. The FLOGI database (via the NX-OS command show flogi database) showed that the affected switchport was connected to Host-1.

  • The congestion severity: This was reported by the Tx-datarate counter on the switchport that connected Host-1 and TxWait on the ISL ports of the upstream switch (refer to Chapter 4, “Troubleshooting Congestion in Fibre Channel Fabrics”).

Investigation

The bank needed more details to make a data-driven change to reduce the high ingress utilization of the Host-1 port, which is the same as the egress utilization of the connected switchport. Although the metrics from the switchport and the alerts from the Port-Monitor showed high utilization, granular flow level details were not available.

The bank wanted to move some workload from Host-1 to the other underutilized servers. But it didn’t know which workload to move and to which server.

The bank went through the following steps in investigating this issue:

  • Step 1. The bank enabled SAN Analytics on the host-connected switchports and ran it for a week while the same pattern of overutilization and congestion repeated. This helped in collecting end-to-end I/O flow metrics.

  • Step 2. Using SAN Analytics, the bank found the number of targets (using IT flows) and the number of logical units (storage volumes, or LUNs) (using ITL flows) that each server was doing I/O operations with. Table 5-4 shows the findings.

    Table 5-4 Distribution of IT and ITL Flows of the Servers

    Server Name

    Number of IT Flows

    Number of ITL Flows

    Number of LUNs (ITL flows / IT flows)

    Host-1

    4

    40

    10

    Host-2

    4

    20

    5

    Host-3

    4

    12

    3

    Host-4

    4

    80

    20

    Dividing the number of ITL flows by the number of IT flows gave the bank the number of LUNs that each server was doing I/O operations with. The results indicated that Host-1 was accessing a higher number of LUNs than were Host-2 and Host-3. Host-4’s LUN number was double that of Host-1, yet it didn’t cause utilization as high as for Host-1.

  • Step 3. The bank found the throughput for every ITL flow. It focused on read I/O throughput because most egress traffic on host-connected switchports results from read I/O operations. After sorting the ITL flows on the Host-1 connected switchport as per the read I/O throughput, the bank found an ITL flow that had a throughput much higher than the other ITL flows. Also, the pattern of spikes and dips of the read I/O throughput of this ITL flow matched the egress utilization on the Host-1 connected switchport. Clearly, this ITL flow was the major cause of the high utilization of the switchport and, consequently, the reason for congestion on the ISL.

  • Step 4. The bank wanted to find the workload that was using this ITL flow. Host-1 was virtualized, with many virtual machines. The bank used the LUN value of the ITL flow to find the datastore. It found the virtual disk that was created using this datastore and found the virtual machines that were using that virtual disk. To verify that it had located the correct virtual machine, the bank used the I/O throughput as reported by the operating system of the VM and matched it with the throughput reported by SAN Analytics for the detected ITL flow.

  • Step 5. After locating the high-throughput virtual machine on Host-1, the bank wanted to find the best server to which this virtual machine could be moved. Was it Host-2, Host-3, or Host-4?

  • Step 6. The bank ruled out Host-4 because it already had a greater number of ITL flows. The remaining possible options were Host-2 with 20 ITL flows, and Host-3 with 12 ITL flows.

  • Step 7. The bank found more metrics reported by SAN Analytics. Table 5-5 shows these findings.

    Table 5-5 I/O Flow Metrics from SAN Analytics for Host-2 and Host-3

    Server Name

    Peak Egress Utilization of the Connected Switchport

    Peak IOPS

    Peak Read I/O Size

    Host-2

    30%

    10,000

    16 KB

    Host-3

    40%

    2000

    64 KB

  • It was important to use the peak values in order to make the right decisions because congestion issues are more severe under peak load. Based on this data, the bank decided to move the high-throughput virtual machine from Host-1 to Host-2 because of its lower utilization and lower read I/O size. Had it made the decision based on the number of ITL counts alone, the bank would have chosen Host-3, which was not the best choice. By using the insights provided by SAN Analytics, the bank was able to make a data-driven decision.

The bank continued to monitor the servers and repeated these steps for further optimization.

Case Study 4 Summary

The bank received high egress utilization alerts from one of the host-connected switchports, which led to congestion on the ISL. It resolved this issue by moving a high-throughput workload/VM from this host to other underutilized hosts. To make this change, the bank used SAN Analytics to find the number of IT flows and ITL flows. It then found the throughput per flow and sorted the flows according to throughput to find the culprit flow. Next, the bank located the virtual machine by using the LUN value from the ITL flow and correlated it with the datastore and virtual disk on the hypervisor. Finally, it analyzed the peak throughput, IOPS, and I/O sizes of the other servers to find the best host for the high-throughput workload.

The insights provided by SAN Analytics helped the bank resolve this issue with only one change.

Cisco Press Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from Cisco Press and its family of brands. I can unsubscribe at any time.

Overview

Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about Cisco Press products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information

To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites; develop new products and services; conduct educational research; and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@ciscopress.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information

Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security

Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children

This site is not directed to children under the age of 13.

Marketing

Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information

If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out

Users can always make an informed choice as to whether they should proceed with certain services offered by Cisco Press. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.ciscopress.com/u.aspx.

Sale of Personal Information

Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents

California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure

Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links

This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact

Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice

We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020