How and Where to Monitor Storage I/O Performance
At a high level, storage I/O performance can be monitored within a host, in storage arrays, or in a network. These are three viable options because an I/O operation passes through many layers within the initiator (host), the target (storage array), and multiple switches in the network. This section explains these approaches briefly, but the primary focus of this chapter is on monitoring storage I/O performance in the network.
Storage I/O Performance Monitoring in the Host
Most operating systems, such as Linux, Windows, and ESXi, monitor storage I/O performance. Example 5-1 shows an example of monitoring storage I/O performance in Linux by using the iotop command.
Example 5-1 Storage I/O Performance Monitoring in Linux
[root@stg-tme-lnx-b200-7 ~]# iotop Total DISK READ : 36.30 M/s | Total DISK WRITE : 36.85 M/s Actual DISK READ: 36.31 M/s | Actual DISK WRITE: 36.80 M/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 941 be/3 root 0.00 B/s 0.00 B/s 0.00 % 3.31 % [jbd2/dm-101-8] 46303 be/4 root 6.42 M/s 6.37 M/s 0.00 % 1.93 % fio config_fio_1 542 be/3 root 0.00 B/s 0.00 B/s 0.00 % 1.89 % [jbd2/dm-22-8] 26496 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 1.26 % multipathd 46383 be/4 root 7.13 M/s 7.11 M/s 0.00 % 0.42 % fio config_fio_1 46284 be/4 root 11.96 M/s 12.34 M/s 0.00 % 0.00 % fio config_fio_1 46384 be/4 root 5.19 M/s 5.40 M/s 0.00 % 0.00 % fio config_fio_1 46402 be/4 root 5.61 M/s 5.63 M/s 0.00 % 0.00 % fio config_fio_1
For the purpose of dealing with network congestion, monitoring storage I/O performance within hosts involves the following considerations:
Per-path storage I/O performance should be monitored because although multiple paths that perform at different levels exist between the host and the storage array, the host may, by default, report only cumulative performance.
Metrics from thousands of hosts should be collected and presented in a single dashboard for early detection of congestion.
Collecting the metrics from hosts may require dedicated agents, and there is overhead involved in maintaining them.
Different implementations on different operating systems, such as Linux, Windows, and ESXi, may take non-uniform approaches to collecting the same metrics.
Be aware that measuring the performance within hosts makes the measurements prone to issues on a particular host. Is the “monitored” end device “monitoring” itself? What happens when it gets congested or becomes a slow-drain device?
Because of organizational silos, hosts and storage arrays may be managed by different teams.
Storage I/O Performance Monitoring in a Storage Array
Most arrays monitor storage I/O performance. For example, Figure 5-1 shows I/O performance on a Dell EMC PowerMax storage array.
Figure 5-1 Storage I/O Performance Monitoring on a Dell EMC PowerMax Storage Array
The metrics collected by the storage arrays can be used for monitoring I/O performance, but this approach involves similar challenges to the host-centric approach, as explained in the previous section.
Storage I/O Performance Monitoring in a Network
I/O operations are encapsulated within frames for transporting the frames via a storage network. The network switches only need to look up the headers to send the frames toward their destination. In other words, a network, for its typical function of frame forwarding, need not know what’s inside the frame. However, monitoring storage I/O performance in the network requires advanced capability on the switches for inspecting the transport (such as Fibre Channel) header, and upper-layer protocol (such as SCSI and NVMe) headers.
Cisco SAN Analytics monitors storage I/O performance natively within a network because it is integrated by design with Cisco MDS switches. As Fibre Channel frames are switched between the ports of an MDS switch, the ASICs (application-specific integrated circuits) inspect the FC and NVMe/SCSI headers and analyze them to collect I/O performance metrics such as the number of I/O operations per second, how long the I/O operations are taking to complete, how long the I/O operations are spending in the storage array, how long the I/O operations are spending in the hosts, and so on. Cisco SAN Analytics does not inspect the frame payload because there is no need for it, as the metrics can be calculated by inspecting only the headers.
Cisco SAN Analytics, because of its network-centric approach and unique architecture, has the following merits for monitoring storage I/O performance:
Vendor neutral: Cisco SAN Analytics is not dependent on server vendor (HPE, Cisco, Dell, and so on), host OS vendor (Red Hat, Microsoft, VMware, and so on), or storage array vendor (Dell EMC, HPE, IBM, Hitachi, Pure, NetApp, and so on).
Not dependent on end-device type: Cisco SAN Analytics is not dependent on any of the following:
Server architecture: Rack-mount, blade, and so on
OS type: Linux, Windows, or ESXi
Storage architecture: All-flash, hybrid, non-flash, and so on
Legacy end devices can also benefit because no changes are needed on them, such as installation of an agent or firmware updates.
No dependency on the monitoring architecture of end devices: Different products use different logic for collecting similar metrics. For example, some storage arrays collect I/O completion time on the front-end ports, whereas other storage arrays collect it on the back-end ports. Different host operating systems may collect I/O completion time at different layers in the host stack. Cisco SAN Analytics doesn’t have this dependency.
Flow-level monitoring: Cisco SAN Analytics monitors performance for every flow separately. When a culprit switchport is detected, flow-level metrics help in pinpointing the issue to an exact initiator, target, virtual machine, or LUN/namespace ID.
Flexibility of location of monitoring: Cisco SAN Analytics can monitor storage I/O performance at any of the following locations:
Host-connected switchports: Close to apps and servers
Storage-connected switchports: Close to storage arrays
ISL ports: Flow-level granularity in the core of the network
Granular: Cisco SAN Analytics monitors storage I/O performance at a low granularity—microseconds for on-switch monitoring and seconds for exporting metrics from the switch.
This chapter focuses on using Cisco SAN Analytics for addressing congestion in storage networks, although the education and case studies can be used with host-centric and storage array-centric approaches as well.