Case Studies
This section describes three case studies based on real customer situations. It examines, at a high level, the challenges these companies had and how we they addressed them with the network automation techniques described throughout this book. These case studies give you a good idea of the benefits of automation in real-world scenarios.
Creating a Machine Learning Model with Raw Data
Raw data is data in the state in which it was collected from the network—that is, without transformations. You have seen in this chapter that preparing data is an important step in order to be able to derive insights. A customer did not understand the importance of preparing data, and this is what happened.
Company XYZ has a sensitive business that is prone to cyberattacks. It had deployed several security solutions, but because it had machine learning experts, XYZ wanted to enhance its current portfolio with an in-house solution.
The initial step taken was to decide which part of the network was going to be used, and XYZ decided that the data center was the best fit. XYZ decided that it wanted to predict network intrusion attacks based on flow data from network traffic.
After careful consideration, the company started collecting flow data from the network by using IPFIX. Table 3-4 shows the flow’s format.
Table 3-4 Flow Data Fields
Duration |
Protocol |
Src IP |
Src Port |
Dst IP |
Dst Port |
Packets |
Bytes |
Flows |
Flags |
---|---|---|---|---|---|---|---|---|---|
813 |
TCP |
10.0.0.2 |
56979 |
10.0.0.3 |
8080 |
12024 |
10300 |
1 |
AP |
After collecting data for a couple months, a team manually labeled each collected flow as suspicious or normal.
With the data labeled, the machine learning team created a model, using a decision tree classifier.
The results were poor, with accuracy in the 81% range. We were engaged and started investigating.
The results from the investigation were clear: XYZ had trained the model with all features in their raw state—so the data was unaltered from the time it was collected, without any data transformation. Byte values appeared in different formats (sometimes 1000 bytes and sometimes 1 KB), and similar problems occurred with the number of packets. Another clear misstep was that the training sample had an overwhelming majority of normal traffic compared to suspicious traffic. This type of class imbalance can damage the training of a machine learning model.
We retrained the model, this time with some feature engineering (that is, data preparation techniques). We separated each flag into a feature of its own, scaled the data set in order for the classes to be more balanced, and changed data fields so they were in the same units of measure (for example, all flows in KB), among other techniques.
Our changes achieved 95.9% accuracy.
This case study illustrates how much difference treating the data can make. The ability to derive insights from data, with or without machine learning, is mostly influenced by the data quality. We trained exactly the same algorithm the original team used and got a 14% improvement just by massaging the data.
How a Data Center Reduced Its Mean Time to Repair
Mean time to repair (MTTR) is a metric that reflects the average time taken to troubleshoot and repair a failed piece of equipment or component.
Customer ABC runs critical services in its data center, and outages in those services incur financial damages along with damages to ABC’s brand reputation. ABC owns a very complex data center topology with a mix of platforms, vendors, and versions. All of its services are highly redundant due to the fact that the company’s MTTR was high.
We were involved to try to drive down the MTTR as doing so could have a big impact on ABC’s business operations. The main reason we could pinpoint as causing the high MTTR metric was the amount of knowledge needed to triage and troubleshoot any situation. In an ecosystem with so many different vendors and platforms, data was really sparse.
We developed an automation solution using Python and natural language processing (NLP) that correlated logs across these platforms and deduplicated them. This really enabled ABC’s engineers to understand the logs in a common language.
In addition, we used Ansible playbooks to apply configurations after a device was replaced. The initial workflow consisted of replacing faulty devices with new ones and manually configuring them using the configuration of the replaced device. This process was slow and error prone.
Now after a device is identified for replacement at ABC, the following workflow occurs:
Step 1. Collect the configuration of the faulty device (using Ansible).
Step 2. Collect information about endpoints connected to the faulty device (using Ansible).
Step 3. Manually connect/cable the new device (manually).
Step 4. Assign the new device a management connection (manually).
Step 5. Configure the new device (using Ansible).
Step 6. Validate that the endpoint information from the faulty device from Step 2 matches (using Ansible).
This new approach allows ABC to ensure that the configurations are consistent when devices are replaced and applied in a quick manner. Furthermore, it provides assurance regarding the number and type of connected endpoints, which should be the same before and after the replacement.
In Chapter 5, you will see how to create your own playbooks to collect information and configure network devices. In the case of ABC, two playbooks were used: one to retrieve configuration and endpoint information from the faulty device and store it and another to configure and validate the replacement device using the previously collected information. These playbooks are executed manually from a central server.
Both solutions together greatly reduced the time ABC takes to repair network components. It is important to note that this customer replaced faulty devices with similar ones, in terms of hardware and software. If a replacement involves a different type of hardware or software, some of the collected configurations might need to be altered before they can be applied.
Network Migrations at an Electricity Provider
A large electricity provider was responsible for a whole country. It had gone through a series of campus network migrations to a new technology, but after a couple months, it had experienced several outages. This led to the decision to remove this new technology, which entailed yet another series of migrations. These migrations had a twist: They had to be done in a much shorter period of time due to customer sentiment and possibility of outages while on the current technology stack.
Several automations were applied, mostly for validation, as the migration activities entailed physical moves, and automating configuration steps wasn’t possible.
The buildings at the electricity provider had many floors, and the floors had many devices, including the following:
IP cameras
Badge readers
IP phones
Desktops
Building management systems (such as heating, air, and water)
The procedure consisted of four steps:
Step 1. Isolate the device from the network (by unplugging its uplinks).
Step 2. Change the device software version.
Step 3. Change the device configuration.
Step 4. Plug in the device to a parallel network segment.
The maintenance windows were short—around 6 hours—but the go-ahead or rollback decision had to be made at least 3 hours after the window start. This was due to the time required to roll back to the previous known good state.
We executed the first migration (a single floor) without any automation. The biggest hurdle we faced was verifying that every single endpoint device that was working before was still working afterward. (Some devices may have not been working before, and making them work on the new network was considered out of scope.) Some floors had over 1000 endpoints connected, and we did not know which of them were working before. We found ourselves—a team of around 30 people—manually testing these endpoints, typically by making a call from an IP phone or swiping a badge in a reader. It was a terrible all-nighter.
After the first migration, we decided to automate verifications. This automation collected many outputs from the network before the migration, including the following:
CDP neighbors
ARP table
MAC address table
Specific connectivity tests (such as ICMP tests)
Having this information on what was working before enabled us to save tens of person-hours.
We parsed this data into tables. After devices were migrated (manually), we ran the collection mechanisms again. We therefore had a picture of the before and a picture of the after. From there, it was easy to tell which endpoint devices were working before and not working afterward, so we could act on only those devices.
We developed the scripts in Python instead of Ansible. The reason for this choice was the team’s expertise. It would be possible to achieve the same with Ansible. (Refer to Example 1-9 in Chapter 1 for a partial snippet of this script.)
There were many migration windows for this project. The first 8-hour maintenance window was dedicated to a single floor. In our 8-hour window, we successfully migrated and verified six floors.
The time savings due to automating the validations were a crucial part of the success of the project. The stress reduction was substantial as well. At the end of each migration window, the team was able to leave, knowing that things were working instead of fearing being called the next morning for nonworking endpoints that were left unvalidated.