Challenges Being Addressed
As described in the chapter introduction, automation is a necessity for growing sophisticated IT environments today. Allow me to share a personal example: if you’ve been to a CiscoLive conference in the US, it is common to deploy a couple thousand wireless access points in the large conference venues in Las Vegas, San Diego, and Orlando. I’m talking a million square feet plus event spaces.
Given that the network operations center (NOC) team is allowed onsite only four to five days before the event starts, that’s not enough time to manually provision everything with a couple dozen event staff volunteers. The thousands of wireless APs are just one aspect of the event infrastructure (see Figure 10-1). There are still the 600+ small form-factor switches that must be spread across the venue to connect breakout rooms, keynote areas, World of Solutions, testing facilities and labs, the DevNet pavilion, and other spaces (see Figure 10-2).
Figure 10.1 Moving a Few Wireless APs
Figure 10.2 Lots of Equipment to Stage
Automation is a “do or die” activity for our businesses: without it, we overwork individuals and that impacts the broader organization. Automation must also extend beyond provisioning into the wide-scale collection of performance, health, and fault information.
Discerning companies are investigating how artificial intelligence and machine learning (AI/ML) can benefit them in obtaining new operational insights and reducing human effort even more.
We might even acknowledge “change is hard and slow.” If you started networking after prior experience with a more programmatic environment or dealt with other industries where mass quantities of devices were managed effectively, you might wonder why network IT lags. This is a fair question, but also to be fair, enormous strides have been made in the last 10 years with an industry that found its start in ARPANET at the end of the 1960s. Cisco incorporated in 1984, and the industry has been growing in scale and functionality ever since.
Being involved in the latter part of the first wave of network evolution has been a constant career of learning and advancing skills development. The change and expansion of scope and function with networking have been very interesting and fulfilling for me.
Differences of Equipment and Functionality
Some of the challenges with networking deal with the diversity of equipment and functionality. In the last part of the 1960s and early 1970s, the aforementioned ARPANET included few network protocols and functions. A router’s purpose was to move traffic across different, isolated network segments of specialized endpoints. The industry grew with shared media technologies (hubs), then to switches. Businesses started acquiring their own servers; they weren’t limited to government agencies and the development labs of colleges and universities. Slowly, home PCs contributed to a burgeoning technology space.
Connectivity technology morphed from more local-based technologies like token ring and FDDI to faster and faster Ethernet-based solutions, hundred megabit and gigabit local interfaces, also influencing the speed of WAN technologies to keep up.
Switches gave advent to more intelligent routing and forwarding switches. IP-based telephony was developed. Who remembers that Cisco’s original IP telephony solution, Call Manager, was originally delivered as a compact disc (CD), as much software was?
Storage was originally directly connected but then became networked, usually with different standards and protocols. The industry then accepted the efficiencies of a common, IP-based network. The rise of business computing being interconnected started influencing home networking. Networks became more interconnected and persistent. Dial-up technologies and ISDN peaked and started a downward trend in light of always-on cable-based technologies to the home. Different routing protocols needed to be created. Multiple-link aggregation requirements needed to be standardized to help with resiliency.
Wireless technologies came on the scene. Servers, which had previously been mere endpoints to the network, now became more integrated. IPv6. Mobile technologies. A lot of hardware innovations but also a lot of protocols and software developments came in parallel. So why the history lesson? Take them as cases in point of why networking IT was slow in automation. The field was changing rapidly and growing in functionality. The scope and pace of change in network IT were unlike those in any other IT disciplines.
Unfortunately, much of the early development relied on consoles and the expectation of a human administrator always creating the service initially and doing the sustaining changes. The Information Technology Information Library (ITIL) and The Open Group Architecture Framework (TOGAF) service management frameworks helped the industry define structure and operational rigor. Some of the concepts seen in Table 10-2 reflect a common vocabulary being established.
Table 10-2 Operational Lifecycle
Operational Perspective |
Function |
---|---|
Day-0 |
Initial installation |
Day-1 |
Configuration for production purpose |
Day-2 |
Compliance and optimization |
Day-X |
Migration/decommissioning |
The full lifecycle of a network device or service must be considered. All too often the “spin-up” of a service is the sole focus. Many IT managers have stories about finding orphaned recurring charges from decommissioned systems. Migrating and decommissioning a service are just as important as the initial provisioning. We must follow up on reclaiming precious consumable resources like disk space, IP addresses, and even power.
In the early days of compute virtualization, Cisco had an environment called CITEIS—Cisco IT Elastic Infrastructure Services, which were referred to as “cities.” CITEIS was built to promote learning, speed development, and customer demos, and to prove the impact of automation. A policy was enacted that any engineer could spin up two virtual machines of any kind as long as they conformed to predefined sizing guidelines. If you needed something different, you could get it, but it would be handled on an exception basis. Now imagine the number of people excited to learn a new technology all piling on the system. VMs were spun up; CPU, RAM, disk space, and IP addresses consumed; used once or twice, then never accessed again. A lot of resources were allocated. In the journey of developing the network programmability discipline, network engineers also needed to apply operational best practices. New functions were added to email (and later send chat messages to) the requester to ensure the resources were still needed. If a response was not received in a timely fashion, the resources were archived and decommissioned. If no acknowledgment came after many attempts over a longer period, the archive may be deleted. These kinds of basic functions formed the basis of standard IT operations to ensure proper use and lifecycle management of consumable resources.
With so many different opportunities among routing, switching, storage, compute, collaboration, wireless, and such, it’s also understandable that there was an amount of specialization in these areas. This focused specialization contributed to a lack of convergence because each technology was growing in its own right; the consolidation of staff and budgets was not pressuring IT to solve the issue by building collaborative solutions. But that would change. As addressed later in the topics covering SDN, the industry was primed for transformation.
In today’s world of modern networks, a difference of equipment and functionality is to be expected. Certainly, there are benefits recognized with standardizing device models to provide efficiencies in management and device/module sparing strategies. However, as network functions are separated, as seen later with SDN, or virtualized, as seen with Network Function Virtualization (NFV), a greater operational complexity is experienced. To that end, the industry has responded with model-driven concepts, which we cover in Chapter 11, “NETCONF and RESTCONF.” The ability to move from device-by-device, atomic management considerations to more service and function-oriented models that comprehend the relationships and dependencies among many devices is the basis for model-driven management.
Proximity of Management Tools and Support Staff
Another situation that needed to be addressed was the proximity of management tools and support staff. Early networks were not as interconnected, persistent, or ingrained to as many aspects of our lives as they are now. It was common to deploy multiple copies of management tools across an environment because the connectivity or basic interface link speed among sites often precluded using a central management tool. Those were the days of “Hey, can you drive from Detroit down to Dayton to install another copy of XYZ?”
Support staff existed at many large sites, sometimes with little collaboration among them or consistency of service delivery across states or countries.
Because early networks often metered and charged on traffic volume across a wide area, they were almost disincentivized to consolidate monitoring and management. “Why would I want to run more monitoring traffic and increase my cost? I only want ‘business-critical traffic’ across those WAN links now.” However, fortunately, even this way of thinking changed.
Today networks are more meshed, persistent, highly available, and faster connected. There is little need to deploy multiple management tools, unless it is purposeful for scale or functional segmentation. The support teams today may include a “follow the sun” model where three or four different support centers are spread across the globe to allow personnel to serve others in their proximate time zone. As businesses experience higher degrees of automation and orchestration, there is reduced need for on-shift personnel. Consolidation of support teams is possible. This pivot to a more on-call or exception-based support model is desired. The implementation of self-healing networks that require fewer and fewer support personnel is even more desirable. Google’s concept of site reliability engineering (SRE) is an example of addressing the industry’s shortcomings with infrastructure and operations support. The SRE discipline combines aspects of software engineering with infrastructure and operations. SRE aims to enable highly scalable and reliable systems. Another way of thinking about SRE is what happens when you tell a software engineer to do an operations role.
Speed of Service Provisioning
With early networks being “small” and “specialized,” there was a certain acceptance to how long it took to provision new services. The network engineer of the late 1990s and early 2000s might have experienced lead times of many months to get new circuits from their WAN service provider. However, this was an area of transformation in network IT also. Networks became more critical to businesses. Soon, having a web presence, in addition to any brick-and-mortar location, was a necessity. This would drive a need for faster service provisioning and delivery. Previous manual efforts that included a “truck roll,” or someone driving to another location, put too much latency into the process.
Businesses that could provide a service in weeks were driving a competitive differentiator to those that took months. Then this model progressed to those that could provide services in days versus weeks, and now you see the expectation of minutes, or “while I watch from my browser.”
Business models have greatly changed. The aforementioned brick-and-mortar model was the norm. As the Internet flourished, having a web presence became a differentiator, then a requirement. To that end, so many years later, it is very difficult to find impactful domain names to register. Or it may cost a lot to negotiate a transfer from another owner!
Today, the physical presence is not required and is sometimes undesirable. More agile business models mean companies can be operated out of the owner’s home. Fulfillment can be handled by others, and the store or marketplace is handled through a larger e-commerce entity like Amazon, Alibaba, or eBay.
It is impossible to provide services in such a rapid fashion without automation. The customer sitting at a browser expects to see an order confirmation or expected service access right then. Indeed, some customers give up and look for alternative offers if their request is not met as they wait.
This expectation of now forces businesses to consolidate their offers into more consistent or templatized offers. The more consistent a service can be delivered, the better suited it is for automation. It’s the exceptions that tend to break the efficiencies of automation and cause longer service delivery cycles.
This rapid pace of service delivery influenced IT service management and development with DevOps and models like Agile and Lean. Figure 10-3 depicts the Agile methodology.
Figure 10.3 Agile Methodology
Agile, as a software development practice, focuses on extracting requirements and developing solutions with collaborative teams and their users. Planning with an adaptive approach to quick delivery and continuous improvement sets Agile apart from other, less flexible models. Agile is just one software development methodology, but it has a large following and is suggested for consideration in your network programmability journey. Several more of the broad spectrum of methodologies and project management frameworks are described in Table 10-3.
Table 10-3 Software Development Methodologies and Frameworks
Method Name |
Description |
---|---|
Agile |
Flexible and incremental design process focused on collaboration |
Kanban |
Visual framework promoting what, when, and how to develop in small, incremental changes; complements Agile |
Lean |
Process to create efficiencies and remove waste to produce more with less |
Scrum |
Process with fixed-length iterations (sprints); follows roles, responsibilities, and meetings for well-defined structure; derivative of Agile |
Waterfall |
Sequential design process; fully planned; execution through phases |
Whatever model you choose, take time to understand the pros and cons and evaluate against your organization’s capabilities, culture, motivations, and business drivers. Ultimately, the right software development methodology for you is the one that is embraced by the most people in the organization.
Accuracy of Service Provisioning
Walt Disney is known for sharing this admirable quote, “Whatever you do, do it well.” That has been the aspiration of any product or service provider. The same thinking can be drawn to network service provisioning: nobody truly intends to partially deploy a service or to deploy something that will fail. One reason accuracy of service provisioning struggled before network programmability hit its stride was due to the lack of programmatic interfaces.
As we mentioned before, much of the genesis of network IT, and dare we say even IT more broadly, was founded on manual command-line interface interactions. Provisioning a device meant someone was logging into it and typing or pasting a set of configuration directives. The task wasn’t quite as bad as that in Figure 10-4, but it sure felt that way!
Figure 10.4 Not Quite This Painful
A slightly more advanced method might be typing or pasting those directives and putting them into a file to be transferred to the device and incorporated into its running configuration state. However, these manual efforts still required human interaction and an ability to translate intent to a set of configuration statements.
Some automations were, and sometimes still are, simply the collection and push of those same CLI commands (see Figure 10-5), but in an unattended fashion by a script or management application.
Figure 10.5 Automating the CLI
The fact that the foundation has been based on CLI automation seems to imply that the industry was conceding the “best” way to interact with a device was through the CLI. A lot of provisioning automation occurs through CLI with many management applications and open-source solutions.
Yet the CLI, while suited for human consumption, is not optimal for programmatic use. If the command syntax or output varies between releases or among products, the CLI-based solutions need to account for the differences. Consider the command output for show interface in Example 10-1.
Example 10-1 Show Interface Output
Switch# show interface te1/0/2 TenGigabitEthernet1/0/2 is up, line protocol is up (connected) Hardware is Ten Gigabit Ethernet, address is 0023.ebdd.4006 (bia 0023.ebdd.4006) MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec, reliability 255/255, txload 1/255, rxload 1/255 Encapsulation ARPA, loopback not set Keepalive not set Full-duplex, 10Gb/s, link type is auto, media type is 10GBase-SR input flow-control is off, output flow-control is unsupported ARP type: ARPA, ARP Timeout 04:00:00 Last input 00:00:04, output 00:00:00, output hang never Last clearing of "show interface" counters never Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0 Queueing strategy: fifo Output queue: 0/40 (size/max) 5 minute input rate 5000 bits/sec, 9 packets/sec 5 minute output rate 0 bits/sec, 0 packets/sec 200689496 packets input, 14996333682 bytes, 0 no buffer Received 195962135 broadcasts (127323238 multicasts) 0 runts, 0 giants, 0 throttles 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored 0 watchdog, 127323238 multicast, 0 pause input 0 input packets with dribble condition detected 7642905 packets output, 1360729535 bytes, 0 underruns 0 output errors, 0 collisions, 0 interface resets 0 babbles, 0 late collision, 0 deferred 0 lost carrier, 0 no carrier, 0 PAUSE output 0 output buffer failures, 0 output buffers swapped out
What are the options for extracting information like number of multicast packets output?
The use of Python scripts is in vogue, so let’s consider that with Example 10-2, which requires a minimum of Python 3.6.
Example 10-2 Python Script to Extract Multicast Packets
import paramiko import time import getpass import re username = input('Enter Username: ') userpassword = getpass.getpass('Enter Password: ') devip = input('Enter Device IP: ') devint = input('Enter Device Interface: ') try: devconn = paramiko.SSHClient() devconn.set_missing_host_key_policy(paramiko.AutoAddPolicy()) devconn.connect(devip, username=username, password=userpassword,timeout=60) chan = devconn.invoke_shell() chan.send("terminal length 0\n") time.sleep(1) chan.send(f'show interface {devint}') time.sleep(2) cmd_output = chan.recv(9999).decode(encoding='utf-8') devconn.close() result = re.search('(\d+) multicast,', cmd_output) if result: print(f'Multicast packet count on {devip} interface {devint} is {result. group(1)}') else: print(f'No match found for {devip} interface {devint} - incorrect interface?') except paramiko.AuthenticationException: print("User or password incorrect - try again") except Exception as e: err = str(e) print(f'ERROR: {err}')
There’s a common theme in methodologies that automate against CLI output which requires some level of string manipulation. Being able to use regular expressions, commonly called regex, or the re module in Python, is a good skill to have for CLI and string manipulation operations. While effective, using regex can be difficult skill to master. Let’s call it an acquired taste. The optimal approach is to leverage even higher degrees of abstraction through model-driven and structure interfaces, which relieve you of the string manipulation activities. You can find these in solutions like pyATS (https://developer.cisco.com/pyats/) and other Infrastructure-as-Code (IaC) solutions, such as Ansible and Terraform.
Product engineers intend to maintain consistency across releases, but the rapid rate of change and the intent to bring new innovation to the industry often result in changes to the command-line interface, either in provisioning syntax and arguments or in command-line output. These differences often break scripts and applications that depend on CLI; this affects accuracy in service provisioning. Fortunately, the industry recognizes the inefficiencies and results of varying CLI syntax and output. Apart from SNMP, which generally lacked a strong provisioning capability, one of the first innovations to enable programmatic interactions with network devices was the IETF’s NETCONF (network configuration) protocol.
We cover NETCONF and the follow-on RESTCONF protocol in more detail later in this book. However, we can briefly describe NETCONF as an XML representation of a device’s native configuration parameters. It is much more suited to programmatic use. Consider now a device configuration shown in an XML format with Figure 10-6.
Figure 10.6 Partial NETCONF Device Configuration
Although the format may be somewhat unfamiliar, you can see patterns and understand the basic structure. It is the consistent structure that allows NETCONF/RESTCONF and an XML-formatted configuration to be addressed more programmatically. By referring to tags or paths through the data, you can cleanly extract the value of a parameter without depending on the existence (or lack of existence) of text before and after the specific parameter(s) you need. This capability sets NETCONF/RESTCONF apart from CLI-based methods that rely on regex or other string-parsing methods.
A more modern skillset would include understanding XML formatting and schemas, along with XPath queries, which provide data filtering and extraction functions.
Many APIs output their data as XML- or JSON-formatted results. Having skills with XPath or JSONPath queries complements NETCONF/RESTCONF. Again, we cover these topics later in Chapter 11.
Another way the industry has responded to the shifting sands of CLI is through abstracting the integration with the device with solutions like Puppet, Chef, Ansible, and Terraform. Scripts and applications can now refer to the abstract intent or API method rather than a potentially changing command-line argument or syntax. These also are covered later in this book.
Scale
Another challenge that needs to be addressed with evolving and growing network is scale. Although early and even some smaller networks today can get by with manual efforts of a few staff members, as the network increases in size, user count, and criticality, those models break. Refer back to Figure 9-19 to see the growth of the Internet over the years.
Scalable deployments are definitely constrained when using CLI-based methodologies, especially when using paste methodologies because of flow control in terminal emulators and adapters. Slightly more efficiencies are gained when using CLI to initiate a configuration file transfer and merge process.
Let me share a personal example from a customer engagement. The customer was dealing with security access list changes that totaled thousands of lines of configuration text and was frustrated with the time it took to deploy the change. One easy fix was procedural: create a new access list and then flip over to it after it was created. The other advice was showing the customer the inefficiency of CLI flow-control based methods. Because the customer was copying/pasting the access list, they were restricted by the flow control between the device CLI and the terminal emulator.
Strike one: CLI/terminal.
Strike two: Size of access list.
Strike three: Time to import.
Pasting the customer’s access list into the device’s configuration took more than 10 minutes. I showed them the alternative of putting the configuration parameters into a file that could be transferred and merged with the device and the resulting seconds that this approach took instead. Needless to say, the customer started using a new process.
Using NETCONF/RESTCONF protocols to programmatically collect information and inject provisioning intent is efficient. In this case, it is necessary to evaluate the extent of deployment to gauge the next level of automation for scale. Here are some questions to ask yourself:
■ How many devices, nodes, and services do I need to deploy?
■ Do I have dependencies among them that require staggering the change for optimal availability? Any primary or secondary service relationships?
■ How much time is permitted for the change window, if applicable?
■ How quickly can I revert a change if unexpected errors occur?
Increasingly, many environments have no maintenance windows; there is no time that they are not doing mission-critical work. They implement changes during all hours of the day or night because their network architectures support high degrees of resiliency and availability. However, even in these environments, it is important to verify that the changes being deployed do not negatively affect the resiliency.
One more important question left off the preceding list for special mention is “How much risk am I willing to take?” I remember working with a customer who asked, “How many devices can we software upgrade over a weekend? What is that maximum number?” Together, we created a project and arranged the equipment to mimic their environment as closely as possible—device types, code versions, link speeds, device counts. The lab was massive—hundreds of racks of equipment with thousands of devices. In the final analysis, I reported, “You can effectively upgrade your entire network over a weekend.” In this case, it was 4000 devices, which at the time was a decent-sized network. I followed by saying, “However, I wouldn’t do it. Based on what I know of your risk tolerance level, I would suggest staging changes. The network you knew Friday afternoon could be very different from the one Monday morning if you run into an unexpected issue.” We obviously pressed for extensive change testing, but even with the leading test methodologies of the time, we had to concede something unexpected could happen. We saved the truly large-scale changes for those that were routine and low impact. For changes that were somewhat new, such as new software releases or new features and protocols, we established a phased approach to gain confidence and limit negative exposure.
■ Lab testing of single device(s) representing each model/function
■ Lab testing of multiple devices, including primary/backup peers
■ Lab testing of multiple devices, including primary/backup peers to maximum scale possible in lab
■ Production deployment of limited device counts in low-priority environments (10 percent of total)
■ Change observation for one to two weeks (depending on criticality of change)
Production deployment of devices in standard priority environments (25 percent of total)
■ Change observation for two to four weeks (depending on criticality of change)
Second batch deployment in standard priority environments (25 percent of total)
■ Change observation for two to four weeks (depending on criticality of change)
Production deployment of devices in high-priority environments (10 percent of total)
■ Change observation for two to four weeks (depending on criticality of change)
Second batch deployment of high-priority environments (10 percent of total)
■ Change observation for two to four weeks (depending on criticality of change)
Third batch deployment of high-priority environments (20 percent of total)
As you contemplate scale, if you’re programming your own solutions using Python scripts or similar, it is worthwhile to understand multithreading and multiprocessing. A few definitions of concurrency and parallelism also are in order.
An application completing more than one task at the same time is considered concurrent. Concurrency is working on multiple tasks at the same time but not necessarily simultaneously. Consider a situation with four tasks executing concurrently (see Figure 10-7). If you had a virtual machine or physical system with a one-core CPU, it would decide the switching involved to run the tasks. Task 1 might go first, then task 3, then some of task 2, then all of task 4, and then a return to complete task 2. Tasks can start, execute their work, and complete in overlapping time periods. The process is effectively to start, complete some (or all) of the work, and then return to incomplete work where necessary—all the while maintaining state and awareness of completion status. One issue to observe is that concurrency may involve tasks that have no dependency among them. In the world of IT, an overall workflow to enable a new web server may not be efficient for concurrency. Consider the following activities:
Figure 10.7 Workflow Creating a Web Server
Create the virtual network.
Create the virtual storage volume.
Create the virtual machine vCPUs and vMemory.
Associate the VM vNet and vStorage.
Install the operating system to the VM.
Configure the operating system settings.
Update the operating system.
Install the Apache service.
Configure the Apache service.
Several of these steps depend on a previous step being completed. So, this workflow is not well suited to concurrency. However, deploying software images to many devices across the network would be well suited. Consider these actions on a multidevice upgrade process (see Figure 10-8):
Figure 10.8 Concurrency Example
Configure Router-A to download new software update (wait for it to process, flag it to return to later, move on to next router), then . . .
Configure Router-B to download new software update (wait for it to process, flag it to return to later, move on to next router), then . . .
Configure Router-C to download new software update (wait for it to process, flag it to return to later, move on to next router), then . . .
Check Router-A status—still going—move on to next router.
Configure Router-D to download new software update (wait for it to process, flag it to return to later, move on to next router).
Check Router-B status—complete—remove flag to check status; move to next router.
Configure Router-E to download new software update (wait for it to process, flag it to return to later, move on to next router).
Check Router-A status—complete—remove flag to check status; move to next router.
Check Router-C status—complete—remove flag to check status; move to next router.
Check Router-D status—complete—remove flag to check status; move to next router.
Check Router-E status—complete—remove flag to check status; move to next router.
Parallelism is different in that an application separates tasks into smaller activities to process in parallel on multiple CPUs simultaneously. Parallelism doesn’t require multiple tasks to exist. It runs parts of the tasks or multiple tasks at the same time using multicore functions of a CPU. The CPU handles the allocation of each task or subtask to a core.
Returning to the previous software example, consider it with a two-core CPU. The following actions would be involved in this multidevice upgrade (see Figure 10-9):
Figure 10.9 Parallelism Example
Core-1: Configure Router-A to download new software update (wait for it to process, flag it to return to later, move on to next router), while at the same time on another CPU . . .
Core-2: Configure Router-B to download new software update (wait for it to process, flag it to return to later, move on to next router).
Core-1: Configure Router-C to download new software update (wait for it to process, flag it to return to later, move on to next router).
Core-1: Check Router-A status—still going—move on to next router.
Core-2: Configure Router-D to download new software update (wait for it to process, flag it to return to later, move on to next router).
Core-2: Check Router-B status—complete—remove flag to check status; move to next router.
Core-2: Configure Router-E to download new software update (wait for it to process, flag it to return to later, move on to next router).
Core-1: Check Router-A status—complete—remove flag to check status; move to next router.
Core-1: Check Router-C status—complete—remove flag to check status; move to next router.
Core-1: Check Router-D status—complete—remove flag to check status; move to next router.
Core-2: Check Router-E status—complete—remove flag to check status; move to next router.
Because two tasks are executed simultaneously, this scenario is identified as parallelism. Parallelism requires hardware with multiple processing units, cores, or threads.
To recap, a system is concurrent if it can support two or more tasks in progress at the same time. A system is parallel if it can support two or more tasks executing simultaneously. Concurrency focuses on working with lots of tasks at once. Parallelism focuses on doing lots of tasks at once.
So, what is the practical application of these concepts? In this case, I was dealing with the Meraki Dashboard API; it allows for up to five API calls per second. Some API resources like Get Organization (GET /organizations/{organizationId}) have few key-values to return, so they are very fast. Other API resources like Get Device Clients (GET /devices/{serial}/clients) potentially return many results, so they may take more time. Using a model of parallelism to send multiple requests across multiple cores—allowing for some short-running tasks to return more quickly than others and allocating other work—provides a quicker experience over doing the entire process sequentially.
To achieve this outcome, I worked with the Python asyncio library and the semaphores feature to allocate work. I understood each activity of work had no relationship or dependency on the running of other activities; no information sharing was needed, and no interference across threads was in scope, also known as thread safe. The notion of tokens to perform work was easy to comprehend. The volume of work was created with a loop building a list of tasks; then the script would allocate as many tokens as were available in the semaphore bucket. When the script first kicked off, it had immediate access to do parallel processing of the four tokens I had allocated. As short-running tasks completed, tokens were returned to the bucket and made available for the next task. Some tasks ran longer than others, and that was fine because the overall model was not blocking other tasks from running as tokens became available.
Doing More with Less
Continuing in the theme of challenges being addressed, we must acknowledge the business pressures of gaining efficiencies to reduce operation expenses (OpEx) and potentially improve margins, if applicable. Network IT varies between a necessary cost center and a competitive differentiating profit center for many businesses. It is not uncommon for the cost center–focused businesses to manage budgets by reducing resources and attempting to get more productivity from those remaining. The profit center–focused businesses may do the same, but mostly for margin improvement.
Automation, orchestration, and network programmability provide the tools to get more done with less. If tasks are repetitive, automation reduces the burden—and burnout—on staff. Team members are able to focus on more strategic and fulfilling endeavors.
In reflection with the previous section on scale, if you have a lot of tasks that would benefit from parallel execution, if they are not dependent on each other, then it makes sense to allocate more threads/cores to the overall work. Efficient use of existing resources is desirable. It is a waste of resources if a system with many cores is often idle.
When building automated solutions, observe the tasks and time the original manual process from end to end. After you have automated the process, measure the runtime of the newly automated process and provide reporting that shows time and cost savings with the automation. Having practical examples of return on investment (ROI) helps decision makers understand the benefits of automation and encourage its implementation. You’re building the automation; you can create your own telemetry and instrumentation!