Case Study: Resolving a Problem Using Proper Troubleshooting Methodology
It is 6 a.m., and you have arrived at work to resolve your CEO's problem. The only data you have is the page you received at 5:30 a.m. that says "CEO's calls keep dropping. Please help ASAP!" You need a bit more information than that to fix the problem.
This case study applies the methodology previously described. You must gather the data before you can begin the analysis.
Gathering the Data
As part of the data-gathering stage, you should do the following:
Identify and isolate the problem
Use topology information to isolate the problem
Gather data from the end users
Determine the problem's timeframe
You find the CEO's administrative assistant and begin your fact-finding mission. He states that at various times during the previous day and one time this morning, the CEO is on the phone when, all of the sudden, the call is disconnected. Eager to resolve the problem, you ask the administrative assistant for the following information:
The exact date and times the problem occurred
Whether the dropped calls were incoming or outgoing
What number was dialed if it was an outbound call or what number the call came from if it was an inbound call
The assistant states that the call was dropped around 5:15 a.m. because the CEO was in early to prepare for the stockholders meeting. This is the extent of the information he remembers. Most users do not pay attention to specifics like this unless they have been instructed to, but all is not lost. The CEO has a 7960 phone that stores information locally about missed calls, received calls, and placed calls. You head into the CEO's office and look at the list of received calls and placed calls for the morning. You notice that a call was received at 5:05 a.m. and a call placed at 5:25 a.m. You notice that the second call was placed to the same area code and prefix as the call that was received.
You ask the CEO about the two calls. She remembers that she was on the phone with a customer for about 15 minutes when the call was disconnected. She immediately called the customer back. She also confirms that the first call that was received was the dropped call. Now you know that the problematic call was received at approximately 5:05 a.m. and was dropped just before 5:25 a.m.
While you are looking at the CEO's phone, you also go into the Settings menu (press the settings button > Network Configuration > CallManager 1) to see which CallManager the CEO's phone is registered to. This lets you isolate which CallManager in the cluster is involved in the signaling for this phone.
Armed with this information, you can begin the task of isolating the problem. You refer to your topology diagram to isolate the components that are involved. Figure 1-2 shows a high-level diagram of the network topology.
Figure 1-2 High-Level Topology Diagram
Reinforcing the topology in Figure 1-2, assume the following setup:
A cluster with eight CallManager nodes
32 voice gateway connections to the PSTN for outgoing calls at your main site16 for local calls and 16 for international and long distance
32 more voice gateways at your main campus where all your inbound calls come in. The telephone company has set up the inbound calls so that the 32 gateways are redundant whereby if one of the gateways is down, all your incoming calls can still use any of the other remaining gateways.
Two gateways at each remote site used for both inbound and outbound calls. All outbound calls prefer the first gateway, and inbound calls prefer the second gateway, although each can handle both inbound and outbound calls should one fail.
As shown in Figure 1-2, the executive offices are at a remote site across the WAN. With just the information you have so far, you can eliminate a large portion of the network. So far you know that the problematic call was to the CEO. You also know that the problematic call was an inbound call. You ask the CEO and her admin if all the dropped calls were inbound calls. As far as they can remember, they were.
You know that the call this morning was during a time of day where there is little phone activity. Remember that all inbound calls to the remote site come in through Primary Rate Interfaces (PRIs) connected to the remote voice gateways and that inbound calls to the site prefer the second gateway. It is unlikely that all the channels on the first PRI were in use during a time of low call volume, so you assume that the call probably came in through the second gateway, although you still keep it in the back of your mind that the call might have come in through the first gateway at the remote site.
You then look at the configuration for the two gateways at Remote Site 2 and note that they are both configured to send incoming calls to CallManager Subscriber 3 as their preferred CallManager and CallManager Backup 1 in case CallManager Subscriber 3 fails.
With the information you have so far, you can narrow down the possible suspect devices to the network shown in Figure 1-3.
Armed with this knowledge, you can immediately isolate the problem to the user's phone and the two gateways being used for inbound calls. Keep in mind that you haven't elimi-nated the possibility that the problem is on CallManager or is network-related.
Now that you know the problem is related to inbound calls, it makes sense to try to understand the call flow for an inbound call to this user. Determine whether these calls all come directly to the user or if the call flow has any intermediate steps, such as Cisco IP Auto Attendant (Cisco IP AA) or an operator who transfers the call to the end user. For the sake of this example, assume that the user has a Direct Inward Dialing (DID) number, so the call comes straight from the PSTN through a gateway to the user, and a Cisco IP AA or operator is not involved. You have now eliminated Cisco IP AA from the picture, as well as the possibility that other phones or users are involved in this user's problems. This is not to say that other users are not experiencing similar problems, but the focus here is on solving this particular user's problem. If the problem is more widespread than this one user, you will probably find it as you continue to troubleshoot this user's problem.
Figure 1-3 Network After You Narrow Down the Possible Suspects
At this point, the problem has been isolated to the following culprits:
The CEO's phone
CallManager Subscriber 3
Site 2 Router/GW 1 and Site 2 Router/GW 2
The underlying network connecting these devices
It might seem like you haven't made much progress in this example, but in reality you have eliminated a large portion of the system as possible culprits. This concludes the data-gathering piece of your investigation. Now it is time to start analyzing the data. After you isolate the problem, you must break it into smaller pieces.
Analyzing the Data
As soon as you have a clear understanding of the problem you're trying to resolve, and you have isolated the piece or pieces of the network that are involved, the next step is to break the problem into pieces to find the root cause. As part of the data analysis stage, you should do the following:
Use deductive reasoning to narrow the list of possible causes
Verify IP network integrity
Determine the proper troubleshooting tools, and use them to find the root cause
Continuing with the case study example, you now know the pieces involved in the puzzle, but you still don't know why the call is being dropped. For the sake of this example, this chapter keeps things general, but later chapters go into far greater detail on exactly what to look for. In this case, the problem is likely caused by the phone, CallManager, the gateway, the PSTN, or the IP network. So how do you determine which one is causing the problem?
One important distinction to make that will become evident as you read through this book is that many problems can be narrowed down to being either signaling-related or voice packet-related. In this case, you are dealing with a signaling-related problem, because the problematic call is being torn downa problem that must occur in the signaling path be-tween devices.
Because nearly all signaling for a call must go through one or more CallManager servers, the first tool you decide to use is a trace from CallManager Subscriber 3. You can then analyze the trace files to discover the device that disconnects the call from CallManager's perspectivein other words, "Who hung up first?" Using the information provided by the user, you must find the proper trace file and try to reconstruct the call from beginning to end.
A call between the CEO's phone and the voice gateway has two distinct signaling connections. One is the communication between CallManager and the voice gateway. The other is the communication between CallManager and the phone. The phone and voice gateway never directly exchange signaling data. All signaling goes through CallManager.
The trace includes all the messaging between CallManager and both the phone and the gateway. Chapter 3 provides more details on where to find these traces and how to read them.
You know that the call in question was set up around 5:05 a.m., so you look through the traces during that timeframe, searching for the phone number you retrieved from the CEO's phone. After combing through the trace file, you determine that the gateway is sending a message to CallManager, telling it to disconnect the call. The CCM traces (discussed in Chapter 3) indicate which gateway the calls are coming from. This eliminates the CEO's phone as a cause of the problem because the disconnect message is coming from the gateway. Because the user indicated that there were three drops, you can now go through the same process of looking through the CCM trace files for each instance of a dropped call and reconstructing those calls to see if the problem is isolated to one gateway. If you don't know the times that the other calls were dropped, you should just concentrate on the one call you do have data for.
Because CallManager received a message from the gateway telling it to disconnect the call, it is unlikely that a network problem is causing the calls to disconnect. If there were a network problem, you would likely see an indication that there was a problem commun-icating between CallManager and the gateway. In this case, the gateway had no problem sending the disconnect message to CallManager. It would not hurt to look through the network devices between CallManager and the voice gateway to ensure that there are no network errors, but with a problem like this, the network is an unlikely culprit.
At this point, you have narrowed down the problem to be originating from either the voice gateway or the PSTN. Figure 1-4 shows you've narrowed down the network to only a few devices.
The next step is to go to the suspected gateway and try to determine why one of the calls was dropped. This involves turning on additional debugs on the gateway to determine if the gateway is disconnecting the call or just passing along information from the PSTN about disconnecting the call. Unfortunately, it is unlikely that you had the debugs enabled at the time the problem occurred, so you need to enable the proper debugs and wait for the problem to happen again. This is why it is so important to narrow down the problem to a small subset of devices: You do not want to turn on debugs on dozens of gateways.
Which debugs to use depends on the gateway model and the type of interface to the PSTN. Chapter 6 discusses these considerations in detail. While waiting for the problem to reoccur, you discover that a message to disconnect the call is coming from the PSTN. If you are using an ISDN voice circuit for connectivity to the PSTN, the disconnect message is accompanied by a cause code that provides a general reason why the call was disconnected. Depending on what you discover on the gateway debugs, the next step might be to contact the local service provider or perhaps debug the gateway further to find the root cause.
Figure 1-4 Network After You Continue Narrowing Down the Possible Suspects
Conclusions
As this case study has demonstrated, the more information you can obtain about the problem, the easier it is to get to the root cause. For example, without the times the dropped calls occurred, it would have been almost impossible to find them in the trace files on a busy system. When deployed in a large enterprise, it is good to arm your help desk with a list of questions to ask depending on the problem being reported.
The point of this example is not to teach you how to troubleshoot a specific problem or to find out exactly why the user's calls are being dropped. It is to show you how to approach a problem in order to isolate it and break it into more manageable pieces. The same prin-ciples can be applied to almost any problem you are troubleshooting.
So remember, first put on your detective hat and gather enough information to isolate the problem to a few pieces of the system. Then dig deeper into each component by breaking the problem into more manageable pieces. Finally, apply your expertise to each of the smaller pieces until you find the resolution to your problem.