Examining VoIP Call Legs and VoIP Media Transmission
VoIP Overview
VoIP traffic has both similarities to and differences form traditional telephony. The most apparent difference is the transport method. Traditional telephony used circuit-switched technology where the physical wire leading to one device is electronically connected to the physical circuit of another device. The technology switches (connects) physical electrical circuits. VoIP uses packet-switching technology. The source device creates voice traffic in the form of IP packets. These packets are treated in the same manner as data packets. Each router or switch examines the addressing of the individual packet and sends it to its destination. When the packet arrives at the destination, the end device extracts the voice from the packet and plays the voice to the user.
Both traditional and circuit-switched methods use similar basic signaling functions. Supervisory signaling such as off-hook, address signaling such as dual-tone multifrequency (DTMF) tones, and informational signaling such as dial tone are very much the same in both environments.
Traditional telephony uses a wide array of signaling protocols. Digital circuits use methods such as ISDN, Signaling System 7 (SS7), and Q Signaling (QSIG). Analog ports use protocols such as loop-start, ground-start, wink-start, DTMF, and pulse dialing. VoIP devices use signaling protocols such as H.323, Media Gateway Control Protocol (MGCP), and Session Initiation Protocol (SIP) to control call setup and teardown.
Traditional voice uses a dedicated physical circuit to connect devices together. This can be a 4-pair station cable to a phone or a 2-pair T1 PRI that can carry 23 calls to the public switched telephone network (PSTN). VoIP traffic is carried using User Datagram Protocol (UDP)–based Real-Time Transport Protocol (RTP) packets. There are several RTP flows associated with each conversation.
VoIP Components
A VoIP network relies on a collection of specialized hardware and protocols. This section examines some of the primary components found in today's VoIP networks.
Figure 2-1 shows the basic components of an IP telephony network.
Figure 2-1 IP Telephony Network
Descriptions of these terms are as follows:
- IP phone: Provides IP voice to the desktop.
- Gatekeeper: Provides Call Admission Control (CAC), bandwidth control and management, and address translation.
- Gateway: Provides translation between VoIP and non-VoIP networks, such as the PSTN. A gateway also provides physical access for local analog and digital voice devices, such as telephones, fax machines, key sets, and PBXs.
- Multipoint control unit (MCU): Mixes audio/video streams, thus allowing participants in multiple locations to attend the same conference.
- Call agent: Provides call control for IP phones, CAC, bandwidth control and management, and address translation. A Cisco UCM server often serves as a call agent in a Cisco IP telephony deployment.
- Application server: Provides services such as voice mail, unified messaging, and Cisco UCM Attendant Console.
- Videoconference station: Provides access for end-user participation in videoconferencing. The videoconference station contains a video capture device for video input and a microphone for audio input. The user can view video streams and hear the audio that originates at a remote user station. Videoconferencing is a fast-growing area of IP communications. On the desktop end of the scale, Cisco offers Video Advantage, and on the high end, TelePresence.
Other components, such as software voice applications, interactive voice response (IVR) systems, and softphones, provide additional services to meet the needs of enterprise sites.
Major Steps of Voice Processing in VoIP
For transmission over an IP network, the voice wavelength must be sampled, quantized, encoded, optionally compressed, and then encapsulated in a VoIP packet. The first four steps are performed by a digital signal processor (DSP) in the originating gateway. The packets are then transported through the IP network to the destination gateway, where the voice portion of the packet is retrieved. A DSP resource decodes the voice payload and modulates the analog signal, which is played to the end user.
Converting Voice to VoIP
The conversion of spoken voice and other audio signals to VoIP contains three necessary steps and one optional step:
- Sample the analog signal regularly. The sampling rate of the analog signal must be twice the highest frequency to produce playback that does not appear choppy or too smooth. The sampling rate in telephony is 8000 samples per second (8k) or a sample every 125 microseconds. This rate is referred to as Nyquist's theorem, and it reflects the fact that the human voice frequency range is from 0 to 4000 Hz.
- Quantize the sample. Quantization consists of a scale made up of eight major segments. Each segment is subdivided into 16 intervals. The segments are not equally spaced but are actually finest near the origin. Intervals are equal within the segments but different when they are compared between segments. Finer graduations at the origin result in less distortion for low-level tones. The two types of quantization are mu-law and a-law.
- Encode the value into an 8-bit digital form. Encoding maps a value derived from the quantization to an 8-bit number (octet).
- (Optional) Compress the samples to reduce bandwidth. Signal compression is used to reduce the bandwidth usage per call. Refer to Table 1-2 to see the relationship between codec and bandwidth utilization.
The first three bullets describe the Pulse Code Modulation (PCM) process, which is used by the G.711 codec. Compression is performed by low-bit-rate codecs such as G.729, G.728, G.726, or Internet Low Bitrate Codec (iLBC).
VoIP Packetization
After the voice wavelength is digitized, the DSP collects the digitized data for an amount of time until there is enough data to fill the payload of a single packet.
With G.711, either 20 ms or 30 ms worth of voice is transmitted in a single packet. 20 ms worth of voice corresponds to 160 samples per packet. With 20 ms worth of voice per packet, 50 packets are created per second: 1 sec / 20 ms = 50.
The packetization rate has a direct effect on the total amount of bandwidth needed. More packets require more headers, and each header adds 40 bytes to the packet. Table 1-5 shows the effect of packetization rates on bandwidth utilization.
Codecs such as G.729 also compress the digitized output. G.729 creates a codeword for every 10 ms of voice. This "codeword" is a predefined representation of a 10-ms sample of human voice. Two codewords are contained in each packet at 50 packets per second or three codewords at 33.3 packets per second. Because the codewords need fewer bits, the overall bandwidth required is reduced.
VoIP Media Transmission
When a VoIP call is established, using the previously discussed signaling protocols, the digitized voice samples need to be transmitted. These voice samples are often called the voice media. Following are a collection of voice media protocols that might be found in a VoIP environment:
- Real-Time Transport Protocol (RTP): RTP is a Layer 4 protocol that is encapsulated inside UDP segments. RTP is the protocol that carries the actual digitized voice samples in a call.
- Real-Time Control Protocol (RTcP): RTcP is a companion protocol to RTP. Both RTP and RTcP operate at Layer 4 and are encapsulated in UDP. UDP ports 16384 to 32767 are used by RTP and RTcP. However, RTP uses the even port numbers in that range, whereas RTcP uses the odd port numbers. While RTP is responsible for carrying the voice stream, RTcP carries information about the RTP stream such as latency, jitter, packets, and octets sent and received.
-
Compressed RTP (cRTP): One of the challenges with RTP is its overhead. Specifically, the combined IP, UDP, and RTP headers are approximately 40 bytes in size, whereas a common voice payload size on a VoIP network is only 20 bytes, which includes 20 ms of voice by default. In that case, the header is twice the size of the payload. Fortunately, Cisco supports cRTP, which is commonly referred to as RTP header compression. cRTP can reduce the 40-byte header to 2 or 4 bytes in size (depending on whether UDP checksums are in use), as shown in Figure 2-2.
Figure 2-2 Compressed RTP
- Secure RTP (sRTP): To help prevent an attacker from intercepting and decoding or possibly manipulating voice packets, sRTP supports encryption of RTP packets. In addition, sRTP provides message authentication, integrity checking, and protection against replay attacks.