Voice Activity Detection
In normal voice conversations, someone speaks and someone else listens. Today's toll networks contain a bi-directional, 64, 000 bit per second (bps) channel, regardless of whether anyone is speaking. This means that in a normal conversation, at least 50 percent of the total bandwidth is wasted. The amount of wasted bandwidth can actually be much higher if you take a statistical sampling of the breaks and pauses in a person's normal speech patterns.
When using VoIP, you can utilize this "wasted" bandwidth for other purposes when voice activity detection (VAD) is enabled. As shown in Figure 7-5, VAD works by detecting the magnitude of speech in decibels (dB) and deciding when to cut off the voice from being framed.
Figure 7-5 Voice Activity Detection
Typically, when the VAD detects a drop-off of speech amplitude, it waits a fixed amount of time before it stops putting speech frames in packets. This fixed amount of time is known as hangover and is typically 200 ms.
With any technology, tradeoffs are made. VAD experiences certain inherent problems in determining when speech ends and begins, and in distinguishing speech from background noise. This means that if you are in a noisy room, VAD is unable to distinguish between speech and background noise. This also is known as the signal-to-noise threshold (refer to Figure 7-5). In these scenarios, VAD disables itself at the beginning of the call.
Another inherent problem with VAD is detecting when speech begins. Typically the beginning of a sentence is cut off or clipped (refer to Figure 7-5). This phenomenon is known as front-end speech clipping. Usually, the person listening to the speech does not notice front-end speech clipping.