RTP
The RTP specification RFC 3550 describes how senders can packetize and transmit media to receivers over the network. Using RTP packets alone, receivers can reconstruct and play audio and video streams from a sender and maintain continuous, glitch-free playback. However, to synchronize separate streams, senders and receivers must use RTCP packets, too. This section covers RTP packets for the purposes of unsynchronized stream playback, and the next section covers RTCP packets for the purposes of adding lip sync.
Canonical RTP Model
Figure 7-12 shows the canonical RTP/RTCP model for a video/audio sender and receiver.
Figure 7-12 Canonical RTP/RTCP Model
Figure 7-12 shows five different clocks.
At the sender
- Clock A, used by the audio capture hardware to sample audio data
- Clock B, used by the video capture hardware to sample video data
- Clock C, the "common timebase" clock at the sender, used for the purposes of stream synchronization with RTCP packets
At the receiver
- Clock D, the clock used by the audio playout hardware to play audio data
- Clock E, the clock used by the video display hardware to display video data
A separate crystal oscillator drives each clock, which means that none of the clocks are synchronized to each other. In most video conferencing systems, the sender audio clock also provides the common timebase clock; however, this example considers the most general case, in which they differ.
RTP Time Stamps
Each capture device (microphone and video capture hardware) has a clock that provides the RTP time stamps for its media stream. The units for the RTP time stamps depend on whether the media stream is audio or video:
- For the audio stream, RTP uses a sample clock that is equal to the audio sample rate. For example, an 8-kHz audio stream uses a sample clock of 8 kHz. In this case, RTP time stamps for audio are actually sample stamps, because the time stamp can be considered a sample index. If an RTP packet has a time stamp of 0 and contains 300 samples, assuming the audio is continuous, the time stamp of the following RTP packet has an RTP time stamp of 300.
- For video streams, RTP uses a sample clock equal to 90 kHz. For example, consider an endpoint that encodes a 25-FPS video sequence, derived by encoding every other field of PAL video: If a video frame consists of RTP packets with RTP time stamp 0, the next video frame consists of RTP packets with an RTP time stamp of 1/25 x 90000 = 3600. The sender may split a large encoded video frame into multiple RTP packets, in which case all RTP packets belonging to the same frame have the same RTP time stamp.
Remember that the RTP time stamps for the video stream and the audio stream are not related to each other. In particular, keep the following in mind:
- The video and audio RTP time stamps do not begin transmission with the same RTP time stamp. According to the RTP specification, the sender must use a randomly selected beginning RTP time stamp for each stream to avoid known-value decryption attacks in case the endpoints encrypt the streams.
- The crystal clocks on the audio capture hardware and video capture hardware are different (and therefore, unsynchronized).
Because the crystal clocks used for audio and video may differ, these clocks might drift past each other. Crystal clocks typically have an accuracy of ± 100 parts per million (ppm). As an example of clock drift, consider the following worst-case scenario:
- If the audio clock is running at a frequency –100 ppm away from its nominal frequency, it is running .1 percent too slowly.
- If the video clock is running at a frequency +100 ppm away from its nominal frequency, it is running .1 percent too fast.
In this example, the timebase of the video clock is fast relative to the timebase of the audio. Figure 7-13 shows how the timebases will crawl past each other over time.
Figure 7-13 Clock Crawl for Nonsynchronized Clocks
In Figure 7-13, time T corresponds to a real-world time span of 1000 seconds. At this point in time, the audio timebase provides a reading of 999 seconds, and the video timebase provides a reading of 1001 seconds. Although a drift of ± 0.1 percent might not seem like much, it can grow over time; if this drift is not taken into account, these streams will play 2 seconds out of sync on the receive endpoint after a duration of 1000 seconds. A robust conferencing system must ensure long-term lip sync, ideally with a skew of <20 ms between audio and video presentation times.
Because RTP time stamps in the video and audio streams are not directly related to each other, a receiver cannot determine how to synchronize RTP audio and video streams by looking at RTP packets alone.
To provide the receiver with enough information to synchronize audio and video, the receiver must be able to map the RTP time stamps from each stream into a common timebase. The RTCP protocol (discussed later) provides this functionality.
Using RTP for Buffer-Level Management
Using only RTP packets without RTCP packets, receivers can establish buffer-level management. Receivers must establish an audio jitter buffer level that corresponds to the minimum level required to absorb network jitter to prevent a nonmalleable device from starving. Then, during the video conference, receivers must monitor the short-term average jitter buffer level to ensure that it is large enough to absorb arrival-time variations of the currently observed network jitter. In addition to short-term swings, the average buffer level may slowly rise or fall over a long period of time because of differences in the exact frequencies of crystal clocks at the sender and receiver, in which case the receiver must intervene. Buffer-level management is the process of maintaining a relatively constant average jitter buffer level in the face of both short-term variance in the packet arrival times and long-term drift from mismatched sender and receiver clocks.
To achieve buffer-level management, receivers first establish a relationship between incoming RTP packets and the audio device timebase of the receiver as follows:
ATBout = RTPin + Krl
RTPin represents the RTP time stamp of the incoming packet, and ATBout represents the time stamp in the audio device timebase of the receiver. The audio device timebase of the receiver is defined by the audio device playout clock. Krl is an offset, chosen by the receiver, that maps one timebase into another. Both the RTP time stamp and the audio device timebase are in the same units, equal to the sample rate of the audio stream.
The audio playout device on the receiver typically operates using a pull model: After the receiver activates the audio device, the audio device starts issuing continual interrupt requests for data. The receiver responds to each interrupt by transferring data to the audio device. In this model, the audio playout device generates an interrupt to ask for audio and specifies the audio device time ATBout at which the audio must play. The receiver then uses Krl to calculate the corresponding RTPin RTP time stamp. The receiver must supply this data by retrieving it from the decoder, which in turn retrieves it from the jitter buffer. The value of Krl therefore enforces a mapping from RTP time stamp to audio device timebase, and the receiver must comply with this mapping.
The receiver establishes the value of Krl when the first RTP packet arrives. At this time, the receiver assigns a preliminary mapping from the RTP time stamp of the packet to the audio device timebase (ATB) of the receiver. This mapping achieves buffer management but not synchronization; the receiver must add delay to either the video or audio streams (discussed later) to achieve lip sync.
The receiver establishes the minimum Krl offset needed to satisfy jitter buffer level requirements. After the receiver selects Krl, it can set in motion the data pipeline for the audio stream.
The equation to calculate Krl uses several delay values, all of which are in units of the audio sample rate:
- The current level of the jitter buffer is A. A will be nonzero if RTP packets have arrived before the receiver decides on a value of Krl.
- The required nominal jitter buffer level is B.
- The playout hardware delay is C.
- The current audio device timebase time is D.
- The RTP time stamp of the first RTP packet is RTPin1.
The RTP time stamp RTPin1 of the first RTP packet should be mapped to an audio device time ATB1 of
ATB1 = D + (B – A) + C
In other words, starting from the time right now, the stream must wait until its input buffer rises from its current level A to the desired nominal jitter buffer level B, which takes (B – A) units. The number of audio samples (B – A) represents the time during which the receiver "primes the input pipe" by filling the jitter buffer, which feeds the audio playout device. The (B – A) offset is required because the receive endpoint should not start playing audio through the audio playout device until the jitter buffer has achieved its nominal level. Alternatively, the receiver can supply initial silence audio to the audio playout device to quickly set the jitter buffer to its nominal level.
The receive endpoint estimates a desired value for B, based on the expected characteristics of the network packet jitter. A higher level of network jitter requires a larger input jitter buffer.
If more data has arrived than is needed to fill the jitter buffer to the required level, (B – A) will be negative. A negative value for (B – A) means that a portion of audio data at the beginning of the transmission must be discarded to reduce the jitter buffer to its nominal value.
The receiver logic must also take into account a delay of C through the playout hardware. For the audio playout device, the delay consists of the latency from the time the receiver passes a media packet to the playout hardware until the time the playout hardware passes the data to the D/A converter.
The preliminary offset Krl used for this mapping is as follows:
Krl = ATB1 – RTPin1 =
D + (B – A) + C – RTPin1
The receiver now uses this value of Krl to map input RTP time stamps to time stamps in the audio device timebase. The receiver might need to change the level of the input buffer over time by changing the value of Krl. However, changing Krl causes the audio stream going to the playout device to be discontinuous. If the receiver increases Krl, the result is a gap in the audio stream, because the next packet plays at a later-than-normal time, and an intervening gap occurs; by default, the audio playout device is likely to fill this gap with silence. If the receiver decreases Krl, the result is an overlap in samples between the previous and next packet, which requires duplicate samples to be discarded. Gaps or discarded samples result in a glitch in the audio stream. However, the receiver can use two methods to change Krl while preventing an objectionable glitch:
- The receiver can scale (stretch) the incoming audio data up or down by a small amount so that listeners will not notice the change. When decoded data is scaled up, the data rate entering the receiver effectively increases, and the jitter buffer level increases over time. The opposite effect occurs if the receiver scales down the input data. This method can be used to slowly change the jitter buffer level.
- The receiver can wait for a duration of silence in the audio stream and then change Krl, which has the effect of increasing or decreasing the duration of silence. Listeners will not notice this change in the silence interval. This method can be used to abruptly change the jitter buffer level.
If the receiver does not require media synchronization, the only task left for the receiver is to manage the buffer level over time. The receiver can create a similar pipeline for video and perform the same type of buffer management. However, the video stream has the benefit of being a malleable medium, so the value of Krl can be changed on-the-fly without an objectionable glitch in the output. Another difference between the audio and video paths is this: The receiver typically measures the local video device timebase in units of seconds, instead of the RTP sample rate of 90 kHz.