A distinguishing feature of audio and video is the volume of data required for a typical stream, especially compared to the text and graphics streams traditionally carried on packet networks. CD quality audio, uncompressed television-quality video, and HDTV (high definition television) quality video, require rates measured in Mb/s, 100 Mb/s, and Gb/s, respectively. Compression, particularly for video streams[9], can reduce the bandwidth of a combined (television-quality) video and audio stream to 1.5 Mb/s using MPEG-1[10], while for applications with more modest quality requirements, like video-conferencing, the bandwidth can drop to between 64 and 384 Kb/s[11]. Considering that these applications may require the simultaneous transmission of multiple data streams, the aggregate data rates are high enough to make resource considerations important, even though transmission bandwidth is constantly increasing in new generation networks and buffer memory costs are dropping.
The second aspect that is relevant here is that many audio and video applications are interactive, in the sense that data reception is interleaved with playback of the associated media streams, rather than playback following reception completion. This implies a requirement for provision of bounded delays between sender and receiver. In fact, for audio and video to be effectively used in these situations, i.e. without forcing the communicating parties to modify their behavior from that of face-to-face communication, such delays are expected to be small. Studies have determined that a certain amount of delay is imperceptible or, at least, tolerable by humans; various guidelines set this tolerance to between 40 and 600 ms[12][13][14].
A problem related to bounding the maximum transmission delays is that of bounding the delay variance, usually called jitter in this context. To avoid distracting the human user, jitter is usually smoothed out at the receiver by buffering and delaying the playback time of received data. Although this improves playback quality, it increases the total delay experienced at the receiver, a problem for interactive applications. In addition, it increases memory requirements for buffering, which may be a problem[15] considering the amount of data involved, even for very short time periods.
While applications expecting nearly real time interaction, with delays
practically imperceptible by humans, are more challenging,
non interactive
applications
which maintain the characteristic of
interleaved reception and playback, pose interesting problems.
For instance,
video distribution based on the ``TV'' model expects relatively
infrequent interactions with the human viewer (e.g. changing the channel);
video distribution based on the ``VCR'' model will include interactions
due to user commands to control information flow (e.g. slow motion)
and expect them to take effect immediately (by human reaction time
standards). In both cases, synchronization is required between sender and
receiver that depends not only on transmission events but also on
simultaneous playback events at the receiving end.
To take into account and exploit the different characteristics of continuous media applications, and more precisely of their media components, we can classify them into two generic categories[16]: