Guides & Tutorials

**ftvfatboy** · 08-03-2006, 03:24 PM

High-Definition Television

--------------------------------------------------------------------------------

High-Definition Television

The NTSC standard was created in the 1930s, for black-and-white television transmissions. Color was added to it in 1953, after four years of testing. NTSC stands for National Television Standards Committee. This is a standard that specifies the shape of the signal sent by a television transmitter. The signal is analog, with amplitude that goes up and down during each scan line in response to the black and white parts of the line. Color was later added to this standard, but it had to be added such that blackand- white television sets would be able to display the color signal in black and white. The result was phase modulation of the black-and-white carrier, a kludge (television engineers call it NSCT “never the same color twice”).
With the explosion of computers and digital equipment in the last two decades came the realization that a digital signal is a better, more reliable way of sending images over the air. In such a signal the image is sent pixel by pixel, where each pixel is represented by a number specifying its color. The digital signal is still a wave, but the amplitude of the wave no longer represents the image. Rather, the wave is modulated to carry binary information. The term modulation means that something in the wave is modified to distinguish between the zeros and ones being sent. An FM digital signal, for example, modifies (modulates) the frequency of the wave. This type of wave uses one frequency to represent a binary zero and another to represent a one. The DTV (Digital TV) standard uses a modulation technique called 8-VSB (for vestigial sideband), which provides robust and reliable terrestrial transmission. The 8-VSB modulation technique allows for a broad coverage area, reduces interference with existing analog broadcasts, and is itself immune from interference.

History of DTV:

The Advanced Television Systems Committee (ATSC), established in 1982, is an international organization developing technical standards for advanced video systems. Even though these standards are voluntary, they are generally adopted by the ATSC members and other manufacturers. There are currently about eighty ATSC member companies and organizations, which represent the many facets of the television, computer, telephone, and motion picture industries. The ATSC Digital Television Standard adopted by the United States Federal Communications Commission (FCC) is based on a design by the Grand Alliance (a coalition of electronics manufacturers and research institutes) that was a finalist in the first round of DTV proposals under the FCC’s Advisory Committee on Advanced Television Systems (ACATS). The ACATS is composed of representatives of the computer, broadcasting, telecommunications, manufacturing, cable television, and motion picture industries. Its mission is to assist in the adoption of an HDTV transmission standard and to promote the rapid implementation of HDTV in the U.S.

The ACATS announced an open competition: Anyone could submit a proposed HDTV standard, and the best system would be selected as the new television standard for the United States. To ensure fast transition to HDTV, the FCC promised that every television station in the nation would be temporarily lent an additional channel of broadcast spectrum.

The ACATS worked with the ATSC to review the proposed DTV standard, and gave its approval to final specifications for the various parts—audio, transport, format, compression, and transmission. The ATSC documented the system as a standard, and ACATS adopted the Grand Alliance system in its recommendation to the FCC in late 1995. In late 1996, corporate members of the ATSC had reached an agreement on the DTV standard (Document A/53) and asked the FCC to approve it. On December 31, 1996, the FCC formally adopted every aspect of the ATSC standard except for the video formats. These video formats nevertheless remain a part of the ATSC standard, and are expected to be used by broadcasters and by television manufacturers in the foreseeable future.

HDTV Specifications:

The NTSC standard in use since the 1930s specifies an interlaced image composed of 525 lines where the odd numbered lines (1, 3, 5, . . .) are drawn on the screen first, followed by the even numbered lines (2, 4, 6, . . .). The two fields are woven together and drawn in 1/30 of a second, allowing for 30 screen refreshes each second. In contrast, a noninterlaced picture displays the entire image at once. This progressive scan type of image is what’s used by today’s computer monitors. The digital television sets that have been available since mid 1998 use an aspect ratio of 16/9 and can display both the interlaced and progressive-scan images in several different resolutions—one of the best features of digital video. These formats include 525-line progressive-scan (525P), 720-line progressive-scan (720P), 1050-line progressivescan (1050P), and 1080-interlaced (1080I), all with square pixels. Our present, analog, television sets cannot deal with the new, digital signal broadcast by television stations, but inexpensive converters will be available (in the form of a small box that can comfortably sit on top of a television set) to translate the digital signals to analog ones (and lose image information in the process). The NTSC standard calls for 525 scan lines and an aspect ratio of 4/3. This implies 4 3 ×525 = 700 pixels per line, yielding a total of 525×700 = 367,500 pixels on the screen. (This is the theoretical total, since only 483 lines are actually visible.) In comparison, a DTV format calling for 1080 scan lines and an aspect ratio of 16/9 is equivalent to 1920 pixels per line, bringing the total number of pixels to 1080 × 1920 = 2,073,600, about 5.64 times more than the NTSC interlaced standard.

In addition to the 1080 × 1920 DTV format, the ATSC DTV standard calls for a lower-resolution format with just 720 scan lines, implying 16 9 × 720 = 1280 pixels per line. Each of these resolutions can be refreshed at one of three different rates: 60 frames/second (for live video) and 24 or 30 frames/second (for material originally produced on film). The refresh rates can be considered temporal resolution. The result is a total of six different formats. Table 6.7 summarizes the screen capacities and the necessary transmission rates of the six formats. With high-resolution and 60 frames per second the transmitter must be able to send 124,416,000 bits/sec (about 14.83 Mbyte/sec), which is why this format uses compression. (It uses MPEG-2. Other video formats can also use this compression method.) The fact that DTV can have different spatial and temporal resolutions allows for tradeoffs. Certain types of video material (such as fastmoving horse- or car races) may look better at high refresh rates even with low spatial resolution, while other material (such as museum-quality paintings) should ideally be watched in high resolution even with low refresh rates.

Digital Television (DTV) is a broad term encompassing all types of digital transmission. HDTV is a subset of DTV indicating 1080 scan lines. Another type of DTV is standard definition television (SDTV), which has picture quality slightly better than a good analog picture. (SDTV has resolution of 640×480 at 30 frames/sec and an aspect ratio of 4:3.) Since generating an SDTV picture requires fewer pixels, a broadcasting station will be able to transmit multiple channels of SDTV within its 6 MHz allowed frequency range. HDTV also incorporates Dolby Digital sound technology to bring together a complete presentation.

**ftvfatboy** · 08-03-2006, 03:25 PM

Video Compression

--------------------------------------------------------------------------------

Video Compression

Video compression is based on two principles. The first is the spatial redundancy that exists in each frame. The second is the fact that most of the time, a video frame is very similar to its immediate neighbors. This is called temporal redundancy. A typical technique for video compression should therefore start by encoding the first frame using a still image compression method. It should then encode each successive frame by identifying the differences between the frame and its predecessor, and encoding these differences. If the frame is very different from its predecessor (as happens with the first frame of a shot), it should be coded independently of any other frame. In the video compression literature, a frame that is coded using its predecessor is called inter frame (or just inter), while a frame that is coded independently is called intra frame (or just intra).

Video compression is normally lossy. Encoding a frame Fi in terms of its predecessor Fi-1 introduces some distortions. As a result, encoding frame Fi+1 in terms of Fi increases the distortion. Even in lossless video compression, a frame may lose some bits. This may happen during transmission or after a long shelf stay. If a frame Fi has lost some bits, then all the frames following it, up to the next intra frame, are decoded improperly, perhaps even leading to accumulated errors. This is why intra frames should be used from time to time inside a sequence, not just at its beginning. An intra frame is labeled I, and an inter frame is labeled P (for predictive). Once this idea is grasped, it is possible to generalize the concept of an inter frame. Such a frame can be coded based on one of its predecessors and also on one of its successors. We know that an encoder should not use any information that is not available to the decoder, but video compression is special because of the large quantities of data involved. We usually don’t mind if the encoder is slow, but the decoder has to be fast. A typical case is video recorded on a hard disk or on a DVD, to be played back. The encoder can take minutes or hours to encode the data. The decoder, however, has to play it back at the correct frame rate (so many frames per second), so it has to be fast. This is why a typical video decoder works in parallel. It has several decoding circuits working simultaneously on several frames.

With this in mind we can now imagine a situation where the encoder encodes frame 2 based on both frames 1 and 3, and writes the frames on the compressed stream in the order 1, 3, 2. The decoder reads them in this order, decodes frames 1 and 3 in parallel, outputs frame 1, then decodes frame 2 based on frames 1 and 3. The frames should, of course, be clearly tagged (or time stamped). A frame that is encoded based on both past and future frames is labeled B (for bidirectional). Predicting a frame based on its successor makes sense in cases where the movement of an object in the picture gradually uncovers a background area. Such an area may be only partly known in the current frame but may be better known in the next frame. Thus, the next frame is a natural candidate for predicting this area in the current frame. The idea of a B frame is so useful that most frames in a compressed video presentation may be of this type. We therefore end up with a sequence of compressed frames of the three types I, P, and B. An I frame is decoded independently of any other frame. A P frame is decoded using the preceding I or P frame. A B frame is decoded using the preceding and following I or P frames. Figure 6.9a shows a sequence of such frames in the order in which they are generated by the encoder (and input by the decoder). Figure 6.9b shows the same sequence in the order in which the frames are output by the decoder and displayed. The frame labeled 2 should be displayed after frame 5, so each frame should have two time stamps, its coding time and its display time.

**ftvfatboy** · 08-03-2006, 03:26 PM

Video compression methods.

--------------------------------------------------------------------------------

Video compression methods.

Subsampling: The encoder selects every other frame and writes it on the compressed stream. This yields a compression factor of 2. The decoder inputs a frame and duplicates it to create two frames.

Differencing: A frame is compared to its predecessor. If the difference between them is small (just a few pixels), the encoder encodes the pixels that are different by writing three numbers on the compressed stream for each pixel: its image coordinates, and the difference between the values of the pixel in the two frames. If the difference between the frames is large, the current frame is written on the output in raw format. Compare this method with relative encoding, Section 1.3.1. A lossy version of differencing looks at the amount of change in a pixel. If the difference between the intensities of a pixel in the preceding frame and in the current frame is smaller than a certain threshold, the pixel is not considered different.

Block Differencing: This is a further improvement of differencing. The image is divided into blocks of pixels, and each block B in the current frame is compared to the corresponding block P in the preceding frame. If the blocks differ by more than a certain amount, then B is compressed by writing its image coordinates, followed by the values of all its pixels (expressed as differences) on the compressed stream. The advantage is that the block coordinates are small numbers (smaller than a pixel’s coordinates), and these coordinates have to be written just once for the entire block. On the downside, the values of all the pixels in the block, even those that haven’t changed, have to be written on the output. However, since these values are expressed as differences, they are small numbers. Consequently, this method is sensitive to the block size.

Motion Compensation: Anyone who has watched movies knows that the difference between consecutive frames is small because it is the result of moving the scene, the camera, or both between frames. This feature can therefore be exploited to get better compression. If the encoder discovers that a part P of the preceding frame has been rigidly moved to a different location in the current frame, then P can be compressed by writing the following three items on the compressed stream: its previous location, its current location, and information identifying the boundaries of P. The following discussion of motion compensation is based on [Manning 98]. In principle, such a part can have any shape. In practice, we are limited to equalsize blocks (normally square but can also be rectangular). The encoder scans the current frame block by block. For each block B it searches the preceding frame for an identical block C (if compression is to be lossless) or for a similar one (if it can be lossy). Finding such a block, the encoder writes the difference between its past and present locations on the output. Motion compensation is effective if objects are just translated, not scaled or rotated. Drastic changes in illumination from frame to frame also reduce the effectiveness of this method. In general, motion compensation is lossy

Frame Segmentation: The current frame is divided into equal-size nonoverlapping blocks. The blocks may be square or rectangles. The latter choice assumes that motion in video is mostly horizontal, so horizontal blocks reduce the number of motion vectors without degrading the compression ratio. The block size is important, since large blocks reduce the chance of finding a match, and small blocks result in many motion vectors. In practice, block sizes that are integer powers of 2, such as 8 or 16, are used, since this simplifies the software.

Search Threshold: Each block B in the current frame is first compared to its counterpart C in the preceding frame. If they are identical, or if the difference between them is less than a preset threshold, the encoder assumes that the block hasn’t been moved.
Block Search: This is a time-consuming process, and so has to be carefully designed. If B is the current block in the current frame, then the previous frame has to be searched for a block identical to or very close to B. The search is normally restricted to a small area (called the search area) around B, defined by the maximum displacement parameters dx and dy. These parameters specify the maximum horizontal and vertical distances, in pixels, between B and any matching block in the previous frame. If B is a square with side b, the search area will contain (b + 2dx)(b + 2dy) pixels (Figure 6.11) and will consist of (2dx+1)(2dy +1) distinct, overlapping b×b squares. The number of candidate blocks in this area is therefore proportional to dx·dy.

Distortion Measure: This is the most sensitive part of the encoder. The distortion measure selects the best match for block B. It has to be simple and fast, but also reliable. A natural question at this point is How can such a thing happen? How can a block in the current frame match nothing in the preceding frame? The answer is Imagine a camera panning from left to right. New objects will enter the field of view from the right all the time. A block on the right side of the frame may thus contain objects that did not exist in the previous frame.

Suboptimal Search Methods: These methods search some, instead of all, the candidate blocks in the (b+2dx)(b+2dy) area. They speed up the search for a matching block, at the expense of compression efficiency. Several such methods are discussed in detail in Section 6.4.1.

Motion Vector Correction: Once a block C has been selected as the best match for B, a motion vector is calculated as the difference between the upper-left corner of C and that of B. Regardless of how the matching was determined, the motion vector may be wrong because of noise, local minima in the frame, or because the matching algorithm is not ideal. It is possible to apply smoothing techniques to the motion vectors after they have been calculated, in an attempt to improve the matching. Spatial correlations in the image suggest that the motion vectors should also be correlated. If certain vectors are found to violate this, they can be corrected. This step is costly and may even backfire. A video presentation may involve slow, smooth motion of most objects, but also swift, jerky motion of some small objects. Correcting motion vectors may interfere with the motion vectors of such objects and cause distortions in the compressed frames.

Coding Motion Vectors: A large part of the current frame (maybe close to half of it) may be converted to motion vectors, so the way these vectors are encoded is crucial; it must also be lossless. Two properties of motion vectors help in encoding them: (1) They are correlated and (2) their distribution is nonuniform. As we scan the frame block by block, adjacent blocks normally have motion vectors that don’t differ by much; they are correlated. The vectors also don’t point in all directions. There are usually one or two preferred directions in which all or most motion vectors point; the vectors are thus nonuniformly distributed. No single method has proved ideal for encoding the motion vectors. Arithmetic coding, adaptive Huffman coding, and various prefix codes have been tried, and all seem to perform well. Here are two different methods that may perform better:
1. Predict a motion vector based on its predecessors in the same row and its predecessors in the same column of the current frame. Calculate the difference between the prediction and the actual vector, and Huffman encode it. This method is important. It is used in MPEG and other compression methods.
2. Group the motion vectors in blocks. If all the vectors in a block are identical, the block is encoded by encoding this vector. Other blocks are encoded as in 1 above. Each encoded block starts with a code identifying its type.

Coding the Prediction Error: Motion compensation is lossy, since a block B is normally matched to a somewhat different block C. Compression can be improved by coding the difference between the current uncompressed and compressed frames on a block by block basis and only for blocks that differ much. This is usually done by transform coding. The difference is written on the output, following each frame, and is used by the decoder to improve the frame after it has been decoded.

**ftvfatboy** · 08-03-2006, 03:26 PM

Suboptimal Search Methods

--------------------------------------------------------------------------------

Suboptimal Search Methods

Video compression includes many steps and computations, so researchers have been looking for optimizations and faster algorithms, especially for steps that involve many calculations. One such step is the search for a block C in the previous frame to match a given block B in the current frame. An exhaustive search is time-consuming, so it pays to look for suboptimal search methods that search just some of the many overlapping candidate blocks. These methods do not always find the best match, but can generally speed up the entire compression process while incurring only a small loss of compression efficiency.

Signature-Based Methods: Such a method performs a number of steps, restricting the number of candidate blocks in each step. In the first step, all the candidate blocks are searched using a simple, fast distortion measure such as pel difference classification. Only the best matched blocks are included in the next step, where they are evaluated by a more restrictive distortion measure, or by the same measure but with a smaller parameter. A signature method may involve several steps, using different distortion measures in each.

Distance-Diluted Search: We know from experience that fast-moving objects look blurred in an animation, even if they are sharp in all the frames. This suggests a way to lose data. We may require a good block match for slow-moving objects, but allow for a worse match for fast-moving ones. The result is a block matching algorithm that searches all the blocks close to B, but fewer and fewer blocks as the search gets farther away from B. Figure 6.12a shows how such a method may work for maximum displacement parameters dx = dy = 6. The total number of blocks C being searched goes from (2dx + 1)·(2dy+1) = 13×13 = 169 to just 65, less than 39%!

Locality-Based Search: This method is based on the assumption that once a good match has been found, even better matches are likely to be located near it (remember that the blocks C searched for matches highly overlap). An obvious algorithm is to start searching for a match in a sparse set of blocks, then use the best-matched block C as the center of a second wave of searches, this time in a denser set of blocks. Figure 6.12b shows two waves of search, the first considers widely spaced blocks, selecting one as the best match. The second wave searches every block in the vicinity of the best match.

Quadrant Monotonic Search: This is a variant of locality-based search. It starts with a sparse set of blocks C that are searched for a match. The distortion measure is computed for each of those blocks, and the result is a set of distortion values. The idea is that the distortion values increase as we move away from the best match. By examining the set of distortion values obtained in the first step, the second step may predict where the best match is likely to be found. Figure 6.13 shows how a search of a region of 4×3 blocks suggests a well-defined direction in which to continue searching. This method is less reliable than the previous ones since the direction proposed by the set of distortion values may lead to a local best block, whereas the best block may be located elsewhere.

Dependent Algorithms: As has been mentioned before, motion in a frame is the result of either camera movement or object movement. If we assume that objects in the frame are bigger than a block, we conclude that it is reasonable to expect the motion vectors of adjacent blocks to be correlated. The search algorithm can therefore start by estimating the motion vector of a block B from the motion vectors that have already been found for its neighbors, then improve this estimate by comparing B to some candidate blocks C. This is the basis of several dependent algorithms, which can be spatial or temporal.

More Quadrant Monotonic Search Methods: The following suboptimal block matching methods use the main assumption of the quadrant monotonic search method.

Two-Dimensional Logarithmic Search: This multistep method reduces the search area in each step until it shrinks to one block. We assume that the current block B is located at position (a, b) in the current frame. This position becomes the initial center of the search. The algorithm uses a distance parameter d that defines the search area. This parameter is user-controlled with a default value. The search area consists of the (2d + 1)×(2d + 1) blocks centered on the current block B.

Three-Step Search: This is somewhat similar to the two-dimensional logarithmic search. In each step it tests eight blocks, instead of four, around the center of search, then halves the step size. If s = 3 initially, the algorithm terminates after three steps, hence its name.

Orthogonal Search: This is a variation of both the two-dimensional logarithmic search and the three-step search. Each step of the orthogonal search involves a horizontal and a vertical search. The step size s is initialized to (d + 1)/2, and the block at the center of the search and two candidate blocks located on either side of it at a distance of s are searched. The location of smallest distortion becomes the center of the vertical search, where two candidate blocks above and below the center, at distances of s, are searched. The best of these locations becomes the center of the next search. If the step size s is 1, the algorithm terminates and returns the best block found in the current step. Otherwise, s is halved, and a new, similar set of horizontal and vertical searches is performed.

One-at-a-Time Search: In this type of search there are again two steps, a horizontal and a vertical. The horizontal step searches all the blocks in the search area whose y coordinates equal that of block B (i.e., that are located on the same horizontal axis as B). Assuming that block H has the minimum distortion among those, the vertical step searches all the blocks on the same vertical axis as H and returns the best of them. A variation repeats this on smaller and smaller search areas.

Cross Search: All the steps of this algorithm, except the last one, search the five blocks at the edges of a multiplication sign “×”. The step size is halved in each step until it gets down to 1. At the last step, the plus sign “+” is used to search the areas located around the top-left and bottom-right corners of the preceding step. This has been a survey of quadrant monotonic search methods. We follow with an outline of two advanced search methods.

Hierarchical Search Methods: Hierarchical methods take advantage of the fact that block matching is sensitive to the block size. A hierarchical search method starts with large blocks and uses their motion vectors as starting points for more searches with smaller blocks. Large blocks are less likely to stumble on a local maximum, while a small block generally produces a better motion vector. A hierarchical search method is thus computationally intensive, and the main point is to speed it up by reducing the number of operations. This can be done in several ways as follows:
1. In the initial steps, when the blocks are still large, search just a sample of blocks. The resulting motion vectors are not the best, but they are only going to be used as starting points for better ones.
2. When searching large blocks, skip some of the pixels of a block. The algorithm may, for example, use just one-quarter of the pixels of the large blocks, one half of the pixels of smaller blocks, and so on.
3. Select the block sizes such that the block used in step i is divided into several (typically four or nine) blocks used in the following step. This way a single motion vector calculated in step i can be used as an estimate for several better motion vectors in step i + 1.

Multidimensional Search Space Methods: These methods are more complex. When searching for a match for block B, such a method looks for matches that are rotations or zooms of B, not just translations. A multidimensional search space method may also find a block C that matches B but has different lighting conditions. This is useful when an object moves among areas that are illuminated differently. All the methods discussed so far compare two blocks by comparing the luminance values of corresponding pixels. Two blocks B and C that contain the same objects but differ in luminance would be declared different by such methods.
When a multidimensional search space method finds a block C that matches B but has different luminance, it may declare C the match of B and append a luminance value to the compressed frame B. This value (which may be negative) is added by the decoder to the pixels of the decompressed frame, to bring them back to their original values. A multidimensional search space method may also compare a block B to rotated versions of the candidate blocks C. This is useful if objects in the video presentation may be rotated in addition to being moved. The algorithm may also try to match a block B to a block C containing a scaled version of the objects in B. If, for example, B is of size 8×8 pixels, the algorithm may consider blocks C of size 12×12, shrink each to 8×8, and compare it to B. This kind of block search involves many extra operations and comparisons. We say that it increases the size of the search space significantly, hence the name multidimensional search space. It seems that at present there is no multidimensional search space method that can account for scaling, rotation, and changes in illumination and also be fast enough for practical use.

**ftvfatboy** · 08-03-2006, 03:28 PM

MPEG

Started in 1988, the MPEG project was developed by a group of hundreds of experts under the auspices of the ISO (International Standardization Organization) and the IEC (International Electrotechnical Committee). The name MPEG is an acronym for Moving Pictures Experts Group. MPEG is a method for video compression, which involves the compression of digital images and sound, as well as synchronization of the two. There currently are several MPEG standards. MPEG-1 is intended for intermediate data rates, on the order of 1.5 Mbit/s. MPEG-2 is intended for high data rates of at least 10 Mbit/s. MPEG-3 was intended for HDTV compression but was found to be redundant and was merged with MPEG-2. MPEG-4 is intended for very low data rates of less than 64 Kbit/s. A third international body, the ITU-T, has been involved in the design of both MPEG-2 and MPEG-4. This section concentrates on MPEG-1 and discusses only its image compression features.

The formal name of MPEG-1 is the international standard for moving picture video compression, IS11172-2. Like other standards developed by the ITU and ISO, the document describing MPEG-1 has normative and informative sections. A normative section is part of the standard specification. It is intended for implementers, is written in a precise language, and should be strictly followed in implementing the standard on actual computer platforms. An informative section, on the other hand, illustrates concepts discussed elsewhere, explains the reasons that led to certain choices and decisions, and contains background material. An example of a normative section is the various tables of variable codes used in MPEG. An example of an informative section is the algorithm used by MPEG to estimate motion and match blocks. MPEG does not require any particular algorithm, and an MPEG encoder can use any method to match blocks. The section itself simply describes various alternatives.

The discussion of MPEG in this section is informal. The first subsection (main components) describes all the important terms, principles, and codes used in MPEG-1. The subsections that follow go into more details, especially in the description and listing of the various parameters and variable-size codes. The importance of a widely accepted standard for video compression is apparent from the fact that many manufacturers (of computer games, CD-ROM movies, digital television, and digital recorders, among others) implemented and started using MPEG-1 even before it was finally approved by the MPEG committee. This also was one reason why MPEG-1 had to be frozen at an early stage and MPEG-2 had to be developed to accommodate video applications with high data rates.

There are many sources of information on MPEG. [Mitchell et al. 97] is one detailed source for MPEG-1, and the MPEG consortium [MPEG 98] contains lists of other resources. In addition, there are many web pages with descriptions, explanations, and answers to frequently asked questions about MPEG.To understand the meaning of the words “intermediate data rate” we consider a typical example of video with a resolution of 360×288, a depth of 24 bits per pixel,and a refresh rate of 24 frames per second. The image part of this video requires 360×288×24×24 = 59,719,680 bits/s. For the sound part, we assume two sound tracks (stereo sound), each sampled at 44 KHz with 16 bit samples. The data rate is 2×44,000×16 = 1,408,000 bits/s. The total is about 61.1 Mbit/s and this is supposed to be compressed by MPEG-1 to an intermediate data rate of about 1.5 Mbit/s (the size of the sound track alone), a compression factor of more than 40! Another aspect is the decoding speed. An MPEG-compressed movie may end up being stored on a CD-ROM or DVD and has to be decoded and played in real time.

MPEG uses its own vocabulary. An entire movie is considered a video sequence. It consists of pictures, each having three components, one luminance (Y ) and two chrominance (Cb and Cr). The luminance component (Section 4.1) contains the black-andwhite picture, and the chrominance components provide the color hue and saturation (see [Salomon 99] for a detailed discussion). Each component is a rectangular array of samples, and each row of the array is called a raster line. A pel is the set of three samples. The eye is sensitive to small spatial variations of luminance, but is less sensitive to similar changes in chrominance. As a result, MPEG-1 samples the chrominance components at half the resolution of the luminance component. The term intra is used, but inter and nonintra are used interchangeably.

The input to an MPEG encoder is called the source data, and the output of an MPEG decoder is the reconstructed data. The source data is organized in packs (Figure 6.16b), where each pack starts with a start code (32 bits) followed by a header, ends with a 32-bit end code, and contains a number of packets in between. A packet contains compressed data, either audio or video. The size of a packet is determined by the MPEG encoder according to the requirements of the storage or transmission medium, which is why a packet is not necessarily a complete video picture. It can be any part of a video picture or any part of the audio. The MPEG decoder has three main parts, called layers, to decode the audio, the video, and the system data. The system layer reads and interprets the various codes and headers in the source data, and routes the packets to either the audio or the video layers (Figure 6.16a) to be buffered and later decoded. Each of these two layers consists of several decoders that work simultaneously.

**ftvfatboy** · 08-03-2006, 03:28 PM

MPEG-1 Main Components

MPEG uses I, P, and B pictures, as discussed in Section 6.4. They are arranged in groups, where a group can be open or closed. The pictures are arranged in a certain order, called the coding order, but are output, after decoding, and sent to the display in a different order, called the display order. In a closed group, P and B pictures are decoded only from other pictures in the group. In an open group, they can be decoded from pictures outside the group. Different regions of a B picture may use different pictures for their decoding. A region may be decoded from some preceding pictures, from some following pictures, from both types, or from none. Similarly, a region in a P picture may use several preceding pictures for its decoding, or use none at all, in which case it is decoded using MPEG’s intra methods.

The basic building block of an MPEG picture is the macroblock It consists of a 16×16 block of luminance (grayscale) samples (divided into four 8×8 blocks) and two 8×8 blocks of the matching chrominance samples. The MPEG compression of a macroblock consists mainly in passing each of the six blocks through a discrete cosine transform, which creates decorrelated values, then quantizing and encoding the results. It is very similar to JPEG compression (Section 4.8), the main differences being that different quantization tables and different code tables are used in MPEG for intra and nonintra, and the rounding is done differently. A picture in MPEG is organized in slices, where each slice is a contiguous set of macroblocks (in raster order) that have the same grayscale (i.e., luminance component). The concept of slices makes sense because a picture may often contain large uniform areas, causing many contiguous macroblocks to have the same grayscale.

**ftvfatboy** · 08-03-2006, 03:30 PM

MPEG-1 Video Syntax

Some of the many parameters used by MPEG to specify and control the compression of a video sequence are described in this section in detail. Readers who are interested only in the general description of MPEG may skip this section. The concepts of video sequence, picture, slice, macroblock, and block have already been discussed. Figure 6.24 shows the format of the compressed MPEG stream and how it is organized in six layers. Optional parts are enclosed in dashed boxes. Notice that only the video sequence of the compressed stream is shown; the system parts are omitted.

The video sequence starts with a sequence header, followed by a group of pictures (GOP) and optionally by more GOPs. There may be other sequence headers followed by more GOPs, and the sequence ends with a sequence-end-code. The extra sequence headers may be included to help in random access playback or video editing, but most of the parameters in the extra sequence headers must remain unchanged from the first header. A group of pictures (GOP) starts with a GOP header, followed by one or more pictures. Each picture in a GOP starts with a picture header, followed by one or more slices. Each slice, in turn, consists of a slice header followed by one or more macroblocks of encoded, quantized DCT coefficients. A macroblock is a set of six 8×8 blocks, four blocks of luminance samples and two blocks of chrominance samples. Some blocks may be completely zero and may not be encoded. Each block is coded in intra or nonintra. An intra block starts with a difference between its DC coefficient and the previous DC coefficient (of the same type), followed by run-level codes for the nonzero AC coefficients and zero runs. The EOB code terminates the block. In a nonintra block, both DC and AC coefficients are run-level coded.

It should be mentioned that in addition to the I, P, and B picture types, there exists in MPEG a fourth type, a D picture (for DC coded). Such pictures contain only DC coefficient information; no run-level codes or EOB is included. However, D pictures are not allowed to be mixed with the other types of pictures, so they are rare and will not be discussed further. The headers of a sequence, GOP, picture, and slice all start with a byte-aligned 32-bit start code. In addition to these video start codes there are other start codes for the system layer, user data, and error tagging. A start code starts with 23 zero bits, followed by a single bit of 1, followed by a unique byte. Table 6.25 lists all the video start codes. The “sequence.error” code is for cases where the encoder discovers unrecoverable errors in a video sequence and cannot encode it as a result. The run-level codes have variable lengths, so some zero bits normally have to be appended to the video stream before a start code, to make sure the code starts on a byte boundary.

Video Sequence Layer: This starts with start code 000001B3, followed by nine fixed-length data elements. The parameters horizontal_size and vertical_size are 12-bit parameters that define the width and height of the picture. Neither is allowed to be zero, and vertical_size must be even. Parameter pel_aspect_ratio is a 4-bit parameter that specifies the aspect ratio of a pel. Its 16 values are listed in Table 6.26. Parameter picture_rate is a 4-bit parameter that specifies one of 16 picture refresh rates

GOP Layer: This layer starts with nine mandatory elements, optionally followed by extensions and user data, and by the (compressed) pictures themselves. The 32-bit group start code 000001B8 is followed by the 25-bit time_code, which consists of the following six data elements: drop_frame_flag (1 bit) is zero unless the picture rate is 29.97 Hz; time_code_hours (5 bits, in the range [0, 23]), data elements time_code_minutes (6 bits, in the range [0, 59]), and time_code_seconds (6 bits, in the same range) indicate the hours, minutes, and seconds in the time interval from the start of the sequence to the display of the first picture in the GOP. The 6-bit time_code_pictures parameter indicates the number of pictures in a second. There is a marker_bit between time_code_minutes and time_code_seconds. Following the time_code there are two 1-bit parameters. The flag closed_gop is set if the GOP is closed (i.e., its pictures can be decoded without reference to pictures from outside the group). The broken_link flag is set to 1 if editing has disrupted the original sequence of groups of pictures.

Picture Layer: Parameters in this layer specify the type of the picture (I, P, B, or D) and the motion vectors for the picture. The layer starts with the 32-bit picture_start_code, whose hexadecimal value is 00000100. It is followed by a 10- bit temporal_reference parameter, which is the picture number (modulo 1024) in the sequence. The next parameter is the 3-bit picture_coding_type (Table 6.29), and this is followed by the 16-bit vbv_delay that tells the decoder how many bits must be in the compressed data buffer before the picture can be decoded. This parameter helps prevent buffer overflow and underflow. If the picture type is P or B, then this is followed by the forward motion vectors scale information, a 3-bit parameter called forward_f_code (see Table 6.34). For B pictures, there follows the backward motion vectors scale information, a 3-bit parameter called backward_f_code.

Slice Layer: There can be many slices in a picture, so the start code of a slice ends with a value in the range [1, 175]. This value defines the macroblock row where the slice starts (a picture can therefore have up to 175 rows of macroblocks). The horizontal position where the slice starts in that macroblock row is determined by other parameters. The quantizer_scale (5 bits) initializes the quantizer scale factor, discussed earlier in connection with the rounding of the quantized DCT coefficients. The extra_bit_slice flag following it is always 0 (the value of 1 is reserved for future ISO standards). Following this, the encoded macroblocks are written.

Macroblock Layer: This layer identifies the position of the macroblock relative to the position of the current macroblock. It codes the motion vectors for the macroblock, and identifies the zero and nonzero blocks in the macroblock. Each macroblock has an address, or index, in the picture. Index values start at 0 in the upper-left corner of the picture and continue in raster order. When the encoder starts encoding a new picture, it sets the macroblock address to -1. The macroblock_ address_increment parameter contains the amount needed to increment the macroblock address in order to reach the macroblock being coded. This parameter is normally 1. If macroblock_address_increment is greater than 33, it is encoded as a sequence of macroblock_escape codes, each incrementing the macroblock address by 33.

Block Layer: This layer is the lowest in the video sequence. It contains the encoded 8×8 blocks of quantized DCT coefficients. The coding depends on whether the block contains luminance or chrominance samples and on whether the macroblock is intra or nonintra. In nonintra coding, blocks that are completely zero are skipped; they don’t have to be encoded. The macroblock_intra flag gets its value from macroblock_type. If it is set, the DC coefficient of the block is coded separately from the AC coefficients.

**ftvfatboy** · 08-03-2006, 03:31 PM

Motion Compensation

--------------------------------------------------------------------------------

Motion Compensation

An important element of MPEG is motion compensation, which is used in inter coding only. In this mode, the pels of the current picture are predicted by those of a previous reference picture (and, possibly, by those of a future reference picture). Pels are subtracted, and the differences (which should be small numbers) are DCT transformed, quantized, and encoded. The differences between the current picture and the reference one are normally caused by motion (either camera motion or scene motion), so best prediction is obtained by matching a region in the current picture with a different region in the reference picture. MPEG does not require the use of any particular matching algorithm, and any implementation can use its own method for matching macroblocks (see Section 6.4 for examples of matching algorithms). The discussion here concentrates on the operations of the decoder.

Differences between consecutive pictures may also be caused by random noise in the video camera, or by variations of illumination, which may change brightness in a nonuniform way. In such cases, motion compensation is not used, and each region ends up being matched with the same spatial region in the reference picture. If the difference between consecutive pictures is caused by camera motion, one motion vector is enough for the entire picture. Normally, however, there is also scene motion and movement of shadows, so a number of motion vectors are needed, to describe the motion of different regions in the picture. The size of those regions is critical. A large number of small regions improves prediction accuracy, whereas the opposite situation simplifies the algorithms used to find matching regions and also leads to fewer motion vectors and sometimes to better compression. Since a macroblock is such an important unit in MPEG, it was also selected as the elementary region for motion compensation.

Another important consideration is the precision of the motion vectors. A motion vector such as (15,-4) for a macroblock M typically means that M has been moved from the reference picture to the current picture by displacing it 15 pels to the right and 4 pels up (a positive vertical displacement is down). The components of the vector are in units of pels. They may, however, be in units of half a pel, or even smaller. In MPEG-1, the precision of motion vectors may be either full-pel or half-pel, and the encoder signals this decision to the decoder by a parameter in the picture header (this parameter may be different from picture to picture). It often happens that large areas of a picture move at identical or at similar speeds, and this implies that the motion vectors of adjacent macroblocks are correlated. This is the reason why the MPEG encoder encodes a motion vector by subtracting it from the motion vector of the preceding macroblock and encoding the difference.

A P picture uses an earlier I picture or P picture as a reference picture. We say that P pictures use forward motion-compensated prediction. When a motion vector MD for a macroblock is determined (MD stands for motion displacement, since the vector consists of two components, the horizontal and the vertical displacements), MPEG denotes the motion vector of the preceding macroblock in the slice by PMD and calculates the difference dMD=MD–PMD. PMD is reset to zero at the start of a slice, after a macroblock is intra coded, when the macroblock is skipped, and when parameter block_motion_forward is zero. The 1-bit parameter full_pel_forward_vector in the picture header defines the precision of the motion vectors (1=full-pel, 0=half-pel). The 3-bit parameter forward_ f_code defines the range.

**ftvfatboy** · 08-03-2006, 03:31 PM

Pel Reconstruction

The main task of the MPEG decoder is to reconstruct the pel of the entire video sequence. This is done by reading the codes of a block from the compressed stream, decoding them, dequantizing them, and calculating the IDCT. For nonintra blocks in P and B pictures, the decoder has to add the motion-compensated prediction to the results of the IDCT. This is repeated six times (or fewer, if some blocks are completely zero) for the six blocks of a macroblock. The entire sequence is decoded picture by picture, and within each picture, macroblock by macroblock. It has already been mentioned that the IDCT is not rigidly defined in MPEG, which may lead to accumulation of errors, called IDCT mismatch, during decoding. For intra-coded blocks, the decoder reads the differential code of the DC coefficient and uses the decoded value of the previous DC coefficient (of the same type) to decode the DC coefficient of the current block. It then reads the run-level codes until an EOB code is encountered, and decodes them, generating a sequence of 63 AC coefficients, normally with few nonzero coefficients and runs of zeros between them. The DC and 63 AC coefficients are then collected in zigzag order to create an 8×8 block. After dequantization and inverse DCT calculation, the resulting block becomes one of the six blocks that make up a macroblock (in intra coding all six blocks are always coded, even those that are completely zero). For nonintra blocks, there is no distinction between DC and AC coefficients and between luminance and chrominance blocks. They are all decoded in the same way.

**ftvfatboy** · 08-03-2006, 03:32 PM

MPEG-4

MPEG-4 is a new standard for audiovisual data. Although video and audio compression is still a central feature of MPEG-4, this standard includes much more than just compression of the data. As a result, MPEG-4 is huge and this section can only describe its main features. No details are provided. We start with a bit of history. The MPEG-4 project started in May 1991 and initially aimed to find ways to compress multimedia data to very low bitrates with minimal distortions. In July 1994, this goal was significantly altered in response to developments in audiovisual technologies. The MPEG-4 committee started thinking of future developments and tried to guess what features should be included in MPEG-4 to meet them. A call for proposals was issued in July 1995 and responses were received by October of that year. (The proposals were supposed to address the eight major functionalities of MPEG-4, listed below.) Tests of the proposals were conducted starting in late 1995. In January 1996, the first verification model was defined, and the cycle of calls for proposals—proposal implementation and verification was repeated several times in 1997 and 1998. Many proposals were accepted for the many facets of MPEG-4, and the first version of MPEG-4 was accepted and approved in late 1998. The formal description was published in 1999 with many amendments that keep coming out.

At present (mid-2003), the MPEG-4 standard is designated the ISO/IEC 14496 standard, and its formal description, which is available from [ISO 03] consists of 10 parts, plus new amendments. More readable descriptions can be found in [Pereira and Ebrahimi 02] and [Symes 03]. MPEG-1 was originally developed as a compression standard for interactive video on CDs and for digital audio broadcasting. It turned out to be a technological triumph but a visionary failure. On one hand, not a single design mistake was found during the implementation of this complex algorithm and it worked as expected. On the other hand, interactive CDs and digital audio broadcasting have had little commercial success, so MPEG-1 is used today for general video compression. One aspect of MPEG-1 that was supposed to be minor, namely MP3, has grown out of proportion and is commonly used today for audio. MPEG-2, on the other hand, was specifically designed for digital television and this product has had tremendous commercial success. The lessons learned from MPEG-1 and MPEG-2 were not lost on the MPEG committee members and helped shape their thinking for MPEG-4. The MPEG-4 project started as a standard for video compression at very low bitrates. It was supposed to deliver reasonable video data in only a few thousand bits per second. Such compression is important for video telephones or for receiving video in a small, handheld device, especially in a mobile environment, such as a moving car. After working on this project for two years, the committee members, realizing that the rapid development of multimedia applications and services will require more and more compression standards, have revised their approach. Instead of a compression standard, they decided to develop a set of tools (a toolbox) to deal with audiovisual products in general, today and in the future. They hoped that such a set will encourage industry to invest in new ideas, technologies, and products in confidence, while making it possible for consumers to generate, distribute, and receive different types of multimedia data with ease and at a reasonable cost.

Traditionally, methods for compressing video have been based on pixels. Each video frame is a rectangular set of pixels and the algorithm looks for correlations between pixels in a frame and between frames. The compression paradigm adopted for MPEG-4, on the other hand, is based on objects. (The name of the MPEG-4 project was also changed at this point to “coding of audiovisual objects.”) In addition to producing a movie in the traditional way with a camera or with the help of computer animation, an individual generating a piece of audiovisual data may start by defining objects, such as a flower, a face, or a vehicle, then describing how each object should be moved and manipulated in successive frames. A flower may open slowly, a face may turn, smile, and fade, a vehicle may move toward the viewer and become bigger. MPEG-4 includes an object description language that provides for a compact description of both objects and their movements and interactions.

Another important feature of MPEG-4 is interoperability. This term refers to the ability to exchange any type of data, be it text, graphics, video, or audio. Obviously, interoperability is possible only in the presence of standards. All devices that produce data, deliver it, and consume (play, display, or print) it must obey the same rules and read and write the same file structures. During its important July 1994 meeting, the MPEG-4 committee decided to revise its original goal and also started thinking of future developments in the audiovisual field and of features that should be included in MPEG-4 to meet them. They came up with eight points that they considered important functionalities for MPEG-4.

1. Content-based multimedia access tools. The MPEG-4 standard should provide tools for accessing and organizing audiovisual data. Such tools may include indexing, linking, querying, browsing, delivering files, and deleting them. The main tools currently in existence are listed later in this section.

2. Content-based manipulation and bitstream editing. A syntax and a coding scheme should be part of MPEG-4 to enable users to manipulate and edit compressed files (bitstreams) without fully decompressing them. A user should be able to select an object and modify it in the compressed file without decompressing the entire file.

3. Hybrid natural and synthetic data coding. A natural scene is normally produced by a video camera. A synthetic scene consists of text and graphics. MPEG-4 needs tools to compress natural and synthetic scenes and mix them interactively.

4. Improved temporal random access. Users may want to access part of the compressed file, so the MPEG-4 standard should include tags to make it easy to reach any point in the file. This may be important when the file is stored in a central location and the user is trying to manipulate it remotely, over a slow communications channel.

5. Improved coding efficiency. This feature simply means improved compression. Imagine a case where audiovisual data has to be transmitted over a low-bandwidth channel (such as a telephone line) and stored in a low-capacity device such as a smartcard. This is possible only if the data is well compressed, and high compression rates (or equivalently, low bitrates) normally involve a trade-off in the form of reduced image size, reduced resolution (pixels per inch), and reduced quality.

6. Coding of multiple concurrent data streams. It seems that future audiovisual applications will allow the user not just to watch and listen but also to interact with the image. As a result, the MPEG-4 compressed stream can include several views of the same scene, enabling the user to select any of them to watch and to change views at will. The point is that the different views may be similar, so any redundancy should be eliminated by means of efficient compression that takes into account identical patterns in the various views. The same is true for the audio part (the soundtracks).

7. Robustness in error-prone environments. MPEG-4 must provide errorcorrecting codes for cases where audiovisual data is transmitted through a noisy channel. This is especially important for low-bitrate streams, where even the smallest error may be noticeable and may propagate and affect large parts of the audiovisual presentation.

8. Content-based scalability. The compressed stream may include audiovisual data in fine resolution and high quality, but any MPEG-4 decoder should be able to decode it at low resolution and low quality. This feature is useful in cases where the data is decoded and displayed on a small, low-resolution screen, or in cases where the user is in a hurry and prefers to see a rough image rather than wait for a full decoding.

Once the above eight fundamental functionalities have been identified and listed, the MPEG-4 committee started the process of developing separate tools to satisfy these functionalities. This is an ongoing process that continues to this day and will continue in the future. An MPEG-4 author faced with an application has to identify the requirements of the application and select the right tools. It is now clear that compression is a central requirement in MPEG-4, but not the only requirement, as it was for MPEG-1 and MPEG-2.

Thread: Guides & Tutorials

Thread Tools

Display

Bookmarks

Bookmarks

Posting Permissions