A Saturday night in front of the TV maybe doesn’t make you think of video streaming standards any more than watching a bridge go up automatically or getting caught on a traffic camera for running a red light do. But video streaming is what makes it possible to safely open bridges from a control center and secure the streets as much as it is what lets you put your feet up and forget the world for awhile with your favorite film.
Since 1995, the video streaming standard of choice for TV broadcasting and DVD video has been MPEG‑2. Its successor, MPEG-4 part 2, expanded the possibilities of MPEG-2 in 1998, creating a streaming standard that has largely been adopted by the computer industry.
But the buzzword in the world of video streaming these days is H.264, a.k.a. MPEG-4 part 10. Everybody developing or distributing codecs either already supports it or will very soon. But just what exactly is H.264 and what is so special about it?
Inventing a new video streaming standard
Simply said, H.264 is a video compression standard. For more than a hundred years, the International Organization for Standardization (ISO), the International Electrotechnical Commission (IEC), and the International Telecommunication Union – Telecommunication Standardization Sector (ITU-T) have created international standards for a vast array of new technologies. In recent decades, these organizations have worked together to define the basic criteria for streaming technologies, making it possible to compress and transmit media globally.
H.264 is a product of this cooperation. Members of ISO and IEC formed a workgroup in May of 1988 called the Moving Picture Experts Group (MPEG), which is known for the MPEG-2 and MPEG-4 part 2 video and audio compression and transmission standards, published in the 1990s. In 2001, MPEG and a subgroup of ITU-T, the Video Coding Experts Group (VCEG), founded a new workgroup called the Joint Video Team (JVT). Basing their work on the MPEG‑2/4 standards, JVT created the H.264 video compression standard, first published in 2003.
A supple standard
The H.264 standard sets the requirements for formatting compressed video so as to provide improved video quality at lower bit rates than preceding standards. However, it doesn’t actually specify how codecs should go about encoding video streams. It only defines how decoders should function and the tools and mechanisms that may be used, giving this standard unparalleled flexibility and allowing developers to contend for the most efficient encoding.
Still, H.264 isn’t all that different from its predecessors. Similarly to MPEG-2/4, developers have to select a particular profile defined for specific uses. In the case of H.264, there are three: Baseline,Main, and Extended. The Baseline Profile is optimized for videotelephony, videoconferencing, wireless communications, and CCTV installations since it is less demanding on the decoder. The Main Profile works better in television broadcasting and video storage, while more exacting applications that are less concerned with processing power requirements can make use of the Extended Profile.
Like MPEG-2/4 before it, H.264 also uses block-based encoding. This means that H.264 employs motion estimation, transform, quantization, and entropy encoding to compress video, and it inverses these processes to decode image data for viewing.
Adapting raster block sizes to the detail in task
H.264 distinguishes itself from the other MPEG standards mainly during motion estimation and its two components, motion compensation and motion vectors. Motion estimation is the process by which image information is assessed for similarities that can be reused in subsequent frames. This ultimately reduces the amount of data that is encoded and therefore reduces the bit rate.
Initially, an H.264 or MPEG-2/4 encoder receives either frames (progressive video) or fields (interlaced video) from a camera or other video source. At the start of motion estimation, these images are divided up in a raster of macroblocks that are organized into arbitrarily shaped slices.
The raster is one aspect that sets H.264 apart. While MPEG-2/4 separate input frames or fields into a fixed raster of blocks containing 8×8 pixels, H.264 allows block sizes to vary. An H.264 encoder’s raster can therefore include block sizes of 16×16, 16×8, 8×16, 8×8, 8×4, 4×8, or 4×4 pixels. So, less detailed areas, such as a clear blue sky, may use a 16×16 block while more detailed areas, such as the edges of moving vehicles, will probably use the smaller, 4×4 block size.
Adjusting the block size as necessary not only makes H.264 encoding more efficient but it also improves the perceived quality of the image. Fixed gridlines are more jarring to the eye than jumbled blocks or chaotic patterns. As a result, most would agree that H.264 also noticeably improves the apparent video view.
Deciding which block size to use and where is not something that is defined in the H.264 standard. This allows engineers to creatively compete for the most accurate and efficient motion estimation process. Consequently, a proficient motion estimation process can be what either makes or breaks an H.264 encoder.
Reusing means reducing the bit rate
Another aspect of motion estimation is the process of motion compensation, during which the difference or change between the macroblocks is calculated. Each slice is examined in raster order with either intra- or inter-prediction.
Intra-prediction is when I blocks in I slices are assessed according to the image data found within the current slice. When P blocks in P slices are examined, image data found in the current and previous slices is referenced in an inter-prediction scan.
The discrepancy between image data determined during motion compensation is used to produce a block containing residual information. This residue block is what is encoded and image data from previous frames is reused. As a result, only the dissimilarities between blocks are encoded and redundant aspects of images are recycled, thus reducing the bit rate.
Whereas MPEG-2/4 consult just one reference frame, H.264 has a number of previously encoded frames that it can check. This provides H.264 with the potential to reprocess even more image data than the preceding MPEG standards and, as a result, diminish the bit rate by an even greater degree, although this also increases the necessary processing power.
Down to a quarter pixel
The direction in which reused pixels should shift in the following frame or field (either vertically or horizontally) is identified through motion vectors. Motion vectors indicate how best to situate data and are therefore a crucial factor in effectively reusing image information.
MPEG-2 and MPEG-4 SP (Simple Profile) generate motion vectors using half-pixel resolution. This means that half pixel increments are used to accurately rearrange data. H.264 goes a step further, subdividing macroblocks and creating motion vectors that can reposition image data with the precision of a quarter of a pixel.
This exactness employed by H.264 encoders further reduces the amount of data needing to be encoded, but it also increases the number of pixel positions 16-fold. As a result, H.264 encoders only average or extrapolate motion vectors for areas where there is a lot of motion or the data is most detailed.
Estimating encoder excellence
Motion compensation and the creation of motion vectors transpire concurrently in motion estimation to select the best block size, calculate the difference, and generate motion vectors for every quarter pixel to reduce the residual difference between image frames, ultimately making H.264 encoders extremely efficient.
Decreasing the bit stream comes at a cost, however. It results in an increased computational complexity and, therefore, higher processing power requirements. Engineers have to carefully implement statistical mechanisms to analyze the data flow and determine the most efficient way of using the tools and enhancements made possible with H.264. Therefore, an encoder’s quality can be judged by the competence of its realization of motion estimation.
Describing data in the transformation matrix
In contrast to the motion estimation step, the transform phase in the encoding process is relatively similar in H.264 and the MPEG standards. At this point in encoding, all the residual data collected during motion estimation is described using the Discrete Cosine Transformation (DCT) method.
Initially, information from each residual block is depicted as one 16×16 pixel brightness (luma) block and two 8×8 pixel color (chroma) blocks. These image data blocks are analyzed and replaced by a DCT pattern with corresponding coefficients that precisely represent the original information. This transform process results in a matrix of coefficients reflecting the amount of data to be encoded. So, the fewer matrix values there are and the lower their values, the less residual image data there is. Accordingly, this results in a better image with fewer bits.
How transform works is easy to understand if you think about it in terms of modern art. It’s quite easy to describe a new painting that is just a canvas covered in solid blue paint. It is, however, much more difficult to tell someone in detail about the intricacies of a Jackson Pollock painting.
Similarly, DCT coefficients ideally describe a solid grey block, or a block with little or no residual data. The more coefficients that need to be used, the more residual details there are.
Quantization and the Q-value: Controlling the bit rate
H.264 quantization also doesn’t differ all that much from MPEG-2/4 quantization. This step in the encoding process consists of first dividing the transform coefficients by a dynamic Q value, used to manage the size of the bit stream, and then discarding trivial coefficients in a specified value range by reducing them to zero.
The transform coefficients are initially divided by the Q value, which varies depending how large the bit rate is allowed to be. A higher Q value results in lower coefficients and fewer bits, but it also diminishes the quality of the image.
In scalar quantization, values within a predetermined range around zero are deemed inconsequential and are therefore reduced to zero. This lowers the bit rate without necessarily impacting the perceived quality of the image.
Both the transform and quantization stages depend on an adept motion estimation process. The advancements H.264 encoding makes in motion estimation are what, in the end, lower the residual image data and allow high quality images to be transformed and quantized to coefficients nearing zero. Therefore, improvements in motion estimation are what ultimately allow better video quality at a lower bit rate. However, H.264 encoding includes additional developments that augment the effectiveness of this streaming standard.
Recognizing and reducing repetition
The next step in block-based encoding is entropy encoding. At this point, data is prepared for transmission in such a way that it can be reconstructed in its entirety by the decoder. This is also known as lossless encoding. Entropy encoding is carried out with the help of a variable‑length encoder (VLC), which condenses the bit rate by recognizing frequently recurring data patterns and replacing them with simplified instructions, or codewords.
In MPEG-2/4, the VLC sends every value in the quantized transform matrix to the decoder. H.264 alternatively offers more varied and advanced entropy encoder options in two types of VLCs: the Context-Adaptive Variable-Length Codes (CAVLC) and the Context‑Based Arithmetic Coding (CABAC). While CAVLC only compresses data for the quantized transform coefficients, CABAC compresses all data streamed to the decoder into codewords.
H.264 VLCs ultimately make streaming redundant data more efficient even though they increase the processing power requirements. CAVLC and CABAC reduce the bit rate by adapting to repeatedly received data sequences when that is statistically proven to be more efficient. So, knowing how and when to implement a particular VLC is just another challenge put to H.264 engineers.
A simple example may help to explain how CAVLC works. Suppose that every time you said, “I’d like a cup of coffee”, you received one, and so, after a while, you started just saying “I’d like”. While this is a very easy way to satiate your coffee craving, should you ever just want a glass of water, you would need to explain yourself without saying “I’d like”.
CAVLC works similarly. If the entropy encoder receives recurring data patterns, it replaces them with a codeword, like 1. However, other sequences then need to be described without using a 1. This can sometimes lead to longer codewords in unique data streams.
Blurring block borders in the encoding process
One problem plaguing MPEG-2/4 encoders is errors in the image data caused by macroblock edges that are incongruous with adjacent blocks. This is not only disruptive to the viewer but it can also hinder motion estimation.
While these block-edge blunders are easily evened out with a deblocking filter, MPEG-2/4 only applies this deblocking filter in the decoding process. Although the deblocking filter more or less erases blocky edges for the viewer, the original distortions still impede motion estimation during encoding because the reference frames retain the block-edge errors. This ultimately reduces the efficiency of residual data recognition and, therefore, it also diminishes the efficiency of the encoder in general.
H.264 provides a solution for both the visual effects of block edges as well as the implications they can have for motion estimation by applying a deblocking filter in the encoding process, also known as in-loop deblocking. This allows motion estimation to use reconstructed frames when searching reference frames rather than the initial frames from the camera, thereby reducing the discovered residual differences. In‑loop deblocking thus further facilitates adroit motion estimation in H.264 encoding.
The Siqura solution
At Siqura, engineers have thoroughly researched the best way of implementing H.264. Initially, they tried running H.264 directly on a digital signal processor (DSP). However, after extensive testing, it became clear that H.264 processing requirements vastly exceed the capacity of the average DSP, and operating H.264 as such would necessitate an impractically powerful DSP.
Therefore, Siqura engineers have incorporated a dedicated hardware encoder chip into codecs to offer both the Baseline profile and aspects of the Main profile while still efficiently utilizing processing power. In this way, the Siqura H.264 solution offers intra- and inter-prediction, CAVLC, slice groups, arbitrary slice order (ASO), redundant slices, and, if processing power allows, CABAC and interlaced video encoding possibilities.
Yet the real successes have been won in the creation of an expert motion estimation process that cleverly chooses appropriate block sizes, correctly calculates the residual difference using multiple reference frames, and accurately establishes motion vectors using quarter pixel resolution, cumulatively compensating not only for linear motion but also for rotation. The Siqura solution consequently optimizes CPU rates and maximizes bit rate reduction while still providing premium video quality, thus making the Siqura H.264 codecs perfect for any surveillance application.