Proforma
| Name: | Andrew Lewis |
|---|---|
| Project originator: | Andrew Lewis |
| Supervisor: | Markus Kuhn |
Original aims of the project
To produce a working lossy video compression and decompression system, with functionality in separate modules. An executable should set parameters and invoke the codec based on detailed configuration provided by a user. The codec should provide runtime analysis for evaluation and optimisation of parameters and algorithm choice.
Work completed
The codec performs well, giving good compression at high quality on natural footage. Many parameters are variable, and features and algorithms can be selected. The codec integrates with a runtime computation engine (Matlab or Octave) to give extra analysis output for evaluation and data visualization. There are a few extensions to the original plan: more advanced motion estimation, caching and rate control.
Contents
- 3.1 Overview
- 3.3 Logging and library services
- 3.4 Module system
- 3.6 Encoding and decoding paths
- 3.7 Runtime measurements
- 4.1 Overview
- 4.2 Structural similarity (SSIM)
- 4.3 Test sequences
- 4.4 Variables
- 4.5 Video codec evaluation
- 4.6 The evaluation setup
- 4.8 Analysis of artifacts
- 4.9 Codec comparison
- 4.10 Structural evaluation and aims
- Bibliography
- A. Standards
- B. DCT implementation
- C. SimpleFrames coding path
1. Introduction
1.1 Overview
In my project I wrote a complete, modular video compression and decompression system (codec), with flexible configuration and monitoring support, and evaluated its performance. The codec performs well on natural video, and I have accomplished the goals outlined in my project proposal, with extensions for integration with a runtime data visualization engine, motion estimation, and various extra pieces of functionality.
1.2 Motivation
Video compression is ubiquitous: entertainment (DVDs, television, web pages, mobile 3G), business (video-conferencing, training) and security systems all make use of the technology. Uncompressed video uses too much bandwidth and storage to be a viable option. Many implementations of compression systems exist, some of which use fast algorithms and optimisation techniques to give excellent performance. In my project I try to achieve some of the performance of these major pieces of software, whilst keeping the code coherent and modular, and provide runtime analysis features, the aim being to investigate the effectiveness of algorithms under different parameter choices, and to create a working compression system.
1.3 Codec requirements
A codec implementation consists of a library or executable which encodes video and a corresponding decoder to operate on the output. The encoder accepts frames (from a camera or separate images, comprising the `source video') and outputs a binary file (the `encoded stream') which represents the video in a more abstract form. The decoder takes this file and reconstructs an approximation of the input frames (`decompressed video').
Two types of compression can be performed: Lossy compression discards irrelevant data, and lossless compression removes redundancy. Both techniques are used in video compression.
Different applications make different demands on the codec, and so trade-offs must be made. Decisions in the implementation and configuration will affect:
The speed of compression and decompression: complexity of the algorithms involved, in space and time;
- The level of compression which can be achieved; and
- The quality of the resulting compressed video.
All of these factors affect each other: increasing the level of lossy compression too much lowers the perceived quality; better compression is normally possible at the expense of increased encode/decode times and resource requirements; and so on.
1.4 Video compression overview
I will now give a general overview of the process of compression and decompression of video, to define important ideas and terms.
The vast majority of current codecs, including the MPEG variants, divide frames up into square blocks either eight or sixteen pixels across called macroblocks. These are further subdivided into smaller squares called blocks, which are transformed to the frequency domain by a discrete cosine transform (DCT). A quantization stage then discards information to approximate the blocks. The encoder emits codewords into the output stream to represent this intermediate form: the symbols guide the decoder, giving it quantized coefficients or telling it when to zero entire blocks, for example. The quantization stage makes this a lossy codec, but the final entropy coding stage uses lossless compression representing more common features with fewer bits.
So far, this is the same scheme used with JPEG. In video compression there is also temporal redundancy: a frame is often similar to predecessors in the stream. Frames which refer to previous frames are known as P-frames (predictive frames, also known as inter or non-intra frames), and frames that are independent and only exploit spatial redundancy are known as I-frames (intra or key frames). Prediction involves estimating motion to derive vectors for each macroblock which tell the decoder a displacement in the last decoded frame from which pixels should be copied to `guess' the value for the block in the current frame; in sequences with movement such predictions are rarely exactly correct, due to objects rotating, scaling, revealing background, or moving by an amount which does not line up with pixel boundaries1.1, and so a residual error is output together with the vectors. Adding the residual to the motion compensated image corrects errors. It is coded similarly to an I-frame, but with signed values.
An alternative to such a block DCT-based scheme uses wavelets, but current codecs' performance is not yet sufficient to encode and decode high resolution video1.2. There are many codec implementations available, using different techniques at differing levels of complexity, and with different licenses. Standardized MPEG-compliant codecs are among the most popular, and my codec is an MPEG-like implementation.
2. Preparation
2.1 Proposal refinements and extensions
My original plan for Matlab integration allowed a developer to extend the set of modules in the codec to provide new functionality implemented in Matlab. However, near the start of the project it became clear that this feature was impractical and not as useful as anticipated. The first issue is that marshalling data across the boundary between the native code application and Matlab is slow, as is often the case with native/managed boundaries; an encoding/decoding path using Matlab for all picture data would certainly by unusable, due to speed and memory constraints. Secondly, the architecture of the codec in native code is very easy to extend, so the only motivation for this feature would come from needing to use Matlab's higher level DSP facilities. Rather than using Matlab for actual computation for compression, I decided to use its data visualization capabilities instead.
I have implemented two of the extensions from the original project proposal: use of motion estimation and compensation in predictive frames and additional performance analysis and visualization in Matlab. The implementation of both of these extensions went very smoothly, and turned out to be worth the extra effort in compression gains (quality, speed and compression ratio) and in evaluation capabilities respectively. As the project progressed I decided to implement support for Octave as an optional replacement for Matlab, added simple rate control based on error estimation, implemented caching on input/output images, added half-pixel resolution motion estimation and compensation, and added an extension to associate four vectors with each macroblock to represent motion more precisely.
I will start by looking at the requirements and preparatory work, before describing some theory which the project makes use of.
2.2 Requirements analysis
The project requires an encoder and decoder. The encoder should output a smaller file than the uncompressed input2.1, and the decoder should produce the coded approximations of input frames at a reasonable speed. (The codec is asymmetrical, meaning that more time can be spent on encoding than decoding.) A test harness which passes custom configuration parameters and references input data into the main codec library is also necessary.
To keep close to standards compliance with MPEG, the main picture data in the binary file are stored as required by the standard, but I replaced headers at various locations in the stream with simpler meta-data, or removed them entirely. The decoder is able to store state about its current position, so the stream syntax does not contain explicit state (e.g., current frame number, location in the picture, and so on). Another factor here is that the syntax of MPEG-4, for example, has many options for different profiles and levels, which I do not require; by including these I would use extra space unnecessarily.
The requirement for prediction brings the codec above MJPEG (Motion-JPEG) in terms of complexity and means that encoding/decoding paths will have to be structured carefully, to avoid the code becoming too complex to maintain. The requirement of modularity entails various tasks at the design stage, to standardize what `modules' have in common and write this framework.
2.3 Preparatory work
Due to the number of unknowns at the start of this project, it was important to prepare correctly, gathering information necessary for the implementation to avoid problems at a later stage. The initial obstacles to beginning coding, in addition to those typical in small/medium sized software engineering tasks (choice of development methodology, pinning down precise descriptions of data structures and algorithms which need to be implemented), were becoming familiar with video compression technology and the MPEG standard [1], sorting out the development environment and compilation for a C++ shared library, and working out how to interface with Matlab/Octave.
2.3.1 Books, standards and tools
Much effort has gone into standardization of video codecs, as it gives many benefits. I made extensive use of the MPEG-4 Visual standard and H.264 recommendation ([1] and [2]). The documents only specify the decoding part of the system, but establish conventions for data formats and give tables of codewords. Appendix A has gives some detail on the standards.
It took a considerable amount of time to understand these documents, as they are necessarily detailed and thorough. Indeed, one of the main difficulties was establishing which tables, flow-charts and sections of the 700 page MPEG-4 standard were relevant to my project. Various extra text books came in useful for general understanding of video compression [3,4,5,8].
I use C++ (compiled using the GNU C++ compiler) with CDT and the Eclipse IDE for development. Octave and Gnuplot are supported by my project as an alternative to Matlab, as they are open-source. I use ImageMagick for reading and writing frames input to the encoder and output from the decoder. Due to the breadth of the project, I used many different tools during development and evaluation, including Perl and BASH, valgrind and various image viewers and command-line tools. To produce this dissertation I used Mediawiki, LATEX, InkScape, Maxima and Gnuplot.
2.3.2 Development process
I used an iterative development process for this project, to give adaptation to changing requirements, because there were many unknowns at the start ([6] Code Complete section 3.2).
|
During preparation I decided to split up development into three stages. Firstly, in the initial iterative stage, I implemented the modules required to encode/decode I-frames in an encoding/decoding path called `SimpleFrames' (Figure 2.1). I added modules in the order shown, gradually giving more functionality. Individually, most modules required implementation of a main algorithm, followed by addition of infrastructure for invocation and handling of different data structures/overloads. Bit-level input and output was also required, implemented in another part of the library separate from the module system, so that it could potentially be utilised by any part of the library. Appendix C is a code sample from SimpleFrames which compresses and decompresses each frame as an I-frame. This development stage was completed on schedule by the end of January.
Secondly, I added prediction code to the encoding path. This involved updating data structures to include predictive data, and modifying algorithms where the input range was different (residual frames have signed input for the DCT, for example). This stage had the most detailed design and specification of the development process, because it was important to keep track of data types and the operations which each module had to perform. I produced Figures 2.2 and 2.3 as part of this design: the key identifies attributes which picture data being passed around may have, and in the main diagrams these attributes are shown on each side of the operations performed on the encoding path, to demonstrate the transformations on the picture data. Lines are colour coded according to the sort of data being transformed, and per-pixel addition/subtraction and bit-size comparison are indicated by circles at the intersections of frame data. Having a specification as detailed as this made writing the encoding paths much easier and helped with strategies for memory management, avoiding copying, and refining the implementations of individual modules. The encoding/decoding paths were completed by the middle of March, as I had moved some analysis features later in the schedule to fit with the evaluation stage.
Lastly, I added extensions. At this stage, addition and alteration of modules was becoming quite simple, demonstrating the ease of use of the module system. I finished off extensions and evaluation code at the end of the schedule at the beginning of April.
This turned out to be a successful software engineering methodology for the project, and each of the three main milestones was met approximately on schedule.
|
2.4 Concepts and theory
I will now give a brief summary of the most important ideas used in the implementation, starting from consideration of frames, then moving onto individual macroblocks, blocks and finally codewords in the stream.
2.4.1 Groups of pictures
A group of pictures begins at an I-frame and ends at the next I-frame. IDCT-mismatch, where the decoder gradually drifts away from the encoder's virtual decoder state, is prevented by requiring the use of an I-frame every 132 frames as in the MPEG standard. Rate control may also determine that an I-frame is appropriate. The first frame in the stream is always an I-frame.
2.4.2 I-frame required
I-frames are independent of surrounding frames, and are coded as follows:
Colour space conversion and chrominance sub-sampling: RGB to

Arrange data into macroblocks, as shown in Figure 2.4. The 4:2:0 layout is used.
Transform each of the six blocks in the macroblock using the DCT.
Quantize coefficients using a matrix: separate quantization factors are used for each location in the transformed block, exploiting the human visual system's non-uniform sensitivity over different spatial frequencies.
- Entropy-encode the resulting macroblocks to the file.
Decode the result (dequantize,inverse transform, change to scan order) and store as a reference frame.
|
2.4.3 I- or P-frame is possible
If the encoder may use either frame type, rate control is used to determine (or estimate) which is smaller. The process is as follows:
Based on the current reference frame, estimate motion and associate a 2-dimensional vector with each macroblock in the new picture indicating where it should get its pixel data from. This motion estimation involves searching for the square region of pixels in the nearby regions of the reference frame which is most similar to the corresponding region in the new frame.
After applying motion vectors to the reference picture (motion compensation) a residual error is calculated by subtracting the the reference from the new frame.
The residual is then coded like an I-frame, except that the values are differences.
Rate control determines whether to code an I- or P-frame based on the size of each. (Alternatively, more advanced rate control can be performed before creating the frames to decide which frame type to produce by detecting scene or region changes.)
The resulting coded frame is decoded to act as the next reference frame.
It is important to use the last encoded frame (decoded) as a reference frame rather than the last input frame, to avoid accumulation of error in prediction: the encoder always has a correct model of what the decoder has available.
2.4.4 Transforming to and from the frequency domain
In natural scenes, neighbouring pixels are often highly correlated. Transform coding makes use of this fact to use fewer bits: various transforms are available which decorrelate the image data, reducing the amount of redundancy.
We have various choices for the transform. Macroblocks
conveniently split up into four
blocks, and fast transforms
exist for input of this size. Each candidate transform should be
rated by the decorrelation it provides, and the potential for a
fast implementation.
The Karhunen-Loéve Transform (KLT) gives optimal decorrelation, but has high computational cost and uses adaptive basis functions; these would have to be transmitted alongside the actual data2.2.
|
The discrete fourier transform (DFT) can be implemented in a fast way. However, Figure 2.5 shows that it does not perform well when the frequency representation of linear spatial changes is truncated and reverse transformed. Its treatment of boundary values also makes it a bad choice. The discrete sine transform represents the constant function as a square wave, making it unsuitable for image compression.
The type II discrete cosine transform (DCT) is the best choice for the linear transform, as it provides good energy compaction, its decorrelation is almost as good as that of the KLT (DCTs are the optimal KLT transforms in some cases, when the statistics of the signal are given by limits of Markov processes [18]) and there are fast implementations.
The DCT is a special case of the general discrete Fourier transform for real, even input. Mirroring samples around the origin satisfies the requirement of even input. To calculate the transform, the naïve approach involves evaluating two matrix multiplications as follows:
| (2.1) | ||
|
where is the
matrix defined with
|
||
![]() |
(2.2) | |
is the input matrix,
if
and
otherwise, and
is the output matrix.
Figure 2.6 shows a visualization of the basis functions for
this transform. Each of the functions is an
array of intensity values. The
action of the DCT is to associate a coefficient with each of
these basis functions. By taking the sum of the basis
function/coefficient products in the inverse transform, the
original sample matrix is reconstructed. The DC2.3
component is outlined in blue, and its coefficient is the average
amplitude of the block in the spatial domain.
|
The one-dimensional transform can be computed on a vector
by evaluating
.
For the two-dimensional transform, the
product computes a one-dimensional transform for each column of
the input matrix, and the rows are then transformed by matrix
multiplication with the transpose of the original matrix
. These two transforms can
be combined together to give
Properties of the cosine function yield a smaller set of constants to pre-compute:
for
integral
;if
,
; andif
,
.
The cosine products can be calculated and stored in a lookup table. This provides a DCT which performs much better than one which computes the cosines at runtime, and functions as a `reference' DCT in my implementation, but there is much further to go in optimising the calculation: I will elaborate on this in the next chapter.
2.4.5 Quantization
Quantization is the process which divides coefficients from
the transform so that they occupy a smaller range, losing
information. Dequantization scales them up to an approximation of
their original values. In I-frames, quantization performs a
division and rounds to the nearest integer (
rounds up). In P-frames, quantization performs a
division rounding down. The behaviour is shown in Figure 2.7. The
rationale behind the `dead zone' in P-frame quantization is that
changes are differential and accumulate over a few P-frames.
|
2.4.6 Variable length coding
The decoder reconstructs precisely those values present in the encoder after quantization, as they are losslessly compressed in the stream. The MPEG standards includes tables with variable length binary codewords, selected using a technique based on Huffman coding, such that more common events have shorter codewords. (No separators are present in the stream, and the codewords exhibit the prefix property.)
|
Coefficient values from a block are coded in the order shown in Figure 2.8. The top-left DC term is coded as a difference from the last corresponding DC block. AC coefficients (the rest) are coded as a zero run length followed by the value, and once only zeros are left in the zig-zag, an end-of-block (EOB) symbol is coded. This represents a block of quantized coefficients very compactly, as zero values are common because of energy compaction and quantization.
3. Implementation
3.1 Overview
The overall structure of a video compression and decompression library is driven by its dual functionality: encoding and decoding. Because these two tasks share many data structures and algorithms, as demonstrated by the fact that the encoder must keep track of the state of a virtual decoder at the current position in the stream in order to predict accurately (and therefore must have the same functionality as the decoder), the codec is broadly structured as two code paths which each use modules providing appropriate algorithms. The module system, logging and configuration parsing are implemented alongside this infrastructure.
Operations in the encoding and decoding paths are performed in-place, and take advantage of C++'s flexibility in memory management and pointer arithmetic. The code re-uses pictures as references and packs data to give spatial and temporal locality of reference. The style of code written had to be conducive to modularity and extensibility, whilst at the same time being fast: to this end, the code is mostly C-like arranged into C++ classes and using some of its more concise syntax.
I will first outline the implementation of some aspects of the test harness and library, and then give detail on various interesting modules of the system.
3.2 Test harness
The test harness is an executable which uses the main libvideocodec library, passing a tree structure of configuration data. It also provides a few functions for code generation and algorithm testing which are not required within the codec library but are still relevant to the project. The overall structure is shown in Figure 3.1.
The requirement of flexibility motivated comprehensive configuration of each part of the encoding/decoding process, which assisted the evaluation: a combination of settings for normal codec features are available, together with settings for doing analysis and evaluation of modules. Rather than using pre-processor directives to configure the codec (requiring recompilation for changes in setup), I put all configuration, analysis and evaluation options in the nested hierarchy on the command line. The cost is small because checks occur during initialization or per-frame and are amenable to branch prediction. The intention is that one can make a set of useful command lines and run them from scripts, analogously to using a configuration file.
The configuration for the codec consists of a polymorphic data type ConfigurationElement which can contain key/value pairs (where keys are strings and values are further elements), a list (of elements), a string value, an integer value or a float. The module system (discussed below) mandates that the root of the configuration tree consists of key/value pairs where the keys are names of modules and the values are elements which are passed to the respective modules during initialization. A configuration element can also be queried to get an element using a path syntax, with names of elements separate by dots.
For future extensibility, the instance of Codec used by an executable is created using a factory method.
3.2.1 Code table generation
Entropy encoding within the encoder makes use of tables of codewords from the standards documents, indexed in the encoder based on values being encoded, and also indexed in the decoder based on codewords seen in the input stream. Because of the decoder's way of decoding the incoming stream, its table should be ordered by length of codewords, so that it can incrementally increase its input `window' size over the stream detecting the first matching codeword.
If a codeword clashes due to a typo and is detected incorrectly, it may, for example, only affect a specific block pattern, which is not even used in certain pictures. Therefore the method for putting tables into the source must keep encoder and decoder tables consistent when corrections/changes are made. It is also laborious to produce the encoder/decoder tables manually. The solution I have adopted is to generate the tables based on a set of master tables kept in a header file in the test executable. When the executable is compiled and run with the appropriate directives, it outputs a partial header file to standard output containing the code for the tables, each in two forms: one optimized for the encoder and one for the decoder. Generating new tables and altering codewords is a simple matter of adding a master table in and running this helper.
Having described the features of the test harness executable, I will now go on to features of the main codec library.
3.3 Logging and library services
A static Logger class is available, which provides both an always-on log to output useful information to the user, and a debug log outputting more detailed information. For efficiency, the debug log is enabled by a preprocessor switch, which only outputs when it is enabled -- this means that when debugging is switched off at compile time, there is no overhead from the debug statements: the preprocessor removes the code and the optimiser will remove the dead code in the method calls. The debug log wraps vprintf(...) so that calls to it can use variable numbers of parameters and a format string.
Other important features in the library external to the modules include:
Header structures for the bitstream syntax (for frames, streams and macroblocks);
A generic PictureInfo structure, which encapsulates the information for a frame and can be passed around between modules.
Abstraction of data type sizes using typedefs, aiding compilation for different architectures.
A small amount of open-source interoperation code for Octave integration, modified slightly to optimize the case of not requiring output back from Octave. This uses a UNIX pipe to access the engine. The interface with Matlab is through a binary provided by MathWorks and a header file.
3.4 Module system
The functionality of the codec is organized into a set of distinct modules, each utilizing picture information and the output of other modules to execute their task. Each module is derived from a base class Module which enforces the use of a module identification name and consumption of configuration information in an instance of ConfigurationElement.
Instances of modules are not created directly. Instead, a ModuleRegistry keeps an STL map<std::string, Module*> which associates the unique string associated with a module (the strings being defined by preprocessor macros) with an instance of that module. When the encoder or decoder is instantiated, the module registry creates those modules which have configuration information passed to them on the command line, storing their instances in this map. If a coding path requests a module when it runs, the registry returns the requested module (if it has already been instantiated), or creates it on-demand with default configuration. At the initialization stage, a module has the option of returning MODULE_CONFIGURATION_ERROR, signalling that invalid or insufficient configuration information was passed to it.
3.5 Modules
3.5.1 Converting between colour spaces
The human visual system's ability to resolve detail differs in
luminance and chrominance [4]. Sub-sampling the chrominance
channel so that it is half the size of the
component in each axis (see Figure 2.4 for the data
layout) results in no noticeable reduction in quality and a
saving of 50% in the amount of data to be coded. Most display
devices use RGB (e.g., LCD monitors have RGB coloured sub-pixel
cells, typically made using pigment/dye filters, and CRT monitors
have RGB phosphors). The colour space YUV is a scaled and shifted
analog version of
, and is the
format used in PAL television. In the context of video encoding,
both terms are sometimes used.
I perform conversion between
and RGB
in-place using fixed point representations. Colour planes are
packed together in memory, as shown for RGB in Figure 3.4, as
this gives spatial locality of reference. The operation can be
pipelined by the CPU and is fast.
|
3.5.2 Implementing the discrete cosine transform
There is much room for improvement in the performance of the DCT developed in the previous chapter. Optimization of the transform is vital due to the volume of DCTs which occur during the encoding of one frame, the complexity of computing the `reference' transform described earlier, and also because the decoder must decode frames quickly (applying the IDCT). I present here my implementation of one of the fastest known versions of the DCT, due to Arai, Agui and Nakajima [20], and the corresponding inverse DCT.
There is a very fast version of the DFT, developed by Winograd [16]. As we have seen, the DFT is not suitable for exploiting spatial redundancy, however, it is possible to compute a DCT via a DFT.
3.5.2.1 Computing the DCT using the DFT
We use a 16-point one-dimensional DFT, where we mirror the
input sequence
with symmetry about the point
so that points zero to seven are the
same as points fifteen to eight. The input data are real, and the
imaginary parts will cancel out due to the even symmetry.
Let
where
. Applying the DFT, we
have
![]() |
(3.1) |
![]() |
(3.2) |
This is in a good form for calculating the DCT. However, we
still need to rearrange the left hand side, which we know must be
real by inspection of the right. Writing the real and imaginary
parts of
as
and
respectively,
, and when this is substituted back in,
This gives the following expression, which can be used to
calculate the DCT in terms of the DFT output values
:
![]() |
(3.3) |
The Winograd form of the DFT [16] calculates real and imaginary
parts separately. Because this equation features only the real
part of the DFT results, we can discard the calculation of the
imaginary parts. The matrix for the calculation of the
values can be factored into a
product of sparse matrices, as described in [3]. Apart from post-scaling, this
leaves only four multiplications by constants, all of which can
be pre-computed. The final algorithm is represented most clearly
using the flow graph show in Figure 3.5, each stage horizontally
representing one sparse matrix product. Temporary variable names
and constants in my implementation are shown. Appropriate scaling
factors for the flow-graph results (on the right hand side) come
from the
factor in
equation 3.3 above.
|
The decoder must reverse this operation by calculating an inverse DCT, or IDCT. The DCT transform matrix is orthogonal: its transpose is its inverse. The flow graph derived from the transpose of the matrix is the same as the original one but with data flowing in the opposite direction (due to the effect of transposing the product of matrices). To implement the IDCT, I worked out this flow-graph (Figure 3.6), labelled nodes with temporaries in the same way, verified that the graph was equivalent to the IDCT using Maxima, and wrote the code based on this representation.
|
This algorithm gives one of the fastest known DCTs. The reference DCT discussed earlier has 64 multiplications by constants for each 1-D DCT; for this fast DCT there are 13 multiplications, but with 8 of these as a final step of the DCT/first step of the IDCT; they can be combined into the scaling factors of quantization, (or dequantization) which is the next step. This (I-)DCT algorithm is one of the fastest known.
In my implementation, I use preprocessor macros to give a concise representation of the algorithm which can be used with data types required for any particular instance of the algorithm. The macros abstract away the notions of SOURCE and DESTINATION assignments so that they can be combined with shifting, rounding and referencing into an array. See Appendix B which gives my DCT/IDCT code for the details of this.
3.5.3 Quantization
Let
be the unquantized value,
be the quantized value and
be
the dequantized value. intra_q_scale and
inter_q_scale are scaling values which can alter the
degree of quantization for rate control for I- and P-frames
respectively. The quantization matrix value for the location of
the coefficient value is
.
takes the value
when
is
negative,
if it is positive non-zero and 0
otherwise. All equations use integer operations. The rounding
term in the P-frame dequantization equation compensates for the
dead zone indicated in Figure 2.7.
3.5.3.1 I-frame (de-)quantization
3.5.3.2 P-frame (de-)quantization
3.5.4 Prediction: FrameDifference and MotionEstimation
P-frames predict picture data using the previous frame as reference. Figure 3.7 shows how I- and P-frames are decoded.
The FrameDifference module handles per-pixel addition or subtraction of frames, using a residual to calculate a final frame based on a reference, or calculating a residual based on two frames respectively. (Recall that a residual error frame corrects any errors remaining after motion estimation.)
Imagine a panning shot (or any shot with a uniform global
motion vector). If we do not compensate for the motion in such a
sequence, the difference between frames will be large, as we may
end up encoding the whole frame's data in each residual, making
the P-frame larger than an I-frame. The better symbol to encode
for this is (conceptually) `These blocks are shifted
pixels across and
pixels down.'
Each PictureInfo instance therefore has two arrays for
displacements (one for each axis), with a pair of elements per
macroblock representing one motion vector.
To derive these vectors, the encoder estimates motion. The vectors are `applied' to a reference as an initial estimate for a frame in the decoder.
3.5.4.1 Motion estimation
In modern video encoders, motion estimation is often the most
time-consuming step. The problem is to find per-macroblock
and
displacements
which give the best estimate of the location where pixel values
should be predicted from in a reference picture. Vectors point in
any spatial direction in the picture and backwards in time.
There is a trade-off between speed and accuracy with motion estimation; no algorithm is specified by the standard, and much work has been done to find fast, accurate techniques. By searching starting with small displacements first, the encoder prioritises displacements with shorter codewords without losing accuracy.
Motion estimation algorithms can typically be characterized by a distortion estimate, search strategy and stopping condition.
The distortion estimate is a metric for evaluating a
particular choice for a vector. I implement the SAE (sum of
absolute errors) measurement on the luminance channel, which
associates a quantitative measurement of error with each
(reference image luminance, new image luminance, macroblock
location, candidate vector) tuple. Let
and
be the reference
and current image
-component pixels at
respectively,
and
be the vector
displacements under consideration and
be the coordinates of the macroblock under consideration.
I use a simple spiral search strategy, to test
locations in approximate order of increasing associated codeword
length, optimizing the most common case. This involves evaluating
the above equation for each macroblock location, varying
and
in
the order of a square spiral of increasing diameter, first
testing the zero displacement vector, then moving outwards one
pixel and testing in the shape of a square one pixel out from the
point, then moving to two pixels out, and so on. The
computational complexity is
in the
search range
. The furthest distance checked
is a parameter of the motion estimation algorithm and care has to
be taken not to exceed the boundaries of the image.
The stopping condition speeds up the estimation by specifying a threshold on the SAE below which we will consider a displacement to have matched. Increasing the threshold will cause the algorithm to terminate sooner, at the expense of accuracy if the actual motion is further than the chosen displacement. Decreasing the threshold will increase the chance of finding the best possible match within the search range but the algorithm may take longer to execute if the real motion is a large displacement.
Motion vectors are differentially encoded, so for a given macroblock the difference from the last encoded vector is encoded (or from the zero vector if this is the first macroblock). This favours uniform motion of large regions.
|
3.5.4.2 Motion compensation
This is a faster, deterministic operation, performed in the decoder, and also in the encoder to keep an accurate representation of what the decoder's state is. The operation takes the output of the motion estimator (pairs of elements for displacements) and applies the vectors to a reference frame, copying pixel values to give an approximation of the current frame to which the residual will be added producing the final frame. Part of a residual frame is shown in Figure 3.8.
To test motion estimation and compensation, I wrote a Perl script to generate frame sequences with an object moving by various displacements each frame.
3.5.5 Entropy encoding and decoding
As we have seen already, the `events' (coefficient values and vectors) within the encoder are put into the final bit-stream as a series of codewords, using shorter codewords to identify more frequently occurring symbols. There are therefore methods which perform this translation, encoding and decoding streams of codewords for each of the two frame types. These methods make extensive use of lookup tables, and have to be fast. Unlike most of the rest of the codec system there is no trade-off with final codec performance in terms of quality -- the only trade-off is choosing an appropriate amount of lossless compression so that compression/decompression is fast without wasting bits. I use efficient non-adaptive codewords in lookup tables, optimized separately for the encoder and decoder.
In addition to getting codewords based on values/vectors, there also needs to be a way to determine the coded length of the stream without getting the actual code bits. This is used for simple rate control and memory allocation, and I implement it alongside the code which outputs to the stream, updating an bit count long integer reference passed to each function.
3.5.6 Input/output and stream syntax
The input/output (I/O) system of the codec needs to be structured carefully to keep it maintainable -- rather than putting code relating to the syntax of the bit-stream, bit-resolution reading/writing and actual I/O code together, I decided to separate the three out, as the interface between each of the three layers can be kept fairly simple. This means stream formatting and parsing are kept isolated from the low-level details, which is an abstraction conducive to maintainability and debugging. It also separates concerns, such as keeping locality of reference and caching, from bit-stream syntax minutiae. The entropy encoder and decoder deal with various clever coding optimizations such as coded block patterns (which specify which blocks in a macroblock are actually coded); a function implementing such coding directly to low-level I/O would become too large to be manageable.
The bit-stream syntax consists of two types of data: headers (which are meta-data) and actual picture data. Headers in the stream specify sufficient data to let the decoder configure itself and construct a blank picture: dimensions of the picture and quantization modes, for example. The headers occur at the start of the stream, except for macroblock headers within the main picture data.
As described earlier, the actual picture information consists of codewords from various lookup tables, specifying symbols which provide actual data (such as the values of quantized DCT coefficients) or state modification instructions (such as macroblock address increment instructions which move the `current macroblock' pointer). The I/O system reflects the distinction in a few ways (see Figures 3.9 and 3.10): firstly, the encoder optimizes the building of a large block of bytes out of bit-level codewords into a buffer (BinaryStreamBuilder), which uses bit masks and shifting to construct words of data in a byte array, whereas initial headers are output separately. Secondly, the StreamFormatter and StreamParser abstract away the precise details of how headers are laid out in the bitstream. Thirdly, methods used by the decoder for input of actual picture data are optimized towards recognizing codewords rather than inputting values from the stream. This can be seen in the MatchBits(...) method which takes a bit-pattern and, if it is seen within the stream at the current location, advances the stream and returns true; otherwise it returns false. The consumer of the codewords can efficiently iterate over the relevant lookup table trying to match codewords of increasing lengths until this method returns true.
|
|
3.6 Encoding and decoding paths
Completion of the modules described above was a major milestone, ending the first stage of development. There was a little overlap with the second stage, as additional modules were added during development of the encoding and decoding paths to deal with prediction.
All of the modules are utilized by encoding and decoding paths, which are central to the codec's functionality. In addition to invoking the required operations, the paths must also handle data and prediction. Figures 2.2 and 2.3 from the preparation chapter give a concise representation of the actions which are performed by the encoding paths, and the conditions on execution of each part of the procedure. The simple representations of pixel data and coefficients in the PictureInfo class help to avoid copying and expensive memory allocation and deallocation: information is re-used where possible.
3.7 Runtime measurements
Timing within the encoding and decoding paths uses the gettimeofday(...) function, working out the number of milliseconds which have passed between two `checkpoints'. If timing is enabled, the results are put into an STL map<pair<const char*,FrameCount>,int64_t> which associates a textual description and frame number with each timing result, and can also store other numerical statistics such as the number of bits coded. The elements are displayed at the end of execution.
Using an external computation engine allows the encoding and decoding paths to output individual macroblocks, show motion vectors and chart runtime.
4. Evaluation
4.1 Overview
In the introduction I identified ways to measure the performance of any video compression system:
- Speed of encoding and decoding;
- Compression ratio achieved; and
- Perceived quality of output.
The first two of these are relatively easy to measure; they are linear, reliable, repeatable, consistent but not independent [19]. They are easy to measure as well: I described timing earlier, and it is easy to keep track of the number of bits coded, which can be used in conjunction with a simple calculation of the number of bits in uncompressed input to give a quantitative measure of compression ratio. Both speed and compression ratio are precisely defined concepts. The main difficulty with making measurements becomes choice of representative input data.
However, perceived quality of output is much harder to measure. The difficulty comes from the fact that the best measurement device for image quality is a human being; setting up experiments where many subjects are asked to rate quality of videos, and then collating the data into quantitative measurements is the best approach. However, I will describe one particular fairly new quantitative metric, structural similarity or SSIM [13], and justify its selection. Experimental research has been done, and, though it is a generalization, it is reasonable to assert that SSIM is a fairly close approximation to the results of human judgements.
4.2 Structural similarity (SSIM)
I use a freely available Octave (or Matlab)
script to make the actual calculations, but I will summarise the
technique used internally to derive the measurement. Structural
similarity calculations take as input two images, one of which is
regarded as a reference (perfect quality) image, while the other
has been subject to distortion of some sort (in this case, lossy
compression in the encoder, and decoding), and outputs a value
between 0 and
. The greater the value, the
higher the assessment of relative quality is expected to be.
The most common technique for quantifying image quality involves calculating the residual error of the two images, then performing some calculations on the residual, based on observations about the human visual system, and computing a single scalar based on these calculations4.1. The problem with this is that the human visual system is not so sensitive to certain types of distortion, and is affected by context, so the approximations made are not good enough to be accurate. Some distortions which are large in the residual may not be as noticeable as others, and many metrics fail to recognize this. There are various other problems associated with other quantitative measures, given in [13].
The structural similarity measurement (SSIM) tries to separate considerations of luminance, contrast and structure, weighting each of these three. The calculations used are fairly simple but give good results and are consistent with other analyses of the human visual system; for example, the luminance formula fits with Weber's law, commonly used to model contextual luminance masking. Contrast masking is also modelled.
4.3 Test sequences
The test sequences are 720p high definition (
, 24-bit
RGB
) sample sequences from
the SVT multi-format test set [14], available from Sveriges
Television. These sequences have been selected to provide
varied content, some `difficult' (including more complex
movement, for example) and some easier. By using these sequences,
I have a representative sample of the various 720p HD inputs to a
codec in general use, and I can identify the behaviour of
measurable quantities in the codec when used on a variety of
inputs. The test sequences I have chosen are:
CrowdRun, which shows runners moving towards and to the left of the camera from an elevated perspective with a little global movement. This sequence is classified as `difficult' by SVT.
ParkJoy, which is a horizontal panning shot following people walking beside a river. This sequence is also classified as `difficult' by SVT.
InToTree, which pans and zooms away from a large house into a tree. There is a fair amount of movement, but overall the sequence is not as challenging and is classified as `easy' by SVT.
OldTownCross, which shows an aerial view of a town, with a very slow pan. There is little movement, and the sequence is classified as `easy' by SVT.
The first three sequences are an ambitious use-case for my codec. The fourth is less difficult, but is representative, real video. All of these sequences are at fifty frames per second, meaning that displacements in adjacent frames are likely to be smaller than in lower frame rate video. This benefits newer codecs such as H.264, which use quarter pixel motion compensation. However, at higher bit-rates my codec will still perform well in terms of picture quality. The first two sequences are particularly difficult because areas of background are uncovered by moving objects.
4.4 Variables
The following are a few of the controllable variables with my codec:
- Input sequence frames;
Quantization scaling factors to use with I- and P-frames;
- Range to check in motion estimation;
- Stopping threshold for motion estimation search;
- Rate control threshold;
Selection of diagnostic information to be provided (normally fixed to be the minimum for whatever measurements are necessary);
- Enable/disable half-pixel motion estimation; and
- Enable/disable using four vectors per macroblock.
The following variables are measurable:
- SSIM value for any (reference, decoded) frame pair;
Time taken to encode a single frame and an entire sequence;
Time taken to decode a single frame and an entire sequence;
- Time taken on each stage of the encoding path;
Number of bits coded per-frame and for the whole sequence; and
Dimensions and colour depth of input frames, allowing the uncompressed input size to be calculated
4.5 Video codec evaluation
The evaluation of a codec should have the dual aim of tuning and tweaking the parameters of the codec to provide maximum performance, and comparing the codec with other codecs based on criteria which are important to users.
Getting the best performance from a video codec is very difficult. There are simply too many variables to vary given that the measured performance of the codec is based on the combination of the output variables. The toughest of these variables is the variability in the input sequences, where input footage can vary widely making the choice of parameters very difficult. Quantization scaling values are also hard to choose: advanced codecs use adaptive techniques which update quantization tables based on context, making the already difficult problem of rate control even harder but giving better compression ratios.
One approach to making this problem more tractable is to use purely quantitative measurements (even for quality -- SSIM, as mentioned above, is a good choice) and then to automate tests which find variable combinations which give good performance with various types of input sequence. My codec is ideal for such instrumentation because it is easy to configure.
4.6 The evaluation setup
Taking advantage of the BASH scripting language on Linux, I set up a testing system which scripts initialization of tests (symbolically linking in input frames), performs the actual test (iterating variables to move over the parameter space which is being searched), parses output (redirecting output, matching relevant parts using regular expressions and making substitutions with sed to compile data in a concise, convenient form) and cleans up. I also wrote further scripts to produce SVG charts in Perl and scripted GNUPlot for 3D-charts. This approach has a number of advantages. Firstly, the test setup is standardized so it is easy to add extra experiments to the current set. Secondly, it is possible to leave the scripts running over a long period and all the results will be collated in a useful way -- the first set of tests (for investigating combinations of I- and P-frame quantization scaling factors, and the effects of motion estimation parameters) which I ran took over nine hours to complete, producing a matrix of bar/line charts which could be inspected easily. Thirdly, new variables, measurements and ways of instrumenting the binary can be added easily into current tests.
This went some way to taming the problems with tuning and evaluating the codec in terms of quantifiable performance. I give a selection of the interesting results here, and compare the codec with two open-source codecs.
4.7 Tuning and selecting parameters
4.7.1 Quantization scaling factors
The standardized quantization formulae include a quantization scaling factor which can be varied. The factor multiplies the value from the quantization matrix. I decided to investigate the effects of using different values for quantization scaling, as they will affect the image quality and bit-rate.
There are some very complex interdependencies within the codec based on these parameters. If I-frames are quantized by a large amount, a lot of residual error will be produced when calculating the next P-frame. If too much quantization is used with P-frames, with very little quantization on I-frames, the quality will visibly vary during the course of a sequence -- a high quality frame will be shown, then quality will degrade gradually, then when another key-frame is shown the picture will suddenly improve again.
4.7.1.1 Procedure
For each sequence to be tested:
Link input frames into the test base (sixteen initial frames from a sequence are used, to give a reasonable coverage of features without taking too long)
Iterate over I-frame quantization scale values (1 to 10) and P-frame quantization scale values (1 to 10) giving 100 combinations in total. For each combination:
- Run the encoder, then the decoder.
Store the size of the stream and compute SSIM values for the output frames.
- Output the results in a convenient form.
- Remove the output stream and frames.
4.7.1.2 Results
The quantity of results collected is too large to show everything, but there are many interesting patterns and relationships. I will identify a few of them here.
Figure 4.1 shows the effect of three values of P-frame
quantization scaling on the structural similarity measure in the
CrowdRun sequence. The first frame of the sequence
(frame zero) is of the same quality in each picture as its
quantization factor is not being altered. The three scaling
factors used characterise the observed quality behaviour over the
whole range of scaling factors (zero to ten): very high quality
is maintained when the DCT coefficients are not scaled down by an
extra factor, but as the factor is increased up to around four or
five, quality begins to drop off. However, as long as it is not
too noticeable this is acceptable -- we merely wish to avoid the
situation where within a particular GOP the image gets worse
before the next I-frame updates the picture and the quality
improves (which could make the video unwatchable). For values
higher than around five, image quality drops off quickly after
the I-frame and continues to drop. When using a value for P-frame
quantization scale of two or more it is wise to increase the
I-frame scaling factor a little to around two or three, because
the extra bits in the I-frame do not make a relative image
quality improvement (the quality is already high) so bits can be
saved by having a lower quality I-frame too. As a very rough
guide, when I look at the output images at a distance of three
times the screen width I become aware of the compression
artifacts when the SSIM value is below
.
The upper chart in Figure 4.2 shows structural similarity (red bars and left scale) over a sequence of sixteen frames from the start of ParkJoy. The number of millions of bits coded is also shown for each frame (the blue line and right scale). This sequence is very challenging: there is a uniform movement with a fairly large magnitude following the people on the further bank, and a wide tree trunk obscures the view as it passes across the frame; this leads to a large amount of residual error (though not enough to nudge the rate control into using an I-frame to update). With the fairly aggressive quantization scaling used for this encoding (P-frame quantization scaling is at five) this causes a noticeable loss in quality, though the bit-rate is still quite high. The lower chart shows the equivalent data for the OldTownCross sequence. Here we see different behaviour compared to the ParkJoy sequence: OldTownCross is classified as `easy' by SVT, as the picture does not change drastically and not much of the picture is uncovered in the duration of the sequence. The higher quantization scaling factor does not have any adverse effects on structural similarity here and gives a significant benefit in bit-rate. It is therefore clear that a rate control system which assigns varying values of quantization scaling based on the `difficulty' of the footage would be useful.4.2
|
|
4.7.2 Motion estimation parameters
The motion estimation algorithm which I use has two main parameters for its operation: the stopping threshold and the search range. As described earlier, the stopping condition and search pattern characterise most motion estimation algorithms in general, so there is good control over motion estimation using just these values.
I used three scenes which have a range of motion from small, global, uniform displacement in OldTownCross, to varied, larger distance displacements in ParkJoy. When motion is non-uniform across the frame, the algorithm will take longer with a lower stopping value and/or larger search range, because it will give up later. If the estimate is close to the real motion, fewer bits will be used coding the residual error resulting in a better bit-rate. There is a clearly a complex interdependency here.
Motion estimation is a large proportion of the computational effort in most codecs, including mine, and the results here would be useful to tune other implementations as they reveal information about characteristics of motion in natural footage.
4.7.2.1 Procedure
For each sequence to be tested:
- Link input frames into the test base
Iterate over stopping conditions (0 to 2000 step 200) and search range values (1 to 15). For each combination:
- Run the encoder, then the decoder.
Store the size of the stream and measure the time taken for the encoding stage4.3.
- Output the results in a convenient form.
- Remove the output stream and frames.
4.7.2.2 Results
The most interesting results from this test were in the range of motion estimation necessary for visual quality. By instrumenting the codec to encode the same sequence five times each time only altering the range of motion estimation whilst keeping the stopping SAE threshold at 0 , so that the search must find the lowest distortion location in range, I found some interesting results about the efficiency of checking to various distances on the stream. Before choosing a search range based on this, it is worth noting that a larger search area will cause motion estimation to take longer in most cases.
Figure 4.3 shows the efficiency of using various motion
estimation search ranges. P-frame quantization scaling is fixed
at five to simulate a fairly high level of compression to
highlight the benefit of good motion estimation. The cost of
motion estimation increases quadratically with the range, so to
improve encoder performance one might first perform this test on
a representative sequence to establish the distribution of
displacements, then set the range parameter based on this. The
graphs suggest that a value of around seven pixels might be
sufficient for non-action sequences. The frame rate of the test
sequences is high (
), so the
same sequences at a lower frame-rate would need a larger search
range, but half the number of frames would have to be
encoded.
|
4.8 Analysis of artifacts
Some interesting details of the compression methods are revealed by looking at its behaviour when a low bit-rate is used.
|
The first artifact in Figure 4.4 demonstrates how high
frequencies (here, the detail in the grass and road) are lost.
This comes from a combination of stronger quantization of higher
spatial frequencies (due to the distribution of values in the
quantization matrices) and the fact that half-pixel motion
estimation interpolates values between pixels linearly. P-frames
are unable to correct the error here because they lack the high
frequencies necessary at a P-frame quantization scaling value of
(the maximum).
The second row shows JPEG-type artifacts from DCT quantization -- blocking is visible in the blocks where the trees meet the sky. These errors can sometimes be corrected by P-frames, but here the spatial frequency is too high. The interpolation for half pixel displacements may cause motion compensation to make the problem less visible (at the expense of `blurring' the image as see in the first row).
Ringing occurs when an object moves in one direction revealing new background but the edges formed are too high frequency to be corrected. In this example, the house moves to the left leaving behind edges which are uncorrectable at this level of quantization.
The last row shows the effect of motion compensation. The edge has already been blurred due to the use of sub-pixel motion compensation (which uses linear interpolation). However, after a small motion displacement, one block is offset leading to a very noticeable high spatial frequency. This demonstrates one of the fundamental difficulties with block-based motion estimation -- the operation introduces high frequencies which P-frames may be unable to correct for. The problem is ameliorated through the use of individual intra-coded blocks (when selected by rate control), smaller regions for motion compensation (i.e., using four vectors instead of one per macroblock, as I implemented in an extension), and so on. All of these approaches cause encoding to take longer.
4.9 Codec comparison
To be competitive with MPEG or H.264 for HD content my codec would need a number of extra features: bi-directional predictive frames/weighted prediction, sub-pixel motion estimation and compensation, an arithmetic encoder for lossless coding, advanced rate control, phase correlation motion estimation, macroblock mode decision, quad-trees describing macroblocks, and so on. However, it is interesting to compare with another codec to see how my codec performs with difficult input, despite my project's priorities being partly different to those of other codecs. I use libavcodec, part of mplayer [11], to encode with MJPEG (which does not exploit temporal redundancy) and MPEG-1.
Encoding one hundred frames of the InToTree sequence
(at
and HD
resolution), my codec produces a 16 MBits/s file with almost
perfect image quality. Using the MJPEG codec gives a file of
around 64 MBits/s in similar quality (SSIM staying greater than
0.8 for the entire sequence). Comparing with libavcodec's MPEG-1
codec is less favourable, though still acceptable: with strong
quantization (quantization scaling factor of greater than twenty
for P-frames), my codec exceeds the quality of MPEG-1 for the
duration of the sequence when my stream is around
times the bitrate, in the
range of around 6 MBits/s. This illustrates a point noted in
[4] that `
the bitrate is very sensitive to the
quantizer_scale, and variable quantization is an
important tool for improving picture quality and controlling
bitrate.' Because my codec has a uniform scaling value over a
frame and sequence it either produces larger files at the same
quality or files of the same size but where P-frames are
quantized so greatly that quality deteriorates. As a guide to the
quality, when file size is matched with MPEG-1 the SSIM values
for MPEG for the test sequence are around 0.8 whereas for my
codec they are around 0.76 -- this discrepancy is sufficient to
be just noticeable.
Enabling the four motion vectors per macroblock extension feature in my codec significantly improves quality for the test sequence; however, this is at the expense of an increase in the number of bits output (the increase is around a factor of two for the one hundred frame test sequence). Sub-pixel motion estimation has a similar effect -- the size of motion vectors which must be represented in general increases by a factor of two. The combination of these two features gives excellent frame quality but most bits are used up in the vectors.
My decoder is quite fast. When not outputting to image files during decoding (i.e., generating colour planes for display4.4) the code is fast enough to decode DVD-resolution or lower in almost real time (around 20 frames per second). There would be a small performance boost from changing the way data are transferred in the decoding path to avoid copying which is currently used for image output. The encoder is quite slow, however. Figure 4.5 shows the timings of the modules in the encoder's path with advanced motion estimation (using four motion vectors per macroblock and half-pixel interpolation) and with simple motion estimation (single vector per macroblock, integral pixels only and reduced search range) respectively4.5.
|
4.10 Structural evaluation and aims
In the context of the overall project, it is important not merely to evaluate the project in terms of its statistical performance, but also to identify other positive features.
My code strikes a good balance between optimization and coherence, especially in comparison with other open-source projects I have looked at. I have managed to optimize modules individually, without breaking loose coupling. This yields advantages in portability, and also makes the code suitable for investigation as a `reference' from which to construct a more heavily optimized, instruction set architecture specific implementation. The code can in principle be compiled for any device with a C++ compiler as I have not made use of non-standard features. This is an advantage over proprietary codecs only available for one platform such as Microsoft's Windows Media Video codec.
The resolution of performance details provided by my codec is also high compared with most other codecs, and not at the expense of performance (due to the use of debug/release building and preprocessor directives). It would not be too hard to apply optimization tools to algorithms within modules. The use of a compiler targeted at multimedia applications might also provide an extra performance boost.
With difficult test input, the relative performance of my codec is good. It would be of practical use in a variety of video compression situations: high quality/lossless archiving of footage; use as a reference when analysing another codec; and encoding footage where specific features are known and the codec can be very finely tuned to provide the best performance.
Finally, my codec is useful as a learning tool, as thorough documentation, both in code and externally, is available for every aspect of the process.
The next change to make with the codec would be using optimized assembly code for time-consuming operations using SIMD extensions (when the architecture supports them). I would then go on to look at using smaller blocks with variable-sized macroblocks, and advanced rate control -- these features are the chief contributors to H.264's amazing compression performance. Smaller block sizes also permit the use of a reversible integer-based transform on the smaller blocks.
5. Conclusion
5.1 Achievements
The product of this project is a maintainable, modular, working video codec which can be used to evaluate algorithms without the difficulty of augmenting one of the more complex proprietary or open-source codecs with extra functionality. In this aim, my project has surpassed expectations: evidence for this is that I was able to implement a variety of extensions in keeping with the idea behind the project whilst maintaining loose coupling of modules and ease of testing. The development process also demonstrates that despite the initial overhead in design and coding infrastructure, a modular approach is worth the effort. I am very pleased with the current state of the whole system architecture -- there has been no gradual decline in maintainability when complexity is added, and the negative effects of abstraction on performance are negligible.
To summarise the functionality of the project, the following features are present:
Video compression and decompression operating on frames of RGB data, using block-based transform coding and an entropy coder;
Highlights of the compression system include one of the fastest known discrete cosine transforms, advanced motion estimation and modularity;
A test harness which parses tree-structured per-module configuration;
- Runtime analysis with Matlab or Octave;
One can measure the performance of the codec as parameters are changed by instrumenting the binary with configuration and parsing its detailed output; and
Half pixel motion estimation and compensation with linear interpolation, four motion vectors per macroblock, two simple rate control techniques and input/output caching.
My codec will be a valuable tool when trying out new algorithms, without the steep learning curve of integration into available codecs which have been developed over a few years. The documentation will also be useful for people wanting to learn about video compression, especially the implementation details for the encoder side, which is only partly standardized, and the information about the fast DCT.
5.2 Experience gained
The area of video compression is fascinating, using a blend of theoretical (information theory, digital signal processing, algorithms, performance analysis, ...) and practical (optimization/operating systems, software engineering, graphics, computer vision) knowledge and results. The variety of experience I have gained from this project is evident from the features and implementation process I have described.
This is also one of the largest software engineering projects I have worked on and has good software engineering as a goal in the modularity requirement. Although laborious at times, the good structuring has paid off in keeping a clear idea of the whole system during implementation of encoding and decoding paths, stream syntax, prediction and more advanced motion estimation.
Bibliography
- 1
- ISO/IEC 14496-2 Information technology -- Coding of audio-visual objects, Part 2: Visual (MPEG-4 Standard third edition), 2004.
- 2
- ITU-T Recommendation H.264: Advanced video coding for generic audiovisual services. 2005.
- 3
- W.B. Pennebaker and J.L. Mitchell: JPEG still image data compression standard. Van Nostrand Reinhold, 1993.
- 4
- J.L. Mitchell, W.B. Pennebaker, C.E. Fogg and D.J. LeGall: MPEG Video Compression Standard. Chapman and Hall, 1996.
- 5
- I.E. Richardson: H.264 and MPEG-4 Video Compression. Wiley, 2003.
- 6
- BBC Research et. al.: Dirac Specification Version 0.11.0. 2007
- 7
- S.C. McConnell: Code Complete (2nd edition). 2004.
- 8
- J. Watkinson: The MPEG Handbook. Focal Press, 2001.
- 9
- D. Vatolin, D. Kulikov, A. Parshin, A. Titarenko and M. Smirnov: MPEG-4 AVC/H.264 Video Codecs Comparison. 2006.
- 10
- D. Vatolin, D. Kulikov and A. Parshin: Videocodecs tuning. http://research.graphicon.ru/video-filtering/videocodecs-tuning.html, 2007.
- 11
- FFMPEG Multimedia System. Open source project, http://ffmpeg.mplayerhq.hu/.
- 12
- x264. Open source project, http://www.videolan.org/developers/x264.html.
- 13
- Z. Wang, A.C. Bovik, H.R. Sheikh and E.P. Simoncelli: Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE transactions on image processing, 2004.
- 14
- L. Haglund, Sveriges Television: The SVT High Definition Multi Format Test Set. 2006.
- 15
- Wikipedia H.264. http://en.wikipedia.org/w/index.php?title=H.264/MPEG-4_AVC&oldid=124806670.
- 16
- S. Winograd: On Computing the Discrete Fourier Transform. Mathematics of Computation, 1978.
- 17
- F. Ozar, H. Demirel, and H. Bilgekul: KLT based Video Coding using Adaptive Eigenspace. Visualization, Imaging, and Image Processing, 2004.
- 18
- K. R. Rao and P. Yip: Discrete Cosine Transform: Algorithms, Advantages, Applications. New York: Academic, 1990.
- 19
- S. Kounev: Performance Metrics notes from the Advanced Systems Topics course, 2007.
- 20
- Y. Arai, T. Agui and M. Nakajima: A Fast DCT-SQ Scheme for Images. IEICE transactions, 1988.
- 21
- W. Cleveland: The Elements of Graphing Data. 1985.
A. Standards
I made use of the MPEG-4 standard and H.264 recommendation for some implementation details and stream syntax.
MPEG-4 Visual is the older of the two, having been introduced in 1998, and is part of the MPEG-4 ISO/IEC standard (ISO/IEC 14496-2). The core of the standard defines the syntax of a compliant bitstream and the process of pixel reconstruction for the decoder. The sections most relevant for my project are 6.3, 6.4 and some of the tables in the appendices. Because this standard shares many features with MPEG-2, I have also used tables and syntax from [4]. It has been asserted that this standard is ahead of its time (e.g. in [5]). In many ways it tries to be too ambitious -- separating out most of the mesh-coding, facial features coding and body animation into separate standards would perhaps have been wise: the evidence for this is that there are currently no widely-used decoders which implement them. I have avoided such features.
H.264 was completed in 2003, standardised with shared technical content in MPEG-4 Part 10 which is ISO/IEC 14496-10 (when I refer to the MPEG-4 standard, I mean the older version not including this part). It is a block-based scheme, and avoids some of the extraneous features of MPEG-4, and at the same time uses different algorithms for some features, a more complicated prediction scheme, and an integer-based DCT and IDCT. Because it gives a significant performance gain over MPEG-4 in compression ratio and quality, but not usually speed, it is likely that this will become the prominent standard over the next few years, after its adoption for Blu-Ray and HD-DVD [15].
B. DCT implementation
My implementation of a fast discrete cosine transform, using preprocessor macros to allow flexibility in input and output:
CodeSamples/aan.lg
C. SimpleFrames coding path
The simple coding path used during the initial stage of development, which encodes then decodes an I-frame; an instance of PictureInfo is passed to each module in turn.
CodeSamples/simpleframes.lg
is the

and
components



arrive on the
left and are transformed into the frequency domain to give
the
on the right. Values flow
to the right and are added at the intersection points.
Values are negated where a line is marked with an arrow,
and the squares indicate multiplication by the constant in
the label. Post-scaling constants are applied separately.