Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_1373693063

ACM Multimedia 97 - Electronic Proceedings
November 8-14, 1997
Crowne Plaza Hotel, Seattle, USA
Optimizing the Data Cache Performance of a
Software MPEG-2 Video Decoder
Peter Soderquist
School of Electrical Engineering
Dept. of Electrical and Computer Engineering
Cornell University
Ithaca, New York 14853
(607) 255-0314
pgs@cs.cornell.edu
http://www.ee.cornell.edu/~pgs/
Miriam Leeser
Dept. of Electrical and Computer Engineering
Northeastern University
Boston, Massachusetts 02115
(617) 373-3814
mel@ece.neu.edu
http://www.ece.neu.edu/
ACM Copyright Notice
Abstract
Multimedia functionality has become an established component of core computer workloads. MPEG-2
video decoding represents a particularly important and computationally demanding application
example. Instruction set extensions like Intel's MMX significantly reduce the computational challenges of
this and other multimedia algorithms. However, memory subsystem deficiencies have now become the
major barrier to increased performance, partly as a consequence of this improved CPU performance.
Decoding MPEG-2 video data in software makes significant bandwidth demands on memory
subsystems, which is seriously aggravated by cache inefficiencies. Conventional data caches generate
many times more cache-memory traffic than required, at best double the minimum necessary to
support decoding. Improving efficiency requires understanding the behavior of the decoder and
composition of its data set. We provide an analysis of the memory and cache behavior of software
MPEG-2 video decoding, and lay out a set of cache-oriented architectural enhancements which offer
relief for the problem of excess cache-memory bandwidth. Our results show that cache-sensitive
handling of different data types can reduce traffic by 50 percent or more.
Keywords

MPEG-2, digital video, caches, computer architecture
Table of Contents
1 Introduction
2 Background
3 MPEG Overview
4 Related Work
4.1 Media Processors
4.2 SIMD Instruction Extensions
4.3 Video/Graphics Data Buses
4.4 Prefetching
4.5 Our Approach: Cache-Oriented Enhancements
5 Experimental Setup
5.1 Machine and Software Environment
5.2 Steady-State Simulation
5.3 Color Space and Bandwidth
6 Cache Simulation Results
6.1 Cache Size
6.2 Associativity
6.3 Line Size
6.4 Summary
7 Decoder Internal Analysis
7.1 Data Set Composition
7.2 Decoder Program Flow
7.3 System and Audio Interaction
8 Cache-Oriented Enhancements
8.1 Selective Caching
8.2 Cache Locking
8.3 Scratch Memory
8.4 Data Reordering
8.5 Cache Partitioning
8.6 Summary
9 Conclusions
10 Acknowledgments
References
1 Introduction
Multimedia computing has become a practical reality, and computer architectures are changing in
response. Extensions for continuous media processing are a current or planned part of every market-
leading instruction set architecture, as shown in Table 1. The primary feature of these extensions is
SIMD-style processing of small data types. While contemporary microprocessor datapaths are 32 or
64 bits wide, multimedia data usually consists of 16-bit or 8-bit integers. Furthermore, multimedia
algorithms commonly feature an abundance of data parallelism. Therefore, modifying wide datapaths
to process multiple small data types in a single instruction promises significant performance
improvements.
Architecture ISA Extension

x86 MMX (MultiMedia eXtensions)
SPARC VIS (Visual Instruction Set)
PA-RISC MAX-2 (Multimedia Acceleration eXtensions)
Alpha MVI (Motion Video Instruction)
MIPS MDMX (MIPS Digital Media eXtensions)
PowerPC VMX (Video and Multimedia eXtensions)*
Table 1. ISA multimedia extensions for popular architectures (* = proposed)
All of these extensions are targeted, at least partially, at MPEG-2 video decoding. Collectively, they
mount a formidable assault on its computational complexity. Unfortunately they largely fail to address
the significant challenge of MPEG processing for memory subsystems, which have become the
primary performance bottleneck. The high data rate, large set sizes, and distinctive memory access
patterns of MPEG exert a particular strain on caches. Manipulating standard parameters (cache size,
associativity, and line size) fails to cost-effectively reduce excess memory traffic. While miss rate levels
are acceptable, standard caches, even very large and highly associative ones, generate significant
excess cache-memory traffic. Multimedia instruction extensions actually exacerbate this problem by
enabling the CPU to consume and produce more data in fewer cycles.
This cache inefficiency is seriously limiting for desktop multitasking systems. A foregrounded video
process can potentially cripple communications, print spooling, system management, network agent,
and other background tasks. The video playback may also suffer from dropped frames, jitter, blocking,
or other annoying artifacts. For low-power/cost systems and information appliances, cache inefficiency
can have a direct cost impact, requiring the use of faster or higher capacity components than strictly
necessary to achieve specified functionality at the required quality level. This can drive up system cost,
increase power consumption, or even prevent implementation.
To explore the cache efficiency problem and its solutions, we examine the traffic and data cache
behavior of an MPEG-2 decoder running Main Level streams on a general-purpose microprocessor.
Trace-driven cache simulations of typical sequences reveal the memory behavior of the decoder. We
show that, contrary to some predictions, there is a benefit to even relatively small, simple caches.
However, even very large and complex caches generate significant excess traffic. An analysis of the
data types utilized by the decoder and their patterns of access provides the basis for proposed
architectural enhancements. We also show that even relatively simple measures, such as selective
caching of specific data types, can dramatically improve efficiency, reducing cache-memory traffic by
50% or more for a wide range of cache sizes.
<-- Table of Contents
2 Background
Consider the problem of decoding MPEG-2 in software on a general-purpose computing platform. The
machine type of primary interest is a typical desktop PC or workstation, as envisioned in Figure 1. The
diagram also illustrates the typical flow of video data in the system. Imagine the user is viewing, for
example, an entry in a multimedia encyclopedia. Compressed data is transferred from the file system
(e.g. CD-ROM or hard disk) by direct memory access (DMA) and buffered in the main memory. The
CPU reads the data through its cache hierarchy and writes the decoded, dithered RGB video, one
frame at a time, to main memory where it is stored in an image of the displayed video window. Finally,
another DMA transfer brings the video data to the frame buffer where it eventually appears on the

display.
Figure 1. Typical desktop machine with software MPEG-2 decoding traffic
It is clear that there is a large amount of data, much of it time-sensitive, being transferred through the
system. Most of this is concentrated on the main system bus. At the very least, there are two streams
each of encoded and decoded video being concurrently transferred in and out of main memory,
amounting to a minimum sustained load of 63 Mbytes/s. This ignores the bandwidth required by other
applications running on the system and sharing interconnect, main memory, and peripherals. Any
excess memory traffic generated by cache inefficiency will further exacerbate this situation.
Unfortunately, our simulations indicate that standard caches generate at least twice the minimum
required level of cache-memory traffic, typically many times more, amounting to a significant strain on
system capacity. The result is some combination of dropped frames, reduced frame rate, and
degradation of overall system performance.
3 MPEG Overview
Appreciating the challenges of supporting video decoding requires some understanding of the MPEG
standard [4]. MPEG attacks both the spatial and temporal redundancy of video signals to achieve
compression. Video data is broken down into 8 by 8 pixel Blocks and passed through a discrete
cosine transform (DCT). The resulting spatial frequency coefficients are quantized, run-length encoded,
and then further compressed with an entropy coding algorithm. To exploit temporal redundancy, MPEG
encoding uses motion compensation with three different types of frames. I (intra) frames contain a
complete image, compressed for spatial redundancy only. P (predicted) MPEG frames are built from
16 by 16 fragments known as macroblocks. These consist primarily of pixels from the closest previous
I or P frame (the reference frame), translated as a group from their location in the source. This
information is stored as a vector representing the translation, and a DCT-encoded difference term,
requiring far fewer bits then the original image fragment. B (bidirectional) frames can use the closest
two I or P pictures - one before and one after in temporal order - as reference frames. Information not
present in reference frames is encoded spatially on a block-by-block basis. All of data in P and B
frames is also subject to run-length and entropy coding.

Figure 2. Typical sequence of MPEG frames, showing interframe dependencies
Figure 2 shows the interframe dependencies of the different frame types, superimposed on the
displayed frame order. For decoding, these frames must be processed in the non-temporal order
[1,4,2,3,7,5,6,10,8,9], which is a result of these dependencies. Interframe dependencies and the
properties and sequence of frame types determine in critical ways the flow pattern of MPEG data and
the nature of the hardware support required.
Figure 3. MPEG bitstream structure
The decoder reads the MPEG data as a stream of bits. Unique bit patterns, known as startcodes mark
the division between different sections of the data. The bitstream has a hierarchical structure, shown in
simplified form in Figure 3. A Sequence (video clip) consists of groups of pictures (GOP's). A GOP
contains at least one I frame and typically a number of dependent P and B frames; Figure 2 shows a
possible GOP of 9 frames (pictures) followed by the I frame of the subsequent GOP. Pictures consist of
collections of macroblocks called slices. The ``headers'' shown at each level contain parameters
relevant to each bitstream type but no actual image data.
4 Related Work
Deficiencies in the MPEG-2 decoding performance of systems are being actively addressed. This
section discusses several significant methods for performance enhancement under investigation by
researchers both in industry and academia. Some of these are merely proposals, while others have
already been implemented in hardware or are forthcoming. We explore the benefits and disadvantages

of each approach, and assess its impact on the problem of excess cache-memory bandwidth.
4.1 Media Processors
A new breed of microprocessor, the so-called ``media processor'' has emerged in just the past few
years. These chips combine programmability with specialized hardware to support multiple concurrent
multimedia operations, including MPEG-2 decoding, and usually reside on peripheral cards in desktop
systems. Their manufacturers, which include Chromatic Research, Fujitsu, Mitsubishi, Philips, and
Samsung, claim that media processors are the best present and long-term solution for providing
multimedia functionality in computer systems, not the host CPU.
To their advantage, media processors yield reduced memory bandwidth requirements, since they deal
in compressed data, and take a significant computational load off the host. They are also good at
handling the real-time constraints of multimedia, while present-day operating systems and memory
hierarchies are not. Finally, present-day multimedia functions consist of a limited set of operations, tied
to international standards, an efficient target for highly specialized solutions.
However, performing multimedia functionality with the host processor is inherently less expensive than
using extra specialized hardware. Memory systems are also improving to meet the needs of
multimedia, as illustrated by Intel's AGP and Direct RDRAM initiatives. Operating systems are also
likely to become more real-time oriented as they evolve. More fundamentally, multimedia data are a
new first-order data type, on the same order as floating-point. These functions are becoming central to
what computers do, not ``peripheral'' like mass-storage or printing. As applications evolve, media and
other operations will become more and more interleaved - someday viewing a video clip will be as
seamless as reading e-mail or editing a spreadsheet. Host multimedia processing best supports the
construction of highly integrated, fully featured, yet portable applications. Hardware integration trends
and the powerful vested interest of CPU manufacturers also favor this outcome.
Nevertheless, parts of video processing are likely to remain in specialized hardware for quite some
time, such as color space conversion, scaling, and dithering. These are straightforward to implement in
hardware, and to a large extent not data-dependent and therefore not subject to pipeline hazards and
other impediments. These are more akin to truly ``peripheral'' operations.
<-- Related Work
4.2 SIMD Instruction Extensions
Multimedia instruction extensions such as MMX, VIS, MAX-2, and the others in Table 1, achieve
speedup primarily through performing SIMD-style operations on multimedia data types. These ISA
enhancements are an essential step to improving microprocessor performance on multimedia
applications - as well as insuring the continued viability of host processing. However, these operations
tend to increase rather than relieve the strain on memory bandwidth. By consuming more operands per
unit time, multimedia instructions expose the underlying weakness of the cache, system bus, and main
memory. The caches in the MMX-enhanced Pentium are double the size of those in the earlier Pentium
for this very reason. Intel researchers even blame the below expected speedup on MPEG with MMX on
memory and I/O system deficiencies [9].
<-- Related Work
4.3 Video/Graphics Data Buses
There have been proposals for implementing special interconnect to transfer image data from the host

processor to the frame buffer, bypassing slower general-purpose I/O buses. The Intel Advanced
Graphics Port (AGP) is the most prominent such effort, intended to provide a proposed direct link
between main memory and the graphics subsystem. While primarily meant to support the high
bandwidth of 3D graphics rendering on the host processor, it is a definite boon to video as well. This
method provides another significant way to boost the capabilities of host multimedia processing
relative to custom hardware solutions. However, it does nothing in and of itself to alleviate cache-
memory traffic waste or reduce overall memory-system bandwidth needs. The undiminished
requirement for extremely high-bandwidth memory is reflected in Intel's support of the Rambus
architecture for future PC memory systems.
<-- Related Work
4.4 Prefetching
Hardware prefetching has been suggested as a remedy for inadequate MPEG performance [11]. Yet it
is primarily promoted as a means to reduce miss rates and without much consideration of cache-
memory traffic, which tends to increase with prefetching, especially the more aggressive schemes.
Prefetching is essentially a form of latency hiding; for memory-bound problems, it merely exposes
underlying bandwidth problems, and tends to generate excess memory traffic of its own. Selectively
applied, prefetching may provide a boost to compute-intensive phases of MPEG and other multimedia
algorithms. However, the emphasis of our research is not so much on getting values into the cache
early, but keeping them around for as long as they are needed.
<-- Related Work
4.5 Our Approach: Cache-Oriented Enhancements
The methods discussed above largely bypass the issue of excess cache-memory traffic, while some of
them even exacerbate the problem. None of them directly target the issue of cache inefficiency. Even
prefetching focuses on reducing miss rates at the possible expense of memory traffic. We suggest a
different type of solution, driven by the internal dynamics of the decoder itself. Exploiting knowledge of
individual data types - their sizes, access types, and access patterns - can lead to effective
architectural solutions to cache inefficiency.
<-- Related Work
5 Experimental Setup
Solving the problem of excess cache-memory traffic requires first establishing its extent, and how
critical system parameters affect this behavior. Our results come from trace-driven cache simulations of
video stream decoding. The clips themselves are progressive MPEG-2 Main Level sequences, with a
resolution of 704 by 480 pixels at 30 frames/s, and a compressed bitrate of 6 Mbits/s. These are
comparable in perceptual quality to conventional analog broadcast (e.g. NTSC, PAL). Sequences are
chosen as representative of different types of programming.
5.1 Machine and Software Environment

For experimental purposes we use a simplified machine model, limiting the microprocessor to a single
level of caching. We further assume that MPEG-2 decoding is the only active process. The decoder is
a version of mpeg2decode from the MPEG Software Simulation Group [1], running on a SuperSPARC
processor under Solaris. This decoder has the advantage of portability and relative system
independence, with the disadvantage that performance is suboptimal for any given architecture. Since
it originates with the body responsible for MPEG, it is closely written to comply with that standard.
Finally, it has the essential feature for research purposes of being available in source code form.
We have chosen to treat the decoder implementation as a given, implementing minimal modification to
support integration with our simulation tools. Trace generation is performed by QPT2 (the Quick
Profiler and Tracer) [6], a component of the Wisconsin Architectural Research Toolkit (WARTS). We
have implemented the cache simulator, Pacino, to support the unique qualities of continuous media
applications. These are considerably different from conventional benchmarks of the SPEC variety in
their consumption and production of data, and interaction with the operating system.
<-- Experimental Setup
5.2 Steady-State Simulation
To limit the considerable storage and CPU cycle requirements of trace-driven simulation, we restrict
each run to one GOP, the largest logical bitstream unit below a sequence. Simulations run as if at
steady state - i.e. preceded and followed by a large number of other GOP's, as if plucked from the
middle of the sequence. This eliminates the side effects of program startup and termination. To
achieve this the results of running initialization code are not logged, the cache is primed with data to
avoid the effects of cold start, and dirty lines are not flushed at the end of the GOP.
5.3 Color Space and Bandwidth
The simulation framework implements one significant performance optimization beyond the generic
hardware model. Decoded video is sent to the frame buffer in 4:2:0 YCrCb form, the native
representation of MPEG, rather than 4:4:4 RGB. YCrCb is a color representation with one luminance
(grayscale) and two chrominance components per pixel, rather than one each of red, green, and blue. In
4:2:0 YCrCb, the chrominance components are subsampled in both dimensions. We assume that
upsampling, color space conversion, and any dithering are done on the fly by the video display
subsystem. These features are becoming common in workstations and budget PC video accelerators
alike. According to prior work on MPEG-1 decoders, this processing can account for almost 25% of
total execution time if performed in software [8]. In addition, the 4:2:0 representation of an image is only
half the size of the 4:4:4 version, reducing the minimum sustained system bus load to 31.5 Mbytes/s.
Finally, these conversion operations account for so many memory and instruction references that they
make trace-driven simulation prohibitive - a sufficiently compelling reason to avoid them.
6 Cache Simulation Results
Trace-driven cache simulations clarify how data requests from the MPEG-2 decoder translate into what

traffic is seen by the memory. First of all, there is a distinct benefit to caching video decoder data, even
using naive, generic schemes. It has occasionally been suggested that caches are critically inefficient
for video data. Accordingly, several media processors dispense with data caches altogether in favor of
SRAM banks - essentially large register files - managed by software [5,3,2]. Yet we have found that
despite the large set sizes and the magnitude of data consumed and discarded, there is sufficient re-
use of values for caching to significantly reduce the required memory bandwidth.
For example, decoding one Group Of Pictures from a typical MPEG-2 stream generates 2.1*10exp(8)
memory requests. A 16 Kbyte, direct-mapped cache generates 3.9*10exp(7) words of memory traffic.
If we estimate conservatively that each word of cache data represents a memory request, then even a
relatively small, simple cache keeps almost 85% of traffic off of the memory bus. In fact, caches
typically load several words at a time, while many CPU memory requests are for data types smaller
than a word. Therefore, data caches can make far more efficient use of available memory bandwidth
than registers alone. Even very clever register management would have trouble competing with
anything but the smallest caches, so even information appliance-type platforms might benefit from
implementing an actual cache.
The remainder of this section examines in more detail how the basic parameters of a simple cache
(cache size, set associativity, and line size) affect memory traffic as well as misses. All of the caches in
our simulations implement a write-back policy on replacement. Not only is this the most popular write
policy in actual implementations, but it typically generates far less memory traffic than write-through, the
best alternative. Write-allocate, where a write miss causes the loading of the associated line, is also a
common feature of all simulated caches.
If we assume a ``perfect'' cache, the traffic required to support decoding is equal to the size of encoded
stream, which is read in from memory, plus the decoded video data, which is written back. This
absolute minimum cache-memory traffic is used as a yardstick to compare the performance of different
configurations in our simulations.
6.1 Cache Size
The size of a cache is its most significant design parameter, certainly from a cost standpoint. Because
cache size usually increases or decreases by factors of two, the decision of how large a cache to
implement in a system is pivotal. For Main Level MPEG-2 decoding, the cache-memory traffic as a
function of cache size and associativity shows little variation between video sequences. Figure 4
shows a representative plot, assuming a standard line size of 32 bytes. This is superimposed on a
surface showing the minimum possible cache-memory traffic for the sequence, consisting of the
encoded and decoded video streams combined.

Figure 4. Typical cache-memory traffic for MPEG-2 Main Level decoding, over minimum possible
traffic level
The most prominent feature of the traffic function is a large plateau, at a level of 6.3 times the minimum
value. Except for direct-mapped caches, which produce a peak of 16 times the minimum, cache-
memory traffic changes little with increasing cache size for most of the range. Traffic starts to roll off at
1 Mbyte, and plummets at 2 Mbytes. This reflects the 2 Mbyte size of the decoder data set. Larger
caches show negligible improvement, with the additional space providing no extra benefit. However,
the smallest measured value is still almost 2 times higher than the minimum possible value.
These data imply that a 32 Kbyte cache is just as good - or bad - for MPEG-2 Main Level decoding as
a 64 Kbyte or 128 Kbyte one, for example. Improving on cache-memory traffic requires a 1 Mbyte
cache or higher, but a 2 Mbytes or larger cache only brings traffic down to double the absolute
minimum, leaving plenty of room for improvement.
<-- Cache Simulation Results
6.2 Associativity
Increasing set associativity is a popular method for getting higher performance out of smaller caches.
The memory traffic effects of varying set associativity are also visible in Figure 4. Going from a direct-
mapped cache to a 2-way set-associative one can reduce memory traffic by as much as 50% for small
caches. Increasing associativity to 4 can squeeze out almost another 10% improvement over the
direct-mapped case. Set sizes of greater than 4, however, show minimal benefit across all cache
sizes. Since increasingly higher levels of associativity add considerably to cache cost, complexity and
access time, such enhancements are not justified.
While memory traffic is the primary focus, we would prefer to not adversely affect miss rates. As a
function of cache size and set associativity, miss rates show the same behavioral pattern as memory
traffic. However, unlike for cache-memory traffic, the performance is more than satisfactory, with an
average value below 0.5%.
6.3 Line Size
Line size, another fundamental parameter, is less costly to experiment with than cache size. Subblock

placement can help decouple the size of cache lines and that of the memory bus. Unfortunately, there is
contention between miss rate and memory traffic minimization. Low miss rates call for larger lines than
the typical 32 bytes, as illustrated in Figure 5. Larger lines tend to provide superior spatial locality, but
require more data to be read and possibly written back on a miss. For this reason, minimal memory
traffic occurs with the smaller lines. This relationship is readily apparent in Figure 6, where for the
smallest caches, the largest line sizes lead to cache-memory traffic almost 200 times the absolute
minimum.
Figure 5. Typical miss rates for MPEG-2 Main Level decoding over different cache sizes
Figure 6. Typical cache-memory traffic for MPEG-2 Main Level decoding over different cache sizes
Balancing the demands of miss rates and memory traffic requires further investigation. For the time
being, maintaining the ubiquitous 32 byte line size seems sensible. The interests of memory traffic
reduction argue against a switch to larger values.
6.4 Summary
It is clear that conventional cache techniques provide limited relief from excess cache-memory traffic.
The simplest way to improve memory traffic and miss rates is to have a very large cache - but a large

cache is not always feasible, and simulations show that improving MPEG-2 decoding performance
requires very big caches. In any case, one would prefer to extract better performance from smaller,
lower-cost resources, improving performance through increased efficiency rather than brute force.
7 Decoder Internal Analysis
This section explores the internal functioning of a software-based MPEG-2 video decoder, as a first
step to explaining and rectifying its suboptimal use of cache resources. First, we dissect the different
kinds of data used by the decoder, including compressed input data, image output data, and all of the
intermediate types. Second, we examine how these data objects are used in the process of decoding,
and how this affects the performance of the cache.
7.1 Data Set Composition
The data set of an MPEG-2 decoder - the information that must be available to the program in the
course of its execution - consists of several different data types which serve distinct purposes. As a
result, they are quite heterogeneous with respect to the types and patterns of access utilized by the
decoder, as well as the amount of memory space required. The following classes account for the global
and static local values in user space.
Input. The compressed MPEG-2 sequence; data are read in series from a fixed-size buffer and
refreshed by system calls.
Output. Uncompressed picture data in YCrCb format, stored in a video window image buffer.
This data type is write only -- written but never read by the CPU. System calls transfer each
completed picture into the frame buffer.
Tabular. Static, read only information used in the MPEG decoding process, such as various
lookup tables.
Reference. The current frame and the previously decoded frames used to reconstruct it (up to two
for a B frame), in YCrCb form.
Block. The DCT coefficient and pixel values for a single macroblock.
State. Values incidental to the settings and operation of the decoder, yet not part of the image
data per se.
This partitioning of data types may not be explicit within the decoder source, but represents an
abstraction at a higher level of the data needed for decoder operation. Most of these data types are
both read and written. The exceptions are the tabular data, which are read only, and the the output data
which is write only. Within a memory hierarchy, reading and writing are not symmetric operations. Read
misses and write misses in caches have very different latencies. The fact that some MPEG decoder
data types are only read or written can be exploited for performance optimization.
The essential properties of the different data types are summarized in Table 2. Note how in terms of
size, the reference and output types completely dominate the others. With respect to caching, this
means that their presence will tend to repeatedly expel the other types from the cache, except for very
large data caches. It doesn't matter how many times the other values are re-used. Capacity limitations
alone will insure that reference and output data, when updated, will throw the other data types out.
Fraction of

Data Type Access Type Size References
Input read/write 2 KB 2.7%
Output write only 500 KB 3.9%
Tabular read only 5 KB 5.5%
Reference read/write 1500 KB 23.7%
Block read/write 1.5 KB 31.4%
State read/write 0.5 KB 25.6%
Table 2. Summary of decoder data types, sizes, access types, and proportion of memory references
accounted for
Ranking the data types in terms of the number of references rather than by size gives a very different
picture. Block and state data account for by far the most memory requests. Reference data is next, but
output data is in next to last place, accounting for 8 times fewer memory requests than block data. Note
that the percentages don't add up to 100%; approximately 7.2% of data references are due to
temporary variables and library functions.
<-- Decoder Internal Analysis
7.2 Decoder Program Flow
The hierarchical composition of the MPEG bitstream, and its intrinsic sequentiality constrain the
structure of the decoder program. Figure 7 shows the phases of operation for a single GOP from the
perspective of data accesses. As the diagram makes evident, execution proceeds as a set of nested
loops corresponding to different data levels in the bitstream as shown in Figure 3. The ``initialization''
and ``termination'' blocks represent the operations excluded by steady-state simulation.

Figure 7. Model of MPEG-2 software decoder showing phases of operation for a GOP
Table 3 shows which data types are accessed in the different phases of operation. Parsing header
information and reading in block data involves operating on relatively small data types, and results in a
small active cache set. However, during reconstruction, the decoder accesses portions of the reference
frames and copies them into the frame currently under construction. The Inverse DCT (IDCT) phase
focuses on smaller data types, but merging the decoded pixels back into the new picture requires once
again accessing large data types. Writing the completed frame requires traversing significant portions
of reference and output data. Notice that the decoder accesses state data on a fairly continuous basis.
Phases of
Operation
Data Types
input output tabular reference block state
read headers X X X
read block X X X X
reconstruct MB X X
IDCT X X X
merge MB data X X X
write frame X X X
Table 3. Data types accessed in each phase of decoder operation

Each iteration through the macroblock loop accesses a new portion of one or two reference frames.
Each new P or B frame repeats this traversal through its reference frame(s), recalling significant
fractions ofeach frame back into the cache. During the ``write frame'' phase, the decoder spends most
of its time copying the current frame from the reference space to the output space. For all but the
largest caches, this has the effect of flushing out almost everything except video data for the next
displayed picture. The other data types will have to be reloaded for each new picture, or possibly
several times a picture for small caches, resulting in wasted cache-memory traffic.
7.3 System and Audio Interaction
Our simulations assume that MPEG video decoding is the only active process. In reality, video is
usually viewed in conjunction with audio data, which requires the concurrent execution of an MPEG
audio codec and an MPEG system parser to de-multiplex and synchronize the two media. We expect
that running these other processes will not dramatically affect our results, since the amount of data they
operate on at any given time is relatively small. The net effect is that of having a slightly smaller cache,
and our simulations show that memory traffic levels are fairly insensitive to small changes in cache size.
Nevertheless, the presence of these other programs is a good argument for keeping the cache
footprint of the video data as small as possible.
8 Cache-Oriented Enhancements
Our analysis of decoder behavior and data composition motivates a specific approach to performance
optimization, where the goal is to improve cache efficiency for video decoding without adversely
impacting other performance metrics or other applications. In particular, there are great potential
benefits in treating different data types distinctly, guided by their different sizes, access types, and
access patterns.
For example, write-only values, like video output data, are clearly a waste of cache space. Likewise,
while block and state data account for most of the memory requests, their relatively small size means
that larger data types, like reference data, systematically crowd them out of the cache. Preventing
blatant cache pollution and the predation of small data types with high re-use can yield considerable
cache-memory traffic savings. To this end, we are evaluating the following traffic-reduction techniques.
Selective Caching. Exclude specific data from the cache.
Cache Locking. Reserve selected cache lines for particular data objects; lines are untouched by
cache replacement while locked.
Scratch Memory. Implement a small portion of addressable system memory on the processor -
not as cache - for storage of small, frequently-used data.
Data Reordering. Perform cache-conscious modifications of decoder memory accesses.
Cache Partitioning. Allocate different sections of the cache to different data types.
Many of these methods are relatively simple to implement, and most have precedents in other
architectures since they have performance benefits beyond video decoding. Perhaps their efficacy for
MPEG-2 processing might encourage their broader adoption in architectures. In any case, the main
challenge is to successfully apply these various techniques in a manner that enhances decoder

challenge is to successfully apply these various techniques in a manner that enhances decoder
performance while avoiding excessive cost and complexity. The remainder of this section considers
each method on its own.
8.1 Selective Caching
Data objects with no possibility of re-use steal cache space from other data, which leads to excess
memory traffic when the excluded data are loaded in once again. Video output data is a perfect
example, since it is written but never read by the CPU, and occupies a fairly large amount of space.
Bypassing the cache entirely in favor of direct storage to main memory promises more efficient use.
In our simulations, we have found that excluding video output data alone from the cache reduces cache-
memory traffic by up to 50%. Across the configurations considered, not caching video output yields a
25% reduction on average. Improvements in cache-memory traffic are global, yet the shape of the
curve is not substantially modified. The plateau out to 512 Kbytes persists, as does the swift drop down
at 2 Mbytes. Excluding output data is even more helpful for miss rates, which drop by a maximum of
85% from earlier levels, and 60% on average.
There are several distinct ways of implementing this feature. In the PA-RISC architecture, the load and
store instructions can contain ``locality hints'' which signal that the data referred to is unlikely to be used.
If supported by the particular implementation, the processor can elect to bypass the cache with the
data [7]. The UltraSPARC has block load and store instructions, which move groups of data directly
between registers and main memory at high speed [10]. This approach allows for potentially more
efficient use of processor bandwidth. Finally, the PowerPC architecture provides for marking areas of
memory as non-cacheable. Once marked, data is transparently and automatically transferred between
registers and memory buffers. The cache hint and block load/store approaches require more explicit
programming effort.
<-- Cache-Oriented Enhancements
8.2 Cache Locking
One way to protect small but frequently reused data types, like input, state, and tabular values, from
victimization by raw video data is to lock the parts of the cache which contain the vulnerable data. While
locked, these cache lines are invisible to the replacement algorithm, and the contents will not be thrown
out only to be re-loaded when needed again.
This is accomplished by special machine instructions which execute the locking and unlocking. There
are two basic variations on this technique. Static locking simply freezes the tag and contents of the
affected line, allowing for the writing of values but not replacement; the line is associated with the same
portion of main memory until unlocked. Dynamic locking is somewhat more flexible, treating locked
lines as an extension of the register set, with special instructions to copy contents directly to and from
main memory. This scheme is used in the Cyrix MediaGX processor to speed the software emulation
of legacy PC hardware standards.
8.3 Scratch Memory
Another means of providing a safe haven for vulnerable data is to put them in a separate memory.
Many processors today consist to a large extent not of logic but SRAM, in the form of caches. It is easy
to imagine implementing a portion of the memory space itself on the die - not as a cache but
addressable memory. Data which are subject to being prematurely ejected from the cache could then

be kept in this space. The components of the MPEG-2 data which would most benefit from this are
relatively small, so only a few KBytes would be required. This ``scratch memory'' would be relatively
inexpensive to implement in hardware, smaller than most caches, and far simpler than any of them due
to the lack of tags and logic for lookup, replacement, etc. Similar memories have actually been a
featured in Intel microcontrollers for quite some time, for storing Page 0 of the address space.
A scratch memory provides potentially very fast access to important data, since there is no possibility
of a read or write miss, and the delay of searching tags can be eliminated. Intel microcontrollers even
use special addressing modes to access Page 0 memory which are more concise and faster than
those used for off-chip addresses. Unlike with cache locking, cache performance and capacity are not
compromised since the scratch memory is off to the side. Of course, getting a hardware feature used in
embedded systems, where operating systems and even compilers are optional, to work effectively in
an interactive, multitasking environment will take careful consideration, but the potential benefits are
considerable.
8.4 Data Reordering
This category of enhancement is a departure from our standard procedure of not tampering with the
decoder itself. Our goal is not to improve computational efficiency, a task which we leave to the many
others working on that problem. However, we suggest that modifications of specifically how the
decoder software manages and addresses memory can provide significant improvements in cache
efficiency - some of which may require hardware support to properly implement.
For example, there is strictly speaking no essential need for the output data type. B frame data,
considered separately, is never used by the decoder to construct other frames. The only time it is read
is for copying to the output buffer space. I and P pictures need to remain available to serve as
reference frames. Yet it turns out that in the natural flow of decoder data, by the time these frames are
copied to the output space they are no longer needed. If pictures could be transferred directly by DMA
from the reference frame space where they are stored, rather than routed through the output data, this
would result in greater efficiency. It would eliminate all of the cycle-consuming copying of output data,
and transform the B frame data into write-only values which could be efficiently excluded from the
cache.
8.5 Cache Partitioning
It is possible to take the basic idea of cache locking in a slightly different direction. With cache
partitioning, particular data types are relegated to specific sections of the cache, which continue to
behave like cache. However, rather than locking in the physical address associated with a line, or
treating cache lines as extensions of the register set, it is as if each data type has its own separate,
smaller cache space for storing its values.
This method provides very precise control over cache behavior, and an extremely powerful tool for the
optimal management of cache resources. However, it requires a level of hardware complexity not
demanded by the methods discussed previously. Also, the task of efficiently exploiting the available
capabilities in software is considerably more challenging.

8.6 Summary
None of the methods proposed are mutually exclusive, although some of them, like cache locking and
scratch memory, are arguably redundant if used in conjunction. Whether one of these techniques or
some combination of them provides the most efficient solution remains to be seen. Also, each one
raises issues of implementation cost and complexity, interaction with the operating system and other
processes, and software interfacing and portability which need to be more thoroughly addressed. We
will investigate these issues and perform a more detailed evaluation of these architectural
enhancements in future work.
9 Conclusions
The effectiveness of computer memory subsystems, including caches, is currently a significant barrier
to improved performance on multimedia applications. We have documented the memory traffic and
data cache behavior of MPEG-2 video decoding on a general-purpose computer, and demonstrated
that standard caches produce a significant excess of cache-memory traffic. While almost any cache is
superior to none - even simple caches can reduce required memory bandwidth considerably -
experimenting with basic cache parameters like cache size, associativity, and line size has a limited
ability to reduce cache-memory traffic. The best value than can be achieved is double the absolute
minimum traffic required. However, cache-oriented architectural enhancements, driven by an
understanding of decoder behavior, can dramatically improve cache efficiency. These enhancements
include selective caching, scratch memory, data reordering, and cache partitioning. All have in
common an emphasis on treating different elements of the data distinctly, accommodating their unique
properties - size, usage patterns, temporal locality, etc. This approach provides for optimal
management of available cache resources. We expect that further refinement of these techniques will
enable improved video decoding performance of general purpose microprocessors, and encourage
the broader proliferation of digital video platforms and applications.
10 Acknowledgments
This research was supported in part by NSF grant CCR-9696196. Miriam Leeser is the recipient of an
NSF Young Investigator Award. The research is also supported by generous donations from Sun
Microsystems. The authors would like to thank Shantanu Tarafdar and Brian Smith for useful
discussions.
References
[1]

Stefan Eckart et al. ISO/IEC MPEG-2 software video codec. In Proc. Digital Video
Compression: Algorithms and Technologies 1995, pages 100-109. SPIE, Jan. 1995.
[2]
Peter N. Glaskowsky. Fujitsu aims media processor at DVD. Microprocessor Report, 10:11-13,
November 1996.
[3]
Linley Gwennap. Mitsubishi designs DVD decoder. Microprocessor Report, 10:1,6-9, December
1996.
[4]
International Organization for Standardization. ISO/IEC JTC1/SC29/WG11/602 13818-2
Committee Draft (MPEG-2), Nov. 1993.
[5]
Paul Kalapathy. Hardware-software interactions on MPACT. IEEE Micro, 17(2):20-26, Mar./Apr.
1997.
[6]
James R. Larus. Efficient program tracing. IEEE Computer, 26(5):52-61, May 1993.
[7]
Ruby B. Lee. Subworld parallelism with MAX-2. IEEE Micro, 16(4):51-59, August 1996.
[8]
Ketan Patel et al. Performance of a software MPEG video decoder. In Proc. 1st ACM Intl. Conf.
on Multimedia, pages 75-82. ACM, Aug. 1993.
[9]
Alex Peleg et al. Intel MMX for multimedia PCs. Comm. of the ACM, 40(1):25-38, Jan. 1997.
[10]
Marc Tremblay et al. VIS speeds new media processing. IEEE Micro, 16(4):10-20, August 1996.
[11]
Daniel F. Zucker et al. A comparison of hardware prefetching techniques for multimedia.
Technical Report CSL-TR-95-683, Stanford University Depts. of Electrical Engineering and
Computer Science, Stanford, CA, Dec. 1995.

Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_1373693063

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (18)

Destacado

Destacado (10)

Similar a Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_1373693063

Similar a Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_1373693063 (20)

Último

Último (20)

Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_1373693063