SlideShare una empresa de Scribd logo
1 de 19
Descargar para leer sin conexión
ACM Multimedia 97 - Electronic Proceedings
November 8-14, 1997
Crowne Plaza Hotel, Seattle, USA
Optimizing the Data Cache Performance of a
Software MPEG-2 Video Decoder
Peter Soderquist
School of Electrical Engineering
Dept. of Electrical and Computer Engineering
Cornell University
Ithaca, New York 14853
(607) 255-0314
pgs@cs.cornell.edu
http://www.ee.cornell.edu/~pgs/
Miriam Leeser
Dept. of Electrical and Computer Engineering
Northeastern University
Boston, Massachusetts 02115
(617) 373-3814
mel@ece.neu.edu
http://www.ece.neu.edu/
ACM Copyright Notice
Abstract
Multimedia functionality has become an established component of core computer workloads. MPEG-2
video decoding represents a particularly important and computationally demanding application
example. Instruction set extensions like Intel's MMX significantly reduce the computational challenges of
this and other multimedia algorithms. However, memory subsystem deficiencies have now become the
major barrier to increased performance, partly as a consequence of this improved CPU performance.
Decoding MPEG-2 video data in software makes significant bandwidth demands on memory
subsystems, which is seriously aggravated by cache inefficiencies. Conventional data caches generate
many times more cache-memory traffic than required, at best double the minimum necessary to
support decoding. Improving efficiency requires understanding the behavior of the decoder and
composition of its data set. We provide an analysis of the memory and cache behavior of software
MPEG-2 video decoding, and lay out a set of cache-oriented architectural enhancements which offer
relief for the problem of excess cache-memory bandwidth. Our results show that cache-sensitive
handling of different data types can reduce traffic by 50 percent or more.
Keywords
MPEG-2, digital video, caches, computer architecture
Table of Contents
1 Introduction
2 Background
3 MPEG Overview
4 Related Work
4.1 Media Processors
4.2 SIMD Instruction Extensions
4.3 Video/Graphics Data Buses
4.4 Prefetching
4.5 Our Approach: Cache-Oriented Enhancements
5 Experimental Setup
5.1 Machine and Software Environment
5.2 Steady-State Simulation
5.3 Color Space and Bandwidth
6 Cache Simulation Results
6.1 Cache Size
6.2 Associativity
6.3 Line Size
6.4 Summary
7 Decoder Internal Analysis
7.1 Data Set Composition
7.2 Decoder Program Flow
7.3 System and Audio Interaction
8 Cache-Oriented Enhancements
8.1 Selective Caching
8.2 Cache Locking
8.3 Scratch Memory
8.4 Data Reordering
8.5 Cache Partitioning
8.6 Summary
9 Conclusions
10 Acknowledgments
References
1 Introduction
Multimedia computing has become a practical reality, and computer architectures are changing in
response. Extensions for continuous media processing are a current or planned part of every market-
leading instruction set architecture, as shown in Table 1. The primary feature of these extensions is
SIMD-style processing of small data types. While contemporary microprocessor datapaths are 32 or
64 bits wide, multimedia data usually consists of 16-bit or 8-bit integers. Furthermore, multimedia
algorithms commonly feature an abundance of data parallelism. Therefore, modifying wide datapaths
to process multiple small data types in a single instruction promises significant performance
improvements.
Architecture ISA Extension
x86 MMX (MultiMedia eXtensions)
SPARC VIS (Visual Instruction Set)
PA-RISC MAX-2 (Multimedia Acceleration eXtensions)
Alpha MVI (Motion Video Instruction)
MIPS MDMX (MIPS Digital Media eXtensions)
PowerPC VMX (Video and Multimedia eXtensions)*
Table 1. ISA multimedia extensions for popular architectures (* = proposed)
All of these extensions are targeted, at least partially, at MPEG-2 video decoding. Collectively, they
mount a formidable assault on its computational complexity. Unfortunately they largely fail to address
the significant challenge of MPEG processing for memory subsystems, which have become the
primary performance bottleneck. The high data rate, large set sizes, and distinctive memory access
patterns of MPEG exert a particular strain on caches. Manipulating standard parameters (cache size,
associativity, and line size) fails to cost-effectively reduce excess memory traffic. While miss rate levels
are acceptable, standard caches, even very large and highly associative ones, generate significant
excess cache-memory traffic. Multimedia instruction extensions actually exacerbate this problem by
enabling the CPU to consume and produce more data in fewer cycles.
This cache inefficiency is seriously limiting for desktop multitasking systems. A foregrounded video
process can potentially cripple communications, print spooling, system management, network agent,
and other background tasks. The video playback may also suffer from dropped frames, jitter, blocking,
or other annoying artifacts. For low-power/cost systems and information appliances, cache inefficiency
can have a direct cost impact, requiring the use of faster or higher capacity components than strictly
necessary to achieve specified functionality at the required quality level. This can drive up system cost,
increase power consumption, or even prevent implementation.
To explore the cache efficiency problem and its solutions, we examine the traffic and data cache
behavior of an MPEG-2 decoder running Main Level streams on a general-purpose microprocessor.
Trace-driven cache simulations of typical sequences reveal the memory behavior of the decoder. We
show that, contrary to some predictions, there is a benefit to even relatively small, simple caches.
However, even very large and complex caches generate significant excess traffic. An analysis of the
data types utilized by the decoder and their patterns of access provides the basis for proposed
architectural enhancements. We also show that even relatively simple measures, such as selective
caching of specific data types, can dramatically improve efficiency, reducing cache-memory traffic by
50% or more for a wide range of cache sizes.
<-- Table of Contents
2 Background
Consider the problem of decoding MPEG-2 in software on a general-purpose computing platform. The
machine type of primary interest is a typical desktop PC or workstation, as envisioned in Figure 1. The
diagram also illustrates the typical flow of video data in the system. Imagine the user is viewing, for
example, an entry in a multimedia encyclopedia. Compressed data is transferred from the file system
(e.g. CD-ROM or hard disk) by direct memory access (DMA) and buffered in the main memory. The
CPU reads the data through its cache hierarchy and writes the decoded, dithered RGB video, one
frame at a time, to main memory where it is stored in an image of the displayed video window. Finally,
another DMA transfer brings the video data to the frame buffer where it eventually appears on the
display.
Figure 1. Typical desktop machine with software MPEG-2 decoding traffic
It is clear that there is a large amount of data, much of it time-sensitive, being transferred through the
system. Most of this is concentrated on the main system bus. At the very least, there are two streams
each of encoded and decoded video being concurrently transferred in and out of main memory,
amounting to a minimum sustained load of 63 Mbytes/s. This ignores the bandwidth required by other
applications running on the system and sharing interconnect, main memory, and peripherals. Any
excess memory traffic generated by cache inefficiency will further exacerbate this situation.
Unfortunately, our simulations indicate that standard caches generate at least twice the minimum
required level of cache-memory traffic, typically many times more, amounting to a significant strain on
system capacity. The result is some combination of dropped frames, reduced frame rate, and
degradation of overall system performance.
<-- Table of Contents
3 MPEG Overview
Appreciating the challenges of supporting video decoding requires some understanding of the MPEG
standard [4]. MPEG attacks both the spatial and temporal redundancy of video signals to achieve
compression. Video data is broken down into 8 by 8 pixel Blocks and passed through a discrete
cosine transform (DCT). The resulting spatial frequency coefficients are quantized, run-length encoded,
and then further compressed with an entropy coding algorithm. To exploit temporal redundancy, MPEG
encoding uses motion compensation with three different types of frames. I (intra) frames contain a
complete image, compressed for spatial redundancy only. P (predicted) MPEG frames are built from
16 by 16 fragments known as macroblocks. These consist primarily of pixels from the closest previous
I or P frame (the reference frame), translated as a group from their location in the source. This
information is stored as a vector representing the translation, and a DCT-encoded difference term,
requiring far fewer bits then the original image fragment. B (bidirectional) frames can use the closest
two I or P pictures - one before and one after in temporal order - as reference frames. Information not
present in reference frames is encoded spatially on a block-by-block basis. All of data in P and B
frames is also subject to run-length and entropy coding.
Figure 2. Typical sequence of MPEG frames, showing interframe dependencies
Figure 2 shows the interframe dependencies of the different frame types, superimposed on the
displayed frame order. For decoding, these frames must be processed in the non-temporal order
[1,4,2,3,7,5,6,10,8,9], which is a result of these dependencies. Interframe dependencies and the
properties and sequence of frame types determine in critical ways the flow pattern of MPEG data and
the nature of the hardware support required.
Figure 3. MPEG bitstream structure
The decoder reads the MPEG data as a stream of bits. Unique bit patterns, known as startcodes mark
the division between different sections of the data. The bitstream has a hierarchical structure, shown in
simplified form in Figure 3. A Sequence (video clip) consists of groups of pictures (GOP's). A GOP
contains at least one I frame and typically a number of dependent P and B frames; Figure 2 shows a
possible GOP of 9 frames (pictures) followed by the I frame of the subsequent GOP. Pictures consist of
collections of macroblocks called slices. The ``headers'' shown at each level contain parameters
relevant to each bitstream type but no actual image data.
<-- Table of Contents
4 Related Work
Deficiencies in the MPEG-2 decoding performance of systems are being actively addressed. This
section discusses several significant methods for performance enhancement under investigation by
researchers both in industry and academia. Some of these are merely proposals, while others have
already been implemented in hardware or are forthcoming. We explore the benefits and disadvantages
of each approach, and assess its impact on the problem of excess cache-memory bandwidth.
4.1 Media Processors
A new breed of microprocessor, the so-called ``media processor'' has emerged in just the past few
years. These chips combine programmability with specialized hardware to support multiple concurrent
multimedia operations, including MPEG-2 decoding, and usually reside on peripheral cards in desktop
systems. Their manufacturers, which include Chromatic Research, Fujitsu, Mitsubishi, Philips, and
Samsung, claim that media processors are the best present and long-term solution for providing
multimedia functionality in computer systems, not the host CPU.
To their advantage, media processors yield reduced memory bandwidth requirements, since they deal
in compressed data, and take a significant computational load off the host. They are also good at
handling the real-time constraints of multimedia, while present-day operating systems and memory
hierarchies are not. Finally, present-day multimedia functions consist of a limited set of operations, tied
to international standards, an efficient target for highly specialized solutions.
However, performing multimedia functionality with the host processor is inherently less expensive than
using extra specialized hardware. Memory systems are also improving to meet the needs of
multimedia, as illustrated by Intel's AGP and Direct RDRAM initiatives. Operating systems are also
likely to become more real-time oriented as they evolve. More fundamentally, multimedia data are a
new first-order data type, on the same order as floating-point. These functions are becoming central to
what computers do, not ``peripheral'' like mass-storage or printing. As applications evolve, media and
other operations will become more and more interleaved - someday viewing a video clip will be as
seamless as reading e-mail or editing a spreadsheet. Host multimedia processing best supports the
construction of highly integrated, fully featured, yet portable applications. Hardware integration trends
and the powerful vested interest of CPU manufacturers also favor this outcome.
Nevertheless, parts of video processing are likely to remain in specialized hardware for quite some
time, such as color space conversion, scaling, and dithering. These are straightforward to implement in
hardware, and to a large extent not data-dependent and therefore not subject to pipeline hazards and
other impediments. These are more akin to truly ``peripheral'' operations.
<-- Related Work
4.2 SIMD Instruction Extensions
Multimedia instruction extensions such as MMX, VIS, MAX-2, and the others in Table 1, achieve
speedup primarily through performing SIMD-style operations on multimedia data types. These ISA
enhancements are an essential step to improving microprocessor performance on multimedia
applications - as well as insuring the continued viability of host processing. However, these operations
tend to increase rather than relieve the strain on memory bandwidth. By consuming more operands per
unit time, multimedia instructions expose the underlying weakness of the cache, system bus, and main
memory. The caches in the MMX-enhanced Pentium are double the size of those in the earlier Pentium
for this very reason. Intel researchers even blame the below expected speedup on MPEG with MMX on
memory and I/O system deficiencies [9].
<-- Related Work
4.3 Video/Graphics Data Buses
There have been proposals for implementing special interconnect to transfer image data from the host
processor to the frame buffer, bypassing slower general-purpose I/O buses. The Intel Advanced
Graphics Port (AGP) is the most prominent such effort, intended to provide a proposed direct link
between main memory and the graphics subsystem. While primarily meant to support the high
bandwidth of 3D graphics rendering on the host processor, it is a definite boon to video as well. This
method provides another significant way to boost the capabilities of host multimedia processing
relative to custom hardware solutions. However, it does nothing in and of itself to alleviate cache-
memory traffic waste or reduce overall memory-system bandwidth needs. The undiminished
requirement for extremely high-bandwidth memory is reflected in Intel's support of the Rambus
architecture for future PC memory systems.
<-- Related Work
4.4 Prefetching
Hardware prefetching has been suggested as a remedy for inadequate MPEG performance [11]. Yet it
is primarily promoted as a means to reduce miss rates and without much consideration of cache-
memory traffic, which tends to increase with prefetching, especially the more aggressive schemes.
Prefetching is essentially a form of latency hiding; for memory-bound problems, it merely exposes
underlying bandwidth problems, and tends to generate excess memory traffic of its own. Selectively
applied, prefetching may provide a boost to compute-intensive phases of MPEG and other multimedia
algorithms. However, the emphasis of our research is not so much on getting values into the cache
early, but keeping them around for as long as they are needed.
<-- Related Work
4.5 Our Approach: Cache-Oriented Enhancements
The methods discussed above largely bypass the issue of excess cache-memory traffic, while some of
them even exacerbate the problem. None of them directly target the issue of cache inefficiency. Even
prefetching focuses on reducing miss rates at the possible expense of memory traffic. We suggest a
different type of solution, driven by the internal dynamics of the decoder itself. Exploiting knowledge of
individual data types - their sizes, access types, and access patterns - can lead to effective
architectural solutions to cache inefficiency.
<-- Related Work
<-- Table of Contents
5 Experimental Setup
Solving the problem of excess cache-memory traffic requires first establishing its extent, and how
critical system parameters affect this behavior. Our results come from trace-driven cache simulations of
video stream decoding. The clips themselves are progressive MPEG-2 Main Level sequences, with a
resolution of 704 by 480 pixels at 30 frames/s, and a compressed bitrate of 6 Mbits/s. These are
comparable in perceptual quality to conventional analog broadcast (e.g. NTSC, PAL). Sequences are
chosen as representative of different types of programming.
5.1 Machine and Software Environment
For experimental purposes we use a simplified machine model, limiting the microprocessor to a single
level of caching. We further assume that MPEG-2 decoding is the only active process. The decoder is
a version of mpeg2decode from the MPEG Software Simulation Group [1], running on a SuperSPARC
processor under Solaris. This decoder has the advantage of portability and relative system
independence, with the disadvantage that performance is suboptimal for any given architecture. Since
it originates with the body responsible for MPEG, it is closely written to comply with that standard.
Finally, it has the essential feature for research purposes of being available in source code form.
We have chosen to treat the decoder implementation as a given, implementing minimal modification to
support integration with our simulation tools. Trace generation is performed by QPT2 (the Quick
Profiler and Tracer) [6], a component of the Wisconsin Architectural Research Toolkit (WARTS). We
have implemented the cache simulator, Pacino, to support the unique qualities of continuous media
applications. These are considerably different from conventional benchmarks of the SPEC variety in
their consumption and production of data, and interaction with the operating system.
<-- Experimental Setup
5.2 Steady-State Simulation
To limit the considerable storage and CPU cycle requirements of trace-driven simulation, we restrict
each run to one GOP, the largest logical bitstream unit below a sequence. Simulations run as if at
steady state - i.e. preceded and followed by a large number of other GOP's, as if plucked from the
middle of the sequence. This eliminates the side effects of program startup and termination. To
achieve this the results of running initialization code are not logged, the cache is primed with data to
avoid the effects of cold start, and dirty lines are not flushed at the end of the GOP.
<-- Experimental Setup
5.3 Color Space and Bandwidth
The simulation framework implements one significant performance optimization beyond the generic
hardware model. Decoded video is sent to the frame buffer in 4:2:0 YCrCb form, the native
representation of MPEG, rather than 4:4:4 RGB. YCrCb is a color representation with one luminance
(grayscale) and two chrominance components per pixel, rather than one each of red, green, and blue. In
4:2:0 YCrCb, the chrominance components are subsampled in both dimensions. We assume that
upsampling, color space conversion, and any dithering are done on the fly by the video display
subsystem. These features are becoming common in workstations and budget PC video accelerators
alike. According to prior work on MPEG-1 decoders, this processing can account for almost 25% of
total execution time if performed in software [8]. In addition, the 4:2:0 representation of an image is only
half the size of the 4:4:4 version, reducing the minimum sustained system bus load to 31.5 Mbytes/s.
Finally, these conversion operations account for so many memory and instruction references that they
make trace-driven simulation prohibitive - a sufficiently compelling reason to avoid them.
<-- Experimental Setup
<-- Table of Contents
6 Cache Simulation Results
Trace-driven cache simulations clarify how data requests from the MPEG-2 decoder translate into what
traffic is seen by the memory. First of all, there is a distinct benefit to caching video decoder data, even
using naive, generic schemes. It has occasionally been suggested that caches are critically inefficient
for video data. Accordingly, several media processors dispense with data caches altogether in favor of
SRAM banks - essentially large register files - managed by software [5,3,2]. Yet we have found that
despite the large set sizes and the magnitude of data consumed and discarded, there is sufficient re-
use of values for caching to significantly reduce the required memory bandwidth.
For example, decoding one Group Of Pictures from a typical MPEG-2 stream generates 2.1*10exp(8)
memory requests. A 16 Kbyte, direct-mapped cache generates 3.9*10exp(7) words of memory traffic.
If we estimate conservatively that each word of cache data represents a memory request, then even a
relatively small, simple cache keeps almost 85% of traffic off of the memory bus. In fact, caches
typically load several words at a time, while many CPU memory requests are for data types smaller
than a word. Therefore, data caches can make far more efficient use of available memory bandwidth
than registers alone. Even very clever register management would have trouble competing with
anything but the smallest caches, so even information appliance-type platforms might benefit from
implementing an actual cache.
The remainder of this section examines in more detail how the basic parameters of a simple cache
(cache size, set associativity, and line size) affect memory traffic as well as misses. All of the caches in
our simulations implement a write-back policy on replacement. Not only is this the most popular write
policy in actual implementations, but it typically generates far less memory traffic than write-through, the
best alternative. Write-allocate, where a write miss causes the loading of the associated line, is also a
common feature of all simulated caches.
If we assume a ``perfect'' cache, the traffic required to support decoding is equal to the size of encoded
stream, which is read in from memory, plus the decoded video data, which is written back. This
absolute minimum cache-memory traffic is used as a yardstick to compare the performance of different
configurations in our simulations.
6.1 Cache Size
The size of a cache is its most significant design parameter, certainly from a cost standpoint. Because
cache size usually increases or decreases by factors of two, the decision of how large a cache to
implement in a system is pivotal. For Main Level MPEG-2 decoding, the cache-memory traffic as a
function of cache size and associativity shows little variation between video sequences. Figure 4
shows a representative plot, assuming a standard line size of 32 bytes. This is superimposed on a
surface showing the minimum possible cache-memory traffic for the sequence, consisting of the
encoded and decoded video streams combined.
Figure 4. Typical cache-memory traffic for MPEG-2 Main Level decoding, over minimum possible
traffic level
The most prominent feature of the traffic function is a large plateau, at a level of 6.3 times the minimum
value. Except for direct-mapped caches, which produce a peak of 16 times the minimum, cache-
memory traffic changes little with increasing cache size for most of the range. Traffic starts to roll off at
1 Mbyte, and plummets at 2 Mbytes. This reflects the 2 Mbyte size of the decoder data set. Larger
caches show negligible improvement, with the additional space providing no extra benefit. However,
the smallest measured value is still almost 2 times higher than the minimum possible value.
These data imply that a 32 Kbyte cache is just as good - or bad - for MPEG-2 Main Level decoding as
a 64 Kbyte or 128 Kbyte one, for example. Improving on cache-memory traffic requires a 1 Mbyte
cache or higher, but a 2 Mbytes or larger cache only brings traffic down to double the absolute
minimum, leaving plenty of room for improvement.
<-- Cache Simulation Results
6.2 Associativity
Increasing set associativity is a popular method for getting higher performance out of smaller caches.
The memory traffic effects of varying set associativity are also visible in Figure 4. Going from a direct-
mapped cache to a 2-way set-associative one can reduce memory traffic by as much as 50% for small
caches. Increasing associativity to 4 can squeeze out almost another 10% improvement over the
direct-mapped case. Set sizes of greater than 4, however, show minimal benefit across all cache
sizes. Since increasingly higher levels of associativity add considerably to cache cost, complexity and
access time, such enhancements are not justified.
While memory traffic is the primary focus, we would prefer to not adversely affect miss rates. As a
function of cache size and set associativity, miss rates show the same behavioral pattern as memory
traffic. However, unlike for cache-memory traffic, the performance is more than satisfactory, with an
average value below 0.5%.
<-- Cache Simulation Results
6.3 Line Size
Line size, another fundamental parameter, is less costly to experiment with than cache size. Subblock
placement can help decouple the size of cache lines and that of the memory bus. Unfortunately, there is
contention between miss rate and memory traffic minimization. Low miss rates call for larger lines than
the typical 32 bytes, as illustrated in Figure 5. Larger lines tend to provide superior spatial locality, but
require more data to be read and possibly written back on a miss. For this reason, minimal memory
traffic occurs with the smaller lines. This relationship is readily apparent in Figure 6, where for the
smallest caches, the largest line sizes lead to cache-memory traffic almost 200 times the absolute
minimum.
Figure 5. Typical miss rates for MPEG-2 Main Level decoding over different cache sizes
Figure 6. Typical cache-memory traffic for MPEG-2 Main Level decoding over different cache sizes
Balancing the demands of miss rates and memory traffic requires further investigation. For the time
being, maintaining the ubiquitous 32 byte line size seems sensible. The interests of memory traffic
reduction argue against a switch to larger values.
<-- Cache Simulation Results
6.4 Summary
It is clear that conventional cache techniques provide limited relief from excess cache-memory traffic.
The simplest way to improve memory traffic and miss rates is to have a very large cache - but a large
cache is not always feasible, and simulations show that improving MPEG-2 decoding performance
requires very big caches. In any case, one would prefer to extract better performance from smaller,
lower-cost resources, improving performance through increased efficiency rather than brute force.
<-- Cache Simulation Results
<-- Table of Contents
7 Decoder Internal Analysis
This section explores the internal functioning of a software-based MPEG-2 video decoder, as a first
step to explaining and rectifying its suboptimal use of cache resources. First, we dissect the different
kinds of data used by the decoder, including compressed input data, image output data, and all of the
intermediate types. Second, we examine how these data objects are used in the process of decoding,
and how this affects the performance of the cache.
7.1 Data Set Composition
The data set of an MPEG-2 decoder - the information that must be available to the program in the
course of its execution - consists of several different data types which serve distinct purposes. As a
result, they are quite heterogeneous with respect to the types and patterns of access utilized by the
decoder, as well as the amount of memory space required. The following classes account for the global
and static local values in user space.
Input. The compressed MPEG-2 sequence; data are read in series from a fixed-size buffer and
refreshed by system calls.
Output. Uncompressed picture data in YCrCb format, stored in a video window image buffer.
This data type is write only -- written but never read by the CPU. System calls transfer each
completed picture into the frame buffer.
Tabular. Static, read only information used in the MPEG decoding process, such as various
lookup tables.
Reference. The current frame and the previously decoded frames used to reconstruct it (up to two
for a B frame), in YCrCb form.
Block. The DCT coefficient and pixel values for a single macroblock.
State. Values incidental to the settings and operation of the decoder, yet not part of the image
data per se.
This partitioning of data types may not be explicit within the decoder source, but represents an
abstraction at a higher level of the data needed for decoder operation. Most of these data types are
both read and written. The exceptions are the tabular data, which are read only, and the the output data
which is write only. Within a memory hierarchy, reading and writing are not symmetric operations. Read
misses and write misses in caches have very different latencies. The fact that some MPEG decoder
data types are only read or written can be exploited for performance optimization.
The essential properties of the different data types are summarized in Table 2. Note how in terms of
size, the reference and output types completely dominate the others. With respect to caching, this
means that their presence will tend to repeatedly expel the other types from the cache, except for very
large data caches. It doesn't matter how many times the other values are re-used. Capacity limitations
alone will insure that reference and output data, when updated, will throw the other data types out.
Fraction of
Data Type Access Type Size References
Input read/write 2 KB 2.7%
Output write only 500 KB 3.9%
Tabular read only 5 KB 5.5%
Reference read/write 1500 KB 23.7%
Block read/write 1.5 KB 31.4%
State read/write 0.5 KB 25.6%
Table 2. Summary of decoder data types, sizes, access types, and proportion of memory references
accounted for
Ranking the data types in terms of the number of references rather than by size gives a very different
picture. Block and state data account for by far the most memory requests. Reference data is next, but
output data is in next to last place, accounting for 8 times fewer memory requests than block data. Note
that the percentages don't add up to 100%; approximately 7.2% of data references are due to
temporary variables and library functions.
<-- Decoder Internal Analysis
7.2 Decoder Program Flow
The hierarchical composition of the MPEG bitstream, and its intrinsic sequentiality constrain the
structure of the decoder program. Figure 7 shows the phases of operation for a single GOP from the
perspective of data accesses. As the diagram makes evident, execution proceeds as a set of nested
loops corresponding to different data levels in the bitstream as shown in Figure 3. The ``initialization''
and ``termination'' blocks represent the operations excluded by steady-state simulation.
Figure 7. Model of MPEG-2 software decoder showing phases of operation for a GOP
Table 3 shows which data types are accessed in the different phases of operation. Parsing header
information and reading in block data involves operating on relatively small data types, and results in a
small active cache set. However, during reconstruction, the decoder accesses portions of the reference
frames and copies them into the frame currently under construction. The Inverse DCT (IDCT) phase
focuses on smaller data types, but merging the decoded pixels back into the new picture requires once
again accessing large data types. Writing the completed frame requires traversing significant portions
of reference and output data. Notice that the decoder accesses state data on a fairly continuous basis.
Phases of
Operation
Data Types
input output tabular reference block state
read headers X X X
read block X X X X
reconstruct MB X X
IDCT X X X
merge MB data X X X
write frame X X X
Table 3. Data types accessed in each phase of decoder operation
Each iteration through the macroblock loop accesses a new portion of one or two reference frames.
Each new P or B frame repeats this traversal through its reference frame(s), recalling significant
fractions ofeach frame back into the cache. During the ``write frame'' phase, the decoder spends most
of its time copying the current frame from the reference space to the output space. For all but the
largest caches, this has the effect of flushing out almost everything except video data for the next
displayed picture. The other data types will have to be reloaded for each new picture, or possibly
several times a picture for small caches, resulting in wasted cache-memory traffic.
<-- Decoder Internal Analysis
7.3 System and Audio Interaction
Our simulations assume that MPEG video decoding is the only active process. In reality, video is
usually viewed in conjunction with audio data, which requires the concurrent execution of an MPEG
audio codec and an MPEG system parser to de-multiplex and synchronize the two media. We expect
that running these other processes will not dramatically affect our results, since the amount of data they
operate on at any given time is relatively small. The net effect is that of having a slightly smaller cache,
and our simulations show that memory traffic levels are fairly insensitive to small changes in cache size.
Nevertheless, the presence of these other programs is a good argument for keeping the cache
footprint of the video data as small as possible.
<-- Decoder Internal Analysis
<-- Table of Contents
8 Cache-Oriented Enhancements
Our analysis of decoder behavior and data composition motivates a specific approach to performance
optimization, where the goal is to improve cache efficiency for video decoding without adversely
impacting other performance metrics or other applications. In particular, there are great potential
benefits in treating different data types distinctly, guided by their different sizes, access types, and
access patterns.
For example, write-only values, like video output data, are clearly a waste of cache space. Likewise,
while block and state data account for most of the memory requests, their relatively small size means
that larger data types, like reference data, systematically crowd them out of the cache. Preventing
blatant cache pollution and the predation of small data types with high re-use can yield considerable
cache-memory traffic savings. To this end, we are evaluating the following traffic-reduction techniques.
Selective Caching. Exclude specific data from the cache.
Cache Locking. Reserve selected cache lines for particular data objects; lines are untouched by
cache replacement while locked.
Scratch Memory. Implement a small portion of addressable system memory on the processor -
not as cache - for storage of small, frequently-used data.
Data Reordering. Perform cache-conscious modifications of decoder memory accesses.
Cache Partitioning. Allocate different sections of the cache to different data types.
Many of these methods are relatively simple to implement, and most have precedents in other
architectures since they have performance benefits beyond video decoding. Perhaps their efficacy for
MPEG-2 processing might encourage their broader adoption in architectures. In any case, the main
challenge is to successfully apply these various techniques in a manner that enhances decoder
challenge is to successfully apply these various techniques in a manner that enhances decoder
performance while avoiding excessive cost and complexity. The remainder of this section considers
each method on its own.
8.1 Selective Caching
Data objects with no possibility of re-use steal cache space from other data, which leads to excess
memory traffic when the excluded data are loaded in once again. Video output data is a perfect
example, since it is written but never read by the CPU, and occupies a fairly large amount of space.
Bypassing the cache entirely in favor of direct storage to main memory promises more efficient use.
In our simulations, we have found that excluding video output data alone from the cache reduces cache-
memory traffic by up to 50%. Across the configurations considered, not caching video output yields a
25% reduction on average. Improvements in cache-memory traffic are global, yet the shape of the
curve is not substantially modified. The plateau out to 512 Kbytes persists, as does the swift drop down
at 2 Mbytes. Excluding output data is even more helpful for miss rates, which drop by a maximum of
85% from earlier levels, and 60% on average.
There are several distinct ways of implementing this feature. In the PA-RISC architecture, the load and
store instructions can contain ``locality hints'' which signal that the data referred to is unlikely to be used.
If supported by the particular implementation, the processor can elect to bypass the cache with the
data [7]. The UltraSPARC has block load and store instructions, which move groups of data directly
between registers and main memory at high speed [10]. This approach allows for potentially more
efficient use of processor bandwidth. Finally, the PowerPC architecture provides for marking areas of
memory as non-cacheable. Once marked, data is transparently and automatically transferred between
registers and memory buffers. The cache hint and block load/store approaches require more explicit
programming effort.
<-- Cache-Oriented Enhancements
8.2 Cache Locking
One way to protect small but frequently reused data types, like input, state, and tabular values, from
victimization by raw video data is to lock the parts of the cache which contain the vulnerable data. While
locked, these cache lines are invisible to the replacement algorithm, and the contents will not be thrown
out only to be re-loaded when needed again.
This is accomplished by special machine instructions which execute the locking and unlocking. There
are two basic variations on this technique. Static locking simply freezes the tag and contents of the
affected line, allowing for the writing of values but not replacement; the line is associated with the same
portion of main memory until unlocked. Dynamic locking is somewhat more flexible, treating locked
lines as an extension of the register set, with special instructions to copy contents directly to and from
main memory. This scheme is used in the Cyrix MediaGX processor to speed the software emulation
of legacy PC hardware standards.
<-- Cache-Oriented Enhancements
8.3 Scratch Memory
Another means of providing a safe haven for vulnerable data is to put them in a separate memory.
Many processors today consist to a large extent not of logic but SRAM, in the form of caches. It is easy
to imagine implementing a portion of the memory space itself on the die - not as a cache but
addressable memory. Data which are subject to being prematurely ejected from the cache could then
be kept in this space. The components of the MPEG-2 data which would most benefit from this are
relatively small, so only a few KBytes would be required. This ``scratch memory'' would be relatively
inexpensive to implement in hardware, smaller than most caches, and far simpler than any of them due
to the lack of tags and logic for lookup, replacement, etc. Similar memories have actually been a
featured in Intel microcontrollers for quite some time, for storing Page 0 of the address space.
A scratch memory provides potentially very fast access to important data, since there is no possibility
of a read or write miss, and the delay of searching tags can be eliminated. Intel microcontrollers even
use special addressing modes to access Page 0 memory which are more concise and faster than
those used for off-chip addresses. Unlike with cache locking, cache performance and capacity are not
compromised since the scratch memory is off to the side. Of course, getting a hardware feature used in
embedded systems, where operating systems and even compilers are optional, to work effectively in
an interactive, multitasking environment will take careful consideration, but the potential benefits are
considerable.
<-- Cache-Oriented Enhancements
8.4 Data Reordering
This category of enhancement is a departure from our standard procedure of not tampering with the
decoder itself. Our goal is not to improve computational efficiency, a task which we leave to the many
others working on that problem. However, we suggest that modifications of specifically how the
decoder software manages and addresses memory can provide significant improvements in cache
efficiency - some of which may require hardware support to properly implement.
For example, there is strictly speaking no essential need for the output data type. B frame data,
considered separately, is never used by the decoder to construct other frames. The only time it is read
is for copying to the output buffer space. I and P pictures need to remain available to serve as
reference frames. Yet it turns out that in the natural flow of decoder data, by the time these frames are
copied to the output space they are no longer needed. If pictures could be transferred directly by DMA
from the reference frame space where they are stored, rather than routed through the output data, this
would result in greater efficiency. It would eliminate all of the cycle-consuming copying of output data,
and transform the B frame data into write-only values which could be efficiently excluded from the
cache.
<-- Cache-Oriented Enhancements
8.5 Cache Partitioning
It is possible to take the basic idea of cache locking in a slightly different direction. With cache
partitioning, particular data types are relegated to specific sections of the cache, which continue to
behave like cache. However, rather than locking in the physical address associated with a line, or
treating cache lines as extensions of the register set, it is as if each data type has its own separate,
smaller cache space for storing its values.
This method provides very precise control over cache behavior, and an extremely powerful tool for the
optimal management of cache resources. However, it requires a level of hardware complexity not
demanded by the methods discussed previously. Also, the task of efficiently exploiting the available
capabilities in software is considerably more challenging.
<-- Cache-Oriented Enhancements
8.6 Summary
None of the methods proposed are mutually exclusive, although some of them, like cache locking and
scratch memory, are arguably redundant if used in conjunction. Whether one of these techniques or
some combination of them provides the most efficient solution remains to be seen. Also, each one
raises issues of implementation cost and complexity, interaction with the operating system and other
processes, and software interfacing and portability which need to be more thoroughly addressed. We
will investigate these issues and perform a more detailed evaluation of these architectural
enhancements in future work.
<-- Cache-Oriented Enhancements
<-- Table of Contents
9 Conclusions
The effectiveness of computer memory subsystems, including caches, is currently a significant barrier
to improved performance on multimedia applications. We have documented the memory traffic and
data cache behavior of MPEG-2 video decoding on a general-purpose computer, and demonstrated
that standard caches produce a significant excess of cache-memory traffic. While almost any cache is
superior to none - even simple caches can reduce required memory bandwidth considerably -
experimenting with basic cache parameters like cache size, associativity, and line size has a limited
ability to reduce cache-memory traffic. The best value than can be achieved is double the absolute
minimum traffic required. However, cache-oriented architectural enhancements, driven by an
understanding of decoder behavior, can dramatically improve cache efficiency. These enhancements
include selective caching, scratch memory, data reordering, and cache partitioning. All have in
common an emphasis on treating different elements of the data distinctly, accommodating their unique
properties - size, usage patterns, temporal locality, etc. This approach provides for optimal
management of available cache resources. We expect that further refinement of these techniques will
enable improved video decoding performance of general purpose microprocessors, and encourage
the broader proliferation of digital video platforms and applications.
<-- Table of Contents
10 Acknowledgments
This research was supported in part by NSF grant CCR-9696196. Miriam Leeser is the recipient of an
NSF Young Investigator Award. The research is also supported by generous donations from Sun
Microsystems. The authors would like to thank Shantanu Tarafdar and Brian Smith for useful
discussions.
<-- Table of Contents
References
[1]
Stefan Eckart et al. ISO/IEC MPEG-2 software video codec. In Proc. Digital Video
Compression: Algorithms and Technologies 1995, pages 100-109. SPIE, Jan. 1995.
[2]
Peter N. Glaskowsky. Fujitsu aims media processor at DVD. Microprocessor Report, 10:11-13,
November 1996.
[3]
Linley Gwennap. Mitsubishi designs DVD decoder. Microprocessor Report, 10:1,6-9, December
1996.
[4]
International Organization for Standardization. ISO/IEC JTC1/SC29/WG11/602 13818-2
Committee Draft (MPEG-2), Nov. 1993.
[5]
Paul Kalapathy. Hardware-software interactions on MPACT. IEEE Micro, 17(2):20-26, Mar./Apr.
1997.
[6]
James R. Larus. Efficient program tracing. IEEE Computer, 26(5):52-61, May 1993.
[7]
Ruby B. Lee. Subworld parallelism with MAX-2. IEEE Micro, 16(4):51-59, August 1996.
[8]
Ketan Patel et al. Performance of a software MPEG video decoder. In Proc. 1st ACM Intl. Conf.
on Multimedia, pages 75-82. ACM, Aug. 1993.
[9]
Alex Peleg et al. Intel MMX for multimedia PCs. Comm. of the ACM, 40(1):25-38, Jan. 1997.
[10]
Marc Tremblay et al. VIS speeds new media processing. IEEE Micro, 16(4):10-20, August 1996.
[11]
Daniel F. Zucker et al. A comparison of hardware prefetching techniques for multimedia.
Technical Report CSL-TR-95-683, Stanford University Depts. of Electrical Engineering and
Computer Science, Stanford, CA, Dec. 1995.
<-- Table of Contents

Más contenido relacionado

La actualidad más candente

Network Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMNetwork Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMSlide_N
 
Built-in Self Repair for SRAM Array using Redundancy
Built-in Self Repair for SRAM Array using RedundancyBuilt-in Self Repair for SRAM Array using Redundancy
Built-in Self Repair for SRAM Array using RedundancyIDES Editor
 
report on recovery in mobile database system
report on recovery in mobile database systemreport on recovery in mobile database system
report on recovery in mobile database systemPraveen Pallav
 
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...IOSRJVSP
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpeSAT Publishing House
 
Spie2006 Paperpdf
Spie2006 PaperpdfSpie2006 Paperpdf
Spie2006 PaperpdfFalascoj
 
multiprocessors and multicomputers
 multiprocessors and multicomputers multiprocessors and multicomputers
multiprocessors and multicomputersPankaj Kumar Jain
 
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...Michael Gschwind
 
Lecture 6.1
Lecture  6.1Lecture  6.1
Lecture 6.1Mr SMAK
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemIJERA Editor
 
Distributed system lectures
Distributed system lecturesDistributed system lectures
Distributed system lecturesmarwaeng
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureMichael Gschwind
 
Lecture 6
Lecture  6Lecture  6
Lecture 6Mr SMAK
 
Transformation and dynamic visualization of images from computer through an F...
Transformation and dynamic visualization of images from computer through an F...Transformation and dynamic visualization of images from computer through an F...
Transformation and dynamic visualization of images from computer through an F...TELKOMNIKA JOURNAL
 
Lecture 6
Lecture  6Lecture  6
Lecture 6Mr SMAK
 

La actualidad más candente (18)

IEEExeonmem
IEEExeonmemIEEExeonmem
IEEExeonmem
 
Network Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMNetwork Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTM
 
Built-in Self Repair for SRAM Array using Redundancy
Built-in Self Repair for SRAM Array using RedundancyBuilt-in Self Repair for SRAM Array using Redundancy
Built-in Self Repair for SRAM Array using Redundancy
 
report on recovery in mobile database system
report on recovery in mobile database systemreport on recovery in mobile database system
report on recovery in mobile database system
 
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmp
 
Spie2006 Paperpdf
Spie2006 PaperpdfSpie2006 Paperpdf
Spie2006 Paperpdf
 
multiprocessors and multicomputers
 multiprocessors and multicomputers multiprocessors and multicomputers
multiprocessors and multicomputers
 
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
 
Lecture 6.1
Lecture  6.1Lecture  6.1
Lecture 6.1
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using Openshmem
 
Distributed system lectures
Distributed system lecturesDistributed system lectures
Distributed system lectures
 
Par com
Par comPar com
Par com
 
50120140505008
5012014050500850120140505008
50120140505008
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architecture
 
Lecture 6
Lecture  6Lecture  6
Lecture 6
 
Transformation and dynamic visualization of images from computer through an F...
Transformation and dynamic visualization of images from computer through an F...Transformation and dynamic visualization of images from computer through an F...
Transformation and dynamic visualization of images from computer through an F...
 
Lecture 6
Lecture  6Lecture  6
Lecture 6
 

Destacado

Algoritmos geometricos
Algoritmos geometricosAlgoritmos geometricos
Algoritmos geometricosLéia de Sousa
 
Coerãªncia+ +memoria
Coerãªncia+ +memoriaCoerãªncia+ +memoria
Coerãªncia+ +memoriaLéia de Sousa
 
Chrome server2 print_http_www_hardware_com_br_dicas_entendendo_cache_ht_13737...
Chrome server2 print_http_www_hardware_com_br_dicas_entendendo_cache_ht_13737...Chrome server2 print_http_www_hardware_com_br_dicas_entendendo_cache_ht_13737...
Chrome server2 print_http_www_hardware_com_br_dicas_entendendo_cache_ht_13737...Léia de Sousa
 
Dark silicon and the end of multicore scaling
Dark silicon and the end of multicore scalingDark silicon and the end of multicore scaling
Dark silicon and the end of multicore scalingLéia de Sousa
 
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsDark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsKoan-Sin Tan
 
Growth and development of cranium and face
Growth and development of cranium and faceGrowth and development of cranium and face
Growth and development of cranium and faceRajesh Bariker
 

Destacado (10)

Algoritmos geometricos
Algoritmos geometricosAlgoritmos geometricos
Algoritmos geometricos
 
Cray xt3
Cray xt3Cray xt3
Cray xt3
 
Coerãªncia+ +memoria
Coerãªncia+ +memoriaCoerãªncia+ +memoria
Coerãªncia+ +memoria
 
Presentation bjt1
Presentation bjt1Presentation bjt1
Presentation bjt1
 
Chrome server2 print_http_www_hardware_com_br_dicas_entendendo_cache_ht_13737...
Chrome server2 print_http_www_hardware_com_br_dicas_entendendo_cache_ht_13737...Chrome server2 print_http_www_hardware_com_br_dicas_entendendo_cache_ht_13737...
Chrome server2 print_http_www_hardware_com_br_dicas_entendendo_cache_ht_13737...
 
Dark silicon
Dark siliconDark silicon
Dark silicon
 
Dark silicon and the end of multicore scaling
Dark silicon and the end of multicore scalingDark silicon and the end of multicore scaling
Dark silicon and the end of multicore scaling
 
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsDark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
 
Growth and development of cranium and face
Growth and development of cranium and faceGrowth and development of cranium and face
Growth and development of cranium and face
 
Crayl
CraylCrayl
Crayl
 

Similar a Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_1373693063

IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Designing of telecommand system using system on chip soc for spacecraft contr...
Designing of telecommand system using system on chip soc for spacecraft contr...Designing of telecommand system using system on chip soc for spacecraft contr...
Designing of telecommand system using system on chip soc for spacecraft contr...IAEME Publication
 
Designing of telecommand system using system on chip soc for spacecraft contr...
Designing of telecommand system using system on chip soc for spacecraft contr...Designing of telecommand system using system on chip soc for spacecraft contr...
Designing of telecommand system using system on chip soc for spacecraft contr...IAEME Publication
 
directCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressdirectCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressHeiko Joerg Schick
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfanil0878
 
GENERIC SOPC PLATFORM FOR VIDEO INTERACTIVE SYSTEM WITH MPMC CONTROLLER
GENERIC SOPC PLATFORM FOR VIDEO INTERACTIVE SYSTEM WITH MPMC CONTROLLERGENERIC SOPC PLATFORM FOR VIDEO INTERACTIVE SYSTEM WITH MPMC CONTROLLER
GENERIC SOPC PLATFORM FOR VIDEO INTERACTIVE SYSTEM WITH MPMC CONTROLLERijesajournal
 
Design and Implementation of JPEG CODEC using NoC
Design and Implementation of JPEG CODEC using NoCDesign and Implementation of JPEG CODEC using NoC
Design and Implementation of JPEG CODEC using NoCIRJET Journal
 
An overview Survey on Various Video compressions and its importance
An overview Survey on Various Video compressions and its importanceAn overview Survey on Various Video compressions and its importance
An overview Survey on Various Video compressions and its importanceINFOGAIN PUBLICATION
 
An area and power efficient on chip communication architectures for image enc...
An area and power efficient on chip communication architectures for image enc...An area and power efficient on chip communication architectures for image enc...
An area and power efficient on chip communication architectures for image enc...eSAT Publishing House
 
A Novel Approach for Compressing Surveillance System Videos
A Novel Approach for Compressing Surveillance System VideosA Novel Approach for Compressing Surveillance System Videos
A Novel Approach for Compressing Surveillance System VideosINFOGAIN PUBLICATION
 
HIGH PERFORMANCE ETHERNET PACKET PROCESSOR CORE FOR NEXT GENERATION NETWORKS
HIGH PERFORMANCE ETHERNET PACKET PROCESSOR CORE FOR NEXT GENERATION NETWORKSHIGH PERFORMANCE ETHERNET PACKET PROCESSOR CORE FOR NEXT GENERATION NETWORKS
HIGH PERFORMANCE ETHERNET PACKET PROCESSOR CORE FOR NEXT GENERATION NETWORKSijngnjournal
 
Multi Processor Architecture for image processing
Multi Processor Architecture for image processingMulti Processor Architecture for image processing
Multi Processor Architecture for image processingideas2ignite
 
Skerlj Dbi Ecc Isqed08 1 B2
Skerlj Dbi Ecc Isqed08 1 B2Skerlj Dbi Ecc Isqed08 1 B2
Skerlj Dbi Ecc Isqed08 1 B2skerlj
 
Performance and Analysis of Video Compression Using Block Based Singular Valu...
Performance and Analysis of Video Compression Using Block Based Singular Valu...Performance and Analysis of Video Compression Using Block Based Singular Valu...
Performance and Analysis of Video Compression Using Block Based Singular Valu...IJMER
 
Performance evaluation of ecc in single and multi( eliptic curve)
Performance evaluation of ecc in single and multi( eliptic curve)Performance evaluation of ecc in single and multi( eliptic curve)
Performance evaluation of ecc in single and multi( eliptic curve)Danilo Calle
 
An Investigation towards Effectiveness in Image Enhancement Process in MPSoC
An Investigation towards Effectiveness in Image Enhancement Process in MPSoC An Investigation towards Effectiveness in Image Enhancement Process in MPSoC
An Investigation towards Effectiveness in Image Enhancement Process in MPSoC IJECEIAES
 
Architectures for mobile computing dec12
Architectures for mobile computing dec12Architectures for mobile computing dec12
Architectures for mobile computing dec12Rajveer Shekhawat
 
UNIT 3 Memory Design for SOC.ppUNIT 3 Memory Design for SOC.pptx
UNIT 3 Memory Design for SOC.ppUNIT 3 Memory Design for SOC.pptxUNIT 3 Memory Design for SOC.ppUNIT 3 Memory Design for SOC.pptx
UNIT 3 Memory Design for SOC.ppUNIT 3 Memory Design for SOC.pptxSnehaLatha68
 

Similar a Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_1373693063 (20)

IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Designing of telecommand system using system on chip soc for spacecraft contr...
Designing of telecommand system using system on chip soc for spacecraft contr...Designing of telecommand system using system on chip soc for spacecraft contr...
Designing of telecommand system using system on chip soc for spacecraft contr...
 
Designing of telecommand system using system on chip soc for spacecraft contr...
Designing of telecommand system using system on chip soc for spacecraft contr...Designing of telecommand system using system on chip soc for spacecraft contr...
Designing of telecommand system using system on chip soc for spacecraft contr...
 
directCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressdirectCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI Express
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdf
 
GENERIC SOPC PLATFORM FOR VIDEO INTERACTIVE SYSTEM WITH MPMC CONTROLLER
GENERIC SOPC PLATFORM FOR VIDEO INTERACTIVE SYSTEM WITH MPMC CONTROLLERGENERIC SOPC PLATFORM FOR VIDEO INTERACTIVE SYSTEM WITH MPMC CONTROLLER
GENERIC SOPC PLATFORM FOR VIDEO INTERACTIVE SYSTEM WITH MPMC CONTROLLER
 
Design and Implementation of JPEG CODEC using NoC
Design and Implementation of JPEG CODEC using NoCDesign and Implementation of JPEG CODEC using NoC
Design and Implementation of JPEG CODEC using NoC
 
An overview Survey on Various Video compressions and its importance
An overview Survey on Various Video compressions and its importanceAn overview Survey on Various Video compressions and its importance
An overview Survey on Various Video compressions and its importance
 
An area and power efficient on chip communication architectures for image enc...
An area and power efficient on chip communication architectures for image enc...An area and power efficient on chip communication architectures for image enc...
An area and power efficient on chip communication architectures for image enc...
 
A Novel Approach for Compressing Surveillance System Videos
A Novel Approach for Compressing Surveillance System VideosA Novel Approach for Compressing Surveillance System Videos
A Novel Approach for Compressing Surveillance System Videos
 
HIGH PERFORMANCE ETHERNET PACKET PROCESSOR CORE FOR NEXT GENERATION NETWORKS
HIGH PERFORMANCE ETHERNET PACKET PROCESSOR CORE FOR NEXT GENERATION NETWORKSHIGH PERFORMANCE ETHERNET PACKET PROCESSOR CORE FOR NEXT GENERATION NETWORKS
HIGH PERFORMANCE ETHERNET PACKET PROCESSOR CORE FOR NEXT GENERATION NETWORKS
 
Multi Processor Architecture for image processing
Multi Processor Architecture for image processingMulti Processor Architecture for image processing
Multi Processor Architecture for image processing
 
Skerlj Dbi Ecc Isqed08 1 B2
Skerlj Dbi Ecc Isqed08 1 B2Skerlj Dbi Ecc Isqed08 1 B2
Skerlj Dbi Ecc Isqed08 1 B2
 
Accelerix ISSCC 1998 Paper
Accelerix ISSCC 1998 PaperAccelerix ISSCC 1998 Paper
Accelerix ISSCC 1998 Paper
 
Performance and Analysis of Video Compression Using Block Based Singular Valu...
Performance and Analysis of Video Compression Using Block Based Singular Valu...Performance and Analysis of Video Compression Using Block Based Singular Valu...
Performance and Analysis of Video Compression Using Block Based Singular Valu...
 
Performance evaluation of ecc in single and multi( eliptic curve)
Performance evaluation of ecc in single and multi( eliptic curve)Performance evaluation of ecc in single and multi( eliptic curve)
Performance evaluation of ecc in single and multi( eliptic curve)
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
 
An Investigation towards Effectiveness in Image Enhancement Process in MPSoC
An Investigation towards Effectiveness in Image Enhancement Process in MPSoC An Investigation towards Effectiveness in Image Enhancement Process in MPSoC
An Investigation towards Effectiveness in Image Enhancement Process in MPSoC
 
Architectures for mobile computing dec12
Architectures for mobile computing dec12Architectures for mobile computing dec12
Architectures for mobile computing dec12
 
UNIT 3 Memory Design for SOC.ppUNIT 3 Memory Design for SOC.pptx
UNIT 3 Memory Design for SOC.ppUNIT 3 Memory Design for SOC.pptxUNIT 3 Memory Design for SOC.ppUNIT 3 Memory Design for SOC.pptx
UNIT 3 Memory Design for SOC.ppUNIT 3 Memory Design for SOC.pptx
 

Último

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Último (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_1373693063

  • 1. ACM Multimedia 97 - Electronic Proceedings November 8-14, 1997 Crowne Plaza Hotel, Seattle, USA Optimizing the Data Cache Performance of a Software MPEG-2 Video Decoder Peter Soderquist School of Electrical Engineering Dept. of Electrical and Computer Engineering Cornell University Ithaca, New York 14853 (607) 255-0314 pgs@cs.cornell.edu http://www.ee.cornell.edu/~pgs/ Miriam Leeser Dept. of Electrical and Computer Engineering Northeastern University Boston, Massachusetts 02115 (617) 373-3814 mel@ece.neu.edu http://www.ece.neu.edu/ ACM Copyright Notice Abstract Multimedia functionality has become an established component of core computer workloads. MPEG-2 video decoding represents a particularly important and computationally demanding application example. Instruction set extensions like Intel's MMX significantly reduce the computational challenges of this and other multimedia algorithms. However, memory subsystem deficiencies have now become the major barrier to increased performance, partly as a consequence of this improved CPU performance. Decoding MPEG-2 video data in software makes significant bandwidth demands on memory subsystems, which is seriously aggravated by cache inefficiencies. Conventional data caches generate many times more cache-memory traffic than required, at best double the minimum necessary to support decoding. Improving efficiency requires understanding the behavior of the decoder and composition of its data set. We provide an analysis of the memory and cache behavior of software MPEG-2 video decoding, and lay out a set of cache-oriented architectural enhancements which offer relief for the problem of excess cache-memory bandwidth. Our results show that cache-sensitive handling of different data types can reduce traffic by 50 percent or more. Keywords
  • 2. MPEG-2, digital video, caches, computer architecture Table of Contents 1 Introduction 2 Background 3 MPEG Overview 4 Related Work 4.1 Media Processors 4.2 SIMD Instruction Extensions 4.3 Video/Graphics Data Buses 4.4 Prefetching 4.5 Our Approach: Cache-Oriented Enhancements 5 Experimental Setup 5.1 Machine and Software Environment 5.2 Steady-State Simulation 5.3 Color Space and Bandwidth 6 Cache Simulation Results 6.1 Cache Size 6.2 Associativity 6.3 Line Size 6.4 Summary 7 Decoder Internal Analysis 7.1 Data Set Composition 7.2 Decoder Program Flow 7.3 System and Audio Interaction 8 Cache-Oriented Enhancements 8.1 Selective Caching 8.2 Cache Locking 8.3 Scratch Memory 8.4 Data Reordering 8.5 Cache Partitioning 8.6 Summary 9 Conclusions 10 Acknowledgments References 1 Introduction Multimedia computing has become a practical reality, and computer architectures are changing in response. Extensions for continuous media processing are a current or planned part of every market- leading instruction set architecture, as shown in Table 1. The primary feature of these extensions is SIMD-style processing of small data types. While contemporary microprocessor datapaths are 32 or 64 bits wide, multimedia data usually consists of 16-bit or 8-bit integers. Furthermore, multimedia algorithms commonly feature an abundance of data parallelism. Therefore, modifying wide datapaths to process multiple small data types in a single instruction promises significant performance improvements. Architecture ISA Extension
  • 3. x86 MMX (MultiMedia eXtensions) SPARC VIS (Visual Instruction Set) PA-RISC MAX-2 (Multimedia Acceleration eXtensions) Alpha MVI (Motion Video Instruction) MIPS MDMX (MIPS Digital Media eXtensions) PowerPC VMX (Video and Multimedia eXtensions)* Table 1. ISA multimedia extensions for popular architectures (* = proposed) All of these extensions are targeted, at least partially, at MPEG-2 video decoding. Collectively, they mount a formidable assault on its computational complexity. Unfortunately they largely fail to address the significant challenge of MPEG processing for memory subsystems, which have become the primary performance bottleneck. The high data rate, large set sizes, and distinctive memory access patterns of MPEG exert a particular strain on caches. Manipulating standard parameters (cache size, associativity, and line size) fails to cost-effectively reduce excess memory traffic. While miss rate levels are acceptable, standard caches, even very large and highly associative ones, generate significant excess cache-memory traffic. Multimedia instruction extensions actually exacerbate this problem by enabling the CPU to consume and produce more data in fewer cycles. This cache inefficiency is seriously limiting for desktop multitasking systems. A foregrounded video process can potentially cripple communications, print spooling, system management, network agent, and other background tasks. The video playback may also suffer from dropped frames, jitter, blocking, or other annoying artifacts. For low-power/cost systems and information appliances, cache inefficiency can have a direct cost impact, requiring the use of faster or higher capacity components than strictly necessary to achieve specified functionality at the required quality level. This can drive up system cost, increase power consumption, or even prevent implementation. To explore the cache efficiency problem and its solutions, we examine the traffic and data cache behavior of an MPEG-2 decoder running Main Level streams on a general-purpose microprocessor. Trace-driven cache simulations of typical sequences reveal the memory behavior of the decoder. We show that, contrary to some predictions, there is a benefit to even relatively small, simple caches. However, even very large and complex caches generate significant excess traffic. An analysis of the data types utilized by the decoder and their patterns of access provides the basis for proposed architectural enhancements. We also show that even relatively simple measures, such as selective caching of specific data types, can dramatically improve efficiency, reducing cache-memory traffic by 50% or more for a wide range of cache sizes. <-- Table of Contents 2 Background Consider the problem of decoding MPEG-2 in software on a general-purpose computing platform. The machine type of primary interest is a typical desktop PC or workstation, as envisioned in Figure 1. The diagram also illustrates the typical flow of video data in the system. Imagine the user is viewing, for example, an entry in a multimedia encyclopedia. Compressed data is transferred from the file system (e.g. CD-ROM or hard disk) by direct memory access (DMA) and buffered in the main memory. The CPU reads the data through its cache hierarchy and writes the decoded, dithered RGB video, one frame at a time, to main memory where it is stored in an image of the displayed video window. Finally, another DMA transfer brings the video data to the frame buffer where it eventually appears on the
  • 4. display. Figure 1. Typical desktop machine with software MPEG-2 decoding traffic It is clear that there is a large amount of data, much of it time-sensitive, being transferred through the system. Most of this is concentrated on the main system bus. At the very least, there are two streams each of encoded and decoded video being concurrently transferred in and out of main memory, amounting to a minimum sustained load of 63 Mbytes/s. This ignores the bandwidth required by other applications running on the system and sharing interconnect, main memory, and peripherals. Any excess memory traffic generated by cache inefficiency will further exacerbate this situation. Unfortunately, our simulations indicate that standard caches generate at least twice the minimum required level of cache-memory traffic, typically many times more, amounting to a significant strain on system capacity. The result is some combination of dropped frames, reduced frame rate, and degradation of overall system performance. <-- Table of Contents 3 MPEG Overview Appreciating the challenges of supporting video decoding requires some understanding of the MPEG standard [4]. MPEG attacks both the spatial and temporal redundancy of video signals to achieve compression. Video data is broken down into 8 by 8 pixel Blocks and passed through a discrete cosine transform (DCT). The resulting spatial frequency coefficients are quantized, run-length encoded, and then further compressed with an entropy coding algorithm. To exploit temporal redundancy, MPEG encoding uses motion compensation with three different types of frames. I (intra) frames contain a complete image, compressed for spatial redundancy only. P (predicted) MPEG frames are built from 16 by 16 fragments known as macroblocks. These consist primarily of pixels from the closest previous I or P frame (the reference frame), translated as a group from their location in the source. This information is stored as a vector representing the translation, and a DCT-encoded difference term, requiring far fewer bits then the original image fragment. B (bidirectional) frames can use the closest two I or P pictures - one before and one after in temporal order - as reference frames. Information not present in reference frames is encoded spatially on a block-by-block basis. All of data in P and B frames is also subject to run-length and entropy coding.
  • 5. Figure 2. Typical sequence of MPEG frames, showing interframe dependencies Figure 2 shows the interframe dependencies of the different frame types, superimposed on the displayed frame order. For decoding, these frames must be processed in the non-temporal order [1,4,2,3,7,5,6,10,8,9], which is a result of these dependencies. Interframe dependencies and the properties and sequence of frame types determine in critical ways the flow pattern of MPEG data and the nature of the hardware support required. Figure 3. MPEG bitstream structure The decoder reads the MPEG data as a stream of bits. Unique bit patterns, known as startcodes mark the division between different sections of the data. The bitstream has a hierarchical structure, shown in simplified form in Figure 3. A Sequence (video clip) consists of groups of pictures (GOP's). A GOP contains at least one I frame and typically a number of dependent P and B frames; Figure 2 shows a possible GOP of 9 frames (pictures) followed by the I frame of the subsequent GOP. Pictures consist of collections of macroblocks called slices. The ``headers'' shown at each level contain parameters relevant to each bitstream type but no actual image data. <-- Table of Contents 4 Related Work Deficiencies in the MPEG-2 decoding performance of systems are being actively addressed. This section discusses several significant methods for performance enhancement under investigation by researchers both in industry and academia. Some of these are merely proposals, while others have already been implemented in hardware or are forthcoming. We explore the benefits and disadvantages
  • 6. of each approach, and assess its impact on the problem of excess cache-memory bandwidth. 4.1 Media Processors A new breed of microprocessor, the so-called ``media processor'' has emerged in just the past few years. These chips combine programmability with specialized hardware to support multiple concurrent multimedia operations, including MPEG-2 decoding, and usually reside on peripheral cards in desktop systems. Their manufacturers, which include Chromatic Research, Fujitsu, Mitsubishi, Philips, and Samsung, claim that media processors are the best present and long-term solution for providing multimedia functionality in computer systems, not the host CPU. To their advantage, media processors yield reduced memory bandwidth requirements, since they deal in compressed data, and take a significant computational load off the host. They are also good at handling the real-time constraints of multimedia, while present-day operating systems and memory hierarchies are not. Finally, present-day multimedia functions consist of a limited set of operations, tied to international standards, an efficient target for highly specialized solutions. However, performing multimedia functionality with the host processor is inherently less expensive than using extra specialized hardware. Memory systems are also improving to meet the needs of multimedia, as illustrated by Intel's AGP and Direct RDRAM initiatives. Operating systems are also likely to become more real-time oriented as they evolve. More fundamentally, multimedia data are a new first-order data type, on the same order as floating-point. These functions are becoming central to what computers do, not ``peripheral'' like mass-storage or printing. As applications evolve, media and other operations will become more and more interleaved - someday viewing a video clip will be as seamless as reading e-mail or editing a spreadsheet. Host multimedia processing best supports the construction of highly integrated, fully featured, yet portable applications. Hardware integration trends and the powerful vested interest of CPU manufacturers also favor this outcome. Nevertheless, parts of video processing are likely to remain in specialized hardware for quite some time, such as color space conversion, scaling, and dithering. These are straightforward to implement in hardware, and to a large extent not data-dependent and therefore not subject to pipeline hazards and other impediments. These are more akin to truly ``peripheral'' operations. <-- Related Work 4.2 SIMD Instruction Extensions Multimedia instruction extensions such as MMX, VIS, MAX-2, and the others in Table 1, achieve speedup primarily through performing SIMD-style operations on multimedia data types. These ISA enhancements are an essential step to improving microprocessor performance on multimedia applications - as well as insuring the continued viability of host processing. However, these operations tend to increase rather than relieve the strain on memory bandwidth. By consuming more operands per unit time, multimedia instructions expose the underlying weakness of the cache, system bus, and main memory. The caches in the MMX-enhanced Pentium are double the size of those in the earlier Pentium for this very reason. Intel researchers even blame the below expected speedup on MPEG with MMX on memory and I/O system deficiencies [9]. <-- Related Work 4.3 Video/Graphics Data Buses There have been proposals for implementing special interconnect to transfer image data from the host
  • 7. processor to the frame buffer, bypassing slower general-purpose I/O buses. The Intel Advanced Graphics Port (AGP) is the most prominent such effort, intended to provide a proposed direct link between main memory and the graphics subsystem. While primarily meant to support the high bandwidth of 3D graphics rendering on the host processor, it is a definite boon to video as well. This method provides another significant way to boost the capabilities of host multimedia processing relative to custom hardware solutions. However, it does nothing in and of itself to alleviate cache- memory traffic waste or reduce overall memory-system bandwidth needs. The undiminished requirement for extremely high-bandwidth memory is reflected in Intel's support of the Rambus architecture for future PC memory systems. <-- Related Work 4.4 Prefetching Hardware prefetching has been suggested as a remedy for inadequate MPEG performance [11]. Yet it is primarily promoted as a means to reduce miss rates and without much consideration of cache- memory traffic, which tends to increase with prefetching, especially the more aggressive schemes. Prefetching is essentially a form of latency hiding; for memory-bound problems, it merely exposes underlying bandwidth problems, and tends to generate excess memory traffic of its own. Selectively applied, prefetching may provide a boost to compute-intensive phases of MPEG and other multimedia algorithms. However, the emphasis of our research is not so much on getting values into the cache early, but keeping them around for as long as they are needed. <-- Related Work 4.5 Our Approach: Cache-Oriented Enhancements The methods discussed above largely bypass the issue of excess cache-memory traffic, while some of them even exacerbate the problem. None of them directly target the issue of cache inefficiency. Even prefetching focuses on reducing miss rates at the possible expense of memory traffic. We suggest a different type of solution, driven by the internal dynamics of the decoder itself. Exploiting knowledge of individual data types - their sizes, access types, and access patterns - can lead to effective architectural solutions to cache inefficiency. <-- Related Work <-- Table of Contents 5 Experimental Setup Solving the problem of excess cache-memory traffic requires first establishing its extent, and how critical system parameters affect this behavior. Our results come from trace-driven cache simulations of video stream decoding. The clips themselves are progressive MPEG-2 Main Level sequences, with a resolution of 704 by 480 pixels at 30 frames/s, and a compressed bitrate of 6 Mbits/s. These are comparable in perceptual quality to conventional analog broadcast (e.g. NTSC, PAL). Sequences are chosen as representative of different types of programming. 5.1 Machine and Software Environment
  • 8. For experimental purposes we use a simplified machine model, limiting the microprocessor to a single level of caching. We further assume that MPEG-2 decoding is the only active process. The decoder is a version of mpeg2decode from the MPEG Software Simulation Group [1], running on a SuperSPARC processor under Solaris. This decoder has the advantage of portability and relative system independence, with the disadvantage that performance is suboptimal for any given architecture. Since it originates with the body responsible for MPEG, it is closely written to comply with that standard. Finally, it has the essential feature for research purposes of being available in source code form. We have chosen to treat the decoder implementation as a given, implementing minimal modification to support integration with our simulation tools. Trace generation is performed by QPT2 (the Quick Profiler and Tracer) [6], a component of the Wisconsin Architectural Research Toolkit (WARTS). We have implemented the cache simulator, Pacino, to support the unique qualities of continuous media applications. These are considerably different from conventional benchmarks of the SPEC variety in their consumption and production of data, and interaction with the operating system. <-- Experimental Setup 5.2 Steady-State Simulation To limit the considerable storage and CPU cycle requirements of trace-driven simulation, we restrict each run to one GOP, the largest logical bitstream unit below a sequence. Simulations run as if at steady state - i.e. preceded and followed by a large number of other GOP's, as if plucked from the middle of the sequence. This eliminates the side effects of program startup and termination. To achieve this the results of running initialization code are not logged, the cache is primed with data to avoid the effects of cold start, and dirty lines are not flushed at the end of the GOP. <-- Experimental Setup 5.3 Color Space and Bandwidth The simulation framework implements one significant performance optimization beyond the generic hardware model. Decoded video is sent to the frame buffer in 4:2:0 YCrCb form, the native representation of MPEG, rather than 4:4:4 RGB. YCrCb is a color representation with one luminance (grayscale) and two chrominance components per pixel, rather than one each of red, green, and blue. In 4:2:0 YCrCb, the chrominance components are subsampled in both dimensions. We assume that upsampling, color space conversion, and any dithering are done on the fly by the video display subsystem. These features are becoming common in workstations and budget PC video accelerators alike. According to prior work on MPEG-1 decoders, this processing can account for almost 25% of total execution time if performed in software [8]. In addition, the 4:2:0 representation of an image is only half the size of the 4:4:4 version, reducing the minimum sustained system bus load to 31.5 Mbytes/s. Finally, these conversion operations account for so many memory and instruction references that they make trace-driven simulation prohibitive - a sufficiently compelling reason to avoid them. <-- Experimental Setup <-- Table of Contents 6 Cache Simulation Results Trace-driven cache simulations clarify how data requests from the MPEG-2 decoder translate into what
  • 9. traffic is seen by the memory. First of all, there is a distinct benefit to caching video decoder data, even using naive, generic schemes. It has occasionally been suggested that caches are critically inefficient for video data. Accordingly, several media processors dispense with data caches altogether in favor of SRAM banks - essentially large register files - managed by software [5,3,2]. Yet we have found that despite the large set sizes and the magnitude of data consumed and discarded, there is sufficient re- use of values for caching to significantly reduce the required memory bandwidth. For example, decoding one Group Of Pictures from a typical MPEG-2 stream generates 2.1*10exp(8) memory requests. A 16 Kbyte, direct-mapped cache generates 3.9*10exp(7) words of memory traffic. If we estimate conservatively that each word of cache data represents a memory request, then even a relatively small, simple cache keeps almost 85% of traffic off of the memory bus. In fact, caches typically load several words at a time, while many CPU memory requests are for data types smaller than a word. Therefore, data caches can make far more efficient use of available memory bandwidth than registers alone. Even very clever register management would have trouble competing with anything but the smallest caches, so even information appliance-type platforms might benefit from implementing an actual cache. The remainder of this section examines in more detail how the basic parameters of a simple cache (cache size, set associativity, and line size) affect memory traffic as well as misses. All of the caches in our simulations implement a write-back policy on replacement. Not only is this the most popular write policy in actual implementations, but it typically generates far less memory traffic than write-through, the best alternative. Write-allocate, where a write miss causes the loading of the associated line, is also a common feature of all simulated caches. If we assume a ``perfect'' cache, the traffic required to support decoding is equal to the size of encoded stream, which is read in from memory, plus the decoded video data, which is written back. This absolute minimum cache-memory traffic is used as a yardstick to compare the performance of different configurations in our simulations. 6.1 Cache Size The size of a cache is its most significant design parameter, certainly from a cost standpoint. Because cache size usually increases or decreases by factors of two, the decision of how large a cache to implement in a system is pivotal. For Main Level MPEG-2 decoding, the cache-memory traffic as a function of cache size and associativity shows little variation between video sequences. Figure 4 shows a representative plot, assuming a standard line size of 32 bytes. This is superimposed on a surface showing the minimum possible cache-memory traffic for the sequence, consisting of the encoded and decoded video streams combined.
  • 10. Figure 4. Typical cache-memory traffic for MPEG-2 Main Level decoding, over minimum possible traffic level The most prominent feature of the traffic function is a large plateau, at a level of 6.3 times the minimum value. Except for direct-mapped caches, which produce a peak of 16 times the minimum, cache- memory traffic changes little with increasing cache size for most of the range. Traffic starts to roll off at 1 Mbyte, and plummets at 2 Mbytes. This reflects the 2 Mbyte size of the decoder data set. Larger caches show negligible improvement, with the additional space providing no extra benefit. However, the smallest measured value is still almost 2 times higher than the minimum possible value. These data imply that a 32 Kbyte cache is just as good - or bad - for MPEG-2 Main Level decoding as a 64 Kbyte or 128 Kbyte one, for example. Improving on cache-memory traffic requires a 1 Mbyte cache or higher, but a 2 Mbytes or larger cache only brings traffic down to double the absolute minimum, leaving plenty of room for improvement. <-- Cache Simulation Results 6.2 Associativity Increasing set associativity is a popular method for getting higher performance out of smaller caches. The memory traffic effects of varying set associativity are also visible in Figure 4. Going from a direct- mapped cache to a 2-way set-associative one can reduce memory traffic by as much as 50% for small caches. Increasing associativity to 4 can squeeze out almost another 10% improvement over the direct-mapped case. Set sizes of greater than 4, however, show minimal benefit across all cache sizes. Since increasingly higher levels of associativity add considerably to cache cost, complexity and access time, such enhancements are not justified. While memory traffic is the primary focus, we would prefer to not adversely affect miss rates. As a function of cache size and set associativity, miss rates show the same behavioral pattern as memory traffic. However, unlike for cache-memory traffic, the performance is more than satisfactory, with an average value below 0.5%. <-- Cache Simulation Results 6.3 Line Size Line size, another fundamental parameter, is less costly to experiment with than cache size. Subblock
  • 11. placement can help decouple the size of cache lines and that of the memory bus. Unfortunately, there is contention between miss rate and memory traffic minimization. Low miss rates call for larger lines than the typical 32 bytes, as illustrated in Figure 5. Larger lines tend to provide superior spatial locality, but require more data to be read and possibly written back on a miss. For this reason, minimal memory traffic occurs with the smaller lines. This relationship is readily apparent in Figure 6, where for the smallest caches, the largest line sizes lead to cache-memory traffic almost 200 times the absolute minimum. Figure 5. Typical miss rates for MPEG-2 Main Level decoding over different cache sizes Figure 6. Typical cache-memory traffic for MPEG-2 Main Level decoding over different cache sizes Balancing the demands of miss rates and memory traffic requires further investigation. For the time being, maintaining the ubiquitous 32 byte line size seems sensible. The interests of memory traffic reduction argue against a switch to larger values. <-- Cache Simulation Results 6.4 Summary It is clear that conventional cache techniques provide limited relief from excess cache-memory traffic. The simplest way to improve memory traffic and miss rates is to have a very large cache - but a large
  • 12. cache is not always feasible, and simulations show that improving MPEG-2 decoding performance requires very big caches. In any case, one would prefer to extract better performance from smaller, lower-cost resources, improving performance through increased efficiency rather than brute force. <-- Cache Simulation Results <-- Table of Contents 7 Decoder Internal Analysis This section explores the internal functioning of a software-based MPEG-2 video decoder, as a first step to explaining and rectifying its suboptimal use of cache resources. First, we dissect the different kinds of data used by the decoder, including compressed input data, image output data, and all of the intermediate types. Second, we examine how these data objects are used in the process of decoding, and how this affects the performance of the cache. 7.1 Data Set Composition The data set of an MPEG-2 decoder - the information that must be available to the program in the course of its execution - consists of several different data types which serve distinct purposes. As a result, they are quite heterogeneous with respect to the types and patterns of access utilized by the decoder, as well as the amount of memory space required. The following classes account for the global and static local values in user space. Input. The compressed MPEG-2 sequence; data are read in series from a fixed-size buffer and refreshed by system calls. Output. Uncompressed picture data in YCrCb format, stored in a video window image buffer. This data type is write only -- written but never read by the CPU. System calls transfer each completed picture into the frame buffer. Tabular. Static, read only information used in the MPEG decoding process, such as various lookup tables. Reference. The current frame and the previously decoded frames used to reconstruct it (up to two for a B frame), in YCrCb form. Block. The DCT coefficient and pixel values for a single macroblock. State. Values incidental to the settings and operation of the decoder, yet not part of the image data per se. This partitioning of data types may not be explicit within the decoder source, but represents an abstraction at a higher level of the data needed for decoder operation. Most of these data types are both read and written. The exceptions are the tabular data, which are read only, and the the output data which is write only. Within a memory hierarchy, reading and writing are not symmetric operations. Read misses and write misses in caches have very different latencies. The fact that some MPEG decoder data types are only read or written can be exploited for performance optimization. The essential properties of the different data types are summarized in Table 2. Note how in terms of size, the reference and output types completely dominate the others. With respect to caching, this means that their presence will tend to repeatedly expel the other types from the cache, except for very large data caches. It doesn't matter how many times the other values are re-used. Capacity limitations alone will insure that reference and output data, when updated, will throw the other data types out. Fraction of
  • 13. Data Type Access Type Size References Input read/write 2 KB 2.7% Output write only 500 KB 3.9% Tabular read only 5 KB 5.5% Reference read/write 1500 KB 23.7% Block read/write 1.5 KB 31.4% State read/write 0.5 KB 25.6% Table 2. Summary of decoder data types, sizes, access types, and proportion of memory references accounted for Ranking the data types in terms of the number of references rather than by size gives a very different picture. Block and state data account for by far the most memory requests. Reference data is next, but output data is in next to last place, accounting for 8 times fewer memory requests than block data. Note that the percentages don't add up to 100%; approximately 7.2% of data references are due to temporary variables and library functions. <-- Decoder Internal Analysis 7.2 Decoder Program Flow The hierarchical composition of the MPEG bitstream, and its intrinsic sequentiality constrain the structure of the decoder program. Figure 7 shows the phases of operation for a single GOP from the perspective of data accesses. As the diagram makes evident, execution proceeds as a set of nested loops corresponding to different data levels in the bitstream as shown in Figure 3. The ``initialization'' and ``termination'' blocks represent the operations excluded by steady-state simulation.
  • 14. Figure 7. Model of MPEG-2 software decoder showing phases of operation for a GOP Table 3 shows which data types are accessed in the different phases of operation. Parsing header information and reading in block data involves operating on relatively small data types, and results in a small active cache set. However, during reconstruction, the decoder accesses portions of the reference frames and copies them into the frame currently under construction. The Inverse DCT (IDCT) phase focuses on smaller data types, but merging the decoded pixels back into the new picture requires once again accessing large data types. Writing the completed frame requires traversing significant portions of reference and output data. Notice that the decoder accesses state data on a fairly continuous basis. Phases of Operation Data Types input output tabular reference block state read headers X X X read block X X X X reconstruct MB X X IDCT X X X merge MB data X X X write frame X X X Table 3. Data types accessed in each phase of decoder operation
  • 15. Each iteration through the macroblock loop accesses a new portion of one or two reference frames. Each new P or B frame repeats this traversal through its reference frame(s), recalling significant fractions ofeach frame back into the cache. During the ``write frame'' phase, the decoder spends most of its time copying the current frame from the reference space to the output space. For all but the largest caches, this has the effect of flushing out almost everything except video data for the next displayed picture. The other data types will have to be reloaded for each new picture, or possibly several times a picture for small caches, resulting in wasted cache-memory traffic. <-- Decoder Internal Analysis 7.3 System and Audio Interaction Our simulations assume that MPEG video decoding is the only active process. In reality, video is usually viewed in conjunction with audio data, which requires the concurrent execution of an MPEG audio codec and an MPEG system parser to de-multiplex and synchronize the two media. We expect that running these other processes will not dramatically affect our results, since the amount of data they operate on at any given time is relatively small. The net effect is that of having a slightly smaller cache, and our simulations show that memory traffic levels are fairly insensitive to small changes in cache size. Nevertheless, the presence of these other programs is a good argument for keeping the cache footprint of the video data as small as possible. <-- Decoder Internal Analysis <-- Table of Contents 8 Cache-Oriented Enhancements Our analysis of decoder behavior and data composition motivates a specific approach to performance optimization, where the goal is to improve cache efficiency for video decoding without adversely impacting other performance metrics or other applications. In particular, there are great potential benefits in treating different data types distinctly, guided by their different sizes, access types, and access patterns. For example, write-only values, like video output data, are clearly a waste of cache space. Likewise, while block and state data account for most of the memory requests, their relatively small size means that larger data types, like reference data, systematically crowd them out of the cache. Preventing blatant cache pollution and the predation of small data types with high re-use can yield considerable cache-memory traffic savings. To this end, we are evaluating the following traffic-reduction techniques. Selective Caching. Exclude specific data from the cache. Cache Locking. Reserve selected cache lines for particular data objects; lines are untouched by cache replacement while locked. Scratch Memory. Implement a small portion of addressable system memory on the processor - not as cache - for storage of small, frequently-used data. Data Reordering. Perform cache-conscious modifications of decoder memory accesses. Cache Partitioning. Allocate different sections of the cache to different data types. Many of these methods are relatively simple to implement, and most have precedents in other architectures since they have performance benefits beyond video decoding. Perhaps their efficacy for MPEG-2 processing might encourage their broader adoption in architectures. In any case, the main challenge is to successfully apply these various techniques in a manner that enhances decoder
  • 16. challenge is to successfully apply these various techniques in a manner that enhances decoder performance while avoiding excessive cost and complexity. The remainder of this section considers each method on its own. 8.1 Selective Caching Data objects with no possibility of re-use steal cache space from other data, which leads to excess memory traffic when the excluded data are loaded in once again. Video output data is a perfect example, since it is written but never read by the CPU, and occupies a fairly large amount of space. Bypassing the cache entirely in favor of direct storage to main memory promises more efficient use. In our simulations, we have found that excluding video output data alone from the cache reduces cache- memory traffic by up to 50%. Across the configurations considered, not caching video output yields a 25% reduction on average. Improvements in cache-memory traffic are global, yet the shape of the curve is not substantially modified. The plateau out to 512 Kbytes persists, as does the swift drop down at 2 Mbytes. Excluding output data is even more helpful for miss rates, which drop by a maximum of 85% from earlier levels, and 60% on average. There are several distinct ways of implementing this feature. In the PA-RISC architecture, the load and store instructions can contain ``locality hints'' which signal that the data referred to is unlikely to be used. If supported by the particular implementation, the processor can elect to bypass the cache with the data [7]. The UltraSPARC has block load and store instructions, which move groups of data directly between registers and main memory at high speed [10]. This approach allows for potentially more efficient use of processor bandwidth. Finally, the PowerPC architecture provides for marking areas of memory as non-cacheable. Once marked, data is transparently and automatically transferred between registers and memory buffers. The cache hint and block load/store approaches require more explicit programming effort. <-- Cache-Oriented Enhancements 8.2 Cache Locking One way to protect small but frequently reused data types, like input, state, and tabular values, from victimization by raw video data is to lock the parts of the cache which contain the vulnerable data. While locked, these cache lines are invisible to the replacement algorithm, and the contents will not be thrown out only to be re-loaded when needed again. This is accomplished by special machine instructions which execute the locking and unlocking. There are two basic variations on this technique. Static locking simply freezes the tag and contents of the affected line, allowing for the writing of values but not replacement; the line is associated with the same portion of main memory until unlocked. Dynamic locking is somewhat more flexible, treating locked lines as an extension of the register set, with special instructions to copy contents directly to and from main memory. This scheme is used in the Cyrix MediaGX processor to speed the software emulation of legacy PC hardware standards. <-- Cache-Oriented Enhancements 8.3 Scratch Memory Another means of providing a safe haven for vulnerable data is to put them in a separate memory. Many processors today consist to a large extent not of logic but SRAM, in the form of caches. It is easy to imagine implementing a portion of the memory space itself on the die - not as a cache but addressable memory. Data which are subject to being prematurely ejected from the cache could then
  • 17. be kept in this space. The components of the MPEG-2 data which would most benefit from this are relatively small, so only a few KBytes would be required. This ``scratch memory'' would be relatively inexpensive to implement in hardware, smaller than most caches, and far simpler than any of them due to the lack of tags and logic for lookup, replacement, etc. Similar memories have actually been a featured in Intel microcontrollers for quite some time, for storing Page 0 of the address space. A scratch memory provides potentially very fast access to important data, since there is no possibility of a read or write miss, and the delay of searching tags can be eliminated. Intel microcontrollers even use special addressing modes to access Page 0 memory which are more concise and faster than those used for off-chip addresses. Unlike with cache locking, cache performance and capacity are not compromised since the scratch memory is off to the side. Of course, getting a hardware feature used in embedded systems, where operating systems and even compilers are optional, to work effectively in an interactive, multitasking environment will take careful consideration, but the potential benefits are considerable. <-- Cache-Oriented Enhancements 8.4 Data Reordering This category of enhancement is a departure from our standard procedure of not tampering with the decoder itself. Our goal is not to improve computational efficiency, a task which we leave to the many others working on that problem. However, we suggest that modifications of specifically how the decoder software manages and addresses memory can provide significant improvements in cache efficiency - some of which may require hardware support to properly implement. For example, there is strictly speaking no essential need for the output data type. B frame data, considered separately, is never used by the decoder to construct other frames. The only time it is read is for copying to the output buffer space. I and P pictures need to remain available to serve as reference frames. Yet it turns out that in the natural flow of decoder data, by the time these frames are copied to the output space they are no longer needed. If pictures could be transferred directly by DMA from the reference frame space where they are stored, rather than routed through the output data, this would result in greater efficiency. It would eliminate all of the cycle-consuming copying of output data, and transform the B frame data into write-only values which could be efficiently excluded from the cache. <-- Cache-Oriented Enhancements 8.5 Cache Partitioning It is possible to take the basic idea of cache locking in a slightly different direction. With cache partitioning, particular data types are relegated to specific sections of the cache, which continue to behave like cache. However, rather than locking in the physical address associated with a line, or treating cache lines as extensions of the register set, it is as if each data type has its own separate, smaller cache space for storing its values. This method provides very precise control over cache behavior, and an extremely powerful tool for the optimal management of cache resources. However, it requires a level of hardware complexity not demanded by the methods discussed previously. Also, the task of efficiently exploiting the available capabilities in software is considerably more challenging. <-- Cache-Oriented Enhancements
  • 18. 8.6 Summary None of the methods proposed are mutually exclusive, although some of them, like cache locking and scratch memory, are arguably redundant if used in conjunction. Whether one of these techniques or some combination of them provides the most efficient solution remains to be seen. Also, each one raises issues of implementation cost and complexity, interaction with the operating system and other processes, and software interfacing and portability which need to be more thoroughly addressed. We will investigate these issues and perform a more detailed evaluation of these architectural enhancements in future work. <-- Cache-Oriented Enhancements <-- Table of Contents 9 Conclusions The effectiveness of computer memory subsystems, including caches, is currently a significant barrier to improved performance on multimedia applications. We have documented the memory traffic and data cache behavior of MPEG-2 video decoding on a general-purpose computer, and demonstrated that standard caches produce a significant excess of cache-memory traffic. While almost any cache is superior to none - even simple caches can reduce required memory bandwidth considerably - experimenting with basic cache parameters like cache size, associativity, and line size has a limited ability to reduce cache-memory traffic. The best value than can be achieved is double the absolute minimum traffic required. However, cache-oriented architectural enhancements, driven by an understanding of decoder behavior, can dramatically improve cache efficiency. These enhancements include selective caching, scratch memory, data reordering, and cache partitioning. All have in common an emphasis on treating different elements of the data distinctly, accommodating their unique properties - size, usage patterns, temporal locality, etc. This approach provides for optimal management of available cache resources. We expect that further refinement of these techniques will enable improved video decoding performance of general purpose microprocessors, and encourage the broader proliferation of digital video platforms and applications. <-- Table of Contents 10 Acknowledgments This research was supported in part by NSF grant CCR-9696196. Miriam Leeser is the recipient of an NSF Young Investigator Award. The research is also supported by generous donations from Sun Microsystems. The authors would like to thank Shantanu Tarafdar and Brian Smith for useful discussions. <-- Table of Contents References [1]
  • 19. Stefan Eckart et al. ISO/IEC MPEG-2 software video codec. In Proc. Digital Video Compression: Algorithms and Technologies 1995, pages 100-109. SPIE, Jan. 1995. [2] Peter N. Glaskowsky. Fujitsu aims media processor at DVD. Microprocessor Report, 10:11-13, November 1996. [3] Linley Gwennap. Mitsubishi designs DVD decoder. Microprocessor Report, 10:1,6-9, December 1996. [4] International Organization for Standardization. ISO/IEC JTC1/SC29/WG11/602 13818-2 Committee Draft (MPEG-2), Nov. 1993. [5] Paul Kalapathy. Hardware-software interactions on MPACT. IEEE Micro, 17(2):20-26, Mar./Apr. 1997. [6] James R. Larus. Efficient program tracing. IEEE Computer, 26(5):52-61, May 1993. [7] Ruby B. Lee. Subworld parallelism with MAX-2. IEEE Micro, 16(4):51-59, August 1996. [8] Ketan Patel et al. Performance of a software MPEG video decoder. In Proc. 1st ACM Intl. Conf. on Multimedia, pages 75-82. ACM, Aug. 1993. [9] Alex Peleg et al. Intel MMX for multimedia PCs. Comm. of the ACM, 40(1):25-38, Jan. 1997. [10] Marc Tremblay et al. VIS speeds new media processing. IEEE Micro, 16(4):10-20, August 1996. [11] Daniel F. Zucker et al. A comparison of hardware prefetching techniques for multimedia. Technical Report CSL-TR-95-683, Stanford University Depts. of Electrical Engineering and Computer Science, Stanford, CA, Dec. 1995. <-- Table of Contents