The Codex of Business Writing Software for Real-World Solutions 2.pptx
Profcompact
1. Profile-Guided Code Compression
Saumya Debray William Evans
Department of Computer Science Department of Computer Science
University of Arizona University of British Columbia
Tucson, AZ 85721. Vancouver B.C. Canada, V6T 1Z4.
debray@cs.arizona.edu will@cs.ubc.ca
RACT 1. INTRODUCTION
2010.05.17 years there has been an increasing trend towar
uters are increasingly used in contexts where the amount In recent
ble memory is limited, it becomes important to devise
es that reduce the memory footprint of application pro-
: incorporation of computers into a wide variety of devices, s
palm-tops, telephones, embedded controllers, etc. In many o
hile leaving them in an executable form. This paper de- devices, the amount of memory available is limited, due to c
n approach to applying data compression techniques to erations such as space, weight, power consumption, or pric
he size of infrequently executed portions of a program. example, the widely used TMS320-C5x DSP processor from
pressed code is decompressed dynamically (via software) Instruments has only 64 Kwords of program memory for exec
d, prior to execution. The use of data compression tech- code [23]. At the same time, there is an increasing desire
ncreases the amount of code size reduction that can be more and more sophisticated software in such devices, such
; their application to infrequently executed code limits the cryption software in telephones, speech/image processing so
overhead due to dynamic decompression; and the use of in palm-tops, fault diagnosis software in embedded processo
Since these devices typically have no secondary storage, an
2. Citation Count
citation count
2002
2003
2004
2005
2006
2007
2008
2009
0 5 10 15
DAC, ASPDAC,
IEEE Transaction on Computer Aided Design of Integrated Circuits and Systems
4. The basic orgnization
infrequently executed functions
frequently code stub never-compressed part
call sites compressed
call sites function offset table runtime buffer
code
C1 0 f
C1
g f
1
C2 f f.stub h
2
C2 [0]
g
C3
[1]
C3 g.stub h
C4
g [2]
Decompressor
C4
C5
h.stub
C5
C6
h
C6
(a) Original (b) Compressed
Figure 1: Code Organization: Before and After Compression
f g
HE BASIC APPROACH 2.2
(1) Buffer Management
The scheme described above is conceptually fairly straigh
verview (2) but fails to mention several issues whose resolution d
ward / (JIT )
1 shows The basic organization of code in our system. (3) f its performance. The most important of these is the
mines restore g /
a program with three infrequently executed functions,1 f, of function calls in the compressed code. Suppose that in Figu
5. g stub
never-compressed part
instruction
f: offset
EntryStub: entry 0
bsr r, Decompress
instruction <index(f), 0>
f: offset
entry 0
RestoreStub(f,98): cs0 bsr $ra, CreateStub 96
bsr $ra, Decompress br g 97
<index(f), 98> ... 98
cs0 bsr $ra, g 96 <count>
... 97
return
return
never−compressed runtime stub list runtime buffer
(a) Original (b) Transformed, during runtime after CreateStub has created Re-
storeStub(f,98)
Figure 2: Managing Function Calls Out of the Runtime Buffer.
executable code, and only discards it to prevent the system after f’s call to g. This stub obviously cannot be placed in the
from running out of memory. runtime buffer, since it may be overwritten there; it must be placed
The main drawback with this approach is that the runtime in the never-compressed portion of the program. Since every call
buffer must be made large enough to hold all of the decom- from a compressed function requires its own stub, these restore
pressed functions that can possibly coexist on the call stack. stubs amount to a large fraction of the final executable’s size (e.g.,
6. Compression & Decompression
splitting streams approach [9]
by encoding each field using Huffman code
canonical Huffman encoding
7. instance, function calls from within a compressed region are still
sing the handled as discussed in Section 2.
Compressible Region
We now face the problem of how to choose regions to com-
press. We want these regions to be reasonably small so that the
runtime buffer can be small, yet we want few control transfers be-
tween different regions so that the number of entry stubs is small.
This is an optimization problem. The input is a control flow graph
for a program in which a vertex represents a basic
block and has size equal to the number of instructions in the
block, and an edge represents a control transfer from to
. In addition, the input specifies a subset of the vertices that
can be compressed. The output is a partition of a subset of the
compressible vertices into regions so that the
quence, following cost is minimized:
y )
with an never-compressed code
the in-
ividual
s of the compressed code
stream,
uffman function offset table
coding
quence
ally the entry stubs
pressed
runtime buffer
streams
has the
time of where is the size of the region after compression, is
mpress the set of blocks requiring an entry stub, i.e.,
one de- and for some
mpres-
lex de- the constant is the number of words required for an entry stub, and
8. Compressible Regions
1.20 1.20 1.20
d
c
1.10 1.10 1.10
d
Normalized code size
Normalized code size
Normalized code size
c
a a d
c
d
c a
1.00 e
d
c a e 1.00 a 1.00
h
i
a e
d
a e a
b c
h
b d e d
c
h
a
g
f e d b i
g
f c b
g
f
d
c c e b i d
c e
k
j i
h
a h k e
a k
b e e d a i i
g j d
b a g j e g
b
0.90 i
d
c d e
c i b
h 0.90 c
h
i e 0.90 a
d
c h
g
f h i
c i e f g
f e b h
f h
g
f
i
b a f
k b
a a k a
b e a d
c h k e e
j a
b
h b h
b g h
i a e i
g i a d g
b
g h k j d
g
c b
i b
i e
b
i k a
h
g
i e
a c h
i k
i
f
k g g f j f
k d
g
h d
c g
h f j f
b
k
d g e g f
f g k c
f g
h k j c i
f
b g
i e
i k
j k k
f k
f k k
f k
f k
d
h
c b
d
k
c
f k
h
b
f
0.80 j j 0.80 j 0.80 h j
j j j j j
j j j
j
j j
0.70 0.70 0.70
32 64 128 256 512 1024 2048 4096 32 64 128 256 512 1024 2048 4096 32 64 128 256 512 1024 2048 4096
Buffer size bound Buffer size bound Buffer size bound
(a) (b) (c)
Key:
1.00 a: adpcm
b: epic
c: g721 dec
Normalized code size
d: g721 enc
0.90
e: gsm
0.0
0.00001
f: jpeg dec
0.00005 g: jpeg enc
0.80 h: mpeg2dec
i: mpeg2enc
j: pgp
k: rasta
0.70
32 64 128 256 512 1024 2048 4096
Buffer size bound
(d) mean
upper bound of runtime buffer K= 512
Figure 3: Effect of Buffer Size Bound on Code Size
is the number of external function calls within (the decom- a value for , we get a large number of small compressible re-
9. Cold Code
(the geometric mean of) the relative amount of cold and compressible code in our programs
1.00
com
0.90
4, a
0.80
it is
Fraction of Code
0.70
0.60
0.50 6.
0.40
0.30 cold code 6.1
0.20 compressible code A
0.10 the
0.00
0.0 0.00001 0.0001 0.001 0.01 0.1 1.0 inst
Threshold the
invo
time
Figure 4: Amount of Cold and Compressible Code (Normal- the
ized) the
ther
11. Program Profiling Input Timing Input
file name size (KB) file name size (KB)
adpcm clinton.pcm 295.0 mlk IHaveADream.pcm 1475.2
clinton.adpcm 73.8 mlk IHaveADream.adpcm 182.1
epic baboon.tif 262.4 baboon.tif 262.4
lena.tif 262.4
g721 dec clinton.g721 73.8 mlk IHaveADream.g721 368.8
g721 enc clinton.pcm 295.0 mlk IHaveADream.pcm 1475.2
gsm clinton.pcm 295.0 mlk IHaveADream.pcm 1475.2
jpeg dec testimg.jpg 5.8 roses17.jpg 25.1
jpeg end testimg.ppm 101.5 roses17.ppm 681.1
mpeg2dec sarnoff2.m2v 102.5 tceh v2.m2v 2310.7
mpeg2enc sarnoff2.m2v 102.5 tceh v2.m2v 2310.7
pgp compression.ps 717.2 TI-320-user-manual.ps 8456.6
rasta ex5 c1.wav 17.0 phone.pcmle.wav 83.7
Figure 5: Inputs used for profiling and timing runs
12. jpeg dec testimg.jpg 5.8 roses17.jpg 25.1
jpeg end testimg.ppm 101.5 roses17.ppm 681.1
mpeg2dec sarnoff2.m2v 102.5 tceh v2.m2v 2310.7
mpeg2enc sarnoff2.m2v 102.5 tceh v2.m2v 2310.7
pgp compression.ps 717.2 TI-320-user-manual.ps 8456.6
rasta ex5 c1.wav 17.0 phone.pcmle.wav 83.7
Figure 5: Inputs used for profiling and timing runs
30
Code Size reduction (%)
20
10
0
abcde f gh i j k M abcde f gh i j k M abcde f gh i j k M abcde f gh i j k M abcde f gh i j k M abcde f gh i j k M abcde f gh i j k M
0.0 0.00001 0.0001 0.001 0.01 0.1 1.0
Thresholds
Key:
a: adpcm d: g721 enc g: jpeg enc j: pgp
b: epic e: gsm h: mpeg2dec k: rasta
c: g721 dec f: jpeg dec i: mpeg2enc M: G EOM . M EAN
Figure 6: Code Size Reduction due to Profile-Guided Code Compression at Different Thresholds
been space optimized by about 30% on average. Squash, using inputs refer to those used to obtain the execution profiles that were
the runtime decompression scheme outlined in this paper, compacts used to carry out compression, while the timing inputs refer to the
squeezed binaries by about another 14–19% on average. inputs used to generate execution time data for the uncompressed
13. However, as is increased, the runtime overhead associated with
repeated dynamic decompression of code quickly begins to make
itself felt. Our experience with this set of programs (and others)
indicates that beyond the runtime overhead becomes
quite noticeable. To obtain a reasonable balance between code size
improvements and execution speed, we focus on values of up to
0.00005.
Execution time data were obtained on a workstation with a 667
MHz Compaq Alpha 21264 EV67 processor with a split two-way
set-associative primary cache (64 Kbytes each of instruction and
data cache) and 512 MB of main memory running Tru64 Unix. In
each case, the execution time was obtained as the smallest of 10
runs of an executable on an otherwise unloaded system.
Figure 7 examines the performance of our programs, both in
30 Thresholds
Code Size reduction (%)
terms of size and speed, for ranging from 0.0 to 0.00005. The fi- 0.0
0.00001
nal set of bars in this figure shows the mean values for code size re-
18.8
20 0.00005
16.8
13.7
10
0
adpcm
epic
g721_dec
g721_enc
gsm
jpeg_dec
jpeg_enc
mpeg2dec
mpeg2enc
pgp
rasta
Geom. Mean
(a) Code Size
2.5
Execution Time (Normalized)
2.0 Thresholds
0.0
1.5
1.24
0.00001
1.04
1.00
0.00005
1.0
0.5
0.0
adpcm
epic
g721_dec
g721_enc
gsm
jpeg_dec
jpeg_enc
mpeg2dec
mpeg2enc
pgp
rasta
Geom. Mean
(b) Execution Time