SlideShare una empresa de Scribd logo
1 de 4
Descargar para leer sin conexión
Short Paper
Proc. of Int. Conf. on Advances in Information Technology and Mobile Communication 2013

An Index-first Addressing Scheme for
Multi-level Caches
Hemant Salwan
Department of Information technology,
Institute of Information Technology and Management, New Delhi, India
salwan.hemant@gmail.com
Abstract—Performance is one of the leading factors in the
current processor designs. A processor’s efficiency highly
depends upon the organization of its cache. In the multi-level
caches, the existing scheme of addressing the cache memory
produces a significant fraction of address conflicts at the lower
level cache. This increases the global miss rate which
diminishes the cache and processor performance. This problem
becomes more severe in multi-processing environment
because of conflicts of inter-process interference. An approach
that performs the cache addressing efficiently in multi-level
caches is proposed in this paper. The experimental results
show that this scheme significantly decreases the global miss
rate of the caches and the cycles per instruction (CPI)
performance metric of the processor.
Index Terms—cache memory, cache addressing, conflict misses,
multi-level caches

I. INTRODUCTION
Cache memory addressing scheme [1] partitions the
physical address into three fields: tag, index and offset. The
index field is used to select one of the cache sets. The tag
field is compared against it for a hit. Figure 1 shows how an
address is divided into three fields. The block offset selects
the desired word from the block.

without conflict. This scenario is most significant in the current multi-processor systems which allow a high degree of
multi processing causing even large caches to crunch under
its pressure.
Previous studies have explored these drawbacks and
proposed the solutions. Non–temporal streaming [5] has
reduced the cache conflict misses by predicting the temporal
behavior of a block. Alternative cache hashing functions were
used in [6] to eliminate the conflicts misses. Cache Miss Look
aside [7] used a hardware buffer to reduce conflict misses.
Compile-time transformations [8] [9] were also used to
eliminate conflict misses. Unfortunately, all of these schemes
either require a specialized hardware or are not dynamic in
nature.
An efficient scheme of addressing the cache is proposed
in this paper which overcomes the above mentioned
limitations in the above said workload environments.
Moreover, it does not require any additional hardware on the
processor chip. Section II describes the index-first cache
addressing scheme in detail and also discuss its comparison
with the traditional scheme. Section III describes the
experimental methodology and the results achieved using
the new scheme. Finally, Section IV concludes the paper.
II. INDEX-FIRST ADDRESSING

Figure 1. Traditional address partitioning

In this proposed cache addressing scheme, we
interchange the tag and index fields of the traditional way to
make it called as index-first addressing. This scheme thus
uses the high order bits to index into the cache and lower
order bits as the tag to compare against the selected cache
block frame for a hit. The address fields are as shown in
Figure 2.

As per to this traditional scheme the contiguous blocks
in memory are mapped to contiguous blocks of the cache.
Although this scheme uniformly distributes the memory block
addresses across the cache block frames, but it suffers with
two major limitations [2] [3] [4]. First, multiple memory blocks
map to the same set of block frames at all the cache levels
which potentially produces global conflict misses in the
caches. Second, due to generally much larger high level cache
Figure 2.Index-first address partitioning
multiple blocks of higher level cache map to a single frame of
lower level cache which causes conflict misses at the lower
The following relations will hold true in the index-first
level. In Figure 4 the types of block addressing conflicts
addressing scheme:
found in the traditional scheme are illustrated. These limitations arise specifically in two workload environments: i) a
single running process whose memory access pattern is scattered across the memory pages instead of being localized to
a region. This scenario is specially encountered when the
By this way of addressing the memory, the whole memory
size of primary cache is very small, and ii) in the multi-procan be seen as composed of a fixed number of regions and
cessing environment where the caches are not large enough
each region is composed of a fixed number of blocks. It is
to accommodate the working sets of all the running processes
illustrated in Figure 3 how this scheme maps in a solo cache.
23
© 2013 ACEEE
DOI: 03.LSCS.2013.2.40
Short Paper
Proc. of Int. Conf. on Advances in Information Technology and Mobile Communication 2013
It should be clear from the figure that all the blocks of a
region maps to a single block frame in the cache and no two
blocks of different regions map to the same block frame.
This logical division of memory into regions is similar to
what is called as paging in memory management. The region
size should be comparable to a page but it is preferable that
the region size be smaller than the page size in order that the
program memory references remain scattered across the
regions as much as possible.

block 2 to frame 2 in traditional and frame 1 in index-first. It
must be noted that all memory accesses will belong to the
category of one of these four mappings. Table I enumerates
the all the possible block mapping conflicts that may occur in
the two addressing schemes. The table shows the caches
that conflict for a particular mapping pair and also the number of memory blocks (called degree) involved in such a conflict. A degree of one means no conflict.
TABLE I. MAPPING CONFLICTS FOR T HE TWO SCHEMES

The above analysis shows the proposed scheme produces
the following effects on the caches:
a. The mapping pair (1, 2) produces the single conflicts
in the L1 cache in the index-first scheme. This effect
can be lowered by keeping the M/m value lower.
This can be done by using a larger sized L1 cache.
b. The mapping pair (1, 3) causes a large no. of single
Figure 3. Address mapping as per the index-first scheme

Figure 4. Addressing conflicts using index-first vs. traditional scheme. Block size is uniform at all the levels

The addressing conflicts that occur in the caches when using index-first scheme versus traditional scheme are as shown
in Fig. 4. Note that it is only the L1$ at which the block mappings of two schemes differ. Each numeric label on the mappings denotes a particular type of memory block access. For
example mapping “1” shows that both schemes map memory
block 1 to frame 1 of the L1$ whereas mapping “2” maps
© 2013 ACEEE
DOI: 03.LSCS.2013.2.40

conflicts in the L1 cache in traditional scheme. Or
conversely, the index-first scheme reduces such
single conflicts in the L1 cache by a factor of 1/m.
This effect can be made higher by increasing the
value of 1/m. This can be done by using a
comparatively smaller L1 cache.
c. The mapping pair (1, 4) produces a large no. of global
24
Short Paper
Proc. of Int. Conf. on Advances in Information Technology and Mobile Communication 2013

Figure 5. Single-trace simulations of a set of benchmarks

conflicts in the traditional scheme whereas many of
these are eliminated in the index-first scheme. This
will reduce the global miss rate in the index-first
scheme again by a factor of 1/m. This effect can also
be made larger by using a smaller sized L1 cache.
The effects (a), (b), and (c) depict the reason of applying
the index-first addressing scheme only at the lower level instead of higher levels. The important observation from the
effect (c) is that index-first scheme services many requests at
L1 level which are serviced by memory in traditional scheme.
This will significantly improves the CPU performance because
of a large difference between L1 cache access time and memory
access time. Another important observation is that the (1, 2)
conflict arise out of the same page, (1, 3) conflicts may or may
not arise out of the same page, and the (1, 4) conflicts only
arise out of the different pages. This implies that in multiprocessing systems index-first scheme will give far better CPU
performance gain as compared to single processing environment because effect (b) and (c) will be dominating in the multiprocessing systems.
III. EXPERIMENT AND RESULTS
Our goal is to evaluate the effect of index-first addressing
scheme on the cache and CPU performance as specified in
the previous section. To do this, global cache miss rate and
instructions per cycle (IPC) were used as the performance
metrics. The cache hierarchy system that was modeled is
detailed in Table II.
A. Simulation environment
The Dinero IV cache simulator [10] was used to run the
traces of various benchmarks. The simulator was modified to
include the index-first addressing scheme. The simulator was
extended to simulate the multi-processing environment by
running 2- and 4- traces simultaneously. In the multi-trace
simulation the switching period of 100 instructions was used.
25
© 2013 ACEEE
DOI: 03.LSCS.2013.2.40

TABLE II. C ACHE H IERARCHY MODEL PARAMETERS

B. Benchmarks
The SBC traces [11] of SPEC CPU2000 benchmarks [12]
suite were used in the trace simulation. However not all the
benchmarks from the suite could be used in the simulation,
so a set of nine benchmarks (five integers and four floating
point) were selected from the benchmark suite for simulation
purposes. Each trace included the memory references of first
two billion of instruction execution.
C. Results
The results of the simulations of running singleton traces
are shown in Figure 5. Fig. 5 (a) shows the global miss rates
of the index-first scheme relative to the traditional scheme
and Fig. 5 (b) shows the corresponding CPI reductions
achieved in the index-first scheme. In index-first scheme,
average global miss rate decreased to around 92% of the
original and lowest to 75% of traditional for lucas. In the
index-first scheme, the average reduction in CPI is over 0.25
and a maximum CPI reduction of 0.63 for lucas.
The relative global miss rates and CPI reductions of the
index-first scheme of running 2-trace and 4-trace simulations
are shown in Figure 6. Figure 6 (a) shows that the average
global miss rate decreased to around 89% of traditional for 2trace simulations and around 85% of traditional for 4-trace
Short Paper
Proc. of Int. Conf. on Advances in Information Technology and Mobile Communication 2013

Figure 6. Multi-trace simulations of a set of 2-trace and 4-trace benchmarks

[3] R. C. Murphy and Peter M. Kogge. “On the memory access
patterns of supercomputer applications: Benchmark selection
and its implications”. IEEE Transactions on Computers, 7(56),
July 2007.
[4] S. Przybylski, M. Horowitz, and J. Hennessy. “Characteristics
of performance-optimal multi-level cache hierarchies”. In Proc.
of the 16th Annual Int. Symposium on Computer Architecture,
pp. 114-121, June 1989.
[5] J. A. Rivers. “Reducing conflicts in direct-mapped caches with
a temporality-based design”. In Proc. of the Int. Conference
on Parallel Processing, pp. 154-163, August 1996.
[6] M. Kharbutli, K. Irwin, Y. Solihin, and J. Lee. “Using prime
numbers for cache indexing to eliminate conflict misses”. IEEE
Proc. on Software, pp. 288-299, Feb. 2004.
[7] Brian N. Bershad, Dennis Lee, Theodore H. Romer, and J.
Bradley Chen. “Avoiding conflict misses dynamically in large
direct-mapped caches”. In Proc. of the 6 th Int. Conf. on
Architectural Support for Programming Languages and
Operating Systems, pp. 158-170, 1994.
[8] Gabriel Rivera and Chau-Wen Tseng. “Data transformations
for eliminating conflict misses”. In Proc. of the ACM
SIGPLAN Conf. on Programming Language Design and
Implementation, pp 38-49, 1998.
[9] Gabriel Rivera and Chau-Weng Tseng. “Eliminating conflict
misses for high performance architectures”. In Proc. of the
12 th Int. Conf. on Supercomputing, pp. 353-360, 1998.
[10] Mark D. Hill. Dinero IV Trace-driven uniprocessor cache
simulator. “pages.cs.wisc.edu/~markhill/DineroIV”.
[11] A. Milenkovic, M. Milenkovic. “Exploiting streams in
instruction and data address trace compression”. In Proc. of
the IEEE 6th Annual Workshop on Workload Characterization,
October 2003.
[12] J. L. Henning. “SPEC CPU2000: Measuring CPU performance
in the new millennium”. IEEE Computer, July 2000.

simulations. It maximum decreased to 50% of traditional for
(gcc,gap,gzip,perl) and to 79% of traditional for
(lucas,wupwise). Fig. 6 (b) shows that the average CPI
reduction of index-first scheme is over 0.14 for 2-trace and
around 1.2 for 4-trace simulations.
IV. CONCLUSIONS
An index-first cache addressing scheme has been
presented which is an efficient scheme for addressing the
cache memories. Using this scheme the first portion of the
memory address is used for indexing the L1 cache instead of
second as used in the traditional scheme. This scheme works
by eliminating several types of block address conflicts in a
multi-level cache system. This scheme can be implemented
without any additional hardware. The results show that this
scheme outperforms the traditional scheme across all the
benchmarks in both global miss rates and CPI measures.
Moreover, the performance improvements were found to be
many times more in multi-processing environments.
REFERENCES
[1] Alan Jay Smith. “Cache memories”. ACM Computing Surveys,
pp. 473-530, Sept. 1982.
[2] Monica S. Lam, Edward E. Rothberg, and Michael E. Wolf.
“The cache performance and optimizations of blocked
algorithms”. ACM SIGARCH Computer Architecture News,
pp. 63-74, 1991.

© 2013 ACEEE
DOI: 03.LSCS.2013.2.40

26

Más contenido relacionado

La actualidad más candente

Chapter on Book on Cloud Computing 96
Chapter on Book on Cloud Computing 96Chapter on Book on Cloud Computing 96
Chapter on Book on Cloud Computing 96
Michele Cermele
 
Colfax-Winograd-Summary _final (1)
Colfax-Winograd-Summary _final (1)Colfax-Winograd-Summary _final (1)
Colfax-Winograd-Summary _final (1)
Sangamesh Ragate
 
System partitioning in VLSI and its considerations
System partitioning in VLSI and its considerationsSystem partitioning in VLSI and its considerations
System partitioning in VLSI and its considerations
Subash John
 
Using the black-box approach with machine learning methods in ...
Using the black-box approach with machine learning methods in ...Using the black-box approach with machine learning methods in ...
Using the black-box approach with machine learning methods in ...
butest
 
Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...
Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...
Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...
Serhan
 

La actualidad más candente (16)

Chapter on Book on Cloud Computing 96
Chapter on Book on Cloud Computing 96Chapter on Book on Cloud Computing 96
Chapter on Book on Cloud Computing 96
 
Colfax-Winograd-Summary _final (1)
Colfax-Winograd-Summary _final (1)Colfax-Winograd-Summary _final (1)
Colfax-Winograd-Summary _final (1)
 
System partitioning in VLSI and its considerations
System partitioning in VLSI and its considerationsSystem partitioning in VLSI and its considerations
System partitioning in VLSI and its considerations
 
Abstract The Prospect of 3D-IC
Abstract The Prospect of 3D-ICAbstract The Prospect of 3D-IC
Abstract The Prospect of 3D-IC
 
A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN ME...
A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN ME...A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN ME...
A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN ME...
 
Using the black-box approach with machine learning methods in ...
Using the black-box approach with machine learning methods in ...Using the black-box approach with machine learning methods in ...
Using the black-box approach with machine learning methods in ...
 
Paper on experimental setup for verifying - "Slow Learners are Fast"
Paper  on experimental setup for verifying  - "Slow Learners are Fast"Paper  on experimental setup for verifying  - "Slow Learners are Fast"
Paper on experimental setup for verifying - "Slow Learners are Fast"
 
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDMOD...
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDMOD...PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDMOD...
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDMOD...
 
An Area Efficient Mixed Decimation MDF Architecture for Radix 22 Parallel FFT
An Area Efficient Mixed Decimation MDF Architecture for Radix 22  Parallel FFTAn Area Efficient Mixed Decimation MDF Architecture for Radix 22  Parallel FFT
An Area Efficient Mixed Decimation MDF Architecture for Radix 22 Parallel FFT
 
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDM O...
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDM O...PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDM O...
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDM O...
 
Omega
OmegaOmega
Omega
 
Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...
Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...
Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...
 
IRJET- An Approach to FPGA based Implementation of Image Mosaicing using Neur...
IRJET- An Approach to FPGA based Implementation of Image Mosaicing using Neur...IRJET- An Approach to FPGA based Implementation of Image Mosaicing using Neur...
IRJET- An Approach to FPGA based Implementation of Image Mosaicing using Neur...
 
A NOVEL SLOTTED ALLOCATION MECHANISM TO PROVIDE QOS FOR EDCF PROTOCOL
A NOVEL SLOTTED ALLOCATION MECHANISM TO PROVIDE QOS FOR EDCF PROTOCOLA NOVEL SLOTTED ALLOCATION MECHANISM TO PROVIDE QOS FOR EDCF PROTOCOL
A NOVEL SLOTTED ALLOCATION MECHANISM TO PROVIDE QOS FOR EDCF PROTOCOL
 
Multiple Binary Images Watermarking in Spatial and Frequency Domains
Multiple Binary Images Watermarking in Spatial and Frequency DomainsMultiple Binary Images Watermarking in Spatial and Frequency Domains
Multiple Binary Images Watermarking in Spatial and Frequency Domains
 
Advance DBMS
Advance DBMSAdvance DBMS
Advance DBMS
 

Destacado (8)

Course_Template
Course_TemplateCourse_Template
Course_Template
 
Addressing Scheme IPv4
Addressing Scheme IPv4Addressing Scheme IPv4
Addressing Scheme IPv4
 
Internet Appendix A
Internet Appendix AInternet Appendix A
Internet Appendix A
 
Day02
Day02 Day02
Day02
 
Introduction to Internet
Introduction to InternetIntroduction to Internet
Introduction to Internet
 
Slide, word and word processing
Slide, word and word processingSlide, word and word processing
Slide, word and word processing
 
Word Processing Slides
Word Processing SlidesWord Processing Slides
Word Processing Slides
 
Microsoft word basics ppt
Microsoft word basics pptMicrosoft word basics ppt
Microsoft word basics ppt
 

Similar a An Index-first Addressing Scheme for Multi-level Caches

Different Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache MemoryDifferent Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache Memory
Dhritiman Halder
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
ijesajournal
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
ijesajournal
 
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
Vijay Prime
 
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIPA ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
ijaceeejournal
 

Similar a An Index-first Addressing Scheme for Multi-level Caches (20)

Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...
Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...
Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...
 
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
 
Different Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache MemoryDifferent Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache Memory
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
 
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
 
HOCSA: AN EFFICIENT DOWNLINK BURST ALLOCATION ALGORITHM TO ACHIEVE HIGH FRAME...
HOCSA: AN EFFICIENT DOWNLINK BURST ALLOCATION ALGORITHM TO ACHIEVE HIGH FRAME...HOCSA: AN EFFICIENT DOWNLINK BURST ALLOCATION ALGORITHM TO ACHIEVE HIGH FRAME...
HOCSA: AN EFFICIENT DOWNLINK BURST ALLOCATION ALGORITHM TO ACHIEVE HIGH FRAME...
 
program partitioning and scheduling IN Advanced Computer Architecture
program partitioning and scheduling  IN Advanced Computer Architectureprogram partitioning and scheduling  IN Advanced Computer Architecture
program partitioning and scheduling IN Advanced Computer Architecture
 
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIComprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
 
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIPA ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
 
Design of a Novel Multiplier and Accumulator using Modified Booth Algorithm w...
Design of a Novel Multiplier and Accumulator using Modified Booth Algorithm w...Design of a Novel Multiplier and Accumulator using Modified Booth Algorithm w...
Design of a Novel Multiplier and Accumulator using Modified Booth Algorithm w...
 
An efficient multi-level cache system for geometrically interconnected many-...
An efficient multi-level cache system for geometrically  interconnected many-...An efficient multi-level cache system for geometrically  interconnected many-...
An efficient multi-level cache system for geometrically interconnected many-...
 
Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...
 
ACES_Journal_February_2012_Paper_07
ACES_Journal_February_2012_Paper_07ACES_Journal_February_2012_Paper_07
ACES_Journal_February_2012_Paper_07
 
Ijciet 10 01_162
Ijciet 10 01_162Ijciet 10 01_162
Ijciet 10 01_162
 
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURESREDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
 
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURESREDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
 
Reducing Competitive Cache Misses in Modern Processor Architectures
Reducing Competitive Cache Misses in Modern Processor ArchitecturesReducing Competitive Cache Misses in Modern Processor Architectures
Reducing Competitive Cache Misses in Modern Processor Architectures
 
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
 
week_2Lec02_CS422.pptx
week_2Lec02_CS422.pptxweek_2Lec02_CS422.pptx
week_2Lec02_CS422.pptx
 

Más de idescitation

65 113-121
65 113-12165 113-121
65 113-121
idescitation
 
74 136-143
74 136-14374 136-143
74 136-143
idescitation
 
84 11-21
84 11-2184 11-21
84 11-21
idescitation
 
29 88-96
29 88-9629 88-96
29 88-96
idescitation
 

Más de idescitation (20)

65 113-121
65 113-12165 113-121
65 113-121
 
69 122-128
69 122-12869 122-128
69 122-128
 
71 338-347
71 338-34771 338-347
71 338-347
 
72 129-135
72 129-13572 129-135
72 129-135
 
74 136-143
74 136-14374 136-143
74 136-143
 
80 152-157
80 152-15780 152-157
80 152-157
 
82 348-355
82 348-35582 348-355
82 348-355
 
84 11-21
84 11-2184 11-21
84 11-21
 
62 328-337
62 328-33762 328-337
62 328-337
 
46 102-112
46 102-11246 102-112
46 102-112
 
47 292-298
47 292-29847 292-298
47 292-298
 
49 299-305
49 299-30549 299-305
49 299-305
 
57 306-311
57 306-31157 306-311
57 306-311
 
60 312-318
60 312-31860 312-318
60 312-318
 
5 1-10
5 1-105 1-10
5 1-10
 
11 69-81
11 69-8111 69-81
11 69-81
 
14 284-291
14 284-29114 284-291
14 284-291
 
15 82-87
15 82-8715 82-87
15 82-87
 
29 88-96
29 88-9629 88-96
29 88-96
 
43 97-101
43 97-10143 97-101
43 97-101
 

Último

Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Último (20)

How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 

An Index-first Addressing Scheme for Multi-level Caches

  • 1. Short Paper Proc. of Int. Conf. on Advances in Information Technology and Mobile Communication 2013 An Index-first Addressing Scheme for Multi-level Caches Hemant Salwan Department of Information technology, Institute of Information Technology and Management, New Delhi, India salwan.hemant@gmail.com Abstract—Performance is one of the leading factors in the current processor designs. A processor’s efficiency highly depends upon the organization of its cache. In the multi-level caches, the existing scheme of addressing the cache memory produces a significant fraction of address conflicts at the lower level cache. This increases the global miss rate which diminishes the cache and processor performance. This problem becomes more severe in multi-processing environment because of conflicts of inter-process interference. An approach that performs the cache addressing efficiently in multi-level caches is proposed in this paper. The experimental results show that this scheme significantly decreases the global miss rate of the caches and the cycles per instruction (CPI) performance metric of the processor. Index Terms—cache memory, cache addressing, conflict misses, multi-level caches I. INTRODUCTION Cache memory addressing scheme [1] partitions the physical address into three fields: tag, index and offset. The index field is used to select one of the cache sets. The tag field is compared against it for a hit. Figure 1 shows how an address is divided into three fields. The block offset selects the desired word from the block. without conflict. This scenario is most significant in the current multi-processor systems which allow a high degree of multi processing causing even large caches to crunch under its pressure. Previous studies have explored these drawbacks and proposed the solutions. Non–temporal streaming [5] has reduced the cache conflict misses by predicting the temporal behavior of a block. Alternative cache hashing functions were used in [6] to eliminate the conflicts misses. Cache Miss Look aside [7] used a hardware buffer to reduce conflict misses. Compile-time transformations [8] [9] were also used to eliminate conflict misses. Unfortunately, all of these schemes either require a specialized hardware or are not dynamic in nature. An efficient scheme of addressing the cache is proposed in this paper which overcomes the above mentioned limitations in the above said workload environments. Moreover, it does not require any additional hardware on the processor chip. Section II describes the index-first cache addressing scheme in detail and also discuss its comparison with the traditional scheme. Section III describes the experimental methodology and the results achieved using the new scheme. Finally, Section IV concludes the paper. II. INDEX-FIRST ADDRESSING Figure 1. Traditional address partitioning In this proposed cache addressing scheme, we interchange the tag and index fields of the traditional way to make it called as index-first addressing. This scheme thus uses the high order bits to index into the cache and lower order bits as the tag to compare against the selected cache block frame for a hit. The address fields are as shown in Figure 2. As per to this traditional scheme the contiguous blocks in memory are mapped to contiguous blocks of the cache. Although this scheme uniformly distributes the memory block addresses across the cache block frames, but it suffers with two major limitations [2] [3] [4]. First, multiple memory blocks map to the same set of block frames at all the cache levels which potentially produces global conflict misses in the caches. Second, due to generally much larger high level cache Figure 2.Index-first address partitioning multiple blocks of higher level cache map to a single frame of lower level cache which causes conflict misses at the lower The following relations will hold true in the index-first level. In Figure 4 the types of block addressing conflicts addressing scheme: found in the traditional scheme are illustrated. These limitations arise specifically in two workload environments: i) a single running process whose memory access pattern is scattered across the memory pages instead of being localized to a region. This scenario is specially encountered when the By this way of addressing the memory, the whole memory size of primary cache is very small, and ii) in the multi-procan be seen as composed of a fixed number of regions and cessing environment where the caches are not large enough each region is composed of a fixed number of blocks. It is to accommodate the working sets of all the running processes illustrated in Figure 3 how this scheme maps in a solo cache. 23 © 2013 ACEEE DOI: 03.LSCS.2013.2.40
  • 2. Short Paper Proc. of Int. Conf. on Advances in Information Technology and Mobile Communication 2013 It should be clear from the figure that all the blocks of a region maps to a single block frame in the cache and no two blocks of different regions map to the same block frame. This logical division of memory into regions is similar to what is called as paging in memory management. The region size should be comparable to a page but it is preferable that the region size be smaller than the page size in order that the program memory references remain scattered across the regions as much as possible. block 2 to frame 2 in traditional and frame 1 in index-first. It must be noted that all memory accesses will belong to the category of one of these four mappings. Table I enumerates the all the possible block mapping conflicts that may occur in the two addressing schemes. The table shows the caches that conflict for a particular mapping pair and also the number of memory blocks (called degree) involved in such a conflict. A degree of one means no conflict. TABLE I. MAPPING CONFLICTS FOR T HE TWO SCHEMES The above analysis shows the proposed scheme produces the following effects on the caches: a. The mapping pair (1, 2) produces the single conflicts in the L1 cache in the index-first scheme. This effect can be lowered by keeping the M/m value lower. This can be done by using a larger sized L1 cache. b. The mapping pair (1, 3) causes a large no. of single Figure 3. Address mapping as per the index-first scheme Figure 4. Addressing conflicts using index-first vs. traditional scheme. Block size is uniform at all the levels The addressing conflicts that occur in the caches when using index-first scheme versus traditional scheme are as shown in Fig. 4. Note that it is only the L1$ at which the block mappings of two schemes differ. Each numeric label on the mappings denotes a particular type of memory block access. For example mapping “1” shows that both schemes map memory block 1 to frame 1 of the L1$ whereas mapping “2” maps © 2013 ACEEE DOI: 03.LSCS.2013.2.40 conflicts in the L1 cache in traditional scheme. Or conversely, the index-first scheme reduces such single conflicts in the L1 cache by a factor of 1/m. This effect can be made higher by increasing the value of 1/m. This can be done by using a comparatively smaller L1 cache. c. The mapping pair (1, 4) produces a large no. of global 24
  • 3. Short Paper Proc. of Int. Conf. on Advances in Information Technology and Mobile Communication 2013 Figure 5. Single-trace simulations of a set of benchmarks conflicts in the traditional scheme whereas many of these are eliminated in the index-first scheme. This will reduce the global miss rate in the index-first scheme again by a factor of 1/m. This effect can also be made larger by using a smaller sized L1 cache. The effects (a), (b), and (c) depict the reason of applying the index-first addressing scheme only at the lower level instead of higher levels. The important observation from the effect (c) is that index-first scheme services many requests at L1 level which are serviced by memory in traditional scheme. This will significantly improves the CPU performance because of a large difference between L1 cache access time and memory access time. Another important observation is that the (1, 2) conflict arise out of the same page, (1, 3) conflicts may or may not arise out of the same page, and the (1, 4) conflicts only arise out of the different pages. This implies that in multiprocessing systems index-first scheme will give far better CPU performance gain as compared to single processing environment because effect (b) and (c) will be dominating in the multiprocessing systems. III. EXPERIMENT AND RESULTS Our goal is to evaluate the effect of index-first addressing scheme on the cache and CPU performance as specified in the previous section. To do this, global cache miss rate and instructions per cycle (IPC) were used as the performance metrics. The cache hierarchy system that was modeled is detailed in Table II. A. Simulation environment The Dinero IV cache simulator [10] was used to run the traces of various benchmarks. The simulator was modified to include the index-first addressing scheme. The simulator was extended to simulate the multi-processing environment by running 2- and 4- traces simultaneously. In the multi-trace simulation the switching period of 100 instructions was used. 25 © 2013 ACEEE DOI: 03.LSCS.2013.2.40 TABLE II. C ACHE H IERARCHY MODEL PARAMETERS B. Benchmarks The SBC traces [11] of SPEC CPU2000 benchmarks [12] suite were used in the trace simulation. However not all the benchmarks from the suite could be used in the simulation, so a set of nine benchmarks (five integers and four floating point) were selected from the benchmark suite for simulation purposes. Each trace included the memory references of first two billion of instruction execution. C. Results The results of the simulations of running singleton traces are shown in Figure 5. Fig. 5 (a) shows the global miss rates of the index-first scheme relative to the traditional scheme and Fig. 5 (b) shows the corresponding CPI reductions achieved in the index-first scheme. In index-first scheme, average global miss rate decreased to around 92% of the original and lowest to 75% of traditional for lucas. In the index-first scheme, the average reduction in CPI is over 0.25 and a maximum CPI reduction of 0.63 for lucas. The relative global miss rates and CPI reductions of the index-first scheme of running 2-trace and 4-trace simulations are shown in Figure 6. Figure 6 (a) shows that the average global miss rate decreased to around 89% of traditional for 2trace simulations and around 85% of traditional for 4-trace
  • 4. Short Paper Proc. of Int. Conf. on Advances in Information Technology and Mobile Communication 2013 Figure 6. Multi-trace simulations of a set of 2-trace and 4-trace benchmarks [3] R. C. Murphy and Peter M. Kogge. “On the memory access patterns of supercomputer applications: Benchmark selection and its implications”. IEEE Transactions on Computers, 7(56), July 2007. [4] S. Przybylski, M. Horowitz, and J. Hennessy. “Characteristics of performance-optimal multi-level cache hierarchies”. In Proc. of the 16th Annual Int. Symposium on Computer Architecture, pp. 114-121, June 1989. [5] J. A. Rivers. “Reducing conflicts in direct-mapped caches with a temporality-based design”. In Proc. of the Int. Conference on Parallel Processing, pp. 154-163, August 1996. [6] M. Kharbutli, K. Irwin, Y. Solihin, and J. Lee. “Using prime numbers for cache indexing to eliminate conflict misses”. IEEE Proc. on Software, pp. 288-299, Feb. 2004. [7] Brian N. Bershad, Dennis Lee, Theodore H. Romer, and J. Bradley Chen. “Avoiding conflict misses dynamically in large direct-mapped caches”. In Proc. of the 6 th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, pp. 158-170, 1994. [8] Gabriel Rivera and Chau-Wen Tseng. “Data transformations for eliminating conflict misses”. In Proc. of the ACM SIGPLAN Conf. on Programming Language Design and Implementation, pp 38-49, 1998. [9] Gabriel Rivera and Chau-Weng Tseng. “Eliminating conflict misses for high performance architectures”. In Proc. of the 12 th Int. Conf. on Supercomputing, pp. 353-360, 1998. [10] Mark D. Hill. Dinero IV Trace-driven uniprocessor cache simulator. “pages.cs.wisc.edu/~markhill/DineroIV”. [11] A. Milenkovic, M. Milenkovic. “Exploiting streams in instruction and data address trace compression”. In Proc. of the IEEE 6th Annual Workshop on Workload Characterization, October 2003. [12] J. L. Henning. “SPEC CPU2000: Measuring CPU performance in the new millennium”. IEEE Computer, July 2000. simulations. It maximum decreased to 50% of traditional for (gcc,gap,gzip,perl) and to 79% of traditional for (lucas,wupwise). Fig. 6 (b) shows that the average CPI reduction of index-first scheme is over 0.14 for 2-trace and around 1.2 for 4-trace simulations. IV. CONCLUSIONS An index-first cache addressing scheme has been presented which is an efficient scheme for addressing the cache memories. Using this scheme the first portion of the memory address is used for indexing the L1 cache instead of second as used in the traditional scheme. This scheme works by eliminating several types of block address conflicts in a multi-level cache system. This scheme can be implemented without any additional hardware. The results show that this scheme outperforms the traditional scheme across all the benchmarks in both global miss rates and CPI measures. Moreover, the performance improvements were found to be many times more in multi-processing environments. REFERENCES [1] Alan Jay Smith. “Cache memories”. ACM Computing Surveys, pp. 473-530, Sept. 1982. [2] Monica S. Lam, Edward E. Rothberg, and Michael E. Wolf. “The cache performance and optimizations of blocked algorithms”. ACM SIGARCH Computer Architecture News, pp. 63-74, 1991. © 2013 ACEEE DOI: 03.LSCS.2013.2.40 26