ARI. HiPEAK 2014

Viacheslav Fedorov, Sheng Qiu,
Narasimha Reddy, Paul Gratz
Texas A&M University
ARI:
Adaptive Replacement and Insertion
HiPEAC 2013, Vienna, Austria

Conventional Main Memory
● Usually we only care about
speeding up the cache miss path
Main Memory
Core 0
Core 1
Core 2
Core 3
L3$
L2$
L2$

Main Memory: Trends
● New Memories emerging
● DRAM not dense enough
● Replace or augment DRAM
DRAM
Core 0
Core 1
Core 2
Core 3
L3$
L2$
L2$
DRAM
PCM
DRAM
cache

PCM Technology
● Based on Chalcogenide glass
● Exploits two phases
● Amorphous
● Chrystalline
● Higher density than DRAM
● Non-volatile
Image: Stanford NanoHeat Lab

DRAM vs PCM
● DRAM is writeback-agnostic
● Write Buffers cushion the impact of writebacks
● State-of-the-art policies target cache misses
● PCM
● High write latency – Write Buffers insufficient
● High write energy – Mobile, embedded devices ?
●
Low cell endurance – Limited write cycles ?
Parameter DRAM PCM
Row Read 210 mW 78 mW
Row Write 195 mW 773 mW
Activate 75 mW 25 mW
Standby 90 mW 45 mW
Refresh 4 mW 0 mW
Initial Row Read 15 ns 28 ns
Row Write 22 ns 150 ns
Same Row R/W 15 ns 15 ns
0.3x
4x
0.3x
0.5x
7x
2x
0x

Outline
● Introduction
● Motivation
● ARI: Adaptive Replacement and Insertion
● Evaluation
● Summary
● Conclusion

Motivation
● PCM is attractive as a Main Memory, but...
● PCM does not favor writes
● High energy
● High latency
● Low write cycle tolerance
● Solution: reduce writes into Main Memory
● Modify LLC policies to reduce Writebacks
● Mind the Miss rate!

Application behavior in
High-Associativity Caches
● Bi-Polar block distribution due to LRU policy
● 'Hot' blocks tend to group towards MRU side
● 'Cold' blocks towards LRU side in a set
● Hot blocks have higher Hit-ratio
● Cold blocks tend to have similar Hit-ratios
%hitrate
Position in LRU stackMRU LRU
'Hot' region 'Cold' region
Hit distribution in a high-associativity cache (16-way)

Static LLC policies
● Based on the observed hot-cold distribution
● 16-way cache: 16 static policies, xH16
● Replace any clean block in (16-x) Low-hit blocks
● Drawbacks:
● No single static policy good for all applications
● Less writebacks => more cache misses
– When replacing hot blocks

Enter ARI:
Adaptive Replacement and Insertion
●
Goal: Reduce LLC writebacks !
● Keep miss rate lower than conventional policies
● How?
● Do not replace dirty cache blocks (as long as possible)
● Place fresh incoming blocks into LLC smartly
● Dynamically choose the best policy

ARI: Operation
● Evict clean blocks from Low-Hit region
● Insert new blocks into top of Low-Hit region
%hitrate
Position in LRU stackMRU LRU
High-Hit region
Low-Hit region

ARI: Operation
● Application hit-distributions are not static
● Dynamic policy adaptation based on epochs
● Emulate various static thresholds in LLC tags
● Pick the best one for next epoch (25k LLC accesses)
● Misses + Writebacks metric used
%hitrate
MRU LRU

Core 0
Core 1
Core 2
Core 3
L3$
L2$
L2$
ARI: Implementation
● Emulate static thresholds in shadow tags
● Adapt to the hit-distribution
Tag Array Data ArrayShadow Tag Array
dynamically
4H16 10H16 14H16

Methodology
● gem5 + DRAMSim2 simulators
● nVidia Tegra -like out-of-order, dual-issue CPU
● SPEC2006 and PARSEC suites
● Compared against state-of-the-art policies
● ARI beats them in writeback reduction
● Nearly identical in total performance
System Single core Multicore
L1 cache 32KB I + 64KB D, 2-way, LRU, 64B block 32KB I + 64KB D, 2-way, LRU, 64B block
L2 cache 256KB, 8-way, LRU, 64B block 256KB, 8-way, LRU, 64B block (private)
L3 cache 2MB, 16-way, LRU, 64B block 16MB, 16-way, LRU, 64B block (shared)
Main memory 4GB, DDR3-1333 DRAM, 32-entry write buffer 4GB, DDR3-1333 DRAM, 32-entry write buffer

ARI: Writeback reduction
● ARI beats the competition: 33% WB reduction
Writeback improvement, normalized to LRU policy
DIP: M. Qureshi et al, ISCA '09
DBLK: S. Khan et al, MICRO '10
RRIP: A. Jaleel et al, ISCA '10

ARI: Miss reduction
● ARI achieves 4.7% Misses reduction
Miss rate improvement, normalized to LRU policy
DIP: M. Qureshi et al, ISCA '09
DBLK: S. Khan et al, MICRO '10
RRIP: A. Jaleel et al, ISCA '10

ARI: Performance improvement
● ARI yields a 5% IPC improvement on average
IPC improvement, normalized to LRU policy

ARI: Dynamic behavior
● ARI adapts to program phases
● Achieves lower WBs than the best static policy
Soplex application, SPEC 2006mcf application, SPEC 2006
Writebacks

ARI: PCM lifetime improvement
● ARI facilitates the use of PCM as Main Memory
DIP DBLK RRIP ARI
0%
10%
20%
30%
40%
50%
60%
%PCMlifetimeimprovement
Decrease lifetime
for several apps

ARI: Hardware overhead
● 8 sets shadowed per LLC bank (x8)
● p*2 shadow tags (we use p=9)
● 14kB storage overhead in a 16MB LLC
● Epoch counter – 15 bits
● Performance counters, adders
● Not on critical path
● Can be designed for low power

ARI: Summary
● 33% writeback reduction
● 4.7% cache miss rate reduction
● 9% less Main Memory traffic
● System IPC boost of 5%
● Enabling PCM as Main Memory
● 50% lifetime improvement
Win – Win

Conclusion
● DRAM is hitting a scalability wall
● New memories/architectures proposed
● We target PCM as main memory
● Propose ARI: Adaptive Replacement and
Insertion
● Simple scheme
● Reduce writebacks to main memory
● Boost the PCM performance and lifetime

Related Work: PCM
G. Dhiman et al.
PDRAM: A hybrid PRAM and DRAM main memory system. DAC ’09
M. K. Qureshi et al.
Enhancing Lifetime and Security of PCM-based Main Memory with
Start-Gap Wear Leveling. MICRO ’09
B. C. Lee et al.
Architecting Phase Change Memory as a Scalable
DRAM Alternative. ISCA ’09
M. K. Qureshi et al.
Scalable high performance main memory system using
phase-change memory technology. ISCA ’09
A. P. Ferreira et al.
Increasing PCM main memory lifetime. DATE ’10

Related Work: PCM
N. H. Seong et al.
Security refresh: prevent malicious wear-out and increase durability
for phase-change memory with dynamically randomized address mapping.
ISCA ’10
H. Yoon et al.
Row buffer locality aware caching policies for hybrid memories. ICCD ’12
Stuecheli et al.
The Virtual Write Queue: Coordinating DRAM and
Last-Level Cache Policies. ISCA ’10
M. K. Qureshi & G. H. Loh
Fundamental latency trade-off in architecting dram caches:
Outperforming impractical SRAM-tags with a simple and practical design.
MICRO ’12

ARI: Total Memory Traffic
gcc
bzip
bwaves
mcf
milc
zeus
gromacs
cactusADMleslie3d
namd
gobmk
soplex
hmmer
sjeng
GemsFDTDh264ref
astar
sphinx3
avg
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
Total memory traffic, Misses + Writebacks. Normalized to LRU
4H16
ARI
TotaltrafficnormalizedtoLRU

ARI. HiPEAK 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (10)

Similar to ARI. HiPEAK 2014

Similar to ARI. HiPEAK 2014 (20)

Recently uploaded

Recently uploaded (20)

ARI. HiPEAK 2014