SlideShare a Scribd company logo
1 of 32
Viacheslav Fedorov, Sheng Qiu,
Narasimha Reddy, Paul Gratz
Texas A&M University
ARI:
Adaptive Replacement and Insertion
HiPEAC 2013, Vienna, Austria
Conventional Main Memory
● Usually we only care about
speeding up the cache miss path
Main Memory
Core 0
Core 1
Core 2
Core 3
L3$
L2$
L2$
Main Memory: Trends
● New Memories emerging
● DRAM not dense enough
● Replace or augment DRAM
DRAM
Core 0
Core 1
Core 2
Core 3
L3$
L2$
L2$
DRAM
PCM
DRAM
cache
PCM Technology
● Based on Chalcogenide glass
● Exploits two phases
● Amorphous
● Chrystalline
● Higher density than DRAM
● Non-volatile
Image: Stanford NanoHeat Lab
DRAM vs PCM
● DRAM is writeback-agnostic
● Write Buffers cushion the impact of writebacks
● State-of-the-art policies target cache misses
● PCM
● High write latency – Write Buffers insufficient
● High write energy – Mobile, embedded devices ?
●
Low cell endurance – Limited write cycles ?
Parameter DRAM PCM
Row Read 210 mW 78 mW
Row Write 195 mW 773 mW
Activate 75 mW 25 mW
Standby 90 mW 45 mW
Refresh 4 mW 0 mW
Initial Row Read 15 ns 28 ns
Row Write 22 ns 150 ns
Same Row R/W 15 ns 15 ns
0.3x
4x
0.3x
0.5x
7x
2x
0x
Outline
● Introduction
● Motivation
● ARI: Adaptive Replacement and Insertion
● Evaluation
● Summary
● Conclusion
Motivation
● PCM is attractive as a Main Memory, but...
● PCM does not favor writes
● High energy
● High latency
● Low write cycle tolerance
● Solution: reduce writes into Main Memory
● Modify LLC policies to reduce Writebacks
● Mind the Miss rate!
Application behavior in
High-Associativity Caches
● Bi-Polar block distribution due to LRU policy
● 'Hot' blocks tend to group towards MRU side
● 'Cold' blocks towards LRU side in a set
● Hot blocks have higher Hit-ratio
● Cold blocks tend to have similar Hit-ratios
%hitrate
Position in LRU stackMRU LRU
'Hot' region 'Cold' region
Hit distribution in a high-associativity cache (16-way)
Static LLC policies
● Based on the observed hot-cold distribution
● 16-way cache: 16 static policies, xH16
● Replace any clean block in (16-x) Low-hit blocks
● Drawbacks:
● No single static policy good for all applications
● Less writebacks => more cache misses
– When replacing hot blocks
Enter ARI:
Adaptive Replacement and Insertion
●
Goal: Reduce LLC writebacks !
● Keep miss rate lower than conventional policies
● How?
● Do not replace dirty cache blocks (as long as possible)
● Place fresh incoming blocks into LLC smartly
● Dynamically choose the best policy
ARI: Operation
● Evict clean blocks from Low-Hit region
● Insert new blocks into top of Low-Hit region
%hitrate
Position in LRU stackMRU LRU
High-Hit region
Low-Hit region
ARI: Operation
● Application hit-distributions are not static
● Dynamic policy adaptation based on epochs
● Emulate various static thresholds in LLC tags
● Pick the best one for next epoch (25k LLC accesses)
● Misses + Writebacks metric used
%hitrate
MRU LRU
Core 0
Core 1
Core 2
Core 3
L3$
L2$
L2$
ARI: Implementation
● Emulate static thresholds in shadow tags
● Adapt to the hit-distribution
Tag Array Data ArrayShadow Tag Array
dynamically
4H16 10H16 14H16
Outline
● Introduction
● Motivation
● ARI: Adaptive Replacement and Insertion
● Evaluation
● Summary
● Conclusion
Methodology
● gem5 + DRAMSim2 simulators
● nVidia Tegra -like out-of-order, dual-issue CPU
● SPEC2006 and PARSEC suites
● Compared against state-of-the-art policies
● ARI beats them in writeback reduction
● Nearly identical in total performance
System Single core Multicore
L1 cache 32KB I + 64KB D, 2-way, LRU, 64B block 32KB I + 64KB D, 2-way, LRU, 64B block
L2 cache 256KB, 8-way, LRU, 64B block 256KB, 8-way, LRU, 64B block (private)
L3 cache 2MB, 16-way, LRU, 64B block 16MB, 16-way, LRU, 64B block (shared)
Main memory 4GB, DDR3-1333 DRAM, 32-entry write buffer 4GB, DDR3-1333 DRAM, 32-entry write buffer
ARI: Writeback reduction
● ARI beats the competition: 33% WB reduction
Writeback improvement, normalized to LRU policy
DIP: M. Qureshi et al, ISCA '09
DBLK: S. Khan et al, MICRO '10
RRIP: A. Jaleel et al, ISCA '10
ARI: Miss reduction
● ARI achieves 4.7% Misses reduction
Miss rate improvement, normalized to LRU policy
DIP: M. Qureshi et al, ISCA '09
DBLK: S. Khan et al, MICRO '10
RRIP: A. Jaleel et al, ISCA '10
ARI: Performance improvement
● ARI yields a 5% IPC improvement on average
IPC improvement, normalized to LRU policy
ARI: Dynamic behavior
● ARI adapts to program phases
● Achieves lower WBs than the best static policy
Soplex application, SPEC 2006mcf application, SPEC 2006
Writebacks
ARI: Multicore applications
ARI: PCM lifetime improvement
● ARI facilitates the use of PCM as Main Memory
DIP DBLK RRIP ARI
0%
10%
20%
30%
40%
50%
60%
%PCMlifetimeimprovement
Decrease lifetime
for several apps
ARI: PCM lifetime improvement
ARI: Hardware overhead
● 8 sets shadowed per LLC bank (x8)
● p*2 shadow tags (we use p=9)
● 14kB storage overhead in a 16MB LLC
● Epoch counter – 15 bits
● Performance counters, adders
● Not on critical path
● Can be designed for low power
Outline
● Introduction
● Motivation
● ARI: Adaptive Replacement and Insertion
● Evaluation
● Summary
● Conclusion
ARI: Summary
● 33% writeback reduction
● 4.7% cache miss rate reduction
● 9% less Main Memory traffic
● System IPC boost of 5%
● Enabling PCM as Main Memory
● 50% lifetime improvement
Win – Win
Conclusion
● DRAM is hitting a scalability wall
● New memories/architectures proposed
● We target PCM as main memory
● Propose ARI: Adaptive Replacement and
Insertion
● Simple scheme
● Reduce writebacks to main memory
● Boost the PCM performance and lifetime
Thank you!
Questions?..
Backup Slides
Related Work: PCM
G. Dhiman et al.
PDRAM: A hybrid PRAM and DRAM main memory system. DAC ’09
M. K. Qureshi et al.
Enhancing Lifetime and Security of PCM-based Main Memory with
Start-Gap Wear Leveling. MICRO ’09
B. C. Lee et al.
Architecting Phase Change Memory as a Scalable
DRAM Alternative. ISCA ’09
M. K. Qureshi et al.
Scalable high performance main memory system using
phase-change memory technology. ISCA ’09
A. P. Ferreira et al.
Increasing PCM main memory lifetime. DATE ’10
Related Work: PCM
N. H. Seong et al.
Security refresh: prevent malicious wear-out and increase durability
for phase-change memory with dynamically randomized address mapping.
ISCA ’10
H. Yoon et al.
Row buffer locality aware caching policies for hybrid memories. ICCD ’12
Stuecheli et al.
The Virtual Write Queue: Coordinating DRAM and
Last-Level Cache Policies. ISCA ’10
M. K. Qureshi & G. H. Loh
Fundamental latency trade-off in architecting dram caches:
Outperforming impractical SRAM-tags with a simple and practical design.
MICRO ’12
ARI: Insertion impact
ARI: Total Memory Traffic
gcc
bzip
bwaves
mcf
milc
zeus
gromacs
cactusADMleslie3d
namd
gobmk
soplex
hmmer
sjeng
GemsFDTDh264ref
astar
sphinx3
avg
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
Total memory traffic, Misses + Writebacks. Normalized to LRU
4H16
ARI
TotaltrafficnormalizedtoLRU

More Related Content

What's hot

Symfony e grandi numeri: si può fare!
Symfony e grandi numeri: si può fare!Symfony e grandi numeri: si può fare!
Symfony e grandi numeri: si può fare!
Daniel Londero
 
Cassandra drivers
Cassandra driversCassandra drivers
Cassandra drivers
Tyler Hobbs
 

What's hot (17)

DB Latency Using DRAM + PMem in App Direct & Memory Modes
DB Latency Using DRAM + PMem in App Direct & Memory ModesDB Latency Using DRAM + PMem in App Direct & Memory Modes
DB Latency Using DRAM + PMem in App Direct & Memory Modes
 
G1: To Infinity and Beyond
G1: To Infinity and BeyondG1: To Infinity and Beyond
G1: To Infinity and Beyond
 
Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction
 
EVCache & Moneta (GoSF)
EVCache & Moneta (GoSF)EVCache & Moneta (GoSF)
EVCache & Moneta (GoSF)
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
hbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at Neteasehbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at Netease
 
Breaking the Sound Barrier with Persistent Memory
Breaking the Sound Barrier with Persistent Memory Breaking the Sound Barrier with Persistent Memory
Breaking the Sound Barrier with Persistent Memory
 
Keynote: Apache HBase at Yahoo! Scale
Keynote: Apache HBase at Yahoo! ScaleKeynote: Apache HBase at Yahoo! Scale
Keynote: Apache HBase at Yahoo! Scale
 
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase Client
 
Scaling Apache Pulsar to 10 Petabytes/Day
Scaling Apache Pulsar to 10 Petabytes/DayScaling Apache Pulsar to 10 Petabytes/Day
Scaling Apache Pulsar to 10 Petabytes/Day
 
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecture
 
Symfony e grandi numeri: si può fare!
Symfony e grandi numeri: si può fare!Symfony e grandi numeri: si può fare!
Symfony e grandi numeri: si può fare!
 
Efficient Memory Mapped File I/O for In-Memory File Systems (HotStorage '17)
Efficient Memory Mapped File I/O for In-Memory File Systems (HotStorage '17)Efficient Memory Mapped File I/O for In-Memory File Systems (HotStorage '17)
Efficient Memory Mapped File I/O for In-Memory File Systems (HotStorage '17)
 
Cassandra drivers
Cassandra driversCassandra drivers
Cassandra drivers
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance Improvement
 
Geographically Distributed PostgreSQL
Geographically Distributed PostgreSQLGeographically Distributed PostgreSQL
Geographically Distributed PostgreSQL
 

Viewers also liked (10)

Reviews Checklists
Reviews ChecklistsReviews Checklists
Reviews Checklists
 
Test management
Test managementTest management
Test management
 
Hip Brochure Web
Hip Brochure WebHip Brochure Web
Hip Brochure Web
 
fmcg|ability flyer
fmcg|ability flyerfmcg|ability flyer
fmcg|ability flyer
 
Pitch for Shampoo
Pitch for ShampooPitch for Shampoo
Pitch for Shampoo
 
Blackbox
BlackboxBlackbox
Blackbox
 
Testcase
TestcaseTestcase
Testcase
 
Saxo Bank Institutional Offering
Saxo Bank Institutional OfferingSaxo Bank Institutional Offering
Saxo Bank Institutional Offering
 
Testcase definition
Testcase definitionTestcase definition
Testcase definition
 
Low back pain
Low back painLow back pain
Low back pain
 

Similar to ARI. HiPEAK 2014

2012 benjamin klenk-future-memory_technologies-presentation
2012 benjamin klenk-future-memory_technologies-presentation2012 benjamin klenk-future-memory_technologies-presentation
2012 benjamin klenk-future-memory_technologies-presentation
Saket Vihari
 
Literature survey presentation
Literature survey presentationLiterature survey presentation
Literature survey presentation
Karthik Iyr
 
lecture asdkvakm;bk;dv;advvAVHD;KASV;DVKHSVDK
lecture asdkvakm;bk;dv;advvAVHD;KASV;DVKHSVDKlecture asdkvakm;bk;dv;advvAVHD;KASV;DVKHSVDK
lecture asdkvakm;bk;dv;advvAVHD;KASV;DVKHSVDK
officeaiotfab
 
A survey on exploring memory optimizations in smartphones
A survey on exploring memory optimizations in smartphonesA survey on exploring memory optimizations in smartphones
A survey on exploring memory optimizations in smartphones
Karthik Iyr
 

Similar to ARI. HiPEAK 2014 (20)

MAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORY
MAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORYMAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORY
MAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORY
 
Recent advancements in cache technology
Recent advancements in cache technologyRecent advancements in cache technology
Recent advancements in cache technology
 
Smart SSD Controller with Flexibility
Smart SSD Controller with FlexibilitySmart SSD Controller with Flexibility
Smart SSD Controller with Flexibility
 
P99 Pursuit: 8 Years of Battling P99 Latency
P99 Pursuit: 8 Years of Battling P99 LatencyP99 Pursuit: 8 Years of Battling P99 Latency
P99 Pursuit: 8 Years of Battling P99 Latency
 
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
 
MEMORY & I/O SYSTEMS
MEMORY & I/O SYSTEMS                          MEMORY & I/O SYSTEMS
MEMORY & I/O SYSTEMS
 
USRG2014 Poster
USRG2014 PosterUSRG2014 Poster
USRG2014 Poster
 
Virtualization for Emerging Memory Devices
Virtualization for Emerging Memory DevicesVirtualization for Emerging Memory Devices
Virtualization for Emerging Memory Devices
 
SanDisk: Persistent Memory and Cassandra
SanDisk: Persistent Memory and CassandraSanDisk: Persistent Memory and Cassandra
SanDisk: Persistent Memory and Cassandra
 
Cache Performance Evaluation under Multi-parameters Using SMPCache simulator
Cache Performance Evaluation under Multi-parameters Using SMPCache simulatorCache Performance Evaluation under Multi-parameters Using SMPCache simulator
Cache Performance Evaluation under Multi-parameters Using SMPCache simulator
 
GCMA: Guaranteed Contiguous Memory Allocator
GCMA: Guaranteed Contiguous Memory AllocatorGCMA: Guaranteed Contiguous Memory Allocator
GCMA: Guaranteed Contiguous Memory Allocator
 
Erasing Belady's Limitations: In Search of Flash Cache Offline Optimality
Erasing Belady's Limitations: In Search of Flash Cache Offline OptimalityErasing Belady's Limitations: In Search of Flash Cache Offline Optimality
Erasing Belady's Limitations: In Search of Flash Cache Offline Optimality
 
Maha an energy efficient malleable hardware accelerator for data intensive a...
Maha  an energy efficient malleable hardware accelerator for data intensive a...Maha  an energy efficient malleable hardware accelerator for data intensive a...
Maha an energy efficient malleable hardware accelerator for data intensive a...
 
2012 benjamin klenk-future-memory_technologies-presentation
2012 benjamin klenk-future-memory_technologies-presentation2012 benjamin klenk-future-memory_technologies-presentation
2012 benjamin klenk-future-memory_technologies-presentation
 
Literature survey presentation
Literature survey presentationLiterature survey presentation
Literature survey presentation
 
Basic Computer Architecture
Basic Computer ArchitectureBasic Computer Architecture
Basic Computer Architecture
 
Slides of talk
Slides of talkSlides of talk
Slides of talk
 
Multicore architectures
Multicore architecturesMulticore architectures
Multicore architectures
 
lecture asdkvakm;bk;dv;advvAVHD;KASV;DVKHSVDK
lecture asdkvakm;bk;dv;advvAVHD;KASV;DVKHSVDKlecture asdkvakm;bk;dv;advvAVHD;KASV;DVKHSVDK
lecture asdkvakm;bk;dv;advvAVHD;KASV;DVKHSVDK
 
A survey on exploring memory optimizations in smartphones
A survey on exploring memory optimizations in smartphonesA survey on exploring memory optimizations in smartphones
A survey on exploring memory optimizations in smartphones
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 

ARI. HiPEAK 2014

  • 1. Viacheslav Fedorov, Sheng Qiu, Narasimha Reddy, Paul Gratz Texas A&M University ARI: Adaptive Replacement and Insertion HiPEAC 2013, Vienna, Austria
  • 2. Conventional Main Memory ● Usually we only care about speeding up the cache miss path Main Memory Core 0 Core 1 Core 2 Core 3 L3$ L2$ L2$
  • 3. Main Memory: Trends ● New Memories emerging ● DRAM not dense enough ● Replace or augment DRAM DRAM Core 0 Core 1 Core 2 Core 3 L3$ L2$ L2$ DRAM PCM DRAM cache
  • 4. PCM Technology ● Based on Chalcogenide glass ● Exploits two phases ● Amorphous ● Chrystalline ● Higher density than DRAM ● Non-volatile Image: Stanford NanoHeat Lab
  • 5. DRAM vs PCM ● DRAM is writeback-agnostic ● Write Buffers cushion the impact of writebacks ● State-of-the-art policies target cache misses ● PCM ● High write latency – Write Buffers insufficient ● High write energy – Mobile, embedded devices ? ● Low cell endurance – Limited write cycles ? Parameter DRAM PCM Row Read 210 mW 78 mW Row Write 195 mW 773 mW Activate 75 mW 25 mW Standby 90 mW 45 mW Refresh 4 mW 0 mW Initial Row Read 15 ns 28 ns Row Write 22 ns 150 ns Same Row R/W 15 ns 15 ns 0.3x 4x 0.3x 0.5x 7x 2x 0x
  • 6. Outline ● Introduction ● Motivation ● ARI: Adaptive Replacement and Insertion ● Evaluation ● Summary ● Conclusion
  • 7. Motivation ● PCM is attractive as a Main Memory, but... ● PCM does not favor writes ● High energy ● High latency ● Low write cycle tolerance ● Solution: reduce writes into Main Memory ● Modify LLC policies to reduce Writebacks ● Mind the Miss rate!
  • 8. Application behavior in High-Associativity Caches ● Bi-Polar block distribution due to LRU policy ● 'Hot' blocks tend to group towards MRU side ● 'Cold' blocks towards LRU side in a set ● Hot blocks have higher Hit-ratio ● Cold blocks tend to have similar Hit-ratios %hitrate Position in LRU stackMRU LRU 'Hot' region 'Cold' region Hit distribution in a high-associativity cache (16-way)
  • 9. Static LLC policies ● Based on the observed hot-cold distribution ● 16-way cache: 16 static policies, xH16 ● Replace any clean block in (16-x) Low-hit blocks ● Drawbacks: ● No single static policy good for all applications ● Less writebacks => more cache misses – When replacing hot blocks
  • 10. Enter ARI: Adaptive Replacement and Insertion ● Goal: Reduce LLC writebacks ! ● Keep miss rate lower than conventional policies ● How? ● Do not replace dirty cache blocks (as long as possible) ● Place fresh incoming blocks into LLC smartly ● Dynamically choose the best policy
  • 11. ARI: Operation ● Evict clean blocks from Low-Hit region ● Insert new blocks into top of Low-Hit region %hitrate Position in LRU stackMRU LRU High-Hit region Low-Hit region
  • 12. ARI: Operation ● Application hit-distributions are not static ● Dynamic policy adaptation based on epochs ● Emulate various static thresholds in LLC tags ● Pick the best one for next epoch (25k LLC accesses) ● Misses + Writebacks metric used %hitrate MRU LRU
  • 13. Core 0 Core 1 Core 2 Core 3 L3$ L2$ L2$ ARI: Implementation ● Emulate static thresholds in shadow tags ● Adapt to the hit-distribution Tag Array Data ArrayShadow Tag Array dynamically 4H16 10H16 14H16
  • 14. Outline ● Introduction ● Motivation ● ARI: Adaptive Replacement and Insertion ● Evaluation ● Summary ● Conclusion
  • 15. Methodology ● gem5 + DRAMSim2 simulators ● nVidia Tegra -like out-of-order, dual-issue CPU ● SPEC2006 and PARSEC suites ● Compared against state-of-the-art policies ● ARI beats them in writeback reduction ● Nearly identical in total performance System Single core Multicore L1 cache 32KB I + 64KB D, 2-way, LRU, 64B block 32KB I + 64KB D, 2-way, LRU, 64B block L2 cache 256KB, 8-way, LRU, 64B block 256KB, 8-way, LRU, 64B block (private) L3 cache 2MB, 16-way, LRU, 64B block 16MB, 16-way, LRU, 64B block (shared) Main memory 4GB, DDR3-1333 DRAM, 32-entry write buffer 4GB, DDR3-1333 DRAM, 32-entry write buffer
  • 16. ARI: Writeback reduction ● ARI beats the competition: 33% WB reduction Writeback improvement, normalized to LRU policy DIP: M. Qureshi et al, ISCA '09 DBLK: S. Khan et al, MICRO '10 RRIP: A. Jaleel et al, ISCA '10
  • 17. ARI: Miss reduction ● ARI achieves 4.7% Misses reduction Miss rate improvement, normalized to LRU policy DIP: M. Qureshi et al, ISCA '09 DBLK: S. Khan et al, MICRO '10 RRIP: A. Jaleel et al, ISCA '10
  • 18. ARI: Performance improvement ● ARI yields a 5% IPC improvement on average IPC improvement, normalized to LRU policy
  • 19. ARI: Dynamic behavior ● ARI adapts to program phases ● Achieves lower WBs than the best static policy Soplex application, SPEC 2006mcf application, SPEC 2006 Writebacks
  • 21. ARI: PCM lifetime improvement ● ARI facilitates the use of PCM as Main Memory DIP DBLK RRIP ARI 0% 10% 20% 30% 40% 50% 60% %PCMlifetimeimprovement Decrease lifetime for several apps
  • 22. ARI: PCM lifetime improvement
  • 23. ARI: Hardware overhead ● 8 sets shadowed per LLC bank (x8) ● p*2 shadow tags (we use p=9) ● 14kB storage overhead in a 16MB LLC ● Epoch counter – 15 bits ● Performance counters, adders ● Not on critical path ● Can be designed for low power
  • 24. Outline ● Introduction ● Motivation ● ARI: Adaptive Replacement and Insertion ● Evaluation ● Summary ● Conclusion
  • 25. ARI: Summary ● 33% writeback reduction ● 4.7% cache miss rate reduction ● 9% less Main Memory traffic ● System IPC boost of 5% ● Enabling PCM as Main Memory ● 50% lifetime improvement Win – Win
  • 26. Conclusion ● DRAM is hitting a scalability wall ● New memories/architectures proposed ● We target PCM as main memory ● Propose ARI: Adaptive Replacement and Insertion ● Simple scheme ● Reduce writebacks to main memory ● Boost the PCM performance and lifetime
  • 29. Related Work: PCM G. Dhiman et al. PDRAM: A hybrid PRAM and DRAM main memory system. DAC ’09 M. K. Qureshi et al. Enhancing Lifetime and Security of PCM-based Main Memory with Start-Gap Wear Leveling. MICRO ’09 B. C. Lee et al. Architecting Phase Change Memory as a Scalable DRAM Alternative. ISCA ’09 M. K. Qureshi et al. Scalable high performance main memory system using phase-change memory technology. ISCA ’09 A. P. Ferreira et al. Increasing PCM main memory lifetime. DATE ’10
  • 30. Related Work: PCM N. H. Seong et al. Security refresh: prevent malicious wear-out and increase durability for phase-change memory with dynamically randomized address mapping. ISCA ’10 H. Yoon et al. Row buffer locality aware caching policies for hybrid memories. ICCD ’12 Stuecheli et al. The Virtual Write Queue: Coordinating DRAM and Last-Level Cache Policies. ISCA ’10 M. K. Qureshi & G. H. Loh Fundamental latency trade-off in architecting dram caches: Outperforming impractical SRAM-tags with a simple and practical design. MICRO ’12
  • 32. ARI: Total Memory Traffic gcc bzip bwaves mcf milc zeus gromacs cactusADMleslie3d namd gobmk soplex hmmer sjeng GemsFDTDh264ref astar sphinx3 avg 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 Total memory traffic, Misses + Writebacks. Normalized to LRU 4H16 ARI TotaltrafficnormalizedtoLRU