SlideShare una empresa de Scribd logo
1 de 36
Improving Real-Time Performance on
Multicore Platforms Using MemGuard
University of Kansas
Dr. Heechul Yun
10/28/2013
Multicore

Server

Desktop

Mobile

RT/Embedded

2
Challenges: Shared Resources
T1

T2

CPU

T
1

T
2

Core
1

T
3

T
4

Core
2

T
5

T
6

Core
3

Memory Hierarchy

T
8

Core
4

Memory Hierarchy

Unicore

T
7

Multicore

Performance Impact
3
Case Study
• HRT
– Synthetic real-time video capture
– P=20, D=13ms
– Cache-insensitive

• X-server
– Scrolling text on a gnome-terminal

• Hardware platform
– Intel Xeon 3530
– 8MB shared L3 cache
– 4GB DDR3 1333MHz DIMM (1ch)

HRT

Xsrv.

Core1

Core2
L3 (8MB)

DRAM

• CPU cores are isolated

A desktop PC
(Intel Xeon 3530)
4
HRT Time Distribution

solo

w/ Xserver

99pct: 10.2ms

99pct: 14.3ms

• 28% deadline violations
• Due to contention in DRAM
5
Outline
• Motivation
• Background
– DRAM basics
– Worst-case memory performance
– MemGuard*RTAS’13+

• Improving Real-Time Performance with
MemGuard
6
Background: DRAM Organization
Core1

Core2

Core3

Core4

L3

Memory Controller (MC)

DRAM DIMM

Bank
1

Bank
2

Bank
3

Bank
4

• Have multiple banks
• Different banks can be
accessed in parallel
Best-case
Core1

Core2

Core3

Core4

L3

Memory Controller (MC)

DRAM DIMM

Fast
• Peak = 10.6 GB/s

Bank
1

Bank
2

Bank
3

Bank
4

– DDR3 1333Mhz
Best-case
Core1

Core2

Core3

Core4

L3

Memory Controller (MC)

DRAM DIMM

Fast
• Peak = 10.6 GB/s

Bank
1

Bank
2

Bank
3

Bank
4

– DDR3 1333Mhz

• Out-of-order processors
Most-cases
Core1

Core2

Core3

Core4

L3

Memory Controller (MC)

DRAM DIMM

Mess
• Performance = ??

Bank
1

Bank
2

Bank
3

(*) Intel® 64 and IA-32 Architectures Optimization Reference Manual

Bank
4
Worst-case
Core1

Core2

Core3

Core4

L3

Memory Controller (MC)

DRAM DIMM

Slow
• 1bank b/w

Bank
1

Bank
2

Bank
3

(*) Intel® 64 and IA-32 Architectures Optimization Reference Manual

Bank
4

– Less than peak b/w
– How much?
Background: DRAM Operation
Bank 1
Row 1
Row 2
Row 3
Row 4
Row 5
activate

READ (Bank 1, Row 3, Col 7)
precharge
Col7

Row Buffer
Read/write

• Stateful per-bank access time
– Row miss: 19 cycles
– Row hit: 9 cycles
(*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)
Real Worst-case
Core
1

Core
2

Core
3

Core
4

Request order
time

L3

Memory Controller (MC)

DRAM DIMM

Bank
1

Bank
2

Bank
3

Bank
4

Row 1
Row 2
Row 3
Row 4
Row 1
Row 2
…

1 bank & always row miss  ~1.2GB/s
Each core = ¼ x 1.2GB/s = 300MB/s ?

(*) Intel® 64 and IA-32 Architectures Optimization Reference Manual
Background: Memory Controller(MC)
Bruce Jacob et al, “Memory Systems: Cache, DRAM, Disk” Fig 13.1.

• Request queue(s)
– Not fair (open-row first  re-ordering)
– Unpredictable queuing delay
14
Challenges for Real-Time Systems
• Multiple parallel resources (banks)
• Stateful bank access latency
• Queuing delay

• Unpredictable memory performance

15
MemGuard *RTAS’13+
MemGuard

Operating System

Reclaim Manager

BW
0.6GB/s
Regulator

BW
0.2GB/s
Regulator

BW
0.2GB/s
Regulator

BW
0.2GB/s
Regulator

PMC
Core1

PMC
Core2

PMC
Core3

PMC
Core4

Memory Controller

Multicore Processor

DRAM DIMM

• Goal: guarantee minimum memory b/w for each core
• How: b/w reservation + best effort sharing
16
Reservation
• Idea
– Scheduler regulates per-core memory b/w using h/w counters
– Period = 1 scheduler tick (e.g., 1ms)
Suspend the RT idle task
2

Budget 1
Core

activity
0

1ms
Schedule a RT idle task
computation

2ms

memory fetch
17
Reservation

18
Best-Effort Sharing
time(ms)

Core0

Core1

900MB/s

300MB/s

0
throttled
reschedule
1
guaranteed b/w

2

best-effort b/w

• Spare Sharing *RTAS’13+
• Proportional Sharing [Unpublished TR]
19
Case Study
• HRT
– Synthetic real-time video capture
– P=20, D=13ms
– Cache-insensitive

• X-server
– Scrolling text on a gnome-terminal

• Hardware platform
– Intel Xeon 3530
– 8MB shared cache
– 4GB DDR3 1333MHz DIMM

HRT

Xsrv.

Core1

Core2
L3 (8MB)

DRAM
A desktop PC
(Intel Xeon 3530)
20
w/o MemGuard

HRT (solo)
HRT’s 99pct: 10.2ms

HRT (w/ Xserver)
HRT’s 99pct: 14.3ms
X’s CPU util: 78%

21
MemGuard
reserve only (HRT=900MB/s, X=300MB/s)

HRT (solo)
HRT’s 99pct: 10.7ms

HRT (w/ Xserver)
HRT’s 99pct: 11.2ms
X’s CPU util: 4%

22
MemGuard
reserve (HRT=900MB/s, X=300MB/s)+ best-effort sharing

HRT (solo)
HRT’s 99pct: 10.7ms

HRT (w/ Xserver)
HRT’s 99pct: 10.7ms
X’s CPU util: 48%

23
MemGuard
reserve (HRT=600MB/s, X=600MB/s)+ best-effort sharing

HRT (solo)
HRT’s 99pct: 10.9 ms

HRT (w/ Xserver)
HRT’s 99pct: 12.1ms
X’s CPU util: 61%

24
Real-Time Performance Improvement
HRT

X-server

• Using MemGuard, we can achieve
– No deadline miss for HRT
– Good X-server performance
25
Conclusion
• Unpredictable memory performance
– multiple resources(banks), per-bank state, unpredictable queueing delay

• MemGuard
– Guarantee minimum memory bandwidth for each core
– b/w reservation (guaranteed part) + best-effort sharing

• Case-study
– On Intel Xeon multicore platform, using HRT + X-server
– MemGuard can improve real-time performance efficiently

• Limitations and Future Work
– Coarse grain (a OS tick) enforcement
– Small guaranteed b/w  DRAM bank partitioning (submitted to RTAS’14)

https://github.com/heechul/memguard
26
Thank you.

27
Evaluation on Intel Core2
• T1: Synthetic video capture task (HRT)
– Period=20ms(50Hz)
– Deadline=14ms,
– Metrics: ACET, WCET, stdev, deadline miss ratio (out of 1000 periods)

• T2: Xserver, update screen (SRT)
– Metric: CPU utilization
• Higher CPU utilization  faster screen update

• Platform
– Intel Core2Quad 8400, 2MB L2 cache x 2,
tunable H/W prefetchers
– PC6400 DDR2 DRAM DIMM x 1

• Three platform configurations
– Exp1: Private L2, Prefetch=off
– Exp2: Private L2, Prefetch=on
– Exp3: Shared L2, Prefetch=on

Core0

Core1

Core2

L2 (pref.)

Core3

L2 (pref.)
DRAM

Intel Core2Quad based PC
28
T1’s exec. time (ms)

Experiment 1
18
16
14
12
10
8
6
4
2
0

deadline

solo

corun

T1

Private L2
Prefetch=off

Performance guarantee

92%
T2

Core1

Core2

L2

L2

solo

corun

solo

corun

T1

38%
T2

T1

78%
T2

550M/s

550M/s

550M/s

550M/s

Core1
L2

Core2
L2

Core1
L2

Core2
L2

DRAM

DRAM

DRAM

Original

MemGuard
(Reserve only)

MemGuard
(reclaim + share)
29
T1’s exec. time (ms)

Experiment 1
18
16
14
12
10
8
6
4
2
0

30%
WCET

WCET

ACET

solo

corun

T1

Private L2
Prefetch=off

Performance guarantee

deadline

92%
T2

Core1

Core2

L2

L2

solo

corun

solo

corun

T1

38%
T2

T1

78%
T2

550M/s

550M/s

550M/s

550M/s

Core1
L2

Core2
L2

Core1
L2

Core2
L2

DRAM

DRAM

DRAM

Original

MemGuard
(Reserve only)

MemGuard
(reclaim + share)
30
T1’s exec. time (ms)

Experiment 1
18
16
14
12
10
8
6
4
2
0

deadline

solo

corun

T1

Private L2
Prefetch=off

92%
T2

Core1

Core2

L2

L2

solo

corun

solo

corun

T1

38%
T2

T1

78%
T2

550M/s

550M/s

550M/s

550M/s

Core1
L2

Core2
L2

Core1
L2

Core2
L2

DRAM

DRAM

DRAM

Original

MemGuard
(Reserve only)

MemGuard
(reclaim + share)
31
T1’s exec. time (ms)

Experiment 1
18
16
14
12
10
8
6
4
2
0

deadline

solo

corun

T1

Private L2
Prefetch=off

92%
T2

Core1

Core2

L2

L2

solo

corun

solo

corun

T1

38%
T2

T1

78%
T2

550M/s

550M/s

550M/s

550M/s

Core1
L2

Core2
L2

Core1
L2

Core2
L2

DRAM

DRAM

DRAM

Original

MemGuard
(Reserve only)

MemGuard
(reclaim + share)
32
T1’s exec. time (ms)

Experiment 1
18
16
14
12
10
8
6
4
2
0

Performance target

solo

corun

T1

Private L2
Prefetch=off

92%
T2

Core1

Core2

L2

L2

solo

corun

solo

corun

T1

38%
T2

T1

78%
T2

550M/s

550M/s

550M/s

550M/s

Core1
L2

Core2
L2

Core1
L2

Core2
L2

DRAM

DRAM

DRAM

Original

MemGuard
(Reserve only)

MemGuard
(reclaim + share)
33
T1's exec. Time (ms)

Experiment 2: Prefetcher
24
22
20
18
16
14
12
10
8
6
4
2
0

Not enough reserv.
More slowdown

deadline

60%

solo

corun

T1

Private L2
Prefetch=ON

Deadline violation

94%
T2

Core1

Core2

L2

L2

solo

corun

solo

corun

T1

33%
T2

T1

82%
T2

550M/s

550M/s

550M/s

550M/s

Core1
L2

Core2
L2

Core1
L2

Core2
L2

DRAM

DRAM

DRAM

Original

MemGuard
(Reserve only)

MemGuard
(reclaim + share)
34
T1's exec. Time (ms)

Experiment 2-2
18
16
14
12
10
8
6
4
2
0

Enough reserv.
60%

solo

corun

T1

Private L2
Prefetch=ON

No deadline violation

94%
T2

Core1

Core2

L2

L2

solo

corun

solo

corun

T1

14%
T2

T1

69%
T2

900M/s

200M/s

900M/s

200M/s

Core1
L2

Core2
L2

Core1
L2

Core2
L2

DRAM

DRAM

DRAM

Original

MemGuard
(Reserve only)

MemGuard
(reclaim + share)
35
T1's exec. Times (ms)

Experiment 3: Shared Cache
24
22
20
18
16
14
12
10
8
6
4
2
0

Even more slowdown
Minimum reserv.

108%

solo

corun

solo

corun

No deadline violation

solo

corun

T1

11%
T2

T1

63%
T2

T1

Shared L2
Prefetch=ON

92%
T2

900M/s

200M/s

900M/s

200M/s

Core1

Core2

Core1

Core2

Core1

Core2

L2
DRAM

L2
DRAM

L2
DRAM

Original

MemGuard
(Reserve only)

MemGuard
(reclaim + share)
36

Más contenido relacionado

La actualidad más candente

Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory Management
Ni Zo-Ma
 
Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uring
ShapeBlue
 
Lecture 6
Lecture  6Lecture  6
Lecture 6
Mr SMAK
 

La actualidad más candente (20)

LAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common Kernel
LAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common KernelLAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common Kernel
LAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common Kernel
 
LXC, Docker, security: is it safe to run applications in Linux Containers?
LXC, Docker, security: is it safe to run applications in Linux Containers?LXC, Docker, security: is it safe to run applications in Linux Containers?
LXC, Docker, security: is it safe to run applications in Linux Containers?
 
Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory Management
 
WALT vs PELT : Redux - SFO17-307
WALT vs PELT : Redux  - SFO17-307WALT vs PELT : Redux  - SFO17-307
WALT vs PELT : Redux - SFO17-307
 
Kernel Recipes 2018 - Overview of SD/eMMC, their high speed modes and Linux s...
Kernel Recipes 2018 - Overview of SD/eMMC, their high speed modes and Linux s...Kernel Recipes 2018 - Overview of SD/eMMC, their high speed modes and Linux s...
Kernel Recipes 2018 - Overview of SD/eMMC, their high speed modes and Linux s...
 
LAS16-307: Benchmarking Schedutil in Android
LAS16-307: Benchmarking Schedutil in AndroidLAS16-307: Benchmarking Schedutil in Android
LAS16-307: Benchmarking Schedutil in Android
 
SFO15-302: Energy Aware Scheduling: Progress Update
SFO15-302: Energy Aware Scheduling: Progress UpdateSFO15-302: Energy Aware Scheduling: Progress Update
SFO15-302: Energy Aware Scheduling: Progress Update
 
Linux scheduler
Linux schedulerLinux scheduler
Linux scheduler
 
LCA14: LCA14-306: CPUidle & CPUfreq integration with scheduler
LCA14: LCA14-306: CPUidle & CPUfreq integration with schedulerLCA14: LCA14-306: CPUidle & CPUfreq integration with scheduler
LCA14: LCA14-306: CPUidle & CPUfreq integration with scheduler
 
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
 
Transparent Hugepages in RHEL 6
Transparent Hugepages in RHEL 6 Transparent Hugepages in RHEL 6
Transparent Hugepages in RHEL 6
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
Virtualization Support in ARMv8+
Virtualization Support in ARMv8+Virtualization Support in ARMv8+
Virtualization Support in ARMv8+
 
intel speed-select-technology-base-frequency-enhancing-performance
intel speed-select-technology-base-frequency-enhancing-performanceintel speed-select-technology-base-frequency-enhancing-performance
intel speed-select-technology-base-frequency-enhancing-performance
 
Process Scheduler and Balancer in Linux Kernel
Process Scheduler and Balancer in Linux KernelProcess Scheduler and Balancer in Linux Kernel
Process Scheduler and Balancer in Linux Kernel
 
Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uring
 
Lecture 6
Lecture  6Lecture  6
Lecture 6
 
The Linux Kernel Scheduler (For Beginners) - SFO17-421
The Linux Kernel Scheduler (For Beginners) - SFO17-421The Linux Kernel Scheduler (For Beginners) - SFO17-421
The Linux Kernel Scheduler (For Beginners) - SFO17-421
 
CPU Scheduling algorithms
CPU Scheduling algorithmsCPU Scheduling algorithms
CPU Scheduling algorithms
 
PCI Express* based Storage: Data Center NVM Express* Platform Topologies
PCI Express* based Storage: Data Center NVM Express* Platform TopologiesPCI Express* based Storage: Data Center NVM Express* Platform Topologies
PCI Express* based Storage: Data Center NVM Express* Platform Topologies
 

Destacado

Destacado (14)

IBM Kenexa Partner/Re-Seller Overview
IBM Kenexa Partner/Re-Seller OverviewIBM Kenexa Partner/Re-Seller Overview
IBM Kenexa Partner/Re-Seller Overview
 
Bn1013 demo sap success factors
Bn1013 demo  sap success factorsBn1013 demo  sap success factors
Bn1013 demo sap success factors
 
An Agile approach to Business Metrics
An Agile approach to Business MetricsAn Agile approach to Business Metrics
An Agile approach to Business Metrics
 
Digital Literacies: Knowledge, Skills and Attitudes for a Digital Age - Ruth ...
Digital Literacies: Knowledge, Skills and Attitudes for a Digital Age - Ruth ...Digital Literacies: Knowledge, Skills and Attitudes for a Digital Age - Ruth ...
Digital Literacies: Knowledge, Skills and Attitudes for a Digital Age - Ruth ...
 
Agile Transformation with Improvement Kata - The Workshop
Agile Transformation with Improvement Kata - The WorkshopAgile Transformation with Improvement Kata - The Workshop
Agile Transformation with Improvement Kata - The Workshop
 
Agile and continuous delivery – How IBM Watson Workspace is built
Agile and continuous delivery – How IBM Watson Workspace is builtAgile and continuous delivery – How IBM Watson Workspace is built
Agile and continuous delivery – How IBM Watson Workspace is built
 
Oracle GoldenGate
Oracle GoldenGate Oracle GoldenGate
Oracle GoldenGate
 
The difference between a KPI and a Metric
The difference between a KPI and a MetricThe difference between a KPI and a Metric
The difference between a KPI and a Metric
 
People Analytics: State of the Market - Top Ten List
People Analytics:  State of the Market - Top Ten ListPeople Analytics:  State of the Market - Top Ten List
People Analytics: State of the Market - Top Ten List
 
Lean Agile Metrics And KPIs
Lean Agile Metrics And KPIsLean Agile Metrics And KPIs
Lean Agile Metrics And KPIs
 
Agile KPIs
Agile KPIsAgile KPIs
Agile KPIs
 
Positive attitude ppt
Positive attitude pptPositive attitude ppt
Positive attitude ppt
 
Digital transformation in 50 soundbites
Digital transformation in 50 soundbitesDigital transformation in 50 soundbites
Digital transformation in 50 soundbites
 
Digital in 2016
Digital in 2016Digital in 2016
Digital in 2016
 

Similar a Improving Real-Time Performance on Multicore Platforms using MemGuard

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time SystemsTaming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Heechul Yun
 
2012 benjamin klenk-future-memory_technologies-presentation
2012 benjamin klenk-future-memory_technologies-presentation2012 benjamin klenk-future-memory_technologies-presentation
2012 benjamin klenk-future-memory_technologies-presentation
Saket Vihari
 

Similar a Improving Real-Time Performance on Multicore Platforms using MemGuard (20)

Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC PlatformsProtecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
 
Deterministic Memory Abstraction and Supporting Multicore System Architecture
Deterministic Memory Abstraction and Supporting Multicore System ArchitectureDeterministic Memory Abstraction and Supporting Multicore System Architecture
Deterministic Memory Abstraction and Supporting Multicore System Architecture
 
Coa presentation3
Coa presentation3Coa presentation3
Coa presentation3
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Cpu spec
Cpu specCpu spec
Cpu spec
 
Virtualization for Emerging Memory Devices
Virtualization for Emerging Memory DevicesVirtualization for Emerging Memory Devices
Virtualization for Emerging Memory Devices
 
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
 
Memory access control in multiprocessor for real-time system with mixed criti...
Memory access control in multiprocessor for real-time system with mixed criti...Memory access control in multiprocessor for real-time system with mixed criti...
Memory access control in multiprocessor for real-time system with mixed criti...
 
Volatile Uses for Persistent Memory
Volatile Uses for Persistent MemoryVolatile Uses for Persistent Memory
Volatile Uses for Persistent Memory
 
Tacc Infinite Memory Engine
Tacc Infinite Memory EngineTacc Infinite Memory Engine
Tacc Infinite Memory Engine
 
R&D work on pre exascale HPC systems
R&D work on pre exascale HPC systemsR&D work on pre exascale HPC systems
R&D work on pre exascale HPC systems
 
trends of microprocessor field
trends of microprocessor fieldtrends of microprocessor field
trends of microprocessor field
 
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time SystemsTaming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
 
UKUUG presentation about µCLinux on Pluto 6
UKUUG presentation about µCLinux on Pluto 6UKUUG presentation about µCLinux on Pluto 6
UKUUG presentation about µCLinux on Pluto 6
 
µCLinux on Pluto 6 Project presentation
µCLinux on Pluto 6 Project presentationµCLinux on Pluto 6 Project presentation
µCLinux on Pluto 6 Project presentation
 
BURA Supercomputer
BURA SupercomputerBURA Supercomputer
BURA Supercomputer
 
Memoryhierarchy
MemoryhierarchyMemoryhierarchy
Memoryhierarchy
 
Advanced performance troubleshooting using esxtop
Advanced performance troubleshooting using esxtopAdvanced performance troubleshooting using esxtop
Advanced performance troubleshooting using esxtop
 
2012 benjamin klenk-future-memory_technologies-presentation
2012 benjamin klenk-future-memory_technologies-presentation2012 benjamin klenk-future-memory_technologies-presentation
2012 benjamin klenk-future-memory_technologies-presentation
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Improving Real-Time Performance on Multicore Platforms using MemGuard

  • 1. Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013
  • 3. Challenges: Shared Resources T1 T2 CPU T 1 T 2 Core 1 T 3 T 4 Core 2 T 5 T 6 Core 3 Memory Hierarchy T 8 Core 4 Memory Hierarchy Unicore T 7 Multicore Performance Impact 3
  • 4. Case Study • HRT – Synthetic real-time video capture – P=20, D=13ms – Cache-insensitive • X-server – Scrolling text on a gnome-terminal • Hardware platform – Intel Xeon 3530 – 8MB shared L3 cache – 4GB DDR3 1333MHz DIMM (1ch) HRT Xsrv. Core1 Core2 L3 (8MB) DRAM • CPU cores are isolated A desktop PC (Intel Xeon 3530) 4
  • 5. HRT Time Distribution solo w/ Xserver 99pct: 10.2ms 99pct: 14.3ms • 28% deadline violations • Due to contention in DRAM 5
  • 6. Outline • Motivation • Background – DRAM basics – Worst-case memory performance – MemGuard*RTAS’13+ • Improving Real-Time Performance with MemGuard 6
  • 7. Background: DRAM Organization Core1 Core2 Core3 Core4 L3 Memory Controller (MC) DRAM DIMM Bank 1 Bank 2 Bank 3 Bank 4 • Have multiple banks • Different banks can be accessed in parallel
  • 8. Best-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) DRAM DIMM Fast • Peak = 10.6 GB/s Bank 1 Bank 2 Bank 3 Bank 4 – DDR3 1333Mhz
  • 9. Best-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) DRAM DIMM Fast • Peak = 10.6 GB/s Bank 1 Bank 2 Bank 3 Bank 4 – DDR3 1333Mhz • Out-of-order processors
  • 10. Most-cases Core1 Core2 Core3 Core4 L3 Memory Controller (MC) DRAM DIMM Mess • Performance = ?? Bank 1 Bank 2 Bank 3 (*) Intel® 64 and IA-32 Architectures Optimization Reference Manual Bank 4
  • 11. Worst-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) DRAM DIMM Slow • 1bank b/w Bank 1 Bank 2 Bank 3 (*) Intel® 64 and IA-32 Architectures Optimization Reference Manual Bank 4 – Less than peak b/w – How much?
  • 12. Background: DRAM Operation Bank 1 Row 1 Row 2 Row 3 Row 4 Row 5 activate READ (Bank 1, Row 3, Col 7) precharge Col7 Row Buffer Read/write • Stateful per-bank access time – Row miss: 19 cycles – Row hit: 9 cycles (*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)
  • 13. Real Worst-case Core 1 Core 2 Core 3 Core 4 Request order time L3 Memory Controller (MC) DRAM DIMM Bank 1 Bank 2 Bank 3 Bank 4 Row 1 Row 2 Row 3 Row 4 Row 1 Row 2 … 1 bank & always row miss  ~1.2GB/s Each core = ¼ x 1.2GB/s = 300MB/s ? (*) Intel® 64 and IA-32 Architectures Optimization Reference Manual
  • 14. Background: Memory Controller(MC) Bruce Jacob et al, “Memory Systems: Cache, DRAM, Disk” Fig 13.1. • Request queue(s) – Not fair (open-row first  re-ordering) – Unpredictable queuing delay 14
  • 15. Challenges for Real-Time Systems • Multiple parallel resources (banks) • Stateful bank access latency • Queuing delay • Unpredictable memory performance 15
  • 16. MemGuard *RTAS’13+ MemGuard Operating System Reclaim Manager BW 0.6GB/s Regulator BW 0.2GB/s Regulator BW 0.2GB/s Regulator BW 0.2GB/s Regulator PMC Core1 PMC Core2 PMC Core3 PMC Core4 Memory Controller Multicore Processor DRAM DIMM • Goal: guarantee minimum memory b/w for each core • How: b/w reservation + best effort sharing 16
  • 17. Reservation • Idea – Scheduler regulates per-core memory b/w using h/w counters – Period = 1 scheduler tick (e.g., 1ms) Suspend the RT idle task 2 Budget 1 Core activity 0 1ms Schedule a RT idle task computation 2ms memory fetch 17
  • 19. Best-Effort Sharing time(ms) Core0 Core1 900MB/s 300MB/s 0 throttled reschedule 1 guaranteed b/w 2 best-effort b/w • Spare Sharing *RTAS’13+ • Proportional Sharing [Unpublished TR] 19
  • 20. Case Study • HRT – Synthetic real-time video capture – P=20, D=13ms – Cache-insensitive • X-server – Scrolling text on a gnome-terminal • Hardware platform – Intel Xeon 3530 – 8MB shared cache – 4GB DDR3 1333MHz DIMM HRT Xsrv. Core1 Core2 L3 (8MB) DRAM A desktop PC (Intel Xeon 3530) 20
  • 21. w/o MemGuard HRT (solo) HRT’s 99pct: 10.2ms HRT (w/ Xserver) HRT’s 99pct: 14.3ms X’s CPU util: 78% 21
  • 22. MemGuard reserve only (HRT=900MB/s, X=300MB/s) HRT (solo) HRT’s 99pct: 10.7ms HRT (w/ Xserver) HRT’s 99pct: 11.2ms X’s CPU util: 4% 22
  • 23. MemGuard reserve (HRT=900MB/s, X=300MB/s)+ best-effort sharing HRT (solo) HRT’s 99pct: 10.7ms HRT (w/ Xserver) HRT’s 99pct: 10.7ms X’s CPU util: 48% 23
  • 24. MemGuard reserve (HRT=600MB/s, X=600MB/s)+ best-effort sharing HRT (solo) HRT’s 99pct: 10.9 ms HRT (w/ Xserver) HRT’s 99pct: 12.1ms X’s CPU util: 61% 24
  • 25. Real-Time Performance Improvement HRT X-server • Using MemGuard, we can achieve – No deadline miss for HRT – Good X-server performance 25
  • 26. Conclusion • Unpredictable memory performance – multiple resources(banks), per-bank state, unpredictable queueing delay • MemGuard – Guarantee minimum memory bandwidth for each core – b/w reservation (guaranteed part) + best-effort sharing • Case-study – On Intel Xeon multicore platform, using HRT + X-server – MemGuard can improve real-time performance efficiently • Limitations and Future Work – Coarse grain (a OS tick) enforcement – Small guaranteed b/w  DRAM bank partitioning (submitted to RTAS’14) https://github.com/heechul/memguard 26
  • 28. Evaluation on Intel Core2 • T1: Synthetic video capture task (HRT) – Period=20ms(50Hz) – Deadline=14ms, – Metrics: ACET, WCET, stdev, deadline miss ratio (out of 1000 periods) • T2: Xserver, update screen (SRT) – Metric: CPU utilization • Higher CPU utilization  faster screen update • Platform – Intel Core2Quad 8400, 2MB L2 cache x 2, tunable H/W prefetchers – PC6400 DDR2 DRAM DIMM x 1 • Three platform configurations – Exp1: Private L2, Prefetch=off – Exp2: Private L2, Prefetch=on – Exp3: Shared L2, Prefetch=on Core0 Core1 Core2 L2 (pref.) Core3 L2 (pref.) DRAM Intel Core2Quad based PC 28
  • 29. T1’s exec. time (ms) Experiment 1 18 16 14 12 10 8 6 4 2 0 deadline solo corun T1 Private L2 Prefetch=off Performance guarantee 92% T2 Core1 Core2 L2 L2 solo corun solo corun T1 38% T2 T1 78% T2 550M/s 550M/s 550M/s 550M/s Core1 L2 Core2 L2 Core1 L2 Core2 L2 DRAM DRAM DRAM Original MemGuard (Reserve only) MemGuard (reclaim + share) 29
  • 30. T1’s exec. time (ms) Experiment 1 18 16 14 12 10 8 6 4 2 0 30% WCET WCET ACET solo corun T1 Private L2 Prefetch=off Performance guarantee deadline 92% T2 Core1 Core2 L2 L2 solo corun solo corun T1 38% T2 T1 78% T2 550M/s 550M/s 550M/s 550M/s Core1 L2 Core2 L2 Core1 L2 Core2 L2 DRAM DRAM DRAM Original MemGuard (Reserve only) MemGuard (reclaim + share) 30
  • 31. T1’s exec. time (ms) Experiment 1 18 16 14 12 10 8 6 4 2 0 deadline solo corun T1 Private L2 Prefetch=off 92% T2 Core1 Core2 L2 L2 solo corun solo corun T1 38% T2 T1 78% T2 550M/s 550M/s 550M/s 550M/s Core1 L2 Core2 L2 Core1 L2 Core2 L2 DRAM DRAM DRAM Original MemGuard (Reserve only) MemGuard (reclaim + share) 31
  • 32. T1’s exec. time (ms) Experiment 1 18 16 14 12 10 8 6 4 2 0 deadline solo corun T1 Private L2 Prefetch=off 92% T2 Core1 Core2 L2 L2 solo corun solo corun T1 38% T2 T1 78% T2 550M/s 550M/s 550M/s 550M/s Core1 L2 Core2 L2 Core1 L2 Core2 L2 DRAM DRAM DRAM Original MemGuard (Reserve only) MemGuard (reclaim + share) 32
  • 33. T1’s exec. time (ms) Experiment 1 18 16 14 12 10 8 6 4 2 0 Performance target solo corun T1 Private L2 Prefetch=off 92% T2 Core1 Core2 L2 L2 solo corun solo corun T1 38% T2 T1 78% T2 550M/s 550M/s 550M/s 550M/s Core1 L2 Core2 L2 Core1 L2 Core2 L2 DRAM DRAM DRAM Original MemGuard (Reserve only) MemGuard (reclaim + share) 33
  • 34. T1's exec. Time (ms) Experiment 2: Prefetcher 24 22 20 18 16 14 12 10 8 6 4 2 0 Not enough reserv. More slowdown deadline 60% solo corun T1 Private L2 Prefetch=ON Deadline violation 94% T2 Core1 Core2 L2 L2 solo corun solo corun T1 33% T2 T1 82% T2 550M/s 550M/s 550M/s 550M/s Core1 L2 Core2 L2 Core1 L2 Core2 L2 DRAM DRAM DRAM Original MemGuard (Reserve only) MemGuard (reclaim + share) 34
  • 35. T1's exec. Time (ms) Experiment 2-2 18 16 14 12 10 8 6 4 2 0 Enough reserv. 60% solo corun T1 Private L2 Prefetch=ON No deadline violation 94% T2 Core1 Core2 L2 L2 solo corun solo corun T1 14% T2 T1 69% T2 900M/s 200M/s 900M/s 200M/s Core1 L2 Core2 L2 Core1 L2 Core2 L2 DRAM DRAM DRAM Original MemGuard (Reserve only) MemGuard (reclaim + share) 35
  • 36. T1's exec. Times (ms) Experiment 3: Shared Cache 24 22 20 18 16 14 12 10 8 6 4 2 0 Even more slowdown Minimum reserv. 108% solo corun solo corun No deadline violation solo corun T1 11% T2 T1 63% T2 T1 Shared L2 Prefetch=ON 92% T2 900M/s 200M/s 900M/s 200M/s Core1 Core2 Core1 Core2 Core1 Core2 L2 DRAM L2 DRAM L2 DRAM Original MemGuard (Reserve only) MemGuard (reclaim + share) 36

Notas del editor

  1. Soon more rt/embedded systems will use multicore as well.
  2. In the unicore systems, CPU time is the most important shared resource determining application’s performance. In the multicore systems, however, memory performance is also very important as multiple cores can concurrently access the memory and affect performance in significant ways.
  3. 5
  4. Problem 1: co-ordinate memory slot with tasks  require program modification(PREM)Problem 2: only 1 core can access memory at a time  do not fully utilize memory level parallelism
  5. First, let me explain how b/w regulator works.
  6. Why we want to regulate the request rates?
  7. 5
  8. Problem: DRAM
  9. Problem: DRAM
  10. Problem: DRAM
  11. Problem: DRAM
  12. Problem: DRAM
  13. Problem: DRAM
  14. Problem: DRAM
  15. Problem: DRAM