SlideShare una empresa de Scribd logo
1 de 39
Descargar para leer sin conexión
GCMA: Guaranteed
Contiguous Memory Allocator
SeongJae Park <sj38.park@gmail.com>
These slides were presented during
The Kernel Summit 2018
(https://events.linuxfoundation.org/events/linux-kernel-summit-2018/)
This work by SeongJae Park is licensed under the
Creative Commons Attribution-ShareAlike 3.0 Unported
License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/3.0/.
I, SeongJae Park
● SeongJae Park <sj38.park@gmail.com>
● PhD candidate at Seoul National University
● Interested in memory management and parallel programming for OS kernels
Contiguous Memory
Requirements
Who Wants Physically Contiguous Memory?
● Virtual Memory avoids need of physically contiguous memory
○ CPU uses virtual address; virtual regions can be mapped to any physical regions
○ MMU maps the virtual address to physical address
● Particularly, Linux uses demand paging and reclaim
Physical
Memory
Application MMUCPU
Logical address Physical address
Devices using Direct Memory Access (DMA)
● Lots of devices need large internal buffers
(e.g., 12 Mega-Pixel Camera, 10 Gbps Network interface card, …)
● Because the MMU is behind the CPU, devices directly addressing the
memory cannot use the virtual address
Physical
Memory
Application MMUCPU
Logical address Physical address
Device
DMA
Physical address
Systems using Huge Pages
● A system can have multiple sizes of pages, usually 4 KiB (regular) and 2 MiB
● Use of huge pages improve system performance by reducing TLB misses
● Importance of huge pages is growing
○ Modern workloads including big data, cloud, and machine learning are memory intensive
○ Systems equipping tera-bytes are not rare now
○ It’s especially important on virtual machine based environments
● A 2 MiB huge page is just 512 physically contiguous regular pages
Existing Solutions
H/W Solutions: IOMMU, Scatter / Gather DMA
● Idea is simple; add MMU-like additional hardware for devices
● IOMMU and Scatter/Gather DMA gives the contiguous device memory illusion
● Additional H/W means increase of power consumption and price, which is
unaffordable for low-end devices
● Useless for many cases including huge pages
Physical
Memory
Application MMUCPU
Logical address Physicals address
Device
DMA
Device address
IOMMU
Physical address
Even H/W Solutions Impose Overhead
● Hardware based solutions are well known for low overhead
● Though it is low, the overhead is inevitable
Christoph Lameter et al., “User space contiguous memory allocation for DMA”,
Linux Plumbers Conference 2017
Reserved Area Technique
● Reserves sufficient amount of contiguous area at boot time and
let only contiguous memory allocation to use the area
● Simple and effective for contiguous memory allocation
● Memory space utilization could be low if the reserved area is not fully used
● Widely adopted despite of the utilization problem
System Memory
(Narrowed by Reservation)
Reserved
Area
Process
Normal Allocation Contiguous Allocation
CMA: Contiguous Memory Allocator
● Software-based solution in the Linux kernel
● Generously extended version of the Reserved area technique
○ Mainly focus on memory utilization
○ Let movable pages to use the reserved area
○ If contig mem alloc requires pages being used for movable pages,
move those pages out of the reserved area and use the vacant area for contig mem alloc
○ Solves the memory utilization problem well because general pages are movable
● In other words, CMA gives different priority to clients of the reserved area
○ Primary client: contiguous memory allocation
○ Secondary client: movable page allocation
CMA in Real-land
● Measured latency for taking a photo using Camera app on Raspberry Pi 2
under memory stress (Blogbench) for 30 times
● 1.6 Seconds worst-latency with Reserved area technique,
● 9.8 seconds worst-latency with CMA; Unacceptable
● Measured the latency of CMA under the situation
● It’s clear that latency of Camera is bounded by CMA
The Phantom Menace: Latency of CMA
CMA: Why So Slow?
● In short, secondary client of CMA is not so nice as expected
● Moving page out is an expensive task (copying, rmap control, LRU churn, …)
● If someone is holding the page, it could not be moved until the one releases
the page (e.g., get_user_page())
● Result: Unexpected long latency and even failure
● That’s why CMA is not adopted to many devices
● Raspberry Pi does not support CMA officially because of the problem
Buddy Allocator
● Adaptively split and merge adjacent contiguous pages
● Highly optimized and heavily used in the Linux kernel
● Supports only multi-order contiguous pages and up to (MAX_ORDER - 1)
contiguous pages
● Under fragmentation, it does time-consuming compaction and retry or just
fails
○ Not a good news for contiguous memory requesting tasks, either
● THP uses buddy allocator to allocates huge pages
○ Quickly falls back to regular pages if it fails to allocate contiguous pages
○ As a result, it cannot be used on highly fragmented memory
THP Performance under High Fragmentation
● Default: THP disabled, Thp: THP always
● .f suffix: On the fragmented memory
● Under fragmentation, THP benefits disappear because of the fragmentation
(Lowerisbetter)
Guaranteed Contiguous
Memory Allocator
GCMA: Guaranteed Contiguous Memory Allocator
● GCMA is a variant of CMA that guarantees
○ Fast latency and success of allocation
○ While keeping memory utilization as well
● The idea is simple
○ Follow primary / secondary client idea of CMA to keep memory utilization
○ Select secondary client as only nice one unlike CMA
■ For fast latency, should be able to vacate the reserved area as soon as required without
any task such as moving content
■ For success, should be out of kernel control
■ For memory utilization, most pages should be able to be secondary client
■ In short, frontswap and cleancache
Secondary Clients of GCMA
● Pages for a write-through mode Frontswap
○ Could be discarded immediately because the contents are written through to swap device
○ Kernel thinks it’s already swapped out
○ Most anonymous pages could be covered
○ We recommend users to use Zram as a swap device to minimize write-through overhead
● Pages for a Clean cache
○ Could be discarded because the contents in storage is up to date
○ Kernel thinks it’s already evicted out
○ Lots of file-backed pages could be the case
● Additional pros 1: Already expected to not be accessed again soon
○ Discarding the secondary client pages would not affect system performance much
● Additional pros 2: Occurs with only severe workloads
○ In peaceful case, no overhead at all
GCMA: Workflow
● Reserve memory area in boot time
● If a page is swapped out or evicted from the page cache, keep the content of
the page in the reserved area
○ If the system requires the content of a page again, give it back from the reserved area
○ If a contiguous memory allocation requires area being used by those pages, discard those
pages and use the area for contiguous memory allocation
System Memory GCMA Reserved
Area
Swap Device
File
Process
Normal Allocation Contiguous Allocation
Reclaim / Swap
Get back if hit
GCMA Architecture
● Reuse CMA interface
○ User can turn CMA to GCMA entire or use them selectively on single system
● DMEM: Discardable Memory
○ Abstraction for backend of frontswap and cleancache
○ Works as a last chance cache for secondary client pages using GCMA reserved area
○ Manages Index to pages using hash table of RB-tree based buckets
○ Evicts pages in LRU scheme
Reserved Area
Dmem Logic
GCMA Logic
Dmem Interface
CMA Interface
CMA Logic
Cleancache Frontswap
GCMA Implementation
● Implemented on Linux v3.18, v4.10 and v4.17
● About 1,500 lines of code
● Ported, evaluated on Raspberry Pi 2 and a high-end server
● Available under GPL v3 at: https://github.com/sjp38/linux.gcma
● Submitted to LKML (https://lkml.org/lkml/2015/2/23/480)
Evaluations of
GCMA for a Device
Experimental Setup
● Device setup
○ Raspberry Pi 2
○ ARM cortex-A7 900 MHz
○ 1 GiB LPDDR2 SDRAM
○ Class 10 SanDisk 16 GiB microSD card
● Configurations
○ Baseline: Linux rpi-v3.18.11 + 100 MiB swap +
256 MiB reserved area
○ CMA: Linux rpi-v3.18.11 + 100 MiB swap +
256 MiB CMA area
○ GCMA: Linux rpi-v3.18.11 + 100 MiB Zram swap +
256 MiB GCMA area
● Workloads
○ Contig Mem Alloc: Contiguous memory allocation using CMA or GCMA
○ Blogbench: Realistic file system benchmark
○ Camera shot: Repeatedly taking picture using Raspberry Pi 2 Camera module with 10 seconds
interval between each shot
Contig Mem Alloc without Background Workload
● Average of 30 allocations
● GCMA shows 14.89x to 129.41x faster latency compared to CMA
● CMA even failed once for 32,768 contiguous pages allocation even though
there is no background task
4MiB Contig Mem Alloc w/o Background Workload
● Latency of GCMA is more predictable compared to CMA
4MiB Contig Mem Allocation w/ Background Task
● Background workload: Blogbench
● CMA consumes 5 seconds (unacceptable!) for one allocation in bad case
Camera Latency w/ Background Task
● Camera is a realistic, important application of Raspberry Pi 2
● Background task: Blogbench
● GCMA keeps latency as fast as Reserved area technique configuration
Blogbench on CMA / GCMA
● Scores are normalized to scores of Baseline configuration
● ‘/ cam’ means camera workload as a background workload
● System using GCMA even slightly outperforms CMA version owing to its light
overhead
● Allocation using CMA even degrades system performance because it should
move out pages in reserved area
Evaluations of
GCMA for THP
Experimental Setup
● Device setup
○ Intel Xeon E7-8870 v3
○ L2 TLB with 1024 entries
○ 600 GiB DDR4 memory
○ 500 GiB Intel SSD
○ Linux v4.10 based kernels
● Configurations
○ Nothp: THP disabled
○ Thp.bd: Buddy allocator based THP enabled
○ Thp.cma: GCMA based THP enabled
○ Thp.bc: Buddy allocator based, GCMA fallback using THP enabled
○ .f suffix: On fragmented system
● Workloads
○ 429.mcf and 471.omnetpp from SPEC CPU2006
○ TPC-H Strong test
Performance of SPEC CPU 2006
● Memory intensive two workloads are chosen
● Fragmentation decreases performance of regular pages, too
● GCMA for THP shows 2.56x higher performance compared to the original
THP under fragmentation
● The impact is workload dependent, though
Performance of TPC-H
● Power test: Measure latency of each query for OLAP workloads
● Runtime is normalized by Thp.bc.f
● Queries 17 and 19 produces speedup more than 2x
● Four queries (3, 9, 14 and 20) show a speedup of 1.5x-2x
Plan for GCMA
Unifying Solutions Under CMA Interface
● There are many solutions and interfaces for contiguous memory allocations
● Too many interfaces can confuse new coming programmers; We don’t want
to increase the confusion with GCMA
● GCMA is developed to coexist with CMA, rather than substituting it; It is
already using CMA interface
● The interface could select the secondary clients for each CMA region
○ None (Reservation)
○ Migratables (CMA)
○ Discardables (GCMA)
● Currently, we are developing a patchset; It aims to be merged to the mainline
● The patchset will include updated evaluation result
Conclusion
● Contiguous memory allocation needs improvement
○ H/W solutions are expensive for low-end devices, imposes overhead, and cannot provide real
contig memory
○ CMA is slow in many cases
○ Buddy allocator is restricted
● GCMA guarantees fast latency, success of allocation and reasonable memory
utilization
○ It achieves the goals by utilizing nice clients, the pages for Frontswap and the Cleancache
○ Allocation latency of GCMA is as fast as Reserved area technique
○ GCMA can be used for THP allocation fallback
○ 130x and 10x shorter latency for contiguous memory allocation and taking a photo compared
to CMA based system, repectively
○ More than 50% and 100% performance improvement for 7/24, 3/24 realistic workloads
● Planning to release the official version and updated evaluation results in near
future
Thanks

Más contenido relacionado

La actualidad más candente

Design pipeline architecture for various stage pipelines
Design pipeline architecture for various stage pipelinesDesign pipeline architecture for various stage pipelines
Design pipeline architecture for various stage pipelinesMahmudul Hasan
 
Process scheduling
Process schedulingProcess scheduling
Process schedulingHao-Ran Liu
 
HKG15-305: Real Time processing comparing the RT patch vs Core isolation
HKG15-305: Real Time processing comparing the RT patch vs Core isolationHKG15-305: Real Time processing comparing the RT patch vs Core isolation
HKG15-305: Real Time processing comparing the RT patch vs Core isolationLinaro
 
Linux Locking Mechanisms
Linux Locking MechanismsLinux Locking Mechanisms
Linux Locking MechanismsKernel TLV
 
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...peknap
 
Using QEMU for cross development
Using QEMU for cross developmentUsing QEMU for cross development
Using QEMU for cross developmentTetsuyuki Kobayashi
 
Linux Preempt-RT Internals
Linux Preempt-RT InternalsLinux Preempt-RT Internals
Linux Preempt-RT Internals哲豪 康哲豪
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocatorsHao-Ran Liu
 
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmKernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmAnne Nicolas
 
Linux kernel modules
Linux kernel modulesLinux kernel modules
Linux kernel modulesHao-Ran Liu
 
Kernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry code
Kernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry codeKernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry code
Kernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry codeAnne Nicolas
 
Improving Real-Time Performance on Multicore Platforms using MemGuard
Improving Real-Time Performance on Multicore Platforms using MemGuardImproving Real-Time Performance on Multicore Platforms using MemGuard
Improving Real-Time Performance on Multicore Platforms using MemGuardHeechul Yun
 
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)Anne Nicolas
 
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens AxboeKernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens AxboeAnne Nicolas
 
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore SystemsParallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore SystemsHeechul Yun
 
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time SystemsTaming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time SystemsHeechul Yun
 
From printk to QEMU: Xen/Linux Kernel debugging
From printk to QEMU: Xen/Linux Kernel debuggingFrom printk to QEMU: Xen/Linux Kernel debugging
From printk to QEMU: Xen/Linux Kernel debuggingThe Linux Foundation
 

La actualidad más candente (20)

Design pipeline architecture for various stage pipelines
Design pipeline architecture for various stage pipelinesDesign pipeline architecture for various stage pipelines
Design pipeline architecture for various stage pipelines
 
Process scheduling
Process schedulingProcess scheduling
Process scheduling
 
HKG15-305: Real Time processing comparing the RT patch vs Core isolation
HKG15-305: Real Time processing comparing the RT patch vs Core isolationHKG15-305: Real Time processing comparing the RT patch vs Core isolation
HKG15-305: Real Time processing comparing the RT patch vs Core isolation
 
Linux Locking Mechanisms
Linux Locking MechanismsLinux Locking Mechanisms
Linux Locking Mechanisms
 
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
 
Linux Kernel Debugging
Linux Kernel DebuggingLinux Kernel Debugging
Linux Kernel Debugging
 
Using QEMU for cross development
Using QEMU for cross developmentUsing QEMU for cross development
Using QEMU for cross development
 
Linux Preempt-RT Internals
Linux Preempt-RT InternalsLinux Preempt-RT Internals
Linux Preempt-RT Internals
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocators
 
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmKernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
 
Mastering Real-time Linux
Mastering Real-time LinuxMastering Real-time Linux
Mastering Real-time Linux
 
Linux kernel modules
Linux kernel modulesLinux kernel modules
Linux kernel modules
 
Kernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry code
Kernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry codeKernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry code
Kernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry code
 
Improving Real-Time Performance on Multicore Platforms using MemGuard
Improving Real-Time Performance on Multicore Platforms using MemGuardImproving Real-Time Performance on Multicore Platforms using MemGuard
Improving Real-Time Performance on Multicore Platforms using MemGuard
 
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
 
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens AxboeKernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
 
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore SystemsParallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
 
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time SystemsTaming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
 
Ch6 cpu scheduling
Ch6   cpu schedulingCh6   cpu scheduling
Ch6 cpu scheduling
 
From printk to QEMU: Xen/Linux Kernel debugging
From printk to QEMU: Xen/Linux Kernel debuggingFrom printk to QEMU: Xen/Linux Kernel debugging
From printk to QEMU: Xen/Linux Kernel debugging
 

Similar a GCMA: Guaranteed Contiguous Memory Allocator

Continguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux KernelContinguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux KernelKernel TLV
 
Virtualization for Emerging Memory Devices
Virtualization for Emerging Memory DevicesVirtualization for Emerging Memory Devices
Virtualization for Emerging Memory DevicesTakahiro Hirofuchi
 
Linux Memory Basics for SysAdmins - ChinaNetCloud Training
Linux Memory Basics for SysAdmins - ChinaNetCloud TrainingLinux Memory Basics for SysAdmins - ChinaNetCloud Training
Linux Memory Basics for SysAdmins - ChinaNetCloud TrainingChinaNetCloud
 
IBM MQ Disaster Recovery
IBM MQ Disaster RecoveryIBM MQ Disaster Recovery
IBM MQ Disaster RecoveryMarkTaylorIBM
 
LCA13: Memory Hotplug on Android
LCA13: Memory Hotplug on AndroidLCA13: Memory Hotplug on Android
LCA13: Memory Hotplug on AndroidLinaro
 
Sc19 ibm hms final
Sc19 ibm hms finalSc19 ibm hms final
Sc19 ibm hms finalYutaka Kawai
 
Software Design for Persistent Memory Systems
Software Design for Persistent Memory SystemsSoftware Design for Persistent Memory Systems
Software Design for Persistent Memory SystemsC4Media
 
LCE13: Android Graphics Upstreaming
LCE13: Android Graphics UpstreamingLCE13: Android Graphics Upstreaming
LCE13: Android Graphics UpstreamingLinaro
 
Memory Management in Amoeba
Memory Management in AmoebaMemory Management in Amoeba
Memory Management in AmoebaRamesh Adhikari
 
Recent advancements in cache technology
Recent advancements in cache technologyRecent advancements in cache technology
Recent advancements in cache technologyParas Nath Chaudhary
 
Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea Arcangeli
Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea ArcangeliKernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea Arcangeli
Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea ArcangeliAnne Nicolas
 
Enhanced Live Migration for Intensive Memory Loads
Enhanced Live Migration for Intensive Memory LoadsEnhanced Live Migration for Intensive Memory Loads
Enhanced Live Migration for Intensive Memory LoadsSamsung Open Source Group
 
Drupal 7 performance and optimization
Drupal 7 performance and optimizationDrupal 7 performance and optimization
Drupal 7 performance and optimizationShafqat Hussain
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in productionPingCAP
 
Literature survey presentation
Literature survey presentationLiterature survey presentation
Literature survey presentationKarthik Iyr
 
Measuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data PlaneMeasuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data PlaneOpen-NFP
 
A survey on exploring memory optimizations in smartphones
A survey on exploring memory optimizations in smartphonesA survey on exploring memory optimizations in smartphones
A survey on exploring memory optimizations in smartphonesKarthik Iyr
 

Similar a GCMA: Guaranteed Contiguous Memory Allocator (20)

Continguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux KernelContinguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux Kernel
 
Virtualization for Emerging Memory Devices
Virtualization for Emerging Memory DevicesVirtualization for Emerging Memory Devices
Virtualization for Emerging Memory Devices
 
Linux Memory Basics for SysAdmins - ChinaNetCloud Training
Linux Memory Basics for SysAdmins - ChinaNetCloud TrainingLinux Memory Basics for SysAdmins - ChinaNetCloud Training
Linux Memory Basics for SysAdmins - ChinaNetCloud Training
 
IBM MQ Disaster Recovery
IBM MQ Disaster RecoveryIBM MQ Disaster Recovery
IBM MQ Disaster Recovery
 
RAMinate Invited Talk at NII
RAMinate Invited Talk at NIIRAMinate Invited Talk at NII
RAMinate Invited Talk at NII
 
LCA13: Memory Hotplug on Android
LCA13: Memory Hotplug on AndroidLCA13: Memory Hotplug on Android
LCA13: Memory Hotplug on Android
 
Sc19 ibm hms final
Sc19 ibm hms finalSc19 ibm hms final
Sc19 ibm hms final
 
Software Design for Persistent Memory Systems
Software Design for Persistent Memory SystemsSoftware Design for Persistent Memory Systems
Software Design for Persistent Memory Systems
 
LCE13: Android Graphics Upstreaming
LCE13: Android Graphics UpstreamingLCE13: Android Graphics Upstreaming
LCE13: Android Graphics Upstreaming
 
Memory Management in Amoeba
Memory Management in AmoebaMemory Management in Amoeba
Memory Management in Amoeba
 
Recent advancements in cache technology
Recent advancements in cache technologyRecent advancements in cache technology
Recent advancements in cache technology
 
Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea Arcangeli
Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea ArcangeliKernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea Arcangeli
Kernel Recipes 2017 - 20 years of Linux Virtual Memory - Andrea Arcangeli
 
Enhanced Live Migration for Intensive Memory Loads
Enhanced Live Migration for Intensive Memory LoadsEnhanced Live Migration for Intensive Memory Loads
Enhanced Live Migration for Intensive Memory Loads
 
ch9_virMem.pdf
ch9_virMem.pdfch9_virMem.pdf
ch9_virMem.pdf
 
Drupal 7 performance and optimization
Drupal 7 performance and optimizationDrupal 7 performance and optimization
Drupal 7 performance and optimization
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
 
Literature survey presentation
Literature survey presentationLiterature survey presentation
Literature survey presentation
 
Measuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data PlaneMeasuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data Plane
 
RAMinate ACM SoCC 2016 Talk
RAMinate ACM SoCC 2016 TalkRAMinate ACM SoCC 2016 Talk
RAMinate ACM SoCC 2016 Talk
 
A survey on exploring memory optimizations in smartphones
A survey on exploring memory optimizations in smartphonesA survey on exploring memory optimizations in smartphones
A survey on exploring memory optimizations in smartphones
 

Más de SeongJae Park

Biscuit: an operating system written in go
Biscuit:  an operating system written in goBiscuit:  an operating system written in go
Biscuit: an operating system written in goSeongJae Park
 
Design choices of golang for high scalability
Design choices of golang for high scalabilityDesign choices of golang for high scalability
Design choices of golang for high scalabilitySeongJae Park
 
Brief introduction to kselftest
Brief introduction to kselftestBrief introduction to kselftest
Brief introduction to kselftestSeongJae Park
 
Let the contribution begin (EST futures)
Let the contribution begin  (EST futures)Let the contribution begin  (EST futures)
Let the contribution begin (EST futures)SeongJae Park
 
Porting golang development environment developed with golang
Porting golang development environment developed with golangPorting golang development environment developed with golang
Porting golang development environment developed with golangSeongJae Park
 
An introduction to_golang.avi
An introduction to_golang.aviAn introduction to_golang.avi
An introduction to_golang.aviSeongJae Park
 
Develop Android/iOS app using golang
Develop Android/iOS app using golangDevelop Android/iOS app using golang
Develop Android/iOS app using golangSeongJae Park
 
Develop Android app using Golang
Develop Android app using GolangDevelop Android app using Golang
Develop Android app using GolangSeongJae Park
 
Sw install with_without_docker
Sw install with_without_dockerSw install with_without_docker
Sw install with_without_dockerSeongJae Park
 
Git inter-snapshot public
Git  inter-snapshot publicGit  inter-snapshot public
Git inter-snapshot publicSeongJae Park
 
(Live) build and run golang web server on android.avi
(Live) build and run golang web server on android.avi(Live) build and run golang web server on android.avi
(Live) build and run golang web server on android.aviSeongJae Park
 
Deep dark-side of git: How git works internally
Deep dark-side of git: How git works internallyDeep dark-side of git: How git works internally
Deep dark-side of git: How git works internallySeongJae Park
 
Deep dark side of git - prologue
Deep dark side of git - prologueDeep dark side of git - prologue
Deep dark side of git - prologueSeongJae Park
 
DO YOU WANT TO USE A VCS
DO YOU WANT TO USE A VCSDO YOU WANT TO USE A VCS
DO YOU WANT TO USE A VCSSeongJae Park
 
Experimental android hacking using reflection
Experimental android hacking using reflectionExperimental android hacking using reflection
Experimental android hacking using reflectionSeongJae Park
 
Let the contribution begin
Let the contribution beginLet the contribution begin
Let the contribution beginSeongJae Park
 
Touch Android Without Touching
Touch Android Without TouchingTouch Android Without Touching
Touch Android Without TouchingSeongJae Park
 
AOSP에 컨트리뷰션 하기 dev festx korea 2012 presentation
AOSP에 컨트리뷰션 하기   dev festx korea 2012 presentationAOSP에 컨트리뷰션 하기   dev festx korea 2012 presentation
AOSP에 컨트리뷰션 하기 dev festx korea 2012 presentationSeongJae Park
 

Más de SeongJae Park (20)

Biscuit: an operating system written in go
Biscuit:  an operating system written in goBiscuit:  an operating system written in go
Biscuit: an operating system written in go
 
Design choices of golang for high scalability
Design choices of golang for high scalabilityDesign choices of golang for high scalability
Design choices of golang for high scalability
 
Brief introduction to kselftest
Brief introduction to kselftestBrief introduction to kselftest
Brief introduction to kselftest
 
Let the contribution begin (EST futures)
Let the contribution begin  (EST futures)Let the contribution begin  (EST futures)
Let the contribution begin (EST futures)
 
Porting golang development environment developed with golang
Porting golang development environment developed with golangPorting golang development environment developed with golang
Porting golang development environment developed with golang
 
An introduction to_golang.avi
An introduction to_golang.aviAn introduction to_golang.avi
An introduction to_golang.avi
 
Develop Android/iOS app using golang
Develop Android/iOS app using golangDevelop Android/iOS app using golang
Develop Android/iOS app using golang
 
Develop Android app using Golang
Develop Android app using GolangDevelop Android app using Golang
Develop Android app using Golang
 
Sw install with_without_docker
Sw install with_without_dockerSw install with_without_docker
Sw install with_without_docker
 
Git inter-snapshot public
Git  inter-snapshot publicGit  inter-snapshot public
Git inter-snapshot public
 
(Live) build and run golang web server on android.avi
(Live) build and run golang web server on android.avi(Live) build and run golang web server on android.avi
(Live) build and run golang web server on android.avi
 
Deep dark-side of git: How git works internally
Deep dark-side of git: How git works internallyDeep dark-side of git: How git works internally
Deep dark-side of git: How git works internally
 
Deep dark side of git - prologue
Deep dark side of git - prologueDeep dark side of git - prologue
Deep dark side of git - prologue
 
DO YOU WANT TO USE A VCS
DO YOU WANT TO USE A VCSDO YOU WANT TO USE A VCS
DO YOU WANT TO USE A VCS
 
Experimental android hacking using reflection
Experimental android hacking using reflectionExperimental android hacking using reflection
Experimental android hacking using reflection
 
ash
ashash
ash
 
Hacktime for adk
Hacktime for adkHacktime for adk
Hacktime for adk
 
Let the contribution begin
Let the contribution beginLet the contribution begin
Let the contribution begin
 
Touch Android Without Touching
Touch Android Without TouchingTouch Android Without Touching
Touch Android Without Touching
 
AOSP에 컨트리뷰션 하기 dev festx korea 2012 presentation
AOSP에 컨트리뷰션 하기   dev festx korea 2012 presentationAOSP에 컨트리뷰션 하기   dev festx korea 2012 presentation
AOSP에 컨트리뷰션 하기 dev festx korea 2012 presentation
 

Último

Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 

Último (20)

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 

GCMA: Guaranteed Contiguous Memory Allocator

  • 1. GCMA: Guaranteed Contiguous Memory Allocator SeongJae Park <sj38.park@gmail.com>
  • 2. These slides were presented during The Kernel Summit 2018 (https://events.linuxfoundation.org/events/linux-kernel-summit-2018/)
  • 3. This work by SeongJae Park is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/.
  • 4. I, SeongJae Park ● SeongJae Park <sj38.park@gmail.com> ● PhD candidate at Seoul National University ● Interested in memory management and parallel programming for OS kernels
  • 6. Who Wants Physically Contiguous Memory? ● Virtual Memory avoids need of physically contiguous memory ○ CPU uses virtual address; virtual regions can be mapped to any physical regions ○ MMU maps the virtual address to physical address ● Particularly, Linux uses demand paging and reclaim Physical Memory Application MMUCPU Logical address Physical address
  • 7. Devices using Direct Memory Access (DMA) ● Lots of devices need large internal buffers (e.g., 12 Mega-Pixel Camera, 10 Gbps Network interface card, …) ● Because the MMU is behind the CPU, devices directly addressing the memory cannot use the virtual address Physical Memory Application MMUCPU Logical address Physical address Device DMA Physical address
  • 8. Systems using Huge Pages ● A system can have multiple sizes of pages, usually 4 KiB (regular) and 2 MiB ● Use of huge pages improve system performance by reducing TLB misses ● Importance of huge pages is growing ○ Modern workloads including big data, cloud, and machine learning are memory intensive ○ Systems equipping tera-bytes are not rare now ○ It’s especially important on virtual machine based environments ● A 2 MiB huge page is just 512 physically contiguous regular pages
  • 10. H/W Solutions: IOMMU, Scatter / Gather DMA ● Idea is simple; add MMU-like additional hardware for devices ● IOMMU and Scatter/Gather DMA gives the contiguous device memory illusion ● Additional H/W means increase of power consumption and price, which is unaffordable for low-end devices ● Useless for many cases including huge pages Physical Memory Application MMUCPU Logical address Physicals address Device DMA Device address IOMMU Physical address
  • 11. Even H/W Solutions Impose Overhead ● Hardware based solutions are well known for low overhead ● Though it is low, the overhead is inevitable Christoph Lameter et al., “User space contiguous memory allocation for DMA”, Linux Plumbers Conference 2017
  • 12. Reserved Area Technique ● Reserves sufficient amount of contiguous area at boot time and let only contiguous memory allocation to use the area ● Simple and effective for contiguous memory allocation ● Memory space utilization could be low if the reserved area is not fully used ● Widely adopted despite of the utilization problem System Memory (Narrowed by Reservation) Reserved Area Process Normal Allocation Contiguous Allocation
  • 13. CMA: Contiguous Memory Allocator ● Software-based solution in the Linux kernel ● Generously extended version of the Reserved area technique ○ Mainly focus on memory utilization ○ Let movable pages to use the reserved area ○ If contig mem alloc requires pages being used for movable pages, move those pages out of the reserved area and use the vacant area for contig mem alloc ○ Solves the memory utilization problem well because general pages are movable ● In other words, CMA gives different priority to clients of the reserved area ○ Primary client: contiguous memory allocation ○ Secondary client: movable page allocation
  • 14. CMA in Real-land ● Measured latency for taking a photo using Camera app on Raspberry Pi 2 under memory stress (Blogbench) for 30 times ● 1.6 Seconds worst-latency with Reserved area technique, ● 9.8 seconds worst-latency with CMA; Unacceptable
  • 15. ● Measured the latency of CMA under the situation ● It’s clear that latency of Camera is bounded by CMA The Phantom Menace: Latency of CMA
  • 16. CMA: Why So Slow? ● In short, secondary client of CMA is not so nice as expected ● Moving page out is an expensive task (copying, rmap control, LRU churn, …) ● If someone is holding the page, it could not be moved until the one releases the page (e.g., get_user_page()) ● Result: Unexpected long latency and even failure ● That’s why CMA is not adopted to many devices ● Raspberry Pi does not support CMA officially because of the problem
  • 17. Buddy Allocator ● Adaptively split and merge adjacent contiguous pages ● Highly optimized and heavily used in the Linux kernel ● Supports only multi-order contiguous pages and up to (MAX_ORDER - 1) contiguous pages ● Under fragmentation, it does time-consuming compaction and retry or just fails ○ Not a good news for contiguous memory requesting tasks, either ● THP uses buddy allocator to allocates huge pages ○ Quickly falls back to regular pages if it fails to allocate contiguous pages ○ As a result, it cannot be used on highly fragmented memory
  • 18. THP Performance under High Fragmentation ● Default: THP disabled, Thp: THP always ● .f suffix: On the fragmented memory ● Under fragmentation, THP benefits disappear because of the fragmentation (Lowerisbetter)
  • 20. GCMA: Guaranteed Contiguous Memory Allocator ● GCMA is a variant of CMA that guarantees ○ Fast latency and success of allocation ○ While keeping memory utilization as well ● The idea is simple ○ Follow primary / secondary client idea of CMA to keep memory utilization ○ Select secondary client as only nice one unlike CMA ■ For fast latency, should be able to vacate the reserved area as soon as required without any task such as moving content ■ For success, should be out of kernel control ■ For memory utilization, most pages should be able to be secondary client ■ In short, frontswap and cleancache
  • 21. Secondary Clients of GCMA ● Pages for a write-through mode Frontswap ○ Could be discarded immediately because the contents are written through to swap device ○ Kernel thinks it’s already swapped out ○ Most anonymous pages could be covered ○ We recommend users to use Zram as a swap device to minimize write-through overhead ● Pages for a Clean cache ○ Could be discarded because the contents in storage is up to date ○ Kernel thinks it’s already evicted out ○ Lots of file-backed pages could be the case ● Additional pros 1: Already expected to not be accessed again soon ○ Discarding the secondary client pages would not affect system performance much ● Additional pros 2: Occurs with only severe workloads ○ In peaceful case, no overhead at all
  • 22. GCMA: Workflow ● Reserve memory area in boot time ● If a page is swapped out or evicted from the page cache, keep the content of the page in the reserved area ○ If the system requires the content of a page again, give it back from the reserved area ○ If a contiguous memory allocation requires area being used by those pages, discard those pages and use the area for contiguous memory allocation System Memory GCMA Reserved Area Swap Device File Process Normal Allocation Contiguous Allocation Reclaim / Swap Get back if hit
  • 23. GCMA Architecture ● Reuse CMA interface ○ User can turn CMA to GCMA entire or use them selectively on single system ● DMEM: Discardable Memory ○ Abstraction for backend of frontswap and cleancache ○ Works as a last chance cache for secondary client pages using GCMA reserved area ○ Manages Index to pages using hash table of RB-tree based buckets ○ Evicts pages in LRU scheme Reserved Area Dmem Logic GCMA Logic Dmem Interface CMA Interface CMA Logic Cleancache Frontswap
  • 24. GCMA Implementation ● Implemented on Linux v3.18, v4.10 and v4.17 ● About 1,500 lines of code ● Ported, evaluated on Raspberry Pi 2 and a high-end server ● Available under GPL v3 at: https://github.com/sjp38/linux.gcma ● Submitted to LKML (https://lkml.org/lkml/2015/2/23/480)
  • 26. Experimental Setup ● Device setup ○ Raspberry Pi 2 ○ ARM cortex-A7 900 MHz ○ 1 GiB LPDDR2 SDRAM ○ Class 10 SanDisk 16 GiB microSD card ● Configurations ○ Baseline: Linux rpi-v3.18.11 + 100 MiB swap + 256 MiB reserved area ○ CMA: Linux rpi-v3.18.11 + 100 MiB swap + 256 MiB CMA area ○ GCMA: Linux rpi-v3.18.11 + 100 MiB Zram swap + 256 MiB GCMA area ● Workloads ○ Contig Mem Alloc: Contiguous memory allocation using CMA or GCMA ○ Blogbench: Realistic file system benchmark ○ Camera shot: Repeatedly taking picture using Raspberry Pi 2 Camera module with 10 seconds interval between each shot
  • 27. Contig Mem Alloc without Background Workload ● Average of 30 allocations ● GCMA shows 14.89x to 129.41x faster latency compared to CMA ● CMA even failed once for 32,768 contiguous pages allocation even though there is no background task
  • 28. 4MiB Contig Mem Alloc w/o Background Workload ● Latency of GCMA is more predictable compared to CMA
  • 29. 4MiB Contig Mem Allocation w/ Background Task ● Background workload: Blogbench ● CMA consumes 5 seconds (unacceptable!) for one allocation in bad case
  • 30. Camera Latency w/ Background Task ● Camera is a realistic, important application of Raspberry Pi 2 ● Background task: Blogbench ● GCMA keeps latency as fast as Reserved area technique configuration
  • 31. Blogbench on CMA / GCMA ● Scores are normalized to scores of Baseline configuration ● ‘/ cam’ means camera workload as a background workload ● System using GCMA even slightly outperforms CMA version owing to its light overhead ● Allocation using CMA even degrades system performance because it should move out pages in reserved area
  • 33. Experimental Setup ● Device setup ○ Intel Xeon E7-8870 v3 ○ L2 TLB with 1024 entries ○ 600 GiB DDR4 memory ○ 500 GiB Intel SSD ○ Linux v4.10 based kernels ● Configurations ○ Nothp: THP disabled ○ Thp.bd: Buddy allocator based THP enabled ○ Thp.cma: GCMA based THP enabled ○ Thp.bc: Buddy allocator based, GCMA fallback using THP enabled ○ .f suffix: On fragmented system ● Workloads ○ 429.mcf and 471.omnetpp from SPEC CPU2006 ○ TPC-H Strong test
  • 34. Performance of SPEC CPU 2006 ● Memory intensive two workloads are chosen ● Fragmentation decreases performance of regular pages, too ● GCMA for THP shows 2.56x higher performance compared to the original THP under fragmentation ● The impact is workload dependent, though
  • 35. Performance of TPC-H ● Power test: Measure latency of each query for OLAP workloads ● Runtime is normalized by Thp.bc.f ● Queries 17 and 19 produces speedup more than 2x ● Four queries (3, 9, 14 and 20) show a speedup of 1.5x-2x
  • 37. Unifying Solutions Under CMA Interface ● There are many solutions and interfaces for contiguous memory allocations ● Too many interfaces can confuse new coming programmers; We don’t want to increase the confusion with GCMA ● GCMA is developed to coexist with CMA, rather than substituting it; It is already using CMA interface ● The interface could select the secondary clients for each CMA region ○ None (Reservation) ○ Migratables (CMA) ○ Discardables (GCMA) ● Currently, we are developing a patchset; It aims to be merged to the mainline ● The patchset will include updated evaluation result
  • 38. Conclusion ● Contiguous memory allocation needs improvement ○ H/W solutions are expensive for low-end devices, imposes overhead, and cannot provide real contig memory ○ CMA is slow in many cases ○ Buddy allocator is restricted ● GCMA guarantees fast latency, success of allocation and reasonable memory utilization ○ It achieves the goals by utilizing nice clients, the pages for Frontswap and the Cleancache ○ Allocation latency of GCMA is as fast as Reserved area technique ○ GCMA can be used for THP allocation fallback ○ 130x and 10x shorter latency for contiguous memory allocation and taking a photo compared to CMA based system, repectively ○ More than 50% and 100% performance improvement for 7/24, 3/24 realistic workloads ● Planning to release the official version and updated evaluation results in near future