Puma is a kernel-level mechanism that pools unused memory across virtual machines to address memory fragmentation. It detects clean pages being evicted from a VM's page cache and stores them remotely in other VMs' page caches. This allows memory to be quickly recovered through remote caching, unlike memory ballooning which is slow. Puma integrates with the page frame reclaim algorithm and handles put and get operations asynchronously and synchronously, respectively. It can filter out sequential I/O to focus on random accesses and improves performance on database workloads.
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Puma: Pooling Unused Memory in Virtual Machines for I/O-intensive Apps
1. Puma: Pooling Unused Memory in Virtual Machines
for I/O intensive applications
Maxime Lorrillere, Julien Sopena, Sébastien Monnet and Pierre Sens
contact: maxime.lorrillere@lip6.fr
Kernel Recipes 2015
Maxime Lorrillere Puma Kernel Recipes 2015 1 / 15
4. Introduction Context
Problem: memory fragmentation
PFRA
cache
VM1 VM2
Host 1
PFRA
cache
VM3 VM4
Host 2
virtio virtio
Swap
Applications
Hypervisor (KVM) Hypervisor (KVM)
10GB Ethernet
Virtualization allows more flexibility and isolation
Problem: it fragments available memory
⇒ Sharing resources like CPU time is straightforward
⇒ Memory cannot be reassigned as efficiently as CPU time
Maxime Lorrillere Puma Kernel Recipes 2015 2 / 15
5. Introduction Related work
Solution: Memory Ballooning [OSDI’02]
PFRA
cache
VM1 VM2
Host 1
PFRA
cache
VM3 VM4
Host 2
virtio virtio
Swap
Balloon
Applications
Hypervisor (KVM) Hypervisor (KVM)
10GB Ethernet
The host asks a VM to inflate its balloon to return free memory
The host asks a VM to deflate its balloon to get more memory
Maxime Lorrillere Puma Kernel Recipes 2015 3 / 15
6. Introduction Related work
Solution: Memory Ballooning [OSDI’02]
PFRA
cache
VM1 VM2
Host 1
PFRA
cache
VM3 VM4
Host 2
virtio virtio
BalloonBalloon
Applications
Hypervisor (KVM) Hypervisor (KVM)
10GB Ethernet
The host asks a VM to inflate its balloon to return free memory
The host asks a VM to deflate its balloon to get more memory
Maxime Lorrillere Puma Kernel Recipes 2015 3 / 15
7. Introduction Related work
Solution: Memory Ballooning [OSDI’02]
PFRA
cache
VM1 VM2
Host 1
PFRA
cache
VM3 VM4
Host 2
virtio virtio
BalloonBalloon
I/O
Applications
Hypervisor (KVM) Hypervisor (KVM)
10GB Ethernet
The host asks a VM to inflate its balloon to return free memory
The host asks a VM to deflate its balloon to get more memory
Limitations
⇒ page cache is still fragmented
Maxime Lorrillere Puma Kernel Recipes 2015 3 / 15
8. Introduction Related work
Solution: Memory Ballooning [OSDI’02]
PFRA
cache
VM1 VM2
Host 1
PFRA
cache
VM3 VM4
Host 2
virtio virtio
I/O
Balloon
Swap
Applications
Hypervisor (KVM) Hypervisor (KVM)
10GB Ethernet
The host asks a VM to inflate its balloon to return free memory
The host asks a VM to deflate its balloon to get more memory
Limitations
⇒ page cache is still fragmented
⇒ slow to recover
Maxime Lorrillere Puma Kernel Recipes 2015 3 / 15
9. Introduction Related work
Memory Ballooning – Time to recover memory
1 Make a lot of I/O on the first VM
2 Try to allocate the memory (malloc) on the second VM
Baseline Auto-ballooning
⇒ Memory allocations are 20× slower than the baseline
Maxime Lorrillere Puma Kernel Recipes 2015 4 / 15
10. Introduction Related work
Memory Ballooning – Time to recover memory
1 Make a lot of I/O on the first VM
2 Try to allocate the memory (malloc) on the second VM
Baseline Auto-ballooning
⇒ Memory allocations are 20× slower than the baseline
⇒ When it does not crash! (OOM-kill)
Maxime Lorrillere Puma Kernel Recipes 2015 4 / 15
11. Introduction Related work
Our contribution: a cooperative page cache
PFRA
cache
VM1 VM2
Host 1
PFRA
cache
VM3 VM4
Host 2
virtiovirtio virtio
TCP (~100µs)
Remote
page cache
~10ms
Puma Puma Puma
TCP (~100µs)
Applications
Hypervisor (KVM) Hypervisor (KVM)
10GB Ethernet
Puma’s approach:
Relies on a fast network between VMs and physical machines
Hypervisor, filesystem and block device agnostic
Handles only clean cache pages
⇒ Writes a generally non-blocking
⇒ Simple consistency scheme
⇒ fast to recover memory!
Maxime Lorrillere Puma Kernel Recipes 2015 5 / 15
12. Puma design Basics
Puma design
Local page cache eviction – put operation
PFRAPFRA
alloc()
VM1 VM2
P31
1
Metadata
31
P31
Typically triggered by a memory allocation
Puma is integrated into the PFRA to detect page cache eviction
Pages are sent asynchronously to avoid slowdowns
Remote pages are stored into the system page cache
Maxime Lorrillere Puma Kernel Recipes 2015 6 / 15
13. Puma design Basics
Puma design
Local page cache eviction – put operation
PFRAPFRA
alloc()
VM1 VM2
P31
1
Metadata
31
P31
Reclaim2
Typically triggered by a memory allocation
Puma is integrated into the PFRA to detect page cache eviction
Pages are sent asynchronously to avoid slowdowns
Remote pages are stored into the system page cache
Maxime Lorrillere Puma Kernel Recipes 2015 6 / 15
14. Puma design Basics
Puma design
Local page cache eviction – put operation
PFRAPFRA
alloc()
VM1 VM2
P31
1
Metadata
31
P31
Reclaim2
put(P31)
3
Typically triggered by a memory allocation
Puma is integrated into the PFRA to detect page cache eviction
Pages are sent asynchronously to avoid slowdowns
Remote pages are stored into the system page cache
Maxime Lorrillere Puma Kernel Recipes 2015 6 / 15
15. Puma design Basics
Puma design
Local page cache eviction – put operation
PFRAPFRA
alloc()
VM1 VM2
P31
1
Metadata
31
Reclaim2
put(P31)
3
P31
4
4
Typically triggered by a memory allocation
Puma is integrated into the PFRA to detect page cache eviction
Pages are sent asynchronously to avoid slowdowns
Remote pages are stored into the system page cache
Maxime Lorrillere Puma Kernel Recipes 2015 6 / 15
16. Puma design Basics
Puma design
Local page cache eviction – put operation
PFRAPFRA
alloc()
VM1 VM2
P31
1
Metadata
31
Reclaim2
put(P31)
3
P31
4
4
Store page
P31
5
Typically triggered by a memory allocation
Puma is integrated into the PFRA to detect page cache eviction
Pages are sent asynchronously to avoid slowdowns
Remote pages are stored into the system page cache
Maxime Lorrillere Puma Kernel Recipes 2015 6 / 15
17. Puma design Basics
Puma design
Local page cache miss – get operation
PFRAPFRA
P24
Miss
get(P24)
VM1 VM2
P24
1
Metadata
24
Integrated into the page cache to detect local cache misses
A local cache miss leads to a (synchronous) get operation
Local metadata are used to know if and where a page is in the cache
Exclusive and non-inclusive caching strategies
Maxime Lorrillere Puma Kernel Recipes 2015 7 / 15
18. Puma design Basics
Puma design
Local page cache miss – get operation
PFRAPFRA
P24
Miss
get(P24)
VM1 VM2
P24
1
Metadata
24
2
Hit?
Integrated into the page cache to detect local cache misses
A local cache miss leads to a (synchronous) get operation
Local metadata are used to know if and where a page is in the cache
Exclusive and non-inclusive caching strategies
Maxime Lorrillere Puma Kernel Recipes 2015 7 / 15
19. Puma design Basics
Puma design
Local page cache miss – get operation
PFRAPFRA
P24
Miss
get(P24)
VM1 VM2
P24
1
Metadata
24
2
Hit?
req(P24)
3
Integrated into the page cache to detect local cache misses
A local cache miss leads to a (synchronous) get operation
Local metadata are used to know if and where a page is in the cache
Exclusive and non-inclusive caching strategies
Maxime Lorrillere Puma Kernel Recipes 2015 7 / 15
20. Puma design Basics
Puma design
Local page cache miss – get operation
PFRAPFRA
P24
Miss
get(P24)
VM1 VM2
P24
1
Metadata
24
2
Hit?
req(P24)
3
Lookup
4
Integrated into the page cache to detect local cache misses
A local cache miss leads to a (synchronous) get operation
Local metadata are used to know if and where a page is in the cache
Exclusive and non-inclusive caching strategies
Maxime Lorrillere Puma Kernel Recipes 2015 7 / 15
21. Puma design Basics
Puma design
Local page cache miss – get operation
PFRAPFRA
P24
Miss
get(P24)
VM1 VM2
P24
1
Metadata
24
2
Hit?
req(P24)
3
Lookup
4
P24 P24
5
Integrated into the page cache to detect local cache misses
A local cache miss leads to a (synchronous) get operation
Local metadata are used to know if and where a page is in the cache
Exclusive and non-inclusive caching strategies
Maxime Lorrillere Puma Kernel Recipes 2015 7 / 15
22. Puma design Basics
Puma design
Local page cache miss – get operation
PFRAPFRA
P24
Miss
get(P24)
VM1 VM2
P24
1
Metadata
24
2
Hit?
req(P24)
3
Lookup
4
P24 P24
5
Integrated into the page cache to detect local cache misses
A local cache miss leads to a (synchronous) get operation
Local metadata are used to know if and where a page is in the cache
Exclusive and non-inclusive caching strategies
Maxime Lorrillere Puma Kernel Recipes 2015 7 / 15
23. Puma design Sequential I/O
Puma design
Filtering sequential I/O
P24
get(P24,32)
VM1 - get
1
!Hit
Metadata
PFRA
Miss
Sequential reads are detected through the read-ahead algorithm
Maxime Lorrillere Puma Kernel Recipes 2015 8 / 15
24. Puma design Sequential I/O
Puma design
Filtering sequential I/O
P24
get(P24,32)
VM1 - get
1
!Hit
Metadata
PFRA
Miss 2
Sequential reads are detected through the read-ahead algorithm
Maxime Lorrillere Puma Kernel Recipes 2015 8 / 15
25. Puma design Sequential I/O
Puma design
Filtering sequential I/O
P24
get(P24,32)
VM1 - get
1
!Hit
Metadata
PFRA
2
3
S
P24
Sequential reads are detected through the read-ahead algorithm
“Sequential pages” are tagged into the metadata
Maxime Lorrillere Puma Kernel Recipes 2015 8 / 15
26. Puma design Sequential I/O
Puma design
Filtering sequential I/O
P24
get(P24,32)
VM1 - get
1
!Hit
Metadata
PFRA
2
3
S
P24
PFRA
alloc()
VM1 - put
1
Metadata
S
P24
P24
Reclaim2
put(P24)
3
Sequential reads are detected through the read-ahead algorithm
“Sequential pages” are tagged into the metadata
When evicted, sequential pages are simply discarded
Maxime Lorrillere Puma Kernel Recipes 2015 8 / 15
27. Puma design Sequential I/O
Puma design
Filtering sequential I/O
P24
get(P24,32)
VM1 - get
1
!Hit
Metadata
PFRA
2
3
S
P24
PFRA
alloc()
VM1 - put
1
Metadata
S
P24
Reclaim2
put(P24)
3
4
Sequential reads are detected through the read-ahead algorithm
“Sequential pages” are tagged into the metadata
When evicted, sequential pages are simply discarded
Maxime Lorrillere Puma Kernel Recipes 2015 8 / 15
28. Puma design Details and optimisations
Implementation details and optimisations
Response time
⇒ Puma is temporarily disabled if the response time becomes too high
Memory footprint
⇒ Metadata: amortized 64 bits/page, 2 MB of metadata per GB of cache
Memory recovery
⇒ Remote cache pages are discarded when reclaimed
Memory management: avoiding deadlocks
Atomic memory allocations
Use of pre-allocated memory pools
PFRA
alloc()
P31
1
Metadata
31
alloc()
P31
Reclaim2
put(P31)
3
P31
4
4
Consistency
Dirty pages are written to disk before being sent to the cache
Maxime Lorrillere Puma Kernel Recipes 2015 9 / 15
29. Evaluation Evaluation Overview
Evaluation Overview
Experiment setup on KVM
Puma server: provides from 512 MB to 12 GB of cache
Puma client: 1 GB
Baseline: a single VM without additional cache
Hosts: Intel Xeon E5-2660v2, 5 × 600GB SAS in RAID-0
Benchmarks: Filebench, BLAST, TPC-C, TPC-H, Postmark
Experiments
1 Varying workload on server side
2 Co-localised VMs with a paravirtualised network (virtio)
3 Latency injection
Maxime Lorrillere Puma Kernel Recipes 2015 10 / 15
30. Evaluation Varying workload
Dynamic memory balancing
Comparison with memory ballooning
Baseline Auto-ballooning Puma
High latencies to reclaim memory with memory ballooning (avg: 20ms)
Puma allows to reclaim memory at a small cost (avg: 1.8ms)
Maxime Lorrillere Puma Kernel Recipes 2015 11 / 15
31. Evaluation Performance evaluation
Sequential I/O filtering
Unfiltered large sequences may severely drop the performance
Filtering sequential I/O allows us to focus on random accesses
Maxime Lorrillere Puma Kernel Recipes 2015 12 / 15
32. Evaluation Performance evaluation
Performance improvement on database benchmarks
I/Os are a mix of random accesses and medium sized sequences
⇒ Concurrent accesses: sequential accesses are interleaved → slow
⇒ Non-inclusive strategy: pages are kept in cache even if accessed
sequentially
Maxime Lorrillere Puma Kernel Recipes 2015 13 / 15
33. Evaluation Latency injection
Network latency management
Latency injection with Netem [LCA’05]
Speedup decreases as we inject network latency between nodes
When the response time is too high, Puma disables itself to avoid a
performance drop
Maxime Lorrillere Puma Kernel Recipes 2015 14 / 15
34. Conclusion
Conclusion
Summary
⇒ Virtualization leads to a fragmentation of the available cache
⇒ Memory ballooning techniques are not able to manage VM’s page
cache distribution
Puma: Pooling Unused memory in virtual MAchines
⇒ It is based on an efficient kernel-level remote caching mechanism
⇒ It handles clean cache pages to quickly recover the memory
⇒ It works with co-localised VMs and remote VMs
Maxime Lorrillere Puma Kernel Recipes 2015 15 / 15