Puma: Pooling Unused Memory in Virtual Machines for I/O-intensive Apps

Puma: Pooling Unused Memory in Virtual Machines
for I/O intensive applications
Maxime Lorrillere, Julien Sopena, Sébastien Monnet and Pierre Sens
contact: maxime.lorrillere@lip6.fr
Kernel Recipes 2015
Maxime Lorrillere Puma Kernel Recipes 2015 1 / 15

Introduction Context
Problem: memory fragmentation
Host 1
PFRA
cache
Memory
Disk
Applications
Host 2
PFRA
cache
Anonymous
pages
Page cache
10GB Ethernet

PFRA
cache
VM1 VM2
Host 1
PFRA
cache
VM3 VM4
Host 2
virtio virtio
Applications
Hypervisor (KVM) Hypervisor (KVM)
10GB Ethernet
Virtualization allows more ﬂexibility and isolation

PFRA
cache
VM1 VM2
Host 1
PFRA
cache
VM3 VM4
Host 2
virtio virtio
Swap
Applications
10GB Ethernet
Virtualization allows more ﬂexibility and isolation
Problem: it fragments available memory
⇒ Sharing resources like CPU time is straightforward
⇒ Memory cannot be reassigned as eﬃciently as CPU time

Introduction Related work
Solution: Memory Ballooning [OSDI’02]
PFRA
cache
VM1 VM2
Host 1
PFRA
cache
VM3 VM4
Host 2
virtio virtio
Swap
Balloon
Applications
10GB Ethernet
The host asks a VM to inﬂate its balloon to return free memory
The host asks a VM to deﬂate its balloon to get more memory

PFRA
cache
VM1 VM2
Host 1
PFRA
cache
VM3 VM4
Host 2
virtio virtio
BalloonBalloon
Applications
10GB Ethernet

PFRA
cache
VM1 VM2
Host 1
PFRA
cache
VM3 VM4
Host 2
virtio virtio
BalloonBalloon
I/O
Applications
10GB Ethernet
Limitations
⇒ page cache is still fragmented

PFRA
cache
VM1 VM2
Host 1
PFRA
cache
VM3 VM4
Host 2
virtio virtio
I/O
Balloon
Swap
Applications
10GB Ethernet
Limitations
⇒ page cache is still fragmented
⇒ slow to recover

Memory Ballooning – Time to recover memory
1 Make a lot of I/O on the ﬁrst VM
2 Try to allocate the memory (malloc) on the second VM
Baseline Auto-ballooning
⇒ Memory allocations are 20× slower than the baseline

Memory Ballooning – Time to recover memory
1 Make a lot of I/O on the ﬁrst VM
2 Try to allocate the memory (malloc) on the second VM
Baseline Auto-ballooning
⇒ Memory allocations are 20× slower than the baseline
⇒ When it does not crash! (OOM-kill)

Our contribution: a cooperative page cache
PFRA
cache
VM1 VM2
Host 1
PFRA
cache
VM3 VM4
Host 2
virtiovirtio virtio
TCP (~100µs)
Remote
page cache
~10ms
Puma Puma Puma
TCP (~100µs)
Applications
10GB Ethernet
Puma’s approach:
Relies on a fast network between VMs and physical machines
Hypervisor, ﬁlesystem and block device agnostic
Handles only clean cache pages
⇒ Writes a generally non-blocking
⇒ Simple consistency scheme
⇒ fast to recover memory!

Puma design Basics
Puma design
Local page cache eviction – put operation
PFRAPFRA
alloc()
VM1 VM2
P31
1
Metadata
31
P31
Typically triggered by a memory allocation
Puma is integrated into the PFRA to detect page cache eviction
Pages are sent asynchronously to avoid slowdowns
Remote pages are stored into the system page cache

Puma design Basics
Puma design
PFRAPFRA
alloc()
VM1 VM2
P31
1
Metadata
31
P31
Reclaim2

Puma design Basics
Puma design
PFRAPFRA
alloc()
VM1 VM2
P31
1
Metadata
31
P31
Reclaim2
put(P31)
3

Puma design Basics
Puma design
PFRAPFRA
alloc()
VM1 VM2
P31
1
Metadata
31
Reclaim2
put(P31)
3
P31
4
4

Puma design Basics
Puma design
PFRAPFRA
alloc()
VM1 VM2
P31
1
Metadata
31
Reclaim2
put(P31)
3
P31
4
4
Store page
P31
5

Puma design Basics
Puma design
Local page cache miss – get operation
PFRAPFRA
P24
Miss
get(P24)
VM1 VM2
P24
1
Metadata
24
Integrated into the page cache to detect local cache misses
A local cache miss leads to a (synchronous) get operation
Local metadata are used to know if and where a page is in the cache
Exclusive and non-inclusive caching strategies

Puma design Basics
Puma design
PFRAPFRA
P24
Miss
get(P24)
VM1 VM2
P24
1
Metadata
24
2
Hit?

Puma design Basics
Puma design
PFRAPFRA
P24
Miss
get(P24)
VM1 VM2
P24
1
Metadata
24
2
Hit?
req(P24)
3

Puma design Basics
Puma design
PFRAPFRA
P24
Miss
get(P24)
VM1 VM2
P24
1
Metadata
24
2
Hit?
req(P24)
3
Lookup
4

Puma design Basics
Puma design
PFRAPFRA
P24
Miss
get(P24)
VM1 VM2
P24
1
Metadata
24
2
Hit?
req(P24)
3
Lookup
4
P24 P24
5

Puma design Sequential I/O
Puma design
Filtering sequential I/O
P24
get(P24,32)
VM1 - get
1
!Hit
Metadata
PFRA
Miss
Sequential reads are detected through the read-ahead algorithm

Puma design
P24
get(P24,32)
VM1 - get
1
!Hit
Metadata
PFRA
Miss 2

Puma design
P24
get(P24,32)
VM1 - get
1
!Hit
Metadata
PFRA
2
3
S
P24
“Sequential pages” are tagged into the metadata

Puma design
P24
get(P24,32)
VM1 - get
1
!Hit
Metadata
PFRA
2
3
S
P24
PFRA
alloc()
VM1 - put
1
Metadata
S
P24
P24
Reclaim2
put(P24)
3
When evicted, sequential pages are simply discarded

Puma design
P24
get(P24,32)
VM1 - get
1
!Hit
Metadata
PFRA
2
3
S
P24
PFRA
alloc()
VM1 - put
1
Metadata
S
P24
Reclaim2
put(P24)
3
4
When evicted, sequential pages are simply discarded

Puma design Details and optimisations
Implementation details and optimisations
Response time
⇒ Puma is temporarily disabled if the response time becomes too high
Memory footprint
⇒ Metadata: amortized 64 bits/page, 2 MB of metadata per GB of cache
Memory recovery
⇒ Remote cache pages are discarded when reclaimed
Memory management: avoiding deadlocks
Atomic memory allocations
Use of pre-allocated memory pools
PFRA
alloc()
P31
1
Metadata
31
alloc()
P31
Reclaim2
put(P31)
3
P31
4
4
Consistency
Dirty pages are written to disk before being sent to the cache

Evaluation Evaluation Overview
Evaluation Overview
Experiment setup on KVM
Puma server: provides from 512 MB to 12 GB of cache
Puma client: 1 GB
Baseline: a single VM without additional cache
Hosts: Intel Xeon E5-2660v2, 5 × 600GB SAS in RAID-0
Benchmarks: Filebench, BLAST, TPC-C, TPC-H, Postmark
Experiments
1 Varying workload on server side
2 Co-localised VMs with a paravirtualised network (virtio)
3 Latency injection

Evaluation Varying workload
Dynamic memory balancing
Comparison with memory ballooning
Baseline Auto-ballooning Puma
High latencies to reclaim memory with memory ballooning (avg: 20ms)
Puma allows to reclaim memory at a small cost (avg: 1.8ms)

Evaluation Performance evaluation
Sequential I/O ﬁltering
Unﬁltered large sequences may severely drop the performance
Filtering sequential I/O allows us to focus on random accesses

Evaluation Performance evaluation
Performance improvement on database benchmarks
I/Os are a mix of random accesses and medium sized sequences
⇒ Concurrent accesses: sequential accesses are interleaved → slow
⇒ Non-inclusive strategy: pages are kept in cache even if accessed
sequentially

Evaluation Latency injection
Network latency management
Latency injection with Netem [LCA’05]
Speedup decreases as we inject network latency between nodes
When the response time is too high, Puma disables itself to avoid a
performance drop

Conclusion
Conclusion
Summary
⇒ Virtualization leads to a fragmentation of the available cache
⇒ Memory ballooning techniques are not able to manage VM’s page
cache distribution
Puma: Pooling Unused memory in virtual MAchines
⇒ It is based on an eﬃcient kernel-level remote caching mechanism
⇒ It handles clean cache pages to quickly recover the memory
⇒ It works with co-localised VMs and remote VMs

Puma: Pooling Unused Memory in Virtual Machines for I/O-intensive Apps

Recomendados

Recomendados

Más contenido relacionado

Similar a Puma: Pooling Unused Memory in Virtual Machines for I/O-intensive Apps

Similar a Puma: Pooling Unused Memory in Virtual Machines for I/O-intensive Apps (20)

Más de Anne Nicolas

Más de Anne Nicolas (20)

Último

Último (20)

Puma: Pooling Unused Memory in Virtual Machines for I/O-intensive Apps