SlideShare una empresa de Scribd logo
1 de 28
Descargar para leer sin conexión
KVM Performance Optimization

Paul Sim
Cloud Consultant
paul.sim@canonical.com
Index
● CPU & Memory
○
○
○
○

vCPU pinning
NUMA affinity
THP (Transparent Huge Page)
KSM (Kernel SamePage Merging) & virtio_balloon

● Networking
○
○
○

vhost_net
Interrupt handling
Large Segment offload

● Block Device
○
○
○

I/O Scheduler
VM Cache mode
Asynchronous I/O
● CPU & Memory
Modern CPU cache architecture and latency (Intel Sandy Bridge)

CPU

CPU

Core

Core

L1i
32K

L1d
32K

L1i
32K

L1d
32K

4 cycles, 0.5 ns

4 cycles, 0.5 ns

4 cycles, 0.5 ns

4 cycles, 0.5 ns

QPI
L2 or MLC (unified)
256K

L2 or MLC (unified)
256K

11 cycles, 7 ns

11 cycles, 7 ns

L3 or LLC (Shared)
14 ~ 38 cycles, 45 ns

Local memory
75 ~ 100 ns

Remote memory
120 ~ 160 ns
● CPU & Memory - vCPU pinning

KVM Guest4

KVM
Guest0

KVM
Guest1

KVM
Guest3

* Pin specific vCPU on specific physical CPU
* vCPU pinning increases CPU cache hit ratio

KVM Guest2

1. Discover cpu topology.
- virsh capabilities

Node0

Node1

2. pin vcpu on specific core
- vcpupin <domain> vcpu-num cpu-num

Core 0

Core 1

Core 2

Core 3

3. print vcpu information
- vcpuinfo <domain>

LLC

LLC
* Multi-node virtual machine?

Node memory

Node memory
● CPU & Memory - vCPU pinning

* Two times faster memory access than without pinning
- Shorter is better on scheduling and mutex test.
- Longer is better on memory test.
● CPU & Memory - NUMA affinity

Cores

Memory
Controller

Memory

CPU 2

QPI

CPU 3

I/O
Controller

PCI-E

Cores

Cores

LLC

LLC

Cores

Cores

PCI-E

Cores

Memory

LLC

PCI-E

LLC

Memory

Cores

Memory
Controller

I/O
Controller

PCI-E

Cores

I/O
Controller

CPU 1

Memory
Controller

Memory
Controller

Memory

CPU 0

I/O
Controller

CPU architecture (Intel Sandy Bridge)
● CPU & Memory - NUMA affinity
janghoon@machine-04:~$ sudo numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 40 41 42 43 44 45 46 47 48 49
node 0 size: 24567 MB
node 0 free: 6010 MB
node 1 cpus: cpus: 10 11 12 13 14 15 16 17 18 19 50 51 52 53 54 55 56 57 58 59
node 1 size: 24576 MB
node 1 free: 8557 MB
node 2 cpus: 20 21 22 23 24 25 26 27 28 29 60 61 62 63 64 65 66 67 68 69
node 2 size: 24567 MB
node 2 free: 6320 MB
node 3 cpus: 30 31 32 33 34 35 36 37 38 39 70 71 72 73 74 75 76 77 78 79
node 3 size: 24567 MB
node 3 free: 8204 MB
node distances:
node 0 1 2 3
0: 10 16 16 22
1: 16 10 22 16
2: 16 22 10 16
3: 22 16 16 10

* Remote memory is expensive!
● CPU & Memory - NUMA affinity
1. Determine where the pages of a VM are allocated.
- cat /proc/<PID>/numa_maps
- cat /sys/fs/cgroup/memory/sysdefault/libvirt/qemu/<KVM name>/memory.numa_stat
total=244973 N0=118375 N1=126598
file=81 N0=24 N1=57
anon=244892 N0=118351 N1=126541
unevictable=0 N0=0 N1=0

2. Change memory policy mode
- cgset -r cpuset.mems=<Node> sysdefault/libvirt/qemu/<KVM name>/emulator/
3. Migrate pages into a specific node.
- migratepages <PID> from-node to-node
- cat /proc/<PID>/status
* Memory policy modes
1) interleave : Memory will be allocated using round robin on nodes. When memory cannot be allocated on
the current interleave target fall back to other nodes.
2) bind : Only allocate memory from nodes. Allocation will fail when there is not enough memory available
on these nodes.
3) preferred : Preferably allocate memory on node, but if memory cannot be allocated there fall back to
other nodes.
* “preferred” memory policy mode is not currently supported on cgroup
● CPU & Memory - NUMA affinity
- NUMA reclaim
Node0

Node1

1. Check if zone reclaim is enabled.

Free memory
Free memory

Mapped page cache

- cat /proc/sys/vm/zone_reclaim_mode
0(default): Linux kernel allocates the

Mapped page cache

memory to a remote NUMA node where

Unmapped page cache
Unmapped page cache
Anonymous page

Anonymous page

free memory is available.
1 : Linux kernel reclaims unmapped page
caches for the local NUMA node rather than

Local reclaim

immediately allocating the memory to a

Node0

Node1

Free memory

Free memory

Mapped page cache

Mapped page cache

remote NUMA node.

It is known that a virtual machine causes zone
reclaim to occur in the situations when KSM(Kernel
Same-page Merging) is enabled or Hugepage are

Unmapped page cache

Unmapped page cache

Anonymous page

Anonymous page

enabled on a virtual machine side.
● CPU & Memory - THP (Transparent Huge Page)
- Paging level in some 64 bit architectures

Platform

Page size

Address bit used

Paging levels

splitting

Alpha

8KB

43

3

10 + 10 + 10 + 13

IA64

4KB

39

3

9 + 9 + 9 + 12

PPC64

4KB

41

3

10 + 10 + 9 + 12

x86_64

4KB(default)

48

4

9 + 9 + 9 + 9 + 12

x86_64

2MB(THP)

48

3

9 + 9 + 9 + 21
● CPU & Memory - THP (Transparent Huge Page)
- Memory address translation in 64 bit Linux
Linear(Virtual) address (48 bit)
Upper Dir
(9 bit)

Global Dir
(9 bit)

Table

Offset

(9 bit in 4KB
0 bit in 2MB)

Middle Dir
(9 bit)

(12 bit in 4KB
21 bit in 2MB)

Physical Page
(4KB page size)
Page table
Page Middle
Directory
Page Upper
Directory
Page Global
Directory
Physical Page
(2MB page size)

cr3

reduce 1
step!
● CPU & Memory - THP (Transparent Huge Page)
Paging hardware with TLB (Translation lookaside buffer)

* Translation lookaside buffer (TLB)
: a cache that memory management
hardware uses to improve virtual
address translation speed.
* TLB is also a kind of cache memory
in CPU.
* Then, how can we increase TLB hit
ratio?
- TLB can hold only 8 ~ 1024
entries.
- Decrease the number of pages.
i.e. On 32GB memory system,
8,388,608 pages with 4KB page block
16,384 pages with 2MB page block
● CPU & Memory - THP (Transparent Huge Page)
THP performance benchmark - MySQL 5.5 OLTP testing

* A guest machine also has to use HugePage for the best effect
● CPU & Memory - THP (Transparent Huge Page)
1. Check current THP configuration
- cat /sys/kernel/mm/transparent_hugepage/enabled
2. Configure THP mode
- echo mode > /sys/kernel/mm/transparent_hugepage/enabled
3. monitor Huge page usage
- cat /proc/meminfo | grep Huge
janghoon@machine-04:~$ grep Huge /proc/meminfo
AnonHugePages: 462848 kB
HugePages_Total:
0
HugePages_Free:
0
HugePages_Rsvd:
0
HugePages_Surp:
0
Hugepagesize:
2048 kB

- grep Huge /proc/<PID>/smaps
4. Adjust parameters under /sys/kernel/mm/transparent_hugepage/khugepaged
- grep thp /proc/vmstat
* THP modes
1) always : use HugePages always
2) madvise : use HugePages only in specific regions, madvise(MADV_HUGEPAGE). Default on Ubuntu precise
3) never : not use HugePages
* Currently it only works for anonymous memory mappings but in the future it can expand over the pagecache
layer starting with tmpfs. - Linux Kernel Documentation/vm/transhuge.txt
● CPU & Memory - KSM & virtio_balloon
- KSM (Kernel SamePage Merging)

KSM
MADV_MERGEABLE

Guest 0

Merging
Guest 1

Guest 2

1. A kernel feature in KVM that shares
memory pages between various processes,
over-committing
the memory.
2. Only merges anonymous (private) pages.
3. Enable KSM
- echo “1” > /sys/kernel/mm/ksm/run
4. Monitor KSM
- Files under /sys/kernel/mm/ksm/
5. For NUMA (in Linux 3.9)
- /sys/kernel/mm/ksm/merge_across_nodes
0 : merge only pages in the memory area of
the same NUMA node
1 : merge pages across nodes
● CPU & Memory - KSM & virtio_balloon
- Virtio_balloon

currentMemory

Memory

Host

Host

VM 0

VM 1

memory

memory

VM 0
memory

VM 1

memory

balloon
balloon

balloon

balloon

1. The hypervisor sends a request to the guest operating system to return some amount of memory back to the
hypervisor.
2. The virtio_balloon driver in the guest operating system receives the request from the hypervisor.
3. The virtio_balloon driver inflates a balloon of memory inside the guest operating system
4. The guest operating system returns the balloon of memory back to the hypervisor.
5. The hypervisor allocates the memory from the balloon elsewhere as needed.
6. If the memory from the balloon later becomes available, the hypervisor can return the memory to the guest
operating system
● Networking - vhost-net
Vhost-net

Virtio

Guest
Guest

Virtio front-end
Virtio front-end

QEMU

Virtio back-end

QEMU

Kernel

TAP

Context
switching

Kernel

DMA

virtqueue

vhost_net
Bridge
Bridge
pNIC

pNIC

TAP
● Networking - vhost-net
SR-IOV

PCI Path-through
Guest

Guest
vNIC

Guest
vNIC

QEMU

vNIC

QEMU

Kernel

Guest

QEMU

vNIC

QEMU

Kernel
pNIC

PCI
Manager

I/O MMU

pNIC

pNIC

VF
PF

VF

Bridge
● Networking - vhost-net
● Networking - Interrupt handling
1. Multi-queue NIC
root@machine-04:~# cat /proc/interrupts | grep eth0
71:
0
0
0
0
0
0
0
0 IR-PCI-MSI-edge eth0
72:
0
0
213 679058
0
0
0
0 IR-PCI-MSI-edge eth0-TxRx-0
73:
0
0
0
0
0
0
0
0 IR-PCI-MSI-edge eth0-TxRx-1
74: 562872
0
0
0
0
0
0
0 IR-PCI-MSI-edge eth0-TxRx-2
75: 561303
0
0
0
0
0
0
0 IR-PCI-MSI-edge eth0-TxRx-3
76: 583422
0
0
0
0
0
0
0 IR-PCI-MSI-edge eth0-TxRx-4
77:
21
5 563963
0
0
0
0
0 IR-PCI-MSI-edge eth0-TxRx-5
78:
23
0 548548
0
0
0
0
0 IR-PCI-MSI-edge eth0-TxRx-6
79: 567242
0
0
0
0
0
0
0 IR-PCI-MSI-edge eth0-TxRx-7

- RPS(Receive Packet Steering) / XPS(Transmit Packet Steering)
: A mechanism for intelligently selecting which receive/transmit queue to use when
receiving/transmitting a packet on a multi-queue device. it can configure CPU affinity per-devicequeue, disabled by default.
: configure cpu mask
- echo 1 > /sys/class/net/<device>/queues/rx-<queue num>/rps_cpus
- echo 16 > /sys/class/net/<device>/queues/tx-<queue num>/xps_cpus
around 20 ~ 100 % improvement
● Networking - Interrupt handling
2. MSI(Message Signaled Interrupt), MSI-X
- Make sure your MSI-X enabled
janghoon@machine-04:~$ sudo lspci -vs 01:00.1 | grep MSI
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Capabilities: [70] MSI-X: Enable+ Count=10 MaskedCapabilities: [a0] Express Endpoint, MSI 00

3. NAPI(New API)
- a feature that in the linux kernel aiming to improve the performance of high-speed
networking, and avoid interrupt storms.
- Ask your NIC vendor whether it’s enabled by default.

Most modern NICs support Multi-queue, MSI-X and NAPI.
However, You may need to make sure these features are configured correctly and working
properly.
http://en.wikipedia.org/wiki/Message_Signaled_Interrupts
http://en.wikipedia.org/wiki/New_API
● Networking - Interrupt handling
4. IRQ Affinity (Pin Interrupts to the local node)

Memory
Controller

Memory

Cores

Cores

QPI

I/O Controller

PCI-E

PCI-E

I/O Controller

PCI-E

PCI-E

Each node has PCI-E devices connected directly (On Intel Sandy Bridge).
Where is my 10G NIC connected to?

Memory

CPU 1
Memory
Controller

CPU 0
● Networking - Interrupt handling
4. IRQ Affinity (Pin Interrupts to the local node)
1) stop irqbalance service
2) determine node that a NIC is connected to
- lspci -tv
- lspci -vs <PCI device bus address>
- dmidecode -t slot
- cat /sys/devices/pci*/<bus address>/numa_node (-1 : not detected)
- cat /proc/irq/<Interrupt number>/node
3) Pin interrupts from the NIC to specific node
- echo f > /proc/irq/<Interrupt number>/smp_affinity
● Networking - Large Segment Offload
w/o LSO
Data (i.e. 4K)

Application

with LSO
Data (i.e. 4K)

Kernel

Header

Data

Header

Data

Header

Data

NIC

Header

Data

Header

Data

Header

Data

Header

Data

Header

janghoon@ubuntu-precise:~$ sudo ethtool -k eth0
Offload parameters for eth0:
tcp-segmentation-offload: on
udp-fragmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off

Data

Header

+

Data

metadata

Header

Data
● Networking - Large Segment Offload

1.
2.
3.

100% throughput performance improvement
GSO/GRO, TSO and UFO are enabled by default on Ubuntu precise 12.04 LTS
Jumbo-frame is not much effective
● Block device - I/O Scheduler
1. Noop : basic FIFO queue.
2. Deadline : I/O requests place in a priority queue and are guaranteed to be run within a
certain time. low latency.
3. CFQ (Completely Fair Queueing) : I/O requests are distributed to a number of perprocess queues. default I/O scheduler on Ubuntu.
janghoon@ubuntu-precise:~$ sudo cat /sys/block/sdb/queue/scheduler
noop deadline [cfq]
● Block device - VM Cache mode
Disable host page cache
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2' cache='none'/>
<source file='/mnt/VM_IMAGES/VM-test.img'/>
<target dev='vda' bus='virtio'/>
<alias name='virtio-disk0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
</disk>

Host
R

w

writeback

R

w

Guest

writethrough

R

w

none

Host Page Cache

Disk Cache

Physical Disk

R

w

directsync
● Block device - Asynchronous I/O
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2' cache='none' io=’native’/>
<source file='/dev/sda7'/>
<target dev='vda' bus='virtio'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
</disk>

Más contenido relacionado

La actualidad más candente

Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Tutorial: Using GoBGP as an IXP connecting router
Tutorial: Using GoBGP as an IXP connecting routerTutorial: Using GoBGP as an IXP connecting router
Tutorial: Using GoBGP as an IXP connecting routerShu Sugimoto
 
High Availability Storage (susecon2016)
High Availability Storage (susecon2016)High Availability Storage (susecon2016)
High Availability Storage (susecon2016)Roger Zhou 周志强
 
Cilium - Container Networking with BPF & XDP
Cilium - Container Networking with BPF & XDPCilium - Container Networking with BPF & XDP
Cilium - Container Networking with BPF & XDPThomas Graf
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific DashboardCeph Community
 
Deploying CloudStack and Ceph with flexible VXLAN and BGP networking
Deploying CloudStack and Ceph with flexible VXLAN and BGP networking Deploying CloudStack and Ceph with flexible VXLAN and BGP networking
Deploying CloudStack and Ceph with flexible VXLAN and BGP networking ShapeBlue
 
LinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughLinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughThomas Graf
 
VLANs in the Linux Kernel
VLANs in the Linux KernelVLANs in the Linux Kernel
VLANs in the Linux KernelKernel TLV
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and moreBrendan Gregg
 
KVM tools and enterprise usage
KVM tools and enterprise usageKVM tools and enterprise usage
KVM tools and enterprise usagevincentvdk
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedBrendan Gregg
 
10分で分かるLinuxブロックレイヤ
10分で分かるLinuxブロックレイヤ10分で分かるLinuxブロックレイヤ
10分で分かるLinuxブロックレイヤTakashi Hoshino
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Brendan Gregg
 
BPF & Cilium - Turning Linux into a Microservices-aware Operating System
BPF  & Cilium - Turning Linux into a Microservices-aware Operating SystemBPF  & Cilium - Turning Linux into a Microservices-aware Operating System
BPF & Cilium - Turning Linux into a Microservices-aware Operating SystemThomas Graf
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing GuideJose De La Rosa
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency CephShapeBlue
 
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPDockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPThomas Graf
 

La actualidad más candente (20)

Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Tutorial: Using GoBGP as an IXP connecting router
Tutorial: Using GoBGP as an IXP connecting routerTutorial: Using GoBGP as an IXP connecting router
Tutorial: Using GoBGP as an IXP connecting router
 
High Availability Storage (susecon2016)
High Availability Storage (susecon2016)High Availability Storage (susecon2016)
High Availability Storage (susecon2016)
 
Cilium - Container Networking with BPF & XDP
Cilium - Container Networking with BPF & XDPCilium - Container Networking with BPF & XDP
Cilium - Container Networking with BPF & XDP
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
Deploying CloudStack and Ceph with flexible VXLAN and BGP networking
Deploying CloudStack and Ceph with flexible VXLAN and BGP networking Deploying CloudStack and Ceph with flexible VXLAN and BGP networking
Deploying CloudStack and Ceph with flexible VXLAN and BGP networking
 
LinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughLinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking Walkthrough
 
macvlan and ipvlan
macvlan and ipvlanmacvlan and ipvlan
macvlan and ipvlan
 
VLANs in the Linux Kernel
VLANs in the Linux KernelVLANs in the Linux Kernel
VLANs in the Linux Kernel
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
 
KVM tools and enterprise usage
KVM tools and enterprise usageKVM tools and enterprise usage
KVM tools and enterprise usage
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
 
10分で分かるLinuxブロックレイヤ
10分で分かるLinuxブロックレイヤ10分で分かるLinuxブロックレイヤ
10分で分かるLinuxブロックレイヤ
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
 
BPF & Cilium - Turning Linux into a Microservices-aware Operating System
BPF  & Cilium - Turning Linux into a Microservices-aware Operating SystemBPF  & Cilium - Turning Linux into a Microservices-aware Operating System
BPF & Cilium - Turning Linux into a Microservices-aware Operating System
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency Ceph
 
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPDockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
 

Similar a Kvm performance optimization for ubuntu

Libvirt/KVM Driver Update (Kilo)
Libvirt/KVM Driver Update (Kilo)Libvirt/KVM Driver Update (Kilo)
Libvirt/KVM Driver Update (Kilo)Stephen Gordon
 
Achieving the Ultimate Performance with KVM
Achieving the Ultimate Performance with KVMAchieving the Ultimate Performance with KVM
Achieving the Ultimate Performance with KVMDevOps.com
 
Achieving the Ultimate Performance with KVM
Achieving the Ultimate Performance with KVMAchieving the Ultimate Performance with KVM
Achieving the Ultimate Performance with KVMdata://disrupted®
 
Storage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdf
Storage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdfStorage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdf
Storage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdfaaajjj4
 
Advanced performance troubleshooting using esxtop
Advanced performance troubleshooting using esxtopAdvanced performance troubleshooting using esxtop
Advanced performance troubleshooting using esxtopAlan Renouf
 
SiteGround Tech TeamBuilding
SiteGround Tech TeamBuildingSiteGround Tech TeamBuilding
SiteGround Tech TeamBuildingMarian Marinov
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephRongze Zhu
 
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-GeneOpenStack Korea Community
 
Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVMAchieving the ultimate performance with KVM
Achieving the ultimate performance with KVMStorPool Storage
 
µCLinux on Pluto 6 Project presentation
µCLinux on Pluto 6 Project presentationµCLinux on Pluto 6 Project presentation
µCLinux on Pluto 6 Project presentationedlangley
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Community
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureDanielle Womboldt
 
Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVM Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVM ShapeBlue
 
R&D work on pre exascale HPC systems
R&D work on pre exascale HPC systemsR&D work on pre exascale HPC systems
R&D work on pre exascale HPC systemsJoshua Mora
 
Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017 - ...
Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017  - ...Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017  - ...
Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017 - ...Haidee McMahon
 
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...Joao Galdino Mello de Souza
 
Deploying Containers and Managing Them
Deploying Containers and Managing ThemDeploying Containers and Managing Them
Deploying Containers and Managing ThemDocker, Inc.
 

Similar a Kvm performance optimization for ubuntu (20)

Libvirt/KVM Driver Update (Kilo)
Libvirt/KVM Driver Update (Kilo)Libvirt/KVM Driver Update (Kilo)
Libvirt/KVM Driver Update (Kilo)
 
Achieving the Ultimate Performance with KVM
Achieving the Ultimate Performance with KVMAchieving the Ultimate Performance with KVM
Achieving the Ultimate Performance with KVM
 
Achieving the Ultimate Performance with KVM
Achieving the Ultimate Performance with KVMAchieving the Ultimate Performance with KVM
Achieving the Ultimate Performance with KVM
 
Storage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdf
Storage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdfStorage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdf
Storage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdf
 
Advanced performance troubleshooting using esxtop
Advanced performance troubleshooting using esxtopAdvanced performance troubleshooting using esxtop
Advanced performance troubleshooting using esxtop
 
SiteGround Tech TeamBuilding
SiteGround Tech TeamBuildingSiteGround Tech TeamBuilding
SiteGround Tech TeamBuilding
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
 
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
 
Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVMAchieving the ultimate performance with KVM
Achieving the ultimate performance with KVM
 
µCLinux on Pluto 6 Project presentation
µCLinux on Pluto 6 Project presentationµCLinux on Pluto 6 Project presentation
µCLinux on Pluto 6 Project presentation
 
100Gbps OpenStack For Providing High-Performance NFV
100Gbps OpenStack For Providing High-Performance NFV100Gbps OpenStack For Providing High-Performance NFV
100Gbps OpenStack For Providing High-Performance NFV
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVM Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVM
 
R&D work on pre exascale HPC systems
R&D work on pre exascale HPC systemsR&D work on pre exascale HPC systems
R&D work on pre exascale HPC systems
 
Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017 - ...
Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017  - ...Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017  - ...
Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017 - ...
 
BURA Supercomputer
BURA SupercomputerBURA Supercomputer
BURA Supercomputer
 
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
 
WAN - trends and use cases
WAN - trends and use casesWAN - trends and use cases
WAN - trends and use cases
 
Deploying Containers and Managing Them
Deploying Containers and Managing ThemDeploying Containers and Managing Them
Deploying Containers and Managing Them
 

Más de Sim Janghoon

Kubernetes networking
Kubernetes networkingKubernetes networking
Kubernetes networkingSim Janghoon
 
OpenStack networking juno l3 h-a, dvr
OpenStack networking   juno l3 h-a, dvrOpenStack networking   juno l3 h-a, dvr
OpenStack networking juno l3 h-a, dvrSim Janghoon
 
Docker - container and lightweight virtualization
Docker - container and lightweight virtualization Docker - container and lightweight virtualization
Docker - container and lightweight virtualization Sim Janghoon
 
OpenStack networking
OpenStack networkingOpenStack networking
OpenStack networkingSim Janghoon
 
Open stack networking vlan, gre
Open stack networking   vlan, greOpen stack networking   vlan, gre
Open stack networking vlan, greSim Janghoon
 
Virtualized network with openvswitch
Virtualized network with openvswitchVirtualized network with openvswitch
Virtualized network with openvswitchSim Janghoon
 

Más de Sim Janghoon (6)

Kubernetes networking
Kubernetes networkingKubernetes networking
Kubernetes networking
 
OpenStack networking juno l3 h-a, dvr
OpenStack networking   juno l3 h-a, dvrOpenStack networking   juno l3 h-a, dvr
OpenStack networking juno l3 h-a, dvr
 
Docker - container and lightweight virtualization
Docker - container and lightweight virtualization Docker - container and lightweight virtualization
Docker - container and lightweight virtualization
 
OpenStack networking
OpenStack networkingOpenStack networking
OpenStack networking
 
Open stack networking vlan, gre
Open stack networking   vlan, greOpen stack networking   vlan, gre
Open stack networking vlan, gre
 
Virtualized network with openvswitch
Virtualized network with openvswitchVirtualized network with openvswitch
Virtualized network with openvswitch
 

Último

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Último (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

Kvm performance optimization for ubuntu

  • 1. KVM Performance Optimization Paul Sim Cloud Consultant paul.sim@canonical.com
  • 2. Index ● CPU & Memory ○ ○ ○ ○ vCPU pinning NUMA affinity THP (Transparent Huge Page) KSM (Kernel SamePage Merging) & virtio_balloon ● Networking ○ ○ ○ vhost_net Interrupt handling Large Segment offload ● Block Device ○ ○ ○ I/O Scheduler VM Cache mode Asynchronous I/O
  • 3. ● CPU & Memory Modern CPU cache architecture and latency (Intel Sandy Bridge) CPU CPU Core Core L1i 32K L1d 32K L1i 32K L1d 32K 4 cycles, 0.5 ns 4 cycles, 0.5 ns 4 cycles, 0.5 ns 4 cycles, 0.5 ns QPI L2 or MLC (unified) 256K L2 or MLC (unified) 256K 11 cycles, 7 ns 11 cycles, 7 ns L3 or LLC (Shared) 14 ~ 38 cycles, 45 ns Local memory 75 ~ 100 ns Remote memory 120 ~ 160 ns
  • 4. ● CPU & Memory - vCPU pinning KVM Guest4 KVM Guest0 KVM Guest1 KVM Guest3 * Pin specific vCPU on specific physical CPU * vCPU pinning increases CPU cache hit ratio KVM Guest2 1. Discover cpu topology. - virsh capabilities Node0 Node1 2. pin vcpu on specific core - vcpupin <domain> vcpu-num cpu-num Core 0 Core 1 Core 2 Core 3 3. print vcpu information - vcpuinfo <domain> LLC LLC * Multi-node virtual machine? Node memory Node memory
  • 5. ● CPU & Memory - vCPU pinning * Two times faster memory access than without pinning - Shorter is better on scheduling and mutex test. - Longer is better on memory test.
  • 6. ● CPU & Memory - NUMA affinity Cores Memory Controller Memory CPU 2 QPI CPU 3 I/O Controller PCI-E Cores Cores LLC LLC Cores Cores PCI-E Cores Memory LLC PCI-E LLC Memory Cores Memory Controller I/O Controller PCI-E Cores I/O Controller CPU 1 Memory Controller Memory Controller Memory CPU 0 I/O Controller CPU architecture (Intel Sandy Bridge)
  • 7. ● CPU & Memory - NUMA affinity janghoon@machine-04:~$ sudo numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 40 41 42 43 44 45 46 47 48 49 node 0 size: 24567 MB node 0 free: 6010 MB node 1 cpus: cpus: 10 11 12 13 14 15 16 17 18 19 50 51 52 53 54 55 56 57 58 59 node 1 size: 24576 MB node 1 free: 8557 MB node 2 cpus: 20 21 22 23 24 25 26 27 28 29 60 61 62 63 64 65 66 67 68 69 node 2 size: 24567 MB node 2 free: 6320 MB node 3 cpus: 30 31 32 33 34 35 36 37 38 39 70 71 72 73 74 75 76 77 78 79 node 3 size: 24567 MB node 3 free: 8204 MB node distances: node 0 1 2 3 0: 10 16 16 22 1: 16 10 22 16 2: 16 22 10 16 3: 22 16 16 10 * Remote memory is expensive!
  • 8. ● CPU & Memory - NUMA affinity 1. Determine where the pages of a VM are allocated. - cat /proc/<PID>/numa_maps - cat /sys/fs/cgroup/memory/sysdefault/libvirt/qemu/<KVM name>/memory.numa_stat total=244973 N0=118375 N1=126598 file=81 N0=24 N1=57 anon=244892 N0=118351 N1=126541 unevictable=0 N0=0 N1=0 2. Change memory policy mode - cgset -r cpuset.mems=<Node> sysdefault/libvirt/qemu/<KVM name>/emulator/ 3. Migrate pages into a specific node. - migratepages <PID> from-node to-node - cat /proc/<PID>/status * Memory policy modes 1) interleave : Memory will be allocated using round robin on nodes. When memory cannot be allocated on the current interleave target fall back to other nodes. 2) bind : Only allocate memory from nodes. Allocation will fail when there is not enough memory available on these nodes. 3) preferred : Preferably allocate memory on node, but if memory cannot be allocated there fall back to other nodes. * “preferred” memory policy mode is not currently supported on cgroup
  • 9. ● CPU & Memory - NUMA affinity - NUMA reclaim Node0 Node1 1. Check if zone reclaim is enabled. Free memory Free memory Mapped page cache - cat /proc/sys/vm/zone_reclaim_mode 0(default): Linux kernel allocates the Mapped page cache memory to a remote NUMA node where Unmapped page cache Unmapped page cache Anonymous page Anonymous page free memory is available. 1 : Linux kernel reclaims unmapped page caches for the local NUMA node rather than Local reclaim immediately allocating the memory to a Node0 Node1 Free memory Free memory Mapped page cache Mapped page cache remote NUMA node. It is known that a virtual machine causes zone reclaim to occur in the situations when KSM(Kernel Same-page Merging) is enabled or Hugepage are Unmapped page cache Unmapped page cache Anonymous page Anonymous page enabled on a virtual machine side.
  • 10. ● CPU & Memory - THP (Transparent Huge Page) - Paging level in some 64 bit architectures Platform Page size Address bit used Paging levels splitting Alpha 8KB 43 3 10 + 10 + 10 + 13 IA64 4KB 39 3 9 + 9 + 9 + 12 PPC64 4KB 41 3 10 + 10 + 9 + 12 x86_64 4KB(default) 48 4 9 + 9 + 9 + 9 + 12 x86_64 2MB(THP) 48 3 9 + 9 + 9 + 21
  • 11. ● CPU & Memory - THP (Transparent Huge Page) - Memory address translation in 64 bit Linux Linear(Virtual) address (48 bit) Upper Dir (9 bit) Global Dir (9 bit) Table Offset (9 bit in 4KB 0 bit in 2MB) Middle Dir (9 bit) (12 bit in 4KB 21 bit in 2MB) Physical Page (4KB page size) Page table Page Middle Directory Page Upper Directory Page Global Directory Physical Page (2MB page size) cr3 reduce 1 step!
  • 12. ● CPU & Memory - THP (Transparent Huge Page) Paging hardware with TLB (Translation lookaside buffer) * Translation lookaside buffer (TLB) : a cache that memory management hardware uses to improve virtual address translation speed. * TLB is also a kind of cache memory in CPU. * Then, how can we increase TLB hit ratio? - TLB can hold only 8 ~ 1024 entries. - Decrease the number of pages. i.e. On 32GB memory system, 8,388,608 pages with 4KB page block 16,384 pages with 2MB page block
  • 13. ● CPU & Memory - THP (Transparent Huge Page) THP performance benchmark - MySQL 5.5 OLTP testing * A guest machine also has to use HugePage for the best effect
  • 14. ● CPU & Memory - THP (Transparent Huge Page) 1. Check current THP configuration - cat /sys/kernel/mm/transparent_hugepage/enabled 2. Configure THP mode - echo mode > /sys/kernel/mm/transparent_hugepage/enabled 3. monitor Huge page usage - cat /proc/meminfo | grep Huge janghoon@machine-04:~$ grep Huge /proc/meminfo AnonHugePages: 462848 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB - grep Huge /proc/<PID>/smaps 4. Adjust parameters under /sys/kernel/mm/transparent_hugepage/khugepaged - grep thp /proc/vmstat * THP modes 1) always : use HugePages always 2) madvise : use HugePages only in specific regions, madvise(MADV_HUGEPAGE). Default on Ubuntu precise 3) never : not use HugePages * Currently it only works for anonymous memory mappings but in the future it can expand over the pagecache layer starting with tmpfs. - Linux Kernel Documentation/vm/transhuge.txt
  • 15. ● CPU & Memory - KSM & virtio_balloon - KSM (Kernel SamePage Merging) KSM MADV_MERGEABLE Guest 0 Merging Guest 1 Guest 2 1. A kernel feature in KVM that shares memory pages between various processes, over-committing the memory. 2. Only merges anonymous (private) pages. 3. Enable KSM - echo “1” > /sys/kernel/mm/ksm/run 4. Monitor KSM - Files under /sys/kernel/mm/ksm/ 5. For NUMA (in Linux 3.9) - /sys/kernel/mm/ksm/merge_across_nodes 0 : merge only pages in the memory area of the same NUMA node 1 : merge pages across nodes
  • 16. ● CPU & Memory - KSM & virtio_balloon - Virtio_balloon currentMemory Memory Host Host VM 0 VM 1 memory memory VM 0 memory VM 1 memory balloon balloon balloon balloon 1. The hypervisor sends a request to the guest operating system to return some amount of memory back to the hypervisor. 2. The virtio_balloon driver in the guest operating system receives the request from the hypervisor. 3. The virtio_balloon driver inflates a balloon of memory inside the guest operating system 4. The guest operating system returns the balloon of memory back to the hypervisor. 5. The hypervisor allocates the memory from the balloon elsewhere as needed. 6. If the memory from the balloon later becomes available, the hypervisor can return the memory to the guest operating system
  • 17. ● Networking - vhost-net Vhost-net Virtio Guest Guest Virtio front-end Virtio front-end QEMU Virtio back-end QEMU Kernel TAP Context switching Kernel DMA virtqueue vhost_net Bridge Bridge pNIC pNIC TAP
  • 18. ● Networking - vhost-net SR-IOV PCI Path-through Guest Guest vNIC Guest vNIC QEMU vNIC QEMU Kernel Guest QEMU vNIC QEMU Kernel pNIC PCI Manager I/O MMU pNIC pNIC VF PF VF Bridge
  • 19. ● Networking - vhost-net
  • 20. ● Networking - Interrupt handling 1. Multi-queue NIC root@machine-04:~# cat /proc/interrupts | grep eth0 71: 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth0 72: 0 0 213 679058 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-0 73: 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-1 74: 562872 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-2 75: 561303 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-3 76: 583422 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-4 77: 21 5 563963 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-5 78: 23 0 548548 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-6 79: 567242 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-7 - RPS(Receive Packet Steering) / XPS(Transmit Packet Steering) : A mechanism for intelligently selecting which receive/transmit queue to use when receiving/transmitting a packet on a multi-queue device. it can configure CPU affinity per-devicequeue, disabled by default. : configure cpu mask - echo 1 > /sys/class/net/<device>/queues/rx-<queue num>/rps_cpus - echo 16 > /sys/class/net/<device>/queues/tx-<queue num>/xps_cpus around 20 ~ 100 % improvement
  • 21. ● Networking - Interrupt handling 2. MSI(Message Signaled Interrupt), MSI-X - Make sure your MSI-X enabled janghoon@machine-04:~$ sudo lspci -vs 01:00.1 | grep MSI Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+ Capabilities: [70] MSI-X: Enable+ Count=10 MaskedCapabilities: [a0] Express Endpoint, MSI 00 3. NAPI(New API) - a feature that in the linux kernel aiming to improve the performance of high-speed networking, and avoid interrupt storms. - Ask your NIC vendor whether it’s enabled by default. Most modern NICs support Multi-queue, MSI-X and NAPI. However, You may need to make sure these features are configured correctly and working properly. http://en.wikipedia.org/wiki/Message_Signaled_Interrupts http://en.wikipedia.org/wiki/New_API
  • 22. ● Networking - Interrupt handling 4. IRQ Affinity (Pin Interrupts to the local node) Memory Controller Memory Cores Cores QPI I/O Controller PCI-E PCI-E I/O Controller PCI-E PCI-E Each node has PCI-E devices connected directly (On Intel Sandy Bridge). Where is my 10G NIC connected to? Memory CPU 1 Memory Controller CPU 0
  • 23. ● Networking - Interrupt handling 4. IRQ Affinity (Pin Interrupts to the local node) 1) stop irqbalance service 2) determine node that a NIC is connected to - lspci -tv - lspci -vs <PCI device bus address> - dmidecode -t slot - cat /sys/devices/pci*/<bus address>/numa_node (-1 : not detected) - cat /proc/irq/<Interrupt number>/node 3) Pin interrupts from the NIC to specific node - echo f > /proc/irq/<Interrupt number>/smp_affinity
  • 24. ● Networking - Large Segment Offload w/o LSO Data (i.e. 4K) Application with LSO Data (i.e. 4K) Kernel Header Data Header Data Header Data NIC Header Data Header Data Header Data Header Data Header janghoon@ubuntu-precise:~$ sudo ethtool -k eth0 Offload parameters for eth0: tcp-segmentation-offload: on udp-fragmentation-offload: on generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off Data Header + Data metadata Header Data
  • 25. ● Networking - Large Segment Offload 1. 2. 3. 100% throughput performance improvement GSO/GRO, TSO and UFO are enabled by default on Ubuntu precise 12.04 LTS Jumbo-frame is not much effective
  • 26. ● Block device - I/O Scheduler 1. Noop : basic FIFO queue. 2. Deadline : I/O requests place in a priority queue and are guaranteed to be run within a certain time. low latency. 3. CFQ (Completely Fair Queueing) : I/O requests are distributed to a number of perprocess queues. default I/O scheduler on Ubuntu. janghoon@ubuntu-precise:~$ sudo cat /sys/block/sdb/queue/scheduler noop deadline [cfq]
  • 27. ● Block device - VM Cache mode Disable host page cache <disk type='file' device='disk'> <driver name='qemu' type='qcow2' cache='none'/> <source file='/mnt/VM_IMAGES/VM-test.img'/> <target dev='vda' bus='virtio'/> <alias name='virtio-disk0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </disk> Host R w writeback R w Guest writethrough R w none Host Page Cache Disk Cache Physical Disk R w directsync
  • 28. ● Block device - Asynchronous I/O <disk type='file' device='disk'> <driver name='qemu' type='qcow2' cache='none' io=’native’/> <source file='/dev/sda7'/> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </disk>