Nakajima numa-final

Xen Guest NUMA:
General Enabling Part

29 April 2010
Jun Nakajima,
Dexuan Cui, and Nitin Kamble

Legal Disclaimer
 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE,
EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED
BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH
PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED
WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES
RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY
PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR
USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.
 Intel may make changes to specifications and product descriptions at any time, without notice.
 All products, dates, and figures specified are preliminary based on current expectations, and are subject to
change without notice.
 Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which
may cause the product to deviate from published specifications. Current characterized errata are available
on request.
 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in
the United States and other countries.
 *Other names and brands may be claimed as the property of others.
 Copyright © 2010 Intel Corporation.

Xen Summit NA 2010
2

Xen Guest NUMA Project

• Working with Xen Community:
− Andre Przywara andre.przywara@amd.com
− Dulloor Rao dulloor@gmail.com
− You are welcome to join us
• Generic guest NUMA support both for PV and HVM
− Major difference is basically ACPI tables
− NUMA-specific enlightenments are applicable to both

Xen Summit NA 2010
3

Agenda

• NUMA machines
• Importance of NUMA Awareness
• Motivation of NUMA Guests
• What is required to support effective NUMA guest?
• Getting host info and resource allocation
• Guest configuration
• Current Status and Next Steps

Xen Summit NA 2010
4

NUMA Machines
I/O Hub Node*
Cores
*: A socket/package can
contain multiple nodes

Xeon® 7500 Xeon® 7500

Xeon® 7500 Xeon® 7500

Memory
I/O Hub

Memory Buffer

Xen Summit NA 2010
5

NUMA Machines (cont.)

2-socket 2+2+2+2 (8S) 4S (64DIMMs) 4+4 (8S)

2+2 (4S)

4S (32DIMMs)

CPU Socket I/O Hub

Interconnect Memory

6
* Other names and brands be claimed as the property of others. Copyright Copyright © 2010, Intel
*Other names and brands maymay be claimed as the property of others. © 2010, Intel Corporation. Corporation. Intel Confidential

Importance of NUMA Awareness
Andre Przywara <andre.przywara@amd.com>
lmbench's rd benchmark (normalized to native Linux (=100)):
guests numa=off numa=on avg increase
min avg max min avg max
1 78.0 102.3
7 37.4 45.6 62.0 90.6 102.3 110.9 124.4%
15 21.0 25.8 31.7 41.7 48.7 54.1 88.2%
23 13.4 17.5 23.2 25.0 28.0 30.1 60.2%

kernel compile in tmpfs, 1 VCPU, 2GB RAM, average of elapsed time:
guests numa=off numa=on increase
1 480.610 464.320 3.4%
7 482.109 461.721 4.2%
15 515.297 477.669 7.3%
23 548.427 495.180 9.7%
again with 2 VCPUs and make -j2:
1 264.580 261.690 1.1%
7 279.763 258.907 7.7%
*: 4 socket AMD Magny-Cours machine with 8 nodes,
15 330.385 272.762 17.4%
23 463.510 390.547 15.7% (46 VCPUs on 32pCPUs) 48 cores and 96 GB RAM.

http://lists.xensource.com/archives/html/xen-devel/2009-12/msg00000.html

Xen Summit NA 2010
7

Motivation
• More NUMA machines in the market
• Run very large guests efficiently on NUMA machines for
performance reasons
− More memory, VCPUs, I/O spanning across multiple nodes
− More performance, throughput
• Allow existing OS and apps to run in virtualization with NUMA
enabled (or disabled)
− Populate guest ACPI SRAT (Static Resource Affinity Table) and SLIT
(System Locality Information Table)
− NUMA libraries
• NUMA-specific optimizations/enlightenments

Xen Summit NA 2010
8

Achieving NUMA Performance
• Which processors (i.e. cores) are connected directly to which
blocks of memory?
− SRAT (Static Resource Affinity Table) or PV
• How far apart the processors are from their associated
memory banks?
− SLIT (System Locality Information Table) or PV
• Virtualization Specific Requirements
− Bind VCPUs to node
− Construct guest SRAT and SLIT
Xeon® 7500 Xeon® 7500
• Need to reflect hardware attributes

• Predictable and repeatable
− Use fixed guest configuration Xeon® 7500 Xeon® 7500

Xen Summit NA 2010
9

Constructing SRAT and SLIT for Guests

• Get platform info from host using host NUMA API (in
upstream)
− XEN_SYSCTL_topologyinfo
• # of cores per node/socket
− XEN_SYSCTL_numainfo
• Equivalent to SRAT and SLIT

• Allocate memory from nodes based on memory allocation
strategy in config file
− CONFINE, SPLIT, STRIP (next page)
− # of nodes
Xeon® 7500 Xeon® 7500

Xeon® 7500 Xeon® 7500

Xen Summit NA 2010
10

Guest NUMA Config Options
• Number of nodes means “# of nodes from which memory is
allocated”
− Not necessarily visible to guest
• max_guest_nodes=<N>
− Specify desirable number of nodes. Number of system nodes by default.
• min_guest_nodes=<N>
− Specify minimum number of nodes. Memory is allocated from nodes ( >=
min_guest_nodes). Creation of guest fails if allocation does not meet it.
1 by default.
• Number of nodes matter for SPLIT and STRIP (next page)
• Create guest in deterministic way by setting
min_guest_nodes = max_guest_nodes

Xen Summit NA 2010
11

Guest NUMA Config Options (cont.)
Memory Allocation Strategy:
• CONFINE : Allocate entire domain memory from single node.
Fail if does not work.
− No need to tell guest NUMA at all.
• SPLIT : Allocate domain memory from nodes by splitting
equally across the nodes. Fail if does not work.
− Populate NUMA topology, and propagate to guest (includes PV querying
via hypercall). If guest is paravirtualized and does not know about NUMA
(missing ELF hint), fail.
• STRIPE : Interleave domain memory across nodes.
− No need to tell guest about NUMA at all.
• AUTOMATIC: Try three strategies after each other (order:
CONFINE, SPLIT, STRIP)

Xen Summit NA 2010
12

Considerations on Live Migration
• Number of nodes needs to be same
• Memory allocation strategy needs to be inherited for live
migration
− CONFINE and STRIPE are not really NUMA guest
− SPLIT: SPLIT will be used at live-migration time.
• If target machine has similar NUMA characteristics, it’s possible to do live
migration retaining NUMA performance.

Xen Summit NA 2010
13

Current Status and Next Steps
• Current Status
− Host NUMA API is in upstream
− Rebasing the patches to submit
− Re-measuring performance
− Merge patches from Dulloor and Andre
• Next Steps
− Performance analysis and different workloads
• Scheduling
− I/O NUMA
• DMA across nodes with direct device assignment
− Live Migration
• Anyone?

Xen Summit NA 2010
14

Nakajima numa-final

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Nakajima numa-final

Similar a Nakajima numa-final (20)

Más de The Linux Foundation

Más de The Linux Foundation (20)

Último

Último (20)

Nakajima numa-final