3. Xen Guest NUMA Project
• Working with Xen Community:
− Andre Przywara andre.przywara@amd.com
− Dulloor Rao dulloor@gmail.com
− You are welcome to join us
• Generic guest NUMA support both for PV and HVM
− Major difference is basically ACPI tables
− NUMA-specific enlightenments are applicable to both
Xen Summit NA 2010
3
4. Agenda
• NUMA machines
• Importance of NUMA Awareness
• Motivation of NUMA Guests
• What is required to support effective NUMA guest?
• Getting host info and resource allocation
• Guest configuration
• Current Status and Next Steps
Xen Summit NA 2010
4
5. NUMA Machines
I/O Hub Node*
Cores
*: A socket/package can
contain multiple nodes
Xeon® 7500 Xeon® 7500
Xeon® 7500 Xeon® 7500
Memory
I/O Hub
Memory Buffer
Xen Summit NA 2010
5
7. Importance of NUMA Awareness
Andre Przywara <andre.przywara@amd.com>
lmbench's rd benchmark (normalized to native Linux (=100)):
guests numa=off numa=on avg increase
min avg max min avg max
1 78.0 102.3
7 37.4 45.6 62.0 90.6 102.3 110.9 124.4%
15 21.0 25.8 31.7 41.7 48.7 54.1 88.2%
23 13.4 17.5 23.2 25.0 28.0 30.1 60.2%
kernel compile in tmpfs, 1 VCPU, 2GB RAM, average of elapsed time:
guests numa=off numa=on increase
1 480.610 464.320 3.4%
7 482.109 461.721 4.2%
15 515.297 477.669 7.3%
23 548.427 495.180 9.7%
again with 2 VCPUs and make -j2:
1 264.580 261.690 1.1%
7 279.763 258.907 7.7%
*: 4 socket AMD Magny-Cours machine with 8 nodes,
15 330.385 272.762 17.4%
23 463.510 390.547 15.7% (46 VCPUs on 32pCPUs) 48 cores and 96 GB RAM.
http://lists.xensource.com/archives/html/xen-devel/2009-12/msg00000.html
Xen Summit NA 2010
7
8. Motivation
• More NUMA machines in the market
• Run very large guests efficiently on NUMA machines for
performance reasons
− More memory, VCPUs, I/O spanning across multiple nodes
− More performance, throughput
• Allow existing OS and apps to run in virtualization with NUMA
enabled (or disabled)
− Populate guest ACPI SRAT (Static Resource Affinity Table) and SLIT
(System Locality Information Table)
− NUMA libraries
• NUMA-specific optimizations/enlightenments
Xen Summit NA 2010
8
9. Achieving NUMA Performance
• Which processors (i.e. cores) are connected directly to which
blocks of memory?
− SRAT (Static Resource Affinity Table) or PV
• How far apart the processors are from their associated
memory banks?
− SLIT (System Locality Information Table) or PV
• Virtualization Specific Requirements
− Bind VCPUs to node
− Construct guest SRAT and SLIT
Xeon® 7500 Xeon® 7500
• Need to reflect hardware attributes
• Predictable and repeatable
− Use fixed guest configuration Xeon® 7500 Xeon® 7500
Xen Summit NA 2010
9
10. Constructing SRAT and SLIT for Guests
• Get platform info from host using host NUMA API (in
upstream)
− XEN_SYSCTL_topologyinfo
• # of cores per node/socket
− XEN_SYSCTL_numainfo
• Equivalent to SRAT and SLIT
• Allocate memory from nodes based on memory allocation
strategy in config file
− CONFINE, SPLIT, STRIP (next page)
− # of nodes
Xeon® 7500 Xeon® 7500
Xeon® 7500 Xeon® 7500
Xen Summit NA 2010
10
11. Guest NUMA Config Options
• Number of nodes means “# of nodes from which memory is
allocated”
− Not necessarily visible to guest
• max_guest_nodes=<N>
− Specify desirable number of nodes. Number of system nodes by default.
• min_guest_nodes=<N>
− Specify minimum number of nodes. Memory is allocated from nodes ( >=
min_guest_nodes). Creation of guest fails if allocation does not meet it.
1 by default.
• Number of nodes matter for SPLIT and STRIP (next page)
• Create guest in deterministic way by setting
min_guest_nodes = max_guest_nodes
Xen Summit NA 2010
11
12. Guest NUMA Config Options (cont.)
Memory Allocation Strategy:
• CONFINE : Allocate entire domain memory from single node.
Fail if does not work.
− No need to tell guest NUMA at all.
• SPLIT : Allocate domain memory from nodes by splitting
equally across the nodes. Fail if does not work.
− Populate NUMA topology, and propagate to guest (includes PV querying
via hypercall). If guest is paravirtualized and does not know about NUMA
(missing ELF hint), fail.
• STRIPE : Interleave domain memory across nodes.
− No need to tell guest about NUMA at all.
• AUTOMATIC: Try three strategies after each other (order:
CONFINE, SPLIT, STRIP)
Xen Summit NA 2010
12
13. Considerations on Live Migration
• Number of nodes needs to be same
• Memory allocation strategy needs to be inherited for live
migration
− CONFINE and STRIPE are not really NUMA guest
− SPLIT: SPLIT will be used at live-migration time.
• If target machine has similar NUMA characteristics, it’s possible to do live
migration retaining NUMA performance.
Xen Summit NA 2010
13
14. Current Status and Next Steps
• Current Status
− Host NUMA API is in upstream
− Rebasing the patches to submit
− Re-measuring performance
− Merge patches from Dulloor and Andre
• Next Steps
− Performance analysis and different workloads
• Scheduling
− I/O NUMA
• DMA across nodes with direct device assignment
− Live Migration
• Anyone?
Xen Summit NA 2010
14