SlideShare una empresa de Scribd logo
1 de 7
©2023 Wheeler’s Network
Page 1
The Evolution of Memory Tiering at Scale
By Bob Wheeler, Principal Analyst
March 2023
www.wheelersnetwork.com
1
The Evolution of Memory Tiering at Scale
©2023 Wheeler’s Network
With first-generation chips now available, the early hype around CXL is giving way to realistic performance
expectations. At the same time, software support for memory tiering is advancing, building on prior work around
NUMA and persistent memory. Finally, operators have deployed RDMA to enable storage disaggregation and
high-performance workloads. Thanks to these advancements, main-memory disaggregation is now within reach.
Enfabrica sponsored the creation of this white paper, but the opinions and analysis are those of the author.
Tiering Addresses the Memory Crunch
Memory tiering is undergoing major advancements with the recent AMD and Intel server-processor
introductions. Both AMD’s new Epyc (codenamed Genoa) and Intel’s new Xeon Scalable (codenamed
Sapphire Rapids) introduce Compute Express Link (CXL), marking the beginning of new memory-
interconnect architectures. The first generation of CXL-enabled processors handle Revision 1.1 of the
specification, however, whereas the CXL Consortium released Revision 3.0 in August 2022.
When CXL launched, hyperbolic statements about main-memory disaggregation appeared, ignoring
the realities of access and time-of-flight latencies. With first-generation CXL chips now shipping,
customers are left to address requirements for software to become tier-aware. Operators or vendors
must also develop orchestration software to manage pooled and shared memory. In parallel with
software, the CXL-hardware ecosystem will take years to fully develop, particularly CXL 3.x com-
ponents including CPUs, GPUs, switches, and memory expanders. Eventually, CXL promises to
mature into a true fabric that can connect CPUs and GPUs to shared memories, but network-attached
memory still has a role.
As Figure 1 shows, the memory hierarchy is becoming more granular, trading access latency against
capacity and flexibility. The top of the pyramid serves the performance tier, where hot pages must be
stored for maximum performance. Cold pages may be demoted to the capacity tier, which storage
devices traditionally served. In recent years, however, developers have optimized software to improve
performance when pages reside in different NUMA domains in multi-socket servers as well as in
persistent (non-volatile) memories such as Intel’s Optane. Although Intel discontinued Optane
development, its large software investment still applies to CXL-attached memories.
FIGURE 1. MEMORY HIERARCHY
(Data source: University of Michigan and Meta Inc.)
2
The Evolution of Memory Tiering at Scale
©2023 Wheeler’s Network
Swapping memory pages to SSD introduces a massive performance penalty, creating an opportunity
for new DRAM-based capacity tiers. Sometimes referred to as “far memory,” this DRAM may reside in
another server or in a memory appliance. Over the last two decades, software developers advanced the
concept of network-based swap, which enables a server to access remote memory located in another
server on the network. By using network interface cards that support remote DMA (RDMA), system
architects can reduce the access latency to network-attached memory to less than four microseconds,
as Figure 1 shows. As a result, network swap can greatly improve the performance of some workloads
compared with traditional swap to storage.
Memory Expansion Drives Initial CXL Adoption
Although it’s little more than three years old, CXL has already achieved industry support exceeding
that of previous coherent-interconnect standards such as CCIX, OpenCAPI, and HyperTransport.
Crucially, AMD supported and implemented CXL despite Intel developing the original specification.
The growing CXL ecosystem includes memory controllers (or expanders) that connect DDR4 or DDR5
DRAM to a CXL-enabled server (or host). An important factor in CXL’s early adoption is its reuse of
the PCI Express physical layer, enabling I/O flexibility without adding to processor pin counts. This
flexibility extends to add-in cards and modules, which use the same slots as PCIe devices. For the
server designer, adding CXL support requires only the latest Epyc or Xeon processor and some
attention to PCIe-lane assignments.
The CXL specification defines three device types and three protocols required for different use cases.
Here, we focus on the Type 3 device used for memory expansion, and the CXL.mem protocol for cache-
coherent memory access. All three device types require the CXL.io protocol, but Type 3 devices use
this only for configuration and control. Compared with CXL.io as well as PCIe, the CXL.mem protocol
stack uses different link and transaction layers. The crucial difference is that CXL.mem (and CXL.cache)
adopt fixed-length messages, whereas CXL.io uses variable-length packets like PCIe. In Revisions 1.1
and 2.0, CXL.mem uses a 68-byte flow-control unit (or flit), which handles a 64-byte cache line. CXL 3.0
adopts the 256-byte flit introduced in PCIe 6.0 to accommodate forward-error correction (FEC), but it
adds a latency-optimized flit that splits error checking (CRC) into two 128-byte blocks.
Fundamentally, CXL.mem brings load/store semantics to the PCIe interface, enabling expansion of
both memory bandwidth and capacity. As Figure 2 shows at left, the first CXL use cases revolve
around memory expansion, starting with single-host configurations. The simplest example is a CXL
memory module, such as Samsung's 512GB DDR5 memory expander with a PCIe Gen5 x8 interface in
an EDSFF form factor. This module uses a CXL memory controller from Montage Technology, and the
vendors claim support for CXL 2.0. Similarly, Astera Labs offers a DDR5 controller chip with a CXL 2.0
x16 interface. The company developed a PCIe add-in card combining its Leo controller chip with four
RDIMM slots that handle up to a combined 2TB of DDR5 DRAM.
Unloaded access latency to CXL-attached DRAM should be around 100ns greater than that of DRAM
attached to a processor’s integrated memory controllers. The memory channel appears as a single
logical device (SLD), which can be allocated to only a single host. Memory expansion using a single
processor and SLD represents the best case for CXL-memory performance, assuming a direct
connection without intermediate devices or layers such as retimers and switches.
3
The Evolution of Memory Tiering at Scale
©2023 Wheeler’s Network
FIGURE 2. CXL 1.1/2.0 USE CASES
The next use case is pooled memory, which enables flexible allocation of memory regions to specific
hosts. In pooling, memory is assigned and accessible to only a single host—that is, a memory region
is not shared by multiple hosts simultaneously. When connecting multiple processors or servers to a
memory pool, CXL enables two approaches. The original approach added a CXL switch component
between the hosts and one or more expanders (Type 3 devices). The downside of this method is that
the switch adds latency, which we estimate at around 80ns. Although customers can design such a
system, we do not expect this use case will achieve high-volume adoption, as the added latency
decreases system performance.
An alternative approach instead uses a multi-headed (MH) expander to directly connect a small
number of hosts to a memory pool, as shown in the center of Figure 2. For example, startup Tanzanite
Silicon Solutions demonstrated an FPGA-based prototype with four heads prior to its acquisition by
Marvell, which later disclosed a forthcoming chip with eight x8 hosts ports. These multi-headed
controllers can form the heart of a memory appliance offering a pool of DRAM to a small number of
servers. The command interface for managing an MH expander wasn’t standardized until CXL 3.0,
however, meaning early demonstrations used proprietary fabric management.
CXL 3.x Enables Shared-Memory Fabrics
Although it enables small-scale memory pooling, CXL 2.0 has numerous limitations. In terms of
topology, it’s limited to 16 hosts and a single-level switch hierarchy. More important for connecting
GPUs and other accelerators, each host supports only a single Type 2 device, which means CXL 2.0
can’t be used to build a coherent GPU server. CXL 3.0 enables up to 16 accelerators per host, allowing
it to serve as a standardized coherent interconnect for GPUs. It also adds peer-to-peer (P2P)
communications, multi-level switching, and fabrics with up to 4,096 nodes.
Whereas memory pooling enables flexible allocation of DRAM to servers, CXL 3.0 enables true shared
memory. The shared-memory expander is called a global fabric-attached memory (G-FAM) device, and
it allows multiple hosts or accelerators to coherently share memory regions. The 3.0 specification also
adds up to eight dynamic capacity (DC) regions for more granular memory allocation. Figure 3 shows
a simple example using a single switch to connect an arbitrary number of hosts to shared memory. In
this case, either the hosts or the devices may manage cache coherence.
4
The Evolution of Memory Tiering at Scale
©2023 Wheeler’s Network
FIGURE 3. CXL 3.X SHARED MEMORY
For an accelerator to directly access shared memory, however, the expander must implement coher-
ence with back invalidation (HDM-DB), which is new to the 3.0 specification. In other words, for CXL-
connected GPUs to share memory, the expander must implement an inclusive snoop filter. This approach
introduces potential blocking, as the specification enforces strict ordering for certain CXL.mem trans-
actions. The shared-memory fabric will experience congestion, leading to less-predictable latency and
the potential for much greater tail latency. Although the specification includes QoS Telemetry features,
host-based rate throttling is optional, and these capabilities are unproven in practice.
RDMA Enables Far Memory
As CXL fabrics grow in size and heterogeneity, the performance concerns expand as well. For example,
putting a switch in each shelf of a disaggregated rack is elegant, but it adds a switch hop to every
transaction between different resources (compute, memory, storage, and network). Scaling to pods
and beyond adds link-reach challenges, and even time-of-flight latency becomes meaningful. When
multiple factors cause latency to exceed 600ns, system errors may occur. Finally, although load/store
semantics are attractive for small transactions, DMA is generally more efficient for bulk-data transfers
such as page swapping or VM migration.
Ultimately, the coherency domain need be extended only so far. Beyond the practical limits of CXL,
Ethernet can serve the need for high-capacity disaggregated memory. From a data-center perspective,
Ethernet’s reach is unlimited, and hyperscalers have scaled RDMA-over-Ethernet (RoCE) networks to
thousands of server nodes. Operators have deployed these large RoCE networks for storage
disaggregation using SSDs, however, not DRAM.
Figure 3 shows an example implementation of memory swap over RDMA, in this case, the Infiniswap
design from the University of Michigan. The researchers’ goal was to disaggregate free memory across
servers, addressing memory underutilization, also known as stranding. Their approach used off-the-
shelf RDMA hardware (RNICs) and avoided application modification. The system software uses an
Infiniswap block device, which appears to the virtual memory manager (VMM) as conventional
storage. The VMM handles the Infiniswap device as a swap partition, just as it would use a local SSD
partition for page swapping.
5
The Evolution of Memory Tiering at Scale
©2023 Wheeler’s Network
FIGURE 4. MEMORY SWAP OVER ETHERNET
The target server runs an Infiniswap daemon in user space, handling only the mapping of local
memory to remote block devices. Once memory is mapped, read and write requests bypass the target
server’s CPU using RDMA, resulting in a zero-overhead data plane. In the researchers’ system, every
server loaded both software components so they could serve as both requestors and targets, but the
concept extends to a memory appliance that serves only the target side.
The University of Michigan team built a 32-node cluster using 56Gbps InfiniBand RNICs, although
Ethernet RNICs should operate identically. They tested several memory-intensive applications,
including VoltDB running the TPC-C benchmark and Memcached running Facebook workloads. With
only 50% of the working set stored in local DRAM and the remainder served by network swap, VoltDB
and Memcached delivered 66% and 77%, respectively, the performance of the same workloads with the
complete working set in local DRAM. By comparison, disk-based swap with the 50% working set
delivered only 4% and 6%, respectively, of baseline performance. Thus, network swap provided an
order of magnitude speedup compared with swap to disk.
Other researchers, including teams at Alibaba and Google, advocate for modifying the application to
directly access a remote memory pool, leaving the operating system unmodified. This approach can
deliver greater performance than the more generalized design presented by the University of Michigan.
Hyperscalers have the resources to develop custom applications, whereas the broader market requires
support for unmodified applications. Given the implementation complexities of network swap at scale,
the application-centric approach will likely be deployed first.
Either way, Ethernet provides low latency and overhead using RDMA, and its reach easily handles
row- or pod-scale fabrics. The fastest available Ethernet-NIC ports can also deliver enough bandwidth
to handle one DDR5 DRAM channel. When using a jumbo frame to transfer a 4KB memory page, 400G
Ethernet has only 1% overhead, yielding 49GB/s of effective bandwidth. That figure well exceeds the
31GB/s of effective bandwidth delivered by one 64-bit DDR5-4800 channel. Although 400G RNICs
represent the leading edge, Nvidia shipped its ConnectX-7 adapter in volume during 2022.
The Long Road to Memory Fabrics
Cloud data centers succeeded in disaggregating storage and network functions from CPUs, but main-
memory disaggregation remained elusive. Pooled memory was on the roadmap for Intel’s Rack-Scale
Architecture a decade ago but never came to fruition. The Gen-Z Consortium formed in 2016 to pursue
6
The Evolution of Memory Tiering at Scale
©2023 Wheeler’s Network
a memory-centric fabric architecture, but system designs reached only the prototype stage. History
tells us that as industry standards add complexity and optional features, their likelihood of volume
adoption drops. CXL offers incremental steps along the architectural-evolution path, allowing the
technology to ramp quickly while offering future iterations that promise truly composable systems.
Workloads that benefit from memory expansion include in-memory databases such as SAP HANA
and Redis, in-memory caches such as Memcached, and large virtual machines, as well as AI training
and inference, which must handle ever-growing large-language models. These workloads fall off a
performance cliff when their working sets don’t fully fit in local DRAM. Memory pooling can alleviate
the problem of stranded memory, which impacts the capital expenditures of hyperscale data-centers
operators. A Microsoft study, detailed in a March 2022 paper, found that up to 25% of server DRAM
was stranded in highly utilized Azure clusters. The company modeled memory pooling across different
numbers of CPU sockets and estimated it could reduce overall DRAM requirements by about 10%.
The case for pure-play CXL 3.x fabric adoption is less compelling, in part because of GPU-market
dynamics. Current data-center GPUs from Nvidia, AMD, and Intel implement proprietary coherent
interconnects for GPU-to-GPU communications, alongside PCIe for host connectivity. Nvidia’s top-
end Tesla GPUs already support memory pooling over the proprietary NVLink interface, solving the
stranded-memory problem for high-bandwidth memory (HBM). The market leader is likely to favor
NVLink, but it may also support CXL by sharing lanes (serdes) between the two protocols. Similarly,
AMD and Intel could adopt CXL in addition to Infinity and Xe-Link, respectively, in future GPUs. The
absence of disclosed GPU support, however, creates uncertainty around adoption of advanced CXL 3.0
features, whereas the move to PCIe Gen6 lane rates for existing use cases is undisputed. In any case, we
expect it will be 2027 before CXL 3.x shared-memory expanders achieve high-volume shipments.
In the meantime, multiple hyperscalers adopted RDMA to handle storage disaggregation as well
as high-performance computing. Although the challenges of deploying RoCE at scale are widely
recognized, these large customers are capable of solving the performance and reliability concerns.
They can extend this deployed and understood technology into new use cases, such as network-based
memory disaggregation. Research has demonstrated that a network-attached capacity tier can deliver
strong performance when system architects apply it to appropriate workloads.
We view CXL and RDMA as complementary technologies, with the former delivering the greatest
bandwidth and lowest latency, whereas the latter offering greater scale. Enfabrica developed an
architecture it calls an Accelerated Compute Fabric (ACF), which collapses CXL/PCIe–switch and
RNIC functions into a single device. When instantiated in a multiterabit chip, the ACF can connect
coherent local memory while scaling across chassis and racks using up to 800G Ethernet ports.
Crucially, this approach removes dependencies on advanced CXL features that will take years to reach
the market.
Data-center operators will take multiple paths to memory disaggregation, as each has different
priorities and unique workloads. Those with well-defined internal workloads will likely lead, whereas
others that prioritize public-cloud instances are apt to be more conservative. Early adopters create
opportunities for vendors that can solve a particular customer’s most pressing need.
Bob Wheeler is an independent industry analyst covering semiconductors and networking for more than two
decades. He is currently principal analyst at Wheeler’s Network, established in 2022. Previously, Wheeler was
a principal analyst at The Linley Group and a senior editor for Microprocessor Report. Joining the company in
2001, he authored articles, reports, and white papers covering a range of chips including Ethernet switches,
DPUs, server processors, and embedded processors, as well as emerging technologies. Wheeler’s Network offers
white papers, strategic consulting, roadmap reviews, and custom reports. Our free blog is available at
www.wheelersnetwork.com.

Más contenido relacionado

La actualidad más candente

Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesAMD
 
What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?Michelle Holley
 
PCI Express* based Storage: Data Center NVM Express* Platform Topologies
PCI Express* based Storage: Data Center NVM Express* Platform TopologiesPCI Express* based Storage: Data Center NVM Express* Platform Topologies
PCI Express* based Storage: Data Center NVM Express* Platform TopologiesOdinot Stanislas
 
Enfabrica - Bridging the Network and Memory Worlds
Enfabrica - Bridging the Network and Memory WorldsEnfabrica - Bridging the Network and Memory Worlds
Enfabrica - Bridging the Network and Memory WorldsMemory Fabric Forum
 
PCIe and PCIe driver in WEC7 (Windows Embedded compact 7)
PCIe and PCIe driver in WEC7 (Windows Embedded compact 7)PCIe and PCIe driver in WEC7 (Windows Embedded compact 7)
PCIe and PCIe driver in WEC7 (Windows Embedded compact 7)gnkeshava
 
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28AMD
 
eBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux KerneleBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux KernelThomas Graf
 
Intel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsIntel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsHisaki Ohara
 
Brief introduction to kselftest
Brief introduction to kselftestBrief introduction to kselftest
Brief introduction to kselftestSeongJae Park
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
 

La actualidad más candente (20)

Notes on NUMA architecture
Notes on NUMA architectureNotes on NUMA architecture
Notes on NUMA architecture
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Memory model
Memory modelMemory model
Memory model
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
 
Pcie drivers basics
Pcie drivers basicsPcie drivers basics
Pcie drivers basics
 
What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?
 
PCI Express* based Storage: Data Center NVM Express* Platform Topologies
PCI Express* based Storage: Data Center NVM Express* Platform TopologiesPCI Express* based Storage: Data Center NVM Express* Platform Topologies
PCI Express* based Storage: Data Center NVM Express* Platform Topologies
 
Embedded Linux Kernel - Build your custom kernel
Embedded Linux Kernel - Build your custom kernelEmbedded Linux Kernel - Build your custom kernel
Embedded Linux Kernel - Build your custom kernel
 
Enfabrica - Bridging the Network and Memory Worlds
Enfabrica - Bridging the Network and Memory WorldsEnfabrica - Bridging the Network and Memory Worlds
Enfabrica - Bridging the Network and Memory Worlds
 
PCIe and PCIe driver in WEC7 (Windows Embedded compact 7)
PCIe and PCIe driver in WEC7 (Windows Embedded compact 7)PCIe and PCIe driver in WEC7 (Windows Embedded compact 7)
PCIe and PCIe driver in WEC7 (Windows Embedded compact 7)
 
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
 
eBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux KerneleBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux Kernel
 
NVIDIA CUDA
NVIDIA CUDANVIDIA CUDA
NVIDIA CUDA
 
CXL Fabric Management Standards
CXL Fabric Management StandardsCXL Fabric Management Standards
CXL Fabric Management Standards
 
Intel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsIntel DPDK Step by Step instructions
Intel DPDK Step by Step instructions
 
Linux-Internals-and-Networking
Linux-Internals-and-NetworkingLinux-Internals-and-Networking
Linux-Internals-and-Networking
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
Brief introduction to kselftest
Brief introduction to kselftestBrief introduction to kselftest
Brief introduction to kselftest
 
Qemu Pcie
Qemu PcieQemu Pcie
Qemu Pcie
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 

Similar a WN Memory Tiering WP Mar2023.pdf

Q1 Memory Fabric Forum: Memory expansion with CXL-Ready Systems and Devices
Q1 Memory Fabric Forum: Memory expansion with CXL-Ready Systems and DevicesQ1 Memory Fabric Forum: Memory expansion with CXL-Ready Systems and Devices
Q1 Memory Fabric Forum: Memory expansion with CXL-Ready Systems and DevicesMemory Fabric Forum
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009Léia de Sousa
 
Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)
Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)
Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)Memory Fabric Forum
 
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONSMULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONSAIRCC Publishing Corporation
 
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONSMULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONSijcsit
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersRyousei Takano
 
Cluster Computers
Cluster ComputersCluster Computers
Cluster Computersshopnil786
 
Q1 Memory Fabric Forum: Memory Processor Interface 2023, Focus on CXL
Q1 Memory Fabric Forum: Memory Processor Interface 2023, Focus on CXLQ1 Memory Fabric Forum: Memory Processor Interface 2023, Focus on CXL
Q1 Memory Fabric Forum: Memory Processor Interface 2023, Focus on CXLMemory Fabric Forum
 
Increasing Throughput per Node for Content Delivery Networks
Increasing Throughput per Node for Content Delivery NetworksIncreasing Throughput per Node for Content Delivery Networks
Increasing Throughput per Node for Content Delivery NetworksDESMOND YUEN
 
Analysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsAnalysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsJames McGalliard
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentEricsson
 
A Comparison of IBM’s New Flex System Environment to Traditional Blade Archit...
A Comparison of IBM’s New Flex System Environment to Traditional Blade Archit...A Comparison of IBM’s New Flex System Environment to Traditional Blade Archit...
A Comparison of IBM’s New Flex System Environment to Traditional Blade Archit...IBM India Smarter Computing
 
Memory consistency models
Memory consistency modelsMemory consistency models
Memory consistency modelspalani kumar
 
How to choose a server for your data center's needs
How to choose a server for your data center's needsHow to choose a server for your data center's needs
How to choose a server for your data center's needsIT Tech
 
Cluster Computing
Cluster ComputingCluster Computing
Cluster ComputingNIKHIL NAIR
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 

Similar a WN Memory Tiering WP Mar2023.pdf (20)

Q1 Memory Fabric Forum: Memory expansion with CXL-Ready Systems and Devices
Q1 Memory Fabric Forum: Memory expansion with CXL-Ready Systems and DevicesQ1 Memory Fabric Forum: Memory expansion with CXL-Ready Systems and Devices
Q1 Memory Fabric Forum: Memory expansion with CXL-Ready Systems and Devices
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009
 
Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)
Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)
Q1 Memory Fabric Forum: Intel Enabling Compute Express Link (CXL)
 
IEEExeonmem
IEEExeonmemIEEExeonmem
IEEExeonmem
 
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONSMULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
 
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONSMULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
 
Multi-Core on Chip Architecture *doc - IK
Multi-Core on Chip Architecture *doc - IKMulti-Core on Chip Architecture *doc - IK
Multi-Core on Chip Architecture *doc - IK
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
 
Cluster Computers
Cluster ComputersCluster Computers
Cluster Computers
 
Q1 Memory Fabric Forum: Memory Processor Interface 2023, Focus on CXL
Q1 Memory Fabric Forum: Memory Processor Interface 2023, Focus on CXLQ1 Memory Fabric Forum: Memory Processor Interface 2023, Focus on CXL
Q1 Memory Fabric Forum: Memory Processor Interface 2023, Focus on CXL
 
Nehalem
NehalemNehalem
Nehalem
 
Increasing Throughput per Node for Content Delivery Networks
Increasing Throughput per Node for Content Delivery NetworksIncreasing Throughput per Node for Content Delivery Networks
Increasing Throughput per Node for Content Delivery Networks
 
Analysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsAnalysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific Applications
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
A Comparison of IBM’s New Flex System Environment to Traditional Blade Archit...
A Comparison of IBM’s New Flex System Environment to Traditional Blade Archit...A Comparison of IBM’s New Flex System Environment to Traditional Blade Archit...
A Comparison of IBM’s New Flex System Environment to Traditional Blade Archit...
 
Memory consistency models
Memory consistency modelsMemory consistency models
Memory consistency models
 
PowerAlluxio
PowerAlluxioPowerAlluxio
PowerAlluxio
 
How to choose a server for your data center's needs
How to choose a server for your data center's needsHow to choose a server for your data center's needs
How to choose a server for your data center's needs
 
Cluster Computing
Cluster ComputingCluster Computing
Cluster Computing
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 

Último

Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.francesco barbera
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServiceRenan Moreira de Oliveira
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 

Último (20)

Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 

WN Memory Tiering WP Mar2023.pdf

  • 1. ©2023 Wheeler’s Network Page 1 The Evolution of Memory Tiering at Scale By Bob Wheeler, Principal Analyst March 2023 www.wheelersnetwork.com
  • 2. 1 The Evolution of Memory Tiering at Scale ©2023 Wheeler’s Network With first-generation chips now available, the early hype around CXL is giving way to realistic performance expectations. At the same time, software support for memory tiering is advancing, building on prior work around NUMA and persistent memory. Finally, operators have deployed RDMA to enable storage disaggregation and high-performance workloads. Thanks to these advancements, main-memory disaggregation is now within reach. Enfabrica sponsored the creation of this white paper, but the opinions and analysis are those of the author. Tiering Addresses the Memory Crunch Memory tiering is undergoing major advancements with the recent AMD and Intel server-processor introductions. Both AMD’s new Epyc (codenamed Genoa) and Intel’s new Xeon Scalable (codenamed Sapphire Rapids) introduce Compute Express Link (CXL), marking the beginning of new memory- interconnect architectures. The first generation of CXL-enabled processors handle Revision 1.1 of the specification, however, whereas the CXL Consortium released Revision 3.0 in August 2022. When CXL launched, hyperbolic statements about main-memory disaggregation appeared, ignoring the realities of access and time-of-flight latencies. With first-generation CXL chips now shipping, customers are left to address requirements for software to become tier-aware. Operators or vendors must also develop orchestration software to manage pooled and shared memory. In parallel with software, the CXL-hardware ecosystem will take years to fully develop, particularly CXL 3.x com- ponents including CPUs, GPUs, switches, and memory expanders. Eventually, CXL promises to mature into a true fabric that can connect CPUs and GPUs to shared memories, but network-attached memory still has a role. As Figure 1 shows, the memory hierarchy is becoming more granular, trading access latency against capacity and flexibility. The top of the pyramid serves the performance tier, where hot pages must be stored for maximum performance. Cold pages may be demoted to the capacity tier, which storage devices traditionally served. In recent years, however, developers have optimized software to improve performance when pages reside in different NUMA domains in multi-socket servers as well as in persistent (non-volatile) memories such as Intel’s Optane. Although Intel discontinued Optane development, its large software investment still applies to CXL-attached memories. FIGURE 1. MEMORY HIERARCHY (Data source: University of Michigan and Meta Inc.)
  • 3. 2 The Evolution of Memory Tiering at Scale ©2023 Wheeler’s Network Swapping memory pages to SSD introduces a massive performance penalty, creating an opportunity for new DRAM-based capacity tiers. Sometimes referred to as “far memory,” this DRAM may reside in another server or in a memory appliance. Over the last two decades, software developers advanced the concept of network-based swap, which enables a server to access remote memory located in another server on the network. By using network interface cards that support remote DMA (RDMA), system architects can reduce the access latency to network-attached memory to less than four microseconds, as Figure 1 shows. As a result, network swap can greatly improve the performance of some workloads compared with traditional swap to storage. Memory Expansion Drives Initial CXL Adoption Although it’s little more than three years old, CXL has already achieved industry support exceeding that of previous coherent-interconnect standards such as CCIX, OpenCAPI, and HyperTransport. Crucially, AMD supported and implemented CXL despite Intel developing the original specification. The growing CXL ecosystem includes memory controllers (or expanders) that connect DDR4 or DDR5 DRAM to a CXL-enabled server (or host). An important factor in CXL’s early adoption is its reuse of the PCI Express physical layer, enabling I/O flexibility without adding to processor pin counts. This flexibility extends to add-in cards and modules, which use the same slots as PCIe devices. For the server designer, adding CXL support requires only the latest Epyc or Xeon processor and some attention to PCIe-lane assignments. The CXL specification defines three device types and three protocols required for different use cases. Here, we focus on the Type 3 device used for memory expansion, and the CXL.mem protocol for cache- coherent memory access. All three device types require the CXL.io protocol, but Type 3 devices use this only for configuration and control. Compared with CXL.io as well as PCIe, the CXL.mem protocol stack uses different link and transaction layers. The crucial difference is that CXL.mem (and CXL.cache) adopt fixed-length messages, whereas CXL.io uses variable-length packets like PCIe. In Revisions 1.1 and 2.0, CXL.mem uses a 68-byte flow-control unit (or flit), which handles a 64-byte cache line. CXL 3.0 adopts the 256-byte flit introduced in PCIe 6.0 to accommodate forward-error correction (FEC), but it adds a latency-optimized flit that splits error checking (CRC) into two 128-byte blocks. Fundamentally, CXL.mem brings load/store semantics to the PCIe interface, enabling expansion of both memory bandwidth and capacity. As Figure 2 shows at left, the first CXL use cases revolve around memory expansion, starting with single-host configurations. The simplest example is a CXL memory module, such as Samsung's 512GB DDR5 memory expander with a PCIe Gen5 x8 interface in an EDSFF form factor. This module uses a CXL memory controller from Montage Technology, and the vendors claim support for CXL 2.0. Similarly, Astera Labs offers a DDR5 controller chip with a CXL 2.0 x16 interface. The company developed a PCIe add-in card combining its Leo controller chip with four RDIMM slots that handle up to a combined 2TB of DDR5 DRAM. Unloaded access latency to CXL-attached DRAM should be around 100ns greater than that of DRAM attached to a processor’s integrated memory controllers. The memory channel appears as a single logical device (SLD), which can be allocated to only a single host. Memory expansion using a single processor and SLD represents the best case for CXL-memory performance, assuming a direct connection without intermediate devices or layers such as retimers and switches.
  • 4. 3 The Evolution of Memory Tiering at Scale ©2023 Wheeler’s Network FIGURE 2. CXL 1.1/2.0 USE CASES The next use case is pooled memory, which enables flexible allocation of memory regions to specific hosts. In pooling, memory is assigned and accessible to only a single host—that is, a memory region is not shared by multiple hosts simultaneously. When connecting multiple processors or servers to a memory pool, CXL enables two approaches. The original approach added a CXL switch component between the hosts and one or more expanders (Type 3 devices). The downside of this method is that the switch adds latency, which we estimate at around 80ns. Although customers can design such a system, we do not expect this use case will achieve high-volume adoption, as the added latency decreases system performance. An alternative approach instead uses a multi-headed (MH) expander to directly connect a small number of hosts to a memory pool, as shown in the center of Figure 2. For example, startup Tanzanite Silicon Solutions demonstrated an FPGA-based prototype with four heads prior to its acquisition by Marvell, which later disclosed a forthcoming chip with eight x8 hosts ports. These multi-headed controllers can form the heart of a memory appliance offering a pool of DRAM to a small number of servers. The command interface for managing an MH expander wasn’t standardized until CXL 3.0, however, meaning early demonstrations used proprietary fabric management. CXL 3.x Enables Shared-Memory Fabrics Although it enables small-scale memory pooling, CXL 2.0 has numerous limitations. In terms of topology, it’s limited to 16 hosts and a single-level switch hierarchy. More important for connecting GPUs and other accelerators, each host supports only a single Type 2 device, which means CXL 2.0 can’t be used to build a coherent GPU server. CXL 3.0 enables up to 16 accelerators per host, allowing it to serve as a standardized coherent interconnect for GPUs. It also adds peer-to-peer (P2P) communications, multi-level switching, and fabrics with up to 4,096 nodes. Whereas memory pooling enables flexible allocation of DRAM to servers, CXL 3.0 enables true shared memory. The shared-memory expander is called a global fabric-attached memory (G-FAM) device, and it allows multiple hosts or accelerators to coherently share memory regions. The 3.0 specification also adds up to eight dynamic capacity (DC) regions for more granular memory allocation. Figure 3 shows a simple example using a single switch to connect an arbitrary number of hosts to shared memory. In this case, either the hosts or the devices may manage cache coherence.
  • 5. 4 The Evolution of Memory Tiering at Scale ©2023 Wheeler’s Network FIGURE 3. CXL 3.X SHARED MEMORY For an accelerator to directly access shared memory, however, the expander must implement coher- ence with back invalidation (HDM-DB), which is new to the 3.0 specification. In other words, for CXL- connected GPUs to share memory, the expander must implement an inclusive snoop filter. This approach introduces potential blocking, as the specification enforces strict ordering for certain CXL.mem trans- actions. The shared-memory fabric will experience congestion, leading to less-predictable latency and the potential for much greater tail latency. Although the specification includes QoS Telemetry features, host-based rate throttling is optional, and these capabilities are unproven in practice. RDMA Enables Far Memory As CXL fabrics grow in size and heterogeneity, the performance concerns expand as well. For example, putting a switch in each shelf of a disaggregated rack is elegant, but it adds a switch hop to every transaction between different resources (compute, memory, storage, and network). Scaling to pods and beyond adds link-reach challenges, and even time-of-flight latency becomes meaningful. When multiple factors cause latency to exceed 600ns, system errors may occur. Finally, although load/store semantics are attractive for small transactions, DMA is generally more efficient for bulk-data transfers such as page swapping or VM migration. Ultimately, the coherency domain need be extended only so far. Beyond the practical limits of CXL, Ethernet can serve the need for high-capacity disaggregated memory. From a data-center perspective, Ethernet’s reach is unlimited, and hyperscalers have scaled RDMA-over-Ethernet (RoCE) networks to thousands of server nodes. Operators have deployed these large RoCE networks for storage disaggregation using SSDs, however, not DRAM. Figure 3 shows an example implementation of memory swap over RDMA, in this case, the Infiniswap design from the University of Michigan. The researchers’ goal was to disaggregate free memory across servers, addressing memory underutilization, also known as stranding. Their approach used off-the- shelf RDMA hardware (RNICs) and avoided application modification. The system software uses an Infiniswap block device, which appears to the virtual memory manager (VMM) as conventional storage. The VMM handles the Infiniswap device as a swap partition, just as it would use a local SSD partition for page swapping.
  • 6. 5 The Evolution of Memory Tiering at Scale ©2023 Wheeler’s Network FIGURE 4. MEMORY SWAP OVER ETHERNET The target server runs an Infiniswap daemon in user space, handling only the mapping of local memory to remote block devices. Once memory is mapped, read and write requests bypass the target server’s CPU using RDMA, resulting in a zero-overhead data plane. In the researchers’ system, every server loaded both software components so they could serve as both requestors and targets, but the concept extends to a memory appliance that serves only the target side. The University of Michigan team built a 32-node cluster using 56Gbps InfiniBand RNICs, although Ethernet RNICs should operate identically. They tested several memory-intensive applications, including VoltDB running the TPC-C benchmark and Memcached running Facebook workloads. With only 50% of the working set stored in local DRAM and the remainder served by network swap, VoltDB and Memcached delivered 66% and 77%, respectively, the performance of the same workloads with the complete working set in local DRAM. By comparison, disk-based swap with the 50% working set delivered only 4% and 6%, respectively, of baseline performance. Thus, network swap provided an order of magnitude speedup compared with swap to disk. Other researchers, including teams at Alibaba and Google, advocate for modifying the application to directly access a remote memory pool, leaving the operating system unmodified. This approach can deliver greater performance than the more generalized design presented by the University of Michigan. Hyperscalers have the resources to develop custom applications, whereas the broader market requires support for unmodified applications. Given the implementation complexities of network swap at scale, the application-centric approach will likely be deployed first. Either way, Ethernet provides low latency and overhead using RDMA, and its reach easily handles row- or pod-scale fabrics. The fastest available Ethernet-NIC ports can also deliver enough bandwidth to handle one DDR5 DRAM channel. When using a jumbo frame to transfer a 4KB memory page, 400G Ethernet has only 1% overhead, yielding 49GB/s of effective bandwidth. That figure well exceeds the 31GB/s of effective bandwidth delivered by one 64-bit DDR5-4800 channel. Although 400G RNICs represent the leading edge, Nvidia shipped its ConnectX-7 adapter in volume during 2022. The Long Road to Memory Fabrics Cloud data centers succeeded in disaggregating storage and network functions from CPUs, but main- memory disaggregation remained elusive. Pooled memory was on the roadmap for Intel’s Rack-Scale Architecture a decade ago but never came to fruition. The Gen-Z Consortium formed in 2016 to pursue
  • 7. 6 The Evolution of Memory Tiering at Scale ©2023 Wheeler’s Network a memory-centric fabric architecture, but system designs reached only the prototype stage. History tells us that as industry standards add complexity and optional features, their likelihood of volume adoption drops. CXL offers incremental steps along the architectural-evolution path, allowing the technology to ramp quickly while offering future iterations that promise truly composable systems. Workloads that benefit from memory expansion include in-memory databases such as SAP HANA and Redis, in-memory caches such as Memcached, and large virtual machines, as well as AI training and inference, which must handle ever-growing large-language models. These workloads fall off a performance cliff when their working sets don’t fully fit in local DRAM. Memory pooling can alleviate the problem of stranded memory, which impacts the capital expenditures of hyperscale data-centers operators. A Microsoft study, detailed in a March 2022 paper, found that up to 25% of server DRAM was stranded in highly utilized Azure clusters. The company modeled memory pooling across different numbers of CPU sockets and estimated it could reduce overall DRAM requirements by about 10%. The case for pure-play CXL 3.x fabric adoption is less compelling, in part because of GPU-market dynamics. Current data-center GPUs from Nvidia, AMD, and Intel implement proprietary coherent interconnects for GPU-to-GPU communications, alongside PCIe for host connectivity. Nvidia’s top- end Tesla GPUs already support memory pooling over the proprietary NVLink interface, solving the stranded-memory problem for high-bandwidth memory (HBM). The market leader is likely to favor NVLink, but it may also support CXL by sharing lanes (serdes) between the two protocols. Similarly, AMD and Intel could adopt CXL in addition to Infinity and Xe-Link, respectively, in future GPUs. The absence of disclosed GPU support, however, creates uncertainty around adoption of advanced CXL 3.0 features, whereas the move to PCIe Gen6 lane rates for existing use cases is undisputed. In any case, we expect it will be 2027 before CXL 3.x shared-memory expanders achieve high-volume shipments. In the meantime, multiple hyperscalers adopted RDMA to handle storage disaggregation as well as high-performance computing. Although the challenges of deploying RoCE at scale are widely recognized, these large customers are capable of solving the performance and reliability concerns. They can extend this deployed and understood technology into new use cases, such as network-based memory disaggregation. Research has demonstrated that a network-attached capacity tier can deliver strong performance when system architects apply it to appropriate workloads. We view CXL and RDMA as complementary technologies, with the former delivering the greatest bandwidth and lowest latency, whereas the latter offering greater scale. Enfabrica developed an architecture it calls an Accelerated Compute Fabric (ACF), which collapses CXL/PCIe–switch and RNIC functions into a single device. When instantiated in a multiterabit chip, the ACF can connect coherent local memory while scaling across chassis and racks using up to 800G Ethernet ports. Crucially, this approach removes dependencies on advanced CXL features that will take years to reach the market. Data-center operators will take multiple paths to memory disaggregation, as each has different priorities and unique workloads. Those with well-defined internal workloads will likely lead, whereas others that prioritize public-cloud instances are apt to be more conservative. Early adopters create opportunities for vendors that can solve a particular customer’s most pressing need. Bob Wheeler is an independent industry analyst covering semiconductors and networking for more than two decades. He is currently principal analyst at Wheeler’s Network, established in 2022. Previously, Wheeler was a principal analyst at The Linley Group and a senior editor for Microprocessor Report. Joining the company in 2001, he authored articles, reports, and white papers covering a range of chips including Ethernet switches, DPUs, server processors, and embedded processors, as well as emerging technologies. Wheeler’s Network offers white papers, strategic consulting, roadmap reviews, and custom reports. Our free blog is available at www.wheelersnetwork.com.