Beyond GPU unified memory: fully unified address spaces

Introduction

As part of CUDA unified memory, the addresses used to access memory on both of the CPU and GPU side are identical.

However, only memory allocated from CUDA (or registered as pinned memory) can be accessed to from the GPU. How could the GPU access to regular paged memory, including memory-mapped files, without needing explicit management by the application?

But why?

Making development significantly easier and enabling new programming models. This can allow transparent handling of much bigger datasets than before through memory overcommiting, while the reducing development effort required from the application developer at the same time.

The interconnects from CPU to GPU

PCIe coherency (or lack thereof)

PCIe doesn’t provide coherency as part of its protocol. As such, accesses to memory through PCIe are uncached from the remote device’s side, but remain cached on the device to which the memory block is attached. Exposing such buffers to applications is generally a bad idea.

If mapping PCIe BARs as normal cacheable memory, which seems to be a mode present (!) but optional in the Arm BSA, explicit cache management ops are necessary for synchronization. Please just don’t use this.

Note: What is colloquially referred to as PCIe coherency is CPU-side coherency when doing DMA transfers. That is generally not an applicable concern to worry about outside of Arm-based platforms. If you are shipping an Arm SoC, please make sure that your PCIe controller is coherent with proper DMA handling. You can then signal that your PCIe controller is coherent for DMA purposes through the _CCA ACPI attribute.

CXL 2.0 and earlier

CXL 2.0 and earlier have peculiar coherency semantics from the performance perspective. Device bias means that accesses have to stay uncached on the host side, and host bias means that the host is queried each time when a memory block is accessed from the device, with a significant impact on performance.

CXL 3.0

CXL 3.0 replaces bias modes with invalidate requests being sent on writes. As such, memory can be cached on both the host and device sides.

NVIDIA NVLink(-C2C)/AMD xGMI

Those are vendor specific interconnects instead of being industry standards.

NVIDIA NVLink is present on IBM Power9 (coupling with Volta) and NVLink-C2C on Grace. AMD supports the xGMI interconnect (marketed as Infinity Fabric) between MI250X GPUs and applicable EPYC processors.

How to implement it?

Heterogeneous Memory Management

HMM with page faulting allows to migrate pages dynamically depending on the processor that is using them. However, when used alone, this comes with a performance cliff for data that is not accessed often. This is because of the cost of handling a page fault.

This does not need new silicon to work, and can be rolled out on regular Volta + PCIe GPU systems.

Hardware coherency

This can be used together with HMM to have intermediate levels instead of having to go through the overhead of a page fault and migration on each access to remote memory. Such a mechanism can be implemented through access counters, with a threshold beyond which the page will be moved. This allows to significantly lower overhead for infrequent access to remote memory.

However, using hardware coherency means needing both the host and device to share a compatible coherent interconnect. This is relatively rare in current systems.

Currently, compatible solutions are Power9 + Volta GPUs and Grace Hopper systems (coming soon) through NVLink and NVLink C2C respectively. Or on the AMD side, Radeon Instinct MI250X and a compatible EPYC processor. Note that there’s no GPU on the market with CXL support advertised as of today.

Performance optimization

Predicting patterns (implicit prefetching) – which is done transparently by the driver – and explicit prefetching allow to lower the overhead, with the same implications as when using CUDA unified memory.