How does NVIDIA implement unified memory on PCIe GPUs and Tegra platforms?

In the prior post, we highlighted why PCIe BARs should be mapped as device memory on 64-bit Arm platforms.

What is unified memory in a dedicated GPU context?

Unified memory is memory that accessible from both the CPU and GPU from an application programmer’s perspective. This is often done using the same pointer address across both the CPU and GPU(s).

How is it implemented?

Pre-Pascal model

Note that this model is still current on Windows platforms.

On NVIDIA GPUs that pre-date the Pascal architecture, page faulting is not supported. As such, a migration of all dirty pages to the GPU is done at kernel launch.

Unified memory cannot be concurrently accessed between CPU and GPU in such a configuration. When a GPU task is running, trying to access to unified memory in this configuration from the CPU triggers a segmentation fault.

The new model

On Pascal onwards, page faults are supported on the GPU. As such, pages are migrated to the processor that is using them. Explicit APIs to pre-fetch are also available.

Unified memory can be concurrently accessed from both the CPU and GPU in this configuration. Even atomics work as expected; with everything being handled through page migration under the hood.

On systems with multiple GPUs, GPU memory can be accessible from another GPU without requiring page migration.

Tegra driver stack

On Tegra, unified memory is cached on both the CPU and GPU but maps down to one single underlying allocation instead of using page migration.

Cache management operations are done to synchronise on kernel launch. As such, concurrent access is not available for unified memory on Tegra platforms today.

On dGPUs when using the nvgpu driver stack (the Tegra iGPU one), which is very rarely done outside of automotive, unified memory is not implemented. You can/have to use the regular desktop driver stack instead, which has unified memory support available.

What isn’t done in those cases?

In all of the cases above, the PCIe BARs for data aren’t mapped to user-space. PCIe BARs remote accesses are very painfully slow, and are uncached. This makes them unsuitable for unified memory, which has to be usable from both the CPU and GPU side for more than just memory copies.

As such, for unified memory support for dGPUs, mapping PCIe remote memory with a normal memory type is not necessary.

Is there another memory type that is accessible from both CPU and GPU?

Yes. Pinned memory is always mapped on the host but also accessible from the GPU. In cases where memory is accessed to rarely from the GPU side (for example, to do a copy), this memory type is adequate.

On Tegra platforms, pinned memory is non-cacheable on the iGPU and non-cacheable on the CPU too for SoCs pre-dating Tegra Xavier (no I/O coherency). On Xavier onwards, pinned memory is mapped as cacheable on the CPU.

There is no pinned memory equivalent that is mapped on the GPU and accessible from the host.

Conclusion

GPUs don’t need PCIe BARs to be mapped as normal memory either. In cases where such a thing is done, better replacements exist to fulfill the same role.

The NVIDIA driver does not rely on mapping device memory as normal memory on 64-bit Arm platforms.