The biggest gap in the graphics APIs for GPGPU workloads

The GPGPU ecosystem is built on an assumption that the virtual memory address space is unified between the host and the device(s). That has been a property that was true since the Kepler GPU architecture back in 2012.

This model is much easier to use than its predecessor from the developer’s perspective, with unified memory having the same address on both the host and device sides, without explicit copies being required.

However, there’s a class of APIs that do not provide those semantics whatsoever today: graphics APIs.

This includes all of D3D12, Metal and Vulkan. In those APIs, the GPU and CPU-side virtual address for the same allocation can be different. There’s also no guarantee of not having overlaps.

How this issue maps to the GPGPU APIs

SYCL

While SYCL supports the older buffer-accessor model, the software ecosystem for modern SYCL has been built around the assumption that USM is available.

As such, these applications will not run on a SYCL implementation only supporting the buffer-accessor model.

A SYCL implementation supporting USM device allocations but not shared allocations (with keeping track of pointers manually) would be something quite unusual. Perhaps just not worth the effort… It could also break down at the very moment when there’d be a host + device pointer address collision.

This would also force applications to use the buffer-accessor model for any allocations accessible from both the CPU and the accelerator side.

CUDA

The unified virtual address space feature shipped for CUDA since Kepler GPUs. As such, it has been the baseline expectation since then.

OpenCL

OpenCL 2.0 onwards has Shared Virtual Memory as an implementation of this concept.

What does this mean for CLon12?

CLon12 is the OpenCL implementation on top of D3D12 as shipped by Microsoft. It does support SPIR-V kernels. However, it doesn’t provide shared virtual memory as the D3D12 API does not support it.

As such, CLon12 does not fulfil the role of a cross-vendor runtime that SYCL real-world applications can run on.

What about using regular OpenCL runtimes on Windows to run SYCL applications?

Regular OpenCL runtimes do not fulfil that role either. AMD only supports OpenCL C (ugh) and NVIDIA’s OpenCL driver only supports PTX as an intermediate language, making a separate binary slice necessary instead of being able to leverage the SPIR-V one.

Intel is the only player in the industry to ship a more or less fully featured OpenCL driver from this perspective. Qualcomm does not ship an OpenCL driver on Windows platforms, leaving CLon12 as the only option.

And mobile platforms?

On Android, SVM tends to be shipped by all the vendors. However, Google does not officially support OpenCL on their software platforms.

Conclusion

Not having unified shared memory is a big gap in the graphics APIs. And not the kind that can be worked around relatively easily via a compatibility layer.

This causes major problems when porting GPGPU programs to platforms such as Apple Silicon Macs, which do not expose unified shared memory.

1 thought on “The biggest gap in the graphics APIs for GPGPU workloads”

John Richards October 17, 2022 at 4:58 pm


This blog post by a former CUDA software architect provides an interesting perspective on managed virtual memory: https://www.cudahandbook.com/2017/08/managed-memory-and-segmentation/.