This post is a followup of the prior State of the GPU compute APIs today post on this blog.
Users migrate to newer GPUs, sometimes even being early adopters. As such, the binary of the app that you distribute can’t solely have GPU binaries for a sole or multiple architectures defined rigidly at compilation time. It’ll also have to contain an intermediate language section that can be compiled at runtime to target a given graphics processor.
Graphics APIs today use the option of shipping an IR, instead of a precompiled binary that can be loaded on a given GPU generation. Some mobile GPU providers have the option of pre-compiling the shaders before distribution.
With the huge variety in mobile hardware, that’s not quite a realistic proposition anymore. Devices running code controlled by the vendor, such as game consoles, might also use such an option. However, on PCs, those facilities are not used.
Stable API and ABI is defined here as continues to run on an existing GPU family forever and can support a future GPU generation without requiring recompilation from the source code, bug fixing excluded.
Here, I’ll only describe the GPU compute API choices that (can) target an intermediate representation, so that your binaries can continue to run on future hardware. Only general-purpose GPU computing APIs will be described, as such excluding DirectML and other field-specific APIs.
Vendor-specific, stable API and ABI
oneAPI Level Zero (Intel): By using SPIR-V as an intermediate representation, which is also used by Vulkan, code duplication is reduced to a minimum, and advances in the compiler have benefits directly for both the compute APIs and the graphics APIs on the hardware.
This is exclusively a low-level API. oneAPI’s primary language for users is Data Parallel C++, which incorporates SYCL.
CUDA (NVIDIA): Fulfils that requirement today, by supporting second-generation Kepler up to Ampere with the current version of the toolkit. Your binaries can (and very often do) have a PTX (NVIDIA’s IR) section that will run on any future hardware, with you setting the minimum required feature level.
As a platform, CUDA supports both a single-source programming model, and a lower-level programming model.
Vendor-independent, stable API and ABI
OpenCL (Khronos): OpenCL is the industry standard with the ABI is stable within a given operating system. A binary using OpenCL will work on future hardware without requiring changes, and as long as the feature set that your app needs is supported, run on GPUs from multiple vendors too as long as the drivers have no bug affecting your application.
C++ AMP (Microsoft): Built binaries with C++ AMP work on any Windows machine with DirectX 11. However, C++ AMP isn’t evolving today, and as such is not a good starting point anymore.
Vendor-independent programming models
SYCL, OpenMP, OpenACC and C++ AMP have higher-level programming models, known as single-source, where host and device code can be mixed in a single file.
C++ Standard Parallelism is currently supported on GPUs solely for devices from just one vendor at this point in time, it’ll hopefully become more complete and be supported by other vendors too. At this point in time, it’s not an option to be chosen for having code capable of running on devices from multiple GPU vendors.
Those programming models do not define a particular ABI for accelerator offloading. As such, each vendor can make their own implementation on top of their own API, which can produce binaries that wouldn’t run on devices from other vendors.
Codeplay’s ComputeCpp is a SYCL implementation that targets OpenCL, and as such running on all OpenCL devices having a powerful enough feature set supported by their drivers.