AMD ROCm: a wasted opportunity

This post will mainly focus on ROCm HIP wasted opportunities for adoption. I wrote an overview of ROCm earlier at https://threedots.ovh/blog/2021/11/what-is-amd-rocm/. This post will specifically eschew talking about issues about the API renames in ROCm HIP – breaking source compatibility with CUDA quite needlessly.

Hardware support

Hardware that you can buy

Official support for GPUs really matters for adoption, on that point ROCm really doesn’t shine.

List of officially supported GPUs on AMD ROCm (from documentation)

This official support list essentially ensures that almost no hobbyists have officially access to the platform, which is a bad choice to make.

Amazon EC2

In December 2020, AWS launched the G4ad instances with Radeon Pro V520 GPUs. This GPU uses the Navi12 die. As of May 2022, AMD still didn’t officially support it as part of ROCm.

This is a big wasted opportunity. Having a supported ROCm configuration easily accessible on Amazon EC2 would have helped development and testing quite a lot. It could have been a nice beginning position for bigger adoption of ROCm across the ecosystem.

“Unofficial” support

If a customer GPU happens to share the same die as a supported part (for example, Radeon VII or 6800 XT) then in practice it can work. However, giving no guarantees of that being the case in the documentation is very unusual for a company seeking to have adoption of their products. Enablement is not support either, vendors have to stand behind their products and give support guarantees for them.

And worse, different binary slices are used between different dies of the same product line. As an example, for RDNA2, ROCm math libraries are compiled only for Navi21. This means that on a (smaller) Navi22 die (notably present in the 6700 XT), those components aren’t functional. The workaround is manually recompiling ROCm with support for more targets. Such a roadblock is very discouraging for adopters – and does complicate application distribution too.

Compatibility between different GPUs

This behaviour of having a separate target per GPU die is unlike what NVIDIA does – (outside of Tegra, which has its own problems that can force going down the PTX route) – where SASS/hardware ISA are compatible between different products of the same generation. And even more than that, within the same SM major release. This means that binary slices targeted for Volta can run on Turing GPUs for example.

And NVIDIA also has PTX, an intermediate representation that can have slices of bundled into program binaries to guarantee compatibility with future GPUs yet to be designed, without recompilation. ROCm has no equivalent to PTX today.

Meanwhile, Intel uses OpenCL 3.0 and oneAPI Level Zero as the API abstractions provided by the driver. Those use SPIR-V as the intermediate representation, providing compatibility for a given GPU program binary between different hardware generations.

AMD not having an equivalent for PTX or SPIR-V results in binaries having to limit themselves to hipRTC (the NVRTC clone/API rename) runtime compilation mechanism if compatibility with future hardware is needed. This results in the single-source programming model not being supported.

Support is important

Official support for all of a product range is really important for the adoption of a proper GPGPU stack across the ecosystem. Having to recompile ROCm manually with support for a specific GPU die is to be truly avoided.

Supporting enterprise products only shrinks the software pool massively – not allowing to benefit from the huge user base of people experimenting on their own machines. Those issues combined result in the lack of adoption of ROCm today. Not supporting the world’s most popular desktop operating system doesn’t help either.

Is OpenCL a good alternative on AMD?

AMD’s OpenCL implementation doesn’t support SPIR (was present previously, then removed) or SPIR-V ingestion. This means that the OpenCL implementation is significantly less useful than it could have been.

A complete OpenCL implementation (including SPIR-V ingestion) would allow running identical binaries across both Intel and AMD GPUs, leveraging standardised infrastructure.

So no, it isn’t one…