{"id":494,"date":"2022-10-17T03:25:48","date_gmt":"2022-10-17T01:25:48","guid":{"rendered":"https:\/\/threedots.ovh\/blog\/?p=494"},"modified":"2022-10-17T04:14:06","modified_gmt":"2022-10-17T02:14:06","slug":"the-biggest-gap-in-the-graphics-apis-for-gpgpu-workloads","status":"publish","type":"post","link":"https:\/\/threedots.ovh\/blog\/2022\/10\/the-biggest-gap-in-the-graphics-apis-for-gpgpu-workloads\/","title":{"rendered":"The biggest gap in the graphics APIs for GPGPU workloads"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">The GPGPU ecosystem is built on an assumption that the virtual memory address space is unified between the host and the device(s). That has been a property that was true since the Kepler GPU architecture back in 2012. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This model is much easier to use than its predecessor from the developer&#8217;s perspective, with unified memory having the same address on both the host and device sides, without explicit copies being required.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, there&#8217;s a class of APIs that do not provide those semantics whatsoever today: graphics APIs. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This includes all of D3D12, Metal and Vulkan. In those APIs, the GPU and CPU-side virtual address for the same allocation can be different. There&#8217;s also no guarantee of not having overlaps.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How this issue maps to the GPGPU APIs<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">SYCL<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">While SYCL supports the older buffer-accessor model, the software ecosystem for modern SYCL has been built around the assumption that USM is available.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As such, these applications will not run on a SYCL implementation only supporting the buffer-accessor model.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A SYCL implementation supporting USM device allocations but not shared allocations (with keeping track of pointers manually) would be something quite unusual. Perhaps just not worth the effort&#8230; It could also break down at the very moment when there&#8217;d be a host + device pointer address collision. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This would also force applications to use the buffer-accessor model for any allocations accessible from both the CPU and the accelerator side.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">CUDA<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The unified virtual address space feature shipped for CUDA since Kepler GPUs. As such, it has been the baseline expectation since then.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">OpenCL<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">OpenCL 2.0 onwards has <em>Shared Virtual Memory<\/em> as an implementation of this concept.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What does this mean for CLon12?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">CLon12 is the OpenCL implementation on top of D3D12 as shipped by Microsoft. It does support SPIR-V kernels. However, it doesn&#8217;t provide <em>shared virtual memory<\/em> as the D3D12 API does not support it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As such, CLon12 does not fulfil the role of a cross-vendor runtime that SYCL real-world applications can run on.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What about using regular OpenCL runtimes on Windows to run SYCL applications?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Regular OpenCL runtimes do not fulfil that role either. AMD only supports OpenCL C (ugh) and NVIDIA&#8217;s OpenCL driver only supports PTX as an intermediate language, making a separate binary slice necessary instead of being able to leverage the SPIR-V one. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Intel is the only player in the industry to ship a more or less fully featured OpenCL driver from this perspective. Qualcomm does not ship an OpenCL driver on Windows platforms, leaving CLon12 as the only option.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">And mobile platforms?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">On Android, SVM tends to be shipped by all the vendors. However, Google does not officially support OpenCL on their software platforms.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Not having unified shared memory is a big gap in the graphics APIs. And not the kind that can be worked around relatively easily via a compatibility layer. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This causes major problems when porting GPGPU programs to platforms such as Apple Silicon Macs, which do not expose unified shared memory.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The GPGPU ecosystem is built on an assumption that the virtual memory address space is unified between the host and the device(s). That has been a property that was true since the Kepler GPU architecture back in 2012. This model is much easier to use than its predecessor from the developer&#8217;s perspective, with unified memory&hellip;&nbsp;<a href=\"https:\/\/threedots.ovh\/blog\/2022\/10\/the-biggest-gap-in-the-graphics-apis-for-gpgpu-workloads\/\" rel=\"bookmark\">Read More &raquo;<span class=\"screen-reader-text\">The biggest gap in the graphics APIs for GPGPU workloads<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"neve_meta_sidebar":"","neve_meta_container":"","neve_meta_enable_content_width":"","neve_meta_content_width":0,"neve_meta_title_alignment":"","neve_meta_author_avatar":"","neve_post_elements_order":"","neve_meta_disable_header":"","neve_meta_disable_footer":"","neve_meta_disable_title":"","footnotes":""},"categories":[1],"tags":[],"class_list":["post-494","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/threedots.ovh\/blog\/wp-json\/wp\/v2\/posts\/494","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/threedots.ovh\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/threedots.ovh\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/threedots.ovh\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/threedots.ovh\/blog\/wp-json\/wp\/v2\/comments?post=494"}],"version-history":[{"count":4,"href":"https:\/\/threedots.ovh\/blog\/wp-json\/wp\/v2\/posts\/494\/revisions"}],"predecessor-version":[{"id":498,"href":"https:\/\/threedots.ovh\/blog\/wp-json\/wp\/v2\/posts\/494\/revisions\/498"}],"wp:attachment":[{"href":"https:\/\/threedots.ovh\/blog\/wp-json\/wp\/v2\/media?parent=494"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/threedots.ovh\/blog\/wp-json\/wp\/v2\/categories?post=494"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/threedots.ovh\/blog\/wp-json\/wp\/v2\/tags?post=494"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}