For April Fools day, an alternate approach is discussed for the OpenCL on Metal project.
Writing an OpenCL implementation from scratch would take a long time – what if we can leverage existing bricks for a proof of concept?
CLVK is an experimental, limited OpenCL implementation that runs on top of Vulkan. It uses clspv, a compiler to transform OpenCL C code into Vulkan compute shaders, with some important limitations.
MoltenVK is a Vulkan (Portability) implementation targeting Apple operating systems, allowing to run Vulkan implementations on top of Metal.
$ cd ~/VulkanSDK/1.3.204.1/ && source setup-env.sh && cd -
~/devlab/opencl/clvk
$ git clone --recurse https://github.com/kpet/clvk
[...]
$ cd clvk/external/clspv
$ git diff # small change to use python3
diff --git a/utils/fetch_sources.py b/utils/fetch_sources.py
index f0ba884..a0d6a35 100755
--- a/utils/fetch_sources.py
+++ b/utils/fetch_sources.py
@@ -1,4 +1,4 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3
# Copyright 2017 The Clspv Authors. All rights reserved.
#
$ cd ../.. ; ./external/clspv/utils/fetch_sources.py --deps llvm
[...]
$ mkdir build && cd build
$ cmake ..
$ make -j8
Does it work at first glance? Yes it does.
$ ./clpeak # linking to libOpenCL.dylib instead of OpenCL.framework
Platform: clvk
Device: Apple M1
Driver version : 1.2 CLVK on Vulkan v1.1.189 driver 10105 (Macintosh)
Compute units : 1
Clock frequency : 0 MHz
Global memory bandwidth (GBPS)
float : 55.63
float2 : 57.66
float4 : 57.50
float8 : 57.88
float16 : 62.41
Single-precision compute (GFLOPS)
float : 1267.18
float2 : 1502.12
float4 : 1525.60
float8 : 892.10
float16 : 1470.11
Half-precision compute (GFLOPS)
half : 1333.62
half2 : 1508.59
half4 : 1591.51
half8 : 1525.85
half16 : 1423.71
No double precision support! Skipped
Integer compute (GIOPS)
int : 472.07
int2 : 467.69
int4 : 469.41
int8 : 476.42
int16 : 461.98
Integer compute Fast 24bit (GIOPS)
int : 480.22
int2 : 478.74
int4 : 437.78
int8 : 474.67
int16 : 475.28
Transfer bandwidth (GBPS)
enqueueWriteBuffer : 28.62
enqueueReadBuffer : 28.93
enqueueWriteBuffer non-blocking : 28.99
enqueueReadBuffer non-blocking : 28.97
enqueueMapBuffer(for read) : 550636.56
memcpy from mapped ptr : 28.75
enqueueUnmap(after write) : 727960.19
memcpy to mapped ptr : 28.96
Kernel launch latency : 6.92 us
I tried just as same as you’d done, and it was actually successful. But it was finally failed eventually because calling clCreateCommandQueue by CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE was failed with most recent version of VulkanSDK and clvk. My purpose was to run llama.cpp with clblast GPU acceleration on my M1 Max computer. I have no knowledge to make it possible, so I give up now. But this blog post gave me a very helpful hint.