Skip to content

OpenCL on Metal #1 – get_global_id()

OpenCL today is considered as deprecated on macOS, let’s see if we can get OpenCL kernels compiled to AIR to run on Metal.

Let’s start with the prototype of a basic kernel to add array A and B to an output array named result in both MSL and OpenCL C.

Prototype

Metal: void add_arrays(device const float* inA, device const float* inB, device float* result, uint index [[thread_position_in_grid]])

OpenCL: __kernel void add_arrays(__global const float *a, __global const float *b, __global float *result)

Let’s pass a precompiled to AIR OpenCL binary to Metal and see what happens?

Compute pipeline state creation failed.
Error: Undefined symbols: llvm.agx2.thread.position.in.grid.A, referenced from: __entry_add_arrays

llvm.agx2.thread.position.in.grid.A is not available.

; Function Attrs: nounwind optsize readnone
define hidden i64 @_Z13get_global_idj(i32) local_unnamed_addr #0 {
  %2 = tail call i32 @air.get_global_id.i32(i32 %0) #1
  %3 = zext i32 %2 to i64
  ret i64 %3
}

; Function Attrs: nounwind readnone
declare i32 @air.get_global_id.i32(i32) local_unnamed_addr #1

What does that map to? It maps to air.get_global_id.i32. Which is called from get_global_id(int dim).

Metal doesn’t have an equivalent to get_global_id(), it passes through the index to the compute kernel.

How is it done in AIR for Metal?

; Function Attrs: norecurse nounwind
define void @add_arrays(float addrspace(1)* noalias nocapture readonly, float addrspace(1)* noalias nocapture readonly, float addrspace(1)* noalias nocapture, i32) local_unnamed_addr #0 {
  %5 = zext i32 %3 to i64
  %6 = getelementptr inbounds float, float addrspace(1)* %0, i64 %5
  %7 = load float, float addrspace(1)* %6, align 4, !tbaa !22
  %8 = getelementptr inbounds float, float addrspace(1)* %1, i64 %5
  %9 = load float, float addrspace(1)* %8, align 4, !tbaa !22
  %10 = fadd fast float %9, %7
  %11 = getelementptr inbounds float, float addrspace(1)* %2, i64 %5
  store float %10, float addrspace(1)* %11, align 4, !tbaa !22
  ret void
}

!8 = !{void (float addrspace(1)*, float addrspace(1)*, float addrspace(1)*, i32)* @add_arrays, !9, !10}
!9 = !{}
!10 = !{!11, !12, !13, !14}
!11 = !{i32 0, !"air.buffer", !"air.location_index", i32 0, i32 1, !"air.read", !"air.arg_type_size", i32 4, !"air.arg_type_align_size", i32 4, !"air.arg_type_name", !"float", !"air.arg_name", !"inA"}
!12 = !{i32 1, !"air.buffer", !"air.location_index", i32 1, i32 1, !"air.read", !"air.arg_type_size", i32 4, !"air.arg_type_align_size", i32 4, !"air.arg_type_name", !"float", !"air.arg_name", !"inB"}
!13 = !{i32 2, !"air.buffer", !"air.location_index", i32 2, i32 1, !"air.read_write", !"air.arg_type_size", i32 4, !"air.arg_type_align_size", i32 4, !"air.arg_type_name", !"float", !"air.arg_name", !"result"}
!14 = !{i32 3, !"air.thread_position_in_grid", !"air.arg_type_name", !"uint", !"air.arg_name", !"index"}

As expected in the function prototype, the index is the 4th argument, and is passed that way to the kernel.

Strategies around this

There’s three possible paths forward.

The first path is figuring out what the OpenCL implementation does to enable use of get_global_id() during code generation.

The second strategy is writing a compiler pass to have this information passed when needed throughout the program, as an additional argument.

The third one is not using Apple’s compiler frontend, and going our own way. For that, an OpenCL-profile SPIR-V to AIR path might be preferable engineering-wise.

Leave a Reply

Your email address will not be published. Required fields are marked *