GCC’s NVPTX target

The GNU Compiler Collection has an NVIDIA PTX target, supporting both offload modes for OpenMP and OpenACC on NVIDIA GPUs.

It also supports a quite special standalone mode, for use cases that want to run a single-threaded load on GPUs – this is of course not remotely performant by orders of magnitude – but is very useful for testing the compiler itself.

Building it

Common

nvptx-tools are the binutils equivalent for this GCC target.

git clone https://github.com/MentorEmbedded/nvptx-tools
cd nvptx-tools
PATH=/usr/local/cuda/bin ./configure --with-cuda-driver=/usr/local/cuda
make
sudo make install

Offload mode

This is the mode that you’ll most likely want to use. Offload mode is currently not functional on 64-bit Arm due to GCC bug 96265.

git clone git://gcc.gnu.org/git/gcc.git
git clone https://sourceware.org/git/newlib-cygwin.git
cd gcc
ln -s ../newlib-cygwin/newlib newlib
mkdir -p build/nvptx-offload
mkdir -p build/host
cd build/nvptx-offload
../../configure --target=nvptx-none --enable-as-accelerator-for=$HOST_TRIPLET --disable-sjlj-exceptions --enable-newlib-io-long-long --with-build-time-tools=/usr/local/nvptx-none/bin --with-as=/usr/local/bin/nvptx-none-as
make -j160 && sudo make install DESTDIR=$PREFIX
cd ../host
../../configure --build=$HOST_TRIPLET --host=$HOST_TRIPLET --target=$HOST_TRIPLET --enable-languages=c,c++,fortran,lto --enable-offload-targets=nvptx-none=$PREFIX/usr/local/nvptx-none --with-cuda-driver=/usr/local/cuda
make -j160 && sudo make install DESTDIR=$PREFIX

Standalone mode

git clone git://gcc.gnu.org/git/gcc.git
git clone https://sourceware.org/git/newlib-cygwin.git
cd gcc
ln -s ../newlib-cygwin/newlib newlib
mkdir -p build/nvptx
cd build/nvptx
../../configure --target=nvptx-none --disable-sjlj-exceptions --enable-newlib-io-long-long --with-build-time-tools=/usr/local/nvptx-none/bin  --with-as=/usr/local/bin/nvptx-none-as
make -j160 && sudo make install DESTDIR=$PREFIX

Fun with standalone mode

Standalone mode allows to have regular programs run, very slowly (GPUs aren’t exactly known for good single SIMT lane performance) and with the limitations applicable to GCC’s PTX backend. That allows to have the GCC test suite run for example – but we can also have some fun with it.

Let’s have a (totally useless by the way) test program to see how long a printf takes on hardware:

#include <stdio.h>
#include <time.h>
#include <stdint.h>

extern uint64_t clock64();

int main (int argc, char** argv) {
	uint64_t b = clock64();
	puts("hello world from a GPU thread!");
	uint64_t e = clock64();
	printf("time elapsed: %i\n", e - b);
}

and

.version 6.0
.target sm_30
.address_size 64
.visible .func (.param .u64 %value_out) clock64;
.visible .func (.param .u64 %value_out) clock64
{
.reg .u64 %value;
.reg .u64 %r23;
mov.u64 %r23,%clock64;
mov.u64 %value,%r23;
st.param.u64 [%value_out],%value;
ret;
}

We can compile the program and then run it:

$ nvptx-none-gcc -mmainkernel -misa=sm_70 clock.s -O3 hello.c -o hello
$ nvptx-none-run hello
nvptx-none-run hello
hello world from a GPU thread!
time elapsed: 48359

What does -mmainkernel do?

It adds a startup stub analogous to the one below to allow regular unmodified test cases to run.

int main(int argc, char** argv);

__attribute__((kernel))
void __main(int *exitval_ptr, int argc, char *argv[])
{
  *exitval_ptr = main(argc, argv);
}