{"id":390,"date":"2022-05-04T03:47:43","date_gmt":"2022-05-04T01:47:43","guid":{"rendered":"https:\/\/threedots.ovh\/blog\/?p=390"},"modified":"2022-05-04T13:25:42","modified_gmt":"2022-05-04T11:25:42","slug":"gccs-nvptx-target","status":"publish","type":"post","link":"https:\/\/threedots.ovh\/blog\/2022\/05\/gccs-nvptx-target\/","title":{"rendered":"GCC&#8217;s NVPTX target"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">The <a rel=\"noreferrer noopener\" href=\"https:\/\/gcc.gnu.org\" data-type=\"URL\" data-id=\"https:\/\/gcc.gnu.org\" target=\"_blank\">GNU Compiler Collection<\/a> has an NVIDIA PTX target, supporting both offload modes for <a rel=\"noreferrer noopener\" href=\"https:\/\/www.openmp.org\" target=\"_blank\">OpenMP<\/a> <em>and<\/em> <a rel=\"noreferrer noopener\" href=\"https:\/\/www.openacc.org\" target=\"_blank\">OpenACC<\/a> on NVIDIA GPUs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It also supports a quite special <em>standalone<\/em> mode, for use cases that want to run a <em>single-threaded<\/em> load on GPUs &#8211; this is of course not remotely performant by orders of magnitude &#8211; but is very useful for testing the compiler itself.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Building it<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Common<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><a rel=\"noreferrer noopener\" href=\"https:\/\/github.com\/MentorEmbedded\/nvptx-tools\" target=\"_blank\">nvptx-tools<\/a> are the binutils equivalent for this GCC target.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>git clone https:\/\/github.com\/MentorEmbedded\/nvptx-tools\ncd nvptx-tools\nPATH=\/usr\/local\/cuda\/bin .\/configure --with-cuda-driver=\/usr\/local\/cuda\nmake\nsudo make install<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Offload mode<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">This is the mode that you&#8217;ll most likely want to use. Offload mode is currently not functional on 64-bit Arm due to <a href=\"https:\/\/gcc.gnu.org\/bugzilla\/show_bug.cgi?format=multiple&amp;id=96265\">GCC bug 96265<\/a>.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>git clone git:\/\/gcc.gnu.org\/git\/gcc.git\ngit clone https:\/\/sourceware.org\/git\/newlib-cygwin.git\ncd gcc\nln -s ..\/newlib-cygwin\/newlib newlib\nmkdir -p build\/nvptx-offload\nmkdir -p build\/host\ncd build\/nvptx-offload\n..\/..\/configure --target=nvptx-none --enable-as-accelerator-for=$HOST_TRIPLET --disable-sjlj-exceptions --enable-newlib-io-long-long --with-build-time-tools=\/usr\/local\/nvptx-none\/bin --with-as=\/usr\/local\/bin\/nvptx-none-as\nmake -j160 &amp;&amp; sudo make install DESTDIR=$PREFIX\ncd ..\/host\n..\/..\/configure --build=$HOST_TRIPLET --host=$HOST_TRIPLET --target=$HOST_TRIPLET --enable-languages=c,c++,fortran,lto --enable-offload-targets=nvptx-none=$PREFIX\/usr\/local\/nvptx-none --with-cuda-driver=\/usr\/local\/cuda\nmake -j160 &amp;&amp; sudo make install DESTDIR=$PREFIX<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Standalone mode<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>git clone git:\/\/gcc.gnu.org\/git\/gcc.git\ngit clone https:\/\/sourceware.org\/git\/newlib-cygwin.git\ncd gcc\nln -s ..\/newlib-cygwin\/newlib newlib\nmkdir -p build\/nvptx\ncd build\/nvptx\n..\/..\/configure --target=nvptx-none --disable-sjlj-exceptions --enable-newlib-io-long-long --with-build-time-tools=\/usr\/local\/nvptx-none\/bin  --with-as=\/usr\/local\/bin\/nvptx-none-as\nmake -j160 &amp;&amp; sudo make install DESTDIR=$PREFIX<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Fun with standalone mode<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Standalone mode allows to have <em>regular<\/em> programs run, very slowly (GPUs aren&#8217;t exactly known for good single SIMT lane performance) and with the limitations applicable to GCC&#8217;s PTX backend. That allows to have the GCC test suite run for example &#8211; but we can also have some fun with it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s have a (totally useless by the way) test program to see how long a printf takes on hardware:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#include &lt;stdio.h&gt;\n#include &lt;time.h&gt;\n#include &lt;stdint.h&gt;\n\nextern uint64_t clock64();\n\nint main (int argc, char** argv) {\n\tuint64_t b = clock64();\n\tputs(\"hello world from a GPU thread!\");\n\tuint64_t e = clock64();\n\tprintf(\"time elapsed: %i\\n\", e - b);\n}<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">and<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>.version 6.0\n.target sm_30\n.address_size 64\n.visible .func (.param .u64 %value_out) clock64;\n.visible .func (.param .u64 %value_out) clock64\n{\n.reg .u64 %value;\n.reg .u64 %r23;\nmov.u64 %r23,%clock64;\nmov.u64 %value,%r23;\nst.param.u64 &#91;%value_out],%value;\nret;\n}\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We can compile the program and then run it:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ nvptx-none-gcc -mmainkernel -misa=sm_70 clock.s -O3 hello.c -o hello\n$ nvptx-none-run hello\nnvptx-none-run hello\nhello world from a GPU thread!\ntime elapsed: 48359<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">What does -mmainkernel do?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">It adds a startup stub analogous to the one below to allow regular unmodified test cases to run.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>int main(int argc, char** argv);\n\n__attribute__((kernel))\nvoid __main(int *exitval_ptr, int argc, char *argv&#91;])\n{\n  *exitval_ptr = main(argc, argv);\n}\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>The GNU Compiler Collection has an NVIDIA PTX target, supporting both offload modes for OpenMP and OpenACC on NVIDIA GPUs. It also supports a quite special standalone mode, for use cases that want to run a single-threaded load on GPUs &#8211; this is of course not remotely performant by orders of magnitude &#8211; but is&hellip;&nbsp;<a href=\"https:\/\/threedots.ovh\/blog\/2022\/05\/gccs-nvptx-target\/\" rel=\"bookmark\">Read More &raquo;<span class=\"screen-reader-text\">GCC&#8217;s NVPTX target<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"neve_meta_sidebar":"","neve_meta_container":"","neve_meta_enable_content_width":"","neve_meta_content_width":0,"neve_meta_title_alignment":"","neve_meta_author_avatar":"","neve_post_elements_order":"","neve_meta_disable_header":"","neve_meta_disable_footer":"","neve_meta_disable_title":"","footnotes":""},"categories":[1],"tags":[],"class_list":["post-390","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/threedots.ovh\/blog\/wp-json\/wp\/v2\/posts\/390","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/threedots.ovh\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/threedots.ovh\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/threedots.ovh\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/threedots.ovh\/blog\/wp-json\/wp\/v2\/comments?post=390"}],"version-history":[{"count":6,"href":"https:\/\/threedots.ovh\/blog\/wp-json\/wp\/v2\/posts\/390\/revisions"}],"predecessor-version":[{"id":397,"href":"https:\/\/threedots.ovh\/blog\/wp-json\/wp\/v2\/posts\/390\/revisions\/397"}],"wp:attachment":[{"href":"https:\/\/threedots.ovh\/blog\/wp-json\/wp\/v2\/media?parent=390"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/threedots.ovh\/blog\/wp-json\/wp\/v2\/categories?post=390"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/threedots.ovh\/blog\/wp-json\/wp\/v2\/tags?post=390"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}