OpenACC: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
(Simplify the section titles)
 
(20 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{draft}}
{{draft}}


OpenACC makes it relatively easy to offload vectorized code to accelerators (like GPUs, for example). Unlike [[CUDA]] and OpenCL where kernels need to be coded explicitly, OpenACC minimizes the amount of modifications to do on a serial or [[OpenMP]] code. The compiler converts the OpenACC code into a binary executable that can make use of accelerators. The performance of OpenACC codes can be similar to the one of a [[CUDA]] code, except that OpenACC requires less code development.
OpenACC makes it relatively easy to offload vectorized code to accelerators such as GPUs, for example. Unlike [[CUDA]] and OpenCL where kernels need to be coded explicitly, OpenACC minimizes the amount of modifications to do on a serial or [[OpenMP]] code. The compiler converts the OpenACC code into a binary executable that can make use of accelerators. The performance of OpenACC codes can be similar to the one of a [[CUDA]] code, except that OpenACC requires less code development.


= OpenACC compiler instructions =
= OpenACC directives =
Similar to [[OpenMP]], OpenACC can convert a for-loop into parallel code that would run on an accelerator. This can be achieved with compiler instructions <code>#pragma acc ...</code> before a for-loop.
Similar to [[OpenMP]], OpenACC can convert a <code>for</code> loop into parallel code that would run on an accelerator. This can be achieved with compiler directives <code>#pragma acc ...</code> before structured blocks of code like, for example, a <code>for</code> loop. All supported <code>pragma</code> directives are described in the [https://www.openacc.org/specification OpenACC specification].


== <code>pragma acc parallel</code> ==
= Code examples =
This will start the following structured block of code in parallel execution on the accelerator.
OpenACC can be used in [[Fortran]], [[C]] and [[C++]], which we illustrate here using a simple program that computes a decimal approximation to π based on a definite integral which is equal to arctan(1), i.e. π/4.
<tabs>
<tab name="C">
{{File
  |name=pi.c
  |lang="C"
  |contents=
#include <stdio.h>


Optional clauses:
const int vl = 512;
* <code>vector_length(size)</code>: determines the vector length to use for vector or SIMD operations.
const long long N = 2000000000;
** For example, <code>size</code> can be: 256, 512


== <code>pragma acc loop</code> ==
int main(int argc,char** argv) {
The following for-loop will be converted for an execution on the accelerator.
  double pi = 0.0f;
  long long i;


Optional clauses:
  #pragma acc parallel vector_length(vl)
* <code>reduction(op:variable)</code>: the loop is doing a reduction in a variable defined outside of the loop. For each thread in parallel, a private version of that variable will be used.
  #pragma acc loop reduction(+:pi)
** Typical operators: <code>+</code> (sum), <code>*</code> (product)
  for (i = 0; i < N; i++) {
    double t = (double)((i + 0.5) / N);
    pi += 4.0 / (1.0 + t * t);
  }
  printf("pi = %11.10f\n", pi / N);
  return 0;
}
}}
</tab>
<tab name="C++">
{{File
  |name=pi.cxx
  |lang="C++"
  |contents=
#include <iostream>
#include <iomanip>


= Code examples =
const int vl = 512;
const long long N = 2000000000;
 
int main(int argc,char** argv) {
  double pi = 0.0f;
  long long i;


In C (<code>pi.c</code>):
  #pragma acc parallel vector_length(vl)
#include <stdio.h>
  #pragma acc loop reduction(+:pi)
  for (i = 0; i < N; i++) {
#define N 2000000000
    double t = double((i + 0.5) / N);
#define vl 512
    pi += 4.0/(1.0 + t * t);
  }
int main(void) {
  std::cout << std::fixed;
  double pi = 0.0f;
  std::cout << std::setprecision(10);
  long long i;
  std::cout << "pi = " << pi/double(N) << std::endl;
  return 0;
  #pragma acc parallel vector_length(vl)  
}
  #pragma acc loop reduction(+:pi)
}}
  for (i = 0; i < N; i++) {
</tab>
    double t = (double)((i + 0.5) / N);
</tabs>
    pi += 4.0 / (1.0 + t * t);
  }
  printf("pi = %11.10f\n", pi / N);
  return 0;
}


= Compilers =
= Compilers =
Line 64: Line 86:
= References =
= References =
* [https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC-3.1-final.pdf OpenACC official documentation - Specification 3.1 (PDF)]
* [https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC-3.1-final.pdf OpenACC official documentation - Specification 3.1 (PDF)]
* [https://www.nvidia.com/docs/IO/116711/OpenACC-API.pdf NVIDIA OpenACC Cheat-sheet (PDF)]
* [https://www.nvidia.com/docs/IO/116711/OpenACC-API.pdf NVIDIA OpenACC API - Quick Reference Guide (PDF)]

Latest revision as of 17:54, 30 November 2020


This article is a draft

This is not a complete article: This is a draft, a work in progress that is intended to be published into an article, which may or may not be ready for inclusion in the main wiki. It should not necessarily be considered factual or authoritative.




OpenACC makes it relatively easy to offload vectorized code to accelerators such as GPUs, for example. Unlike CUDA and OpenCL where kernels need to be coded explicitly, OpenACC minimizes the amount of modifications to do on a serial or OpenMP code. The compiler converts the OpenACC code into a binary executable that can make use of accelerators. The performance of OpenACC codes can be similar to the one of a CUDA code, except that OpenACC requires less code development.

OpenACC directives[edit]

Similar to OpenMP, OpenACC can convert a for loop into parallel code that would run on an accelerator. This can be achieved with compiler directives #pragma acc ... before structured blocks of code like, for example, a for loop. All supported pragma directives are described in the OpenACC specification.

Code examples[edit]

OpenACC can be used in Fortran, C and C++, which we illustrate here using a simple program that computes a decimal approximation to π based on a definite integral which is equal to arctan(1), i.e. π/4.

File : pi.c

#include <stdio.h>

const int vl = 512;
const long long N = 2000000000;

int main(int argc,char** argv) {
  double pi = 0.0f;
  long long i;

  #pragma acc parallel vector_length(vl) 
  #pragma acc loop reduction(+:pi)
  for (i = 0; i < N; i++) {
    double t = (double)((i + 0.5) / N);
    pi += 4.0 / (1.0 + t * t);
  }
  printf("pi = %11.10f\n", pi / N);
  return 0;
}


File : pi.cxx

#include <iostream>
#include <iomanip>

const int vl = 512;
const long long N = 2000000000;

int main(int argc,char** argv) {
  double pi = 0.0f;
  long long i;

  #pragma acc parallel vector_length(vl)
  #pragma acc loop reduction(+:pi)
  for (i = 0; i < N; i++) {
    double t = double((i + 0.5) / N);
    pi += 4.0/(1.0 + t * t);
  }
  std::cout << std::fixed;
  std::cout << std::setprecision(10);
  std::cout << "pi = " << pi/double(N) << std::endl;
  return 0;
}


Compilers[edit]

PGI[edit]

  • Module pgi, any version from 13.10
    • Newer versions support newest GPU capabilities.

Compilation example:

# TODO

GCC[edit]

  • Module gcc, any version from 9.3.0
    • Newer versions support newest GPU capabilities.

Compilation example:

gcc -fopenacc -march=native -O3 pi.c -o pi

Tutorial[edit]

See our OpenACC_Tutorial.

References[edit]