This article is a draft

This is not a complete article: This is a draft, a work in progress that is intended to be published into an article, which may or may not be ready for inclusion in the main wiki. It should not necessarily be considered factual or authoritative.

OpenACC makes it relatively easy to offload vectorized code to accelerators (like GPUs, for example). Unlike CUDA and OpenCL where kernels need to be coded explicitly, OpenACC minimizes the amount of modifications to do on a serial or OpenMP code. The compiler converts the OpenACC code into a binary executable that can make use of accelerators. The performance of OpenACC codes can be similar to the one of a CUDA code, except that OpenACC requires less code development.

OpenACC compiler instructions

Similar to OpenMP, OpenACC can convert a for-loop into parallel code that would run on an accelerator. This can be achieved with compiler instructions #pragma acc ... before a for-loop.

`pragma acc parallel`

This will start the following structured block of code in parallel execution on the accelerator.

Optional clauses:

vector_length(size): determines the vector length to use for vector or SIMD operations.
- For example, size can be: 256, 512

`pragma acc loop`

The following for-loop will be converted for an execution on the accelerator.

Optional clauses:

reduction(op:variable): the loop is doing a reduction in a variable defined outside of the loop. For each thread in parallel, a private version of that variable will be used.
- Typical operators: + (sum), * (product)

Code examples

OpenACC can be used in Fortran, C and C++, which we illustrate here using the simple example of an integral expression for the arctangent to compute an approximation for the value of Failed to parse (unknown function "\atan"): {\displaystyle \pi = 4\atan(1) = 4\int_0^1 \frac {dx}{1+x^2}} ,

CC++

File : pi.c

#include <stdio.h>

const int vl = 512;
const long long N = 2000000000;

int main(int argc,char** argv) {
  double pi = 0.0f;
  long long i;

  #pragma acc parallel vector_length(vl) 
  #pragma acc loop reduction(+:pi)
  for (i = 0; i < N; i++) {
    double t = (double)((i + 0.5) / N);
    pi += 4.0 / (1.0 + t * t);
  }
  printf("pi = %11.10f\n", pi / N);
  return 0;
}

File : pi.cxx

#include <iostream>
#include <iomanip>

const int vl = 512;
const long long N = 2000000000;

int main(int argc,char** argv) {
  double pi = 0.0f;
  long long i;

  #pragma acc parallel vector_length(vl)
  #pragma acc loop reduction(+:pi)
  for (i = 0; i < N; i++) {
    double t = double((i + 0.5) / N);
    pi += 4.0/(1.0 + t * t);
  }
  std::cout << std::fixed;
  std::cout << std::setprecision(10);
  std::cout << "pi = " << pi/double(N) << std::endl;
  return 0;
}

Compilers

PGI

Module pgi, any version from 13.10
- Newer versions support newest GPU capabilities.

Compilation example:

# TODO

GCC

Module gcc, any version from 9.3.0
- Newer versions support newest GPU capabilities.

Compilation example:

gcc -fopenacc -march=native -O3 pi.c -o pi

Tutorial

See our OpenACC_Tutorial.

OpenACC

Contents