OpenACC Tutorial - Adding directives
- Understand the process of offloading
- Understand what is an OpenACC directive.
- Understand what is the difference between the loop and kernels directive.
- Understand how to build a program with OpenACC
- Understand what aliasing is in C/C++
- Learn how to use compiler feedback and how fix false aliasing.
Offloading to a GPU
The first thing to realize when trying to port a code to a GPU is that they do not share the same memory as the CPU. In other words, a GPU does not have direct access to the host memory. The host memory is generally larger, but slower than the GPU memory. To use a GPU, data must therefore be transferred from the main program to the GPU through the PCI bus, which has a much lower bandwidth than either memories. This means that managing data transfer between the host and the GPU will be of paramount importance. Transferring the data and the code onto the device is called offloading.
OpenACC directives
OpenACC directives are much like OpenMP directives. They take the form of pragma in C/C++, and comments in Fortran. The advantages of this method are numerous. First, since it involves very minor modifications to the code, changes can be done incrementally, one pragma at a time. This is especially useful for debugging purpose, since making a single change at a time allows one to quickly identify which change created a bug. Second, OpenACC support can be disabled at compile time. When OpenACC support is disabled, the pragma are considered comments, and ignored by the compiler. This means that a single source code can be used to compile both an accelerated version and a normal version. Third, since all of the offloading work is done by the compiler, the same code can be compiled for various accelerator types: GPUs, MIC (Xeon Phi) or CPUs. It also means that a new generation of devices only requires one to update the compiler, not to change the code.
In the following example, we take a code comprised of two loops. The first one initializes two vectors, and the second performs a SAXPY, a basic vector addition operation.
C/C++ | Fortran |
---|---|
#pragma acc kernels
{
for (int i=0; i<N; i++)
{
x[i] = 1.0;
y[i] = 2.0;
}
for (int i=0; i<N; i++)
{
y[i] = a * x[i] + y[i];
}
}
|
!$acc kernels
do i=1,N
x(i) = 1.0
y(i) = 2.0
end do
y(:) = a*x(:) + y(:)
!$acc end kernels
|
Both in the C/C++ and the Fortran cases, the compiler will identify two kernels. In C/C++, the two kernels will correspond to the inside of each loops. In Fortran, the kernels will be the inside of the first loop, as well as the inside of the implicit loop that Fortran performs when it does an array operation.
Note that in C/C++, the OpenACC block is delimited using curly brackets, while in Fortran, the same comment needs to be repeated, with the end keyword added.
Loops vs Kernels
When the compiler reaches an OpenACC kernels directive, it will analyze the code in order to identify sections that can be parallelized. This often corresponds to the body of loops. When such a case is identified, the compiler will wrap the body of the code into a special function called a kernel. This function makes it clear that each call to the function is independent from any other call. The function is then compiled to enable it to run on an accelerator. Since each call is independent, each one of the thousands cores of the accelerator can run the function for one specific index in parallel.
Loop | Kernels |
---|---|
for (int i=0; i<N; i++)
{
C[i] = A[i] + B[i];
}
|
void loopBody(A,B,C,i)
{
C[i] = A[i] + B[i];
}
|
Calculate 0 - N in order | Each compute core calculates one value of i. |
The kernels directive
The kernels directive is what we call a descriptive directive. It is used to tell the compiler that the programmer thinks this region can be made parallel. At this point, the compiler is free to do whatever it wants with this information. It can use whichever strategy it thinks is best to run the code, including running it sequentially. Typically, it will
- Analyze the code to try to identify parallelism
- If found, identify which data must be transferred and when
- Create a kernel
- Offload the kernel to the GPU
One example of this directive is the following code:
#pragma acc kernels
{
for (int i=0; i<N; i++)
{
C[i] = A[i] + B[i];
}
}
This example is very simple. However, code is often not that simple, and we then need to reply on compiler feedback in order to identify regions it failed to parallelize.
Those who have used OpenMP before will be familiar with the directive based nature of OpenACC. There is however one major difference between OpenMP and OpenACC directives. OpenMP directives are by design prescriptive in nature. This means that the compiler is required to perform the requested parallelization, no matter whether this is good from a performance stand point or not. This yields very reproductible results from one compiler to the next. This also means that parallelization will be performed the same way, whatever the hardware the code runs on. However, not every architecture performs best with code written the same way. Sometimes, it may be beneficial to switch the order of loops for example. If one were to parallelize a code with OpenMP and wanted it to perform optimally on multiple different architectures, they would have to write different sets of directives for different architectures.
By opposition, many of OpenACC's directives are descriptive in nature. This means that the compiler is free to compile the code whichever way it thinks is best for the target architecture. This may even imply that the code is not parallelized at all. The same code, compiled to run on GPU, or on Xeon Phi, or on CPU, may therefore yield different binary code. This, of course, means that different compilers may yield different performance. It also means that new generations of compilers will do better than previous generations, especially with new hardware.
Example: porting a matrix-vector product
For this example, we use the code from the exercises repository. More precisely, we will use a portion of the code from the matrix_functions.h file. The equivalent Fortran code can be found in the subroutine matvec contained in the matrix.F90 file. The original code is the following:
for(int i=0;i<num_rows;i++) {
double sum=0;
int row_start=row_offsets[i];
int row_end=row_offsets[i+1];
for(int j=row_start;j<row_end;j++) {
unsigned int Acol=cols[j];
double Acoef=Acoefs[j];
double xcoef=xcoefs[Acol];
sum+=Acoef*xcoef;
}
ycoefs[i]=sum;
}
The first change we make to this code to try to run it on the GPU is to add the kernels directive. At this stage, we don't worry about data transfer, or about giving more information to the compiler.
#pragma acc kernels
{
for(int i=0;i<num_rows;i++) {
double sum=0;
int row_start=row_offsets[i];
int row_end=row_offsets[i+1];
for(int j=row_start;j<row_end;j++) {
unsigned int Acol=cols[j];
double Acoef=Acoefs[j];
double xcoef=xcoefs[Acol];
sum+=Acoef*xcoef;
}
ycoefs[i]=sum;
}
}
Building with OpenACC
For the purpose of this tutorial, we use version 16.3 of the PGI compilers. We use the option -ta (target accelerator) flag in order to enable offloading to accelerators.
As of May 2016, compiler support for OpenACC is still relatively scarce. Being pushed by NVidia, through its Portland Group division, as well as by Cray, these two lines of compilers offer the most advanced OpenACC support. GNU Compiler support for OpenACC exists, but is considered experimental in version 5. It is expected to be officially supported in version 6 of the compiler.
For the purpose of this tutorial, we use version 16.3 of the Portland Group compilers. We note that Portland Group compilers are free for academic usage.