OpenACC Tutorial - Adding directives: Difference between revisions

Reviewed the directives section
(Objectives and description)
(Reviewed the directives section)
Line 25: Line 25:


== OpenACC directives == <!--T:4-->
== OpenACC directives == <!--T:4-->
OpenACC directives are much like OpenMP directives. They take the form of <tt>pragma</tt> in C/C++, and comments in Fortran. There are several advantages to using directives. First, since it involves very minor modifications to the code, changes can be done ''incrementally'', one <tt>pragma</tt> at a time. This is especially useful for debugging purpose, since making a single change at a time allows one to quickly identify which change created a bug. Second, OpenACC support can be disabled at compile time. When OpenACC support is disabled, the <tt>pragma</tt> are considered comments, and ignored by the compiler. This means that a single source code can be used to compile both an accelerated version and a normal version. Third, since all of the offloading work is done by the compiler, the same code can be compiled for various accelerator types: GPUs, MIC (Xeon Phi) or CPUs. It also means that a new generation of devices only requires one to update the compiler, not to change the code.  
OpenACC directives are much like [[OpenMP]] directives.
They take the form of <tt>pragma</tt> statements in C/C++, and comments in Fortran.
There are several advantages to using directives:
* First, since it involves very minor modifications to the code, changes can be done ''incrementally'', one <tt>pragma</tt> at a time. This is especially useful for debugging purpose, since making a single change at a time allows one to quickly identify which change created a bug.
* Second, OpenACC support can be disabled at compile time. When OpenACC support is disabled, the <tt>pragma</tt> are considered comments, and ignored by the compiler. This means that a single source code can be used to compile both an accelerated version and a normal version.
* Third, since all of the offloading work is done by the compiler, the same code can be compiled for various accelerator types: GPUs or SIMD instructions on CPUs. It also means that a new generation of devices only requires one to update the compiler, not to change the code.  


<!--T:5-->
<!--T:5-->
In the following example, we take a code comprised of two loops. The first one initializes two vectors, and the second performs a <tt>[https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms#Level_1 SAXPY]</tt>, a basic vector addition operation.  
In the following example, we take a code comprised of two loops.
The first one initializes two vectors, and the second performs a <tt>[https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms#Level_1 SAXPY]</tt>, a basic vector addition operation.  
</translate>
</translate>


Line 62: Line 68:
<translate>
<translate>
<!--T:6-->
<!--T:6-->
Both in the C/C++ and the Fortran cases, the compiler will identify '''two''' kernels. In C/C++, the two kernels will correspond to the inside of each loops. In Fortran, the kernels will be the inside of the first loop, as well as the inside of the implicit loop that Fortran performs when it does an array operation.
Both in the C/C++ and the Fortran cases, the compiler will identify '''two''' kernels:
* In C/C++, the two kernels will correspond to the inside of each loops.
* In Fortran, the kernels will be the inside of the first loop, as well as the inside of the implicit loop that Fortran performs when it does an array operation.


<!--T:7-->
<!--T:7-->
Note that in C/C++, the OpenACC block is delimited using curly brackets, while in Fortran, the same comment needs to be repeated, with the <tt>end</tt> keyword added.
Note that in C/C++, the OpenACC block is delimited using curly brackets,
while in Fortran, the same comment needs to be repeated, with the <tt>end</tt> keyword added.


=== Loops vs Kernels === <!--T:8-->
=== Loops vs Kernels === <!--T:8-->
When the compiler reaches an OpenACC <tt>kernels</tt> directive, it will analyze the code in order to identify sections that can be parallelized. This often corresponds to the body of the loop. When such a case is identified, the compiler will wrap the body of the code into a special function called a ''kernel''. This function makes it clear that each call to the function is independent from any other call. The function is then compiled to enable it to run on an accelerator. Since each call is independent, each one of the thousands cores of the accelerator can run the function for one specific index in parallel.
 
When the compiler reaches an OpenACC <tt>kernels</tt> directive, it will analyze the code in order to identify sections that can be parallelized.
This often corresponds to the body of a loop that has independent iterations.
When such a case is identified, the compiler will first wrap the body of the loop into a special function called a [https://en.wikipedia.org/wiki/Compute_kernel ''kernel''].
This internal code refactoring makes sure that each call to the kernel is independent from any other call.
The kernel is then compiled to enable it to run on an accelerator.
Since each call is independent, each one of the hundreds of cores of the accelerator can run the function for one specific index in parallel.
</translate>
</translate>
{| class="wikitable" width="100%"
{| class="wikitable" width="100%"
|-
|-<translate><!--T:9-->
! <translate><!--T:9-->
! LOOP !! KERNEL</translate>
LOOP</translate> !! <translate><!--T:10-->
KERNEL</translate>
|-
|-
| <syntaxhighlight lang="cpp" line>
| <syntaxhighlight lang="cpp" line>
Line 82: Line 96:
}
}
</syntaxhighlight> || <syntaxhighlight lang="cpp" line>
</syntaxhighlight> || <syntaxhighlight lang="cpp" line>
void loopBody(A,B,C,i)
void loopBody(A, B, C, i)
{
{
   C[i] = A[i] + B[i];
   C[i] = A[i] + B[i];
}
}
</syntaxhighlight>
</syntaxhighlight>
|-
|-<translate><!--T:11-->
|<translate><!--T:11-->
| Calculates sequentially from index <tt>i=0</tt> to <tt>i=N-1</tt>, inclusive. || Each compute core calculates for one value of <tt>i</tt>.</translate>
Calculate 0 - N in order</translate> || <translate><!--T:12-->
Each compute core calculates one value of <tt>i</tt>.</translate>
|}
|}
<translate>


<translate>
== The <tt>kernels</tt> directive == <!--T:13-->
== The <tt>kernels</tt> directive == <!--T:13-->
The <tt>kernels</tt> directive is what we call a ''descriptive'' directive. It is used to tell the compiler that the programmer thinks this region can be made parallel. At this point, the compiler is free to do whatever it wants with this information. It can use whichever strategy it thinks is best to run the code, ''including'' running it sequentially. Typically, it will  
The <tt>kernels</tt> directive is what we call a ''descriptive'' directive. It is used to tell the compiler that the programmer thinks this region can be made parallel. At this point, the compiler is free to do whatever it wants with this information. It can use whichever strategy it thinks is best to run the code, ''including'' running it sequentially. Typically, it will  
cc_staff
782

edits