OpenACC Tutorial - Adding directives: Difference between revisions

no edit summary
No edit summary
No edit summary
Line 253: Line 253:
== How is ported code performing ? ==
== How is ported code performing ? ==
Since we have completed a first step to porting the code to GPU, we need to analyze how the code is performing, and whether it gives the correct results. Running the original version of the code yields the following (performed on one of Guillimin's GPU node):
Since we have completed a first step to porting the code to GPU, we need to analyze how the code is performing, and whether it gives the correct results. Running the original version of the code yields the following (performed on one of Guillimin's GPU node):
</translate>
[[File:Openacc profiling1.png|thumbnail|<translate>Click to enlarge</translate>]]
<translate>
{{Command
{{Command
|./cg.x  
|./cg.x  
Line 289: Line 286:
Total Iterations: 100 Total Time: 115.068931s
Total Iterations: 100 Total Time: 115.068931s
}}
}}
 
</translate>
[[File:Openacc profiling1.png|thumbnail|<translate>Click to enlarge</translate>]]
<translate>
The results are correct. However, not only do we not get any speed up, but we rather get a slow down by a factor of almost 4! Lets profile the code again using NVidia's visual profiler (<tt>nvvp</tt>). This can be done with the following steps:  
The results are correct. However, not only do we not get any speed up, but we rather get a slow down by a factor of almost 4! Lets profile the code again using NVidia's visual profiler (<tt>nvvp</tt>). This can be done with the following steps:  
# Start <tt>nvvp</tt> with the command <tt>nvvp &</tt>  (the <tt>&</tt> sign is to start it in the background
# Start <tt>nvvp</tt> with the command <tt>nvvp &</tt>  (the <tt>&</tt> sign is to start it in the background
Line 299: Line 298:


== The <tt>parallel loop</tt> directive ==
== The <tt>parallel loop</tt> directive ==
With the <tt>kernels</tt> directive, we let the compiler do all of the analysis. This is the ''descriptive'' approach to porting a code. OpenACC support a ''prescriptive'' approach through a different directive, called the <tt>parallel</tt> directive. This can be combined with the <tt>loop</tt> directive, to form the <tt>parallel loop</tt> directive. An example would be the following code:
</translate>
<syntaxhighlight lang="cpp" line>
#pragma acc parallel loop
for (int i=0; i<N; i++)
{
  C[i] = A[i] + B[i];
}
</syntaxhighlight>
<translate>
Since <tt>parallel loop</tt> is a ''prescriptive'' directive, it forces the compiler to perform the loop in parallel. This means that the <tt>independent</tt> clause introduced above is implicit within a parallel region.
For reasons that we explain below, in order to use this directive in the matrix-vector product example, we need to introduce additional clauses used to manage the scope of data. The <tt>private</tt> and <tt>reduction</tt> clauses control how the data flows through a parallel region.
* With the <tt>private</tt> clause, a copy of the variable is made for each loop iteration, making the value of the variable independent from other iterations.
* With the <tt>reduction</tt> clause, the values of a variable in each iterations will be ''reduced'' to a single value. It supports addition (+), multiplication (*), maximum (max), minimum (min) among other operations.
These clauses were not required with the <tt>kernels</tt> directive, because the <tt>kernels</tt> directive handles this for you.
Going back to the matrix-vector multiplication example, the corresponding code with the <tt>parallel loop</tt> directive would look like this:
</translate>
<syntaxhighlight lang="cpp" line>
#pragma acc parallel loop
  for(int i=0;i<num_rows;i++) {
    double sum=0;
    int row_start=row_offsets[i];
    int row_end=row_offsets[i+1];
#pragma acc loop reduction(+:sum)
    for(int j=row_start;j<row_end;j++) {
      unsigned int Acol=cols[j];
      double Acoef=Acoefs[j];
      double xcoef=xcoefs[Acol];
      sum+=Acoef*xcoef;
    }
    ycoefs[i]=sum;
  }
</syntaxhighlight>
<translate>
Compiling this code yields the following compiler feedback:
</translate>
{{Command
|pgc++ -fast -Minfo{{=}}accel -ta{{=}}tesla:managed main.cpp -o challenge
|result=
matvec(const matrix &, const vector &, const vector &):
    23, include "matrix_functions.h"
          27, Accelerator kernel generated
              Generating Tesla code
              29, #pragma acc loop gang /* blockIdx.x */
              34, #pragma acc loop vector(128) /* threadIdx.x */
                  Sum reduction generated for sum
          27, Generating copyout(ycoefs[:num_rows])
              Generating copyin(xcoefs[:],Acoefs[:],cols[:],row_offsets[:num_rows+1])
          34, Loop is parallelizable
}}
== Parallel loop vs Kernels ==
{| class="wikitable" width="100%"
|-
! Parallel loop !! Kernels
|-
|
<translate>
* It is the programmer's responsibility to ensure that parallelism is safe
* Enables parallelization of sections that the compiler may miss
* Straightforward path from OpenMP
</translate>
||
<translate>
* It is the compiler's responsibility to analyze the code and determine what is safe to parallelize.
* A single directive can cover a large area of code
* The compilers has more room to optimize
</translate>
|}
<translate>
Both approaches are equally valid and can perform equally well.


[[OpenACC Tutorial|Back to the lesson plan]]
[[OpenACC Tutorial|Back to the lesson plan]]
</translate>
</translate>
Bureaucrats, cc_docs_admin, cc_staff, rsnt_translations
2,837

edits