OpenACC Tutorial - Profiling: Difference between revisions

Marked this version for translation
No edit summary
(Marked this version for translation)
 
(33 intermediate revisions by 5 users not shown)
Line 8: Line 8:
<!--T:2-->
<!--T:2-->
* Understand what a profiler is''
* Understand what a profiler is''
* Understand how to use PGPROF profiler  
* Understand how to use the NVPROF profiler  
* Understand how the code is performing  
* Understand how the code is performing  
* Understand where to focus your time and rewrite most time consuming routines
* Understand where to focus your time and rewrite most time consuming routines
</translate>
</translate>
}}
}}
<translate>
== Code profiling == <!--T:8-->
Why would one need to profile code? Because it's the only way to understand:
* Where time is being spent (hotspots)
* How the code is performing
* Where to focus your development time


== Code profiling  ==
<!--T:9-->
Why would one need to profile code? Because it's the only way to understand:
What is so important about hotspots in the code?  
# Where time is being spent (Hotspots)
The [https://en.wikipedia.org/wiki/Amdahl%27s_law Amdahl's law] says that
# How the code is performing
"Parallelizing the most time-consuming routines (i.e. the hotspots) will have the most impact".
# Where to focus your time


What is so important about hotspots in the code ?
== Build the Sample Code == <!--T:10-->
Amdahl's law says that "Parallelizing the most time-consuming routines (i.e. the hotspots) will have the most impact".
For the following example, we use a code from this [https://github.com/calculquebec/cq-formation-openacc Git repository].
You are invited to [https://github.com/calculquebec/cq-formation-openacc/archive/refs/heads/main.zip download and extract the package], and go to the <code>cpp</code> or the <code>f90</code> directory.
The object of this example is to compile and link the code, obtain an executable, and then profile its source code with a profiler.
</translate>


== Build the Sample Code ==
For this example we will use code from the [https://github.com/calculquebec/cq-formation-openacc repositories]. Download the package and change to the '''cpp''' or '''f90''' directory. The object of this exercise is to compile and link the code, obtain an executable, and then profile it.
{{Callout
{{Callout
|title=<translate><!--T:3-->
|title=<translate><!--T:3-->
Line 31: Line 37:
<translate>
<translate>
<!--T:4-->
<!--T:4-->
As of May 2016, compiler support for OpenACC is still relatively scarce. Being pushed by [http://www.nvidia.com/content/global/global.php NVidia], through its [http://www.pgroup.com/ Portland Group] division, as well as by [http://www.cray.com/ Cray], these two lines of compilers offer the most advanced OpenACC support. [https://gcc.gnu.org/wiki/OpenACC GNU Compiler] support for OpenACC exists, but is considered experimental in version 5. It is expected to be officially supported in version 6 of the compiler.  
Being pushed by [https://www.cray.com/ Cray] and by [https://www.nvidia.com NVIDIA] through its
[https://www.pgroup.com/support/release_archive.php Portland Group] division until 2020 and now through its [https://developer.nvidia.com/hpc-sdk HPC SDK], these two lines of compilers offer the most advanced OpenACC support.
 
<!--T:26-->
As for the [https://gcc.gnu.org/wiki/OpenACC GNU compilers], since GCC version 6, the support for OpenACC 2.x kept improving.
As of July 2022, GCC versions 10, 11 and 12 support OpenACC version 2.6.


<!--T:5-->
<!--T:5-->
For the purpose of this tutorial, we use version 16.3 of the Portland Group compilers. We note that [http://www.pgroup.com/support/download_pgi2016.php?view=current Portland Group compilers] are free for academic usage.  
For the purpose of this tutorial, we use the
[https://developer.nvidia.com/nvidia-hpc-sdk-releases NVIDIA HPC SDK], version 22.7.
Please note that NVIDIA compilers are free for academic usage.  
</translate>
</translate>
}}
{{Command
|module load nvhpc/22.7
|result=
Lmod is automatically replacing "intel/2020.1.217" with "nvhpc/22.7".
The following have been reloaded with a version change:
  1) gcccore/.9.3.0 => gcccore/.11.3.0        3) openmpi/4.0.3 => openmpi/4.1.4
  2) libfabric/1.10.1 => libfabric/1.15.1    4) ucx/1.8.0 => ucx/1.12.1
}}
}}


Line 41: Line 64:
|make  
|make  
|result=
|result=
pgc++ -fast  -c -o main.o main.cpp
nvc++   -c -o main.o main.cpp
"vector.h", line 30: warning: variable "vcoefs" was declared but never
nvc++ main.o -o cg.x
      referenced
}}
      double *vcoefs=v.coefs;
                    ^


pgc++ main.o -o cg.x -fast
<translate>
}}
<!--T:11-->
Once the executable <code>cg.x</code> is created, we are going to profile its source code:
the profiler will measure function calls by executing and monitoring this program.
'''Important:''' this executable uses about 3GB of memory and one CPU core at near 100%.
Therefore, '''a proper test environment should have at least 4GB of available memory and at least two (2) CPU cores'''.
</translate>


After the executable is created, we are going to profile that code.
{{Callout
{{Callout
|title=<translate><!--T:6-->
|title=<translate><!--T:6-->
Line 57: Line 82:
<translate>
<translate>
<!--T:7-->
<!--T:7-->
For the purpose of this tutorial, we use several profilers as described below:  
For the purpose of this tutorial, we use two profilers:  
* PGPROF - a powerful and simple analyzer for parallel programs written with OpenMP or OpenACC directives, or with [https://en.wikipedia.org/wiki/CUDA CUDA].
* '''[https://docs.nvidia.com/cuda/profiler-users-guide/ NVIDIA <code>nvprof</code>]''' - a command line text-based profiler that can analyze non-GPU codes.
We note that [http://www.pgroup.com/support/download_pgi2016.php?view=current Portland Group Profiler] is free for academic usage.  
* '''[[OpenACC_Tutorial_-_Adding_directives#NVIDIA_Visual_Profiler|NVIDIA Visual Profiler <code>nvvp</code>]]''' - a graphical cross-platform analyzing tool for the codes written with OpenACC and CUDA C/C++ instructions.
* NVIDIA Visual Profiler NVVP - a cross-platform analyzing tool for the codes written with OpenACC and CUDA C/C++ instructions.
Since our previously built <code>cg.x</code> is not yet using the GPU, we will start the analysis with the <code>nvprof</code> profiler.
* NVPROF - a command line text-based version of the NVIDIA Visual Profiler.
</translate>
</translate>
}}
}}
<translate>


=== NVIDIA <code>nvprof</code> Command Line Profiler === <!--T:15-->
NVIDIA usually provides <code>nvprof</code> with its HPC SDK,
but the proper version to use on our clusters is included with a CUDA module:
</translate>
{{Command
|module load cuda/11.7
}}
<translate>


=== PGPROF Profiler  ===
<!--T:27-->
[[File:Pgprof new0.png|thumbnail|300px|Starting a new PGPROF session|left  ]]
To profile a pure CPU executable, we need to add the arguments <code>--cpu-profiling on</code> to the command line:
These next pictures demonstrate how to start with the PGPROF profiler. The first step is to initiate a new session.
</translate>
Then, browse for an executable file of the code you want to profile.
Finally, specify the profiling options; for example, if you need to profile CPU activity then click the "Profile execution of the CPU" box.
 
=== NVIDIA Visual Profiler  ===
 
Another profiler available for OpenACC applications is the NVIDIA Visual Profiler. It's a crossplatform analyzing tool for code written with OpenACC and CUDA C/C++ instructions.
[[File:Nvvp-pic0.png|thumbnail|300px|NVVP profiler|right  ]]
[[File:Nvvp-pic1.png|thumbnail|300px|Browse for the executable you want to profile|right  ]]
 
=== NVIDIA NVPROF Command Line Profiler  ===
NVIDIA also provides a command line version called NVPROF, similar to GPU prof
{{Command
{{Command
|nvprof --cpu-profiling on ./cgi.x  
|nvprof --cpu-profiling on ./cg.x  
|result=
|result=
...
<Program output >
<Program output >
...
======== CPU profiling result (bottom up):
======== CPU profiling result (bottom up):
84.25% matvec(matrix const &, vector const &, vector const &)
Time(%)      Time  Name
84.25% main
83.54% 90.6757s  matvec(matrix const &, vector const &, vector const &)
9.50% waxpby(double, vector const &, double, vector const &, vector const &)
83.54% 90.6757s  {{!}} main
3.37% dot(vector const &, vector const &)
  7.94% 8.62146s  waxpby(double, vector const &, double, vector const &, vector const &)
2.76% allocate_3d_poisson_matrix(matrix&, int)
  7.94%  8.62146s  {{!}} main
2.76% main
  5.86% 6.36584s  dot(vector const &, vector const &)
0.11% __c_mset8
  5.86%  6.36584s  {{!}} main
0.03% munmap
  2.47% 2.67666s  allocate_3d_poisson_matrix(matrix&, int)
  0.03% free_matrix(matrix&)
  2.47% 2.67666s  {{!}} main
    0.03% main
  0.13% 140.35ms  initialize_vector(vector&, double)
  0.13% 140.35ms  {{!}} main
...
======== Data collected at 100Hz frequency
======== Data collected at 100Hz frequency
}}
}}
<translate>
<!--T:28-->
From the above output, the <code>matvec()</code> function is responsible for 83.5% of the execution time, and this function call can be found in the <code>main()</code> function.


== Compiler Feedback ==
== Compiler Feedback == <!--T:16-->
Before working on the routine, we need to understand what the compiler is actually doing by asking ourselves the following questions:
Before working on the routine, we need to understand what the compiler is actually doing by asking ourselves the following questions:
* What optimizations were applied?  
* What optimizations were applied automatically by the compiler?  
* What prevented further optimizations?
* What prevented further optimizations?
* Can very minor modifications of the code affect performance?
* Can very minor modifications of the code affect performance?


The PGI compiler offers you a '''-Minfo''' flag with the following options:
<!--T:17-->
* accel Print compiler operations related to the accelerator
The NVIDIA compiler offers a <code>-Minfo</code> flag with the following options:
* all – Print all compiler output
* <code>all</code> - Print almost all types of compilation information, including:
* intensity Print loop intensity information
** <code>accel</code> - Print compiler operations related to the accelerator
* ccff–Add information to the object files for use by tools
** <code>inline</code> - Print information about functions extracted and inlined
** <code>loop,mp,par,stdpar,vect</code> - Print various information about loop optimization and vectorization
* <code>intensity</code> - Print compute intensity information about loops
* (none) - If <code>-Minfo</code> is used without any option, it is the same as with the <code>all</code> option, but without the <code>inline</code> information


== How to Enable Compiler Feedback ==
=== How to Enable Compiler Feedback === <!--T:18-->
* Edit the Makefile
* Edit the <code>Makefile</code>:
CXX=pgc++
  CXX=nvc++
CXXFLAGS=-fast -Minfo=all,intensity,ccff LDFLAGS=${CXXFLAGS}
  CXXFLAGS=-fast -Minfo=all,intensity
  LDFLAGS=${CXXFLAGS}
 
<!--T:29-->
* Rebuild
* Rebuild
</translate>
{{Command
{{Command
|make
|make clean; make
|result=
|result=
pgc++ CXXFLAGS=-fast -Minfo=all,intensity,ccff LDFLAGS=-fast -fast   -c -o main.o main.cpp
...
"vector.h", line 30: warning: variable "vcoefs" was declared but never
nvc++ -fast -Minfo=all,intensity  -c -o main.o main.cpp
          referenced
initialize_vector(vector &, double):
    double *vcoefs=v.coefs;
    20, include "vector.h"
            ^
           36, Intensity = 0.0
 
_Z17initialize_vectorR6vectord:
           37, Intensity = 0.0
               Memory set idiom, loop replaced by call to __c_mset8
               Memory set idiom, loop replaced by call to __c_mset8
_Z3dotRK6vectorS1_:
dot(const vector &, const vector &):
           27, Intensity = 1.00  
    21, include "vector_functions.h"
              Generated 3 alternate versions of the loop
           27, Intensity = 1.00
               Generated vector sse code for the loop
               Generated vector simd code for the loop containing reductions
              Generated 2 prefetch instructions for the loop
          28, FMA (fused multiply-add) instruction(s) generated
_Z6waxpbydRK6vectordS1_S1_:
waxpby(double, const vector &, double, const vector &, const vector &):
           39, Intensity = 1.00  
    21, include "vector_functions.h"
           39, Intensity = 1.00
               Loop not vectorized: data dependency
               Loop not vectorized: data dependency
               Loop unrolled 4 times
              Generated vector simd code for the loop
_Z26allocate_3d_poisson_matrixR6matrixi:
               Loop unrolled 2 times
              FMA (fused multiply-add) instruction(s) generated
          40, FMA (fused multiply-add) instruction(s) generated
allocate_3d_poisson_matrix(matrix &, int):
    22, include "matrix.h"
           43, Intensity = 0.0
           43, Intensity = 0.0
              Loop not fused: different loop trip count
           44, Intensity = 0.0
           44, Intensity = 0.0
               Loop not vectorized/parallelized: loop count too small
               Loop not vectorized/parallelized: loop count too small
Line 145: Line 184:
           59, Intensity = 0.0
           59, Intensity = 0.0
               Loop not vectorized: data dependency
               Loop not vectorized: data dependency
_Z6matvecRK6matrixRK6vectorS4_:
matvec(const matrix &, const vector &, const vector &):
    23, include "matrix_functions.h"
           29, Intensity = (num_rows*((row_end-row_start)*        2))/(num_rows+(num_rows+(num_rows+((row_end-row_start)+(row_end-row_start)))))
           29, Intensity = (num_rows*((row_end-row_start)*        2))/(num_rows+(num_rows+(num_rows+((row_end-row_start)+(row_end-row_start)))))
           33, Intensity = 1.00  
           33, Intensity = 1.00
              Unrolled inner loop 4 times
               Generated vector simd code for the loop containing reductions
               Generated 2 prefetch instructions for the loop
          37, FMA (fused multiply-add) instruction(s) generated
main:
main:
     61, Intensity = 16.00 
     38, allocate_3d_poisson_matrix(matrix &, int) inlined, size=41 (inline) file main.cpp (29)
        Loop not vectorized/parallelized: potential early exits
          43, Intensity = 0.0
pgc++ CXXFLAGS=-fast -Minfo=all,intensity,ccff LDFLAGS=-fast main.o -o cg.x -fast
              Loop not fused: different loop trip count
          44, Intensity = 0.0
              Loop not vectorized/parallelized: loop count too small
          45, Intensity = 0.0
              Loop unrolled 3 times (completely unrolled)
          57, Intensity = 0.0
              Loop not fused: function call before adjacent loop
          59, Intensity = 0.0
              Loop not vectorized: data dependency
    42, allocate_vector(vector &, unsigned int) inlined, size=3 (inline) file main.cpp (24)
    43, allocate_vector(vector &, unsigned int) inlined, size=3 (inline) file main.cpp (24)
    44, allocate_vector(vector &, unsigned int) inlined, size=3 (inline) file main.cpp (24)
    45, allocate_vector(vector &, unsigned int) inlined, size=3 (inline) file main.cpp (24)
    46, allocate_vector(vector &, unsigned int) inlined, size=3 (inline) file main.cpp (24)
    48, initialize_vector(vector &, double) inlined, size=5 (inline) file main.cpp (34)
          36, Intensity = 0.0
              Memory set idiom, loop replaced by call to __c_mset8
    49, initialize_vector(vector &, double) inlined, size=5 (inline) file main.cpp (34)
          36, Intensity = 0.0
              Memory set idiom, loop replaced by call to __c_mset8
    52, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
          39, Intensity = 0.0
              Memory copy idiom, loop replaced by call to __c_mcopy8
    53, matvec(const matrix &, const vector &, const vector &) inlined, size=19 (inline) file main.cpp (20)
          29, Intensity = [symbolic], and not printable, try the -Mpfi -Mpfo options
              Loop not fused: different loop trip count
          33, Intensity = 1.00
              Generated vector simd code for the loop containing reductions
    54, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
          27, FMA (fused multiply-add) instruction(s) generated
          36, FMA (fused multiply-add) instruction(s) generated
          39, Intensity = 0.67
              Loop not fused: different loop trip count
              Loop not vectorized: data dependency
              Generated vector simd code for the loop
              Loop unrolled 4 times
              FMA (fused multiply-add) instruction(s) generated
    56, dot(const vector &, const vector &) inlined, size=9 (inline) file main.cpp (21)
          27, Intensity = 1.00
              Loop not fused: function call before adjacent loop
              Generated vector simd code for the loop containing reductions
    61, Intensity = 0.0
    62, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
          39, Intensity = 0.0
              Memory copy idiom, loop replaced by call to __c_mcopy8
    65, dot(const vector &, const vector &) inlined, size=9 (inline) file main.cpp (21)
          27, Intensity = 1.00
              Loop not fused: different controlling conditions
              Generated vector simd code for the loop containing reductions
    67, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
          39, Intensity = 0.67
              Loop not fused: different loop trip count
              Loop not vectorized: data dependency
              Generated vector simd code for the loop
              Loop unrolled 4 times
    72, matvec(const matrix &, const vector &, const vector &) inlined, size=19 (inline) file main.cpp (20)
          29, Intensity = [symbolic], and not printable, try the -Mpfi -Mpfo options
              Loop not fused: different loop trip count
          33, Intensity = 1.00
              Generated vector simd code for the loop containing reductions
    73, dot(const vector &, const vector &) inlined, size=9 (inline) file main.cpp (21)
          27, Intensity = 1.00
              Loop not fused: different loop trip count
              Generated vector simd code for the loop containing reductions
    77, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
          39, Intensity = 0.67
              Loop not fused: different loop trip count
              Loop not vectorized: data dependency
              Generated vector simd code for the loop
              Loop unrolled 4 times
    78, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
          39, Intensity = 0.67
              Loop not fused: function call before adjacent loop
              Loop not vectorized: data dependency
              Generated vector simd code for the loop
              Loop unrolled 4 times
    88, free_vector(vector &) inlined, size=2 (inline) file main.cpp (29)
    89, free_vector(vector &) inlined, size=2 (inline) file main.cpp (29)
    90, free_vector(vector &) inlined, size=2 (inline) file main.cpp (29)
    91, free_vector(vector &) inlined, size=2 (inline) file main.cpp (29)
    92, free_matrix(matrix &) inlined, size=5 (inline) file main.cpp (73)
}}
}}
<translate>


== Computational Intensity  ==
=== Interpretation of the Compiler Feedback === <!--T:19-->
Computational Intensity of a loop is a measure of how much work is being done compared to memory operations.
The ''Computational Intensity'' of a loop is a measure of how much work is being done compared to memory operations.
Basically:


'''Computation Intensity = Compute Operations / Memory Operations'''
<!--T:20-->
<math>\mbox{Computational Intensity} = \frac{\mbox{Compute Operations}}{\mbox{Memory Operations}}</math>


Computational Intensity of 1.0 or greater suggests that the loop might run well on a GPU.
<!--T:21-->
In the compiler feedback, an <code>Intensity</code> <math>\ge</math> 1.0 suggests that the loop might run well on a GPU.


== Understanding the code  ==
== Understanding the code  == <!--T:22-->
Let's look closely at the following code:
Let's look closely at the main loop in the
<syntaxhighlight lang="cpp" line highlight="1,5,10,12">
[https://github.com/calculquebec/cq-formation-openacc/blob/main/cpp/matrix_functions.h#L29 <code>matvec()</code> function implemented in <code>matrix_functions.h</code>]:
for(int i=0;i<num_rows;i++) {
</translate>
  double sum=0;
<syntaxhighlight lang="cpp" line start="29" highlight="1,5,10,12">
  int row_start=row_offsets[i];
  for(int i=0;i<num_rows;i++) {
  int row_end=row_offsets[i+1];
    double sum=0;
  for(int j=row_start; j<row_end;j++) {
    int row_start=row_offsets[i];
    unsigned int Acol=cols[j];
    int row_end=row_offsets[i+1];
    double Acoef=Acoefs[j];  
    for(int j=row_start; j<row_end;j++) {
    double xcoef=xcoefs[Acol];  
      unsigned int Acol=cols[j];
    sum+=Acoef*xcoef;
      double Acoef=Acoefs[j];  
      double xcoef=xcoefs[Acol];  
      sum+=Acoef*xcoef;
    }
    ycoefs[i]=sum;
   }
   }
  ycoefs[i]=sum;
}
</syntaxhighlight>  
</syntaxhighlight>  
<translate>
<!--T:23-->
Given the code above, we search for data dependencies:
Given the code above, we search for data dependencies:
* Does one loop iteration affect other loop iterations?
* Does one loop iteration affect other loop iterations?
* Do loop iterations read from and write to different places in the same array?
** For example, when generating the '''[https://en.wikipedia.org/wiki/Fibonacci_number Fibonacci sequence]''', each new value depends on the previous two values. Therefore, efficient parallelism is very difficult to implement, if not impossible.
* Is sum a data dependency? No, it’s a reduction.
* Is the accumulation of values in <code>sum</code> a data dependency?
** No, it’s a '''[https://en.wikipedia.org/wiki/Reduction_operator reduction]'''! And modern compilers are good at optimizing such reductions.
* Do loop iterations read from and write to the same array, such that written values are used or overwritten in other iterations?
** Fortunately, that does not happen in the above code.


[[OpenACC Tutorial - Adding directives|Onward to the next unit: Adding directives]]<br>
<!--T:25-->
[[OpenACC Tutorial|Back to the lesson plan]]
Now that the code analysis is done, we are ready to add directives to the compiler.
 
<!--T:24-->
[[OpenACC Tutorial - Introduction|<- Previous unit: ''Introduction'']] | [[OpenACC Tutorial|^- Back to the lesson plan]] | [[OpenACC Tutorial - Adding directives|Onward to the next unit: ''Adding directives'' ->]]
</translate>
rsnt_translations
56,430

edits