OpenACC Tutorial - Profiling: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
No edit summary
(Marked this version for translation)
 
(40 intermediate revisions by 5 users not shown)
Line 2: Line 2:


{{Objectives
{{Objectives
|title=<translate>Learning objectives</translate>
|title=<translate><!--T:1-->
Learning objectives</translate>
|content=
|content=
<translate>
<translate>
<!--T:2-->
* Understand what a profiler is''
* Understand what a profiler is''
* Understand how to use PGPROF profiler.
* Understand how to use the NVPROF profiler  
* Understand how the code is performing .
* Understand how the code is performing  
* Understand where to focus your time and re-write most time consuming routines
* Understand where to focus your time and rewrite most time consuming routines
</translate>
</translate>
}}
}}
<translate>
== Code profiling == <!--T:8-->
Why would one need to profile code? Because it's the only way to understand:
* Where time is being spent (hotspots)
* How the code is performing
* Where to focus your development time


== Gathering a Profile  ==
<!--T:9-->
Why would one needs to gather a profile of a code ? Because it's the only way to understand:
What is so important about hotspots in the code?  
# Where time is being spent (Hotspots)
The [https://en.wikipedia.org/wiki/Amdahl%27s_law Amdahl's law] says that
# How the code is performing
"Parallelizing the most time-consuming routines (i.e. the hotspots) will have the most impact".
# Where to focus your time


What is so important about the hotspots of the code ?
== Build the Sample Code == <!--T:10-->
The Amdahl's law says that "Parallelizing the most time-consuming (i.e. the hotspots) routines will have the most impact".
For the following example, we use a code from this [https://github.com/calculquebec/cq-formation-openacc Git repository].
You are invited to [https://github.com/calculquebec/cq-formation-openacc/archive/refs/heads/main.zip download and extract the package], and go to the <code>cpp</code> or the <code>f90</code> directory.
The object of this example is to compile and link the code, obtain an executable, and then profile its source code with a profiler.
</translate>


== Build the Sample Code ? ==
For this example we will use a code from the [https://github.com/calculquebec/cq-formation-openacc repositories]. Download the package and change to the '''cpp''' or '''f90''' directory. The point of this exercise is to compile&link the code, obtain executable, and then profile them.
{{Callout
{{Callout
|title=<translate>Which compiler ?</translate>
|title=<translate><!--T:3-->
Which compiler ?</translate>
|content=
|content=
<translate>
<translate>
As of May 2016, compiler support for OpenACC is still relatively scarce. Being pushed by [http://www.nvidia.com/content/global/global.php NVidia], through its [http://www.pgroup.com/ Portland Group] division, as well as by [http://www.cray.com/ Cray], these two lines of compilers offer the most advanced OpenACC support. [https://gcc.gnu.org/wiki/OpenACC GNU Compiler] support for OpenACC exists, but is considered experimental in version 5. It is expected to be officially supported in version 6 of the compiler.  
<!--T:4-->
Being pushed by [https://www.cray.com/ Cray] and by [https://www.nvidia.com NVIDIA] through its
[https://www.pgroup.com/support/release_archive.php Portland Group] division until 2020 and now through its [https://developer.nvidia.com/hpc-sdk HPC SDK], these two lines of compilers offer the most advanced OpenACC support.
 
<!--T:26-->
As for the [https://gcc.gnu.org/wiki/OpenACC GNU compilers], since GCC version 6, the support for OpenACC 2.x kept improving.
As of July 2022, GCC versions 10, 11 and 12 support OpenACC version 2.6.


For the purpose of this tutorial, we use version 16.3 of the Portland Group compilers. We note that [http://www.pgroup.com/support/download_pgi2016.php?view=current Portland Group compilers] are free for academic usage.  
<!--T:5-->
For the purpose of this tutorial, we use the
[https://developer.nvidia.com/nvidia-hpc-sdk-releases NVIDIA HPC SDK], version 22.7.
Please note that NVIDIA compilers are free for academic usage.  
</translate>
</translate>
}}
{{Command
|module load nvhpc/22.7
|result=
Lmod is automatically replacing "intel/2020.1.217" with "nvhpc/22.7".
The following have been reloaded with a version change:
  1) gcccore/.9.3.0 => gcccore/.11.3.0        3) openmpi/4.0.3 => openmpi/4.1.4
  2) libfabric/1.10.1 => libfabric/1.15.1    4) ucx/1.8.0 => ucx/1.12.1
}}
}}


Line 36: Line 64:
|make  
|make  
|result=
|result=
pgc++ -fast  -c -o main.o main.cpp
nvc++   -c -o main.o main.cpp
"vector.h", line 30: warning: variable "vcoefs" was declared but never
nvc++ main.o -o cg.x
      referenced
}}
      double *vcoefs=v.coefs;
                    ^


pgc++ main.o -o cg.x -fast
<translate>
}}
<!--T:11-->
Once the executable <code>cg.x</code> is created, we are going to profile its source code:
the profiler will measure function calls by executing and monitoring this program.
'''Important:''' this executable uses about 3GB of memory and one CPU core at near 100%.
Therefore, '''a proper test environment should have at least 4GB of available memory and at least two (2) CPU cores'''.
</translate>


After the executable is created, we are going to profile that code.
{{Callout
{{Callout
|title=<translate>Which profiller ?</translate>
|title=<translate><!--T:6-->
Which profiler ?</translate>
|content=
|content=
<translate>
<translate>
For the purpose of this tutorial, we use several profilers as described below:  
<!--T:7-->
* PGPROF - a powerful and simple analyzer for parallel programs written with OpenMP or OpenACC directives, or with CUDA
For the purpose of this tutorial, we use two profilers:  
We note that [http://www.pgroup.com/support/download_pgi2016.php?view=current Portland Group Profiler] is free for academic usage.  
* '''[https://docs.nvidia.com/cuda/profiler-users-guide/ NVIDIA <code>nvprof</code>]''' - a command line text-based profiler that can analyze non-GPU codes.
* NVIDIA Visual Profiler NVVP - a cross-platform analyzing tool for the codes written with OpenACC and CUDA C/C++ instructions
* '''[[OpenACC_Tutorial_-_Adding_directives#NVIDIA_Visual_Profiler|NVIDIA Visual Profiler <code>nvvp</code>]]''' - a graphical cross-platform analyzing tool for the codes written with OpenACC and CUDA C/C++ instructions.
* NVPROF - a command line text-based version of the NVIDIA Visual Profiler
Since our previously built <code>cg.x</code> is not yet using the GPU, we will start the analysis with the <code>nvprof</code> profiler.
</translate>
</translate>
}}
}}
<translate>


=== NVIDIA <code>nvprof</code> Command Line Profiler === <!--T:15-->
NVIDIA usually provides <code>nvprof</code> with its HPC SDK,
but the proper version to use on our clusters is included with a CUDA module:
</translate>
{{Command
|module load cuda/11.7
}}
<translate>


=== PGPROF Profiller  ===
<!--T:27-->
[[File:Pgprof new0.png|thumbnail|300px|Starting new session|left  ]]
To profile a pure CPU executable, we need to add the arguments <code>--cpu-profiling on</code> to the command line:
Bellow are several snapshots demonstrating how to start with the PGPROF profiler. First step is to initiate a new session.
</translate>
Then browse for an executable file of the code you want to profile.
Then specify the profiling options. For example, if you need to profile CPU activity then set the "Profile execution of the CPU" box.
 
=== NVIDIA Visual Profiller  ===
 
Another profiler available for OpenACC applications is NVIDIA Visual Profiler. It's a cross-platform analyzing tool for the codes written with OpenACC and CUDA C/C++ instructions.
[[File:Nvvp-pic0.png|thumbnail|300px|The NVVP profiler|right  ]]
[[File:Nvvp-pic1.png|thumbnail|300px|Browse for executable you want to profile|right  ]]
 
=== NVIDIA NVPROF Command Line Profiler  ===
NVIDIA also provides a command line version called NVPROF, similar to GPU prof
{{Command
{{Command
|nvprof --cpu-profiling on ./cgi.x  
|nvprof --cpu-profiling on ./cg.x  
|result=
|result=
...
<Program output >
<Program output >
...
======== CPU profiling result (bottom up):
======== CPU profiling result (bottom up):
84.25% matvec(matrix const &, vector const &, vector const &)
Time(%)      Time  Name
84.25% main
83.54% 90.6757s  matvec(matrix const &, vector const &, vector const &)
9.50% waxpby(double, vector const &, double, vector const &, vector const &)
83.54% 90.6757s  {{!}} main
3.37% dot(vector const &, vector const &)
  7.94% 8.62146s  waxpby(double, vector const &, double, vector const &, vector const &)
2.76% allocate_3d_poisson_matrix(matrix&, int)
  7.94%  8.62146s  {{!}} main
2.76% main
  5.86% 6.36584s  dot(vector const &, vector const &)
0.11% __c_mset8
  5.86%  6.36584s  {{!}} main
0.03% munmap
  2.47% 2.67666s  allocate_3d_poisson_matrix(matrix&, int)
  0.03% free_matrix(matrix&)
  2.47% 2.67666s  {{!}} main
    0.03% main
  0.13% 140.35ms  initialize_vector(vector&, double)
  0.13% 140.35ms  {{!}} main
...
======== Data collected at 100Hz frequency
======== Data collected at 100Hz frequency
}}
}}
<translate>
<!--T:28-->
From the above output, the <code>matvec()</code> function is responsible for 83.5% of the execution time, and this function call can be found in the <code>main()</code> function.
== Compiler Feedback == <!--T:16-->
Before working on the routine, we need to understand what the compiler is actually doing by asking ourselves the following questions:
* What optimizations were applied automatically by the compiler?
* What prevented further optimizations?
* Can very minor modifications of the code affect performance?


== Compiler Feedback  ==
<!--T:17-->
Before working on the routine, we need to understand what the compiler is actually doing. Several questions we got to ask ourselves:
The NVIDIA compiler offers a <code>-Minfo</code> flag with the following options:
* What optimizations were applied ?
* <code>all</code> - Print almost all types of compilation information, including:
* What prevented further optimizations ?
** <code>accel</code> - Print compiler operations related to the accelerator
* Can very minor modification of the code affect the performance ?
** <code>inline</code> - Print information about functions extracted and inlined
** <code>loop,mp,par,stdpar,vect</code> - Print various information about loop optimization and vectorization
* <code>intensity</code> - Print compute intensity information about loops
* (none) - If <code>-Minfo</code> is used without any option, it is the same as with the <code>all</code> option, but without the <code>inline</code> information


The PGI compiler offers you a '''-Minfo''' flag with the following options:
=== How to Enable Compiler Feedback === <!--T:18-->
* accel – Print compiler operations related to the accelerator
* Edit the <code>Makefile</code>:
* all – Print all compiler output
  CXX=nvc++
* intensity – Print loop intensity information
  CXXFLAGS=-fast -Minfo=all,intensity
* ccff–Add information to the object files for use by tools
  LDFLAGS=${CXXFLAGS}


== How to Enable Compiler Feedback  ==
<!--T:29-->
* Edit the Makefile
CXX=pgc++
CXXFLAGS=-fast -Minfo=all,intensity,ccff LDFLAGS=${CXXFLAGS}
* Rebuild
* Rebuild
</translate>
{{Command
{{Command
|make
|make clean; make
|result=
|result=
pgc++ CXXFLAGS=-fast -Minfo=all,intensity,ccff LDFLAGS=-fast -fast   -c -o main.o main.cpp
...
"vector.h", line 30: warning: variable "vcoefs" was declared but never
nvc++ -fast -Minfo=all,intensity  -c -o main.o main.cpp
          referenced
initialize_vector(vector &, double):
    double *vcoefs=v.coefs;
    20, include "vector.h"
            ^
           36, Intensity = 0.0
 
_Z17initialize_vectorR6vectord:
           37, Intensity = 0.0
               Memory set idiom, loop replaced by call to __c_mset8
               Memory set idiom, loop replaced by call to __c_mset8
_Z3dotRK6vectorS1_:
dot(const vector &, const vector &):
           27, Intensity = 1.00  
    21, include "vector_functions.h"
              Generated 3 alternate versions of the loop
           27, Intensity = 1.00
               Generated vector sse code for the loop
               Generated vector simd code for the loop containing reductions
              Generated 2 prefetch instructions for the loop
          28, FMA (fused multiply-add) instruction(s) generated
_Z6waxpbydRK6vectordS1_S1_:
waxpby(double, const vector &, double, const vector &, const vector &):
           39, Intensity = 1.00  
    21, include "vector_functions.h"
           39, Intensity = 1.00
               Loop not vectorized: data dependency
               Loop not vectorized: data dependency
               Loop unrolled 4 times
              Generated vector simd code for the loop
_Z26allocate_3d_poisson_matrixR6matrixi:
               Loop unrolled 2 times
              FMA (fused multiply-add) instruction(s) generated
          40, FMA (fused multiply-add) instruction(s) generated
allocate_3d_poisson_matrix(matrix &, int):
    22, include "matrix.h"
           43, Intensity = 0.0
           43, Intensity = 0.0
              Loop not fused: different loop trip count
           44, Intensity = 0.0
           44, Intensity = 0.0
               Loop not vectorized/parallelized: loop count too small
               Loop not vectorized/parallelized: loop count too small
Line 138: Line 184:
           59, Intensity = 0.0
           59, Intensity = 0.0
               Loop not vectorized: data dependency
               Loop not vectorized: data dependency
_Z6matvecRK6matrixRK6vectorS4_:
matvec(const matrix &, const vector &, const vector &):
    23, include "matrix_functions.h"
           29, Intensity = (num_rows*((row_end-row_start)*        2))/(num_rows+(num_rows+(num_rows+((row_end-row_start)+(row_end-row_start)))))
           29, Intensity = (num_rows*((row_end-row_start)*        2))/(num_rows+(num_rows+(num_rows+((row_end-row_start)+(row_end-row_start)))))
           33, Intensity = 1.00  
           33, Intensity = 1.00
              Unrolled inner loop 4 times
               Generated vector simd code for the loop containing reductions
               Generated 2 prefetch instructions for the loop
          37, FMA (fused multiply-add) instruction(s) generated
main:
main:
     61, Intensity = 16.00 
     38, allocate_3d_poisson_matrix(matrix &, int) inlined, size=41 (inline) file main.cpp (29)
        Loop not vectorized/parallelized: potential early exits
          43, Intensity = 0.0
pgc++ CXXFLAGS=-fast -Minfo=all,intensity,ccff LDFLAGS=-fast main.o -o cg.x -fast
              Loop not fused: different loop trip count
          44, Intensity = 0.0
              Loop not vectorized/parallelized: loop count too small
          45, Intensity = 0.0
              Loop unrolled 3 times (completely unrolled)
          57, Intensity = 0.0
              Loop not fused: function call before adjacent loop
          59, Intensity = 0.0
              Loop not vectorized: data dependency
    42, allocate_vector(vector &, unsigned int) inlined, size=3 (inline) file main.cpp (24)
    43, allocate_vector(vector &, unsigned int) inlined, size=3 (inline) file main.cpp (24)
    44, allocate_vector(vector &, unsigned int) inlined, size=3 (inline) file main.cpp (24)
    45, allocate_vector(vector &, unsigned int) inlined, size=3 (inline) file main.cpp (24)
    46, allocate_vector(vector &, unsigned int) inlined, size=3 (inline) file main.cpp (24)
    48, initialize_vector(vector &, double) inlined, size=5 (inline) file main.cpp (34)
          36, Intensity = 0.0
              Memory set idiom, loop replaced by call to __c_mset8
    49, initialize_vector(vector &, double) inlined, size=5 (inline) file main.cpp (34)
          36, Intensity = 0.0
              Memory set idiom, loop replaced by call to __c_mset8
    52, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
          39, Intensity = 0.0
              Memory copy idiom, loop replaced by call to __c_mcopy8
    53, matvec(const matrix &, const vector &, const vector &) inlined, size=19 (inline) file main.cpp (20)
          29, Intensity = [symbolic], and not printable, try the -Mpfi -Mpfo options
              Loop not fused: different loop trip count
          33, Intensity = 1.00
              Generated vector simd code for the loop containing reductions
    54, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
          27, FMA (fused multiply-add) instruction(s) generated
          36, FMA (fused multiply-add) instruction(s) generated
          39, Intensity = 0.67
              Loop not fused: different loop trip count
              Loop not vectorized: data dependency
              Generated vector simd code for the loop
              Loop unrolled 4 times
              FMA (fused multiply-add) instruction(s) generated
    56, dot(const vector &, const vector &) inlined, size=9 (inline) file main.cpp (21)
          27, Intensity = 1.00
              Loop not fused: function call before adjacent loop
              Generated vector simd code for the loop containing reductions
    61, Intensity = 0.0
    62, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
          39, Intensity = 0.0
              Memory copy idiom, loop replaced by call to __c_mcopy8
    65, dot(const vector &, const vector &) inlined, size=9 (inline) file main.cpp (21)
          27, Intensity = 1.00
              Loop not fused: different controlling conditions
              Generated vector simd code for the loop containing reductions
    67, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
          39, Intensity = 0.67
              Loop not fused: different loop trip count
              Loop not vectorized: data dependency
              Generated vector simd code for the loop
              Loop unrolled 4 times
    72, matvec(const matrix &, const vector &, const vector &) inlined, size=19 (inline) file main.cpp (20)
          29, Intensity = [symbolic], and not printable, try the -Mpfi -Mpfo options
              Loop not fused: different loop trip count
          33, Intensity = 1.00
              Generated vector simd code for the loop containing reductions
    73, dot(const vector &, const vector &) inlined, size=9 (inline) file main.cpp (21)
          27, Intensity = 1.00
              Loop not fused: different loop trip count
              Generated vector simd code for the loop containing reductions
    77, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
          39, Intensity = 0.67
              Loop not fused: different loop trip count
              Loop not vectorized: data dependency
              Generated vector simd code for the loop
              Loop unrolled 4 times
    78, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
          39, Intensity = 0.67
              Loop not fused: function call before adjacent loop
              Loop not vectorized: data dependency
              Generated vector simd code for the loop
              Loop unrolled 4 times
    88, free_vector(vector &) inlined, size=2 (inline) file main.cpp (29)
    89, free_vector(vector &) inlined, size=2 (inline) file main.cpp (29)
    90, free_vector(vector &) inlined, size=2 (inline) file main.cpp (29)
    91, free_vector(vector &) inlined, size=2 (inline) file main.cpp (29)
    92, free_matrix(matrix &) inlined, size=5 (inline) file main.cpp (73)
}}
}}
<translate>


== Computational Intensity  ==
=== Interpretation of the Compiler Feedback === <!--T:19-->
Computational Intensity of a loop is a measure of how much work is being done compared to memory operations.
The ''Computational Intensity'' of a loop is a measure of how much work is being done compared to memory operations.
Basically:


'''Computation Intensity = Compute Operations / Memory Operations'''
<!--T:20-->
<math>\mbox{Computational Intensity} = \frac{\mbox{Compute Operations}}{\mbox{Memory Operations}}</math>


Computational Intensity of 1.0 or greater is often a clue that something might run well on a GPU.
<!--T:21-->
In the compiler feedback, an <code>Intensity</code> <math>\ge</math> 1.0 suggests that the loop might run well on a GPU.


== Understanding the code  ==
== Understanding the code  == <!--T:22-->
Lets look more closely at the following code:
Let's look closely at the main loop in the
<syntaxhighlight lang="cpp" line highlight="1,5,10,12">
[https://github.com/calculquebec/cq-formation-openacc/blob/main/cpp/matrix_functions.h#L29 <code>matvec()</code> function implemented in <code>matrix_functions.h</code>]:
for(int i=0;i<num_rows;i++) {
</translate>
  double sum=0;
<syntaxhighlight lang="cpp" line start="29" highlight="1,5,10,12">
  int row_start=row_offsets[i];
  for(int i=0;i<num_rows;i++) {
  int row_end=row_offsets[i+1];
    double sum=0;
  for(int j=row_start; j<row_end;j++) {
    int row_start=row_offsets[i];
    unsigned int Acol=cols[j];
    int row_end=row_offsets[i+1];
    double Acoef=Acoefs[j];  
    for(int j=row_start; j<row_end;j++) {
    double xcoef=xcoefs[Acol];  
      unsigned int Acol=cols[j];
    sum+=Acoef*xcoef;
      double Acoef=Acoefs[j];  
      double xcoef=xcoefs[Acol];  
      sum+=Acoef*xcoef;
    }
    ycoefs[i]=sum;
   }
   }
  ycoefs[i]=sum;
}
</syntaxhighlight>  
</syntaxhighlight>  
<translate>
<!--T:23-->
Given the code above, we search for data dependencies:
Given the code above, we search for data dependencies:
* Does one loop iteration affect other loop iterations?
* Does one loop iteration affect other loop iterations?
* Do loop iterations read from and write to different places in the same array?
** For example, when generating the '''[https://en.wikipedia.org/wiki/Fibonacci_number Fibonacci sequence]''', each new value depends on the previous two values. Therefore, efficient parallelism is very difficult to implement, if not impossible.
* Is sum a data dependency? No, it’s a reduction.
* Is the accumulation of values in <code>sum</code> a data dependency?
** No, it’s a '''[https://en.wikipedia.org/wiki/Reduction_operator reduction]'''! And modern compilers are good at optimizing such reductions.
* Do loop iterations read from and write to the same array, such that written values are used or overwritten in other iterations?
** Fortunately, that does not happen in the above code.


[[OpenACC Tutorial|Back to the lesson plan]]
<!--T:25-->
Now that the code analysis is done, we are ready to add directives to the compiler.
 
<!--T:24-->
[[OpenACC Tutorial - Introduction|<- Previous unit: ''Introduction'']] | [[OpenACC Tutorial|^- Back to the lesson plan]] | [[OpenACC Tutorial - Adding directives|Onward to the next unit: ''Adding directives'' ->]]
</translate>

Latest revision as of 22:26, 20 December 2022

Other languages:


Learning objectives
  • Understand what a profiler is
  • Understand how to use the NVPROF profiler
  • Understand how the code is performing
  • Understand where to focus your time and rewrite most time consuming routines


Code profiling[edit]

Why would one need to profile code? Because it's the only way to understand:

  • Where time is being spent (hotspots)
  • How the code is performing
  • Where to focus your development time

What is so important about hotspots in the code? The Amdahl's law says that "Parallelizing the most time-consuming routines (i.e. the hotspots) will have the most impact".

Build the Sample Code[edit]

For the following example, we use a code from this Git repository. You are invited to download and extract the package, and go to the cpp or the f90 directory. The object of this example is to compile and link the code, obtain an executable, and then profile its source code with a profiler.


Which compiler ?

Being pushed by Cray and by NVIDIA through its Portland Group division until 2020 and now through its HPC SDK, these two lines of compilers offer the most advanced OpenACC support.

As for the GNU compilers, since GCC version 6, the support for OpenACC 2.x kept improving. As of July 2022, GCC versions 10, 11 and 12 support OpenACC version 2.6.

For the purpose of this tutorial, we use the NVIDIA HPC SDK, version 22.7. Please note that NVIDIA compilers are free for academic usage.


Question.png
[name@server ~]$ module load nvhpc/22.7
Lmod is automatically replacing "intel/2020.1.217" with "nvhpc/22.7".

The following have been reloaded with a version change:
  1) gcccore/.9.3.0 => gcccore/.11.3.0        3) openmpi/4.0.3 => openmpi/4.1.4
  2) libfabric/1.10.1 => libfabric/1.15.1     4) ucx/1.8.0 => ucx/1.12.1
Question.png
[name@server ~]$ make 
nvc++    -c -o main.o main.cpp
nvc++ main.o -o cg.x

Once the executable cg.x is created, we are going to profile its source code: the profiler will measure function calls by executing and monitoring this program. Important: this executable uses about 3GB of memory and one CPU core at near 100%. Therefore, a proper test environment should have at least 4GB of available memory and at least two (2) CPU cores.


Which profiler ?

For the purpose of this tutorial, we use two profilers:

  • NVIDIA nvprof - a command line text-based profiler that can analyze non-GPU codes.
  • NVIDIA Visual Profiler nvvp - a graphical cross-platform analyzing tool for the codes written with OpenACC and CUDA C/C++ instructions.

Since our previously built cg.x is not yet using the GPU, we will start the analysis with the nvprof profiler.


NVIDIA nvprof Command Line Profiler[edit]

NVIDIA usually provides nvprof with its HPC SDK, but the proper version to use on our clusters is included with a CUDA module:

Question.png
[name@server ~]$ module load cuda/11.7

To profile a pure CPU executable, we need to add the arguments --cpu-profiling on to the command line:

Question.png
[name@server ~]$ nvprof --cpu-profiling on ./cg.x 
...
<Program output >
...
======== CPU profiling result (bottom up):
Time(%)      Time  Name
 83.54%  90.6757s  matvec(matrix const &, vector const &, vector const &)
 83.54%  90.6757s  | main
  7.94%  8.62146s  waxpby(double, vector const &, double, vector const &, vector const &)
  7.94%  8.62146s  | main
  5.86%  6.36584s  dot(vector const &, vector const &)
  5.86%  6.36584s  | main
  2.47%  2.67666s  allocate_3d_poisson_matrix(matrix&, int)
  2.47%  2.67666s  | main
  0.13%  140.35ms  initialize_vector(vector&, double)
  0.13%  140.35ms  | main
...
======== Data collected at 100Hz frequency

From the above output, the matvec() function is responsible for 83.5% of the execution time, and this function call can be found in the main() function.

Compiler Feedback[edit]

Before working on the routine, we need to understand what the compiler is actually doing by asking ourselves the following questions:

  • What optimizations were applied automatically by the compiler?
  • What prevented further optimizations?
  • Can very minor modifications of the code affect performance?

The NVIDIA compiler offers a -Minfo flag with the following options:

  • all - Print almost all types of compilation information, including:
    • accel - Print compiler operations related to the accelerator
    • inline - Print information about functions extracted and inlined
    • loop,mp,par,stdpar,vect - Print various information about loop optimization and vectorization
  • intensity - Print compute intensity information about loops
  • (none) - If -Minfo is used without any option, it is the same as with the all option, but without the inline information

How to Enable Compiler Feedback[edit]

  • Edit the Makefile:
 CXX=nvc++
 CXXFLAGS=-fast -Minfo=all,intensity
 LDFLAGS=${CXXFLAGS}
  • Rebuild
Question.png
[name@server ~]$ make clean; make
...
nvc++ -fast -Minfo=all,intensity   -c -o main.o main.cpp
initialize_vector(vector &, double):
     20, include "vector.h"
          36, Intensity = 0.0
              Memory set idiom, loop replaced by call to __c_mset8
dot(const vector &, const vector &):
     21, include "vector_functions.h"
          27, Intensity = 1.00
              Generated vector simd code for the loop containing reductions
          28, FMA (fused multiply-add) instruction(s) generated
waxpby(double, const vector &, double, const vector &, const vector &):
     21, include "vector_functions.h"
          39, Intensity = 1.00
              Loop not vectorized: data dependency
              Generated vector simd code for the loop
              Loop unrolled 2 times
              FMA (fused multiply-add) instruction(s) generated
          40, FMA (fused multiply-add) instruction(s) generated
allocate_3d_poisson_matrix(matrix &, int):
     22, include "matrix.h"
          43, Intensity = 0.0
              Loop not fused: different loop trip count
          44, Intensity = 0.0
              Loop not vectorized/parallelized: loop count too small
          45, Intensity = 0.0
              Loop unrolled 3 times (completely unrolled)
          57, Intensity = 0.0
          59, Intensity = 0.0
              Loop not vectorized: data dependency
matvec(const matrix &, const vector &, const vector &):
     23, include "matrix_functions.h"
          29, Intensity = (num_rows*((row_end-row_start)*         2))/(num_rows+(num_rows+(num_rows+((row_end-row_start)+(row_end-row_start)))))
          33, Intensity = 1.00
              Generated vector simd code for the loop containing reductions
          37, FMA (fused multiply-add) instruction(s) generated
main:
     38, allocate_3d_poisson_matrix(matrix &, int) inlined, size=41 (inline) file main.cpp (29)
          43, Intensity = 0.0
              Loop not fused: different loop trip count
          44, Intensity = 0.0
              Loop not vectorized/parallelized: loop count too small
          45, Intensity = 0.0
              Loop unrolled 3 times (completely unrolled)
          57, Intensity = 0.0
              Loop not fused: function call before adjacent loop
          59, Intensity = 0.0
              Loop not vectorized: data dependency
     42, allocate_vector(vector &, unsigned int) inlined, size=3 (inline) file main.cpp (24)
     43, allocate_vector(vector &, unsigned int) inlined, size=3 (inline) file main.cpp (24)
     44, allocate_vector(vector &, unsigned int) inlined, size=3 (inline) file main.cpp (24)
     45, allocate_vector(vector &, unsigned int) inlined, size=3 (inline) file main.cpp (24)
     46, allocate_vector(vector &, unsigned int) inlined, size=3 (inline) file main.cpp (24)
     48, initialize_vector(vector &, double) inlined, size=5 (inline) file main.cpp (34)
          36, Intensity = 0.0
              Memory set idiom, loop replaced by call to __c_mset8
     49, initialize_vector(vector &, double) inlined, size=5 (inline) file main.cpp (34)
          36, Intensity = 0.0
              Memory set idiom, loop replaced by call to __c_mset8
     52, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
          39, Intensity = 0.0
              Memory copy idiom, loop replaced by call to __c_mcopy8
     53, matvec(const matrix &, const vector &, const vector &) inlined, size=19 (inline) file main.cpp (20)
          29, Intensity = [symbolic], and not printable, try the -Mpfi -Mpfo options
              Loop not fused: different loop trip count
          33, Intensity = 1.00
              Generated vector simd code for the loop containing reductions
     54, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
          27, FMA (fused multiply-add) instruction(s) generated
          36, FMA (fused multiply-add) instruction(s) generated
          39, Intensity = 0.67
              Loop not fused: different loop trip count
              Loop not vectorized: data dependency
              Generated vector simd code for the loop
              Loop unrolled 4 times
              FMA (fused multiply-add) instruction(s) generated
     56, dot(const vector &, const vector &) inlined, size=9 (inline) file main.cpp (21)
          27, Intensity = 1.00
              Loop not fused: function call before adjacent loop
              Generated vector simd code for the loop containing reductions
     61, Intensity = 0.0
     62, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
          39, Intensity = 0.0
              Memory copy idiom, loop replaced by call to __c_mcopy8
     65, dot(const vector &, const vector &) inlined, size=9 (inline) file main.cpp (21)
          27, Intensity = 1.00
              Loop not fused: different controlling conditions
              Generated vector simd code for the loop containing reductions
     67, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
          39, Intensity = 0.67
              Loop not fused: different loop trip count
              Loop not vectorized: data dependency
              Generated vector simd code for the loop
              Loop unrolled 4 times
     72, matvec(const matrix &, const vector &, const vector &) inlined, size=19 (inline) file main.cpp (20)
          29, Intensity = [symbolic], and not printable, try the -Mpfi -Mpfo options
              Loop not fused: different loop trip count
          33, Intensity = 1.00
              Generated vector simd code for the loop containing reductions
     73, dot(const vector &, const vector &) inlined, size=9 (inline) file main.cpp (21)
          27, Intensity = 1.00
              Loop not fused: different loop trip count
              Generated vector simd code for the loop containing reductions
     77, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
          39, Intensity = 0.67
              Loop not fused: different loop trip count
              Loop not vectorized: data dependency
              Generated vector simd code for the loop
              Loop unrolled 4 times
     78, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
          39, Intensity = 0.67
              Loop not fused: function call before adjacent loop
              Loop not vectorized: data dependency
              Generated vector simd code for the loop
              Loop unrolled 4 times
     88, free_vector(vector &) inlined, size=2 (inline) file main.cpp (29)
     89, free_vector(vector &) inlined, size=2 (inline) file main.cpp (29)
     90, free_vector(vector &) inlined, size=2 (inline) file main.cpp (29)
     91, free_vector(vector &) inlined, size=2 (inline) file main.cpp (29)
     92, free_matrix(matrix &) inlined, size=5 (inline) file main.cpp (73)

Interpretation of the Compiler Feedback[edit]

The Computational Intensity of a loop is a measure of how much work is being done compared to memory operations. Basically:

In the compiler feedback, an Intensity 1.0 suggests that the loop might run well on a GPU.

Understanding the code[edit]

Let's look closely at the main loop in the matvec() function implemented in matrix_functions.h:

  for(int i=0;i<num_rows;i++) {
    double sum=0;
    int row_start=row_offsets[i];
    int row_end=row_offsets[i+1];
    for(int j=row_start; j<row_end;j++) {
      unsigned int Acol=cols[j];
      double Acoef=Acoefs[j]; 
      double xcoef=xcoefs[Acol]; 
      sum+=Acoef*xcoef;
    }
    ycoefs[i]=sum;
  }

Given the code above, we search for data dependencies:

  • Does one loop iteration affect other loop iterations?
    • For example, when generating the Fibonacci sequence, each new value depends on the previous two values. Therefore, efficient parallelism is very difficult to implement, if not impossible.
  • Is the accumulation of values in sum a data dependency?
    • No, it’s a reduction! And modern compilers are good at optimizing such reductions.
  • Do loop iterations read from and write to the same array, such that written values are used or overwritten in other iterations?
    • Fortunately, that does not happen in the above code.

Now that the code analysis is done, we are ready to add directives to the compiler.

<- Previous unit: Introduction | ^- Back to the lesson plan | Onward to the next unit: Adding directives ->