OpenACC Tutorial - Profiling: Difference between revisions
No edit summary |
(Marked this version for translation) |
||
(30 intermediate revisions by 5 users not shown) | |||
Line 8: | Line 8: | ||
<!--T:2--> | <!--T:2--> | ||
* Understand what a profiler is'' | * Understand what a profiler is'' | ||
* Understand how to use | * Understand how to use the NVPROF profiler | ||
* Understand how the code is performing | * Understand how the code is performing | ||
* Understand where to focus your time and rewrite most time consuming routines | * Understand where to focus your time and rewrite most time consuming routines | ||
Line 14: | Line 14: | ||
}} | }} | ||
<translate> | <translate> | ||
== Code profiling | == Code profiling == <!--T:8--> | ||
Why would one need to profile code? Because it's the only way to understand: | Why would one need to profile code? Because it's the only way to understand: | ||
* Where time is being spent ( | * Where time is being spent (hotspots) | ||
* How the code is performing | * How the code is performing | ||
* Where to focus your time | * Where to focus your development time | ||
<!--T:9--> | <!--T:9--> | ||
What is so important about hotspots in the code ? | What is so important about hotspots in the code? | ||
Amdahl's law says that "Parallelizing the most time-consuming routines (i.e. the hotspots) will have the most impact". | The [https://en.wikipedia.org/wiki/Amdahl%27s_law Amdahl's law] says that | ||
"Parallelizing the most time-consuming routines (i.e. the hotspots) will have the most impact". | |||
== Build the Sample Code == <!--T:10--> | == Build the Sample Code == <!--T:10--> | ||
For | For the following example, we use a code from this [https://github.com/calculquebec/cq-formation-openacc Git repository]. | ||
You are invited to [https://github.com/calculquebec/cq-formation-openacc/archive/refs/heads/main.zip download and extract the package], and go to the <code>cpp</code> or the <code>f90</code> directory. | |||
The object of this example is to compile and link the code, obtain an executable, and then profile its source code with a profiler. | |||
</translate> | </translate> | ||
{{Callout | {{Callout | ||
|title=<translate><!--T:3--> | |title=<translate><!--T:3--> | ||
Line 34: | Line 37: | ||
<translate> | <translate> | ||
<!--T:4--> | <!--T:4--> | ||
Being pushed by [https://www.cray.com/ Cray] and by [https://www.nvidia.com NVIDIA] through its | |||
[https://www.pgroup.com/support/release_archive.php Portland Group] division until 2020 and now through its [https://developer.nvidia.com/hpc-sdk HPC SDK], these two lines of compilers offer the most advanced OpenACC support. | |||
<!--T:26--> | |||
As for the [https://gcc.gnu.org/wiki/OpenACC GNU compilers], since GCC version 6, the support for OpenACC 2.x kept improving. | |||
As of July 2022, GCC versions 10, 11 and 12 support OpenACC version 2.6. | |||
<!--T:5--> | <!--T:5--> | ||
For the purpose of this tutorial, we use | For the purpose of this tutorial, we use the | ||
[https://developer.nvidia.com/nvidia-hpc-sdk-releases NVIDIA HPC SDK], version 22.7. | |||
Please note that NVIDIA compilers are free for academic usage. | |||
</translate> | </translate> | ||
}} | |||
{{Command | |||
|module load nvhpc/22.7 | |||
|result= | |||
Lmod is automatically replacing "intel/2020.1.217" with "nvhpc/22.7". | |||
The following have been reloaded with a version change: | |||
1) gcccore/.9.3.0 => gcccore/.11.3.0 3) openmpi/4.0.3 => openmpi/4.1.4 | |||
2) libfabric/1.10.1 => libfabric/1.15.1 4) ucx/1.8.0 => ucx/1.12.1 | |||
}} | }} | ||
Line 44: | Line 64: | ||
|make | |make | ||
|result= | |result= | ||
nvc++ -c -o main.o main.cpp | |||
nvc++ main.o -o cg.x | |||
}} | |||
<translate> | |||
<!--T:11--> | |||
Once the executable <code>cg.x</code> is created, we are going to profile its source code: | |||
the profiler will measure function calls by executing and monitoring this program. | |||
'''Important:''' this executable uses about 3GB of memory and one CPU core at near 100%. | |||
Therefore, '''a proper test environment should have at least 4GB of available memory and at least two (2) CPU cores'''. | |||
</translate> | |||
{{Callout | {{Callout | ||
|title=<translate><!--T:6--> | |title=<translate><!--T:6--> | ||
Line 60: | Line 82: | ||
<translate> | <translate> | ||
<!--T:7--> | <!--T:7--> | ||
For the purpose of this tutorial, we use | For the purpose of this tutorial, we use two profilers: | ||
* | * '''[https://docs.nvidia.com/cuda/profiler-users-guide/ NVIDIA <code>nvprof</code>]''' - a command line text-based profiler that can analyze non-GPU codes. | ||
* '''[[OpenACC_Tutorial_-_Adding_directives#NVIDIA_Visual_Profiler|NVIDIA Visual Profiler <code>nvvp</code>]]''' - a graphical cross-platform analyzing tool for the codes written with OpenACC and CUDA C/C++ instructions. | |||
* NVIDIA Visual Profiler | Since our previously built <code>cg.x</code> is not yet using the GPU, we will start the analysis with the <code>nvprof</code> profiler. | ||
</translate> | </translate> | ||
}} | }} | ||
<translate> | |||
=== NVIDIA <code>nvprof</code> Command Line Profiler === <!--T:15--> | |||
NVIDIA usually provides <code>nvprof</code> with its HPC SDK, | |||
but the proper version to use on our clusters is included with a CUDA module: | |||
</translate> | |||
{{Command | |||
|module load cuda/11.7 | |||
}} | |||
<translate> | |||
<!--T:27--> | |||
To profile a pure CPU executable, we need to add the arguments <code>--cpu-profiling on</code> to the command line: | |||
</translate> | |||
{{Command | {{Command | ||
|nvprof --cpu-profiling on ./ | |nvprof --cpu-profiling on ./cg.x | ||
|result= | |result= | ||
... | |||
<Program output > | <Program output > | ||
... | |||
======== CPU profiling result (bottom up): | ======== CPU profiling result (bottom up): | ||
Time(%) Time Name | |||
83.54% 90.6757s matvec(matrix const &, vector const &, vector const &) | |||
83.54% 90.6757s {{!}} main | |||
7.94% 8.62146s waxpby(double, vector const &, double, vector const &, vector const &) | |||
2. | 7.94% 8.62146s {{!}} main | ||
2. | 5.86% 6.36584s dot(vector const &, vector const &) | ||
0. | 5.86% 6.36584s {{!}} main | ||
2.47% 2.67666s allocate_3d_poisson_matrix(matrix&, int) | |||
2.47% 2.67666s {{!}} main | |||
0.13% 140.35ms initialize_vector(vector&, double) | |||
0.13% 140.35ms {{!}} main | |||
... | |||
======== Data collected at 100Hz frequency | ======== Data collected at 100Hz frequency | ||
}} | }} | ||
<translate> | |||
<!--T:28--> | |||
From the above output, the <code>matvec()</code> function is responsible for 83.5% of the execution time, and this function call can be found in the <code>main()</code> function. | |||
== Compiler Feedback | == Compiler Feedback == <!--T:16--> | ||
Before working on the routine, we need to understand what the compiler is actually doing by asking ourselves the following questions: | Before working on the routine, we need to understand what the compiler is actually doing by asking ourselves the following questions: | ||
* What optimizations were applied? | * What optimizations were applied automatically by the compiler? | ||
* What prevented further optimizations? | * What prevented further optimizations? | ||
* Can very minor modifications of the code affect performance? | * Can very minor modifications of the code affect performance? | ||
The | <!--T:17--> | ||
* accel | The NVIDIA compiler offers a <code>-Minfo</code> flag with the following options: | ||
* | * <code>all</code> - Print almost all types of compilation information, including: | ||
* intensity | ** <code>accel</code> - Print compiler operations related to the accelerator | ||
* | ** <code>inline</code> - Print information about functions extracted and inlined | ||
** <code>loop,mp,par,stdpar,vect</code> - Print various information about loop optimization and vectorization | |||
* <code>intensity</code> - Print compute intensity information about loops | |||
* (none) - If <code>-Minfo</code> is used without any option, it is the same as with the <code>all</code> option, but without the <code>inline</code> information | |||
== How to Enable Compiler Feedback | === How to Enable Compiler Feedback === <!--T:18--> | ||
* Edit the Makefile | * Edit the <code>Makefile</code>: | ||
CXX= | CXX=nvc++ | ||
CXXFLAGS=-fast -Minfo=all,intensity | CXXFLAGS=-fast -Minfo=all,intensity | ||
LDFLAGS=${CXXFLAGS} | |||
<!--T:29--> | |||
* Rebuild | * Rebuild | ||
</translate> | |||
{{Command | {{Command | ||
|make | |make clean; make | ||
|result= | |result= | ||
... | |||
"vector.h" | nvc++ -fast -Minfo=all,intensity -c -o main.o main.cpp | ||
initialize_vector(vector &, double): | |||
20, include "vector.h" | |||
36, Intensity = 0.0 | |||
Memory set idiom, loop replaced by call to __c_mset8 | Memory set idiom, loop replaced by call to __c_mset8 | ||
dot(const vector &, const vector &): | |||
27, Intensity = 1.00 | 21, include "vector_functions.h" | ||
27, Intensity = 1.00 | |||
Generated vector | Generated vector simd code for the loop containing reductions | ||
28, FMA (fused multiply-add) instruction(s) generated | |||
waxpby(double, const vector &, double, const vector &, const vector &): | |||
39, Intensity = 1.00 | 21, include "vector_functions.h" | ||
39, Intensity = 1.00 | |||
Loop not vectorized: data dependency | Loop not vectorized: data dependency | ||
Loop unrolled | Generated vector simd code for the loop | ||
Loop unrolled 2 times | |||
FMA (fused multiply-add) instruction(s) generated | |||
40, FMA (fused multiply-add) instruction(s) generated | |||
allocate_3d_poisson_matrix(matrix &, int): | |||
22, include "matrix.h" | |||
43, Intensity = 0.0 | 43, Intensity = 0.0 | ||
Loop not fused: different loop trip count | |||
44, Intensity = 0.0 | 44, Intensity = 0.0 | ||
Loop not vectorized/parallelized: loop count too small | Loop not vectorized/parallelized: loop count too small | ||
Line 148: | Line 184: | ||
59, Intensity = 0.0 | 59, Intensity = 0.0 | ||
Loop not vectorized: data dependency | Loop not vectorized: data dependency | ||
matvec(const matrix &, const vector &, const vector &): | |||
23, include "matrix_functions.h" | |||
29, Intensity = (num_rows*((row_end-row_start)* 2))/(num_rows+(num_rows+(num_rows+((row_end-row_start)+(row_end-row_start))))) | 29, Intensity = (num_rows*((row_end-row_start)* 2))/(num_rows+(num_rows+(num_rows+((row_end-row_start)+(row_end-row_start))))) | ||
33, Intensity = 1.00 | 33, Intensity = 1.00 | ||
Generated vector simd code for the loop containing reductions | |||
Generated | 37, FMA (fused multiply-add) instruction(s) generated | ||
main: | main: | ||
38, allocate_3d_poisson_matrix(matrix &, int) inlined, size=41 (inline) file main.cpp (29) | |||
43, Intensity = 0.0 | |||
Loop not fused: different loop trip count | |||
44, Intensity = 0.0 | |||
Loop not vectorized/parallelized: loop count too small | |||
45, Intensity = 0.0 | |||
Loop unrolled 3 times (completely unrolled) | |||
57, Intensity = 0.0 | |||
Loop not fused: function call before adjacent loop | |||
59, Intensity = 0.0 | |||
Loop not vectorized: data dependency | |||
42, allocate_vector(vector &, unsigned int) inlined, size=3 (inline) file main.cpp (24) | |||
43, allocate_vector(vector &, unsigned int) inlined, size=3 (inline) file main.cpp (24) | |||
44, allocate_vector(vector &, unsigned int) inlined, size=3 (inline) file main.cpp (24) | |||
45, allocate_vector(vector &, unsigned int) inlined, size=3 (inline) file main.cpp (24) | |||
46, allocate_vector(vector &, unsigned int) inlined, size=3 (inline) file main.cpp (24) | |||
48, initialize_vector(vector &, double) inlined, size=5 (inline) file main.cpp (34) | |||
36, Intensity = 0.0 | |||
Memory set idiom, loop replaced by call to __c_mset8 | |||
49, initialize_vector(vector &, double) inlined, size=5 (inline) file main.cpp (34) | |||
36, Intensity = 0.0 | |||
Memory set idiom, loop replaced by call to __c_mset8 | |||
52, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33) | |||
39, Intensity = 0.0 | |||
Memory copy idiom, loop replaced by call to __c_mcopy8 | |||
53, matvec(const matrix &, const vector &, const vector &) inlined, size=19 (inline) file main.cpp (20) | |||
29, Intensity = [symbolic], and not printable, try the -Mpfi -Mpfo options | |||
Loop not fused: different loop trip count | |||
33, Intensity = 1.00 | |||
Generated vector simd code for the loop containing reductions | |||
54, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33) | |||
27, FMA (fused multiply-add) instruction(s) generated | |||
36, FMA (fused multiply-add) instruction(s) generated | |||
39, Intensity = 0.67 | |||
Loop not fused: different loop trip count | |||
Loop not vectorized: data dependency | |||
Generated vector simd code for the loop | |||
Loop unrolled 4 times | |||
FMA (fused multiply-add) instruction(s) generated | |||
56, dot(const vector &, const vector &) inlined, size=9 (inline) file main.cpp (21) | |||
27, Intensity = 1.00 | |||
Loop not fused: function call before adjacent loop | |||
Generated vector simd code for the loop containing reductions | |||
61, Intensity = 0.0 | |||
62, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33) | |||
39, Intensity = 0.0 | |||
Memory copy idiom, loop replaced by call to __c_mcopy8 | |||
65, dot(const vector &, const vector &) inlined, size=9 (inline) file main.cpp (21) | |||
27, Intensity = 1.00 | |||
Loop not fused: different controlling conditions | |||
Generated vector simd code for the loop containing reductions | |||
67, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33) | |||
39, Intensity = 0.67 | |||
Loop not fused: different loop trip count | |||
Loop not vectorized: data dependency | |||
Generated vector simd code for the loop | |||
Loop unrolled 4 times | |||
72, matvec(const matrix &, const vector &, const vector &) inlined, size=19 (inline) file main.cpp (20) | |||
29, Intensity = [symbolic], and not printable, try the -Mpfi -Mpfo options | |||
Loop not fused: different loop trip count | |||
33, Intensity = 1.00 | |||
Generated vector simd code for the loop containing reductions | |||
73, dot(const vector &, const vector &) inlined, size=9 (inline) file main.cpp (21) | |||
27, Intensity = 1.00 | |||
Loop not fused: different loop trip count | |||
Generated vector simd code for the loop containing reductions | |||
77, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33) | |||
39, Intensity = 0.67 | |||
Loop not fused: different loop trip count | |||
Loop not vectorized: data dependency | |||
Generated vector simd code for the loop | |||
Loop unrolled 4 times | |||
78, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33) | |||
39, Intensity = 0.67 | |||
Loop not fused: function call before adjacent loop | |||
Loop not vectorized: data dependency | |||
Generated vector simd code for the loop | |||
Loop unrolled 4 times | |||
88, free_vector(vector &) inlined, size=2 (inline) file main.cpp (29) | |||
89, free_vector(vector &) inlined, size=2 (inline) file main.cpp (29) | |||
90, free_vector(vector &) inlined, size=2 (inline) file main.cpp (29) | |||
91, free_vector(vector &) inlined, size=2 (inline) file main.cpp (29) | |||
92, free_matrix(matrix &) inlined, size=5 (inline) file main.cpp (73) | |||
}} | }} | ||
<translate> | |||
== | === Interpretation of the Compiler Feedback === <!--T:19--> | ||
Computational Intensity of a loop is a measure of how much work is being done compared to memory operations. | The ''Computational Intensity'' of a loop is a measure of how much work is being done compared to memory operations. | ||
Basically: | |||
<!--T:20--> | |||
<math>\mbox{Computational Intensity} = \frac{\mbox{Compute Operations}}{\mbox{Memory Operations}}</math> | |||
<!--T:21--> | |||
In the compiler feedback, an <code>Intensity</code> <math>\ge</math> 1.0 suggests that the loop might run well on a GPU. | |||
== Understanding the code == | == Understanding the code == <!--T:22--> | ||
Let's look closely at the | Let's look closely at the main loop in the | ||
<syntaxhighlight lang="cpp" line highlight="1,5,10,12"> | [https://github.com/calculquebec/cq-formation-openacc/blob/main/cpp/matrix_functions.h#L29 <code>matvec()</code> function implemented in <code>matrix_functions.h</code>]: | ||
for(int i=0;i<num_rows;i++) { | </translate> | ||
<syntaxhighlight lang="cpp" line start="29" highlight="1,5,10,12"> | |||
for(int i=0;i<num_rows;i++) { | |||
double sum=0; | |||
int row_start=row_offsets[i]; | |||
int row_end=row_offsets[i+1]; | |||
for(int j=row_start; j<row_end;j++) { | |||
unsigned int Acol=cols[j]; | |||
double Acoef=Acoefs[j]; | |||
double xcoef=xcoefs[Acol]; | |||
sum+=Acoef*xcoef; | |||
} | |||
ycoefs[i]=sum; | |||
} | } | ||
</syntaxhighlight> | </syntaxhighlight> | ||
<translate> | |||
<!--T:23--> | |||
Given the code above, we search for data dependencies: | Given the code above, we search for data dependencies: | ||
* Does one loop iteration affect other loop iterations? | * Does one loop iteration affect other loop iterations? | ||
* Do loop iterations read from and write to | ** For example, when generating the '''[https://en.wikipedia.org/wiki/Fibonacci_number Fibonacci sequence]''', each new value depends on the previous two values. Therefore, efficient parallelism is very difficult to implement, if not impossible. | ||
* | * Is the accumulation of values in <code>sum</code> a data dependency? | ||
** No, it’s a '''[https://en.wikipedia.org/wiki/Reduction_operator reduction]'''! And modern compilers are good at optimizing such reductions. | |||
* Do loop iterations read from and write to the same array, such that written values are used or overwritten in other iterations? | |||
** Fortunately, that does not happen in the above code. | |||
[[OpenACC Tutorial - | <!--T:25--> | ||
[[OpenACC Tutorial|Back to the lesson plan]] | Now that the code analysis is done, we are ready to add directives to the compiler. | ||
<!--T:24--> | |||
[[OpenACC Tutorial - Introduction|<- Previous unit: ''Introduction'']] | [[OpenACC Tutorial|^- Back to the lesson plan]] | [[OpenACC Tutorial - Adding directives|Onward to the next unit: ''Adding directives'' ->]] | |||
</translate> |
Latest revision as of 22:26, 20 December 2022
- Understand what a profiler is
- Understand how to use the NVPROF profiler
- Understand how the code is performing
- Understand where to focus your time and rewrite most time consuming routines
Code profiling[edit]
Why would one need to profile code? Because it's the only way to understand:
- Where time is being spent (hotspots)
- How the code is performing
- Where to focus your development time
What is so important about hotspots in the code? The Amdahl's law says that "Parallelizing the most time-consuming routines (i.e. the hotspots) will have the most impact".
Build the Sample Code[edit]
For the following example, we use a code from this Git repository.
You are invited to download and extract the package, and go to the cpp
or the f90
directory.
The object of this example is to compile and link the code, obtain an executable, and then profile its source code with a profiler.
Being pushed by Cray and by NVIDIA through its Portland Group division until 2020 and now through its HPC SDK, these two lines of compilers offer the most advanced OpenACC support.
As for the GNU compilers, since GCC version 6, the support for OpenACC 2.x kept improving. As of July 2022, GCC versions 10, 11 and 12 support OpenACC version 2.6.
For the purpose of this tutorial, we use the NVIDIA HPC SDK, version 22.7. Please note that NVIDIA compilers are free for academic usage.
[name@server ~]$ module load nvhpc/22.7
Lmod is automatically replacing "intel/2020.1.217" with "nvhpc/22.7".
The following have been reloaded with a version change:
1) gcccore/.9.3.0 => gcccore/.11.3.0 3) openmpi/4.0.3 => openmpi/4.1.4
2) libfabric/1.10.1 => libfabric/1.15.1 4) ucx/1.8.0 => ucx/1.12.1
[name@server ~]$ make
nvc++ -c -o main.o main.cpp
nvc++ main.o -o cg.x
Once the executable cg.x
is created, we are going to profile its source code:
the profiler will measure function calls by executing and monitoring this program.
Important: this executable uses about 3GB of memory and one CPU core at near 100%.
Therefore, a proper test environment should have at least 4GB of available memory and at least two (2) CPU cores.
For the purpose of this tutorial, we use two profilers:
- NVIDIA
nvprof
- a command line text-based profiler that can analyze non-GPU codes. - NVIDIA Visual Profiler
nvvp
- a graphical cross-platform analyzing tool for the codes written with OpenACC and CUDA C/C++ instructions.
Since our previously built cg.x
is not yet using the GPU, we will start the analysis with the nvprof
profiler.
NVIDIA nvprof
Command Line Profiler[edit]
NVIDIA usually provides nvprof
with its HPC SDK,
but the proper version to use on our clusters is included with a CUDA module:
[name@server ~]$ module load cuda/11.7
To profile a pure CPU executable, we need to add the arguments --cpu-profiling on
to the command line:
[name@server ~]$ nvprof --cpu-profiling on ./cg.x
...
<Program output >
...
======== CPU profiling result (bottom up):
Time(%) Time Name
83.54% 90.6757s matvec(matrix const &, vector const &, vector const &)
83.54% 90.6757s | main
7.94% 8.62146s waxpby(double, vector const &, double, vector const &, vector const &)
7.94% 8.62146s | main
5.86% 6.36584s dot(vector const &, vector const &)
5.86% 6.36584s | main
2.47% 2.67666s allocate_3d_poisson_matrix(matrix&, int)
2.47% 2.67666s | main
0.13% 140.35ms initialize_vector(vector&, double)
0.13% 140.35ms | main
...
======== Data collected at 100Hz frequency
From the above output, the matvec()
function is responsible for 83.5% of the execution time, and this function call can be found in the main()
function.
Compiler Feedback[edit]
Before working on the routine, we need to understand what the compiler is actually doing by asking ourselves the following questions:
- What optimizations were applied automatically by the compiler?
- What prevented further optimizations?
- Can very minor modifications of the code affect performance?
The NVIDIA compiler offers a -Minfo
flag with the following options:
all
- Print almost all types of compilation information, including:accel
- Print compiler operations related to the acceleratorinline
- Print information about functions extracted and inlinedloop,mp,par,stdpar,vect
- Print various information about loop optimization and vectorization
intensity
- Print compute intensity information about loops- (none) - If
-Minfo
is used without any option, it is the same as with theall
option, but without theinline
information
How to Enable Compiler Feedback[edit]
- Edit the
Makefile
:
CXX=nvc++ CXXFLAGS=-fast -Minfo=all,intensity LDFLAGS=${CXXFLAGS}
- Rebuild
[name@server ~]$ make clean; make
...
nvc++ -fast -Minfo=all,intensity -c -o main.o main.cpp
initialize_vector(vector &, double):
20, include "vector.h"
36, Intensity = 0.0
Memory set idiom, loop replaced by call to __c_mset8
dot(const vector &, const vector &):
21, include "vector_functions.h"
27, Intensity = 1.00
Generated vector simd code for the loop containing reductions
28, FMA (fused multiply-add) instruction(s) generated
waxpby(double, const vector &, double, const vector &, const vector &):
21, include "vector_functions.h"
39, Intensity = 1.00
Loop not vectorized: data dependency
Generated vector simd code for the loop
Loop unrolled 2 times
FMA (fused multiply-add) instruction(s) generated
40, FMA (fused multiply-add) instruction(s) generated
allocate_3d_poisson_matrix(matrix &, int):
22, include "matrix.h"
43, Intensity = 0.0
Loop not fused: different loop trip count
44, Intensity = 0.0
Loop not vectorized/parallelized: loop count too small
45, Intensity = 0.0
Loop unrolled 3 times (completely unrolled)
57, Intensity = 0.0
59, Intensity = 0.0
Loop not vectorized: data dependency
matvec(const matrix &, const vector &, const vector &):
23, include "matrix_functions.h"
29, Intensity = (num_rows*((row_end-row_start)* 2))/(num_rows+(num_rows+(num_rows+((row_end-row_start)+(row_end-row_start)))))
33, Intensity = 1.00
Generated vector simd code for the loop containing reductions
37, FMA (fused multiply-add) instruction(s) generated
main:
38, allocate_3d_poisson_matrix(matrix &, int) inlined, size=41 (inline) file main.cpp (29)
43, Intensity = 0.0
Loop not fused: different loop trip count
44, Intensity = 0.0
Loop not vectorized/parallelized: loop count too small
45, Intensity = 0.0
Loop unrolled 3 times (completely unrolled)
57, Intensity = 0.0
Loop not fused: function call before adjacent loop
59, Intensity = 0.0
Loop not vectorized: data dependency
42, allocate_vector(vector &, unsigned int) inlined, size=3 (inline) file main.cpp (24)
43, allocate_vector(vector &, unsigned int) inlined, size=3 (inline) file main.cpp (24)
44, allocate_vector(vector &, unsigned int) inlined, size=3 (inline) file main.cpp (24)
45, allocate_vector(vector &, unsigned int) inlined, size=3 (inline) file main.cpp (24)
46, allocate_vector(vector &, unsigned int) inlined, size=3 (inline) file main.cpp (24)
48, initialize_vector(vector &, double) inlined, size=5 (inline) file main.cpp (34)
36, Intensity = 0.0
Memory set idiom, loop replaced by call to __c_mset8
49, initialize_vector(vector &, double) inlined, size=5 (inline) file main.cpp (34)
36, Intensity = 0.0
Memory set idiom, loop replaced by call to __c_mset8
52, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
39, Intensity = 0.0
Memory copy idiom, loop replaced by call to __c_mcopy8
53, matvec(const matrix &, const vector &, const vector &) inlined, size=19 (inline) file main.cpp (20)
29, Intensity = [symbolic], and not printable, try the -Mpfi -Mpfo options
Loop not fused: different loop trip count
33, Intensity = 1.00
Generated vector simd code for the loop containing reductions
54, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
27, FMA (fused multiply-add) instruction(s) generated
36, FMA (fused multiply-add) instruction(s) generated
39, Intensity = 0.67
Loop not fused: different loop trip count
Loop not vectorized: data dependency
Generated vector simd code for the loop
Loop unrolled 4 times
FMA (fused multiply-add) instruction(s) generated
56, dot(const vector &, const vector &) inlined, size=9 (inline) file main.cpp (21)
27, Intensity = 1.00
Loop not fused: function call before adjacent loop
Generated vector simd code for the loop containing reductions
61, Intensity = 0.0
62, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
39, Intensity = 0.0
Memory copy idiom, loop replaced by call to __c_mcopy8
65, dot(const vector &, const vector &) inlined, size=9 (inline) file main.cpp (21)
27, Intensity = 1.00
Loop not fused: different controlling conditions
Generated vector simd code for the loop containing reductions
67, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
39, Intensity = 0.67
Loop not fused: different loop trip count
Loop not vectorized: data dependency
Generated vector simd code for the loop
Loop unrolled 4 times
72, matvec(const matrix &, const vector &, const vector &) inlined, size=19 (inline) file main.cpp (20)
29, Intensity = [symbolic], and not printable, try the -Mpfi -Mpfo options
Loop not fused: different loop trip count
33, Intensity = 1.00
Generated vector simd code for the loop containing reductions
73, dot(const vector &, const vector &) inlined, size=9 (inline) file main.cpp (21)
27, Intensity = 1.00
Loop not fused: different loop trip count
Generated vector simd code for the loop containing reductions
77, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
39, Intensity = 0.67
Loop not fused: different loop trip count
Loop not vectorized: data dependency
Generated vector simd code for the loop
Loop unrolled 4 times
78, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
39, Intensity = 0.67
Loop not fused: function call before adjacent loop
Loop not vectorized: data dependency
Generated vector simd code for the loop
Loop unrolled 4 times
88, free_vector(vector &) inlined, size=2 (inline) file main.cpp (29)
89, free_vector(vector &) inlined, size=2 (inline) file main.cpp (29)
90, free_vector(vector &) inlined, size=2 (inline) file main.cpp (29)
91, free_vector(vector &) inlined, size=2 (inline) file main.cpp (29)
92, free_matrix(matrix &) inlined, size=5 (inline) file main.cpp (73)
Interpretation of the Compiler Feedback[edit]
The Computational Intensity of a loop is a measure of how much work is being done compared to memory operations. Basically:
In the compiler feedback, an Intensity
1.0 suggests that the loop might run well on a GPU.
Understanding the code[edit]
Let's look closely at the main loop in the
matvec()
function implemented in matrix_functions.h
:
for(int i=0;i<num_rows;i++) {
double sum=0;
int row_start=row_offsets[i];
int row_end=row_offsets[i+1];
for(int j=row_start; j<row_end;j++) {
unsigned int Acol=cols[j];
double Acoef=Acoefs[j];
double xcoef=xcoefs[Acol];
sum+=Acoef*xcoef;
}
ycoefs[i]=sum;
}
Given the code above, we search for data dependencies:
- Does one loop iteration affect other loop iterations?
- For example, when generating the Fibonacci sequence, each new value depends on the previous two values. Therefore, efficient parallelism is very difficult to implement, if not impossible.
- Is the accumulation of values in
sum
a data dependency?- No, it’s a reduction! And modern compilers are good at optimizing such reductions.
- Do loop iterations read from and write to the same array, such that written values are used or overwritten in other iterations?
- Fortunately, that does not happen in the above code.
Now that the code analysis is done, we are ready to add directives to the compiler.
<- Previous unit: Introduction | ^- Back to the lesson plan | Onward to the next unit: Adding directives ->