rsnt_translations
56,430
edits
Tag: Rollback |
(Marked this version for translation) |
||
(23 intermediate revisions by 3 users not shown) | |||
Line 8: | Line 8: | ||
<!--T:2--> | <!--T:2--> | ||
* Understand what a profiler is'' | * Understand what a profiler is'' | ||
* Understand how to use | * Understand how to use the NVPROF profiler | ||
* Understand how the code is performing | * Understand how the code is performing | ||
* Understand where to focus your time and rewrite most time consuming routines | * Understand where to focus your time and rewrite most time consuming routines | ||
Line 14: | Line 14: | ||
}} | }} | ||
<translate> | <translate> | ||
== Code profiling | == Code profiling == <!--T:8--> | ||
Why would one need to profile code? Because it's the only way to understand: | Why would one need to profile code? Because it's the only way to understand: | ||
* Where time is being spent ( | * Where time is being spent (hotspots) | ||
* How the code is performing | * How the code is performing | ||
* Where to focus your time | * Where to focus your development time | ||
<!--T:9--> | <!--T:9--> | ||
What is so important about hotspots in the code ? | What is so important about hotspots in the code? | ||
Amdahl's law says that "Parallelizing the most time-consuming routines (i.e. the hotspots) will have the most impact". | The [https://en.wikipedia.org/wiki/Amdahl%27s_law Amdahl's law] says that | ||
"Parallelizing the most time-consuming routines (i.e. the hotspots) will have the most impact". | |||
== Build the Sample Code == <!--T:10--> | == Build the Sample Code == <!--T:10--> | ||
For | For the following example, we use a code from this [https://github.com/calculquebec/cq-formation-openacc Git repository]. | ||
You are invited to [https://github.com/calculquebec/cq-formation-openacc/archive/refs/heads/main.zip download and extract the package], and go to the <code>cpp</code> or the <code>f90</code> directory. | |||
The object of this example is to compile and link the code, obtain an executable, and then profile its source code with a profiler. | |||
</translate> | </translate> | ||
{{Callout | {{Callout | ||
|title=<translate><!--T:3--> | |title=<translate><!--T:3--> | ||
Line 33: | Line 37: | ||
<translate> | <translate> | ||
<!--T:4--> | <!--T:4--> | ||
Being pushed by [https://www.cray.com/ Cray] and by [https://www.nvidia.com NVIDIA] through its | |||
[https://www.pgroup.com/support/release_archive.php Portland Group] division until 2020 and now through its [https://developer.nvidia.com/hpc-sdk HPC SDK], these two lines of compilers offer the most advanced OpenACC support. | |||
<!--T:26--> | |||
As for the [https://gcc.gnu.org/wiki/OpenACC GNU compilers], since GCC version 6, the support for OpenACC 2.x kept improving. | |||
As of July 2022, GCC versions 10, 11 and 12 support OpenACC version 2.6. | |||
<!--T:5--> | <!--T:5--> | ||
For the purpose of this tutorial, we use | For the purpose of this tutorial, we use the | ||
[https://developer.nvidia.com/nvidia-hpc-sdk-releases NVIDIA HPC SDK], version 22.7. | |||
Please note that NVIDIA compilers are free for academic usage. | |||
</translate> | </translate> | ||
}} | |||
{{Command | |||
|module load nvhpc/22.7 | |||
|result= | |||
Lmod is automatically replacing "intel/2020.1.217" with "nvhpc/22.7". | |||
The following have been reloaded with a version change: | |||
1) gcccore/.9.3.0 => gcccore/.11.3.0 3) openmpi/4.0.3 => openmpi/4.1.4 | |||
2) libfabric/1.10.1 => libfabric/1.15.1 4) ucx/1.8.0 => ucx/1.12.1 | |||
}} | }} | ||
Line 49: | Line 70: | ||
<translate> | <translate> | ||
<!--T:11--> | <!--T:11--> | ||
Once the executable <code>cg.x</code> is created, we are going to profile its source code: | |||
the profiler will measure function calls by executing and monitoring this program. | |||
'''Important:''' this executable uses about 3GB of memory and one CPU core at near 100%. | |||
Therefore, '''a proper test environment should have at least 4GB of available memory and at least two (2) CPU cores'''. | |||
</translate> | </translate> | ||
Line 58: | Line 82: | ||
<translate> | <translate> | ||
<!--T:7--> | <!--T:7--> | ||
For the purpose of this tutorial, we use | For the purpose of this tutorial, we use two profilers: | ||
* | * '''[https://docs.nvidia.com/cuda/profiler-users-guide/ NVIDIA <code>nvprof</code>]''' - a command line text-based profiler that can analyze non-GPU codes. | ||
* '''[[OpenACC_Tutorial_-_Adding_directives#NVIDIA_Visual_Profiler|NVIDIA Visual Profiler <code>nvvp</code>]]''' - a graphical cross-platform analyzing tool for the codes written with OpenACC and CUDA C/C++ instructions. | |||
* NVIDIA Visual Profiler | Since our previously built <code>cg.x</code> is not yet using the GPU, we will start the analysis with the <code>nvprof</code> profiler. | ||
</translate> | </translate> | ||
}} | }} | ||
<translate> | |||
=== NVIDIA <code>nvprof</code> Command Line Profiler === <!--T:15--> | |||
NVIDIA usually provides <code>nvprof</code> with its HPC SDK, | |||
but the proper version to use on our clusters is included with a CUDA module: | |||
</translate> | |||
{{Command | |||
|module load cuda/11.7 | |||
}} | |||
<translate> | <translate> | ||
<!--T:27--> | |||
To profile a pure CPU executable, we need to add the arguments <code>--cpu-profiling on</code> to the command line: | |||
</translate> | </translate> | ||
{{Command | {{Command | ||
|nvprof --cpu-profiling on ./cg.x | |nvprof --cpu-profiling on ./cg.x | ||
|result= | |result= | ||
... | |||
<Program output > | <Program output > | ||
... | |||
======== CPU profiling result (bottom up): | ======== CPU profiling result (bottom up): | ||
Time(%) Time Name | |||
83.54% 90.6757s matvec(matrix const &, vector const &, vector const &) | |||
83.54% 90.6757s {{!}} main | |||
7.94% 8.62146s waxpby(double, vector const &, double, vector const &, vector const &) | |||
2. | 7.94% 8.62146s {{!}} main | ||
2. | 5.86% 6.36584s dot(vector const &, vector const &) | ||
0. | 5.86% 6.36584s {{!}} main | ||
2.47% 2.67666s allocate_3d_poisson_matrix(matrix&, int) | |||
2.47% 2.67666s {{!}} main | |||
0.13% 140.35ms initialize_vector(vector&, double) | |||
0.13% 140.35ms {{!}} main | |||
... | |||
======== Data collected at 100Hz frequency | ======== Data collected at 100Hz frequency | ||
}} | }} | ||
<translate> | <translate> | ||
<!--T:28--> | |||
From the above output, the <code>matvec()</code> function is responsible for 83.5% of the execution time, and this function call can be found in the <code>main()</code> function. | |||
== Compiler Feedback | == Compiler Feedback == <!--T:16--> | ||
Before working on the routine, we need to understand what the compiler is actually doing by asking ourselves the following questions: | Before working on the routine, we need to understand what the compiler is actually doing by asking ourselves the following questions: | ||
* What optimizations were applied? | * What optimizations were applied automatically by the compiler? | ||
* What prevented further optimizations? | * What prevented further optimizations? | ||
* Can very minor modifications of the code affect performance? | * Can very minor modifications of the code affect performance? | ||
<!--T:17--> | <!--T:17--> | ||
The | The NVIDIA compiler offers a <code>-Minfo</code> flag with the following options: | ||
* accel | * <code>all</code> - Print almost all types of compilation information, including: | ||
* | ** <code>accel</code> - Print compiler operations related to the accelerator | ||
* intensity | ** <code>inline</code> - Print information about functions extracted and inlined | ||
* | ** <code>loop,mp,par,stdpar,vect</code> - Print various information about loop optimization and vectorization | ||
* <code>intensity</code> - Print compute intensity information about loops | |||
* (none) - If <code>-Minfo</code> is used without any option, it is the same as with the <code>all</code> option, but without the <code>inline</code> information | |||
== How to Enable Compiler Feedback | === How to Enable Compiler Feedback === <!--T:18--> | ||
* Edit the Makefile | * Edit the <code>Makefile</code>: | ||
CXX=nvc++ | CXX=nvc++ | ||
CXXFLAGS=-fast -Minfo=all,intensity | CXXFLAGS=-fast -Minfo=all,intensity | ||
LDFLAGS=${CXXFLAGS} | LDFLAGS=${CXXFLAGS} | ||
<!--T:29--> | |||
* Rebuild | * Rebuild | ||
</translate> | </translate> | ||
{{Command | {{Command | ||
|make | |make clean; make | ||
|result= | |result= | ||
nvc++ -fast -Minfo=all,intensity | ... | ||
nvc++ -fast -Minfo=all,intensity -c -o main.o main.cpp | |||
initialize_vector(vector &, double): | initialize_vector(vector &, double): | ||
20, include "vector.h" | 20, include "vector.h" | ||
Line 134: | Line 164: | ||
27, Intensity = 1.00 | 27, Intensity = 1.00 | ||
Generated vector simd code for the loop containing reductions | Generated vector simd code for the loop containing reductions | ||
28, FMA (fused multiply-add) instruction(s) generated | |||
waxpby(double, const vector &, double, const vector &, const vector &): | waxpby(double, const vector &, double, const vector &, const vector &): | ||
21, include "vector_functions.h" | 21, include "vector_functions.h" | ||
Line 142: | Line 172: | ||
Loop unrolled 2 times | Loop unrolled 2 times | ||
FMA (fused multiply-add) instruction(s) generated | FMA (fused multiply-add) instruction(s) generated | ||
40, FMA (fused multiply-add) instruction(s) generated | |||
allocate_3d_poisson_matrix(matrix &, int): | allocate_3d_poisson_matrix(matrix &, int): | ||
22, include "matrix.h" | 22, include "matrix.h" | ||
Line 149: | Line 180: | ||
Loop not vectorized/parallelized: loop count too small | Loop not vectorized/parallelized: loop count too small | ||
45, Intensity = 0.0 | 45, Intensity = 0.0 | ||
Loop unrolled 3 times (completely unrolled) | |||
57, Intensity = 0.0 | 57, Intensity = 0.0 | ||
59, Intensity = 0.0 | 59, Intensity = 0.0 | ||
Line 155: | Line 187: | ||
23, include "matrix_functions.h" | 23, include "matrix_functions.h" | ||
29, Intensity = (num_rows*((row_end-row_start)* 2))/(num_rows+(num_rows+(num_rows+((row_end-row_start)+(row_end-row_start))))) | 29, Intensity = (num_rows*((row_end-row_start)* 2))/(num_rows+(num_rows+(num_rows+((row_end-row_start)+(row_end-row_start))))) | ||
33, Intensity = 1.00 | 33, Intensity = 1.00 | ||
Generated vector simd code for the loop containing reductions | |||
37, FMA (fused multiply-add) instruction(s) generated | |||
main: | main: | ||
38, allocate_3d_poisson_matrix(matrix &, int) inlined, size=41 (inline) file main.cpp (29) | 38, allocate_3d_poisson_matrix(matrix &, int) inlined, size=41 (inline) file main.cpp (29) | ||
Line 168: | Line 197: | ||
Loop not vectorized/parallelized: loop count too small | Loop not vectorized/parallelized: loop count too small | ||
45, Intensity = 0.0 | 45, Intensity = 0.0 | ||
Loop unrolled 3 times (completely unrolled) | |||
57, Intensity = 0.0 | 57, Intensity = 0.0 | ||
Loop not fused: function call before adjacent loop | Loop not fused: function call before adjacent loop | ||
Line 179: | Line 209: | ||
48, initialize_vector(vector &, double) inlined, size=5 (inline) file main.cpp (34) | 48, initialize_vector(vector &, double) inlined, size=5 (inline) file main.cpp (34) | ||
36, Intensity = 0.0 | 36, Intensity = 0.0 | ||
Memory set idiom, loop replaced by call to __c_mset8 | |||
49, initialize_vector(vector &, double) inlined, size=5 (inline) file main.cpp (34) | 49, initialize_vector(vector &, double) inlined, size=5 (inline) file main.cpp (34) | ||
36, Intensity = 0.0 | 36, Intensity = 0.0 | ||
Memory set idiom, loop replaced by call to __c_mset8 | |||
52, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33) | 52, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33) | ||
39, Intensity = 0.0 | 39, Intensity = 0.0 | ||
Line 190: | Line 220: | ||
Loop not fused: different loop trip count | Loop not fused: different loop trip count | ||
33, Intensity = 1.00 | 33, Intensity = 1.00 | ||
Generated vector simd code for the loop containing reductions | |||
54, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33) | 54, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33) | ||
27, FMA (fused multiply-add) instruction(s) generated | 27, FMA (fused multiply-add) instruction(s) generated | ||
36, FMA (fused multiply-add) instruction(s) generated | |||
39, Intensity = 0.67 | 39, Intensity = 0.67 | ||
Loop not fused: different loop trip count | Loop not fused: different loop trip count | ||
Line 213: | Line 240: | ||
65, dot(const vector &, const vector &) inlined, size=9 (inline) file main.cpp (21) | 65, dot(const vector &, const vector &) inlined, size=9 (inline) file main.cpp (21) | ||
27, Intensity = 1.00 | 27, Intensity = 1.00 | ||
Loop not fused: different | Loop not fused: different controlling conditions | ||
Generated vector simd code for the loop containing reductions | Generated vector simd code for the loop containing reductions | ||
67, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33) | 67, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33) | ||
Line 225: | Line 252: | ||
Loop not fused: different loop trip count | Loop not fused: different loop trip count | ||
33, Intensity = 1.00 | 33, Intensity = 1.00 | ||
Generated vector simd code for the loop containing reductions | |||
73, dot(const vector &, const vector &) inlined, size=9 (inline) file main.cpp (21) | 73, dot(const vector &, const vector &) inlined, size=9 (inline) file main.cpp (21) | ||
27, Intensity = 1.00 | 27, Intensity = 1.00 | ||
Line 249: | Line 274: | ||
91, free_vector(vector &) inlined, size=2 (inline) file main.cpp (29) | 91, free_vector(vector &) inlined, size=2 (inline) file main.cpp (29) | ||
92, free_matrix(matrix &) inlined, size=5 (inline) file main.cpp (73) | 92, free_matrix(matrix &) inlined, size=5 (inline) file main.cpp (73) | ||
}} | }} | ||
<translate> | <translate> | ||
== | === Interpretation of the Compiler Feedback === <!--T:19--> | ||
Computational Intensity of a loop is a measure of how much work is being done compared to memory operations. | The ''Computational Intensity'' of a loop is a measure of how much work is being done compared to memory operations. | ||
Basically: | |||
<!--T:20--> | <!--T:20--> | ||
<math>\mbox{Computational Intensity} = \frac{\mbox{Compute Operations}}{\mbox{Memory Operations}}</math> | |||
<!--T:21--> | <!--T:21--> | ||
In the compiler feedback, an <code>Intensity</code> <math>\ge</math> 1.0 suggests that the loop might run well on a GPU. | |||
== Understanding the code == <!--T:22--> | == Understanding the code == <!--T:22--> | ||
Let's look closely at the | Let's look closely at the main loop in the | ||
[https://github.com/calculquebec/cq-formation-openacc/blob/main/cpp/matrix_functions.h#L29 <code>matvec()</code> function implemented in <code>matrix_functions.h</code>]: | |||
</translate> | </translate> | ||
<syntaxhighlight lang="cpp" line highlight="1,5,10,12"> | <syntaxhighlight lang="cpp" line start="29" highlight="1,5,10,12"> | ||
for(int i=0;i<num_rows;i++) { | for(int i=0;i<num_rows;i++) { | ||
double sum=0; | |||
int row_start=row_offsets[i]; | |||
int row_end=row_offsets[i+1]; | |||
for(int j=row_start; j<row_end;j++) { | |||
unsigned int Acol=cols[j]; | |||
double Acoef=Acoefs[j]; | |||
double xcoef=xcoefs[Acol]; | |||
sum+=Acoef*xcoef; | |||
} | |||
ycoefs[i]=sum; | |||
} | } | ||
</syntaxhighlight> | </syntaxhighlight> | ||
<translate> | <translate> | ||
Line 283: | Line 309: | ||
Given the code above, we search for data dependencies: | Given the code above, we search for data dependencies: | ||
* Does one loop iteration affect other loop iterations? | * Does one loop iteration affect other loop iterations? | ||
* Do loop iterations read from and write to | ** For example, when generating the '''[https://en.wikipedia.org/wiki/Fibonacci_number Fibonacci sequence]''', each new value depends on the previous two values. Therefore, efficient parallelism is very difficult to implement, if not impossible. | ||
* | * Is the accumulation of values in <code>sum</code> a data dependency? | ||
** No, it’s a '''[https://en.wikipedia.org/wiki/Reduction_operator reduction]'''! And modern compilers are good at optimizing such reductions. | |||
* Do loop iterations read from and write to the same array, such that written values are used or overwritten in other iterations? | |||
** Fortunately, that does not happen in the above code. | |||
<!--T:25--> | |||
Now that the code analysis is done, we are ready to add directives to the compiler. | |||
<!--T:24--> | <!--T:24--> | ||
[[OpenACC Tutorial - | [[OpenACC Tutorial - Introduction|<- Previous unit: ''Introduction'']] | [[OpenACC Tutorial|^- Back to the lesson plan]] | [[OpenACC Tutorial - Adding directives|Onward to the next unit: ''Adding directives'' ->]] | ||
[[OpenACC Tutorial|Back to the lesson plan]] | |||
</translate> | </translate> |