OpenACC Tutorial - Profiling: Difference between revisions

Updated example with Minfo
(Making this section a subsection)
(Updated example with Minfo)
Line 127: Line 127:
From the above output, the <code>matvec()</code> function is responsible for 83.5% of the execution time, and this function call can be found in the <code>main()</code> function.
From the above output, the <code>matvec()</code> function is responsible for 83.5% of the execution time, and this function call can be found in the <code>main()</code> function.


== Compiler Feedback == <!--T:16-->
== Compiler Feedback == <!--T:16-->
Before working on the routine, we need to understand what the compiler is actually doing by asking ourselves the following questions:
Before working on the routine, we need to understand what the compiler is actually doing by asking ourselves the following questions:
* What optimizations were applied?  
* What optimizations were applied automatically by the compiler?  
* What prevented further optimizations?
* What prevented further optimizations?
* Can very minor modifications of the code affect performance?
* Can very minor modifications of the code affect performance?


<!--T:17-->
<!--T:17-->
The PGI compiler offers you a '''-Minfo''' flag with the following options:
The NVIDIA compiler offers a <code>-Minfo</code> flag with the following options:
* accel Print compiler operations related to the accelerator
* <code>all</code> - Print almost all types of compilation information, including:
* all – Print all compiler output
** <code>accel</code> - Print compiler operations related to the accelerator
* intensity Print loop intensity information
** <code>inline</code> - Print information about functions extracted and inlined
* ccff–Add information to the object files for use by tools
** <code>loop,mp,par,stdpar,vect</code> - Print various information about loop optimization and vectorization
* <code>intensity</code> - Print loop intensity information
* (none) - If <code>-Minfo</code> is used without any option, it is the same as with the <code>all</code> option, but without the <code>inline</code> information


=== How to Enable Compiler Feedback === <!--T:18-->
=== How to Enable Compiler Feedback === <!--T:18-->
* Edit the Makefile
* Edit the <code>Makefile</code>:
   CXX=nvc++
   CXX=nvc++
   CXXFLAGS=-fast -Minfo=all,intensity,ccff
   CXXFLAGS=-fast -Minfo=all,intensity
   LDFLAGS=${CXXFLAGS}
   LDFLAGS=${CXXFLAGS}
* Rebuild
* Rebuild
</translate>
</translate>
Line 150: Line 153:
|make clean; make
|make clean; make
|result=
|result=
nvc++ -fast -Minfo=all,intensity,ccff   -c -o main.o main.cpp
...
nvc++ -fast -Minfo=all,intensity  -c -o main.o main.cpp
initialize_vector(vector &, double):
initialize_vector(vector &, double):
     20, include "vector.h"
     20, include "vector.h"
Line 159: Line 163:
           27, Intensity = 1.00
           27, Intensity = 1.00
               Generated vector simd code for the loop containing reductions
               Generated vector simd code for the loop containing reductions
              FMA (fused multiply-add) instruction(s) generated
          28, FMA (fused multiply-add) instruction(s) generated
waxpby(double, const vector &, double, const vector &, const vector &):
waxpby(double, const vector &, double, const vector &, const vector &):
     21, include "vector_functions.h"
     21, include "vector_functions.h"
Line 167: Line 171:
               Loop unrolled 2 times
               Loop unrolled 2 times
               FMA (fused multiply-add) instruction(s) generated
               FMA (fused multiply-add) instruction(s) generated
          40, FMA (fused multiply-add) instruction(s) generated
allocate_3d_poisson_matrix(matrix &, int):
allocate_3d_poisson_matrix(matrix &, int):
     22, include "matrix.h"
     22, include "matrix.h"
Line 174: Line 179:
               Loop not vectorized/parallelized: loop count too small
               Loop not vectorized/parallelized: loop count too small
           45, Intensity = 0.0
           45, Intensity = 0.0
              Loop unrolled 3 times (completely unrolled)
           57, Intensity = 0.0
           57, Intensity = 0.0
           59, Intensity = 0.0
           59, Intensity = 0.0
Line 180: Line 186:
     23, include "matrix_functions.h"
     23, include "matrix_functions.h"
           29, Intensity = (num_rows*((row_end-row_start)*        2))/(num_rows+(num_rows+(num_rows+((row_end-row_start)+(row_end-row_start)))))
           29, Intensity = (num_rows*((row_end-row_start)*        2))/(num_rows+(num_rows+(num_rows+((row_end-row_start)+(row_end-row_start)))))
              FMA (fused multiply-add) instruction(s) generated
           33, Intensity = 1.00
           33, Intensity = 1.00
               Loop not vectorized: non-stride-1 array reference
               Generated vector simd code for the loop containing reductions
              Loop not vectorized: mixed data types
          37, FMA (fused multiply-add) instruction(s) generated
              Loop unrolled 2 times
              FMA (fused multiply-add) instruction(s) generated
main:
main:
     38, allocate_3d_poisson_matrix(matrix &, int) inlined, size=41 (inline) file main.cpp (29)
     38, allocate_3d_poisson_matrix(matrix &, int) inlined, size=41 (inline) file main.cpp (29)
Line 193: Line 196:
               Loop not vectorized/parallelized: loop count too small
               Loop not vectorized/parallelized: loop count too small
           45, Intensity = 0.0
           45, Intensity = 0.0
              Loop unrolled 3 times (completely unrolled)
           57, Intensity = 0.0
           57, Intensity = 0.0
               Loop not fused: function call before adjacent loop
               Loop not fused: function call before adjacent loop
Line 204: Line 208:
     48, initialize_vector(vector &, double) inlined, size=5 (inline) file main.cpp (34)
     48, initialize_vector(vector &, double) inlined, size=5 (inline) file main.cpp (34)
           36, Intensity = 0.0
           36, Intensity = 0.0
               Loop not vectorized/parallelized: not countable
               Memory set idiom, loop replaced by call to __c_mset8
     49, initialize_vector(vector &, double) inlined, size=5 (inline) file main.cpp (34)
     49, initialize_vector(vector &, double) inlined, size=5 (inline) file main.cpp (34)
           36, Intensity = 0.0
           36, Intensity = 0.0
               Loop not vectorized/parallelized: not countable
               Memory set idiom, loop replaced by call to __c_mset8
     52, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
     52, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
           39, Intensity = 0.0
           39, Intensity = 0.0
Line 215: Line 219:
               Loop not fused: different loop trip count
               Loop not fused: different loop trip count
           33, Intensity = 1.00
           33, Intensity = 1.00
               Loop not vectorized: non-stride-1 array reference
               Generated vector simd code for the loop containing reductions
              Loop not vectorized: mixed data types
              Loop unrolled 2 times
     54, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
     54, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
           27, FMA (fused multiply-add) instruction(s) generated
           27, FMA (fused multiply-add) instruction(s) generated
           29, FMA (fused multiply-add) instruction(s) generated
           36, FMA (fused multiply-add) instruction(s) generated
          33, FMA (fused multiply-add) instruction(s) generated
           39, Intensity = 0.67
           39, Intensity = 0.67
               Loop not fused: different loop trip count
               Loop not fused: different loop trip count
Line 238: Line 239:
     65, dot(const vector &, const vector &) inlined, size=9 (inline) file main.cpp (21)
     65, dot(const vector &, const vector &) inlined, size=9 (inline) file main.cpp (21)
           27, Intensity = 1.00
           27, Intensity = 1.00
               Loop not fused: different loop trip count
               Loop not fused: different controlling conditions
               Generated vector simd code for the loop containing reductions
               Generated vector simd code for the loop containing reductions
     67, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
     67, waxpby(double, const vector &, double, const vector &, const vector &) inlined, size=10 (inline) file main.cpp (33)
Line 250: Line 251:
               Loop not fused: different loop trip count
               Loop not fused: different loop trip count
           33, Intensity = 1.00
           33, Intensity = 1.00
               Loop not vectorized: non-stride-1 array reference
               Generated vector simd code for the loop containing reductions
              Loop not vectorized: mixed data types
              Loop unrolled 2 times
     73, dot(const vector &, const vector &) inlined, size=9 (inline) file main.cpp (21)
     73, dot(const vector &, const vector &) inlined, size=9 (inline) file main.cpp (21)
           27, Intensity = 1.00
           27, Intensity = 1.00
Line 274: Line 273:
     91, free_vector(vector &) inlined, size=2 (inline) file main.cpp (29)
     91, free_vector(vector &) inlined, size=2 (inline) file main.cpp (29)
     92, free_matrix(matrix &) inlined, size=5 (inline) file main.cpp (73)
     92, free_matrix(matrix &) inlined, size=5 (inline) file main.cpp (73)
nvc++ main.o -o cg.x -fast -Minfo=all,intensity,ccff
}}
}}
<translate>
<translate>
cc_staff
782

edits