cc_staff
782
edits
(Partly reviewed the kernels directive section) |
(Reviewed Building with OpenACC) |
||
Line 159: | Line 159: | ||
=== Example: porting a matrix-vector product === <!--T:19--> | === Example: porting a matrix-vector product === <!--T:19--> | ||
For this example, we use the code from the [https://github.com/calculquebec/cq-formation-openacc exercises repository]. | For this example, we use the code from the [https://github.com/calculquebec/cq-formation-openacc exercises repository]. | ||
More precisely, we will use a portion of the code from the [https://github.com/calculquebec/cq-formation-openacc/blob/main/cpp/matrix_functions.h# | More precisely, we will use a portion of the code from the [https://github.com/calculquebec/cq-formation-openacc/blob/main/cpp/matrix_functions.h#L20 <code>cpp/matrix_functions.h</code> file]. | ||
The equivalent Fortran code can be found in the subroutine [https://github.com/calculquebec/cq-formation-openacc/blob/main/f90/matrix.F90#L101 <code>matvec</code> contained in the <code>matrix.F90</code> file]. | The equivalent Fortran code can be found in the subroutine [https://github.com/calculquebec/cq-formation-openacc/blob/main/f90/matrix.F90#L101 <code>matvec</code> contained in the <code>matrix.F90</code> file]. | ||
The C++ code is the following: | The C++ code is the following: | ||
Line 180: | Line 180: | ||
<translate> | <translate> | ||
<!--T:20--> | <!--T:20--> | ||
The [https://github.com/calculquebec/cq-formation-openacc/blob/main/cpp/step1.kernels/matrix_functions.h#L29 first change] we make to this code to try to run it on the GPU is to add the <code>kernels</code> directive. | The [https://github.com/calculquebec/cq-formation-openacc/blob/main/cpp/step1.kernels/matrix_functions.h#L29 first change] we make to this code in order to try to run it on the GPU is to add the <code>kernels</code> directive. | ||
At this stage, we don't worry about data transfer, or about giving more information to the compiler. | At this stage, we don't worry about data transfer, or about giving more information to the compiler. | ||
</translate> | </translate> | ||
Line 204: | Line 204: | ||
<translate> | <translate> | ||
==== Building with OpenACC ==== <!--T:49--> | ==== Building with OpenACC ==== <!--T:49--> | ||
<!--T:50--> | <!--T:50--> | ||
The NVidia compilers use the < | The NVidia compilers use the <code>-ta</code> (target accelerator) option to enable compilation for an accelerator. | ||
We use the sub-option <code>tesla:managed</code> to tell the compiler that we want it compiled for Tesla GPUs, | |||
and we want to use [https://developer.nvidia.com/blog/unified-memory-cuda-beginners/ managed memory]. | |||
This ''managed memory'' simplifies the process of transferring data to and from the device. | |||
We will remove this option in a later example. | |||
We also use the option <code>-fast</code>, which is an optimization option. | |||
</translate> | </translate> | ||
Line 228: | Line 220: | ||
matvec(const matrix &, const vector &, const vector &): | matvec(const matrix &, const vector &, const vector &): | ||
23, include "matrix_functions.h" | 23, include "matrix_functions.h" | ||
30, Generating implicit copyin(cols[:],row_offsets[:num_rows+1],Acoefs[:]) [if not already present] | |||
Generating implicit copyout(ycoefs[:num_rows]) [if not already present] | Generating implicit copyout(ycoefs[:num_rows]) [if not already present] | ||
Generating implicit copyin( | Generating implicit copyin(xcoefs[:]) [if not already present] | ||
31, Loop carried dependence of ycoefs-> prevents parallelization | |||
Loop carried backward dependence of ycoefs-> prevents vectorization | Loop carried backward dependence of ycoefs-> prevents vectorization | ||
Complex loop carried dependence of Acoefs->,xcoefs-> prevents parallelization | Complex loop carried dependence of Acoefs->,xcoefs-> prevents parallelization | ||
Generating | Generating NVIDIA GPU code | ||
31, #pragma acc loop seq | |||
35, #pragma acc loop vector(128) /* threadIdx.x */ | |||
Generating implicit reduction(+:sum) | Generating implicit reduction(+:sum) | ||
35, Loop is parallelizable | |||
}} | }} | ||
<translate> | <translate> | ||
<!--T:51--> | <!--T:51--> | ||
As we can see in the compiler output, the compiler could not parallelize the | As we can see in the compiler output, the compiler could not parallelize the outer loop on line 31. | ||
We will see in the following sections how to deal with those dependencies. | |||
== Fixing false loop dependencies == <!--T:25--> | == Fixing false loop dependencies == <!--T:25--> |