Bureaucrats, cc_docs_admin, cc_staff, rsnt_translations
2,837
edits
No edit summary |
No edit summary |
||
Line 193: | Line 193: | ||
=== <tt>restrict</tt> keyword === | === <tt>restrict</tt> keyword === | ||
One way to tell the compiler that pointers are '''not''' going to be aliased, is by using a special keyword. In C, the keyword <tt>restrict</tt> was introduced in C99 for this purpose. In C++, there is no standard way yet, but each compiler typically has its own keyword. Either <tt>__restrict</tt> or <tt>__restrict__</tt> can be used depending on the compiler. For Portland Group compilers, the keyword is <tt>__restrict</tt>. For an explanation as to why there is no standard way to do this in C++, you can read [http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3988.pdf this paper]. This concept is important not only for OpenACC, but for any C/C++ programming, since many more optimizations can be done when pointers are guaranteed not to be aliased. | One way to tell the compiler that pointers are '''not''' going to be aliased, is by using a special keyword. In C, the keyword <tt>restrict</tt> was introduced in C99 for this purpose. In C++, there is no standard way yet, but each compiler typically has its own keyword. Either <tt>__restrict</tt> or <tt>__restrict__</tt> can be used depending on the compiler. For Portland Group compilers, the keyword is <tt>__restrict</tt>. For an explanation as to why there is no standard way to do this in C++, you can read [http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3988.pdf this paper]. This concept is important not only for OpenACC, but for any C/C++ programming, since many more optimizations can be done by compilers when pointers are guaranteed not to be aliased. Note that the keyword goes ''after'' the pointer, since it refers to the pointer, and not to the type. In other words, you would declare <code>float * __restrict A;</code> rather than <code>float __restrict * A;</code>. | ||
</translate> | </translate> | ||
Line 217: | Line 218: | ||
} | } | ||
</syntaxhighlight> | </syntaxhighlight> | ||
=== Back to the example === | |||
Going back to the matrix-vector product above, the way that we recommend fixing false aliasing is by declaring the pointers as restricted. This is done by changing the following code in <tt>matrix_functions.h</tt>: | |||
</translate> | |||
<syntaxhighlight lang="cpp" line> | |||
double *Acoefs=A.coefs; | |||
double *xcoefs=x.coefs; | |||
double *ycoefs=y.coefs; | |||
</syntaxhighlight> | |||
<translate>by this code: </translate> | |||
<syntaxhighlight lang="cpp" line> | |||
double *__restrict Acoefs=A.coefs; | |||
double *__restrict xcoefs=x.coefs; | |||
double *__restrict ycoefs=y.coefs; | |||
</syntaxhighlight> | |||
<translate> | |||
We note that we do not need to declare the other pointers as restricted, since they are not reported as problematic by the compiler. With the above changes, recompiling gives the following compiler messages: | |||
</translate> | |||
{{Command | |||
|pgc++ -fast -Minfo{{=}}accel -ta{{=}}tesla:managed main.cpp -o challenge | |||
|result= | |||
matvec(const matrix &, const vector &, const vector &): | |||
23, include "matrix_functions.h" | |||
27, Generating copyout(ycoefs[:num_rows]) | |||
Generating copyin(xcoefs[:],Acoefs[:],cols[:],row_offsets[:num_rows+1]) | |||
29, Loop is parallelizable | |||
Accelerator kernel generated | |||
Generating Tesla code | |||
29, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ | |||
33, Loop is parallelizable | |||
}} | |||
<translate> | |||
== How is ported code performing ? == | |||
Since we have completed a first step to porting the code to GPU, we need to analyze how the code is performing, and whether it gives the correct results. Running the original version of the code yields the following (performed on one of Guillimin's GPU node): | |||
{{Command | |||
|./cg.x | |||
|result= | |||
Rows: 8120601, nnz: 218535025 | |||
Iteration: 0, Tolerance: 4.0067e+08 | |||
Iteration: 10, Tolerance: 1.8772e+07 | |||
Iteration: 20, Tolerance: 6.4359e+05 | |||
Iteration: 30, Tolerance: 2.3202e+04 | |||
Iteration: 40, Tolerance: 8.3565e+02 | |||
Iteration: 50, Tolerance: 3.0039e+01 | |||
Iteration: 60, Tolerance: 1.0764e+00 | |||
Iteration: 70, Tolerance: 3.8360e-02 | |||
Iteration: 80, Tolerance: 1.3515e-03 | |||
Iteration: 90, Tolerance: 4.6209e-05 | |||
Total Iterations: 100 Total Time: 29.894881s | |||
}} | |||
Running the OpenACC version yields the following: | |||
{{Command | |||
|./challenge | |||
|result=Rows: 8120601, nnz: 218535025 | |||
Iteration: 0, Tolerance: 4.0067e+08 | |||
Iteration: 10, Tolerance: 1.8772e+07 | |||
Iteration: 20, Tolerance: 6.4359e+05 | |||
Iteration: 30, Tolerance: 2.3202e+04 | |||
Iteration: 40, Tolerance: 8.3565e+02 | |||
Iteration: 50, Tolerance: 3.0039e+01 | |||
Iteration: 60, Tolerance: 1.0764e+00 | |||
Iteration: 70, Tolerance: 3.8360e-02 | |||
Iteration: 80, Tolerance: 1.3515e-03 | |||
Iteration: 90, Tolerance: 4.6209e-05 | |||
Total Iterations: 100 Total Time: 115.068931s | |||
}} | |||
The results are correct. However, not only do we not get any speed up, but we rather get a slow down by a factor of almost 4! Lets profile the code again using NVidia's visual profiler (<tt>nvvp</tt>). The corresponding profile is illustrated on the image on the right side. As we can see, almost all of the run time is being spent transferring data between the host and the device. This is very often the case when one ports a code from CPU to GPU. We will look at how to optimize this in the [[OpenACC Tutorial - Data movement|next part of the tutorial]]. | |||
</translate> | |||
[[File:Openacc profiling1.png|thumbnail|<translate>Click to enlarge</translate>]] | |||
<translate> | |||
== The <tt>parallel loop</tt> directive == | |||
[[OpenACC Tutorial|Back to the lesson plan]] | [[OpenACC Tutorial|Back to the lesson plan]] | ||
</translate> | </translate> |