OpenACC Tutorial - Adding directives: Difference between revisions

no edit summary
No edit summary
No edit summary
Line 253: Line 253:
== How is ported code performing ? ==
== How is ported code performing ? ==
Since we have completed a first step to porting the code to GPU, we need to analyze how the code is performing, and whether it gives the correct results. Running the original version of the code yields the following (performed on one of Guillimin's GPU node):
Since we have completed a first step to porting the code to GPU, we need to analyze how the code is performing, and whether it gives the correct results. Running the original version of the code yields the following (performed on one of Guillimin's GPU node):
</translate>
[[File:Openacc profiling1.png|thumbnail|<translate>Click to enlarge</translate>]]
<translate>
{{Command
{{Command
|./cg.x  
|./cg.x  
Line 287: Line 290:
}}
}}


The results are correct. However, not only do we not get any speed up, but we rather get a slow down by a factor of almost 4! Lets profile the code again using NVidia's visual profiler (<tt>nvvp</tt>). The corresponding profile is illustrated on the image on the right side. As we can see, almost all of the run time is being spent transferring data between the host and the device. This is very often the case when one ports a code from CPU to GPU. We will look at how to optimize this in the [[OpenACC Tutorial - Data movement|next part of the tutorial]].  
The results are correct. However, not only do we not get any speed up, but we rather get a slow down by a factor of almost 4! Lets profile the code again using NVidia's visual profiler (<tt>nvvp</tt>). This can be done with the following steps:
</translate>
# Start <tt>nvvp</tt> with the command <tt>nvvp &</tt>  (the <tt>&</tt> sign is to start it in the background
[[File:Openacc profiling1.png|thumbnail|<translate>Click to enlarge</translate>]]
# Go in File -> New Session
# In the "File:" field, search for the executable (named <tt>challenge</tt> in our example).
# Click "Next" until you can click "Finish".
 
This will run the program and generate a timeline of the execution. The resulting timeline is illustrated on the image on the right side. As we can see, almost all of the run time is being spent transferring data between the host and the device. This is very often the case when one ports a code from CPU to GPU. We will look at how to optimize this in the [[OpenACC Tutorial - Data movement|next part of the tutorial]].  


<translate>
== The <tt>parallel loop</tt> directive ==
== The <tt>parallel loop</tt> directive ==


[[OpenACC Tutorial|Back to the lesson plan]]
[[OpenACC Tutorial|Back to the lesson plan]]
</translate>
</translate>
Bureaucrats, cc_docs_admin, cc_staff, rsnt_translations
2,837

edits