cc_staff
782
edits
(Job script for GPU profiling) |
(Marked this version for translation) |
||
Line 217: | Line 217: | ||
With this method, you can run multiple tasks in one submission. The <code>-j4</code> parameter means that GNU Parallel can run a maximum of four concurrent tasks, launching another as soon as each one ends. CUDA_VISIBLE_DEVICES is used to ensure that two tasks do not try to use the same GPU at the same time. | With this method, you can run multiple tasks in one submission. The <code>-j4</code> parameter means that GNU Parallel can run a maximum of four concurrent tasks, launching another as soon as each one ends. CUDA_VISIBLE_DEVICES is used to ensure that two tasks do not try to use the same GPU at the same time. | ||
== Profiling GPU tasks == | == Profiling GPU tasks == <!--T:65--> | ||
<!--T:66--> | |||
On [[Béluga/en|Béluga]] and [[Narval/en|Narval]], the | On [[Béluga/en|Béluga]] and [[Narval/en|Narval]], the | ||
[https://developer.nvidia.com/dcgm NVIDIA Data Center GPU Manager (DCGM)] | [https://developer.nvidia.com/dcgm NVIDIA Data Center GPU Manager (DCGM)] | ||
Line 225: | Line 226: | ||
parameter is used to set the <code>DISABLE_DCGM</code> environment variable: | parameter is used to set the <code>DISABLE_DCGM</code> environment variable: | ||
<!--T:67--> | |||
{{File | {{File | ||
|name=gpu_profiling_job.sh | |name=gpu_profiling_job.sh | ||
Line 236: | Line 238: | ||
#SBATCH --time=0-03:00 | #SBATCH --time=0-03:00 | ||
<!--T:68--> | |||
# Wait until DCGM is disabled on the node | # Wait until DCGM is disabled on the node | ||
while [ ! -z "$(dcgmi -v {{!}} grep 'Hostengine build info:')" ]; do | while [ ! -z "$(dcgmi -v {{!}} grep 'Hostengine build info:')" ]; do | ||
Line 241: | Line 244: | ||
done | done | ||
<!--T:69--> | |||
./profiler arg1 arg2 ... # Edit this line. Nvprof can be used | ./profiler arg1 arg2 ... # Edit this line. Nvprof can be used | ||
}} | }} |