Using GPUs with Slurm: Difference between revisions

Jump to navigation Jump to search
Marked this version for translation
(Job script for GPU profiling)
(Marked this version for translation)
Line 217: Line 217:
With this method, you can run multiple tasks in one submission. The <code>-j4</code> parameter means that GNU Parallel can run a maximum of four concurrent tasks, launching another as soon as each one ends. CUDA_VISIBLE_DEVICES is used to ensure that two tasks do not try to use the same GPU at the same time.
With this method, you can run multiple tasks in one submission. The <code>-j4</code> parameter means that GNU Parallel can run a maximum of four concurrent tasks, launching another as soon as each one ends. CUDA_VISIBLE_DEVICES is used to ensure that two tasks do not try to use the same GPU at the same time.


== Profiling GPU tasks ==
== Profiling GPU tasks == <!--T:65-->


<!--T:66-->
On [[Béluga/en|Béluga]] and [[Narval/en|Narval]], the
On [[Béluga/en|Béluga]] and [[Narval/en|Narval]], the
[https://developer.nvidia.com/dcgm NVIDIA Data Center GPU Manager (DCGM)]
[https://developer.nvidia.com/dcgm NVIDIA Data Center GPU Manager (DCGM)]
Line 225: Line 226:
parameter is used to set the <code>DISABLE_DCGM</code> environment variable:
parameter is used to set the <code>DISABLE_DCGM</code> environment variable:


<!--T:67-->
{{File
{{File
   |name=gpu_profiling_job.sh
   |name=gpu_profiling_job.sh
Line 236: Line 238:
#SBATCH --time=0-03:00
#SBATCH --time=0-03:00


<!--T:68-->
# Wait until DCGM is disabled on the node
# Wait until DCGM is disabled on the node
while [ ! -z "$(dcgmi -v {{!}} grep 'Hostengine build info:')" ]; do
while [ ! -z "$(dcgmi -v {{!}} grep 'Hostengine build info:')" ]; do
Line 241: Line 244:
done
done


<!--T:69-->
./profiler arg1 arg2 ...          # Edit this line. Nvprof can be used
./profiler arg1 arg2 ...          # Edit this line. Nvprof can be used
}}
}}
cc_staff
782

edits

Navigation menu