Using GPUs with Slurm: Difference between revisions

Jump to navigation Jump to search
Job script for GPU profiling
No edit summary
(Job script for GPU profiling)
Line 217: Line 217:
With this method, you can run multiple tasks in one submission. The <code>-j4</code> parameter means that GNU Parallel can run a maximum of four concurrent tasks, launching another as soon as each one ends. CUDA_VISIBLE_DEVICES is used to ensure that two tasks do not try to use the same GPU at the same time.
With this method, you can run multiple tasks in one submission. The <code>-j4</code> parameter means that GNU Parallel can run a maximum of four concurrent tasks, launching another as soon as each one ends. CUDA_VISIBLE_DEVICES is used to ensure that two tasks do not try to use the same GPU at the same time.


== Profiling GPU tasks ==
On [[Béluga/en|Béluga]] and [[Narval/en|Narval]], the
[https://developer.nvidia.com/dcgm NVIDIA Data Center GPU Manager (DCGM)]
needs to be disabled, and this must be done while doing your job submission.
Based on the simplest example in this page, the <code>--export</code>
parameter is used to set the <code>DISABLE_DCGM</code> environment variable:
{{File
  |name=gpu_profiling_job.sh
  |lang="sh"
  |contents=
#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --export=ALL,DISABLE_DCGM=1
#SBATCH --gpus-per-node=1
#SBATCH --mem=4000M              # memory per node
#SBATCH --time=0-03:00
# Wait until DCGM is disabled on the node
while [ ! -z "$(dcgmi -v {{!}} grep 'Hostengine build info:')" ]; do
  sleep 5;
done
./profiler arg1 arg2 ...          # Edit this line. Nvprof can be used
}}
For more details about profilers, see [[Debugging and profiling]].


<!--T:54-->
<!--T:54-->
[[Category:SLURM]]
[[Category:SLURM]]
</translate>
</translate>
cc_staff
782

edits

Navigation menu