cc_staff
782
edits
No edit summary |
(Job script for GPU profiling) |
||
Line 217: | Line 217: | ||
With this method, you can run multiple tasks in one submission. The <code>-j4</code> parameter means that GNU Parallel can run a maximum of four concurrent tasks, launching another as soon as each one ends. CUDA_VISIBLE_DEVICES is used to ensure that two tasks do not try to use the same GPU at the same time. | With this method, you can run multiple tasks in one submission. The <code>-j4</code> parameter means that GNU Parallel can run a maximum of four concurrent tasks, launching another as soon as each one ends. CUDA_VISIBLE_DEVICES is used to ensure that two tasks do not try to use the same GPU at the same time. | ||
== Profiling GPU tasks == | |||
On [[Béluga/en|Béluga]] and [[Narval/en|Narval]], the | |||
[https://developer.nvidia.com/dcgm NVIDIA Data Center GPU Manager (DCGM)] | |||
needs to be disabled, and this must be done while doing your job submission. | |||
Based on the simplest example in this page, the <code>--export</code> | |||
parameter is used to set the <code>DISABLE_DCGM</code> environment variable: | |||
{{File | |||
|name=gpu_profiling_job.sh | |||
|lang="sh" | |||
|contents= | |||
#!/bin/bash | |||
#SBATCH --account=def-someuser | |||
#SBATCH --export=ALL,DISABLE_DCGM=1 | |||
#SBATCH --gpus-per-node=1 | |||
#SBATCH --mem=4000M # memory per node | |||
#SBATCH --time=0-03:00 | |||
# Wait until DCGM is disabled on the node | |||
while [ ! -z "$(dcgmi -v {{!}} grep 'Hostengine build info:')" ]; do | |||
sleep 5; | |||
done | |||
./profiler arg1 arg2 ... # Edit this line. Nvprof can be used | |||
}} | |||
For more details about profilers, see [[Debugging and profiling]]. | |||
<!--T:54--> | <!--T:54--> | ||
[[Category:SLURM]] | [[Category:SLURM]] | ||
</translate> | </translate> |