Using GPUs with Slurm: Difference between revisions

Using GPUs with Slurm (view source)

894 bytes added , 1 year ago

Job script for GPU profiling

cc_staff

782

edits

@@ Line 217: / Line 217: @@
 With this method, you can run multiple tasks in one submission. The <code>-j4</code> parameter means that GNU Parallel can run a maximum of four concurrent tasks, launching another as soon as each one ends. CUDA_VISIBLE_DEVICES is used to ensure that two tasks do not try to use the same GPU at the same time.
+== Profiling GPU tasks ==
+On [[Béluga/en|Béluga]] and [[Narval/en|Narval]], the
+[https://developer.nvidia.com/dcgm NVIDIA Data Center GPU Manager (DCGM)]
+needs to be disabled, and this must be done while doing your job submission.
+Based on the simplest example in this page, the <code>--export</code>
+parameter is used to set the <code>DISABLE_DCGM</code> environment variable:
+{{File
+  |name=gpu_profiling_job.sh
+  |lang="sh"
+  |contents=
+#!/bin/bash
+#SBATCH --account=def-someuser
+#SBATCH --export=ALL,DISABLE_DCGM=1
+#SBATCH --gpus-per-node=1
+#SBATCH --mem=4000M               # memory per node
+#SBATCH --time=0-03:00
+# Wait until DCGM is disabled on the node
+while [ ! -z "$(dcgmi -v {{!}} grep 'Hostengine build info:')" ]; do
+  sleep 5;
+done
+./profiler arg1 arg2 ...          # Edit this line. Nvprof can be used
+}}
+For more details about profilers, see [[Debugging and profiling]].
 <!--T:54-->
 [[Category:SLURM]]
 </translate>