Best practices for job submission: Difference between revisions

Best practices for job submission (view source)

Revision as of 16:42, 2 September 2022

390 bytes added , 2 years ago

Moved a few paragraphs into sections

Plstonge

cc_staff

823

edits

@@ Line 1: / Line 1: @@
 When submitting a job to one of the clusters, it's important to choose appropriate values for various parameters in order to ensure that your job doesn't waste resources or create problems for other users and yourself. This will ensure your job starts more quickly and that it is likely to finish correctly, producing the output you need to move your research forward.
-For your first jobs on the cluster, it's understandably difficult to estimate how much time or memory may be needed for your job to carry out a particular simulation or analysis. The best approach in this case is to begin by submitting a few relatively small jobs, asking for a fairly standard amount of memory (<tt>#SBATCH --mem-per-cpu=2G</tt>) and time, for example one or two hours. Ideally you should already know what the answer will be in these test jobs, allowing you to verify that the software is running correctly on the cluster. If the job ends before the computation finished, you can increase the duration by doubling it until the job's duration is sufficient. A similar method can be applied for the memory: if your job ends with a message about an "OOM event" this means it ran out of memory (OOM), so try doubling the memory you've requested and see if this is enough. By means of these test jobs, you should gain some familiarity with how long certain analyses require on the cluster and how much memory is needed, so that for more realistic jobs you'll be able to make an intelligent estimate.
+=Typical job submission problems=
-In general, your jobs should never contain the command <tt>sleep</tt> and we strongly recommend against the use of [[Anaconda/en|Conda]] and its variants on the clusters, in favour of solutions like a [[Python#Creating_and_using_a_virtual_environment|Python virtual environment]] or [[Singularity]].
-=Typical problems=
 * The more resources - time, memory, CPU cores, GPUs - that your job asks for, the more difficult it will be for the scheduler to find these resources and so the longer your job will wait in queue.
+* But if not enough resources are requested, the job can be stopped if it goes beyond its time limit or its memory limit.
+* Estimating required resources based on the performance of a local computer could be misleading since the processor type and speed can differ significantly.
+* The compute tasks listed in the job script are wasting resources.
 =Best practice tips=
+For your first jobs on the cluster, it's understandably difficult to estimate how much time or memory may be needed for your job to carry out a particular simulation or analysis. The best approach in this case is to begin by submitting a few relatively small jobs, asking for a fairly standard amount of memory (<tt>#SBATCH --mem-per-cpu=2G</tt>) and time, for example one or two hours. Ideally you should already know what the answer will be in these test jobs, allowing you to verify that the software is running correctly on the cluster. If the job ends before the computation finished, you can increase the duration by doubling it until the job's duration is sufficient. A similar method can be applied for the memory: if your job ends with a message about an "OOM event" this means it ran out of memory (OOM), so try doubling the memory you've requested and see if this is enough. By means of these test jobs, you should gain some familiarity with how long certain analyses require on the cluster and how much memory is needed, so that for more realistic jobs you'll be able to make an intelligent estimate.
 ==Job duration==
@@ Line 58: / Line 59: @@
 Much like the case of nodes with a large amount of memory, the nodes with GPUs are relatively uncommon so that any job which asks for a GPU will wait significantly longer in most cases. It's therefore in your interest to be sure that this GPU you had to wait so much longer to obtain is being used as efficiently as possible and that it is really contributing to improved performance in your jobs. A considerable amount of software does have a GPU option, for example such widely used packages as [[NAMD]] and [[GROMACS]], but only a small part of these program's functionality has been modified to make use of GPUs. For this reason, it is wiser to first test a small sample calculation both with and without a GPU to see what kind of speed-up you obtain from the use of this GPU. If your job only finishes 5% or 10% more quickly with a GPU, it's probably not worth the effort of waiting to get a node with a GPU as it will be idle during much of your job's execution. Other tools for monitoring the efficiency of your GPU-based jobs include [https://developer.nvidia.com/nvidia-system-management-interface nvidia-smi] and, if you're using software based on [[TensorFlow]], the [[TensorFlow#TensorBoard|TensorBoard]] utility.
+==Avoid wasting resources==
+* In general, your jobs should never contain the command <tt>sleep</tt>.
+* We strongly recommend against the use of [[Anaconda/en|Conda]] and its variants on the clusters, in favour of solutions like a [[Python#Creating_and_using_a_virtual_environment|Python virtual environment]] or [[Singularity]].

Best practices for job submission: Difference between revisions

Best practices for job submission (view source)

Revision as of 16:42, 2 September 2022

Navigation menu

Search