Bureaucrats, cc_docs_admin, cc_staff
2,309
edits
No edit summary |
|||
Line 1: | Line 1: | ||
When submitting a job to one of the clusters, it's important to choose appropriate values for various parameters in order to ensure that your job doesn't waste resources or create problems for other users and yourself. This will ensure your job starts more quickly and that it is likely to finish correctly, producing the output you need to move your research forward. As you might expect, the more resources - time, CPU cores, memory, GPUs - that your job asks for, the more difficult it will be for the scheduler to find these resources and so the longer your job will wait in queue. | When submitting a job to one of the clusters, it's important to choose appropriate values for various parameters in order to ensure that your job doesn't waste resources or create problems for other users and yourself. This will ensure your job starts more quickly and that it is likely to finish correctly, producing the output you need to move your research forward. As you might expect, the more resources - time, CPU cores, memory, GPUs - that your job asks for, the more difficult it will be for the scheduler to find these resources and so the longer your job will wait in queue. | ||
For your first jobs on the cluster, it's understandably difficult to estimate how much time or memory may be needed for your job to carry out a particular simulation or analysis. The best approach in this case is to begin by submitting a few relatively small jobs, asking for a fairly standard amount of memory (<tt>#SBATCH --mem-per-cpu=2G</tt>) and time, for example one or two hours. Ideally you should already know what the answer will be in these test jobs, allowing you to verify that the software is running correctly on the cluster. If the job ends before the computation finished, you can increase the duration by doubling it until the job's duration is sufficient. A similar method can be applied for the memory: if your job ends with a message about an "OOM event" this means it ran out of memory (OOM), so try doubling the memory you've requested and see if this is enough. By means of these test jobs, you should gain some familiarity with how long certain analyses require on the cluster and how much memory is needed, so that for more realistic jobs you'll be able to make an intelligent estimate. | For your first jobs on the cluster, it's understandably difficult to estimate how much time or memory may be needed for your job to carry out a particular simulation or analysis. The best approach in this case is to begin by submitting a few relatively small jobs, asking for a fairly standard amount of memory (<tt>#SBATCH --mem-per-cpu=2G</tt>) and time, for example one or two hours. Ideally you should already know what the answer will be in these test jobs, allowing you to verify that the software is running correctly on the cluster. If the job ends before the computation finished, you can increase the duration by doubling it until the job's duration is sufficient. A similar method can be applied for the memory: if your job ends with a message about an "OOM event" this means it ran out of memory (OOM), so try doubling the memory you've requested and see if this is enough. By means of these test jobs, you should gain some familiarity with how long certain analyses require on the cluster and how much memory is needed, so that for more realistic jobs you'll be able to make an intelligent estimate. | ||
In general, your jobs should never contain the command <tt>sleep</tt> and we strongly recommend against the use of [[Anaconda/en|Conda]] and its variants on the clusters, in favour of solutions like a [[Python#Creating_and_using_a_virtual_environment|Python virtual environment]] or [[Singularity]]. | |||
=Job duration= | =Job duration= |