Best practices for job submission: Difference between revisions

Jump to navigation Jump to search
no edit summary
No edit summary
No edit summary
Line 64: Line 64:


<!--T:12-->
<!--T:12-->
* Your <code>Memory Efficiency</code> in the output from the <code>seff</code> command '''should be at least 80% to 85%''' in most cases.
* Your <code>Memory Efficiency</code> in the output from the <code>seff</code> command <b>should be at least 80% to 85%</b> in most cases.
** Much like with the duration of your job, the goal when requesting the memory is to ensure that the amount is sufficient, with a certain margin of error.
** Much like with the duration of your job, the goal when requesting the memory is to ensure that the amount is sufficient, with a certain margin of error.
* If you plan on using a '''whole node''' for your job, it is natural to also '''use all of its available memory''' which you can express using the line <code>#SBATCH --mem=0</code> in your job submission script.
* If you plan on using a <b>whole node</b> for your job, it is natural to also <b>use all of its available memory</b> which you can express using the line <code>#SBATCH --mem=0</code> in your job submission script.
** Note however that most of our clusters offer nodes with variable amounts of memory available, so using this approach means your job will likely be assigned a node with less memory.
** Note however that most of our clusters offer nodes with variable amounts of memory available, so using this approach means your job will likely be assigned a node with less memory.
* If your testing has shown that you need a '''large memory node''', then you will want to use a line like <code>#SBATCH --mem=1500G</code> for example, to request a node with 1500 GB (or 1.46 TB) of memory.
* If your testing has shown that you need a <b>large memory node</b>, then you will want to use a line like <code>#SBATCH --mem=1500G</code> for example, to request a node with 1500 GB (or 1.46 TB) of memory.
** There are relatively few of these large memory nodes so your job will wait much longer to run - make sure your job really needs all this extra memory.
** There are relatively few of these large memory nodes so your job will wait much longer to run - make sure your job really needs all this extra memory.


Line 74: Line 74:


<!--T:14-->
<!--T:14-->
* By default your job will get one core on one node and this is the most sensible policy because '''most software is serial''': it can only ever make use of a single core.
* By default your job will get one core on one node and this is the most sensible policy because <b>most software is serial</b>: it can only ever make use of a single core.
** Asking for more cores and/or nodes will not make the serial program run any faster because for it to run in parallel the program's source code needs to be modified, in some cases in a very profound manner requiring a substantial investment of developer time.
** Asking for more cores and/or nodes will not make the serial program run any faster because for it to run in parallel the program's source code needs to be modified, in some cases in a very profound manner requiring a substantial investment of developer time.
* How can you '''determine if''' the software you're using '''can run in parallel'''?
* How can you <b>determine if</b> the software you're using <b>can run in parallel</b>?
** The best approach is to '''look in the software's documentation''' for a section on parallel execution: if you can't find anything, this is usually a sign that this program is serial.
** The best approach is to <b>look in the software's documentation</b> for a section on parallel execution: if you can't find anything, this is usually a sign that this program is serial.
** You can also '''contact the development team''' to ask if the software can be run in parallel and if not, to request that such a feature be added in a future version.  
** You can also <b>contact the development team</b> to ask if the software can be run in parallel and if not, to request that such a feature be added in a future version.  


<!--T:15-->
<!--T:15-->
* If the program can run in parallel, the next question is '''what number of cores to use'''?
* If the program can run in parallel, the next question is <b>what number of cores to use</b>?
** Many of the programming techniques used to allow a program to run in parallel assume the existence of a ''shared memory environment'', i.e. multiple cores can be used but they must all be located on the same node. In this case, the maximum number of cores available on a single node provides a ceiling for the number of cores you can use.
** Many of the programming techniques used to allow a program to run in parallel assume the existence of a ''shared memory environment'', i.e. multiple cores can be used but they must all be located on the same node. In this case, the maximum number of cores available on a single node provides a ceiling for the number of cores you can use.
** It may be tempting to simply request "as many cores as possible", but this is often not the wisest approach. Just as having too many cooks trying to work together in a small kitchen to prepare a single meal can lead to chaos, so too adding an excessive number of CPU cores can have the perverse effect of slowing down a program.
** It may be tempting to simply request "as many cores as possible", but this is often not the wisest approach. Just as having too many cooks trying to work together in a small kitchen to prepare a single meal can lead to chaos, so too adding an excessive number of CPU cores can have the perverse effect of slowing down a program.
rsnt_translations
57,772

edits

Navigation menu