Bureaucrats, cc_docs_admin, cc_staff
2,879
edits
(creation, split from META: A package...) |
(Marked this version for translation) |
||
Line 2: | Line 2: | ||
<translate> | <translate> | ||
<!--T:1--> | |||
This page is about [[META: A package for job farming]]. | This page is about [[META: A package for job farming]]. | ||
== Resubmitting failed cases automatically == | == Resubmitting failed cases automatically == <!--T:2--> | ||
<!--T:3--> | |||
If your farm is particularly large, that is, if it needs more resources than ''NJOBS_MAX x job_run_time'', where ''NJOBS_MAX'' is the maximum number of jobs one is allowed to submit, you will have to run <code>resubmit.run</code> after the original farm finishes running-- perhaps more than once. You can do it by hand, but with META you can also automate this process. To enable this feature, add the <code>-auto</code> switch to your <code>submit.run</code> or <code>resubmit.run</code> command: | If your farm is particularly large, that is, if it needs more resources than ''NJOBS_MAX x job_run_time'', where ''NJOBS_MAX'' is the maximum number of jobs one is allowed to submit, you will have to run <code>resubmit.run</code> after the original farm finishes running-- perhaps more than once. You can do it by hand, but with META you can also automate this process. To enable this feature, add the <code>-auto</code> switch to your <code>submit.run</code> or <code>resubmit.run</code> command: | ||
$ submit.run N -auto | <!--T:4--> | ||
$ submit.run N -auto | |||
<!--T:5--> | |||
This can be used in either SIMPLE or META mode. If your original <code>submit.run</code> command did not have the <code>-auto</code> switch, you can add it to <code>resubmit.run</code> after the original farm finishes running, to the same effect. | This can be used in either SIMPLE or META mode. If your original <code>submit.run</code> command did not have the <code>-auto</code> switch, you can add it to <code>resubmit.run</code> after the original farm finishes running, to the same effect. | ||
<!--T:6--> | |||
When you add <code>-auto</code>, <code>(re)submit.run</code> submits one more (serial) job, in addition to the farm jobs. The purpose of this job is to run the <code>resubmit.run</code> command automatically right after the current farm finishes running. The job script for this additional job is <code>resubmit_script.sh</code>, which should be present in the farm directory; a sample file is automatically copied there when you run <code>farm_init.run</code>. The only customization you need to do to this file is to correct the account name in the <code>#SBATCH -A</code> line. | When you add <code>-auto</code>, <code>(re)submit.run</code> submits one more (serial) job, in addition to the farm jobs. The purpose of this job is to run the <code>resubmit.run</code> command automatically right after the current farm finishes running. The job script for this additional job is <code>resubmit_script.sh</code>, which should be present in the farm directory; a sample file is automatically copied there when you run <code>farm_init.run</code>. The only customization you need to do to this file is to correct the account name in the <code>#SBATCH -A</code> line. | ||
<!--T:7--> | |||
If you are using <code>-auto</code>, the value of the <code>NJOBS_MAX</code> parameter defined in the <code>config.h</code> file should be at least one smaller than the largest number of jobs you can submit on the cluster. | If you are using <code>-auto</code>, the value of the <code>NJOBS_MAX</code> parameter defined in the <code>config.h</code> file should be at least one smaller than the largest number of jobs you can submit on the cluster. | ||
E.g. if the largest number of jobs one can submit on the cluster is 999 and you intend to use <code>-auto</code>, set <code>NJOBS_MAX</code> to 998. | E.g. if the largest number of jobs one can submit on the cluster is 999 and you intend to use <code>-auto</code>, set <code>NJOBS_MAX</code> to 998. | ||
<!--T:8--> | |||
When using <code>-auto</code>, if at some point the only cases left to be processed are the ones which failed earlier, auto-resubmission will stop, and farm computations will end. This is to avoid an infinite loop on badly-formed cases which will always fail. If this happens, you will have to address the reasons for these cases failing before attempting to resubmit the farm. You can see the relevant messages in the file <code>farm.log</code> created in the farm directory. | When using <code>-auto</code>, if at some point the only cases left to be processed are the ones which failed earlier, auto-resubmission will stop, and farm computations will end. This is to avoid an infinite loop on badly-formed cases which will always fail. If this happens, you will have to address the reasons for these cases failing before attempting to resubmit the farm. You can see the relevant messages in the file <code>farm.log</code> created in the farm directory. | ||
== Running a post-processing job automatically == | == Running a post-processing job automatically == <!--T:9--> | ||
<!--T:10--> | |||
Another advanced feature is the ability to run a post-processing job automatically once all the cases from table.dat have been '''successfully''' processed. | Another advanced feature is the ability to run a post-processing job automatically once all the cases from table.dat have been '''successfully''' processed. | ||
If any cases failed-- ''i.e.'' had a non-zero exit status-- the post-processing job will not run. | If any cases failed-- ''i.e.'' had a non-zero exit status-- the post-processing job will not run. | ||
Line 26: | Line 34: | ||
This job can be of any kind-- serial, parallel, or an array job. | This job can be of any kind-- serial, parallel, or an array job. | ||
<!--T:11--> | |||
This feature uses the same script, <code>resubmit_script.sh</code>, described for [[#Resubmitting failed cases automatically|<code>-auto</code>]] above. | This feature uses the same script, <code>resubmit_script.sh</code>, described for [[#Resubmitting failed cases automatically|<code>-auto</code>]] above. | ||
Make sure <code>resubmit_script.sh</code> has the correct account name in the <code>#SBATCH -A</code> line. | Make sure <code>resubmit_script.sh</code> has the correct account name in the <code>#SBATCH -A</code> line. | ||
<!--T:12--> | |||
The automatic post-processing feature also causes more serial job to be submitted, above the number you request. | The automatic post-processing feature also causes more serial job to be submitted, above the number you request. | ||
Adjust the parameter <code>NJOBS_MAX</code> in <code>config.h</code> accordingly (''e.g.'' if the cluster has a job limit of 999, set it to 998). | Adjust the parameter <code>NJOBS_MAX</code> in <code>config.h</code> accordingly (''e.g.'' if the cluster has a job limit of 999, set it to 998). | ||
Line 34: | Line 44: | ||
You do not need to subtract 2 from <code>NJOBS_MAX</code>. | You do not need to subtract 2 from <code>NJOBS_MAX</code>. | ||
<!--T:13--> | |||
System messages from the auto-resubmit feature are logged in <code>farm.log</code>, in the root farm directory. | System messages from the auto-resubmit feature are logged in <code>farm.log</code>, in the root farm directory. | ||
=Additional information= | =Additional information= <!--T:14--> | ||
== Using the git repository == | == Using the git repository == <!--T:15--> | ||
<!--T:16--> | |||
To use META on a cluster where it is not installed as a module | To use META on a cluster where it is not installed as a module | ||
you can clone the package from our git repository: | you can clone the package from our git repository: | ||
$ git clone https://git.computecanada.ca/syam/meta-farm.git | <!--T:17--> | ||
$ git clone https://git.computecanada.ca/syam/meta-farm.git | |||
Then modify your $PATH variable to point to the <code>bin</code> subdirectory of the newly created <code>meta-farm</code> directory. | Then modify your $PATH variable to point to the <code>bin</code> subdirectory of the newly created <code>meta-farm</code> directory. | ||
Assuming you executed <code>git clone</code> inside your home directory, do this: | Assuming you executed <code>git clone</code> inside your home directory, do this: | ||
$ export PATH=~/meta-farm/bin:$PATH | $ export PATH=~/meta-farm/bin:$PATH | ||
<!--T:18--> | |||
Then proceed as shown in the META [[META: A package for job farming#Quick start|Quick start]] from the <code>farm_init.run</code> step. | Then proceed as shown in the META [[META: A package for job farming#Quick start|Quick start]] from the <code>farm_init.run</code> step. | ||
==Passing additional sbatch arguments== | ==Passing additional sbatch arguments== <!--T:19--> | ||
<!--T:20--> | |||
If you need to use additional <code>sbatch</code> arguments (like <code>--mem 4G, --gres=gpu:1</code> ''etc.''), | If you need to use additional <code>sbatch</code> arguments (like <code>--mem 4G, --gres=gpu:1</code> ''etc.''), | ||
add them to <code>job_script.sh</code> as separate <code>#SBATCH</code> lines. | add them to <code>job_script.sh</code> as separate <code>#SBATCH</code> lines. | ||
<!--T:21--> | |||
Or if you prefer, you can add them at the end of the <code>submit.run</code> or <code>resubmit.run</code> command | Or if you prefer, you can add them at the end of the <code>submit.run</code> or <code>resubmit.run</code> command | ||
and they will be passed to <code>sbatch</code>, ''e.g.:'' | and they will be passed to <code>sbatch</code>, ''e.g.:'' | ||
Line 64: | Line 80: | ||
<translate> | <translate> | ||
==Multi-threaded applications== | ==Multi-threaded applications== <!--T:22--> | ||
<!--T:23--> | |||
For [[Running_jobs#Threaded_or_OpenMP_job|multi-threaded]] applications (such as those that use [[OpenMP]], for example), | For [[Running_jobs#Threaded_or_OpenMP_job|multi-threaded]] applications (such as those that use [[OpenMP]], for example), | ||
add the following lines to <code>job_script.sh</code>: | add the following lines to <code>job_script.sh</code>: | ||
Line 77: | Line 94: | ||
<translate> | <translate> | ||
<!--T:24--> | |||
...where ''N'' is the number of CPU cores to use, and ''M'' is the total memory to reserve in megabytes. | ...where ''N'' is the number of CPU cores to use, and ''M'' is the total memory to reserve in megabytes. | ||
You may also supply <code>--cpus-per-task=N</code> and <code>--mem=M</code> as arguments to <code>(re)submit.run</code>. | You may also supply <code>--cpus-per-task=N</code> and <code>--mem=M</code> as arguments to <code>(re)submit.run</code>. | ||
==MPI applications== | ==MPI applications== <!--T:25--> | ||
<!--T:26--> | |||
For applications that use [[MPI]], | For applications that use [[MPI]], | ||
add the following lines to <code>job_script.sh</code>: | add the following lines to <code>job_script.sh</code>: | ||
Line 92: | Line 111: | ||
<translate> | <translate> | ||
<!--T:27--> | |||
...where ''N'' is the number of CPU cores to use, and ''M'' is the memory to reserve for each core, in megabytes. | ...where ''N'' is the number of CPU cores to use, and ''M'' is the memory to reserve for each core, in megabytes. | ||
You may also supply <code>--ntasks=N</code> and <code>--mem-per-cpu=M</code> as arguments to <code>(re)submit.run</code>. | You may also supply <code>--ntasks=N</code> and <code>--mem-per-cpu=M</code> as arguments to <code>(re)submit.run</code>. | ||
See [[Advanced MPI scheduling]] for information about more-complicated MPI scenarios. | See [[Advanced MPI scheduling]] for information about more-complicated MPI scenarios. | ||
<!--T:28--> | |||
Also add <code>srun</code> before the path to your code inside <code>single_case.sh</code>,''e.g.'': | Also add <code>srun</code> before the path to your code inside <code>single_case.sh</code>,''e.g.'': | ||
Line 104: | Line 125: | ||
<translate> | <translate> | ||
<!--T:29--> | |||
Alternatively, you can prepend <code>srun</code> to each line of <code>table.dat</code>: | Alternatively, you can prepend <code>srun</code> to each line of <code>table.dat</code>: | ||
Line 115: | Line 137: | ||
<translate> | <translate> | ||
==GPU applications== | ==GPU applications== <!--T:30--> | ||
<!--T:31--> | |||
For applications which use GPUs, modify <code>job_script.sh</code> following the guidance at [[Using GPUs with Slurm]]. | For applications which use GPUs, modify <code>job_script.sh</code> following the guidance at [[Using GPUs with Slurm]]. | ||
For example, if your cases each use one GPU, add this line: | For example, if your cases each use one GPU, add this line: | ||
Line 126: | Line 149: | ||
<translate> | <translate> | ||
<!--T:32--> | |||
You may also wish to copy the utility <code>~syam/bin/gpu_test</code> to your <code>~/bin</code> directory (only on Graham, Cedar, and Beluga), | You may also wish to copy the utility <code>~syam/bin/gpu_test</code> to your <code>~/bin</code> directory (only on Graham, Cedar, and Beluga), | ||
and put the following lines in <code>job_script.sh</code> right before the <code>task.run</code> line: | and put the following lines in <code>job_script.sh</code> right before the <code>task.run</code> line: | ||
Line 140: | Line 164: | ||
<translate> | <translate> | ||
<!--T:33--> | |||
This will catch those rare situations when there is a problem with the node which renders the GPU unavailable. | This will catch those rare situations when there is a problem with the node which renders the GPU unavailable. | ||
If that happens to one of your meta-jobs, and you don't detect the GPU failure somehow, | If that happens to one of your meta-jobs, and you don't detect the GPU failure somehow, | ||
then the job will try (and fail) to run all your cases from <code>table.dat</code>. | then the job will try (and fail) to run all your cases from <code>table.dat</code>. | ||
== Environment variables and --export == | == Environment variables and --export == <!--T:34--> | ||
<!--T:35--> | |||
All the jobs generated by META package inherit the environment present when you run <code>submit.run</code> or <code>resubmit.run</code>. | All the jobs generated by META package inherit the environment present when you run <code>submit.run</code> or <code>resubmit.run</code>. | ||
This includes all the loaded modules and environment variables. | This includes all the loaded modules and environment variables. | ||
Line 153: | Line 179: | ||
''e.g.'' <code>--export=ALL,X=1,Y=2</code>. | ''e.g.'' <code>--export=ALL,X=1,Y=2</code>. | ||
<!--T:36--> | |||
If you need to pass values of custom environment variables to all of your farm jobs | If you need to pass values of custom environment variables to all of your farm jobs | ||
(including auto-resubmitted jobs and the post-processing job if there is one), | (including auto-resubmitted jobs and the post-processing job if there is one), | ||
Line 163: | Line 190: | ||
<translate> | <translate> | ||
<!--T:37--> | |||
Here <code>VAR1, VAR2, VAR3</code> are custom environment variables which will be passed to all farm jobs. | Here <code>VAR1, VAR2, VAR3</code> are custom environment variables which will be passed to all farm jobs. | ||
== Example: Numbered input files == | == Example: Numbered input files == <!--T:38--> | ||
<!--T:39--> | |||
Suppose you have an application called <code>fcode</code>, and each case needs to read a separate file from standard input-– | Suppose you have an application called <code>fcode</code>, and each case needs to read a separate file from standard input-– | ||
say <code>data.X</code>, where ''X'' ranges from 1 to ''N_cases''. | say <code>data.X</code>, where ''X'' ranges from 1 to ''N_cases''. | ||
Line 175: | Line 204: | ||
Create <code>table.dat</code> in the farm META directory like this: | Create <code>table.dat</code> in the farm META directory like this: | ||
fcode < /home/user/IC/data.1 | <!--T:40--> | ||
fcode < /home/user/IC/data.1 | |||
fcode < /home/user/IC/data.2 | fcode < /home/user/IC/data.2 | ||
fcode < /home/user/IC/data.3 | fcode < /home/user/IC/data.3 | ||
... | ... | ||
<!--T:41--> | |||
You might wish to use a shell loop to create <code>table.dat</code>, ''e.g.'': | You might wish to use a shell loop to create <code>table.dat</code>, ''e.g.'': | ||
Line 188: | Line 219: | ||
<translate> | <translate> | ||
== Example: Input file must have the same name == | == Example: Input file must have the same name == <!--T:42--> | ||
<!--T:43--> | |||
Some applications expect to read input from a file with a prescribed and unchangeable name, like <code>INPUT</code> for example. | Some applications expect to read input from a file with a prescribed and unchangeable name, like <code>INPUT</code> for example. | ||
To handle this situation each case must run in its own subdirectory, | To handle this situation each case must run in its own subdirectory, | ||
Line 197: | Line 229: | ||
Your <code>table.dat</code> can contain nothing but the application name, over and over again: | Your <code>table.dat</code> can contain nothing but the application name, over and over again: | ||
/path/to/code | <!--T:44--> | ||
/path/to/code | |||
/path/to/code | /path/to/code | ||
... | ... | ||
<!--T:45--> | |||
Add a line to <code>single_case.sh</code> | Add a line to <code>single_case.sh</code> | ||
which copies the input file into the farm ''sub''directory for each case-- | which copies the input file into the farm ''sub''directory for each case-- | ||
Line 213: | Line 247: | ||
<translate> | <translate> | ||
==Using all the columns in the cases table explicitly== | ==Using all the columns in the cases table explicitly== <!--T:46--> | ||
The examples shown so far assume that each line in the cases table is an executable statement, starting with either the name of the executable file (when it is on your <code>$PATH</code>) or the full path to the executable file, and then listing the command line arguments particular to that case, or something like <code> < input.$ID</code> if your code expects to read a standard input file. | The examples shown so far assume that each line in the cases table is an executable statement, starting with either the name of the executable file (when it is on your <code>$PATH</code>) or the full path to the executable file, and then listing the command line arguments particular to that case, or something like <code> < input.$ID</code> if your code expects to read a standard input file. | ||
<!--T:47--> | |||
In the most general case, you may want to be able to access all the columns in the table individually. That can be done by modifying <code>single_case.sh</code>: | In the most general case, you may want to be able to access all the columns in the table individually. That can be done by modifying <code>single_case.sh</code>: | ||
Line 248: | Line 283: | ||
<translate> | <translate> | ||
<!--T:48--> | |||
For example, you might need to provide to your code ''both'' a standard input file ''and'' a variable number of command line arguments. | For example, you might need to provide to your code ''both'' a standard input file ''and'' a variable number of command line arguments. | ||
Your cases table will look like this: | Your cases table will look like this: | ||
/path/to/IC.1 0.1 | <!--T:49--> | ||
/path/to/IC.1 0.1 | |||
/path/to/IC.2 0.2 10 | /path/to/IC.2 0.2 10 | ||
... | ... | ||
<!--T:50--> | |||
The way to implement this in <code>single_case.sh</code> is as follows: | The way to implement this in <code>single_case.sh</code> is as follows: | ||
Line 264: | Line 302: | ||
<translate> | <translate> | ||
=Troubleshooting= | =Troubleshooting= <!--T:51--> | ||
<!--T:52--> | |||
Here we explain typical error messages you might get when using this package. | Here we explain typical error messages you might get when using this package. | ||
==Problems affecting multiple commands== | ==Problems affecting multiple commands== <!--T:53--> | ||
==="Non-farm directory, or no farm has been submitted; exiting"=== | ==="Non-farm directory, or no farm has been submitted; exiting"=== <!--T:54--> | ||
Either the current directory is not a farm directory, or you never ran <code>submit.run</code> for this farm. | Either the current directory is not a farm directory, or you never ran <code>submit.run</code> for this farm. | ||
==Problems with submit.run== | ==Problems with submit.run== <!--T:55--> | ||
===Wrong first argument: XXX (should be a positive integer or -1) ; exiting=== | ===Wrong first argument: XXX (should be a positive integer or -1) ; exiting=== | ||
Use the correct first argument: -1 for the SIMPLE mode, or a positive integer N (number of requested meta-jobs) for the META mode. | Use the correct first argument: -1 for the SIMPLE mode, or a positive integer N (number of requested meta-jobs) for the META mode. | ||
==="lockfile is not on path; exiting"=== | ==="lockfile is not on path; exiting"=== <!--T:56--> | ||
Make sure the utility <code>lockfile</code> is on your $PATH. | Make sure the utility <code>lockfile</code> is on your $PATH. | ||
This utility is critical for this package. | This utility is critical for this package. | ||
Line 283: | Line 322: | ||
it ensures that two different meta-jobs do not read the same line of <code>table.dat</code> at the same time. | it ensures that two different meta-jobs do not read the same line of <code>table.dat</code> at the same time. | ||
==="Non-farm directory (config.h, job_script.sh, single_case.sh, and/or table.dat are missing); exiting"=== | ==="Non-farm directory (config.h, job_script.sh, single_case.sh, and/or table.dat are missing); exiting"=== <!--T:57--> | ||
Either the current directory is not a farm directory, or some important files are missing. Change to the correct (farm) directory, or create the missing files. | Either the current directory is not a farm directory, or some important files are missing. Change to the correct (farm) directory, or create the missing files. | ||
==="-auto option requires resubmit_script.sh file in the root farm directory; exiting"=== | ==="-auto option requires resubmit_script.sh file in the root farm directory; exiting"=== <!--T:58--> | ||
You used the <code>-auto</code> option, but you forgot to create the <code>resubmit_script.sh</code> file inside the root farm directory. A sample <code>resubmit_script.sh</code> is created automatically when you use <code>farm_init.run</code>. | You used the <code>-auto</code> option, but you forgot to create the <code>resubmit_script.sh</code> file inside the root farm directory. A sample <code>resubmit_script.sh</code> is created automatically when you use <code>farm_init.run</code>. | ||
==="File table.dat doesn't exist. Exiting"=== | ==="File table.dat doesn't exist. Exiting"=== <!--T:59--> | ||
You forgot to create the <code>table.dat</code> file in the current directory, or perhaps you are running <code>submit.run</code> not inside one of your farm sub-directories. | You forgot to create the <code>table.dat</code> file in the current directory, or perhaps you are running <code>submit.run</code> not inside one of your farm sub-directories. | ||
==="Job runtime sbatch argument (-t or --time) is missing in job_script.sh. Exiting"=== | ==="Job runtime sbatch argument (-t or --time) is missing in job_script.sh. Exiting"=== <!--T:60--> | ||
Make sure you provide a run-time limit for all meta-jobs as an <code>#SBATCH</code> argument inside your <code>job_script.sh</code> file. | Make sure you provide a run-time limit for all meta-jobs as an <code>#SBATCH</code> argument inside your <code>job_script.sh</code> file. | ||
The run-time is the only one which cannot be passed as an optional argument to <code>submit.run</code>. | The run-time is the only one which cannot be passed as an optional argument to <code>submit.run</code>. | ||
==="Wrong job runtime in job_script.sh - nnn . Exiting"=== | ==="Wrong job runtime in job_script.sh - nnn . Exiting"=== <!--T:61--> | ||
You didn't format properly the run-time argument inside your <code>job_script.sh</code> file. | You didn't format properly the run-time argument inside your <code>job_script.sh</code> file. | ||
==="Something wrong with sbatch farm submission; jobid=XXX; aborting"=== | ==="Something wrong with sbatch farm submission; jobid=XXX; aborting"=== <!--T:62--> | ||
==="Something wrong with a auto-resubmit job submission; jobid=XXX; aborting"=== | ==="Something wrong with a auto-resubmit job submission; jobid=XXX; aborting"=== | ||
With either of the two messages, there was an issue with submitting jobs with <code>sbatch</code>. | With either of the two messages, there was an issue with submitting jobs with <code>sbatch</code>. | ||
The cluster's scheduler might be misbehaving, or simply too busy. Try again a bit later. | The cluster's scheduler might be misbehaving, or simply too busy. Try again a bit later. | ||
==="Couldn't create subdirectories inside the farm directory ; exiting"=== | ==="Couldn't create subdirectories inside the farm directory ; exiting"=== <!--T:63--> | ||
==="Couldn't create the temp directory XXX ; exiting"=== | ==="Couldn't create the temp directory XXX ; exiting"=== | ||
==="Couldn't create a file inside XXX ; exiting"=== | ==="Couldn't create a file inside XXX ; exiting"=== | ||
Line 310: | Line 349: | ||
Either permissions got messed up, or you have exhausted a quota. Fix the issue(s), then try again. | Either permissions got messed up, or you have exhausted a quota. Fix the issue(s), then try again. | ||
==Problems with resubmit.run== | ==Problems with resubmit.run== <!--T:64--> | ||
==="Jobs are still running/queued; cannot resubmit"=== | ==="Jobs are still running/queued; cannot resubmit"=== | ||
You cannot use <code>resubmit.run</code> until all meta-jobs from this farm have finished running. | You cannot use <code>resubmit.run</code> until all meta-jobs from this farm have finished running. | ||
Use <code>list.run</code> or <code>queue.run</code> to check the status of the farm. | Use <code>list.run</code> or <code>queue.run</code> to check the status of the farm. | ||
==="No failed/unfinished jobs; nothing to resubmit"=== | ==="No failed/unfinished jobs; nothing to resubmit"=== <!--T:65--> | ||
Your farm was 100% processed. There are no more (failed or never-ran) cases to compute. | Your farm was 100% processed. There are no more (failed or never-ran) cases to compute. | ||
==Problems with running jobs== | ==Problems with running jobs== <!--T:66--> | ||
==="Too many failed (very short) cases - exiting"=== | ==="Too many failed (very short) cases - exiting"=== | ||
This happens if the first <code>$N_failed_max</code> cases are very short-- less than <code>$dt_failed</code> seconds in duration. | This happens if the first <code>$N_failed_max</code> cases are very short-- less than <code>$dt_failed</code> seconds in duration. | ||
Line 325: | Line 364: | ||
or else adjust the <code>$N_failed_max</code> and <code>$dt_failed</code> values in <code>config.h</code>. | or else adjust the <code>$N_failed_max</code> and <code>$dt_failed</code> values in <code>config.h</code>. | ||
==="lockfile is not on path on node XXX"=== | ==="lockfile is not on path on node XXX"=== <!--T:67--> | ||
As the error message suggests, somehow the utility <code>lockfile</code> is not on your <code>$PATH</code> on some node. | As the error message suggests, somehow the utility <code>lockfile</code> is not on your <code>$PATH</code> on some node. | ||
Use <code>which lockfile</code> to ensure that the utility is somewhere in your <code>$PATH</code>. | Use <code>which lockfile</code> to ensure that the utility is somewhere in your <code>$PATH</code>. | ||
Line 331: | Line 370: | ||
for example a file system may have failed to mount. | for example a file system may have failed to mount. | ||
==="Exiting after processing one case (-1 option)"=== | ==="Exiting after processing one case (-1 option)"=== <!--T:68--> | ||
This is not an error message. It simply tells you that you submitted the farm with <code>submit.run -1</code> (one case per job mode), so each meta-job is exiting after processing a single case. | This is not an error message. It simply tells you that you submitted the farm with <code>submit.run -1</code> (one case per job mode), so each meta-job is exiting after processing a single case. | ||
==="Not enough runtime left; exiting."=== | ==="Not enough runtime left; exiting."=== <!--T:69--> | ||
This message tells you that the meta-job would likely not have enough time left to process the next case (based on the analysis of run-times for all the cases processed so far), so it is exiting early. | This message tells you that the meta-job would likely not have enough time left to process the next case (based on the analysis of run-times for all the cases processed so far), so it is exiting early. | ||
==="No cases left; exiting."=== | ==="No cases left; exiting."=== <!--T:70--> | ||
This is not an error message. This is how each meta-job normally finishes, when all cases have been computed. | This is not an error message. This is how each meta-job normally finishes, when all cases have been computed. | ||
==="Only failed cases left; cannot auto-resubmit; exiting"=== | ==="Only failed cases left; cannot auto-resubmit; exiting"=== <!--T:71--> | ||
This can only happen if you used the <code>-auto</code> switch when submitting the farm. | This can only happen if you used the <code>-auto</code> switch when submitting the farm. | ||
Find the failed cases with <code>Status.run -f</code>, fix the issue(s) causing the cases to fail, then run <code>resubmit.run</code>. | Find the failed cases with <code>Status.run -f</code>, fix the issue(s) causing the cases to fail, then run <code>resubmit.run</code>. | ||
=Words of caution= | =Words of caution= <!--T:72--> | ||
<!--T:73--> | |||
Always start with a small test run to make sure everything works before submitting a large production run. You can test individual cases by reserving an interactive node with <code>salloc</code>, changing to the farm directory, and executing commands like <code>./single_case.sh table.dat 1</code>, <code>./single_case.sh table.dat 2</code>, ''etc.'' | Always start with a small test run to make sure everything works before submitting a large production run. You can test individual cases by reserving an interactive node with <code>salloc</code>, changing to the farm directory, and executing commands like <code>./single_case.sh table.dat 1</code>, <code>./single_case.sh table.dat 2</code>, ''etc.'' | ||
== More than 10,000 cases == | == More than 10,000 cases == <!--T:74--> | ||
<!--T:75--> | |||
If your farm is particularly large (say >10,000 cases), you should spend extra effort to make sure it runs as efficiently as possible. In particular, minimize the number of files and/or directories created during execution. If possible, instruct your code to append to existing files (one per meta-job; '''do not mix results from different meta-jobs in a single output file!''') instead of creating a separate file for each case. Avoid creating a separate subdirectory for each case. (Yes, creating a separate subdirectory for each case is the default setup of this package, but that default was chosen for safety, not efficiency!) | If your farm is particularly large (say >10,000 cases), you should spend extra effort to make sure it runs as efficiently as possible. In particular, minimize the number of files and/or directories created during execution. If possible, instruct your code to append to existing files (one per meta-job; '''do not mix results from different meta-jobs in a single output file!''') instead of creating a separate file for each case. Avoid creating a separate subdirectory for each case. (Yes, creating a separate subdirectory for each case is the default setup of this package, but that default was chosen for safety, not efficiency!) | ||
<!--T:76--> | |||
The following example is optimized for a very large number of cases. It assumes, for purposes of the example: | The following example is optimized for a very large number of cases. It assumes, for purposes of the example: | ||
* that your code accepts the output file name via a command line switch <code>-o</code>, | * that your code accepts the output file name via a command line switch <code>-o</code>, | ||
Line 379: | Line 421: | ||
<translate> | <translate> | ||
<!--T:77--> | |||
''Parent page:'' [[META: A package for job farming]] | ''Parent page:'' [[META: A package for job farming]] | ||
</translate> | </translate> |