META-Farm: Advanced features and troubleshooting: Difference between revisions
(creation, split from META: A package...) |
(Marked this version for translation) |
||
Line 2: | Line 2: | ||
<translate> | <translate> | ||
<!--T:1--> | |||
This page is about [[META: A package for job farming]]. | This page is about [[META: A package for job farming]]. | ||
== Resubmitting failed cases automatically == | == Resubmitting failed cases automatically == <!--T:2--> | ||
<!--T:3--> | |||
If your farm is particularly large, that is, if it needs more resources than ''NJOBS_MAX x job_run_time'', where ''NJOBS_MAX'' is the maximum number of jobs one is allowed to submit, you will have to run <code>resubmit.run</code> after the original farm finishes running-- perhaps more than once. You can do it by hand, but with META you can also automate this process. To enable this feature, add the <code>-auto</code> switch to your <code>submit.run</code> or <code>resubmit.run</code> command: | If your farm is particularly large, that is, if it needs more resources than ''NJOBS_MAX x job_run_time'', where ''NJOBS_MAX'' is the maximum number of jobs one is allowed to submit, you will have to run <code>resubmit.run</code> after the original farm finishes running-- perhaps more than once. You can do it by hand, but with META you can also automate this process. To enable this feature, add the <code>-auto</code> switch to your <code>submit.run</code> or <code>resubmit.run</code> command: | ||
$ submit.run N -auto | <!--T:4--> | ||
$ submit.run N -auto | |||
<!--T:5--> | |||
This can be used in either SIMPLE or META mode. If your original <code>submit.run</code> command did not have the <code>-auto</code> switch, you can add it to <code>resubmit.run</code> after the original farm finishes running, to the same effect. | This can be used in either SIMPLE or META mode. If your original <code>submit.run</code> command did not have the <code>-auto</code> switch, you can add it to <code>resubmit.run</code> after the original farm finishes running, to the same effect. | ||
<!--T:6--> | |||
When you add <code>-auto</code>, <code>(re)submit.run</code> submits one more (serial) job, in addition to the farm jobs. The purpose of this job is to run the <code>resubmit.run</code> command automatically right after the current farm finishes running. The job script for this additional job is <code>resubmit_script.sh</code>, which should be present in the farm directory; a sample file is automatically copied there when you run <code>farm_init.run</code>. The only customization you need to do to this file is to correct the account name in the <code>#SBATCH -A</code> line. | When you add <code>-auto</code>, <code>(re)submit.run</code> submits one more (serial) job, in addition to the farm jobs. The purpose of this job is to run the <code>resubmit.run</code> command automatically right after the current farm finishes running. The job script for this additional job is <code>resubmit_script.sh</code>, which should be present in the farm directory; a sample file is automatically copied there when you run <code>farm_init.run</code>. The only customization you need to do to this file is to correct the account name in the <code>#SBATCH -A</code> line. | ||
<!--T:7--> | |||
If you are using <code>-auto</code>, the value of the <code>NJOBS_MAX</code> parameter defined in the <code>config.h</code> file should be at least one smaller than the largest number of jobs you can submit on the cluster. | If you are using <code>-auto</code>, the value of the <code>NJOBS_MAX</code> parameter defined in the <code>config.h</code> file should be at least one smaller than the largest number of jobs you can submit on the cluster. | ||
E.g. if the largest number of jobs one can submit on the cluster is 999 and you intend to use <code>-auto</code>, set <code>NJOBS_MAX</code> to 998. | E.g. if the largest number of jobs one can submit on the cluster is 999 and you intend to use <code>-auto</code>, set <code>NJOBS_MAX</code> to 998. | ||
<!--T:8--> | |||
When using <code>-auto</code>, if at some point the only cases left to be processed are the ones which failed earlier, auto-resubmission will stop, and farm computations will end. This is to avoid an infinite loop on badly-formed cases which will always fail. If this happens, you will have to address the reasons for these cases failing before attempting to resubmit the farm. You can see the relevant messages in the file <code>farm.log</code> created in the farm directory. | When using <code>-auto</code>, if at some point the only cases left to be processed are the ones which failed earlier, auto-resubmission will stop, and farm computations will end. This is to avoid an infinite loop on badly-formed cases which will always fail. If this happens, you will have to address the reasons for these cases failing before attempting to resubmit the farm. You can see the relevant messages in the file <code>farm.log</code> created in the farm directory. | ||
== Running a post-processing job automatically == | == Running a post-processing job automatically == <!--T:9--> | ||
<!--T:10--> | |||
Another advanced feature is the ability to run a post-processing job automatically once all the cases from table.dat have been '''successfully''' processed. | Another advanced feature is the ability to run a post-processing job automatically once all the cases from table.dat have been '''successfully''' processed. | ||
If any cases failed-- ''i.e.'' had a non-zero exit status-- the post-processing job will not run. | If any cases failed-- ''i.e.'' had a non-zero exit status-- the post-processing job will not run. | ||
Line 26: | Line 34: | ||
This job can be of any kind-- serial, parallel, or an array job. | This job can be of any kind-- serial, parallel, or an array job. | ||
<!--T:11--> | |||
This feature uses the same script, <code>resubmit_script.sh</code>, described for [[#Resubmitting failed cases automatically|<code>-auto</code>]] above. | This feature uses the same script, <code>resubmit_script.sh</code>, described for [[#Resubmitting failed cases automatically|<code>-auto</code>]] above. | ||
Make sure <code>resubmit_script.sh</code> has the correct account name in the <code>#SBATCH -A</code> line. | Make sure <code>resubmit_script.sh</code> has the correct account name in the <code>#SBATCH -A</code> line. | ||
<!--T:12--> | |||
The automatic post-processing feature also causes more serial job to be submitted, above the number you request. | The automatic post-processing feature also causes more serial job to be submitted, above the number you request. | ||
Adjust the parameter <code>NJOBS_MAX</code> in <code>config.h</code> accordingly (''e.g.'' if the cluster has a job limit of 999, set it to 998). | Adjust the parameter <code>NJOBS_MAX</code> in <code>config.h</code> accordingly (''e.g.'' if the cluster has a job limit of 999, set it to 998). | ||
Line 34: | Line 44: | ||
You do not need to subtract 2 from <code>NJOBS_MAX</code>. | You do not need to subtract 2 from <code>NJOBS_MAX</code>. | ||
<!--T:13--> | |||
System messages from the auto-resubmit feature are logged in <code>farm.log</code>, in the root farm directory. | System messages from the auto-resubmit feature are logged in <code>farm.log</code>, in the root farm directory. | ||
=Additional information= | =Additional information= <!--T:14--> | ||
== Using the git repository == | == Using the git repository == <!--T:15--> | ||
<!--T:16--> | |||
To use META on a cluster where it is not installed as a module | To use META on a cluster where it is not installed as a module | ||
you can clone the package from our git repository: | you can clone the package from our git repository: | ||
$ git clone https://git.computecanada.ca/syam/meta-farm.git | <!--T:17--> | ||
$ git clone https://git.computecanada.ca/syam/meta-farm.git | |||
Then modify your $PATH variable to point to the <code>bin</code> subdirectory of the newly created <code>meta-farm</code> directory. | Then modify your $PATH variable to point to the <code>bin</code> subdirectory of the newly created <code>meta-farm</code> directory. | ||
Assuming you executed <code>git clone</code> inside your home directory, do this: | Assuming you executed <code>git clone</code> inside your home directory, do this: | ||
$ export PATH=~/meta-farm/bin:$PATH | $ export PATH=~/meta-farm/bin:$PATH | ||
<!--T:18--> | |||
Then proceed as shown in the META [[META: A package for job farming#Quick start|Quick start]] from the <code>farm_init.run</code> step. | Then proceed as shown in the META [[META: A package for job farming#Quick start|Quick start]] from the <code>farm_init.run</code> step. | ||
==Passing additional sbatch arguments== | ==Passing additional sbatch arguments== <!--T:19--> | ||
<!--T:20--> | |||
If you need to use additional <code>sbatch</code> arguments (like <code>--mem 4G, --gres=gpu:1</code> ''etc.''), | If you need to use additional <code>sbatch</code> arguments (like <code>--mem 4G, --gres=gpu:1</code> ''etc.''), | ||
add them to <code>job_script.sh</code> as separate <code>#SBATCH</code> lines. | add them to <code>job_script.sh</code> as separate <code>#SBATCH</code> lines. | ||
<!--T:21--> | |||
Or if you prefer, you can add them at the end of the <code>submit.run</code> or <code>resubmit.run</code> command | Or if you prefer, you can add them at the end of the <code>submit.run</code> or <code>resubmit.run</code> command | ||
and they will be passed to <code>sbatch</code>, ''e.g.:'' | and they will be passed to <code>sbatch</code>, ''e.g.:'' | ||
Line 64: | Line 80: | ||
<translate> | <translate> | ||
==Multi-threaded applications== | ==Multi-threaded applications== <!--T:22--> | ||
<!--T:23--> | |||
For [[Running_jobs#Threaded_or_OpenMP_job|multi-threaded]] applications (such as those that use [[OpenMP]], for example), | For [[Running_jobs#Threaded_or_OpenMP_job|multi-threaded]] applications (such as those that use [[OpenMP]], for example), | ||
add the following lines to <code>job_script.sh</code>: | add the following lines to <code>job_script.sh</code>: | ||
Line 77: | Line 94: | ||
<translate> | <translate> | ||
<!--T:24--> | |||
...where ''N'' is the number of CPU cores to use, and ''M'' is the total memory to reserve in megabytes. | ...where ''N'' is the number of CPU cores to use, and ''M'' is the total memory to reserve in megabytes. | ||
You may also supply <code>--cpus-per-task=N</code> and <code>--mem=M</code> as arguments to <code>(re)submit.run</code>. | You may also supply <code>--cpus-per-task=N</code> and <code>--mem=M</code> as arguments to <code>(re)submit.run</code>. | ||
==MPI applications== | ==MPI applications== <!--T:25--> | ||
<!--T:26--> | |||
For applications that use [[MPI]], | For applications that use [[MPI]], | ||
add the following lines to <code>job_script.sh</code>: | add the following lines to <code>job_script.sh</code>: | ||
Line 92: | Line 111: | ||
<translate> | <translate> | ||
<!--T:27--> | |||
...where ''N'' is the number of CPU cores to use, and ''M'' is the memory to reserve for each core, in megabytes. | ...where ''N'' is the number of CPU cores to use, and ''M'' is the memory to reserve for each core, in megabytes. | ||
You may also supply <code>--ntasks=N</code> and <code>--mem-per-cpu=M</code> as arguments to <code>(re)submit.run</code>. | You may also supply <code>--ntasks=N</code> and <code>--mem-per-cpu=M</code> as arguments to <code>(re)submit.run</code>. | ||
See [[Advanced MPI scheduling]] for information about more-complicated MPI scenarios. | See [[Advanced MPI scheduling]] for information about more-complicated MPI scenarios. | ||
<!--T:28--> | |||
Also add <code>srun</code> before the path to your code inside <code>single_case.sh</code>,''e.g.'': | Also add <code>srun</code> before the path to your code inside <code>single_case.sh</code>,''e.g.'': | ||
Line 104: | Line 125: | ||
<translate> | <translate> | ||
<!--T:29--> | |||
Alternatively, you can prepend <code>srun</code> to each line of <code>table.dat</code>: | Alternatively, you can prepend <code>srun</code> to each line of <code>table.dat</code>: | ||
Line 115: | Line 137: | ||
<translate> | <translate> | ||
==GPU applications== | ==GPU applications== <!--T:30--> | ||
<!--T:31--> | |||
For applications which use GPUs, modify <code>job_script.sh</code> following the guidance at [[Using GPUs with Slurm]]. | For applications which use GPUs, modify <code>job_script.sh</code> following the guidance at [[Using GPUs with Slurm]]. | ||
For example, if your cases each use one GPU, add this line: | For example, if your cases each use one GPU, add this line: | ||
Line 126: | Line 149: | ||
<translate> | <translate> | ||
<!--T:32--> | |||
You may also wish to copy the utility <code>~syam/bin/gpu_test</code> to your <code>~/bin</code> directory (only on Graham, Cedar, and Beluga), | You may also wish to copy the utility <code>~syam/bin/gpu_test</code> to your <code>~/bin</code> directory (only on Graham, Cedar, and Beluga), | ||
and put the following lines in <code>job_script.sh</code> right before the <code>task.run</code> line: | and put the following lines in <code>job_script.sh</code> right before the <code>task.run</code> line: | ||
Line 140: | Line 164: | ||
<translate> | <translate> | ||
<!--T:33--> | |||
This will catch those rare situations when there is a problem with the node which renders the GPU unavailable. | This will catch those rare situations when there is a problem with the node which renders the GPU unavailable. | ||
If that happens to one of your meta-jobs, and you don't detect the GPU failure somehow, | If that happens to one of your meta-jobs, and you don't detect the GPU failure somehow, | ||
then the job will try (and fail) to run all your cases from <code>table.dat</code>. | then the job will try (and fail) to run all your cases from <code>table.dat</code>. | ||
== Environment variables and --export == | == Environment variables and --export == <!--T:34--> | ||
<!--T:35--> | |||
All the jobs generated by META package inherit the environment present when you run <code>submit.run</code> or <code>resubmit.run</code>. | All the jobs generated by META package inherit the environment present when you run <code>submit.run</code> or <code>resubmit.run</code>. | ||
This includes all the loaded modules and environment variables. | This includes all the loaded modules and environment variables. | ||
Line 153: | Line 179: | ||
''e.g.'' <code>--export=ALL,X=1,Y=2</code>. | ''e.g.'' <code>--export=ALL,X=1,Y=2</code>. | ||
<!--T:36--> | |||
If you need to pass values of custom environment variables to all of your farm jobs | If you need to pass values of custom environment variables to all of your farm jobs | ||
(including auto-resubmitted jobs and the post-processing job if there is one), | (including auto-resubmitted jobs and the post-processing job if there is one), | ||
Line 163: | Line 190: | ||
<translate> | <translate> | ||
<!--T:37--> | |||
Here <code>VAR1, VAR2, VAR3</code> are custom environment variables which will be passed to all farm jobs. | Here <code>VAR1, VAR2, VAR3</code> are custom environment variables which will be passed to all farm jobs. | ||
== Example: Numbered input files == | == Example: Numbered input files == <!--T:38--> | ||
<!--T:39--> | |||
Suppose you have an application called <code>fcode</code>, and each case needs to read a separate file from standard input-– | Suppose you have an application called <code>fcode</code>, and each case needs to read a separate file from standard input-– | ||
say <code>data.X</code>, where ''X'' ranges from 1 to ''N_cases''. | say <code>data.X</code>, where ''X'' ranges from 1 to ''N_cases''. | ||
Line 175: | Line 204: | ||
Create <code>table.dat</code> in the farm META directory like this: | Create <code>table.dat</code> in the farm META directory like this: | ||
fcode < /home/user/IC/data.1 | <!--T:40--> | ||
fcode < /home/user/IC/data.1 | |||
fcode < /home/user/IC/data.2 | fcode < /home/user/IC/data.2 | ||
fcode < /home/user/IC/data.3 | fcode < /home/user/IC/data.3 | ||
... | ... | ||
<!--T:41--> | |||
You might wish to use a shell loop to create <code>table.dat</code>, ''e.g.'': | You might wish to use a shell loop to create <code>table.dat</code>, ''e.g.'': | ||
Line 188: | Line 219: | ||
<translate> | <translate> | ||
== Example: Input file must have the same name == | == Example: Input file must have the same name == <!--T:42--> | ||
<!--T:43--> | |||
Some applications expect to read input from a file with a prescribed and unchangeable name, like <code>INPUT</code> for example. | Some applications expect to read input from a file with a prescribed and unchangeable name, like <code>INPUT</code> for example. | ||
To handle this situation each case must run in its own subdirectory, | To handle this situation each case must run in its own subdirectory, | ||
Line 197: | Line 229: | ||
Your <code>table.dat</code> can contain nothing but the application name, over and over again: | Your <code>table.dat</code> can contain nothing but the application name, over and over again: | ||
/path/to/code | <!--T:44--> | ||
/path/to/code | |||
/path/to/code | /path/to/code | ||
... | ... | ||
<!--T:45--> | |||
Add a line to <code>single_case.sh</code> | Add a line to <code>single_case.sh</code> | ||
which copies the input file into the farm ''sub''directory for each case-- | which copies the input file into the farm ''sub''directory for each case-- | ||
Line 213: | Line 247: | ||
<translate> | <translate> | ||
==Using all the columns in the cases table explicitly== | ==Using all the columns in the cases table explicitly== <!--T:46--> | ||
The examples shown so far assume that each line in the cases table is an executable statement, starting with either the name of the executable file (when it is on your <code>$PATH</code>) or the full path to the executable file, and then listing the command line arguments particular to that case, or something like <code> < input.$ID</code> if your code expects to read a standard input file. | The examples shown so far assume that each line in the cases table is an executable statement, starting with either the name of the executable file (when it is on your <code>$PATH</code>) or the full path to the executable file, and then listing the command line arguments particular to that case, or something like <code> < input.$ID</code> if your code expects to read a standard input file. | ||
<!--T:47--> | |||
In the most general case, you may want to be able to access all the columns in the table individually. That can be done by modifying <code>single_case.sh</code>: | In the most general case, you may want to be able to access all the columns in the table individually. That can be done by modifying <code>single_case.sh</code>: | ||
Line 248: | Line 283: | ||
<translate> | <translate> | ||
<!--T:48--> | |||
For example, you might need to provide to your code ''both'' a standard input file ''and'' a variable number of command line arguments. | For example, you might need to provide to your code ''both'' a standard input file ''and'' a variable number of command line arguments. | ||
Your cases table will look like this: | Your cases table will look like this: | ||
/path/to/IC.1 0.1 | <!--T:49--> | ||
/path/to/IC.1 0.1 | |||
/path/to/IC.2 0.2 10 | /path/to/IC.2 0.2 10 | ||
... | ... | ||
<!--T:50--> | |||
The way to implement this in <code>single_case.sh</code> is as follows: | The way to implement this in <code>single_case.sh</code> is as follows: | ||
Line 264: | Line 302: | ||
<translate> | <translate> | ||
=Troubleshooting= | =Troubleshooting= <!--T:51--> | ||
<!--T:52--> | |||
Here we explain typical error messages you might get when using this package. | Here we explain typical error messages you might get when using this package. | ||
==Problems affecting multiple commands== | ==Problems affecting multiple commands== <!--T:53--> | ||
==="Non-farm directory, or no farm has been submitted; exiting"=== | ==="Non-farm directory, or no farm has been submitted; exiting"=== <!--T:54--> | ||
Either the current directory is not a farm directory, or you never ran <code>submit.run</code> for this farm. | Either the current directory is not a farm directory, or you never ran <code>submit.run</code> for this farm. | ||
==Problems with submit.run== | ==Problems with submit.run== <!--T:55--> | ||
===Wrong first argument: XXX (should be a positive integer or -1) ; exiting=== | ===Wrong first argument: XXX (should be a positive integer or -1) ; exiting=== | ||
Use the correct first argument: -1 for the SIMPLE mode, or a positive integer N (number of requested meta-jobs) for the META mode. | Use the correct first argument: -1 for the SIMPLE mode, or a positive integer N (number of requested meta-jobs) for the META mode. | ||
==="lockfile is not on path; exiting"=== | ==="lockfile is not on path; exiting"=== <!--T:56--> | ||
Make sure the utility <code>lockfile</code> is on your $PATH. | Make sure the utility <code>lockfile</code> is on your $PATH. | ||
This utility is critical for this package. | This utility is critical for this package. | ||
Line 283: | Line 322: | ||
it ensures that two different meta-jobs do not read the same line of <code>table.dat</code> at the same time. | it ensures that two different meta-jobs do not read the same line of <code>table.dat</code> at the same time. | ||
==="Non-farm directory (config.h, job_script.sh, single_case.sh, and/or table.dat are missing); exiting"=== | ==="Non-farm directory (config.h, job_script.sh, single_case.sh, and/or table.dat are missing); exiting"=== <!--T:57--> | ||
Either the current directory is not a farm directory, or some important files are missing. Change to the correct (farm) directory, or create the missing files. | Either the current directory is not a farm directory, or some important files are missing. Change to the correct (farm) directory, or create the missing files. | ||
==="-auto option requires resubmit_script.sh file in the root farm directory; exiting"=== | ==="-auto option requires resubmit_script.sh file in the root farm directory; exiting"=== <!--T:58--> | ||
You used the <code>-auto</code> option, but you forgot to create the <code>resubmit_script.sh</code> file inside the root farm directory. A sample <code>resubmit_script.sh</code> is created automatically when you use <code>farm_init.run</code>. | You used the <code>-auto</code> option, but you forgot to create the <code>resubmit_script.sh</code> file inside the root farm directory. A sample <code>resubmit_script.sh</code> is created automatically when you use <code>farm_init.run</code>. | ||
==="File table.dat doesn't exist. Exiting"=== | ==="File table.dat doesn't exist. Exiting"=== <!--T:59--> | ||
You forgot to create the <code>table.dat</code> file in the current directory, or perhaps you are running <code>submit.run</code> not inside one of your farm sub-directories. | You forgot to create the <code>table.dat</code> file in the current directory, or perhaps you are running <code>submit.run</code> not inside one of your farm sub-directories. | ||
==="Job runtime sbatch argument (-t or --time) is missing in job_script.sh. Exiting"=== | ==="Job runtime sbatch argument (-t or --time) is missing in job_script.sh. Exiting"=== <!--T:60--> | ||
Make sure you provide a run-time limit for all meta-jobs as an <code>#SBATCH</code> argument inside your <code>job_script.sh</code> file. | Make sure you provide a run-time limit for all meta-jobs as an <code>#SBATCH</code> argument inside your <code>job_script.sh</code> file. | ||
The run-time is the only one which cannot be passed as an optional argument to <code>submit.run</code>. | The run-time is the only one which cannot be passed as an optional argument to <code>submit.run</code>. | ||
==="Wrong job runtime in job_script.sh - nnn . Exiting"=== | ==="Wrong job runtime in job_script.sh - nnn . Exiting"=== <!--T:61--> | ||
You didn't format properly the run-time argument inside your <code>job_script.sh</code> file. | You didn't format properly the run-time argument inside your <code>job_script.sh</code> file. | ||
==="Something wrong with sbatch farm submission; jobid=XXX; aborting"=== | ==="Something wrong with sbatch farm submission; jobid=XXX; aborting"=== <!--T:62--> | ||
==="Something wrong with a auto-resubmit job submission; jobid=XXX; aborting"=== | ==="Something wrong with a auto-resubmit job submission; jobid=XXX; aborting"=== | ||
With either of the two messages, there was an issue with submitting jobs with <code>sbatch</code>. | With either of the two messages, there was an issue with submitting jobs with <code>sbatch</code>. | ||
The cluster's scheduler might be misbehaving, or simply too busy. Try again a bit later. | The cluster's scheduler might be misbehaving, or simply too busy. Try again a bit later. | ||
==="Couldn't create subdirectories inside the farm directory ; exiting"=== | ==="Couldn't create subdirectories inside the farm directory ; exiting"=== <!--T:63--> | ||
==="Couldn't create the temp directory XXX ; exiting"=== | ==="Couldn't create the temp directory XXX ; exiting"=== | ||
==="Couldn't create a file inside XXX ; exiting"=== | ==="Couldn't create a file inside XXX ; exiting"=== | ||
Line 310: | Line 349: | ||
Either permissions got messed up, or you have exhausted a quota. Fix the issue(s), then try again. | Either permissions got messed up, or you have exhausted a quota. Fix the issue(s), then try again. | ||
==Problems with resubmit.run== | ==Problems with resubmit.run== <!--T:64--> | ||
==="Jobs are still running/queued; cannot resubmit"=== | ==="Jobs are still running/queued; cannot resubmit"=== | ||
You cannot use <code>resubmit.run</code> until all meta-jobs from this farm have finished running. | You cannot use <code>resubmit.run</code> until all meta-jobs from this farm have finished running. | ||
Use <code>list.run</code> or <code>queue.run</code> to check the status of the farm. | Use <code>list.run</code> or <code>queue.run</code> to check the status of the farm. | ||
==="No failed/unfinished jobs; nothing to resubmit"=== | ==="No failed/unfinished jobs; nothing to resubmit"=== <!--T:65--> | ||
Your farm was 100% processed. There are no more (failed or never-ran) cases to compute. | Your farm was 100% processed. There are no more (failed or never-ran) cases to compute. | ||
==Problems with running jobs== | ==Problems with running jobs== <!--T:66--> | ||
==="Too many failed (very short) cases - exiting"=== | ==="Too many failed (very short) cases - exiting"=== | ||
This happens if the first <code>$N_failed_max</code> cases are very short-- less than <code>$dt_failed</code> seconds in duration. | This happens if the first <code>$N_failed_max</code> cases are very short-- less than <code>$dt_failed</code> seconds in duration. | ||
Line 325: | Line 364: | ||
or else adjust the <code>$N_failed_max</code> and <code>$dt_failed</code> values in <code>config.h</code>. | or else adjust the <code>$N_failed_max</code> and <code>$dt_failed</code> values in <code>config.h</code>. | ||
==="lockfile is not on path on node XXX"=== | ==="lockfile is not on path on node XXX"=== <!--T:67--> | ||
As the error message suggests, somehow the utility <code>lockfile</code> is not on your <code>$PATH</code> on some node. | As the error message suggests, somehow the utility <code>lockfile</code> is not on your <code>$PATH</code> on some node. | ||
Use <code>which lockfile</code> to ensure that the utility is somewhere in your <code>$PATH</code>. | Use <code>which lockfile</code> to ensure that the utility is somewhere in your <code>$PATH</code>. | ||
Line 331: | Line 370: | ||
for example a file system may have failed to mount. | for example a file system may have failed to mount. | ||
==="Exiting after processing one case (-1 option)"=== | ==="Exiting after processing one case (-1 option)"=== <!--T:68--> | ||
This is not an error message. It simply tells you that you submitted the farm with <code>submit.run -1</code> (one case per job mode), so each meta-job is exiting after processing a single case. | This is not an error message. It simply tells you that you submitted the farm with <code>submit.run -1</code> (one case per job mode), so each meta-job is exiting after processing a single case. | ||
==="Not enough runtime left; exiting."=== | ==="Not enough runtime left; exiting."=== <!--T:69--> | ||
This message tells you that the meta-job would likely not have enough time left to process the next case (based on the analysis of run-times for all the cases processed so far), so it is exiting early. | This message tells you that the meta-job would likely not have enough time left to process the next case (based on the analysis of run-times for all the cases processed so far), so it is exiting early. | ||
==="No cases left; exiting."=== | ==="No cases left; exiting."=== <!--T:70--> | ||
This is not an error message. This is how each meta-job normally finishes, when all cases have been computed. | This is not an error message. This is how each meta-job normally finishes, when all cases have been computed. | ||
==="Only failed cases left; cannot auto-resubmit; exiting"=== | ==="Only failed cases left; cannot auto-resubmit; exiting"=== <!--T:71--> | ||
This can only happen if you used the <code>-auto</code> switch when submitting the farm. | This can only happen if you used the <code>-auto</code> switch when submitting the farm. | ||
Find the failed cases with <code>Status.run -f</code>, fix the issue(s) causing the cases to fail, then run <code>resubmit.run</code>. | Find the failed cases with <code>Status.run -f</code>, fix the issue(s) causing the cases to fail, then run <code>resubmit.run</code>. | ||
=Words of caution= | =Words of caution= <!--T:72--> | ||
<!--T:73--> | |||
Always start with a small test run to make sure everything works before submitting a large production run. You can test individual cases by reserving an interactive node with <code>salloc</code>, changing to the farm directory, and executing commands like <code>./single_case.sh table.dat 1</code>, <code>./single_case.sh table.dat 2</code>, ''etc.'' | Always start with a small test run to make sure everything works before submitting a large production run. You can test individual cases by reserving an interactive node with <code>salloc</code>, changing to the farm directory, and executing commands like <code>./single_case.sh table.dat 1</code>, <code>./single_case.sh table.dat 2</code>, ''etc.'' | ||
== More than 10,000 cases == | == More than 10,000 cases == <!--T:74--> | ||
<!--T:75--> | |||
If your farm is particularly large (say >10,000 cases), you should spend extra effort to make sure it runs as efficiently as possible. In particular, minimize the number of files and/or directories created during execution. If possible, instruct your code to append to existing files (one per meta-job; '''do not mix results from different meta-jobs in a single output file!''') instead of creating a separate file for each case. Avoid creating a separate subdirectory for each case. (Yes, creating a separate subdirectory for each case is the default setup of this package, but that default was chosen for safety, not efficiency!) | If your farm is particularly large (say >10,000 cases), you should spend extra effort to make sure it runs as efficiently as possible. In particular, minimize the number of files and/or directories created during execution. If possible, instruct your code to append to existing files (one per meta-job; '''do not mix results from different meta-jobs in a single output file!''') instead of creating a separate file for each case. Avoid creating a separate subdirectory for each case. (Yes, creating a separate subdirectory for each case is the default setup of this package, but that default was chosen for safety, not efficiency!) | ||
<!--T:76--> | |||
The following example is optimized for a very large number of cases. It assumes, for purposes of the example: | The following example is optimized for a very large number of cases. It assumes, for purposes of the example: | ||
* that your code accepts the output file name via a command line switch <code>-o</code>, | * that your code accepts the output file name via a command line switch <code>-o</code>, | ||
Line 379: | Line 421: | ||
<translate> | <translate> | ||
<!--T:77--> | |||
''Parent page:'' [[META: A package for job farming]] | ''Parent page:'' [[META: A package for job farming]] | ||
</translate> | </translate> |
Revision as of 15:16, 9 November 2022
This page is about META: A package for job farming.
Resubmitting failed cases automatically[edit]
If your farm is particularly large, that is, if it needs more resources than NJOBS_MAX x job_run_time, where NJOBS_MAX is the maximum number of jobs one is allowed to submit, you will have to run resubmit.run
after the original farm finishes running-- perhaps more than once. You can do it by hand, but with META you can also automate this process. To enable this feature, add the -auto
switch to your submit.run
or resubmit.run
command:
$ submit.run N -auto
This can be used in either SIMPLE or META mode. If your original submit.run
command did not have the -auto
switch, you can add it to resubmit.run
after the original farm finishes running, to the same effect.
When you add -auto
, (re)submit.run
submits one more (serial) job, in addition to the farm jobs. The purpose of this job is to run the resubmit.run
command automatically right after the current farm finishes running. The job script for this additional job is resubmit_script.sh
, which should be present in the farm directory; a sample file is automatically copied there when you run farm_init.run
. The only customization you need to do to this file is to correct the account name in the #SBATCH -A
line.
If you are using -auto
, the value of the NJOBS_MAX
parameter defined in the config.h
file should be at least one smaller than the largest number of jobs you can submit on the cluster.
E.g. if the largest number of jobs one can submit on the cluster is 999 and you intend to use -auto
, set NJOBS_MAX
to 998.
When using -auto
, if at some point the only cases left to be processed are the ones which failed earlier, auto-resubmission will stop, and farm computations will end. This is to avoid an infinite loop on badly-formed cases which will always fail. If this happens, you will have to address the reasons for these cases failing before attempting to resubmit the farm. You can see the relevant messages in the file farm.log
created in the farm directory.
Running a post-processing job automatically[edit]
Another advanced feature is the ability to run a post-processing job automatically once all the cases from table.dat have been successfully processed.
If any cases failed-- i.e. had a non-zero exit status-- the post-processing job will not run.
To enable this feature, simply create a script for the post-processing job with the name final.sh
inside the farm directory
This job can be of any kind-- serial, parallel, or an array job.
This feature uses the same script, resubmit_script.sh
, described for -auto
above.
Make sure resubmit_script.sh
has the correct account name in the #SBATCH -A
line.
The automatic post-processing feature also causes more serial job to be submitted, above the number you request.
Adjust the parameter NJOBS_MAX
in config.h
accordingly (e.g. if the cluster has a job limit of 999, set it to 998).
However, if you use both the auto-resubmit and the auto-post-processing features, they will together only submit one additional job.
You do not need to subtract 2 from NJOBS_MAX
.
System messages from the auto-resubmit feature are logged in farm.log
, in the root farm directory.
Additional information[edit]
Using the git repository[edit]
To use META on a cluster where it is not installed as a module you can clone the package from our git repository:
$ git clone https://git.computecanada.ca/syam/meta-farm.git
Then modify your $PATH variable to point to the bin
subdirectory of the newly created meta-farm
directory.
Assuming you executed git clone
inside your home directory, do this:
$ export PATH=~/meta-farm/bin:$PATH
Then proceed as shown in the META Quick start from the farm_init.run
step.
Passing additional sbatch arguments[edit]
If you need to use additional sbatch
arguments (like --mem 4G, --gres=gpu:1
etc.),
add them to job_script.sh
as separate #SBATCH
lines.
Or if you prefer, you can add them at the end of the submit.run
or resubmit.run
command
and they will be passed to sbatch
, e.g.:
$ submit.run -1 --mem 4G
Multi-threaded applications[edit]
For multi-threaded applications (such as those that use OpenMP, for example),
add the following lines to job_script.sh
:
#SBATCH --cpus-per-task=N
#SBATCH --mem=M
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
...where N is the number of CPU cores to use, and M is the total memory to reserve in megabytes.
You may also supply --cpus-per-task=N
and --mem=M
as arguments to (re)submit.run
.
MPI applications[edit]
For applications that use MPI,
add the following lines to job_script.sh
:
#SBATCH --ntasks=N
#SBATCH --mem-per-cpu=M
...where N is the number of CPU cores to use, and M is the memory to reserve for each core, in megabytes.
You may also supply --ntasks=N
and --mem-per-cpu=M
as arguments to (re)submit.run
.
See Advanced MPI scheduling for information about more-complicated MPI scenarios.
Also add srun
before the path to your code inside single_case.sh
,e.g.:
srun $COMM
Alternatively, you can prepend srun
to each line of table.dat
:
srun /path/to/mpi_code arg1 arg2
srun /path/to/mpi_code arg1 arg2
...
srun /path/to/mpi_code arg1 arg2
GPU applications[edit]
For applications which use GPUs, modify job_script.sh
following the guidance at Using GPUs with Slurm.
For example, if your cases each use one GPU, add this line:
#SBATCH --gres=gpu:1
You may also wish to copy the utility ~syam/bin/gpu_test
to your ~/bin
directory (only on Graham, Cedar, and Beluga),
and put the following lines in job_script.sh
right before the task.run
line:
~/bin/gpu_test
retVal=$?
if [ $retVal -ne 0 ]; then
echo "No GPU found - exiting..."
exit 1
fi
This will catch those rare situations when there is a problem with the node which renders the GPU unavailable.
If that happens to one of your meta-jobs, and you don't detect the GPU failure somehow,
then the job will try (and fail) to run all your cases from table.dat
.
Environment variables and --export[edit]
All the jobs generated by META package inherit the environment present when you run submit.run
or resubmit.run
.
This includes all the loaded modules and environment variables.
META relies on this behaviour for its work, using some environment variables to pass information between scripts.
You have to be careful not to break this default behaviour, such as can happen if you use the --export
switch.
If you need to use --export
in your farm, make sure ALL
is one of the arguments to this command,
e.g. --export=ALL,X=1,Y=2
.
If you need to pass values of custom environment variables to all of your farm jobs
(including auto-resubmitted jobs and the post-processing job if there is one),
do not use --export
. Instead, set the variables on the command line as in this example:
$ VAR1=1 VAR2=5 VAR3=3.1416 submit.run ...
Here VAR1, VAR2, VAR3
are custom environment variables which will be passed to all farm jobs.
Example: Numbered input files[edit]
Suppose you have an application called fcode
, and each case needs to read a separate file from standard input-–
say data.X
, where X ranges from 1 to N_cases.
The input files are all stored in a directory /home/user/IC
.
Ensure fcode
is on your $PATH
(e.g., put fcode
in ~/bin
, and ensure /home/$USER/bin
is added to $PATH
in ~/.bashrc
),
or use a full path to fcode
in table.dat
.
Create table.dat
in the farm META directory like this:
fcode < /home/user/IC/data.1 fcode < /home/user/IC/data.2 fcode < /home/user/IC/data.3 ...
You might wish to use a shell loop to create table.dat
, e.g.:
$ for ((i=1; i<=100; i++)); do echo "fcode < /home/user/IC/data.$i"; done >table.dat
Example: Input file must have the same name[edit]
Some applications expect to read input from a file with a prescribed and unchangeable name, like INPUT
for example.
To handle this situation each case must run in its own subdirectory,
and you must create an input file with the prescribed name in each subdirectory.
Suppose for this example that you have prepared the different input files for each case
and stored them in /path/to/data.X
, where X ranges from 1 to N_cases.
Your table.dat
can contain nothing but the application name, over and over again:
/path/to/code /path/to/code ...
Add a line to single_case.sh
which copies the input file into the farm subdirectory for each case--
the first line in the example below:
cp /path/to/data.$ID INPUT
$COMM
STATUS=$?
Using all the columns in the cases table explicitly[edit]
The examples shown so far assume that each line in the cases table is an executable statement, starting with either the name of the executable file (when it is on your $PATH
) or the full path to the executable file, and then listing the command line arguments particular to that case, or something like < input.$ID
if your code expects to read a standard input file.
In the most general case, you may want to be able to access all the columns in the table individually. That can be done by modifying single_case.sh
:
...
# ++++++++++++ This part can be customized: ++++++++++++++++
# $ID contains the case id from the original table
# $COMM is the line corresponding to the case $ID in the original table, without the ID field
mkdir RUN$ID
cd RUN$ID
# Converting $COMM to an array:
COMM=( $COMM )
# Number of columns in COMM:
Ncol=${#COMM[@]}
# Now one can access the columns individually, as ${COMM[i]} , where i=0...$Ncol-1
# A range of columns can be accessed as ${COMM[@]:i:n} , where i is the first column
# to display, and n is the number of columns to display
# Use the ${COMM[@]:i} syntax to display all the columns starting from the i-th column
# (use for codes with a variable number of command line arguments).
# Call the user code here.
...
# Exit status of the code:
STATUS=$?
cd ..
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
...
For example, you might need to provide to your code both a standard input file and a variable number of command line arguments. Your cases table will look like this:
/path/to/IC.1 0.1 /path/to/IC.2 0.2 10 ...
The way to implement this in single_case.sh
is as follows:
# Call the user code here.
/path/to/code ${COMM[@]:1} < ${COMM[0]}
Troubleshooting[edit]
Here we explain typical error messages you might get when using this package.
Problems affecting multiple commands[edit]
"Non-farm directory, or no farm has been submitted; exiting"[edit]
Either the current directory is not a farm directory, or you never ran submit.run
for this farm.
Problems with submit.run[edit]
Wrong first argument: XXX (should be a positive integer or -1) ; exiting[edit]
Use the correct first argument: -1 for the SIMPLE mode, or a positive integer N (number of requested meta-jobs) for the META mode.
"lockfile is not on path; exiting"[edit]
Make sure the utility lockfile
is on your $PATH.
This utility is critical for this package.
It provides serialized access of meta-jobs to the table.dat
file, that is,
it ensures that two different meta-jobs do not read the same line of table.dat
at the same time.
"Non-farm directory (config.h, job_script.sh, single_case.sh, and/or table.dat are missing); exiting"[edit]
Either the current directory is not a farm directory, or some important files are missing. Change to the correct (farm) directory, or create the missing files.
"-auto option requires resubmit_script.sh file in the root farm directory; exiting"[edit]
You used the -auto
option, but you forgot to create the resubmit_script.sh
file inside the root farm directory. A sample resubmit_script.sh
is created automatically when you use farm_init.run
.
"File table.dat doesn't exist. Exiting"[edit]
You forgot to create the table.dat
file in the current directory, or perhaps you are running submit.run
not inside one of your farm sub-directories.
"Job runtime sbatch argument (-t or --time) is missing in job_script.sh. Exiting"[edit]
Make sure you provide a run-time limit for all meta-jobs as an #SBATCH
argument inside your job_script.sh
file.
The run-time is the only one which cannot be passed as an optional argument to submit.run
.
"Wrong job runtime in job_script.sh - nnn . Exiting"[edit]
You didn't format properly the run-time argument inside your job_script.sh
file.
"Something wrong with sbatch farm submission; jobid=XXX; aborting"[edit]
"Something wrong with a auto-resubmit job submission; jobid=XXX; aborting"[edit]
With either of the two messages, there was an issue with submitting jobs with sbatch
.
The cluster's scheduler might be misbehaving, or simply too busy. Try again a bit later.
"Couldn't create subdirectories inside the farm directory ; exiting"[edit]
"Couldn't create the temp directory XXX ; exiting"[edit]
"Couldn't create a file inside XXX ; exiting"[edit]
With any of these three messages, something is wrong with a file system: Either permissions got messed up, or you have exhausted a quota. Fix the issue(s), then try again.
Problems with resubmit.run[edit]
"Jobs are still running/queued; cannot resubmit"[edit]
You cannot use resubmit.run
until all meta-jobs from this farm have finished running.
Use list.run
or queue.run
to check the status of the farm.
"No failed/unfinished jobs; nothing to resubmit"[edit]
Your farm was 100% processed. There are no more (failed or never-ran) cases to compute.
Problems with running jobs[edit]
"Too many failed (very short) cases - exiting"[edit]
This happens if the first $N_failed_max
cases are very short-- less than $dt_failed
seconds in duration.
See the discussion in section job_script.sh above.
Determine what is causing the cases to fail and fix that,
or else adjust the $N_failed_max
and $dt_failed
values in config.h
.
"lockfile is not on path on node XXX"[edit]
As the error message suggests, somehow the utility lockfile
is not on your $PATH
on some node.
Use which lockfile
to ensure that the utility is somewhere in your $PATH
.
If it is in your $PATH
on a login node, then something went wrong on that particular compute node,
for example a file system may have failed to mount.
"Exiting after processing one case (-1 option)"[edit]
This is not an error message. It simply tells you that you submitted the farm with submit.run -1
(one case per job mode), so each meta-job is exiting after processing a single case.
"Not enough runtime left; exiting."[edit]
This message tells you that the meta-job would likely not have enough time left to process the next case (based on the analysis of run-times for all the cases processed so far), so it is exiting early.
"No cases left; exiting."[edit]
This is not an error message. This is how each meta-job normally finishes, when all cases have been computed.
"Only failed cases left; cannot auto-resubmit; exiting"[edit]
This can only happen if you used the -auto
switch when submitting the farm.
Find the failed cases with Status.run -f
, fix the issue(s) causing the cases to fail, then run resubmit.run
.
Words of caution[edit]
Always start with a small test run to make sure everything works before submitting a large production run. You can test individual cases by reserving an interactive node with salloc
, changing to the farm directory, and executing commands like ./single_case.sh table.dat 1
, ./single_case.sh table.dat 2
, etc.
More than 10,000 cases[edit]
If your farm is particularly large (say >10,000 cases), you should spend extra effort to make sure it runs as efficiently as possible. In particular, minimize the number of files and/or directories created during execution. If possible, instruct your code to append to existing files (one per meta-job; do not mix results from different meta-jobs in a single output file!) instead of creating a separate file for each case. Avoid creating a separate subdirectory for each case. (Yes, creating a separate subdirectory for each case is the default setup of this package, but that default was chosen for safety, not efficiency!)
The following example is optimized for a very large number of cases. It assumes, for purposes of the example:
- that your code accepts the output file name via a command line switch
-o
, - that the application opens the output file in "append" mode, that is, multiple runs will keep appending to the existing file,
- that each line of
table.dat
provides the rest of the command line arguments for your code, - that multiple instances of your code can safely run concurrently inside the same directory, so there is no need to create a subdirectory for each case,
- and that each run will not produce any files besides the output file.
With this setup, even very large farms (hundreds of thousands or even millions of cases) should run efficiently, as there will be very few files created.
...
# ++++++++++++++++++++++ This part can be customized: ++++++++++++++++++++++++
# Here:
# $ID contains the case id from the original table (can be used to provide a unique seed to the code etc)
# $COMM is the line corresponding to the case $ID in the original table, without the ID field
# $METAJOB_ID is the jobid for the current meta-job (convenient for creating per-job files)
# Executing the command (a line from table.dat)
/path/to/your/code $COMM -o output.$METAJOB_ID
# Exit status of the code:
STATUS=$?
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
...
Parent page: META: A package for job farming