META-Farm: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
No edit summary
(split into two pages, and copy-edit)
Line 1: Line 1:
<languages />
<translate>
<translate>


=Overview= <!--T:1-->
=Overview=  
META is a suite of scripts designed in SHARCNET to automate high-throughput computing, that is, running a large number of related serial, parallel, or GPU calculations.  This practice is sometimes called "farming", "serial farming", or "task farming". META works on all Compute Canada national systems, and could also be used on other clusters which use the same setup (most importantly, SLURM scheduler).


<!--T:128-->
META is a suite of scripts designed in SHARCNET to automate high-throughput computing, that is, running a large number of related calculations.  This practice is sometimes called "farming", "serial farming", or "task farming". META works on all Alliance national systems, and could also be used on other clusters which use the same setup (most importantly, which use the [https://en.wikipedia.org/wiki/Slurm_Workload_Manager Slurm scheduler]).
We will use the term "case" in this article to describe one independent computation.  In contrast, a "job" will mean an invocation of the job scheduler (Slurm).  A "case" may involve the execution of a serial program, a parallel program, or a GPU-using program, and a "job" might handle several "cases".
 
We will use the term "case" in this article to describe one independent computation.  In contrast, a "job" will mean an invocation of the job scheduler (Slurm).  A "case" may involve the execution of a serial program, a parallel program, or a GPU-using program.  A job might handle several cases.


<!--T:129-->
META has the following features:
META has the following features:


<!--T:2-->
* Two modes of operation:
* SIMPLE mode (one case per job) and META node (many cases per job).
** SIMPLE mode, which handles one case per job.
** META node, which handles many cases per job.
* Dynamic workload balancing in META mode.
* Dynamic workload balancing in META mode.
* Capture the exit status of all individual cases.
* Capture the exit status of all individual cases.
* Automatically resubmit all the cases which failed or never ran.
* Automatically resubmit all the cases which failed or never ran.
* Submit and independently operate multiple farms (groups of cases) on the same cluster.
* Submit and independently operate multiple "farms" (groups of cases) on the same cluster.
* Automatically run a post-processing job once all the cases have been processed successfully.
* Can automatically run a post-processing job once all the cases have been processed successfully.
 
Some technical requirements:
* Each case to be computed must be described as a separate line in a file '''table.dat'''.
* One can run multiple farms independently, but each farm must have its own directory.
 
In the META mode, the number of actual jobs (so-called "meta-jobs") submitted by the package is usually much smaller than the number of cases to process. Each meta-job can process multiple lines from table.dat (multiple cases). A collection of meta-jobs will read lines from table.dat, starting from the first line, in a serialized manner using the [https://linux.die.net/man/1/lockfile lockfile] mechanism to prevent a race condition. This ensures a good dynamic workload balance between meta-jobs, as meta-jobs which happen to handle shorter cases will process more of them.


<!--T:3-->
Not all meta-jobs need to ever run in the META mode. The first meta-job to run will start processing lines from table.dat; if and when the second job starts, it joins the first one, and so on. If the run-time of an individual meta-job is long enough, all the cases might be processed with just a single meta-job.
Some technical points:
* All "cases" to be computed have to be described as separate lines in a file '''table.dat''', one line per case.
* One can run multiple farms independently.  Each farm has to have its own directory.
* In the META mode, the number of actual jobs (so-called "meta-jobs") submitted by the package is usually much smaller than the number of cases to process. Each meta-job can process multiple lines from table.dat (multiple cases). A collection of meta-jobs will read lines from table.dat, starting from the first line, in a serialized manner using the lockfile mechanism to prevent a race condition. This ensures a good dynamic workload balance between meta-jobs, as meta-jobs which happen to handle shorter cases will process more of them.
* Not all meta-jobs need to ever run in the META mode. The first meta-job to run will start processing lines from table.dat; if/when the second job starts, it joins the first one, and so on. If the runtime of an individual meta-job is long enough, all the cases might be processed with just a single meta-job.


== META vs. GLOST == <!--T:4-->
== META vs. GLOST ==  
There are three important advantages of the META package over the other approaches (like [[GLOST]]) where farm processing is done by bundling up all the jobs into a large parallel (MPI) job:
 
There are three important advantages of the META package over other approaches like [[GLOST]] where farm processing is done by bundling up all the jobs into a large parallel (MPI) job:
# As the scheduler has full flexibility to start individual meta-jobs when it wants, the queue wait time can be dramatically shorter with the META package than with GLOST. Consider a large farm where 1000 CPU cores need to be used for 3 days. With META, some meta-jobs start to run and produce the first results within minutes. With GLOST, with a 1000-way MPI job, queue wait time can be weeks, so it'll be weeks before you see your very first result.
# As the scheduler has full flexibility to start individual meta-jobs when it wants, the queue wait time can be dramatically shorter with the META package than with GLOST. Consider a large farm where 1000 CPU cores need to be used for 3 days. With META, some meta-jobs start to run and produce the first results within minutes. With GLOST, with a 1000-way MPI job, queue wait time can be weeks, so it'll be weeks before you see your very first result.
# With GLOST, at the end of the farm computations, some MPI ranks will finish earlier and will sit idle until the very last -- the slowest -- MPI rank ends. In META package there is no such waste at the end of the farm -- individual meta-jobs exit earlier if they have no more workload to process.  
# With GLOST, at the end of the farm computations, some MPI ranks will finish earlier and will sit idle until the very last -- the slowest -- MPI rank ends. In META package there is no such waste at the end of the farm -- individual meta-jobs exit earlier if they have no more workload to process.  
# GLOST and other similar packages do not support automated resubmission of the cases which failed or never ran. META has this feature, and it is very easy to use.
# GLOST and other similar packages do not support automated resubmission of the cases which failed or never ran. META has this feature, and it is very easy to use.


== The META webinar == <!--T:5-->
== The META webinar ==
This webinar (recorded on October 6th, 2021) describes the META package: https://youtu.be/GcYbaPClwGE
 
A webinar was recorded on October 6th, 2021 describing the META package. You can view it [https://youtu.be/GcYbaPClwGE here].
 
=Quick start=


=Quick start guide= <!--T:6-->
If you are impatient to start using the package, just follow the steps listed below. But it is highly recommended to also read the rest of the page.
If you are impatient to start using the package, just follow the steps listed below. But it is highly recommended to also read the rest of the page.


== Using module command == <!--T:7-->
* Log in to a cluster.
* Login to the cluster.
* Load the <code>meta-farm</code> module:
* Load the meta-farm module:
  $ module load meta-farm
  $ module load meta-farm
* To create a farm directory with the name Farm_name, one can run the command
* Choose a name for a farm directory, <i>e.g.</i> <code>Farm_name</code>, and create it with the following command:
  $ farm_init.run  Farm_name
  $ farm_init.run  Farm_name
* The above command will also copy a few important files inside the farm directory, some of which will have to be customized.
* The above command will also create a few important files inside the farm directory, some of which you will need to customize.
* Copy your code and initial conditions files to the farm directory (optional; you can use full paths instead).
* Copy your executable and input files to the farm directory. (You may skip this step if you plan to use full paths everywhere.)
* Create table.dat inside the farm directory (text file, one case per line; see [[#single_case.sh script]], [[#FORTRAN code example: using standard input]], [[#FORTRAN code example: copying files for each case]], [[#Using all the columns in the cases table explicitly]] for some examples). Each line of this file describes how to run one case (one independent computation).
* Edit the file <code>table.dat</code> inside the farm directory.  This is a text file describing one case (one independent computation) per line.  For examples, see one or more of:
* Modify “single_case.sh” script if needed. In many cases you don't have to make any changes. See [[#single_case.sh script]], [[#Handling code exit status]], [[#FORTRAN code example: copying files for each case]], [[#Using all the columns in the cases table explicitly]] for more information.
** [[#single_case.sh|single_case.sh]]
* Modify "job_script.sh" file to suit your needs ([[#job_script.sh script]]). In particular, use a correct account name, and set an appropriate job runtime. (See [[#Estimating the runtime and number of meta-jobs]].)
** [[META:_Advanced_features_and_troubleshooting#Example:_Numbered_input_files|Example: Numbered input files]] (advanced)
** [[META:_Advanced_features_and_troubleshooting#Example:_Input_file_must_have_the_same_name|Example: Input file must have the same name]] (advanced)
** [[META:_Advanced_features_and_troubleshooting#Using_all_the_columns_in_the_cases_table_explicitly|Using all the columns in the cases table explicitly]] (advanced)
* Modify <code>single_case.sh</code> script if needed. In many cases you don't have to make any changes. For more information see one or more of:
** [[#single_case.sh|single_case.sh]]
** [[#STATUS_and_handling_errors|STATUS and handling errors]]
** [[META:_Advanced_features_and_troubleshooting#Example:_Input_file_must_have_the_same_name|Example: Input file must have the same name]] (advanced)
** [[META:_Advanced_features_and_troubleshooting#Using_all_the_columns_in_the_cases_table_explicitly|Using all the columns in the cases table explicitly]] (advanced)
* Modify <code>job_script.sh</code> file to suit your needs as described at [[#job_script.sh|job_script.sh]] below. In particular, use a correct account name, and set an appropriate job run-time. For more about run-times, see [[#Estimating_the_run-time_and_number_of_meta-jobs|Estimating the run-time and number of meta-jobs]].
* Inside the farm directory, execute
* Inside the farm directory, execute
  $ submit.run -1
  $ submit.run -1
for the one case per job (SIMPLE) mode, or
for the one case per job (SIMPLE) mode, or
  $ submit.run N
  $ submit.run N
for the many cases per job (META) mode (N is the number of meta-jobs to use; it should be significantly smaller than the total number of cases).
for the many cases per job (META) mode, where <i>N</i> is the number of meta-jobs to use. <i>N</i> should be significantly smaller than the total number of cases.


<!--T:8-->
To run another farm concurrently with the first one, run <code>farm_init.run</code> again (providing a different farm name) and customize the files single_case.sh and job_script.sh inside the new farm directory, then create a new table.dat file there. Also copy the executable and all the input files as needed. Now you can execute the <code>submit.run</code> command inside the second farm directory to submit the second farm.
To run another farm concurrently with the first one, run "farm_init.run" again (providing a different farm name) and customize the files single_case.sh and job_script.sh inside the new farm directory, then create a new table.dat file there. Also copy the code executable and all the input files as needed. Now you can execute the "submit.run" command inside the second farm directory to submit the second farm.


<!--T:9-->
=List of commands=
To use any of the provided *.run utilities, one has first to cd to the corresponding farm directory. (Farm_init.run is the only exception.)
* '''farm_init.run''' : Initialize a farm. See [[#Quick start|Quick start]] above.
* '''submit.run''' : Submit the farm to the scheduler. See [[#submit.run script|submit.run]] below.
* '''resubmit.run''' : Resubmit all computations which failed or never ran as a new farm. See [[#Resubmitting failed cases|Resubmitting failed cases]] below.
* '''list.run''' List all the jobs with their current state for the farm.
* '''query.run''' Provide a short summary of the state of the farm, showing the number of queued, running, and completed jobs. More convenient than using <code>list.run</code> when the number of jobs is large. It will also print the progress--- that is, the number of processed cases vs. the total number of cases--- both for the current run, and globally.
* '''kill.run''': Kill all the running and queued jobs in the farm.
* '''prune.run''': Remove only queued jobs.
* '''Status.run''' (capital "S"!) List statuses of all processed cases. With the optional <code>-f</code>, the non-zero status lines (if any) will be listed at the end.
* '''clean.run''': Delete all the files in the farm directory (including subdirectories if any present), except for <code>job_script.sh, single_case.sh, final.sh, resubmit_script.sh, config.h,</code> and <code>table.dat</code>. It will also delete all files associated with this farm in the <code>/home/$USER/tmp</code> directory. Be very careful with this script!


== Using the git repository == <!--T:10-->
All of these commands (except for <code>farm_init.run</code> itself) have to be executed inside a farm directory, that is, a directory created by <code>farm_init.run</code>.
Instead of loading the module, you can also clone the meta-farm package from our git repository:


<!--T:11-->
=Small number of cases (SIMPLE mode)=
$ git clone https://git.computecanada.ca/syam/meta-farm.git
Then you'll have to modify your $PATH variable, to point it to the bin subdirectory of the newly created meta-farm directory. Assuming you executed "git clone" inside your home directory, you'll have to execute
$ export PATH=~/meta-farm/bin:$PATH


=Full list of commands= <!--T:12-->
Recall that a single execution of your code is a "case" and a "job" is an invocation of the Slurm scheduler. If:
* '''farm_init.run''' : Initiates a farm. See [[#Using module command]].
* the total number of cases is fairly small--- say, less than 500, and
* '''submit.run''' : Submits the farm to the scheduler. See [[#submit.run script]].
* each case runs for at least 10 minutes,
* '''resubmit.run''' : Resubmits all computations which failed or never ran as a new farm. See [[#Resubmitting failed/never-run jobs]].
then it is reasonable to dedicate a separate job to each case using SIMPLE mode.  
* '''list.run''' will list all the jobs with their current state for the farm.
Otherwise you should consider using META mode to handle many cases per job,  
* '''query.run''' will provide a short summary (number of queued / running / done jobs) in the farm, which is more convenient than using “list.run” when the number of jobs is large. It will also print the progress - number of processed cases vs. the total number of cases - both for the current run, and globally.
for which please see [[#Large number of cases (META mode)|Large number of cases (META mode)]] below.
* '''kill.run''': will kill all the running/queued jobs in the farm.
* '''prune.run''': will only kill (remove) queued jobs.
* '''Status.run''' (capital “S”!) will list statuses of all processed cases. With the optional "-f" switch, the non-zero status lines (if any) will be listed at the end.
* '''clean.run''': will delete all the files in the farm directory (including subdirectories if any present), except for *.sh files (job_script.sh, single_case.sh, final.sh, and resubmit_script.sh ), config.h, and table.dat. It will also delete all files associated with this farm in the /home/$USER/tmp directory. Be very careful with this script!


<!--T:13-->
The three essential scripts are the command <code>submit.run</code>, and two user-customizable scripts <code>single_case.sh</code> and <code>job_script.sh</code>.
All of these commands (except for farm_init.run) have to be executed inside a root farm directory.


=Small number of cases (SIMPLE mode)= <!--T:14-->
==submit.run==
==Overview==
The command <code>submit.run</code> has one obligatory argument, the number of jobs to submit, <i>N</i>:
Let's call a single execution of the code in a serial/parallel farm a “case”. When the total number of cases, N_cases, is fairly small (say, <500) it is convenient to dedicate a separate job to each case. (You should make sure that each case runs for at least 10 minutes. If this is not the case, you should consider the "many cases per job" - META - mode; see [[#Large number of cases (META mode)]].)
 
<!--T:15-->
The three essential scripts are the command “submit.run”, and two customizable scripts “single_case.sh” and "job_script.sh".
 
==submit.run script== <!--T:16-->
“submit.run” has one obligatory command line argument - number of jobs to submit N:
</translate>


<source lang="bash">
<source lang="bash">
Line 98: Line 98:
</source>
</source>


<translate>
If <i>N</i>=-1, you are requesting the SIMPLE mode ("submit as many jobs as there are lines in table.dat"). If <i>N</i> is a positive integer, you are requesting the META mode (multiple cases per job), with <i>N</i> being the number of meta-jobs requested.  Any other value for <i>N</i> is an error.
<!--T:17-->
If N=-1, you are requesting the SIMPLE mode ("submit as many jobs as there are lines in table.dat"). Otherwise (if N is a positive integer), you are requesting the META mode (multiple cases per job), with N being the number of meta-jobs requested.


<!--T:18-->
If the optional switch <code>-auto</code> is present, the farm will resubmit itself automatically at the end, more than once if necessary, until all the cases from table.dat have been processed. This feature is described at [[META:_Advanced_features_and_troubleshooting#Resubmitting_failed_cases_automatically|Running resubmit.run automatically]].
If the optional switch "-auto" is present, the farm will resubmit itself automatically at the end, more than once if necessary, until all the cases from table.dat have been processed. This advanced feature is described in [[#Running resubmit.run automatically]].


<!--T:19-->
If a file named <code>final.sh</code> is present in the farm directory, <code>submit.run</code> will treat it as a job script for a post-processing job and it will be launched automatically once all the cases from table.dat have been successfully processed. See [[META:_Advanced_features_and_troubleshooting#Running_a_post-processing_job_automatically|Running a post-processing job automatically]] for more details.
If a file named "final.sh" is present in the farm directory, the submit.run command will treat it as a job script for a post-processing job (which will be launched automatically once all the cases from table.dat have been successfully processed). See [[#Running a post-processing job automatically]] for more details.


<!--T:20-->
If you supply any other arguments, they will be passed on to the Slurm command <code>sbatch</code> used to launch all meta-jobs for this farm.
All optional_sbatch_arguments (there could more than one) will be passed to the job submitting command, sbatch. They will be used with all meta-jobs submitted for this farm.


==single_case.sh script== <!--T:21-->
==single_case.sh==
Another principal script, “single_case.sh”, is only one of the two scripts (the other one being job_script.sh) which might need customization. Its task is to read one line from table.dat, parse it, and use these data to launch your code for this particular case. The version of the file provided literally executes one full line from table.dat file in a separate subdirectory, RUNyyy (yyy being the case number).


</translate>
The function of <code>single_case.sh</code> is to read one line from <code>table.dat</code>, parse it, and use the contents of that line to launch your code for one case.
You may wish to customize <code>single_case.sh</code> for your purposes.


“single_case.sh”:
The version of <code>single_case.sh</code> provided by <code>farm_init.run</code> treats each line in <code>table.dat</code> as a literal command and executes it in its own subdirectory <code>RUNyyy</code>, where <i>yyy</i> is the case number.  Here is the relevant section of <code>single_case.sh</code>:


</translate>
<source lang="bash">
<source lang="bash">
...
...
Line 142: Line 138:
...
...
</source>
</source>
<translate>
Consequently, if you are using the unmodified <code>single_case.sh</code> then each line of <code>table.dat</code> should contain a complete command.
This may be a compound command, that is, several commands separated by semicolons (;).
Typically <code>table.dat</code> will contain a list of identical commands differentiated only by their arguments, but it need not be so.
Any executable statement can go into <code>table.dat</code>.
Your <code>table.dat</code> could look like this:


<translate>
<!--T:22-->
Your table.dat can look like this:
</translate>
</translate>
   /home/user/bin/code1  1.0  10  2.1
   /home/user/bin/code1  1.0  10  2.1
   cp -f /input_dir/input1 .; /code_dir/code  
   cp -f ~/input_dir/input1 .; ~/code_dir/code  
   ./code2 < IC.2
   ./code2 < IC.2
  sleep 10m
<translate>
  ...


<translate>
If you intend to execute the same command for every case and don't wish to repeat it on every line of <code>table.dat</code>,
<!--T:23-->
then you can edit <code>single_case.sh</code> to include the common command.
In other words, any executable statement(s) which can be written on one line can go there.
Then edit your <code>table.dat</code> to contain only the arguments and/or redirects for each case.


<!--T:24-->
For example, here is a modification of <code>single_case.sh</code> which includes the command
No explicit code names are used in the original version of “single_case.sh”. Instead, explicit code names (with the path if needed) should be listed on each line of table.dat file.
(<code>/path/to/your/code</code>), takes the contents of <code>table.dat</code> as arguments to
that command, and uses the case number <code>$ID</code> as an additional argument:


<!--T:25-->
Alternatively, you can edit the single_case.sh file to use there the code name explicitly, in which case your table.dat file will only contain command line switch(es) for your code and/or redirects. For example:
</translate>
</translate>
* single_case.sh:
* single_case.sh:
<source lang="bash">
<source lang="bash">
...
# ++++++++++++++++++++++  This part can be customized:  ++++++++++++++++++++++++
# ++++++++++++++++++++++  This part can be customized:  ++++++++++++++++++++++++
...
# Here we use $ID (case number) as a unique seed for Monte-Carlo type serial farming:
# Here we use $ID (case number) as a unique seed for Monte-Carlo type serial farming:
/path/to/your/code -par $COMM  -seed $ID
/path/to/your/code -par $COMM  -seed $ID
STATUS=$?
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
...
...
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
</source>
</source>
* table.dat:
* table.dat:
Line 180: Line 178:
  ...
  ...
</source>
</source>
<translate>


<translate>
'''Note:'''  If your code doesn't need to read any arguments from <code>table.dat</code>, you still have to generate <code>table.dat</code>, with the number of lines equal to the number of cases you want to compute. In this case, it doesn't matter what you put inside <code>table.dat</code> -- all that matters is the total number of the lines. The key line in the above example might then look like
<!--T:26-->
 
A note: if your code doesn't need any switches to be read from table.dat, you still have to generate table.dat, with the number of lines being equal to the number of cases you want to compute. In this case, it doesn't matter what you put inside table.dat - all that matters is the total number of the lines. The above example would then look like
/path/to/your/code -seed $ID
 
'''Another note:''' You do not need to insert line numbers at the beginning of each line of <code>table.dat</code>.  The script <code>submit.run</code> will modify <code>table.dat</code> to add line numbers if it doesn't find them there.


<!--T:27-->
===STATUS and handling errors===
/path/to/your/code -seed $ID


<!--T:28-->
What is <code>STATUS</code> for in <code>single_case.sh</code>? It is a variable which should be set to “0” if your case was computed correctly, and some positive value (that is, greater than 0) otherwise. It is very important: It is used by <code>resubmit.run</code> to figure out which cases failed so they can be re-computed. In the provided version of <code>single_case.sh</code>, <code>STATUS</code> is set to the exit code of your program. This may not cover all potential problems, since some programs produce an exit code of zero even if something goes wrong.  You can change how <code>STATUS</code> is set by editing <code>single_case.sh</code>.
(no references to $COMM).


<!--T:29-->
For example if your code is supposed to write a file (say, <code>out.dat</code>) at the end of each case, test whether the file exists and set <code>STATUS</code> appropriately.
Another note: “submit.run” will modify your table.dat once (will add the line number at the beginning of each line, if you didn't do it yourself). The file table.dat can be used with submit.run either way (with or without the first case_ID column) - this is handled automatically.
In the following code fragment, <code>$STATUS</code> will be positive if either the exit code from the program is positive, or if <code>out.dat</code> doesn't exist or is empty:


===Handling code exit status=== <!--T:30-->
What is “$STATUS” for in “single_case.sh”? It is a shell variable which should be set to “0” if your case was computed correctly, and >0 otherwise. It is very important: it is used by “resubmit.run” to figure out which cases failed, so they can be re-computed. In the provided version of “single_case.sh”, $STATUS only reflects the exit code of your program. This likely won't cover all potential problems. (Some codes do not exit with non-zero status even if something went wrong.) You can always change or augment $STATUS derivation in “single_case.sh”. E.g., if your code is supposed to create a new non-empty file (say, “out.dat”) at the very end of each case run, the existence of such a non-empty file can be used to judge if the case failed or not:
</translate>
</translate>
<source lang="bash">
<source lang="bash">
   STATUS=$?
   STATUS=$?
Line 205: Line 201:
     fi
     fi
</source>
</source>
<translate>


<translate>
==job_script.sh==
<!--T:31-->
 
In the above example, $STATUS will be positive if either code exit status is positive, or "out.dat" file doesn't exist or is empty.
The file <code>job_script.sh</code> is the job script which will be submitted to SLURM for all meta-jobs in your farm.
Here is the default version created for you by <code>farm_init.run</code>:


==job_script.sh script== <!--T:32-->
The file job_script.sh is the SLURM job script which will be used by all meta-jobs in your farm. It can look like this:
</translate>
</translate>
<source lang="bash">
<source lang="bash">
#!/bin/bash
#!/bin/bash
Line 225: Line 220:
task.run
task.run
</source>
</source>
<translate>


<translate>
At the very least you should change the account name (the <code>-A</code> switch), and the meta-job run-time (the <code>-t</code> switch).  
<!--T:33-->
In SIMPLE mode, you should set the run-time to be somewhat longer than the longest expected individual case.
At the very least you'd have to change the account name (the "-A" switch), and the meta-job runtime ("-t" switch). In the SIMPLE mode, you should request the job's runtime to be somewhat larger than the longest expected individual case run.


<!--T:34-->
'''Important:''' Your <code>job_script.sh</code> ''must'' include the run-time switch (either <code>-t</code> or <code>-time</code>).  
'''Important:''' your job_script.sh file must include the runtime switch (either -t or --time). This cannot be passed to sbatch as an optional argument to submit.run.
This cannot be passed to <code>sbatch</code> as an optional argument to <code>submit.run</code>.


<!--T:35-->
Sometimes the following problem happens: A meta-job may be allocated to a node which has a defect, thereby causing your program to fail instantly. For example, perhaps your program needs a GPU but the GPU you're assigned is malfunctioning, or perhaps the <code>/project</code> file system is not mounted. (Please report such a defective node to support@tech.alliancecan.ca if you detect one!)  But when it happens, that single bad meta-job can quickly churn through <code>table.dat</code>, so your whole farm fails. If you can anticipate such problems, you can add tests to <code>job_script.sh</code> before the <code>task.run</code> line. For example, the following modification will test for the presence of an NVidia GPU, and if none is found it will force the meta-job to exit before it starts failing your cases:
Sometimes the following problem happens: one of the meta-jobs is allocated on a node which has an issue causing your code to fail instantly (e.g., no GPU is available, and your code needs a GPU; or project file space is not mounted). This is definitely not normal, and issues like this need to be reported to Compute Canada. But if it does happen, then your single bad meta-job can churn quickly through table.dat, so your whole farm fails. As a precaution, one can add a testing routine in job_scrip.sh, before the "task.run" line. For example, the following code will test for the presence of a GPU, and forces the meta-job to exit if none are present - before it started failing your farm cases:
</translate>
</translate>


<source lang="bash">
<source lang="bash">
gpu_test
nvidia-smi >/dev/null
retVal=$?
retVal=$?
if [ $retVal -ne 0 ]; then
if [ $retVal -ne 0 ]; then
     exit 1
     exit 1
fi
fi
task.run
task.run
</source>
</source>
<translate>


<translate>
There is a utility <code>gpu_test</code> which does a similar job to <code>nvidia_smi</code> in the above example.
<!--T:36-->
On Graham, Cedar, or Beluga you can copy it to your <code>~/bin</code> directory:
You can copy the utility "gpu_test" to your ~/bin directory (only on graham, cedar, and beluga):
 
cp ~syam/bin/gpu_test ~/bin


<!--T:37-->
The META package has a built-in mechanism which tries to detect problems of this kind and kill a meta-job which churns through the cases too quickly. The two relevant parameters, <code>N_failed_max</code> and <code>dt_failed</code> are set in the file <code>config.h</code>. The protection mechanism is triggered when the first <code>$N_failed_max</code> cases are very short - less than <code>$dt_failed</code> seconds in duration.  The default values are 5 and 5, so by default a meta-job will stop if the first 5 cases all finish in less than 5 seconds. If you get false triggering of this protective mechanism because some of your normal cases have run-time shorter than <code>$dt_failed</code>, reduce the value of <code>dt_failed</code> in <code>config.h</code>.
cp ~syam/bin/gpu_test ~/bin


<!--T:38-->
==Output files==
The META package also has a built-in mechanism to detect early problems of this kind, and kill meta-job(s) which churn through the cases too quickly. The two relevant parameters from config.h file are N_failed_max and dt_failed. The protection mechanism is triggered when the first $N_failed_max (5 by default) cases are very short - less than $dt_failed (5 by default) seconds in duration. If you get false triggering of the protective mechanism because some of your normal cases have runtime shorter than $dt_failed, you have to reduce the value of $dt_failed in config.h .


==Output files== <!--T:39-->
Once one or more meta-jobs in your farm are running, the following files will be created in the farm directory:
Once one or more meta-jobs in your farm are running, the following files will be created in the farm directory:
* OUTPUT/slurm-jobid.out files (one file per meta-job): standard output from meta-jobs;
* <code>OUTPUT/slurm-$JOBID.out</code>, one file per meta-job: Standard output from meta-jobs.
* STATUSES/status.jobid files (one file per meta-job): files containing the statuses of the processed cases.
* <code>STATUSES/status.$JOBID</code>, one file per meta-job: Files containing the statuses of the processed cases.


<!--T:40-->
In both cases, <code>$JOBID</code> stands for the jobid of the corresponding meta-job.
In both cases, jobid stands for the jobid of the corresponding meta-job.


<!--T:41-->
One more directory, <code>MISC</code>, will also be created inside the root farm directory. It contains some auxiliary data.
One more directory - MISC - will also be created inside the root farm directory. It contains some auxiliary data.


<!--T:42-->
Also, every time <code>submit.run</code> is run it will create a unique subdirectory inside <code>/home/$USER/tmp</code>.  
Also, every "submit.run" script execution will create a unique subdirectory inside "/home/$USER/tmp". Inside that subdirectory, some small scratch files (like files used by "lockfile" command, to serialize certain operations inside the jobs) will be created. These subdirectories have names "NODE.PID", where "NODE" is the name of the current node (typically a login node), and "PID" is the unique process ID for the script. Once the farm execution is done, one can safely erase this subdirectory. This will happen automatically if you run "clean.run" command (be careful - clean.run also deletes all the results produced by your farm!).
Inside that subdirectory, some small scratch files will be created, such as files used by <code>lockfile</code> to serialize certain operations inside the jobs.
These subdirectories have names <code>$NODE.$PID</code>, where <code>$NODE</code> is the name of the current node (typically a login node), and <code>$PID</code> is the unique process ID for the script.  
Once the farm execution is done, one can safely erase this subdirectory.  
This will happen automatically if you run <code>clean.run</code>, but be careful! <code>clean.run</code> also '''deletes all the results''' produced by your farm!


==Resubmitting failed cases==
The command <code>resubmit.run</code> takes the same arguments as <code>submit.run</code>:


==Resubmitting failed/never-run jobs== <!--T:43-->
The command “resubmit.run” takes the same arguments as “submit.run”:
</translate>
</translate>
<source lang="bash">
<source lang="bash">
   $  resubmit.run N [-auto] [optional_sbatch_arguments]
   $  resubmit.run N [-auto] [optional_sbatch_arguments]
</source>
</source>
<translate>


<translate>
<code>resubmit.run</code>:  
<!--T:44-->
* analyzes all those <code>status.*</code> files (see [[#Output files|Output files]] above);
“resubmit.run”:  
* figures out which cases failed and which never ran for whatever reason (e.g. because of the meta-jobs' run-time limit);
* will analyze all those status.* files ([[#Output files]]);
* creates a new case table (adding “_” at the end of the original table name), which lists only the cases which still need to be run;
* figure out which cases failed and which never ran for whatever reason (e.g. because of the meta-jobs' runtime limit);
* launches a new farm for those cases.
* create a new case table (adding “_” at the end of the original table name), which lists only the cases which still need to be run;
* uses “submit.run” internally to launch a new farm, for the unfinished/failed jobs.


<!--T:45-->
You cannot run <code>resubmit.run</code> until all the jobs from the original run are done or killed.  
Notes: You won't be able to run “resubmit.run” until all the jobs from the original run are done or killed. If some cases still fail or do not run, one can resubmit the farm as many times as needed.


<!--T:46-->
If some cases still fail or do not run, one can resubmit the farm as many times as needed.  Of course, if certain cases fail repeatedly then there must a be a problem with either the program you are running or its input. In this case you may wish to use the command <code>Status.run</code> (capital S!) which displays the statuses for all computed cases. With the optional argument <code>-f</code>, <code>Status.run</code> will sort the output according to the exit status, showing cases with non-zero status at the bottom, to make them easier to spot.
Of course, if certain cases persistently fail, then there must a be a problem with either your initial conditions parameters or with your code (a code bug). It is convenient to use here the command "Status.run" (capital S!) which displays the statuses for all computed cases. With the optional argument "-f", the Status.run command will sort the output according to the exit status, showing non-zero status lines (if any) at the bottom, to make them easier to spot.


<!--T:47-->
Similarly to <code>submit.run</code>, if the optional switch <code>-auto</code> is present the farm will resubmit itself automatically at the end, more than once if necessary. This advanced feature is described at [[META:_Advanced_features_and_troubleshooting#Resubmitting_failed_cases_automatically|Resubmitting failed cases automatically]].
Similarly to "submit.run", if the optional switch "-auto" is present, the farm will resubmit itself automatically at the end, more than once if necessary. This advanced feature is described in [[#Running resubmit.run automatically]].


=Large number of cases (META mode)= <!--T:48-->
=Large number of cases (META mode)=
==META mode overview==
==META mode overview==
The SIMPLE (one case per job) mode works fine when the number of cases is fairly small (<500). When N_cases >> 500, the following problems arise:
The SIMPLE (one case per job) mode works fine when the number of cases is fairly small (<500).  
When the number of cases is much greater than 500, the following problems may arise:
 
* Each cluster has a limit on how many jobs a user can have at one time. (E.g. for Graham, it is 1000.)
* With a very large number of cases, each case computation is typically short. If one case runs for <20 min, CPU cycles may be wasted due to scheduling overheads.


<!--T:49-->
META mode is the solution to these problems.
* Each cluster has a limit on how many jobs a user can submit (for Graham, it is 1000).
Instead of submitting a separate job for each case, a smaller number of "meta-jobs" are submitted, each of which processes multiple cases.
* With a very large number of cases, each case computation is typically short. If one case runs for <20 min, you start wasting cpu cycles due to scheduling overheads.
To enable META mode the first argument to <code>submit.run</code>t should be the desired number of meta-jobs,
which should be a fairly small number-- much smaller than the number of cases to process.  E.g.:


<!--T:50-->
</translate>
The solution: instead of submitting a separate job for each case, one should submit a smaller number of "meta-jobs", each of which would process multiple cases (so-called META mode). As cases can take different time to process, it is highly desirable to utilize a dynamic workload balancing scheme here.  
<source lang="bash">
  $  submit.run  32
</source>
<translate>


<!--T:51-->
Since each case may take a different amount of time to process, META modes uses a dynamic workload-balancing scheme.
This is how it is implemented:
This is how META mode is implemented:


<!--T:52-->
[[File:meta1.png|500px]]
[[File:meta1.png|500px]]


<!--T:53-->
As the above diagram shows, each job executes the same script, <code>task.run</code>. Inside that script, there is a <code>while</code> loop for the cases. Each iteration of the loop has to go through a serialized portion of the code (that is, only one ''job'' at a time can be in that section of code), where it gets the next case to process from <code>table.dat</code>. Then the script <code>single_case.sh</code> (see section [[#single_case.sh script]]) is executed once for each case, which in turn calls the user code.
As the above diagram shows, "submit.run N" script in the META mode will submit N jobs, with N being a fairly small number (much smaller than the number of cases to process). Each job would execute the same script - "task.run". Inside that script, there is a "while" loop, for different cases. Each iteration of the loop has to go through a serialized (only one job at a time can do that) portion of the code, where it figures out which next case (if any) to process. Then the already familiar script "single_case.sh" (see section [[#single_case.sh script]]) is executed - once per each case, which in turn calls the user code.


<!--T:54-->
This approach results in dynamic workload balancing achieved across all the running "meta-jobs" belonging to the same farm. The algorithm is illustrated by the diagram below:
This approach results in dynamic workload balancing achieved across all the running "meta-jobs" belonging to the same farm. The algorithm is illustrated by the diagram below:


<!--T:55-->
[[File:DWB_META.png|800px]]
[[File:DWB_META.png|800px]]


<!--T:56-->
This can be seen more clearly in [https://www.youtube.com/watch?v=GcYbaPClwGE&t=423s this animation] from the META webinar.
This can be seen more clearly in [https://www.youtube.com/watch?v=GcYbaPClwGE&t=423s this animation] (part of the META webinar).


<!--T:57-->
The dynamic workload balancing results in all meta-jobs finishing around the same time, regardless of how different the run-times are for individual cases, regardless of how fast CPUs are on different nodes, and regardless of when individual "meta-jobs" start. In addition, not all meta-jobs need to start running for all the cases to be processed, and if a meta-job dies (e.g. due to a node crash), at most one case will be lost. The latter can be easily rectified with <code>resubmit.run</code>; see [[#Resubmitting failed/never-run jobs]].
The dynamic workload balancing results in all meta-jobs finishing around the same time, regardless of how different the runtimes are for individual cases, regardless of how fast CPUs are on different nodes, and regardless of when individual "meta-jobs" start. In addition, this approach is very robust: not all meta-jobs need to start running for all the cases to be processed; if a meta-job dies (e.g. due to a node crash), at most one case will be lost. (The latter can be easily rectified by running the "resubmit.run" script; see [[#Resubmitting failed/never-run jobs]].)


<!--T:58-->
Not all of the requested meta-jobs will necessarily run, depending on how busy the cluster is. But as described above, in META mode you will eventually get all your results regardless of how many meta-jobs run, although you might need to use <code>resubmit.run</code> to complete a particularly large farm.
To enable the META mode (with dynamic workload balancing), the first argument to “submit.run” script should be the desired number of meta-jobs, e.g.:
</translate>


<source lang="bash">
==Estimating the run-time and number of meta-jobs==
  $  submit.run 32
</source>


<translate>
How should you figure out the optimum number of meta-jobs, and the run-time to be used in <code>job_script.sh</code>?
<!--T:59-->
Not all of the requested meta-jobs will necessarily run (this depends on how busy the cluster is). But as described above, in the META mode you will eventually get all your results regardless of how many meta-jobs will run. (You might need to run "resubmit.run", sometimes more than once, to complete particularly large farms).


==Estimating the runtime and number of meta-jobs== <!--T:60-->
First you need to figure out the average run-time for an individual case (a single line in table.dat). Supposing your application program is not parallel, allocate a single CPU core with [[Running_jobs#Interactive_jobs|<code>salloc</code>]], then execute <code>single_case.sh</code> there for a few different cases.  Measure the total run-time and divide that by the number of cases you ran to get an estimate of the average case run-time. This can be done with a shell <code>for</code> loop:
How to figure out the optimum number of meta-jobs, and the runtime (to be used in job_script.sh)?


<!--T:61-->
First you need to figure out what is the average runtime for an individual case (a single line in table.dat). One way to do it is to allocate a cpu with salloc command, cd to the farm directory, and execute the single_case.sh script there multiple times, for different cases, measuring the total runtime, and then dividing that by the number of cases to get an estimate of the average case runtime. This can be conveniently achieved with a bash "for" loop:
</translate>
</translate>
<source lang="bash">
<source lang="bash">
   $  N=10; time for ((i=1; i<=$N; i++)); do  ./single_case.sh table.dat $i  ; done
   $  N=10; time for ((i=1; i<=$N; i++)); do  ./single_case.sh table.dat $i  ; done
</source>
</source>
<translate>


<translate>
Divide the "real" time output by the above command by <code>$N</code> to get the average case run-time estimate. Let's call it ''dt_case''.
<!--T:62-->
The "real" time obtained with the above command should be divided by $N (10 in this example) to get the average case runtime estimate. Let's call it dt_case (in seconds).


<!--T:63-->
Estimate the total CPU time needed to process the whole farm by multiplying ''dt_case'' by the number of cases, that is, the number of lines in <code>table.dat</code>.  
You can estimate the amount of cpu cycles needed to process the whole farm, by multiplying dt_case by the number of cases (number of lines in table.dat). This will be in cpu-seconds. Dividing that by 3600 gives you the amount of compute resources in cpu-hours. Multiply that by something like 1.1 - 1.3 to have a bit of a safety margin.
If this is in CPU-seconds, dividing that by 3600 gives you the total number of CPU-hours.  
Multiply that by something like 1.1 or 1.3 to have a bit of a safety margin.


<!--T:64-->
Now you can make a sensible choice for the run-time of meta-jobs, and that will also determine the number of meta-jobs needed to finish the whole farm.
Now you can make a sensible choice for the runtime of meta-jobs, and that will also give you the number of meta-jobs needed to finish the whole farm.


<!--T:65-->
The run-time you choose should be significantly larger than the average run-time of an individual case, ideally by a factor of 100 or more.  
The runtime you choose should be significantly larger (ideally by a factor of 100 or more) than the average runtime of individual cases. In any case, it should definitely be larger than the expected longest individual case runtime. On the other hand, it should not be too large (say, no more than 3 days) to avoid very long queue wait time (the longer the job's runtime is, the smaller is the number of cluster's nodes available for such jobs). A good choice would be either 12h or 1d. Once you settled on the runtime, you can divide the farm's cpu cycles amount (in cpu-hours) by the meta-job's runtime (in hours) to get the required number of meta-jobs (should be rounded up to the next larger integer number).
It must definitely be larger than the longest run-time you expect for an individual case.  
On the other hand it should not be too large; say, no more than 3 days.
The longer a job's run-time is, the longer it will usually wait to be scheduled.  
On Alliance general-purpose clusters, a good choice would be 12h or 24h due to [[Job_scheduling_policies#Time_limits|scheduling policies]].  
Once you have settled on a run-time, divide the total number of CPU-hours by the run-time you have chosen (in hours) to get the required number of meta-jobs.
Round up this number to the next integer.


<!--T:66-->
With the above choices, the queue wait time should be fairly small, and the throughput and efficiency of the farm should be fairly high.
With the above choices, the queue wait time should be fairly small, and the throughput and efficiency of the farm should be fairly high.


<!--T:67-->
Let's consider a specific example. Suppose you ran the above <code>for</code> loop on a dedicated CPU obtained with <code>salloc</code>, and the output said the "real" time was 15m50s, which is 950 seconds. Divide that by the number of sample cases, 10, to find that the average time for an individual case is 95 seconds. Suppose also the total number of cases you have to process (the number of lines in <code>table.dat</code>) is 1000. The total CPU time required to compute all your cases is then 95 x 1000 = 95,000 CPU-seconds = 26.4 CPU-hours. Multiply that by a factor of 1.2 as a safety measure, to yield 31.7 CPU-hours.  A run-time of 3 hours for your meta-jobs would work here, and should lead to good queue wait times.  Edit the value of the <code>#SBATCH -t</code> in <code>job_script.sh</code> to be <code>3:00:00</code>. Now estimate how many meta-jobs you'll need to process all the cases: N = 31.7 core-hours / 3 hours = 10.6, which rounded up to the next integer is 11. Then you can launch the farm by executing a single <code>submit.run 11</code>.
Let's consider a specific example. You ran the above "for loop" command on a dedicated (obtained via salloc command) CPU-core, and estimated that the average individual case runtime is 95 seconds. The total number of cases to process (number of lines in table.dat) is say 1000. The total amount of resources required to compute all your cases is then 95s x 1000 = 95,000 core-seconds = 26.4 core-hours. You choose a short runtime of 3 hours for your meta-jobs to ensure the shortest possible queue wait time (use it as the "#SBATCH -t " argument in your job_script.sh file). Now you can estimate how many meta-jobs you'll need to process all the cases: N = 26.4 core-hours / 3 hours = 8.8. You multiply that by a factor of 1.2 as a safety measure (10.56), and then round it up to the next larger integer - 11. That means you can process the whole farm by executing a single command "submit.run 11".
 
If the number of jobs in the above analysis is larger than 1000, you have a particularly large farm. 
The maximum number of jobs which can be submitted on Graham and Beluga is 1000, so you won't be able to run the whole collection with a single command. 
The workaround would be to go through the following sequence of commands.  
Remember each command can only be executed after the previous farm has finished running:


<!--T:68-->
For particularly large farms, if the number of jobs in the above analysis is larger than 1000 (the maximum number of jobs which can be submitted on Graham and Beluga), the workaround would be to go through the sequence of commands (each command can only be executed after the previous farm has finished running):
</translate>
</translate>
<source lang="bash">
<source lang="bash">
   $  submit.run 1000
   $  submit.run 1000
Line 381: Line 366:
   ...   
   ...   
</source>
</source>
<translate>
<translate>
<!--T:69-->
As this can get rather tedious, one should instead consider using an advanced feature of the META package - [[#Running resubmit.run automatically]] - for such large farms. This will fully automate the farm resubmission steps.
==Reducing the waste== <!--T:70-->
Here is one potential problem when one is running multiple cases per job, utilizing dynamic workload balancing: what if the number of running meta-jobs times the requested runtime per meta-job (say, 3 days) is not enough to process all your cases? E.g., you managed to start the maximum allowed 1000 meta-jobs, each of which has a 3 day runtime limit. That means that your serial farm can only process all the cases in a single run if the average_case_runtime x N_cases < 1000 x 3d = 3000 cpu days. Once your meta-jobs start hitting the 3d runtime limit, they will start dying in the middle of processing one of your cases. This will result in up to 1000 interrupted cases calculations. This is not a big deal in terms of accounting (the "resubmit.run" will find all the cases which failed or never ran, and will resubmit them automatically). But this can become a waste of cpu cycles. On average, you will be wasting 0.5 x N_jobs x average_case_runtime_hours cpu-hours. E.g. if your cases have an average runtime of 1 hour, and you have 1000 meta-jobs running, you will waste ~20 cpu days, which is not acceptable.
<!--T:71-->
Fortunately, the scripts we are providing have some built-in intelligence to mitigate this problem. This is implemented in the "task.run" script as follows:
<!--T:72-->
* The script measures runtime of each case, and adds the value as one line in a scratch file "times" created inside /home/$USER/tmp/NODE.PID directory (see [[#Output files]]). This is done by all running meta-jobs.
* Once the first 8 cases were computed, one of the meta-jobs will read the contents of the file "times" and compute the larger 12.5% quantile for the current distribution of case runtimes. This will serve as a conservative estimate of the runtime for your individual cases, dt_cutoff. (The current estimate is stored in dt_cutoff file inside the /home/$USER/tmp/NODE.PID directory.)
* From now on, each meta-job will estimate if it has the time to finish the case it is about to start computing, by ensuring that t_finish - t_now > dt_cutoff. (Here t_finish is the time when the job will die because of the job's runtime limit; t_now is the current time.) If it thinks it doesn't have the time, it will exit early, which will minimize the chance of a case computation aborted half-way due to the job's runtime limit.
* At every subsequent power of two number of computed cases (8, then 16, then 32 and so on) dt_cutoff is recomputed using the above algorithm. This will make the dt_cutoff estimate more and more accurate. Power of two is used to minimize the overheads related to computing dt_cutoff; the algorithm will be equally efficient for both very small (tens) and very large (many thousands) number of cases.
* The above algorithm reduces the amount of cpu cycles wasted due to jobs hitting the runtime limit by a factor of 8.
<!--T:73-->
As a useful side effect, every time you run a farm, you get individual runtimes for all of your cases (stored in /home/$USER/tmp/NODE.PID/times file). You can analyze that file to fine-tune your farm setup, for profiling your code etc.
= Advanced features = <!--T:74-->
== Running resubmit.run automatically ==
If your farm is particularly large (needs more resources than NJOBS_MAX * job_runtime, where NJOBS_MAX is the maximum number of jobs one is allowed to submit), you will have to run resubmit.run after the original farm finished running (perhaps more than once). You can do it by hand, but with META you can also fully automate this process. To enable this feature, you have to add "-auto" switch to your submit.run (or resubmit.run) command:
<!--T:75-->
$ submit.run N -auto


<!--T:76-->
If this seems rather tedious, consider using an advanced feature of the META package for such large farms: [[META:_Advanced_features_and_troubleshooting#Resubmitting_failed_cases_automatically|Resubmitting failed cases automatically]]. This will fully automate the farm resubmission steps.
This can be used in both SIMPLE and META modes. If your original submit.run command did not have the "-auto" switch, you can add it to the resubmit.run command, after the original farm finished running, to the same effect.


<!--T:77-->
==Reducing waste==
When you add the "-auto" switch, the (re)submit.run script submits one more (serial) job, in addition to the farm jobs. (This should be accommodated for in the value of the NJOBS_MAX parameter defined in config.h file. E.g. if the largest number of jobs one can submit on the cluster is 999, set this parameter to 998 when using the "-auto" feature.) The purpose of this job is to run the resubmit.run command automatically right after the current farm finished running. The job script for this additional job is resubmit_script.sh (should be present in the root farm directory; a sample file is automatically copied there when you run farm_init.run command). The only customization you need to do to this file is to put the correct account name as the "#SBATCH -A" argument.


<!--T:78-->
Here is one potential problem when one is running multiple cases per job:  What if the number of running meta-jobs times the requested run-time per meta-job (say, 3 days) is not enough to process all your cases? E.g., you managed to start the maximum allowed 1000 meta-jobs, each of which has a 3-day run-time limit. That means that your farm can only process all the cases in a single run if the ''average_case_run_time x N_cases < 1000 x 3d = 3000'' CPU days. Once your meta-jobs start hitting the 3-day run-time limit, they will start dying in the middle of processing one of your cases. This will result in up to 1000 interrupted cases calculations. This is not a big deal in terms of completing the work--- <code>resubmit.run</code> will find all the cases which failed or never ran, and will restart them automatically. But this can become a waste of CPU cycles. On average, you will be wasting ''0.5 x N_jobs x average_case_run_time''. E.g., if your cases have an average run-time of 1 hour, and you have 1000 meta-jobs running, you will waste about 500 CPU-hours or about 20 CPU-days, which is not acceptable.
To prevent a risk of creating an infinite loop when using the "-auto" feature, if at some point the only cases left to be processed are the ones which failed earlier, auto-resubmission will stop, and farm computations will end. (You can see the relevant messages in the file farm.log created in the root farm directory.) If this happens, you will have to address the reasons for these cases failing before attempting to resubmit the farm.


== Running a post-processing job automatically == <!--T:79-->
Fortunately, the scripts we are providing have some built-in intelligence to mitigate this problem. This is implemented in <code>task.run</code> as follows:
Another advanced feature is the ability to run a post-processing job automatically once all the cases from table.dat have been '''successfully''' processed. (If there are any cases which failed - had a non-zero exit status - the post-processing job will not run.) To enable this feature, simply create a job script file for the post-processing job with the name "final.sh" inside the root farm directory. This job can be of any kind - serial, parallel, or an array job.


<!--T:80-->
* The script measures the run-time of each case, and adds the value as one line in a scratch file <code>times</code> created in directory <code>/home/$USER/tmp/$NODE.$PID/</code>. (See [[#Output files|Output files]].) This is done by all running meta-jobs.
This feature, similarly to [[#Running resubmit.run automatically]] feature, uses one additional serial job, described by the file resubmit_script.sh in the root farm directory (make sure it uses the correct account name). Adjust the parameter NJOBS_MAX defined in config.h file accordingly (e.g. if the cluster has a job limit of 999, set it to 998).
* Once the first 8 cases were computed, one of the meta-jobs will read the contents of the file <code>times</code> and compute the largest 12.5% quantile for the current distribution of case run-times. This will serve as a conservative estimate of the run-time for your individual cases, ''dt_cutoff''.  The current estimate is stored in file <code>dt_cutoff</code> in <code>/home/$USER/tmp/$NODE.$PID/</code>.
* From now on, each meta-job will estimate if it has the time to finish the case it is about to start computing, by ensuring that ''t_finish - t_now > dt_cutoff''. Here, ''t_finish'' is the time when the job will die because of the job's run-time limit, and ''t_now'' is the current time. If it computes that it doesn't have the time, it will exit early, which will minimize the chance of a case aborting half-way due to the job's run-time limit.
* At every subsequent power of two number of computed cases (8, then 16, then 32 and so on) ''dt_cutoff'' is recomputed using the above algorithm. This will make the ''dt_cutoff'' estimate more and more accurate. Power of two is used to minimize the overheads related to computing ''dt_cutoff''; the algorithm will be equally efficient for both very small (tens) and very large (many thousands) number of cases.
* The above algorithm reduces the amount of CPU cycles wasted due to jobs hitting the run-time limit by a factor of 8, on average.


<!--T:81-->
As a useful side effect, every time you run a farm you get individual run-times for all of your cases stored in <code>/home/$USER/tmp/$NODE.$PID/times</code>.
You can use both auto-resubmit and auto-post-processing features together, and that will still only use one serial job (in addition to your farm jobs).
You can analyze that file to fine-tune your farm setup, for profiling your code, etc.


<!--T:82-->
= If more help is needed =
The system messages from the auto-resubmission feature are logged into farm.log file, in the root farm directory.


=Additional information= <!--T:83-->
See [[META: Advanced features and troubleshooting]] for more detailed discussion of some features, and for troubleshooting suggestions.
==Passing additional sbatch arguments==
What if you need to use additional sbatch arguments (like --mem 4G, --gres=gpu:1 etc.)? Simple: just add all those arguments at the end of “submit.run” and “resubmit.run” command line, and they will be passed to sbatch, e.g.:
</translate>


<source lang="bash">
If you need more help, contact [[Technical support]], mentioning the name of the package (META), and the name of the staff member who wrote the software (Sergey Mashchenko).
  $  submit.run  -1  --mem 4G
</source>
 
<translate>
<!--T:84-->
Alternatively, you can supply these arguments as separate "#SBATCH" lines in your job_script.sh file.
 
==Jobs environment== <!--T:130-->
All the jobs generated by META package inherit the environment present at the farm submission time (when you run submit.run or resubmit.run manually). This includes all the loaded modules, and user created environment variables. META package relies on this behaviour for its work (it uses some environment variables internally to pass parameters between scripts). You have to be careful not to break this default behaviour, for example when using "--export" switch in farm jobs. If you do need to use "--export" command in your farm, make sure "ALL" is one of the arguments to this command (this preserves all the existing environment), e.g. : "--export=ALL,X=1,Y=2".
 
<!--T:131-->
If you need to pass values of custom environment variables inside all of your farm jobs (including the auto-resubmit jobs, and final  - post-processing - job), do not use "--export" switch for sbatch. Instead, submit (or resubmit) your farm as follows:
</translate>
 
<source lang="bash">
  $  VAR1=1 VAR2=5 VAR3=3.1416 submit.run ...
</source>
 
<translate>
<!--T:132-->
Here VAR1, VAR2, and VAR3 are custom environment variables which will be passed to all farm jobs.
 
==Multi-threaded farming== <!--T:85-->
For “multi-threaded farming” (OpenMP etc.), add "--cpus-per-task=N" and "--mem=XXX" sbatch arguments to “(re)submit.run” (or add the corresponding #SBATCH lines to your job_script.sh file). Here “N” is the number of cpu cores/threads to use. Also, add the following line inside your “job_script.sh” file right before the task.run line:
</translate>
 
<source lang="bash">
  export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
</source>
 
<translate>
==MPI farming== <!--T:86-->
For “MPI farming”, use these sbatch arguments with “(re)submit.run” (or add the corresponding #SBATCH lines to your job_script.sh file):
</translate>
 
<source lang="bash">
  --ntasks=N  --mem-per-cpu=XXX
</source>
 
<translate>
<!--T:87-->
Also add “srun” before the path to your code inside “single_case.sh”, e.g.:
</translate>
 
<source lang="bash">
  srun  $COMM
</source>
 
<translate>
<!--T:88-->
Alternatively, you can prepend “srun” on each line of your table.dat:
</translate>
 
<source lang="bash">
  srun /path/to/mpi_code arg1 arg2
  srun /path/to/mpi_code arg1 arg2
  ...
  srun /path/to/mpi_code arg1 arg2
</source>
 
<translate>
==GPU farming== <!--T:89-->
For GPU farming, you only need to modify your job_script.sh file accordingly. For example, for farming where the code uses one GPU add one extra line:
</translate>
 
<source lang="bash">
#SBATCH --gres=gpu:1
</source>
 
<translate>
<!--T:90-->
It is also a good idea to copy my utility ~syam/bin/gpu_test to your ~/bin directory (only on graham, cedar, and beluga), and put the following lines in your job_script.sh file right before the "task.run" line:
</translate>
 
<source lang="bash">
gpu_test
retVal=$?
if [ $retVal -ne 0 ]; then
    echo "No GPU found - exiting..."
    exit 1
fi
</source>
 
<translate>
<!--T:91-->
This will catch those quite rare situations when there is a technical issue with the node rendering the GPU being not available. If that happens to one of your meta-jobs, and you don't have the above lines in your script, the rogue meta-job will churn through (and fail) all your cases from table.dat.
 
==FORTRAN code example: using standard input== <!--T:92-->
You have a FORTRAN (or C/C++) serial code, “fcode”; each case needs to read a separate file from standard input – say “data.xxx” (in /home/user/IC directory), where xxx goes from 1 to N_cases. Place “fcode” on your $PATH (e.g., in ~/bin, make sure /home/$USER/bin is added to $PATH in .bashrc; alternatively, use a full path to your code in the cases table). Create table.dat (inside META directory) like this:
 
  <!--T:93-->
fcode < /home/user/IC/data.1
  fcode < /home/user/IC/data.2
  ...
  fcode < /home/user/IC/data.N_cases
 
<!--T:94-->
The task of creating the table can be greatly simplified if you use a BASH loop command, e.g.:
</translate>
 
<source lang="bash">
  $  for ((i=1; i<=10; i++)); do echo "fcode < /home/user/IC/data.$i"; done >table.dat
</source>
 
<translate>
==FORTRAN code example: copying files for each case== <!--T:95-->
Another typical FORTRAN code situation: you need to copy a file (say, /path/to/data.xxx) to each case subdirectory, before executing the code, and rename it to some standard input file name. Your table.dat can look like this:
 
  <!--T:96-->
/path/to/code
  /path/to/code
  ...
 
<!--T:97-->
Add one line (first line in the example below) to your “single_case.sh”:
</translate>
 
<source lang="bash">
  \cp /path/to/data.$ID standard_name
  $COMM
  STATUS=$?
</source>
 
<translate>
==Using all the columns in the cases table explicitly== <!--T:98-->
The examples shown so far presume that each line in the cases table is an executable statement, starting with either the code binary name (when the binary is on your $PATH) or full path to the binary, and then listing the code's command line arguments (if any) particular to that case, or something like " < input.$ID" if your code expects the initial conditions via standard input.
 
<!--T:99-->
In the most general case, one wants to have the ultimate flexibility in being able to access all the columns in the table individually. That is easy to achieve by slightly modifying the "single_case.sh" script:
</translate>
 
<source lang="bash">
...
# ++++++++++++  This part can be customized:  ++++++++++++++++
#  $ID contains the case id from the original table
#  $COMM is the line corresponding to the case $ID in the original table, without the ID field
mkdir RUN$ID
cd RUN$ID
 
# Converting $COMM to an array:
COMM=( $COMM )
# Number of columns in COMM:
Ncol=${#COMM[@]}
# Now one can access the columns individually, as ${COMM[i]} , where i=0...$Ncol-1
# A range of columns can be accessed as ${COMM[@]:i:n} , where i is the first column
# to display, and n is the number of columns to display
# Use the ${COMM[@]:i} syntax to display all the columns starting from the i-th column
# (use for codes with a variable number of command line arguments).
 
# Call the user code here.
...
 
# Exit status of the code:
STATUS=$?
cd ..
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
...
</source>
 
<translate>
<!--T:100-->
For example, you need to provide to your code both an initial conditions file (to be used via standard input), and a variable number of command line arguments. Your cases table will look like this:
 
  <!--T:101-->
/path/to/IC.1 0.1
  /path/to/IC.2 0.2 10
  ...
 
<!--T:102-->
The way to implement this in "single_case.sh" is as follows:
</translate>
 
<source lang="bash">
# Call the user code here.
/path/to/code ${COMM[@]:1} < ${COMM[0]}
</source>
 
<translate>
=Troubleshooting= <!--T:103-->
Here we explain typical error messages you might get when using this package.
 
==Problems affecting multiple commands== <!--T:104-->
==="Non-farm directory, or no farm has been submitted; exiting"===
Either the current directory is not a farm directory, or you never ran the "submit.run" command for this farm.
 
==Problems with submit.run== <!--T:105-->
===Wrong first argument: XXX (should be a positive integer or -1) ; exiting===
Use the correct first argument: -1 for the SIMPLE mode, or a positive integer N (number of requested meta-jobs) for the META mode.
 
==="lockfile is not on path; exiting"=== <!--T:106-->
Make sure the utility lockfile is on your $PATH.
 
==="Non-farm directory (config.h, job_script.sh, single_case.sh, and/or table.dat are missing); exiting"=== <!--T:107-->
Either the current directory is not a farm directory, or some important files are missing. Cd to the correct (farm) directory, or create the missing files.
 
==="-auto option requires resubmit_script.sh file in the root farm directory; exiting"=== <!--T:108-->
You used the "-auto" option, but you forgot to create the resubmit_script.sh file inside the root farm directory. A sample resubmit_script.sh file is created automatically when you use farm_init.run.
 
==="File table.dat doesn't exist. Exiting"=== <!--T:109-->
You forgot to create the table.dat file in the current directory, or perhaps you are running submit.run not inside one of your farm sub-directories.
 
==="Job runtime sbatch argument (-t or --time) is missing in job_script.sh. Exiting"=== <!--T:110-->
Make sure you provide the runtime for all meta-jobs as an #SBATCH argument inside your job_script.sh file. This is a requirement - the runtime sbatch argument is the only one which cannot be passed as an optional argument for submit.run.
 
==="Wrong job runtime in job_script.sh - nnn . Exiting"=== <!--T:111-->
You didn't format properly the runtime argument inside your job_script.sh file.
 
==="Something wrong with sbatch farm submission; jobid=XXX; aborting"=== <!--T:112-->
==="Something wrong with a auto-resubmit job submission; jobid=XXX; aborting"===
With either of the two messages, there was an issue with submitting jobs with sbatch. The cluster's scheduler might be misbehaving, or simply too busy. Try again a bit later.
 
==="Couldn't create subdirectories inside the farm directory ; exiting"=== <!--T:113-->
==="Couldn't create the temp directory XXX ; exiting"===
==="Couldn't create a file inside XXX ; exiting"===
With any of the above messages, something is wrong with a file system - either permissions got messed up, or you hit the file system quota. Fix the issue(s), then try again.
 
==Problems with resubmit.run== <!--T:114-->
==="Jobs are still running/queued; cannot resubmit"===
You cannot use resubmit.run until all meta-jobs from this farm finished running. Use list.run or queue.run to check the status of the farm.
 
==="No failed/unfinished jobs; nothing to resubmit"=== <!--T:115-->
Not an error - simply tells you that your farm was 100% processed, and that there are no more (failed or never-ran) cases to compute.
 
==Problems with running jobs== <!--T:116-->
==="Too many failed (very short) cases - exiting"===
This happens if the first $N_failed_max (5 by default) cases are very short  - less than $dt_failed (5 by default) seconds in duration. The two variables, $N_failed_max and $dt_failed, can be adjusted by editing the config.h file (in the root farm directory). This is a protection mechanism, in case anything is amiss - a problem with the node (file system not mounted, GPU is missing etc), with the job parameters (not enough of RAM etc), or with the code (the binary is missing or instantly crashing, input files are missing etc.). This protection will prevent the bad meta-job churning through (and failing) all the cases in table.dat.
 
==="lockfile is not on path on node XXX"=== <!--T:117-->
As the error message suggests, somehow the utility lockfile is not on your $PATH - either you forgot to modify your $PATH variable accordingly, to copy lockfile into your ~/bin directory, or perhaps something is wrong on that particular compute node (home file system not mounted). The lockfile utility is critical for this package (it ensures serialized access of meta-jobs to the table.dat file), and it won't work if the utility is not accessible.
 
==="Exiting after processing one case (-1 option)"=== <!--T:118-->
This is actually not an error - it simply tells you that you submitted the farm with via "submit.run -1" (one case per job mode), so each meta-job is exiting after processing a single case.
 
==="Not enough runtime left; exiting."=== <!--T:119-->
This message tells you that the meta-job would likely not have enough time left to process the next case (based on the analysis of runtimes for all the cases processed so far), so it is exiting earlier.
 
==="No cases left; exiting."=== <!--T:120-->
This is not an error message - this is how each meta-job normally finishes, when all cases have already been computed.
 
==="Only failed cases left; cannot auto-resubmit; exiting"=== <!--T:121-->
This can only happen if you used the "-auto" switch when submitting the farm. Find the failed cases ("Status.run -f"), fix the issue(s) causing the cases to fail, then run resubmit.run command to process all those cases.
 
=Words of caution= <!--T:122-->
Always start with a much smaller test farm run, to make sure everything works, before submitting a large production run farm. You can test individual cases by reserving an interactive node with "salloc" command, cd'ing to the farm directory, and executing commands like "./single_case.sh table.dat 1", "./single_case.sh table.dat 2" etc.
 
<!--T:123-->
If your farm is particularly large (say > 10,000 cases), extra efforts have to be spent to make sure it runs as efficiently as possible. In particular, you have to minimize the number of files and/or directories created during the jobs execution. If possible, instruct your code to append to the existing files (one per meta-job; '''do not mix results from different meta-jobs in a single output file!''') instead of creating a separate file for each case. Avoid creating a separate subdirectory for each case (which is the default setup of this package).
 
<!--T:124-->
The following example (optimized for a very large number of cases) assumes that your code accepts the output file name via "-o" command line switch, that the output file is used in "append" mode (multiple code runs will keep appending to the existing file), and that each line of table.dat provides the rest of the command line arguments for your code. It is also assumed that multiple instances of your code can safely run concurrently inside the same directory (so no need to create subdirectories for each case), and that each code run will not produce any other files (beside the output file). With this setup, even very large farms (hundreds of thousands or even millions of cases) should run efficiently, as there will be very few files created.
</translate>
<source lang="bash">
...
# ++++++++++++++++++++++  This part can be customized:  ++++++++++++++++++++++++
#  Here:
#  $ID contains the case id from the original table (can be used to provide a unique seed to the code etc)
#  $COMM is the line corresponding to the case $ID in the original table, without the ID field
#  $METAJOB_ID is the jobid for the current meta-job (convenient for creating per-job files)
 
# Executing the command (a line from table.dat)
/path/to/your/code  $COMM  -o output.$METAJOB_ID
 
# Exit status of the code:
STATUS=$?
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
...
</source>
 
<translate>
= Glossary = <!--T:125-->
* '''Case''' : one independent computation. The file table.dat should list one "case" per line.
* '''Farm / Farming''' : running many jobs on a cluster which carry out independent (but related) computations, of the same kind. "Farms" can be serial, multi-threaded, MPI, GPU etc.
* '''Meta-job''' : a job which can process multiple cases (independent computations) from table.dat.
* '''META mode''' : the mode of operating the META package when each job is a meta-job (can process multiple cases from table.dat).
* '''SIMPLE mode''' : the mode of operating the META package when each job will process only one case from table.dat.


= If more help is needed = <!--T:126-->
== Glossary ==
* '''case''': One independent computation. The file <code>table.dat</code> should list one case per line.
* '''farm / farming''' (verb): Running many jobs on a cluster which carry out independent (but related) computations, of the same kind.
* '''farm''' (noun): The directory and files involved in running one instance of the package.
* '''meta-job''': A job which can process multiple cases (independent computations) from <code>table.dat</code>.
* '''META mode''': The mode of operation of the package in which each job can process ''multiple'' cases from <code>table.dat</code>.
* '''SIMPLE mode''': The mode of operation of the package in which each job will process only one case from <code>table.dat</code>.


<!--T:127-->
Submit a ticket to Compute Canada ticketing system by contacting our [[Technical support]], mentioning the name of the package (META), and the name of the staff who wrote the software (Sergey Mashchenko).


</translate>
</translate>
[[Category:Tutorials]]

Revision as of 15:15, 9 November 2022

Overview[edit]

META is a suite of scripts designed in SHARCNET to automate high-throughput computing, that is, running a large number of related calculations. This practice is sometimes called "farming", "serial farming", or "task farming". META works on all Alliance national systems, and could also be used on other clusters which use the same setup (most importantly, which use the Slurm scheduler).

We will use the term "case" in this article to describe one independent computation. In contrast, a "job" will mean an invocation of the job scheduler (Slurm). A "case" may involve the execution of a serial program, a parallel program, or a GPU-using program. A job might handle several cases.

META has the following features:

  • Two modes of operation:
    • SIMPLE mode, which handles one case per job.
    • META node, which handles many cases per job.
  • Dynamic workload balancing in META mode.
  • Capture the exit status of all individual cases.
  • Automatically resubmit all the cases which failed or never ran.
  • Submit and independently operate multiple "farms" (groups of cases) on the same cluster.
  • Can automatically run a post-processing job once all the cases have been processed successfully.

Some technical requirements:

  • Each case to be computed must be described as a separate line in a file table.dat.
  • One can run multiple farms independently, but each farm must have its own directory.

In the META mode, the number of actual jobs (so-called "meta-jobs") submitted by the package is usually much smaller than the number of cases to process. Each meta-job can process multiple lines from table.dat (multiple cases). A collection of meta-jobs will read lines from table.dat, starting from the first line, in a serialized manner using the lockfile mechanism to prevent a race condition. This ensures a good dynamic workload balance between meta-jobs, as meta-jobs which happen to handle shorter cases will process more of them.

Not all meta-jobs need to ever run in the META mode. The first meta-job to run will start processing lines from table.dat; if and when the second job starts, it joins the first one, and so on. If the run-time of an individual meta-job is long enough, all the cases might be processed with just a single meta-job.

META vs. GLOST[edit]

There are three important advantages of the META package over other approaches like GLOST where farm processing is done by bundling up all the jobs into a large parallel (MPI) job:

  1. As the scheduler has full flexibility to start individual meta-jobs when it wants, the queue wait time can be dramatically shorter with the META package than with GLOST. Consider a large farm where 1000 CPU cores need to be used for 3 days. With META, some meta-jobs start to run and produce the first results within minutes. With GLOST, with a 1000-way MPI job, queue wait time can be weeks, so it'll be weeks before you see your very first result.
  2. With GLOST, at the end of the farm computations, some MPI ranks will finish earlier and will sit idle until the very last -- the slowest -- MPI rank ends. In META package there is no such waste at the end of the farm -- individual meta-jobs exit earlier if they have no more workload to process.
  3. GLOST and other similar packages do not support automated resubmission of the cases which failed or never ran. META has this feature, and it is very easy to use.

The META webinar[edit]

A webinar was recorded on October 6th, 2021 describing the META package. You can view it here.

Quick start[edit]

If you are impatient to start using the package, just follow the steps listed below. But it is highly recommended to also read the rest of the page.

  • Log in to a cluster.
  • Load the meta-farm module:
$ module load meta-farm
  • Choose a name for a farm directory, e.g. Farm_name, and create it with the following command:
$ farm_init.run  Farm_name
$ submit.run -1

for the one case per job (SIMPLE) mode, or

$ submit.run N

for the many cases per job (META) mode, where N is the number of meta-jobs to use. N should be significantly smaller than the total number of cases.

To run another farm concurrently with the first one, run farm_init.run again (providing a different farm name) and customize the files single_case.sh and job_script.sh inside the new farm directory, then create a new table.dat file there. Also copy the executable and all the input files as needed. Now you can execute the submit.run command inside the second farm directory to submit the second farm.

List of commands[edit]

  • farm_init.run : Initialize a farm. See Quick start above.
  • submit.run : Submit the farm to the scheduler. See submit.run below.
  • resubmit.run : Resubmit all computations which failed or never ran as a new farm. See Resubmitting failed cases below.
  • list.run List all the jobs with their current state for the farm.
  • query.run Provide a short summary of the state of the farm, showing the number of queued, running, and completed jobs. More convenient than using list.run when the number of jobs is large. It will also print the progress--- that is, the number of processed cases vs. the total number of cases--- both for the current run, and globally.
  • kill.run: Kill all the running and queued jobs in the farm.
  • prune.run: Remove only queued jobs.
  • Status.run (capital "S"!) List statuses of all processed cases. With the optional -f, the non-zero status lines (if any) will be listed at the end.
  • clean.run: Delete all the files in the farm directory (including subdirectories if any present), except for job_script.sh, single_case.sh, final.sh, resubmit_script.sh, config.h, and table.dat. It will also delete all files associated with this farm in the /home/$USER/tmp directory. Be very careful with this script!

All of these commands (except for farm_init.run itself) have to be executed inside a farm directory, that is, a directory created by farm_init.run.

Small number of cases (SIMPLE mode)[edit]

Recall that a single execution of your code is a "case" and a "job" is an invocation of the Slurm scheduler. If:

  • the total number of cases is fairly small--- say, less than 500, and
  • each case runs for at least 10 minutes,

then it is reasonable to dedicate a separate job to each case using SIMPLE mode. Otherwise you should consider using META mode to handle many cases per job, for which please see Large number of cases (META mode) below.

The three essential scripts are the command submit.run, and two user-customizable scripts single_case.sh and job_script.sh.

submit.run[edit]

The command submit.run has one obligatory argument, the number of jobs to submit, N:

   $ submit.run N [-auto] [optional_sbatch_arguments]

If N=-1, you are requesting the SIMPLE mode ("submit as many jobs as there are lines in table.dat"). If N is a positive integer, you are requesting the META mode (multiple cases per job), with N being the number of meta-jobs requested. Any other value for N is an error.

If the optional switch -auto is present, the farm will resubmit itself automatically at the end, more than once if necessary, until all the cases from table.dat have been processed. This feature is described at Running resubmit.run automatically.

If a file named final.sh is present in the farm directory, submit.run will treat it as a job script for a post-processing job and it will be launched automatically once all the cases from table.dat have been successfully processed. See Running a post-processing job automatically for more details.

If you supply any other arguments, they will be passed on to the Slurm command sbatch used to launch all meta-jobs for this farm.

single_case.sh[edit]

The function of single_case.sh is to read one line from table.dat, parse it, and use the contents of that line to launch your code for one case. You may wish to customize single_case.sh for your purposes.

The version of single_case.sh provided by farm_init.run treats each line in table.dat as a literal command and executes it in its own subdirectory RUNyyy, where yyy is the case number. Here is the relevant section of single_case.sh:

...
# ++++++++++++++++++++++  This part can be customized:  ++++++++++++++++++++++++
#  Here:
#  $ID contains the case id from the original table (can be used to provide a unique seed to the code etc)
#  $COMM is the line corresponding to the case $ID in the original table, without the ID field
#  $METAJOB_ID is the jobid for the current meta-job (convenient for creating per-job files)

mkdir -p RUN$ID
cd RUN$ID

echo "Case $ID:"

# Executing the command (a line from table.dat)
# It's allowed to use more than one shell command (separated by semicolons) on a single line
eval "$COMM"

# Exit status of the code:
STATUS=$?

cd ..
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
...

Consequently, if you are using the unmodified single_case.sh then each line of table.dat should contain a complete command. This may be a compound command, that is, several commands separated by semicolons (;).

Typically table.dat will contain a list of identical commands differentiated only by their arguments, but it need not be so. Any executable statement can go into table.dat. Your table.dat could look like this:

 /home/user/bin/code1  1.0  10  2.1
 cp -f ~/input_dir/input1 .; ~/code_dir/code 
 ./code2 < IC.2

If you intend to execute the same command for every case and don't wish to repeat it on every line of table.dat, then you can edit single_case.sh to include the common command. Then edit your table.dat to contain only the arguments and/or redirects for each case.

For example, here is a modification of single_case.sh which includes the command (/path/to/your/code), takes the contents of table.dat as arguments to that command, and uses the case number $ID as an additional argument:

  • single_case.sh:
...
# ++++++++++++++++++++++  This part can be customized:  ++++++++++++++++++++++++
# Here we use $ID (case number) as a unique seed for Monte-Carlo type serial farming:
/path/to/your/code -par $COMM  -seed $ID
STATUS=$?
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
...
  • table.dat:
 12.56
 21.35
 ...

Note: If your code doesn't need to read any arguments from table.dat, you still have to generate table.dat, with the number of lines equal to the number of cases you want to compute. In this case, it doesn't matter what you put inside table.dat -- all that matters is the total number of the lines. The key line in the above example might then look like

/path/to/your/code -seed $ID

Another note: You do not need to insert line numbers at the beginning of each line of table.dat. The script submit.run will modify table.dat to add line numbers if it doesn't find them there.

STATUS and handling errors[edit]

What is STATUS for in single_case.sh? It is a variable which should be set to “0” if your case was computed correctly, and some positive value (that is, greater than 0) otherwise. It is very important: It is used by resubmit.run to figure out which cases failed so they can be re-computed. In the provided version of single_case.sh, STATUS is set to the exit code of your program. This may not cover all potential problems, since some programs produce an exit code of zero even if something goes wrong. You can change how STATUS is set by editing single_case.sh.

For example if your code is supposed to write a file (say, out.dat) at the end of each case, test whether the file exists and set STATUS appropriately. In the following code fragment, $STATUS will be positive if either the exit code from the program is positive, or if out.dat doesn't exist or is empty:

  STATUS=$?
  if test ! -s out.dat
     then
     STATUS=1
     fi

job_script.sh[edit]

The file job_script.sh is the job script which will be submitted to SLURM for all meta-jobs in your farm. Here is the default version created for you by farm_init.run:

#!/bin/bash
# Here you should provide the sbatch arguments to be used in all jobs in this serial farm
# It has to contain the runtime switch (either -t or --time):
#SBATCH -t 0-00:10
#SBATCH --mem=4G
#SBATCH -A Your_account_name

# Don't change this line:
task.run

At the very least you should change the account name (the -A switch), and the meta-job run-time (the -t switch). In SIMPLE mode, you should set the run-time to be somewhat longer than the longest expected individual case.

Important: Your job_script.sh must include the run-time switch (either -t or -time). This cannot be passed to sbatch as an optional argument to submit.run.

Sometimes the following problem happens: A meta-job may be allocated to a node which has a defect, thereby causing your program to fail instantly. For example, perhaps your program needs a GPU but the GPU you're assigned is malfunctioning, or perhaps the /project file system is not mounted. (Please report such a defective node to support@tech.alliancecan.ca if you detect one!) But when it happens, that single bad meta-job can quickly churn through table.dat, so your whole farm fails. If you can anticipate such problems, you can add tests to job_script.sh before the task.run line. For example, the following modification will test for the presence of an NVidia GPU, and if none is found it will force the meta-job to exit before it starts failing your cases:

nvidia-smi >/dev/null
retVal=$?
if [ $retVal -ne 0 ]; then
    exit 1
fi
task.run

There is a utility gpu_test which does a similar job to nvidia_smi in the above example. On Graham, Cedar, or Beluga you can copy it to your ~/bin directory:

cp ~syam/bin/gpu_test ~/bin

The META package has a built-in mechanism which tries to detect problems of this kind and kill a meta-job which churns through the cases too quickly. The two relevant parameters, N_failed_max and dt_failed are set in the file config.h. The protection mechanism is triggered when the first $N_failed_max cases are very short - less than $dt_failed seconds in duration. The default values are 5 and 5, so by default a meta-job will stop if the first 5 cases all finish in less than 5 seconds. If you get false triggering of this protective mechanism because some of your normal cases have run-time shorter than $dt_failed, reduce the value of dt_failed in config.h.

Output files[edit]

Once one or more meta-jobs in your farm are running, the following files will be created in the farm directory:

  • OUTPUT/slurm-$JOBID.out, one file per meta-job: Standard output from meta-jobs.
  • STATUSES/status.$JOBID, one file per meta-job: Files containing the statuses of the processed cases.

In both cases, $JOBID stands for the jobid of the corresponding meta-job.

One more directory, MISC, will also be created inside the root farm directory. It contains some auxiliary data.

Also, every time submit.run is run it will create a unique subdirectory inside /home/$USER/tmp. Inside that subdirectory, some small scratch files will be created, such as files used by lockfile to serialize certain operations inside the jobs. These subdirectories have names $NODE.$PID, where $NODE is the name of the current node (typically a login node), and $PID is the unique process ID for the script. Once the farm execution is done, one can safely erase this subdirectory. This will happen automatically if you run clean.run, but be careful! clean.run also deletes all the results produced by your farm!

Resubmitting failed cases[edit]

The command resubmit.run takes the same arguments as submit.run:

   $  resubmit.run N [-auto] [optional_sbatch_arguments]

resubmit.run:

  • analyzes all those status.* files (see Output files above);
  • figures out which cases failed and which never ran for whatever reason (e.g. because of the meta-jobs' run-time limit);
  • creates a new case table (adding “_” at the end of the original table name), which lists only the cases which still need to be run;
  • launches a new farm for those cases.

You cannot run resubmit.run until all the jobs from the original run are done or killed.

If some cases still fail or do not run, one can resubmit the farm as many times as needed. Of course, if certain cases fail repeatedly then there must a be a problem with either the program you are running or its input. In this case you may wish to use the command Status.run (capital S!) which displays the statuses for all computed cases. With the optional argument -f, Status.run will sort the output according to the exit status, showing cases with non-zero status at the bottom, to make them easier to spot.

Similarly to submit.run, if the optional switch -auto is present the farm will resubmit itself automatically at the end, more than once if necessary. This advanced feature is described at Resubmitting failed cases automatically.

Large number of cases (META mode)[edit]

META mode overview[edit]

The SIMPLE (one case per job) mode works fine when the number of cases is fairly small (<500). When the number of cases is much greater than 500, the following problems may arise:

  • Each cluster has a limit on how many jobs a user can have at one time. (E.g. for Graham, it is 1000.)
  • With a very large number of cases, each case computation is typically short. If one case runs for <20 min, CPU cycles may be wasted due to scheduling overheads.

META mode is the solution to these problems. Instead of submitting a separate job for each case, a smaller number of "meta-jobs" are submitted, each of which processes multiple cases. To enable META mode the first argument to submit.runt should be the desired number of meta-jobs, which should be a fairly small number-- much smaller than the number of cases to process. E.g.:

   $  submit.run  32

Since each case may take a different amount of time to process, META modes uses a dynamic workload-balancing scheme. This is how META mode is implemented:

Meta1.png

As the above diagram shows, each job executes the same script, task.run. Inside that script, there is a while loop for the cases. Each iteration of the loop has to go through a serialized portion of the code (that is, only one job at a time can be in that section of code), where it gets the next case to process from table.dat. Then the script single_case.sh (see section #single_case.sh script) is executed once for each case, which in turn calls the user code.

This approach results in dynamic workload balancing achieved across all the running "meta-jobs" belonging to the same farm. The algorithm is illustrated by the diagram below:

DWB META.png

This can be seen more clearly in this animation from the META webinar.

The dynamic workload balancing results in all meta-jobs finishing around the same time, regardless of how different the run-times are for individual cases, regardless of how fast CPUs are on different nodes, and regardless of when individual "meta-jobs" start. In addition, not all meta-jobs need to start running for all the cases to be processed, and if a meta-job dies (e.g. due to a node crash), at most one case will be lost. The latter can be easily rectified with resubmit.run; see #Resubmitting failed/never-run jobs.

Not all of the requested meta-jobs will necessarily run, depending on how busy the cluster is. But as described above, in META mode you will eventually get all your results regardless of how many meta-jobs run, although you might need to use resubmit.run to complete a particularly large farm.

Estimating the run-time and number of meta-jobs[edit]

How should you figure out the optimum number of meta-jobs, and the run-time to be used in job_script.sh?

First you need to figure out the average run-time for an individual case (a single line in table.dat). Supposing your application program is not parallel, allocate a single CPU core with salloc, then execute single_case.sh there for a few different cases. Measure the total run-time and divide that by the number of cases you ran to get an estimate of the average case run-time. This can be done with a shell for loop:

   $  N=10; time for ((i=1; i<=$N; i++)); do  ./single_case.sh table.dat $i  ; done

Divide the "real" time output by the above command by $N to get the average case run-time estimate. Let's call it dt_case.

Estimate the total CPU time needed to process the whole farm by multiplying dt_case by the number of cases, that is, the number of lines in table.dat. If this is in CPU-seconds, dividing that by 3600 gives you the total number of CPU-hours. Multiply that by something like 1.1 or 1.3 to have a bit of a safety margin.

Now you can make a sensible choice for the run-time of meta-jobs, and that will also determine the number of meta-jobs needed to finish the whole farm.

The run-time you choose should be significantly larger than the average run-time of an individual case, ideally by a factor of 100 or more. It must definitely be larger than the longest run-time you expect for an individual case. On the other hand it should not be too large; say, no more than 3 days. The longer a job's run-time is, the longer it will usually wait to be scheduled. On Alliance general-purpose clusters, a good choice would be 12h or 24h due to scheduling policies. Once you have settled on a run-time, divide the total number of CPU-hours by the run-time you have chosen (in hours) to get the required number of meta-jobs. Round up this number to the next integer.

With the above choices, the queue wait time should be fairly small, and the throughput and efficiency of the farm should be fairly high.

Let's consider a specific example. Suppose you ran the above for loop on a dedicated CPU obtained with salloc, and the output said the "real" time was 15m50s, which is 950 seconds. Divide that by the number of sample cases, 10, to find that the average time for an individual case is 95 seconds. Suppose also the total number of cases you have to process (the number of lines in table.dat) is 1000. The total CPU time required to compute all your cases is then 95 x 1000 = 95,000 CPU-seconds = 26.4 CPU-hours. Multiply that by a factor of 1.2 as a safety measure, to yield 31.7 CPU-hours. A run-time of 3 hours for your meta-jobs would work here, and should lead to good queue wait times. Edit the value of the #SBATCH -t in job_script.sh to be 3:00:00. Now estimate how many meta-jobs you'll need to process all the cases: N = 31.7 core-hours / 3 hours = 10.6, which rounded up to the next integer is 11. Then you can launch the farm by executing a single submit.run 11.

If the number of jobs in the above analysis is larger than 1000, you have a particularly large farm. The maximum number of jobs which can be submitted on Graham and Beluga is 1000, so you won't be able to run the whole collection with a single command. The workaround would be to go through the following sequence of commands. Remember each command can only be executed after the previous farm has finished running:

   $  submit.run 1000
   $  resubmit.run 1000
   $  resubmit.run 1000
   ...

If this seems rather tedious, consider using an advanced feature of the META package for such large farms: Resubmitting failed cases automatically. This will fully automate the farm resubmission steps.

Reducing waste[edit]

Here is one potential problem when one is running multiple cases per job: What if the number of running meta-jobs times the requested run-time per meta-job (say, 3 days) is not enough to process all your cases? E.g., you managed to start the maximum allowed 1000 meta-jobs, each of which has a 3-day run-time limit. That means that your farm can only process all the cases in a single run if the average_case_run_time x N_cases < 1000 x 3d = 3000 CPU days. Once your meta-jobs start hitting the 3-day run-time limit, they will start dying in the middle of processing one of your cases. This will result in up to 1000 interrupted cases calculations. This is not a big deal in terms of completing the work--- resubmit.run will find all the cases which failed or never ran, and will restart them automatically. But this can become a waste of CPU cycles. On average, you will be wasting 0.5 x N_jobs x average_case_run_time. E.g., if your cases have an average run-time of 1 hour, and you have 1000 meta-jobs running, you will waste about 500 CPU-hours or about 20 CPU-days, which is not acceptable.

Fortunately, the scripts we are providing have some built-in intelligence to mitigate this problem. This is implemented in task.run as follows:

  • The script measures the run-time of each case, and adds the value as one line in a scratch file times created in directory /home/$USER/tmp/$NODE.$PID/. (See Output files.) This is done by all running meta-jobs.
  • Once the first 8 cases were computed, one of the meta-jobs will read the contents of the file times and compute the largest 12.5% quantile for the current distribution of case run-times. This will serve as a conservative estimate of the run-time for your individual cases, dt_cutoff. The current estimate is stored in file dt_cutoff in /home/$USER/tmp/$NODE.$PID/.
  • From now on, each meta-job will estimate if it has the time to finish the case it is about to start computing, by ensuring that t_finish - t_now > dt_cutoff. Here, t_finish is the time when the job will die because of the job's run-time limit, and t_now is the current time. If it computes that it doesn't have the time, it will exit early, which will minimize the chance of a case aborting half-way due to the job's run-time limit.
  • At every subsequent power of two number of computed cases (8, then 16, then 32 and so on) dt_cutoff is recomputed using the above algorithm. This will make the dt_cutoff estimate more and more accurate. Power of two is used to minimize the overheads related to computing dt_cutoff; the algorithm will be equally efficient for both very small (tens) and very large (many thousands) number of cases.
  • The above algorithm reduces the amount of CPU cycles wasted due to jobs hitting the run-time limit by a factor of 8, on average.

As a useful side effect, every time you run a farm you get individual run-times for all of your cases stored in /home/$USER/tmp/$NODE.$PID/times. You can analyze that file to fine-tune your farm setup, for profiling your code, etc.

If more help is needed[edit]

See META: Advanced features and troubleshooting for more detailed discussion of some features, and for troubleshooting suggestions.

If you need more help, contact Technical support, mentioning the name of the package (META), and the name of the staff member who wrote the software (Sergey Mashchenko).

Glossary[edit]

  • case: One independent computation. The file table.dat should list one case per line.
  • farm / farming (verb): Running many jobs on a cluster which carry out independent (but related) computations, of the same kind.
  • farm (noun): The directory and files involved in running one instance of the package.
  • meta-job: A job which can process multiple cases (independent computations) from table.dat.
  • META mode: The mode of operation of the package in which each job can process multiple cases from table.dat.
  • SIMPLE mode: The mode of operation of the package in which each job will process only one case from table.dat.