META-Farm: Advanced features and troubleshooting

From Alliance Doc
Revision as of 15:16, 9 November 2022 by Rdickson (talk | contribs) (creation, split from META: A package...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Other languages:

This page is about META: A package for job farming.

Resubmitting failed cases automatically[edit]

If your farm is particularly large, that is, if it needs more resources than NJOBS_MAX x job_run_time, where NJOBS_MAX is the maximum number of jobs one is allowed to submit, you will have to run resubmit.run after the original farm finishes running-- perhaps more than once. You can do it by hand, but with META you can also automate this process. To enable this feature, add the -auto switch to your submit.run or resubmit.run command:

$ submit.run N -auto

This can be used in either SIMPLE or META mode. If your original submit.run command did not have the -auto switch, you can add it to resubmit.run after the original farm finishes running, to the same effect.

When you add -auto, (re)submit.run submits one more (serial) job, in addition to the farm jobs. The purpose of this job is to run the resubmit.run command automatically right after the current farm finishes running. The job script for this additional job is resubmit_script.sh, which should be present in the farm directory; a sample file is automatically copied there when you run farm_init.run. The only customization you need to do to this file is to correct the account name in the #SBATCH -A line.

If you are using -auto, the value of the NJOBS_MAX parameter defined in the config.h file should be at least one smaller than the largest number of jobs you can submit on the cluster. E.g. if the largest number of jobs one can submit on the cluster is 999 and you intend to use -auto, set NJOBS_MAX to 998.

When using -auto, if at some point the only cases left to be processed are the ones which failed earlier, auto-resubmission will stop, and farm computations will end. This is to avoid an infinite loop on badly-formed cases which will always fail. If this happens, you will have to address the reasons for these cases failing before attempting to resubmit the farm. You can see the relevant messages in the file farm.log created in the farm directory.

Running a post-processing job automatically[edit]

Another advanced feature is the ability to run a post-processing job automatically once all the cases from table.dat have been successfully processed. If any cases failed-- i.e. had a non-zero exit status-- the post-processing job will not run. To enable this feature, simply create a script for the post-processing job with the name final.sh inside the farm directory This job can be of any kind-- serial, parallel, or an array job.

This feature uses the same script, resubmit_script.sh, described for -auto above. Make sure resubmit_script.sh has the correct account name in the #SBATCH -A line.

The automatic post-processing feature also causes more serial job to be submitted, above the number you request. Adjust the parameter NJOBS_MAX in config.h accordingly (e.g. if the cluster has a job limit of 999, set it to 998). However, if you use both the auto-resubmit and the auto-post-processing features, they will together only submit one additional job. You do not need to subtract 2 from NJOBS_MAX.

System messages from the auto-resubmit feature are logged in farm.log, in the root farm directory.

Additional information[edit]

Using the git repository[edit]

To use META on a cluster where it is not installed as a module you can clone the package from our git repository:

$ git clone https://git.computecanada.ca/syam/meta-farm.git

Then modify your $PATH variable to point to the bin subdirectory of the newly created meta-farm directory. Assuming you executed git clone inside your home directory, do this:

$ export PATH=~/meta-farm/bin:$PATH

Then proceed as shown in the META Quick start from the farm_init.run step.

Passing additional sbatch arguments[edit]

If you need to use additional sbatch arguments (like --mem 4G, --gres=gpu:1 etc.), add them to job_script.sh as separate #SBATCH lines.

Or if you prefer, you can add them at the end of the submit.run or resubmit.run command and they will be passed to sbatch, e.g.:

   $  submit.run  -1  --mem 4G

Multi-threaded applications[edit]

For multi-threaded applications (such as those that use OpenMP, for example), add the following lines to job_script.sh:

   #SBATCH --cpus-per-task=N
   #SBATCH --mem=M
   export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

...where N is the number of CPU cores to use, and M is the total memory to reserve in megabytes. You may also supply --cpus-per-task=N and --mem=M as arguments to (re)submit.run.

MPI applications[edit]

For applications that use MPI, add the following lines to job_script.sh:

   #SBATCH --ntasks=N  
   #SBATCH --mem-per-cpu=M

...where N is the number of CPU cores to use, and M is the memory to reserve for each core, in megabytes. You may also supply --ntasks=N and --mem-per-cpu=M as arguments to (re)submit.run. See Advanced MPI scheduling for information about more-complicated MPI scenarios.

Also add srun before the path to your code inside single_case.sh,e.g.:

   srun  $COMM

Alternatively, you can prepend srun to each line of table.dat:

   srun /path/to/mpi_code arg1 arg2
   srun /path/to/mpi_code arg1 arg2
   ...
   srun /path/to/mpi_code arg1 arg2

GPU applications[edit]

For applications which use GPUs, modify job_script.sh following the guidance at Using GPUs with Slurm. For example, if your cases each use one GPU, add this line:

#SBATCH --gres=gpu:1

You may also wish to copy the utility ~syam/bin/gpu_test to your ~/bin directory (only on Graham, Cedar, and Beluga), and put the following lines in job_script.sh right before the task.run line:

~/bin/gpu_test
retVal=$?
if [ $retVal -ne 0 ]; then
    echo "No GPU found - exiting..."
    exit 1
fi

This will catch those rare situations when there is a problem with the node which renders the GPU unavailable. If that happens to one of your meta-jobs, and you don't detect the GPU failure somehow, then the job will try (and fail) to run all your cases from table.dat.

Environment variables and --export[edit]

All the jobs generated by META package inherit the environment present when you run submit.run or resubmit.run. This includes all the loaded modules and environment variables. META relies on this behaviour for its work, using some environment variables to pass information between scripts. You have to be careful not to break this default behaviour, such as can happen if you use the --export switch. If you need to use --export in your farm, make sure ALL is one of the arguments to this command, e.g. --export=ALL,X=1,Y=2.

If you need to pass values of custom environment variables to all of your farm jobs (including auto-resubmitted jobs and the post-processing job if there is one), do not use --export. Instead, set the variables on the command line as in this example:

   $  VAR1=1 VAR2=5 VAR3=3.1416 submit.run ...

Here VAR1, VAR2, VAR3 are custom environment variables which will be passed to all farm jobs.

Example: Numbered input files[edit]

Suppose you have an application called fcode, and each case needs to read a separate file from standard input-– say data.X, where X ranges from 1 to N_cases. The input files are all stored in a directory /home/user/IC. Ensure fcode is on your $PATH (e.g., put fcode in ~/bin, and ensure /home/$USER/bin is added to $PATH in ~/.bashrc), or use a full path to fcode in table.dat. Create table.dat in the farm META directory like this:

  fcode < /home/user/IC/data.1
  fcode < /home/user/IC/data.2
  fcode < /home/user/IC/data.3
  ...

You might wish to use a shell loop to create table.dat, e.g.:

   $  for ((i=1; i<=100; i++)); do echo "fcode < /home/user/IC/data.$i"; done >table.dat

Example: Input file must have the same name[edit]

Some applications expect to read input from a file with a prescribed and unchangeable name, like INPUT for example. To handle this situation each case must run in its own subdirectory, and you must create an input file with the prescribed name in each subdirectory. Suppose for this example that you have prepared the different input files for each case and stored them in /path/to/data.X, where X ranges from 1 to N_cases. Your table.dat can contain nothing but the application name, over and over again:

  /path/to/code
  /path/to/code
  ...

Add a line to single_case.sh which copies the input file into the farm subdirectory for each case-- the first line in the example below:

   cp /path/to/data.$ID INPUT
   $COMM
   STATUS=$?

Using all the columns in the cases table explicitly[edit]

The examples shown so far assume that each line in the cases table is an executable statement, starting with either the name of the executable file (when it is on your $PATH) or the full path to the executable file, and then listing the command line arguments particular to that case, or something like < input.$ID if your code expects to read a standard input file.

In the most general case, you may want to be able to access all the columns in the table individually. That can be done by modifying single_case.sh:

...
# ++++++++++++  This part can be customized:  ++++++++++++++++
#  $ID contains the case id from the original table
#  $COMM is the line corresponding to the case $ID in the original table, without the ID field
mkdir RUN$ID
cd RUN$ID

# Converting $COMM to an array:
COMM=( $COMM )
# Number of columns in COMM:
Ncol=${#COMM[@]}
# Now one can access the columns individually, as ${COMM[i]} , where i=0...$Ncol-1
# A range of columns can be accessed as ${COMM[@]:i:n} , where i is the first column
# to display, and n is the number of columns to display
# Use the ${COMM[@]:i} syntax to display all the columns starting from the i-th column
# (use for codes with a variable number of command line arguments).

# Call the user code here.
...

# Exit status of the code:
STATUS=$?
cd ..
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
...

For example, you might need to provide to your code both a standard input file and a variable number of command line arguments. Your cases table will look like this:

  /path/to/IC.1 0.1
  /path/to/IC.2 0.2 10
  ...

The way to implement this in single_case.sh is as follows:

# Call the user code here.
/path/to/code ${COMM[@]:1} < ${COMM[0]}

Troubleshooting[edit]

Here we explain typical error messages you might get when using this package.

Problems affecting multiple commands[edit]

"Non-farm directory, or no farm has been submitted; exiting"[edit]

Either the current directory is not a farm directory, or you never ran submit.run for this farm.

Problems with submit.run[edit]

Wrong first argument: XXX (should be a positive integer or -1) ; exiting[edit]

Use the correct first argument: -1 for the SIMPLE mode, or a positive integer N (number of requested meta-jobs) for the META mode.

"lockfile is not on path; exiting"[edit]

Make sure the utility lockfile is on your $PATH. This utility is critical for this package. It provides serialized access of meta-jobs to the table.dat file, that is, it ensures that two different meta-jobs do not read the same line of table.dat at the same time.

"Non-farm directory (config.h, job_script.sh, single_case.sh, and/or table.dat are missing); exiting"[edit]

Either the current directory is not a farm directory, or some important files are missing. Change to the correct (farm) directory, or create the missing files.

"-auto option requires resubmit_script.sh file in the root farm directory; exiting"[edit]

You used the -auto option, but you forgot to create the resubmit_script.sh file inside the root farm directory. A sample resubmit_script.sh is created automatically when you use farm_init.run.

"File table.dat doesn't exist. Exiting"[edit]

You forgot to create the table.dat file in the current directory, or perhaps you are running submit.run not inside one of your farm sub-directories.

"Job runtime sbatch argument (-t or --time) is missing in job_script.sh. Exiting"[edit]

Make sure you provide a run-time limit for all meta-jobs as an #SBATCH argument inside your job_script.sh file. The run-time is the only one which cannot be passed as an optional argument to submit.run.

"Wrong job runtime in job_script.sh - nnn . Exiting"[edit]

You didn't format properly the run-time argument inside your job_script.sh file.

"Something wrong with sbatch farm submission; jobid=XXX; aborting"[edit]

"Something wrong with a auto-resubmit job submission; jobid=XXX; aborting"[edit]

With either of the two messages, there was an issue with submitting jobs with sbatch. The cluster's scheduler might be misbehaving, or simply too busy. Try again a bit later.

"Couldn't create subdirectories inside the farm directory ; exiting"[edit]

"Couldn't create the temp directory XXX ; exiting"[edit]

"Couldn't create a file inside XXX ; exiting"[edit]

With any of these three messages, something is wrong with a file system: Either permissions got messed up, or you have exhausted a quota. Fix the issue(s), then try again.

Problems with resubmit.run[edit]

"Jobs are still running/queued; cannot resubmit"[edit]

You cannot use resubmit.run until all meta-jobs from this farm have finished running. Use list.run or queue.run to check the status of the farm.

"No failed/unfinished jobs; nothing to resubmit"[edit]

Your farm was 100% processed. There are no more (failed or never-ran) cases to compute.

Problems with running jobs[edit]

"Too many failed (very short) cases - exiting"[edit]

This happens if the first $N_failed_max cases are very short-- less than $dt_failed seconds in duration. See the discussion in section job_script.sh above. Determine what is causing the cases to fail and fix that, or else adjust the $N_failed_max and $dt_failed values in config.h.

"lockfile is not on path on node XXX"[edit]

As the error message suggests, somehow the utility lockfile is not on your $PATH on some node. Use which lockfile to ensure that the utility is somewhere in your $PATH. If it is in your $PATH on a login node, then something went wrong on that particular compute node, for example a file system may have failed to mount.

"Exiting after processing one case (-1 option)"[edit]

This is not an error message. It simply tells you that you submitted the farm with submit.run -1 (one case per job mode), so each meta-job is exiting after processing a single case.

"Not enough runtime left; exiting."[edit]

This message tells you that the meta-job would likely not have enough time left to process the next case (based on the analysis of run-times for all the cases processed so far), so it is exiting early.

"No cases left; exiting."[edit]

This is not an error message. This is how each meta-job normally finishes, when all cases have been computed.

"Only failed cases left; cannot auto-resubmit; exiting"[edit]

This can only happen if you used the -auto switch when submitting the farm. Find the failed cases with Status.run -f, fix the issue(s) causing the cases to fail, then run resubmit.run.

Words of caution[edit]

Always start with a small test run to make sure everything works before submitting a large production run. You can test individual cases by reserving an interactive node with salloc, changing to the farm directory, and executing commands like ./single_case.sh table.dat 1, ./single_case.sh table.dat 2, etc.

More than 10,000 cases[edit]

If your farm is particularly large (say >10,000 cases), you should spend extra effort to make sure it runs as efficiently as possible. In particular, minimize the number of files and/or directories created during execution. If possible, instruct your code to append to existing files (one per meta-job; do not mix results from different meta-jobs in a single output file!) instead of creating a separate file for each case. Avoid creating a separate subdirectory for each case. (Yes, creating a separate subdirectory for each case is the default setup of this package, but that default was chosen for safety, not efficiency!)

The following example is optimized for a very large number of cases. It assumes, for purposes of the example:

  • that your code accepts the output file name via a command line switch -o,
  • that the application opens the output file in "append" mode, that is, multiple runs will keep appending to the existing file,
  • that each line of table.dat provides the rest of the command line arguments for your code,
  • that multiple instances of your code can safely run concurrently inside the same directory, so there is no need to create a subdirectory for each case,
  • and that each run will not produce any files besides the output file.

With this setup, even very large farms (hundreds of thousands or even millions of cases) should run efficiently, as there will be very few files created.

...
# ++++++++++++++++++++++  This part can be customized:  ++++++++++++++++++++++++
#  Here:
#  $ID contains the case id from the original table (can be used to provide a unique seed to the code etc)
#  $COMM is the line corresponding to the case $ID in the original table, without the ID field
#  $METAJOB_ID is the jobid for the current meta-job (convenient for creating per-job files)

# Executing the command (a line from table.dat)
/path/to/your/code  $COMM  -o output.$METAJOB_ID

# Exit status of the code:
STATUS=$?
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
...

Parent page: META: A package for job farming