META-Farm: Advanced features and troubleshooting: Difference between revisions

Jump to navigation Jump to search
Marked this version for translation
No edit summary
(Marked this version for translation)
Line 247: Line 247:
<translate>
<translate>


==Using all the columns in the cases table explicitly== <!--T:46-->
==Using all the columns in the cases table explicitly== <!--T:46-->
The examples shown so far assume that each line in the cases table is an executable statement, starting with either the name of the executable file (when it is on your <code>$PATH</code>) or the full path to the executable file, and then listing the command line arguments particular to that case, or something like <code> < input.$ID</code> if your code expects to read a standard input file.
The examples shown so far assume that each line in the cases table is an executable statement, starting with either the name of the executable file (when it is on your <code>$PATH</code>) or the full path to the executable file, and then listing the command line arguments particular to that case, or something like <code> < input.$ID</code> if your code expects to read a standard input file.


Line 302: Line 302:
<translate>
<translate>


==Reducing waste==
==Reducing waste== <!--T:78-->


<!--T:79-->
Here is one potential problem when one is running multiple cases per job:  What if the number of running meta-jobs times the requested run-time per meta-job (say, 3 days) is not enough to process all your cases? E.g., you managed to start the maximum allowed 1000 meta-jobs, each of which has a 3-day run-time limit. That means that your farm can only process all the cases in a single run if the ''average_case_run_time x N_cases < 1000 x 3d = 3000'' CPU days. Once your meta-jobs start hitting the 3-day run-time limit, they will start dying in the middle of processing one of your cases. This will result in up to 1000 interrupted cases calculations. This is not a big deal in terms of completing the work--- <code>resubmit.run</code> will find all the cases which failed or never ran, and will restart them automatically. But this can become a waste of CPU cycles. On average, you will be wasting ''0.5 x N_jobs x average_case_run_time''. E.g., if your cases have an average run-time of 1 hour, and you have 1000 meta-jobs running, you will waste about 500 CPU-hours or about 20 CPU-days, which is not acceptable.
Here is one potential problem when one is running multiple cases per job:  What if the number of running meta-jobs times the requested run-time per meta-job (say, 3 days) is not enough to process all your cases? E.g., you managed to start the maximum allowed 1000 meta-jobs, each of which has a 3-day run-time limit. That means that your farm can only process all the cases in a single run if the ''average_case_run_time x N_cases < 1000 x 3d = 3000'' CPU days. Once your meta-jobs start hitting the 3-day run-time limit, they will start dying in the middle of processing one of your cases. This will result in up to 1000 interrupted cases calculations. This is not a big deal in terms of completing the work--- <code>resubmit.run</code> will find all the cases which failed or never ran, and will restart them automatically. But this can become a waste of CPU cycles. On average, you will be wasting ''0.5 x N_jobs x average_case_run_time''. E.g., if your cases have an average run-time of 1 hour, and you have 1000 meta-jobs running, you will waste about 500 CPU-hours or about 20 CPU-days, which is not acceptable.


<!--T:80-->
Fortunately, the scripts we are providing have some built-in intelligence to mitigate this problem. This is implemented in <code>task.run</code> as follows:
Fortunately, the scripts we are providing have some built-in intelligence to mitigate this problem. This is implemented in <code>task.run</code> as follows:


<!--T:81-->
* The script measures the run-time of each case, and adds the value as one line in a scratch file <code>times</code> created in directory <code>/home/$USER/tmp/$NODE.$PID/</code>. (See [[#Output files|Output files]].) This is done by all running meta-jobs.
* The script measures the run-time of each case, and adds the value as one line in a scratch file <code>times</code> created in directory <code>/home/$USER/tmp/$NODE.$PID/</code>. (See [[#Output files|Output files]].) This is done by all running meta-jobs.
* Once the first 8 cases were computed, one of the meta-jobs will read the contents of the file <code>times</code> and compute the largest 12.5% quantile for the current distribution of case run-times. This will serve as a conservative estimate of the run-time for your individual cases, ''dt_cutoff''.  The current estimate is stored in file <code>dt_cutoff</code> in <code>/home/$USER/tmp/$NODE.$PID/</code>.
* Once the first 8 cases were computed, one of the meta-jobs will read the contents of the file <code>times</code> and compute the largest 12.5% quantile for the current distribution of case run-times. This will serve as a conservative estimate of the run-time for your individual cases, ''dt_cutoff''.  The current estimate is stored in file <code>dt_cutoff</code> in <code>/home/$USER/tmp/$NODE.$PID/</code>.
Line 314: Line 317:
* The above algorithm reduces the amount of CPU cycles wasted due to jobs hitting the run-time limit by a factor of 8, on average.
* The above algorithm reduces the amount of CPU cycles wasted due to jobs hitting the run-time limit by a factor of 8, on average.


<!--T:82-->
As a useful side effect, every time you run a farm you get individual run-times for all of your cases stored in <code>/home/$USER/tmp/$NODE.$PID/times</code>.  
As a useful side effect, every time you run a farm you get individual run-times for all of your cases stored in <code>/home/$USER/tmp/$NODE.$PID/times</code>.  
You can analyze that file to fine-tune your farm setup, for profiling your code, etc.
You can analyze that file to fine-tune your farm setup, for profiling your code, etc.
Line 327: Line 331:
Either the current directory is not a farm directory, or you never ran <code>submit.run</code> for this farm.
Either the current directory is not a farm directory, or you never ran <code>submit.run</code> for this farm.


==Problems with submit.run== <!--T:55-->
==Problems with submit.run== <!--T:55-->
===Wrong first argument: XXX (should be a positive integer or -1) ; exiting===
===Wrong first argument: XXX (should be a positive integer or -1) ; exiting===
Use the correct first argument: -1 for the SIMPLE mode, or a positive integer N (number of requested meta-jobs) for the META mode.
Use the correct first argument: -1 for the SIMPLE mode, or a positive integer N (number of requested meta-jobs) for the META mode.


==="lockfile is not on path; exiting"=== <!--T:56-->
==="lockfile is not on path; exiting"=== <!--T:56-->
Make sure the utility <code>lockfile</code> is on your $PATH.
Make sure the utility <code>lockfile</code> is on your $PATH.
This utility is critical for this package.   
This utility is critical for this package.   
Line 337: Line 341:
it ensures that two different meta-jobs do not read the same line of <code>table.dat</code> at the same time.
it ensures that two different meta-jobs do not read the same line of <code>table.dat</code> at the same time.


==="Non-farm directory (config.h, job_script.sh, single_case.sh, and/or table.dat are missing); exiting"=== <!--T:57-->
==="Non-farm directory (config.h, job_script.sh, single_case.sh, and/or table.dat are missing); exiting"=== <!--T:57-->
Either the current directory is not a farm directory, or some important files are missing. Change to the correct (farm) directory, or create the missing files.
Either the current directory is not a farm directory, or some important files are missing. Change to the correct (farm) directory, or create the missing files.


Line 343: Line 347:
You used the <code>-auto</code> option, but you forgot to create the <code>resubmit_script.sh</code> file inside the root farm directory. A sample <code>resubmit_script.sh</code> is created automatically when you use <code>farm_init.run</code>.
You used the <code>-auto</code> option, but you forgot to create the <code>resubmit_script.sh</code> file inside the root farm directory. A sample <code>resubmit_script.sh</code> is created automatically when you use <code>farm_init.run</code>.


==="File table.dat doesn't exist. Exiting"=== <!--T:59-->
==="File table.dat doesn't exist. Exiting"=== <!--T:59-->
You forgot to create the <code>table.dat</code> file in the current directory, or perhaps you are running <code>submit.run</code> not inside one of your farm sub-directories.
You forgot to create the <code>table.dat</code> file in the current directory, or perhaps you are running <code>submit.run</code> not inside one of your farm sub-directories.


==="Job runtime sbatch argument (-t or --time) is missing in job_script.sh. Exiting"=== <!--T:60-->
==="Job runtime sbatch argument (-t or --time) is missing in job_script.sh. Exiting"=== <!--T:60-->
Make sure you provide a run-time limit for all meta-jobs as an <code>#SBATCH</code> argument inside your <code>job_script.sh</code> file.
Make sure you provide a run-time limit for all meta-jobs as an <code>#SBATCH</code> argument inside your <code>job_script.sh</code> file.
The run-time is the only one which cannot be passed as an optional argument to <code>submit.run</code>.
The run-time is the only one which cannot be passed as an optional argument to <code>submit.run</code>.


==="Wrong job runtime in job_script.sh - nnn . Exiting"=== <!--T:61-->
==="Wrong job runtime in job_script.sh - nnn . Exiting"=== <!--T:61-->
You didn't format properly the run-time argument inside your <code>job_script.sh</code> file.
You didn't format properly the run-time argument inside your <code>job_script.sh</code> file.


==="Something wrong with sbatch farm submission; jobid=XXX; aborting"=== <!--T:62-->
==="Something wrong with sbatch farm submission; jobid=XXX; aborting"=== <!--T:62-->
==="Something wrong with a auto-resubmit job submission; jobid=XXX; aborting"===
==="Something wrong with a auto-resubmit job submission; jobid=XXX; aborting"===
With either of the two messages, there was an issue with submitting jobs with <code>sbatch</code>.
With either of the two messages, there was an issue with submitting jobs with <code>sbatch</code>.
The cluster's scheduler might be misbehaving, or simply too busy. Try again a bit later.
The cluster's scheduler might be misbehaving, or simply too busy. Try again a bit later.


==="Couldn't create subdirectories inside the farm directory ; exiting"=== <!--T:63-->
==="Couldn't create subdirectories inside the farm directory ; exiting"=== <!--T:63-->
==="Couldn't create the temp directory XXX ; exiting"===
==="Couldn't create the temp directory XXX ; exiting"===
==="Couldn't create a file inside XXX ; exiting"===
==="Couldn't create a file inside XXX ; exiting"===
Line 364: Line 368:
Either permissions got messed up, or you have exhausted a quota. Fix the issue(s), then try again.
Either permissions got messed up, or you have exhausted a quota. Fix the issue(s), then try again.


==Problems with resubmit.run== <!--T:64-->
==Problems with resubmit.run== <!--T:64-->
==="Jobs are still running/queued; cannot resubmit"===
==="Jobs are still running/queued; cannot resubmit"===
You cannot use <code>resubmit.run</code> until all meta-jobs from this farm have finished running.  
You cannot use <code>resubmit.run</code> until all meta-jobs from this farm have finished running.  
Use <code>list.run</code> or <code>queue.run</code> to check the status of the farm.
Use <code>list.run</code> or <code>queue.run</code> to check the status of the farm.


==="No failed/unfinished jobs; nothing to resubmit"=== <!--T:65-->
==="No failed/unfinished jobs; nothing to resubmit"=== <!--T:65-->
Your farm was 100% processed.  There are no more (failed or never-ran) cases to compute.
Your farm was 100% processed.  There are no more (failed or never-ran) cases to compute.


==Problems with running jobs== <!--T:66-->
==Problems with running jobs== <!--T:66-->
==="Too many failed (very short) cases - exiting"===
==="Too many failed (very short) cases - exiting"===
This happens if the first <code>$N_failed_max</code> cases are very short-- less than <code>$dt_failed</code> seconds in duration.  
This happens if the first <code>$N_failed_max</code> cases are very short-- less than <code>$dt_failed</code> seconds in duration.  
Line 379: Line 383:
or else adjust the <code>$N_failed_max</code> and <code>$dt_failed</code> values in <code>config.h</code>.
or else adjust the <code>$N_failed_max</code> and <code>$dt_failed</code> values in <code>config.h</code>.


==="lockfile is not on path on node XXX"=== <!--T:67-->
==="lockfile is not on path on node XXX"=== <!--T:67-->
As the error message suggests, somehow the utility <code>lockfile</code> is not on your <code>$PATH</code> on some node.
As the error message suggests, somehow the utility <code>lockfile</code> is not on your <code>$PATH</code> on some node.
Use <code>which lockfile</code> to ensure that the utility is somewhere in your <code>$PATH</code>.
Use <code>which lockfile</code> to ensure that the utility is somewhere in your <code>$PATH</code>.
Line 391: Line 395:
This message tells you that the meta-job would likely not have enough time left to process the next case (based on the analysis of run-times for all the cases processed so far), so it is exiting early.
This message tells you that the meta-job would likely not have enough time left to process the next case (based on the analysis of run-times for all the cases processed so far), so it is exiting early.


==="No cases left; exiting."=== <!--T:70-->
==="No cases left; exiting."=== <!--T:70-->
This is not an error message.  This is how each meta-job normally finishes, when all cases have been computed.
This is not an error message.  This is how each meta-job normally finishes, when all cases have been computed.


rsnt_translations
56,430

edits

Navigation menu