META-Farm: Advanced features and troubleshooting: Difference between revisions

Jump to navigation Jump to search
no edit summary
(Marked this version for translation)
No edit summary
Line 301: Line 301:
</source>
</source>
<translate>
<translate>
==Reducing waste==
Here is one potential problem when one is running multiple cases per job:  What if the number of running meta-jobs times the requested run-time per meta-job (say, 3 days) is not enough to process all your cases? E.g., you managed to start the maximum allowed 1000 meta-jobs, each of which has a 3-day run-time limit. That means that your farm can only process all the cases in a single run if the ''average_case_run_time x N_cases < 1000 x 3d = 3000'' CPU days. Once your meta-jobs start hitting the 3-day run-time limit, they will start dying in the middle of processing one of your cases. This will result in up to 1000 interrupted cases calculations. This is not a big deal in terms of completing the work--- <code>resubmit.run</code> will find all the cases which failed or never ran, and will restart them automatically. But this can become a waste of CPU cycles. On average, you will be wasting ''0.5 x N_jobs x average_case_run_time''. E.g., if your cases have an average run-time of 1 hour, and you have 1000 meta-jobs running, you will waste about 500 CPU-hours or about 20 CPU-days, which is not acceptable.
Fortunately, the scripts we are providing have some built-in intelligence to mitigate this problem. This is implemented in <code>task.run</code> as follows:
* The script measures the run-time of each case, and adds the value as one line in a scratch file <code>times</code> created in directory <code>/home/$USER/tmp/$NODE.$PID/</code>. (See [[#Output files|Output files]].) This is done by all running meta-jobs.
* Once the first 8 cases were computed, one of the meta-jobs will read the contents of the file <code>times</code> and compute the largest 12.5% quantile for the current distribution of case run-times. This will serve as a conservative estimate of the run-time for your individual cases, ''dt_cutoff''.  The current estimate is stored in file <code>dt_cutoff</code> in <code>/home/$USER/tmp/$NODE.$PID/</code>.
* From now on, each meta-job will estimate if it has the time to finish the case it is about to start computing, by ensuring that ''t_finish - t_now > dt_cutoff''.  Here, ''t_finish'' is the time when the job will die because of the job's run-time limit, and ''t_now'' is the current time. If it computes that it doesn't have the time, it will exit early, which will minimize the chance of a case aborting half-way due to the job's run-time limit.
* At every subsequent power of two number of computed cases (8, then 16, then 32 and so on) ''dt_cutoff'' is recomputed using the above algorithm. This will make the ''dt_cutoff'' estimate more and more accurate. Power of two is used to minimize the overheads related to computing ''dt_cutoff''; the algorithm will be equally efficient for both very small (tens) and very large (many thousands) number of cases.
* The above algorithm reduces the amount of CPU cycles wasted due to jobs hitting the run-time limit by a factor of 8, on average.
As a useful side effect, every time you run a farm you get individual run-times for all of your cases stored in <code>/home/$USER/tmp/$NODE.$PID/times</code>.
You can analyze that file to fine-tune your farm setup, for profiling your code, etc.


=Troubleshooting= <!--T:51-->
=Troubleshooting= <!--T:51-->
Line 382: Line 397:
This can only happen if you used the <code>-auto</code> switch when submitting the farm.  
This can only happen if you used the <code>-auto</code> switch when submitting the farm.  
Find the failed cases with <code>Status.run -f</code>, fix the issue(s) causing the cases to fail, then run <code>resubmit.run</code>.
Find the failed cases with <code>Status.run -f</code>, fix the issue(s) causing the cases to fail, then run <code>resubmit.run</code>.
=Words of caution= <!--T:72-->
<!--T:73-->
Always start with a small test run to make sure everything works before submitting a large production run. You can test individual cases by reserving an interactive node with <code>salloc</code>, changing to the farm directory, and executing commands like <code>./single_case.sh table.dat 1</code>, <code>./single_case.sh table.dat 2</code>, ''etc.''
== More than 10,000 cases == <!--T:74-->
<!--T:75-->
If your farm is particularly large (say >10,000 cases), you should spend extra effort to make sure it runs as efficiently as possible. In particular, minimize the number of files and/or directories created during execution. If possible, instruct your code to append to existing files (one per meta-job; '''do not mix results from different meta-jobs in a single output file!''') instead of creating a separate file for each case. Avoid creating a separate subdirectory for each case.  (Yes, creating a separate subdirectory for each case is the default setup of this package, but that default was chosen for safety, not efficiency!)
<!--T:76-->
The following example is optimized for a very large number of cases.  It assumes, for purposes of the example:
* that your code accepts the output file name via a command line switch <code>-o</code>,
* that the application opens the output file in "append" mode, that is, multiple runs will keep appending to the existing file,
* that each line of <code>table.dat</code> provides the rest of the command line arguments for your code,
* that multiple instances of your code can safely run concurrently inside the same directory, so there is no need to create a subdirectory for each case,
* and that each run will not produce any files besides the output file.
With this setup, even very large farms (hundreds of thousands or even millions of cases) should run efficiently, as there will be very few files created.


</translate>
</translate>
<source lang="bash">
...
# ++++++++++++++++++++++  This part can be customized:  ++++++++++++++++++++++++
#  Here:
#  $ID contains the case id from the original table (can be used to provide a unique seed to the code etc)
#  $COMM is the line corresponding to the case $ID in the original table, without the ID field
#  $METAJOB_ID is the jobid for the current meta-job (convenient for creating per-job files)
# Executing the command (a line from table.dat)
/path/to/your/code  $COMM  -o output.$METAJOB_ID


# Exit status of the code:
STATUS=$?
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
...
</source>
<translate>
<translate>


cc_staff
238

edits

Navigation menu