META-Farm: Advanced features and troubleshooting: Difference between revisions

META-Farm: Advanced features and troubleshooting (view source)

Revision as of 19:45, 9 November 2022

648 bytes added , 1 year ago

no edit summary

Syam

cc_staff

238

edits

@@ Line 301: / Line 301: @@
 </source>
 <translate>
+==Reducing waste==
+Here is one potential problem when one is running multiple cases per job:  What if the number of running meta-jobs times the requested run-time per meta-job (say, 3 days) is not enough to process all your cases? E.g., you managed to start the maximum allowed 1000 meta-jobs, each of which has a 3-day run-time limit. That means that your farm can only process all the cases in a single run if the ''average_case_run_time x N_cases < 1000 x 3d = 3000'' CPU days. Once your meta-jobs start hitting the 3-day run-time limit, they will start dying in the middle of processing one of your cases. This will result in up to 1000 interrupted cases calculations. This is not a big deal in terms of completing the work--- <code>resubmit.run</code> will find all the cases which failed or never ran, and will restart them automatically. But this can become a waste of CPU cycles. On average, you will be wasting ''0.5 x N_jobs x average_case_run_time''. E.g., if your cases have an average run-time of 1 hour, and you have 1000 meta-jobs running, you will waste about 500 CPU-hours or about 20 CPU-days, which is not acceptable.
+Fortunately, the scripts we are providing have some built-in intelligence to mitigate this problem. This is implemented in <code>task.run</code> as follows:
+* The script measures the run-time of each case, and adds the value as one line in a scratch file <code>times</code> created in directory <code>/home/$USER/tmp/$NODE.$PID/</code>. (See [[#Output files|Output files]].) This is done by all running meta-jobs.
+* Once the first 8 cases were computed, one of the meta-jobs will read the contents of the file <code>times</code> and compute the largest 12.5% quantile for the current distribution of case run-times. This will serve as a conservative estimate of the run-time for your individual cases, ''dt_cutoff''.  The current estimate is stored in file <code>dt_cutoff</code> in <code>/home/$USER/tmp/$NODE.$PID/</code>.
+* From now on, each meta-job will estimate if it has the time to finish the case it is about to start computing, by ensuring that ''t_finish - t_now > dt_cutoff''.  Here, ''t_finish'' is the time when the job will die because of the job's run-time limit, and ''t_now'' is the current time. If it computes that it doesn't have the time, it will exit early, which will minimize the chance of a case aborting half-way due to the job's run-time limit.
+* At every subsequent power of two number of computed cases (8, then 16, then 32 and so on) ''dt_cutoff'' is recomputed using the above algorithm. This will make the ''dt_cutoff'' estimate more and more accurate. Power of two is used to minimize the overheads related to computing ''dt_cutoff''; the algorithm will be equally efficient for both very small (tens) and very large (many thousands) number of cases.
+* The above algorithm reduces the amount of CPU cycles wasted due to jobs hitting the run-time limit by a factor of 8, on average.
+As a useful side effect, every time you run a farm you get individual run-times for all of your cases stored in <code>/home/$USER/tmp/$NODE.$PID/times</code>.
+You can analyze that file to fine-tune your farm setup, for profiling your code, etc.
 =Troubleshooting= <!--T:51-->
@@ Line 382: / Line 397: @@
 This can only happen if you used the <code>-auto</code> switch when submitting the farm.
 Find the failed cases with <code>Status.run -f</code>, fix the issue(s) causing the cases to fail, then run <code>resubmit.run</code>.
-=Words of caution= <!--T:72-->
-<!--T:73-->
-Always start with a small test run to make sure everything works before submitting a large production run. You can test individual cases by reserving an interactive node with <code>salloc</code>, changing to the farm directory, and executing commands like <code>./single_case.sh table.dat 1</code>, <code>./single_case.sh table.dat 2</code>, ''etc.''
-== More than 10,000 cases == <!--T:74-->
-<!--T:75-->
-If your farm is particularly large (say >10,000 cases), you should spend extra effort to make sure it runs as efficiently as possible. In particular, minimize the number of files and/or directories created during execution. If possible, instruct your code to append to existing files (one per meta-job; '''do not mix results from different meta-jobs in a single output file!''') instead of creating a separate file for each case. Avoid creating a separate subdirectory for each case.  (Yes, creating a separate subdirectory for each case is the default setup of this package, but that default was chosen for safety, not efficiency!)
-<!--T:76-->
-The following example is optimized for a very large number of cases.  It assumes, for purposes of the example:
-* that your code accepts the output file name via a command line switch <code>-o</code>,
-* that the application opens the output file in "append" mode, that is, multiple runs will keep appending to the existing file,
-* that each line of <code>table.dat</code> provides the rest of the command line arguments for your code,
-* that multiple instances of your code can safely run concurrently inside the same directory, so there is no need to create a subdirectory for each case,
-* and that each run will not produce any files besides the output file.
-With this setup, even very large farms (hundreds of thousands or even millions of cases) should run efficiently, as there will be very few files created.
 </translate>
-<source lang="bash">
-...
-# ++++++++++++++++++++++  This part can be customized:  ++++++++++++++++++++++++
-#  Here:
-#  $ID contains the case id from the original table (can be used to provide a unique seed to the code etc)
-#  $COMM is the line corresponding to the case $ID in the original table, without the ID field
-#  $METAJOB_ID is the jobid for the current meta-job (convenient for creating per-job files)
-# Executing the command (a line from table.dat)
-/path/to/your/code  $COMM  -o output.$METAJOB_ID
-# Exit status of the code:
-STATUS=$?
-# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-...
-</source>
 <translate>

META-Farm: Advanced features and troubleshooting: Difference between revisions

META-Farm: Advanced features and troubleshooting (view source)

Revision as of 19:45, 9 November 2022

Navigation menu

Search