cc_staff
238
edits
(Marked this version for translation) |
No edit summary |
||
Line 301: | Line 301: | ||
</source> | </source> | ||
<translate> | <translate> | ||
==Reducing waste== | |||
Here is one potential problem when one is running multiple cases per job: What if the number of running meta-jobs times the requested run-time per meta-job (say, 3 days) is not enough to process all your cases? E.g., you managed to start the maximum allowed 1000 meta-jobs, each of which has a 3-day run-time limit. That means that your farm can only process all the cases in a single run if the ''average_case_run_time x N_cases < 1000 x 3d = 3000'' CPU days. Once your meta-jobs start hitting the 3-day run-time limit, they will start dying in the middle of processing one of your cases. This will result in up to 1000 interrupted cases calculations. This is not a big deal in terms of completing the work--- <code>resubmit.run</code> will find all the cases which failed or never ran, and will restart them automatically. But this can become a waste of CPU cycles. On average, you will be wasting ''0.5 x N_jobs x average_case_run_time''. E.g., if your cases have an average run-time of 1 hour, and you have 1000 meta-jobs running, you will waste about 500 CPU-hours or about 20 CPU-days, which is not acceptable. | |||
Fortunately, the scripts we are providing have some built-in intelligence to mitigate this problem. This is implemented in <code>task.run</code> as follows: | |||
* The script measures the run-time of each case, and adds the value as one line in a scratch file <code>times</code> created in directory <code>/home/$USER/tmp/$NODE.$PID/</code>. (See [[#Output files|Output files]].) This is done by all running meta-jobs. | |||
* Once the first 8 cases were computed, one of the meta-jobs will read the contents of the file <code>times</code> and compute the largest 12.5% quantile for the current distribution of case run-times. This will serve as a conservative estimate of the run-time for your individual cases, ''dt_cutoff''. The current estimate is stored in file <code>dt_cutoff</code> in <code>/home/$USER/tmp/$NODE.$PID/</code>. | |||
* From now on, each meta-job will estimate if it has the time to finish the case it is about to start computing, by ensuring that ''t_finish - t_now > dt_cutoff''. Here, ''t_finish'' is the time when the job will die because of the job's run-time limit, and ''t_now'' is the current time. If it computes that it doesn't have the time, it will exit early, which will minimize the chance of a case aborting half-way due to the job's run-time limit. | |||
* At every subsequent power of two number of computed cases (8, then 16, then 32 and so on) ''dt_cutoff'' is recomputed using the above algorithm. This will make the ''dt_cutoff'' estimate more and more accurate. Power of two is used to minimize the overheads related to computing ''dt_cutoff''; the algorithm will be equally efficient for both very small (tens) and very large (many thousands) number of cases. | |||
* The above algorithm reduces the amount of CPU cycles wasted due to jobs hitting the run-time limit by a factor of 8, on average. | |||
As a useful side effect, every time you run a farm you get individual run-times for all of your cases stored in <code>/home/$USER/tmp/$NODE.$PID/times</code>. | |||
You can analyze that file to fine-tune your farm setup, for profiling your code, etc. | |||
=Troubleshooting= <!--T:51--> | =Troubleshooting= <!--T:51--> | ||
Line 382: | Line 397: | ||
This can only happen if you used the <code>-auto</code> switch when submitting the farm. | This can only happen if you used the <code>-auto</code> switch when submitting the farm. | ||
Find the failed cases with <code>Status.run -f</code>, fix the issue(s) causing the cases to fail, then run <code>resubmit.run</code>. | Find the failed cases with <code>Status.run -f</code>, fix the issue(s) causing the cases to fail, then run <code>resubmit.run</code>. | ||
</translate> | </translate> | ||
<translate> | <translate> | ||