Points de contrôle/en: Difference between revisions

no edit summary
(Updating to match new version of source page)
No edit summary
 
(13 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{Draft}}
 


<languages />
<languages />
The execution time for a program is sometimes too long for the maximum duration of a job permitted by the job schedulers used on Compute Canada clusters. Long-running jobs are also subject to all of the risks of system instability due to power outages, hardware defects and so forth. A program with a short execution time can easily be restarted with little concern but for long-running software it is preferable to use checkpoints to minimize the risk of losing several days' worth of computation. These checkpoints take the form of binary disk files from which the program can be restarted at the point in the computation where the checkpoint file was initially created.
The execution time for a program is sometimes too long for the maximum duration of a job permitted by the job schedulers used on the clusters. Long-running jobs are also subject to all of the risks of system instability due to power outages, hardware defects and so forth. A program with a short execution time can easily be restarted with little concern but for long-running software it is preferable to use checkpoints to minimize the risk of losing several days' worth of computation. These checkpoints take the form of binary disk files from which the program can be restarted at the point in the computation where the checkpoint file was initially created.


== Creating and Loading a Checkpoint  ==
== Creating and Loading a Checkpoint  ==
Line 15: Line 15:
** Once the atomic write has been completed, one can choose whether or not to delete any older checkpoints.
** Once the atomic write has been completed, one can choose whether or not to delete any older checkpoints.


Afin de ne pas réinventer la roue, surtout si la modification du code source n'est pas une option, nous suggérons l'utilisation de [http://dmtcp.sourceforge.net/ DMTCP].
<!-- So as not to re-invent the wheel, particularly in situations where modifying the source code isn't an option, an alternative solution is the use of the software  
 
[http://dmtcp.sourceforge.net/ DMTCP]. -->
<div class="mw-translate-fuzzy">
So as not to re-invent the wheel, particularly in situations where modifying the source code isn't an option, an alternative solution is the use of the software  
[http://dmtcp.sourceforge.net/ DMTCP].
</div>  


<div class="mw-translate-fuzzy">
<!-- === DMTCP === -->  


.
<!-- The software  [http://dmtcp.sourceforge.net/ DMTCP] (Distributed Multithreaded CheckPointing) permits checkpointing of programs without having to recompile them. The initial execution is done with the program  <tt>dmtcp_launch</tt> and specifying the checkpoint intervals. The restart can then be carried out by running the script  <tt>dmtcp_restart_script.sh</tt>. By default, this script and the checkpoint files are written in the directory where the program was started. You can modify the location of these checkpoint files with the option  <tt>--ckptdir <checkpoint directory></tt>. You can run  <tt>dmtcp_launch --help</tt> to see all the options for DMTCP. Note that for the moment DMTCP does not work with software parallelized using MPI. -->  
</div>  




.
<!-- An example of a job script: -->
<!--
{{Fichier
{{Fichier
   |name=job_with_dmtcp.sh
   |name=job_with_dmtcp.sh
Line 60: Line 56:
# ---------------------------------------------------------------------
# ---------------------------------------------------------------------
}}
}}
-->
== Resubmitting a Job for Long-Running Computations ==
== Resubmitting a Job for Long-Running Computations ==
If you plan on breaking up a lengthy computation into several Slurm jobs, there are [[Running_jobs#Resubmitting_jobs_for_long_running_computations|two recommended methods]]:  
If you plan on breaking up a lengthy computation into several Slurm jobs, there are [[Running_jobs#Resubmitting_jobs_for_long_running_computations|two recommended methods]]:  
* [[Running_jobs#Restarting_using_job_arrays|using Slurm job arrays]];
* [[Running_jobs#Restarting_using_job_arrays|using Slurm job arrays]];
* [[Running_jobs#Resubmission_from_the_job_script|resubmission from the end of the job script]].
* [[Running_jobs#Resubmission_from_the_job_script|resubmission from the end of the job script]].
Bureaucrats, cc_docs_admin, cc_staff
2,879

edits