Points de contrôle/en: Difference between revisions

Updating to match new version of source page
(Created page with ".")
(Updating to match new version of source page)
Line 1: Line 1:
{{Draft}}
<languages />
<languages />
The execution time for a program is sometimes too long for the maximum duration of a job permitted by the job schedulers used on Compute Canada clusters. Long-running jobs are also subject to all of the risks of system instability due to power outages, hardware defects and so forth. A program with a short execution time can easily be restarted with little concern but for long-running software it is preferable to use checkpoints to minimize the risk of losing several days' worth of computation. These checkpoints take the form of binary disk files from which the program can be restarted at the point in the computation where the checkpoint file was initially created.
The execution time for a program is sometimes too long for the maximum duration of a job permitted by the job schedulers used on Compute Canada clusters. Long-running jobs are also subject to all of the risks of system instability due to power outages, hardware defects and so forth. A program with a short execution time can easily be restarted with little concern but for long-running software it is preferable to use checkpoints to minimize the risk of losing several days' worth of computation. These checkpoints take the form of binary disk files from which the program can be restarted at the point in the computation where the checkpoint file was initially created.
Line 12: Line 14:
** The creation of the checkpoint file can be made ''atomic'' by performing an operation which confirms the end of the checkpoint process. For example, the checkpoint file can be initially named based on the date and time and, as the final step, a symbolic link ''latest-version'' is pointed at this new checkpoint file. Another more advanced method would be to create a second file which contains a hash of the checkpoint file's content by means of which the restart function can verify the integrity of the checkpoint when it is loaded.  
** The creation of the checkpoint file can be made ''atomic'' by performing an operation which confirms the end of the checkpoint process. For example, the checkpoint file can be initially named based on the date and time and, as the final step, a symbolic link ''latest-version'' is pointed at this new checkpoint file. Another more advanced method would be to create a second file which contains a hash of the checkpoint file's content by means of which the restart function can verify the integrity of the checkpoint when it is loaded.  
** Once the atomic write has been completed, one can choose whether or not to delete any older checkpoints.
** Once the atomic write has been completed, one can choose whether or not to delete any older checkpoints.
Afin de ne pas réinventer la roue, surtout si la modification du code source n'est pas une option, nous suggérons l'utilisation de [http://dmtcp.sourceforge.net/ DMTCP].


<div class="mw-translate-fuzzy">
<div class="mw-translate-fuzzy">
Line 18: Line 22:
</div>  
</div>  


<div class="mw-translate-fuzzy">


.  
.
</div>




Line 37: Line 43:
#SBATCH --mem=100M
#SBATCH --mem=100M
# ---------------------------------------------------------------------
# ---------------------------------------------------------------------
echo "Current working directory: `pwd`"
echo "Current working directory: $(pwd)"
echo "Starting run at: `date`"
echo "Starting run at: $(date)"
# ---------------------------------------------------------------------
# ---------------------------------------------------------------------
# Run your simulation step here...
# Run your simulation step here...
Line 44: Line 50:
if test -e "dmtcp_restart_script.sh"; then  
if test -e "dmtcp_restart_script.sh"; then  
     # There is a checkpoint file, restart;
     # There is a checkpoint file, restart;
     ./dmtcp_restart_script.sh -h `hostname`
     ./dmtcp_restart_script.sh -h $(hostname)
else
else
     # There is no checkpoint file, start a new simulation.
     # There is no checkpoint file, start a new simulation.
Line 51: Line 57:


# ---------------------------------------------------------------------
# ---------------------------------------------------------------------
echo "Job finished with exit code $? at: `date`"
echo "Job finished with exit code $? at: $(date)"
# ---------------------------------------------------------------------
# ---------------------------------------------------------------------
}}
}}
-->
== Resubmitting a Job for Long-Running Computations ==
== Resubmitting a Job for Long-Running Computations ==
If you plan on breaking up a lengthy computation into several Slurm jobs, there are [[Running_jobs#Resubmitting_jobs_for_long_running_computations|two recommended methods]]:  
If you plan on breaking up a lengthy computation into several Slurm jobs, there are [[Running_jobs#Resubmitting_jobs_for_long_running_computations|two recommended methods]]:  
* [[Running_jobs#Restarting_using_job_arrays|using Slurm job arrays]];
* [[Running_jobs#Restarting_using_job_arrays|using Slurm job arrays]];
* [[Running_jobs#Resubmission_from_the_job_script|resubmission from the end of the job script]].
* [[Running_jobs#Resubmission_from_the_job_script|resubmission from the end of the job script]].
38,757

edits