Points de contrôle/en: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
No edit summary
No edit summary
 
(22 intermediate revisions by 2 users not shown)
Line 1: Line 1:
<languages />
<languages />
The execution time for a program is sometimes too long for the maximum duration of a job permitted by the job schedulers used on Compute Canada clusters. Long-running jobs are also subject to all of the risks of system instability due to power outages, hardware defects and so forth. A program with a short execution time can easily be restarted with little concern but for long-running software it is preferable to use checkpoints to minimize the risk of losing several days' worth of computation. These checkpoints take the form of binary disk files from which the program can be restarted at the point in the computation where the checkpoint file was initially created.
The execution time for a program is sometimes too long for the maximum duration of a job permitted by the job schedulers used on the clusters. Long-running jobs are also subject to all of the risks of system instability due to power outages, hardware defects and so forth. A program with a short execution time can easily be restarted with little concern but for long-running software it is preferable to use checkpoints to minimize the risk of losing several days' worth of computation. These checkpoints take the form of binary disk files from which the program can be restarted at the point in the computation where the checkpoint file was initially created.


== Creating and Loading a Checkpoint  ==
== Creating and Loading a Checkpoint  ==
Line 13: Line 15:
** Once the atomic write has been completed, one can choose whether or not to delete any older checkpoints.
** Once the atomic write has been completed, one can choose whether or not to delete any older checkpoints.


So as not to re-invent the wheel, particularly in situations where modifying the source code isn't an option, an alternative solution is the use of the software  
<!-- So as not to re-invent the wheel, particularly in situations where modifying the source code isn't an option, an alternative solution is the use of the software  
[http://dmtcp.sourceforge.net/ DMTCP].
[http://dmtcp.sourceforge.net/ DMTCP]. -->
 
<!-- === DMTCP === -->


=== DMTCP ===
<!-- The software  [http://dmtcp.sourceforge.net/ DMTCP] (Distributed Multithreaded CheckPointing) permits checkpointing of programs without having to recompile them. The initial execution is done with the program  <tt>dmtcp_launch</tt> and specifying the checkpoint intervals. The restart can then be carried out by running the script  <tt>dmtcp_restart_script.sh</tt>. By default, this script and the checkpoint files are written in the directory where the program was started. You can modify the location of these checkpoint files with the option  <tt>--ckptdir <checkpoint directory></tt>. You can run  <tt>dmtcp_launch --help</tt> to see all the options for DMTCP. Note that for the moment DMTCP does not work with software parallelized using MPI. -->


Le logiciel [http://dmtcp.sourceforge.net/ DMTCP] (Distributed Multithreaded CheckPointing) permet de faire des points de contrôles de programmes sans avoir à les recompiler. Pour pouvoir l’utiliser, il faut charger le module DMTCP. La première exécution est effectuée avec le programme <tt>dmtcp_launch</tt> en spécifiant le temps entre les intervalles de sauvegarde. Le redémarrage se fait en exécutant le script <tt>dmtcp_restart_script.sh</tt>. Par défaut, ce script et les fichiers de redémarrage du programme sont écrits à l'endroit où le programme a été lancé. On peut changer l’emplacement des fichiers de sauvegarde  avec l’option <tt>--ckptdir <répertoire pour les sauvegardes></tt>. Vous pouvez faire <tt>dmtcp_launch --help</tt> pour obtenir toutes les options. Notez que DMTCP ne marche pas pour le moment avec les logiciels parallélisés par MPI.


An example of a job script:
<!-- An example of a job script: -->
<!--
{{Fichier
{{Fichier
   |name=job_with_dmtcp.sh
   |name=job_with_dmtcp.sh
Line 35: Line 39:
#SBATCH --mem=100M
#SBATCH --mem=100M
# ---------------------------------------------------------------------
# ---------------------------------------------------------------------
echo "Current working directory: `pwd`"
echo "Current working directory: $(pwd)"
echo "Starting run at: `date`"
echo "Starting run at: $(date)"
# ---------------------------------------------------------------------
# ---------------------------------------------------------------------
# Run your simulation step here...
# Run your simulation step here...
Line 42: Line 46:
if test -e "dmtcp_restart_script.sh"; then  
if test -e "dmtcp_restart_script.sh"; then  
     # There is a checkpoint file, restart;
     # There is a checkpoint file, restart;
     ./dmtcp_restart_script.sh -h `hostname`
     ./dmtcp_restart_script.sh -h $(hostname)
else
else
     # There is no checkpoint file, start a new simulation.
     # There is no checkpoint file, start a new simulation.
Line 49: Line 53:


# ---------------------------------------------------------------------
# ---------------------------------------------------------------------
echo "Job finished with exit code $? at: `date`"
echo "Job finished with exit code $? at: $(date)"
# ---------------------------------------------------------------------
# ---------------------------------------------------------------------
}}
}}
== Resoumettre une tâche pour un calcul de longue durée ==
-->
Si on prévoit qu'un long calcul sera morcelé en plusieurs tâches Slurm, les [[Running jobs/fr#Resoumettre_une_t.C3.A2che_pour_un_calcul_de_longue_dur.C3.A9e|deux méthodes recommandées]] sont:
== Resubmitting a Job for Long-Running Computations ==
* [[Running_jobs/fr#Red.C3.A9marrage_avec_des_vecteurs_de_t.C3.A2ches|l'utilisation de vecteurs de tâches (''job arrays'') Slurm]];
If you plan on breaking up a lengthy computation into several Slurm jobs, there are [[Running_jobs#Resubmitting_jobs_for_long_running_computations|two recommended methods]]:  
* [[Running_jobs/fr#Resoumettre_.C3.A0_partir_d.27un_script|la resoumission à partir de la fin du script]].
* [[Running_jobs#Restarting_using_job_arrays|using Slurm job arrays]];
* [[Running_jobs#Resubmission_from_the_job_script|resubmission from the end of the job script]].

Latest revision as of 14:24, 27 September 2024


Other languages:

The execution time for a program is sometimes too long for the maximum duration of a job permitted by the job schedulers used on the clusters. Long-running jobs are also subject to all of the risks of system instability due to power outages, hardware defects and so forth. A program with a short execution time can easily be restarted with little concern but for long-running software it is preferable to use checkpoints to minimize the risk of losing several days' worth of computation. These checkpoints take the form of binary disk files from which the program can be restarted at the point in the computation where the checkpoint file was initially created.

Creating and Loading a Checkpoint

The creation and loading of a checkpoint may already be taken care of by the application you're using. In this case you simply need to read the relevant documentation about how to use this functionality.

However, if you have access to the source code of the software and/or if you are the author, you can implement a checkpoint/restart functionality in the program yourself. The essential steps are:

  • The creation of a checkpoint file is done periodically, with a suggested frequency of every 2 to 24 hours
  • While writing the checkpoint file, it's important to remember that the program could be interrupted at any moment and this for a variety of reasons. As a consequence,
    • It is preferable to not delete the preceding checkpoint when creating the new one.
    • The creation of the checkpoint file can be made atomic by performing an operation which confirms the end of the checkpoint process. For example, the checkpoint file can be initially named based on the date and time and, as the final step, a symbolic link latest-version is pointed at this new checkpoint file. Another more advanced method would be to create a second file which contains a hash of the checkpoint file's content by means of which the restart function can verify the integrity of the checkpoint when it is loaded.
    • Once the atomic write has been completed, one can choose whether or not to delete any older checkpoints.



Resubmitting a Job for Long-Running Computations

If you plan on breaking up a lengthy computation into several Slurm jobs, there are two recommended methods: