Points de contrôle/en: Difference between revisions

Updating to match new version of source page
No edit summary
(Updating to match new version of source page)
Line 13: Line 13:
** Once the atomic write has been completed, one can choose whether or not to delete any older checkpoints.
** Once the atomic write has been completed, one can choose whether or not to delete any older checkpoints.


<div class="mw-translate-fuzzy">
So as not to re-invent the wheel, particularly in situations where modifying the source code isn't an option, an alternative solution is the use of the software  
So as not to re-invent the wheel, particularly in situations where modifying the source code isn't an option, an alternative solution is the use of the software  
[http://dmtcp.sourceforge.net/ DMTCP].
[http://dmtcp.sourceforge.net/ DMTCP].
</div>


=== DMTCP ===
Le logiciel [http://dmtcp.sourceforge.net/ DMTCP] (Distributed Multithreaded CheckPointing) permet de faire des points de contrôles de programmes sans avoir à les recompiler. Pour pouvoir l’utiliser, il faut charger le module DMTCP. La première exécution est effectuée avec le programme <tt>dmtcp_launch</tt> en spécifiant le temps entre les intervalles de sauvegarde. Le redémarrage se fait en exécutant le script <tt>dmtcp_restart_script.sh</tt>. Par défaut, ce script et les fichiers de redémarrage du programme sont écrits à l'endroit où le programme a été lancé. On peut changer l’emplacement des fichiers de sauvegarde  avec l’option <tt>--ckptdir <répertoire pour les sauvegardes></tt>. Vous pouvez faire <tt>dmtcp_launch --help</tt> pour obtenir toutes les options. Notez que DMTCP ne marche pas pour le moment avec les logiciels parallélisés par MPI.


The software [http://dmtcp.sourceforge.net/ DMTCP] (Distributed Multithreaded CheckPointing) allows you to checkpoint applications without having to recompile them. In order to use it, you first need to load the DMTCP module. The initial execution of the application software is done using the command <tt>dmtcp_launch</tt> where you can specify the amount of time between checkpoints. The restart functionality can be used by executing the script <tt>dmtcp_restart_script.sh</tt>. By default this script and the checkpoint files are written in the directory where the program was started but you can change this by using the option <tt>--ckptdir <checkpoint directory></tt>. You can also use the command <tt>dmtcp_launch --help</tt> to get more information on all the options. Note that for the moment the DMTCP software cannot be used to checkpoint applications parallelized using MPI.


An example of a job script:
Un exemple de script:
{{Fichier
{{Fichier
   |name=job_with_dmtcp.sh
   |name=job_with_dmtcp.sh
Line 52: Line 53:
# ---------------------------------------------------------------------
# ---------------------------------------------------------------------
}}
}}
-->
== Resubmitting a Job for Long-Running Computations ==
== Resubmitting a Job for Long-Running Computations ==
If you plan on breaking up a lengthy computation into several Slurm jobs, there are [[Running_jobs#Resubmitting_jobs_for_long_running_computations|two recommended methods]]:  
If you plan on breaking up a lengthy computation into several Slurm jobs, there are [[Running_jobs#Resubmitting_jobs_for_long_running_computations|two recommended methods]]:  
* [[Running_jobs#Restarting_using_job_arrays|using Slurm job arrays]];
* [[Running_jobs#Restarting_using_job_arrays|using Slurm job arrays]];
* [[Running_jobs#Resubmission_from_the_job_script|resubmission from the end of the job script]].
* [[Running_jobs#Resubmission_from_the_job_script|resubmission from the end of the job script]].
38,757

edits