Translations:AI and Machine Learning/16/en: Difference between revisions

Jump to navigation Jump to search
Importing a new version from external source
(Importing a new version from external source)
 
(Importing a new version from external source)
Line 1: Line 1:
If your computations are long, you should use checkpointing. For example, if your training time is 3 days, you could split it in 3 chunks of 24 hours. This would prevent you from losing all the work in case of an outage, and would give you an edge in terms of priority (more nodes are available for short jobs). Most machine learning libraries natively support checkpointing. Please see our suggestions about [[Running jobs#Resubmitting_jobs_for_long_running_computations|resubmitting jobs for long running computations]]. If your program does not natively support this, we provide a [[Points de contrôle|general checkpointing solution]].
If your computations are long, you should use checkpointing. For example, if your training time is 3 days, you could split it in 3 chunks of 24 hours. This would prevent you from losing all the work in case of an outage, and would give you an edge in terms of priority (more nodes are available for short jobs). Most machine learning libraries natively support checkpointing. Please see our suggestions about [[Running jobs#Resubmitting_jobs_for_long_running_computations|resubmitting jobs for long running computations]]. If your program does not natively support this, we provide a [[Points de contrôle/en|general checkpointing solution]].
38,892

edits

Navigation menu